[PATCH 2/2 V6] lib/show_mem.c: Add dma-buf counter to show_mem dump.

2021-04-20 Thread Peter Enderborg
On system where dma-buf is used it can be many clients that adds up
to a lot of memory. This can be relevant for OOM handling when
running out of memory or how system handle this memory. It may be to free
with a kill.

Suggested-by: Michal Hocko 
Signed-off-by: Peter Enderborg 
---
 lib/show_mem.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/show_mem.c b/lib/show_mem.c
index 1c26c14ffbb9..ec4748c64353 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 
 void show_mem(unsigned int filter, nodemask_t *nodemask)
 {
@@ -41,4 +42,8 @@ void show_mem(unsigned int filter, nodemask_t *nodemask)
 #ifdef CONFIG_MEMORY_FAILURE
printk("%lu pages hwpoisoned\n", atomic_long_read(_poisoned_pages));
 #endif
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   printk("%lu pages dma-buf\n", dma_buf_allocated_pages());
+#endif
+
 }
-- 
2.17.1



[PATCH 1/2 V6] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available. dma-buf are indirect allocated by
userspace. So with this value we can monitor and detect
userspace applications that have problems. Typical usage
is to see that system does not do to much pre-allocations,
finding memory leaks in userspace, such as not all clients
close down the reference to the buffer.

Signed-off-by: Peter Enderborg 
---
 Documentation/filesystems/proc.rst |  5 +
 drivers/dma-buf/dma-buf.c  | 12 
 fs/proc/meminfo.c  |  5 -
 include/linux/dma-buf.h|  1 +
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 48fbfc336ebf..a85df9490810 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -973,6 +973,7 @@ varies by architecture and compile options.  The following 
is from a
 AnonHugePages:   49152 kB
 ShmemHugePages:  0 kB
 ShmemPmdMapped:  0 kB
+DmaBufTotal  0 kB
 
 MemTotal
   Total usable RAM (i.e. physical RAM minus a few reserved
@@ -1102,6 +1103,10 @@ VmallocChunk
 Percpu
   Memory allocated to the percpu allocator used to back percpu
   allocations. This stat excludes the cost of metadata.
+DmaBufTotal
+  Memory allocated by dma-buf driver.What memory is used
+ is  arbitrary. (It might be kernel, local or even hardware vram).
+ Details on buffers are found in debugfs if enabled.
 
 vmallocinfo
 ~~~
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..4dc37cd4293b 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_global_allocated;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_global_allocated);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_global_allocated);
 
return dmabuf;
 
@@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_allocated_pages - Return the used nr of pages
+ * allocated for dma-buf
+ */
+long dma_buf_allocated_pages(void)
+{
+   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..ccc7c40c8db7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
-
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
+#endif
hugetlb_report_meminfo(m);
 
arch_report_meminfo(m);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..5b05816bd2cd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_allocated_pages(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



[PATCH 0/2 V6]Add dma-buf counter

2021-04-20 Thread Peter Enderborg
The dma-buf counter is a metric for mapped memory used by it's clients.
It is a shared buffer that is typically used for interprocess communication
or process to hardware communication. In android we used to have ION,. but
it is now replaced with dma-buf. ION had some overview metrics that was similar.



V1
initial version. Add dma-buf counter

V2
Fix build depencendy error suggested by Matthew Wilcox
Extent commit message sugged by Köning

V3
Change variable and function names.

V4
Fix function name in code doc
Reported-by: kernel test robot 

V5
Removed EXPORT_SYMBOL_GPL suggested by Muchun Song

V6
Made it a patch set, Adding a addional patch for
printing dma-buf counter in show_mem.
Suggested by Michal Hocko.








[PATCH v5] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available. dma-buf are indirect allocated by
userspace. So with this value we can monitor and detect
userspace applications that have problems.

Signed-off-by: Peter Enderborg 
---
 drivers/dma-buf/dma-buf.c | 12 
 fs/proc/meminfo.c |  5 -
 include/linux/dma-buf.h   |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..4dc37cd4293b 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_global_allocated;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_global_allocated);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_global_allocated);
 
return dmabuf;
 
@@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_allocated_pages - Return the used nr of pages
+ * allocated for dma-buf
+ */
+long dma_buf_allocated_pages(void)
+{
+   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
+}
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..ccc7c40c8db7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
-
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
+#endif
hugetlb_report_meminfo(m);
 
arch_report_meminfo(m);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..5b05816bd2cd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_allocated_pages(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



[PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available. dma-buf are indirect allocated by
userspace. So with this value we can monitor and detect
userspace applications that have problems.

Signed-off-by: Peter Enderborg 
---
 drivers/dma-buf/dma-buf.c | 13 +
 fs/proc/meminfo.c |  5 -
 include/linux/dma-buf.h   |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..197e5c45dd26 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_global_allocated;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_global_allocated);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_global_allocated);
 
return dmabuf;
 
@@ -1346,6 +1349,16 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_allocated_pages - Return the used nr of pages
+ * allocated for dma-buf
+ */
+long dma_buf_allocated_pages(void)
+{
+   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..ccc7c40c8db7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
-
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
+#endif
hugetlb_report_meminfo(m);
 
arch_report_meminfo(m);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..5b05816bd2cd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_allocated_pages(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



[PATCH v3] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-16 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available. dma-buf are indirect allocated by
userspace. So with this value we can monitor and detect
userspace applications that have problems.

Signed-off-by: Peter Enderborg 
---
 drivers/dma-buf/dma-buf.c | 12 
 fs/proc/meminfo.c |  5 -
 include/linux/dma-buf.h   |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..d40fff2ae1fa 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_global_allocated;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_global_allocated);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_global_allocated);
 
return dmabuf;
 
@@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_get_size - Return the used nr pages by dma-buf
+ */
+long dma_buf_allocated_pages(void)
+{
+   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..ccc7c40c8db7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
-
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
+#endif
hugetlb_report_meminfo(m);
 
arch_report_meminfo(m);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..5b05816bd2cd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_allocated_pages(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



[PATCH v2] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-16 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available. dma-buf are indirect allocated by
userspace. So with this value we can monitor and detect
userspace applications that have problems.

Signed-off-by: Peter Enderborg 
---
 drivers/dma-buf/dma-buf.c | 12 
 fs/proc/meminfo.c |  5 -
 include/linux/dma-buf.h   |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..9f88171b394c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_size;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_size);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_size);
 
return dmabuf;
 
@@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_get_size - Return the used nr pages by dma-buf
+ */
+long dma_buf_get_size(void)
+{
+   return atomic_long_read(_buf_size) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(dma_buf_get_size);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..178f6ffb1618 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
-
+#ifdef CONFIG_DMA_SHARED_BUFFER
+   show_val_kb(m, "DmaBufTotal:", dma_buf_get_size());
+#endif
hugetlb_report_meminfo(m);
 
arch_report_meminfo(m);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..f6481315a377 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_get_size(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



[PATCH] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-16 Thread Peter Enderborg
This adds a total used dma-buf memory. Details
can be found in debugfs, however it is not for everyone
and not always available.

Signed-off-by: Peter Enderborg 
---
 drivers/dma-buf/dma-buf.c | 12 
 fs/proc/meminfo.c |  2 ++
 include/linux/dma-buf.h   |  1 +
 3 files changed, 15 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f264b70c383e..9f88171b394c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,7 @@ struct dma_buf_list {
 };
 
 static struct dma_buf_list db_list;
+static atomic_long_t dma_buf_size;
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
@@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
+   atomic_long_sub(dmabuf->size, _buf_size);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
@@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
+   atomic_long_add(dmabuf->size, _buf_size);
 
return dmabuf;
 
@@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
dma_buf_map *map)
 }
 EXPORT_SYMBOL_GPL(dma_buf_vunmap);
 
+/**
+ * dma_buf_get_size - Return the used nr pages by dma-buf
+ */
+long dma_buf_get_size(void)
+{
+   return atomic_long_read(_buf_size) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(dma_buf_get_size);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..3c1a82b51a6f 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -16,6 +16,7 @@
 #ifdef CONFIG_CMA
 #include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -145,6 +146,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "CmaFree:",
global_zone_page_state(NR_FREE_CMA_PAGES));
 #endif
+   show_val_kb(m, "DmaBufTotal:", dma_buf_get_size());
 
hugetlb_report_meminfo(m);
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..f6481315a377 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
+long dma_buf_get_size(void);
 #endif /* __DMA_BUF_H__ */
-- 
2.17.1



Re: [PATCH v2 00/12] Add build ID to stacktraces

2021-03-25 Thread peter enderborg
On 3/24/21 3:04 AM, Stephen Boyd wrote:
> This series adds the kernel's build ID[1] to the stacktrace header printed
> in oops messages, warnings, etc. and the build ID for any module that
> appears in the stacktrace after the module name. The goal is to make the
> stacktrace more self-contained and descriptive by including the relevant
> build IDs in the kernel logs when something goes wrong. This can be used
> by post processing tools like script/decode_stacktrace.sh and kernel
> developers to easily locate the debug info associated with a kernel
> crash and line up what line and file things started falling apart at.
>
> To show how this can be used I've included a patch to
> decode_stacktrace.sh that downloads the debuginfo from a debuginfod
> server.
>
> This also includes some patches to make the buildid.c file use more
> const arguments and consolidate logic into buildid.c from kdump. These
> are left to the end as they were mostly cleanup patches. I don't know
> who exactly maintains this so I guess Andrew is the best option to merge
> all this code.
>
> Here's an example lkdtm stacktrace on arm64.
>
>  WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 
> lkdtm_WARNING+0x28/0x30 [lkdtm]
>  Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup 
> uinput xt_MASQUERADE
>  CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 
> aa23f7a1231c229de205662d5a9e0d4c580f19a1
>  Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
>  pstate: 0049 (nzcv daif +PAN -UAO -TCO BTYPE=--)
>  pc : lkdtm_WARNING+0x28/0x30 [lkdtm]
>  lr : lkdtm_do_action+0x24/0x40 [lkdtm]
>  sp : ffc0134fbca0
>  x29: ffc0134fbca0 x28: ff92d53ba240
>  x27:  x26: 
>  x25:  x24: ffe3622352c0
>  x23: 0020 x22: ffe362233366
>  x21: ffe3622352e0 x20: ffc0134fbde0
>  x19: 0008 x18: 
>  x17: ff929b6536fc x16: 
>  x15:  x14: 0012
>  x13: ffe380ed892c x12: ffe381d05068
>  x11:  x10: 
>  x9 : 0001 x8 : ffe362237000
>  x7 :  x6 : 
>  x5 :  x4 : 0001
>  x3 : 0008 x2 : ff93fef25a70
>  x1 : ff93fef15788 x0 : ffe3622352e0
>  Call trace:
>   lkdtm_WARNING+0x28/0x30 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
>   direct_entry+0x16c/0x1b4 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
>   full_proxy_write+0x74/0xa4
>   vfs_write+0xec/0x2e8
>   ksys_write+0x84/0xf0
>   __arm64_sys_write+0x24/0x30
>   el0_svc_common+0xf4/0x1c0
>   do_el0_svc_compat+0x28/0x3c
>   el0_svc_compat+0x10/0x1c
>   el0_sync_compat_handler+0xa8/0xcc
>   el0_sync_compat+0x178/0x180
>  ---[ end trace 3d95032303e59e68 ]---

How will this work with the ftrace?



Re: [PATCH v2 00/12] Add build ID to stacktraces

2021-03-25 Thread peter enderborg
On 3/24/21 9:55 AM, Christoph Hellwig wrote:
> On Tue, Mar 23, 2021 at 07:04:31PM -0700, Stephen Boyd wrote:
>>  x5 :  x4 : 0001
>>  x3 : 0008 x2 : ff93fef25a70
>>  x1 : ff93fef15788 x0 : ffe3622352e0
>>  Call trace:
>>   lkdtm_WARNING+0x28/0x30 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
>>   direct_entry+0x16c/0x1b4 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
> Yikes.  No, please do not make the backtraces a complete mess for
> something that serves absolutely no need.

Would a "verbose" flag be acceptable solution?    Something like write 1 to 
/sys/kernel/debug/verbose_stack to get the extra info.

I think I see a need for it.



Re: [PATCH v5 5/8] security/brute: Mitigate a brute force attack

2021-03-11 Thread peter enderborg
On 2/27/21 4:30 PM, John Wood wrote:
> In order to mitigate a brute force attack all the offending tasks involved
> in the attack must be killed. In other words, it is necessary to kill all
> the tasks that share the fork and/or exec statistical data related to the
> attack. Moreover, if the attack happens through the fork system call, the
> processes that have the same group_leader that the current task (the task
> that has crashed) must be avoided since they are in the path to be killed.
>
> When the SIGKILL signal is sent to the offending tasks, the function
> "brute_kill_offending_tasks" will be called in a recursive way from the
> task_fatal_signal LSM hook due to a small crash period. So, to avoid kill
> again the same tasks due to a recursive call of this function, it is
> necessary to disable the attack detection for the involved hierarchies.

Would it not be useful for forensic reasons to be able to send SIGABRT and get 
the a coredump?


> To disable the attack detection, set to zero the last crash timestamp and
> avoid to compute the application crash period in this case.
>
> Signed-off-by: John Wood 
> ---
>  security/brute/brute.c | 141 ++---
>  1 file changed, 132 insertions(+), 9 deletions(-)
>
> diff --git a/security/brute/brute.c b/security/brute/brute.c
> index 0a99cd4c3303..48b07d923ec7 100644
> --- a/security/brute/brute.c
> +++ b/security/brute/brute.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -64,7 +65,7 @@ struct brute_cred {
>   * @lock: Lock to protect the brute_stats structure.
>   * @refc: Reference counter.
>   * @faults: Number of crashes.
> - * @jiffies: Last crash timestamp.
> + * @jiffies: Last crash timestamp. If zero, the attack detection is disabled.
>   * @period: Crash period's moving average.
>   * @saved_cred: Saved credentials.
>   * @network: Network activity flag.
> @@ -566,6 +567,125 @@ static inline void print_fork_attack_running(void)
>   pr_warn("Fork brute force attack detected [%s]\n", current->comm);
>  }
>
> +/**
> + * brute_disabled() - Test if the brute force attack detection is disabled.
> + * @stats: Statistical data shared by all the fork hierarchy processes.
> + *
> + * The brute force attack detection enabling/disabling is based on the last
> + * crash timestamp. A zero timestamp indicates that this feature is 
> disabled. A
> + * timestamp greater than zero indicates that the attack detection is 
> enabled.
> + *
> + * The statistical data shared by all the fork hierarchy processes cannot be
> + * NULL.
> + *
> + * It's mandatory to disable interrupts before acquiring the 
> brute_stats::lock
> + * since the task_free hook can be called from an IRQ context during the
> + * execution of the task_fatal_signal hook.
> + *
> + * Context: Must be called with interrupts disabled and brute_stats_ptr_lock
> + *  held.
> + * Return: True if the brute force attack detection is disabled. False
> + * otherwise.
> + */
> +static bool brute_disabled(struct brute_stats *stats)
> +{
> + bool disabled;
> +
> + spin_lock(>lock);
> + disabled = !stats->jiffies;
> + spin_unlock(>lock);
> +
> + return disabled;
> +}
> +
> +/**
> + * brute_disable() - Disable the brute force attack detection.
> + * @stats: Statistical data shared by all the fork hierarchy processes.
> + *
> + * To disable the brute force attack detection it is only necessary to set 
> the
> + * last crash timestamp to zero. A zero timestamp indicates that this 
> feature is
> + * disabled. A timestamp greater than zero indicates that the attack 
> detection
> + * is enabled.
> + *
> + * The statistical data shared by all the fork hierarchy processes cannot be
> + * NULL.
> + *
> + * Context: Must be called with interrupts disabled and brute_stats_ptr_lock
> + *  and brute_stats::lock held.
> + */
> +static inline void brute_disable(struct brute_stats *stats)
> +{
> + stats->jiffies = 0;
> +}
> +
> +/**
> + * enum brute_attack_type - Brute force attack type.
> + * @BRUTE_ATTACK_TYPE_FORK: Attack that happens through the fork system call.
> + * @BRUTE_ATTACK_TYPE_EXEC: Attack that happens through the execve system 
> call.
> + */
> +enum brute_attack_type {
> + BRUTE_ATTACK_TYPE_FORK,
> + BRUTE_ATTACK_TYPE_EXEC,
> +};
> +
> +/**
> + * brute_kill_offending_tasks() - Kill the offending tasks.
> + * @attack_type: Brute force attack type.
> + * @stats: Statistical data shared by all the fork hierarchy processes.
> + *
> + * When a brute force attack is detected all the offending tasks involved in 
> the
> + * attack must be killed. In other words, it is necessary to kill all the 
> tasks
> + * that share the same statistical data. Moreover, if the attack happens 
> through
> + * the fork system call, the processes that have the same group_leader that 
> the
> + * current task must be avoided since they are in the path to be killed.
> + *
> + 

Re: [PATCH] RTIC: selinux: ARM64: Move selinux_state to a separate page

2021-02-22 Thread peter enderborg
On 2/17/21 10:42 AM, Will Deacon wrote:
> [Please include arm64 and kvm folks for threads involving the stage-2 MMU]
>
> On Tue, Feb 16, 2021 at 03:47:52PM +0530, Preeti Nagar wrote:
>> The changes introduce a new security feature, RunTime Integrity Check
>> (RTIC), designed to protect Linux Kernel at runtime. The motivation
>> behind these changes is:
>> 1. The system protection offered by Security Enhancements(SE) for
>> Android relies on the assumption of kernel integrity. If the kernel
>> itself is compromised (by a perhaps as yet unknown future vulnerability),
>> SE for Android security mechanisms could potentially be disabled and
>> rendered ineffective.
>> 2. Qualcomm Snapdragon devices use Secure Boot, which adds cryptographic
>> checks to each stage of the boot-up process, to assert the authenticity
>> of all secure software images that the device executes.  However, due to
>> various vulnerabilities in SW modules, the integrity of the system can be
>> compromised at any time after device boot-up, leading to un-authorized
>> SW executing.
>>
>> The feature's idea is to move some sensitive kernel structures to a
>> separate page and monitor further any unauthorized changes to these,
>> from higher Exception Levels using stage 2 MMU. Moving these to a
>> different page will help avoid getting page faults from un-related data.
>> The mechanism we have been working on removes the write permissions for
>> HLOS in the stage 2 page tables for the regions to be monitored, such
>> that any modification attempts to these will lead to faults being
>> generated and handled by handlers. If the protected assets are moved to
>> a separate page, faults will be generated corresponding to change attempts
>> to these assets only. If not moved to a separate page, write attempts to
>> un-related data present on the monitored pages will also be generated.
>>
>> Using this feature, some sensitive variables of the kernel which are
>> initialized after init or are updated rarely can also be protected from
>> simple overwrites and attacks trying to modify these.
> Although I really like the idea of using stage-2 to protect the kernel, I
> think the approach you outline here is deeply flawed. Identifying "sensitive
> variables" of the kernel to protect is subjective and doesn't scale.
> Furthermore, the triaging of what constitues a valid access is notably
> absent from your description and is assumedly implemented in an opaque blob
> at EL2.
>
> I think a better approach would be along the lines of:
>
>   1. Introduce the protection at stage-1 (like we already have for mapping
>  e.g. the kernel text R/O)

Will that really solve the problem? There is a lot of caches that are used
to resolve policy data in selinux, and this caches will not be protected.
If you can manipulate kernel data you can do cache poisoning.


>   2. Implement the handlers in the kernel, so the heuristics are clear.
>
>   3. Extend this to involve KVM, so that the host can manage its own
>  stage-2 to firm-up the stage-1 protections.
>
> I also think we should avoid tying this to specific data structures.
> Rather, we should introduce a mechanism to make arbitrary data read-only.
>
> I've CC'd Ard and Marc, as I think they've both been thinking about this
> sort of thing recently as well.
>
> Will




Re: [RFC PATCH 0/6] Sleepable tracepoints

2020-10-26 Thread peter enderborg
On 10/23/20 9:53 PM, Michael Jeanson wrote:
> When invoked from system call enter/exit instrumentation, accessing
> user-space data is a common use-case for tracers. However, tracepoints
> currently disable preemption around iteration on the registered
> tracepoint probes and invocation of the probe callbacks, which prevents
> tracers from handling page faults.
>
> Extend the tracepoint and trace event APIs to allow specific tracer
> probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
> called from sleepable context, and convert the system call enter/exit
> instrumentation to sleepable tracepoints.

Will this not be a problem for analyse of the trace? It get two
relevant times, one it when it is called and one when it returns.

It makes things harder to correlate in what order things happen.

And handling of tracing of contexts that already are not preamptable?

Eg the same tracepoint are used in different places and contexts.


> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.
>
> This patchset is base on v5.9.1.
>
> Cc: Steven Rostedt (VMware) 
> Cc: Peter Zijlstra 
> Cc: Alexei Starovoitov 
> Cc: Yonghong Song 
> Cc: Paul E. McKenney 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Joel Fernandes (Google) 
> Cc: b...@vger.kernel.org
>
> Mathieu Desnoyers (1):
>   tracing: use sched-RCU instead of SRCU for rcuidle tracepoints
>
> Michael Jeanson (5):
>   tracing: introduce sleepable tracepoints
>   tracing: ftrace: add support for sleepable tracepoints
>   tracing: bpf-trace: add support for sleepable tracepoints
>   tracing: perf: add support for sleepable tracepoints
>   tracing: convert sys_enter/exit to sleepable tracepoints
>
>  include/linux/tracepoint-defs.h |  11 
>  include/linux/tracepoint.h  | 104 +---
>  include/trace/bpf_probe.h   |  23 ++-
>  include/trace/define_trace.h|   7 +++
>  include/trace/events/syscalls.h |   4 +-
>  include/trace/perf.h|  26 ++--
>  include/trace/trace_events.h|  79 ++--
>  init/Kconfig|   1 +
>  kernel/trace/bpf_trace.c|   5 +-
>  kernel/trace/trace_events.c |  15 -
>  kernel/trace/trace_syscalls.c   |  68 +
>  kernel/tracepoint.c | 104 +---
>  12 files changed, 351 insertions(+), 96 deletions(-)
>



[PATCH] trace: Return ENOTCONN instead of EBADF

2020-10-12 Thread Peter Enderborg
When there is no clients listening on event the trace return
EBADF. The file is not a bad file descriptor and to get the
userspace able to do a proper error handling it need a different
error code that separate a bad file descriptor from a empty listening.

Signed-off-by: Peter Enderborg 
---
 kernel/trace/trace.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d3e5de717df2..6e592bf736df 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6651,8 +6651,8 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
irq_flags, preempt_count());
if (unlikely(!event))
-   /* Ring buffer disabled, return as if not open for write */
-   return -EBADF;
+   /* Ring buffer disabled, return as if not connected */
+   return -ENOTCONN;
 
entry = ring_buffer_event_data(event);
entry->ip = _THIS_IP_;
@@ -6731,8 +6731,8 @@ tracing_mark_raw_write(struct file *filp, const char 
__user *ubuf,
event = __trace_buffer_lock_reserve(buffer, TRACE_RAW_DATA, size,
irq_flags, preempt_count());
if (!event)
-   /* Ring buffer disabled, return as if not open for write */
-   return -EBADF;
+   /* Ring buffer disabled, return not connected */
+   return -ENOTCONN;
 
entry = ring_buffer_event_data(event);
 
-- 
2.17.1



Re: [PATCH v7 3/3] binder: add transaction latency tracer

2020-09-07 Thread peter enderborg
On 8/4/20 3:59 PM, Frankie Chang wrote:
> +void probe_binder_txn_latency_free(void *ignore, struct binder_transaction 
> *t,
> + int from_proc, int from_thread,
> + int to_proc, int to_thread)
> +{
> + struct rtc_time tm;
> + struct timespec64 *startime;
> + struct timespec64 cur, sub_t;
> +
> + ktime_get_ts64();


I think sched_clock is what you want.


> + startime = >timestamp;
> + sub_t = timespec64_sub(cur, *startime);
> +
> + /* if transaction time is over than binder_txn_latency_threshold (sec),
> +  * show timeout warning log.
> +  */
> + if (sub_t.tv_sec < binder_txn_latency_threshold)
> + return;
> +
> + rtc_time_to_tm(t->tv.tv_sec, );



[PATCH] crypto: Mark tfm buffer as non leak.

2020-09-03 Thread Peter Enderborg
When running kmemleak on this I got a lot of

unreferenced object 0xfff942d4ec00 (size 1024):
  comm "init", pid 1, jiffies 4294893619 (age 17475.864s)
  hex dump (first 32 bytes):
38 1d cf bd 9e ff ff ff b8 37 8b bd 9e ff ff ff  87..
78 38 8b bd 9e ff ff ff 10 00 00 00 00 00 00 00  x8..
  backtrace:
[<c3c55a80>] __kmalloc+0x2cc/0x3b0
[<c599b091>] crypto_create_tfm+0x38/0xf0
[<d4516e51>] crypto_spawn_tfm2+0x58/0xa0
[<1bab58aa>] cryptd_skcipher_init_tfm+0x1c/0x40
[<06748df3>] crypto_skcipher_init_tfm+0x158/0x1e0
[<17f3270c>] crypto_create_tfm+0x54/0xf0
[<6af1de62>] crypto_alloc_tfm+0x88/0x198
[<0d8e8c03>] crypto_alloc_skcipher+0x1c/0x28
[<85448a2a>] cryptd_alloc_skcipher+0x5c/0xb0
[<3c48c083>] simd_skcipher_init+0x24/0x68
[<06748df3>] crypto_skcipher_init_tfm+0x158/0x1e0
[<17f3270c>] crypto_create_tfm+0x54/0xf0
[<d4516e51>] crypto_spawn_tfm2+0x58/0xa0
[<b5344705>] crypto_cts_init_tfm+0x1c/0x68
[<06748df3>] crypto_skcipher_init_tfm+0x158/0x1e0
[<17f3270c>] crypto_create_tfm+0x54/0xf0

This is caused by tfm = (struct crypto_tfm *)(mem + tfmsize);
that is keept instead of the allocated buffer in mem.
Reference counting is done on alg.

Signed-off-by: Peter Enderborg 
---
 crypto/api.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/crypto/api.c b/crypto/api.c
index ed08cbd5b9d3..1a9cb6852a56 100644
--- a/crypto/api.c
+++ b/crypto/api.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include "internal.h"
+#include 
 
 LIST_HEAD(crypto_alg_list);
 EXPORT_SYMBOL_GPL(crypto_alg_list);
@@ -460,7 +461,7 @@ void *crypto_create_tfm_node(struct crypto_alg *alg,
 
if (!tfm->exit && alg->cra_init && (err = alg->cra_init(tfm)))
goto cra_init_failed;
-
+   kmemleak_not_leak(mem);
goto out;
 
 cra_init_failed:
-- 
2.17.1



Re: [RFC PATCH] selinux: Add denied trace with permssion filter

2020-09-01 Thread peter enderborg
On 9/1/20 5:31 PM, Paul Moore wrote:
> On Mon, Aug 31, 2020 at 11:34 AM peter enderborg wrote:
>> On 8/31/20 4:16 PM, Paul Moore wrote:
>>> On Thu, Aug 27, 2020 at 10:04 AM peter enderborg wrote:
> ...
>
>>>> Im happly fine with replacing the selinux_audited with selinux_denied.  
>>>> However it is the case where there are more than one denial at the same 
>>>> time. Im not sure how and when it might happen.
>>> One thing I wondered about was why not build up a single string with
>>> all of the permissions instead of generating multiple trace events?
>>> In the previous discussion it was implied that this was due to
>>> limitations in the tracing subsystem's filtering, and based on the
>>> discussion thus far I'm guessing there is little desire for this
>>> information if it can't be filtered on?
>> The information is of course as essential as for audit messages.
>> I dont see much of the problem with having as the first suggestion with
>> a list. It works fine for trace_pipe. It is not failing due to that we can 
>> not
>> filter with that.
> I don't really have much personal experience with the kernel tracing
> tools, so an example would be helpful as I'm not really following what
> you are saying.  Are you talking about something like
> "permission=foo,bar,baz"?

In the first patch we hade premission that was in the format:

permission={read !write open}

That would have work fine if it was not for that the permission can be created
dynamically.  It would be very easy to change that to

permission=read permission=!write permission=open

But I think that will cause problems too. The create a format specifiaction
in the trace tree, and I dont think we can describe this format.


>
>> It is cause in other tools in user-space
>> that needs a plugin to parse it. It need static
>> mapping for something that is not really static. Not in runtime, and it will
>> change over time.
> I think we've all come to the conclusion that doing the permission
> bitmap-to-string translation in a plugin is not really desirable.

Yes. But Im arguing that we still have the same information,
but it will "cost" that we have more events. It is a fair trade-off,
it is usually a lot better to have the trace simple than
flexible and let it cost events when needed.


>> A other idea based on the first one is to have multiple pairs like
>>
>> class=file permission=read permission=write permission=open
>>
>> but then you need to filter on numeric values that are not static and
>> I don't know if library can make anything useful from that.
> Oh, wait, is the issue that the tracing subsystem can't filter on
> strings?  That doesn't seem right, but I can understand why one might
> want to avoid that for performance reasons.  If the tracing subsystem
> *can* filter on strings, why did you say that in the "perm=foo
> perm=bar" format one would need to filter on numeric values?  I still
> think I'm missing something here ...
>
No. It can filter on strings. But it can not do any fuzzy matching.
They are equal not not equal. So if you have a parameter value
that is { open read !write } you need to specify a exact match.

With multiple events you match for "open" or "write" not "{ open read !write }".

So you get one event for "open" and a other event for "write".

The numeric values is not matching perm=, it is a match for the
bits in the denied= parameter that is present within selinux_audited event.




Re: [RFC PATCH] selinux: Add denied trace with permssion filter

2020-08-31 Thread peter enderborg
On 8/31/20 4:16 PM, Paul Moore wrote:
> On Thu, Aug 27, 2020 at 10:04 AM peter enderborg
>  wrote:
>> On 8/27/20 3:30 PM, Paul Moore wrote:
>>> On Wed, Aug 26, 2020 at 11:06 AM peter enderborg
>>>  wrote:
>>>> On 8/26/20 4:45 PM, Paul Moore wrote:
>>>>> On Wed, Aug 26, 2020 at 10:34 AM peter enderborg
>>>>>  wrote:
>>>>>> On 8/26/20 3:42 PM, Paul Moore wrote:
>>>>>>> On Mon, Aug 24, 2020 at 9:23 AM Peter Enderborg
>>>>>>>  wrote:
>>>>>>>> This adds tracing of all denies. They are grouped with trace_seq for
>>>>>>>> each audit.
>>>>>>>>
>>>>>>>> A filter can be inserted with a write to it's filter section.
>>>>>>>>
>>>>>>>> echo "permission==\"entrypoint\"" > events/avc/selinux_denied/filter
>>>>>>>>
>>>>>>>> A output will be like:
>>>>>>>>   runcon-1046  [002] .N..   156.351738: selinux_denied:
>>>>>>>>   trace_seq=2 result=-13
>>>>>>>>   scontext=system_u:system_r:cupsd_t:s0-s0:c0.
>>>>>>>>   c1023 tcontext=system_u:object_r:bin_t:s0
>>>>>>>>   tclass=file permission=entrypoint
>>>>>>>>
>>>>>>>> Signed-off-by: Peter Enderborg 
>>>>>>>> ---
>>>>>>>>  include/trace/events/avc.h | 37 +
>>>>>>>>  security/selinux/avc.c | 27 +--
>>>>>>>>  2 files changed, 62 insertions(+), 2 deletions(-)
>>>>>>> My most significant comment is that I don't think we want, or need,
>>>>>>> two trace points in the avc_audit_post_callback() function.  Yes, I
>>>>>>> understand they are triggered slightly differently, but from my
>>>>>>> perspective there isn't enough difference between the two tracepoints
>>>>>>> to warrant including both.  However, while the tracepoints may be
>>>>>> We tried that but that was problematic too.
>>>>> My apologies if I was on that thread, but can you remind me why it was
>>>>> a problem?  Why can't we use a single tracepoint to capture the AVC
>>>>> information?
>>>> The problem is parsing the event.
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_18_842=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=iS3eZr3TFrN5I7BbnvPFYOKd6DfW1FHTFcwI7joS_fk=
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_21_526=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=9OsLN0Y5mUWxEAAqUE6K4PS57Pn1XyZz7GXak6uc_Ls=
>>>>
>>>> and the "single list" version
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_17_1346=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=tWSY2ry2IT6RcT5BIUwMuqBL_yPObDE1VljbLqI1zrA=
>>>>
>>>> With this patch we follow standard message format so no plugin should be 
>>>> needed.
>>> I'm evidently missing something very fundamental (likely), and/or I'm
>>> just not communicating very clearly (also likely), because the above
>>> links don't appear to make any sense with respect to my question.
>>>
>>> Let me try a reset ... Why can't we basically take the
>>> "selinux_denied" TRACE_EVENT implementation in your patch and use it
>>> to replace the "selinux_audited" TRACE_EVENT in the selinux/next tree
>>> (of course with the necessary changes to the AVC callback code)?
>>>
>>> If the "selinux_denied" implementation is valid from a tracing point
>>> of view, why can we not do this?  Of course if the "selinux_denied"
>>> implementation is not a valid TRACE_EVENT then I'm not sure why this
>>> was suggested for SELinux :)
>> Im happly fine with replacing the selinux_audited with selinux_denied.  
>> However it is the case where there are more than one denial at the same 
>> time. Im not sure how and when it might happen.
> One thing I wondered 

Re: [RFC PATCH] selinux: Add denied trace with permssion filter

2020-08-27 Thread peter enderborg
On 8/27/20 3:30 PM, Paul Moore wrote:
> On Wed, Aug 26, 2020 at 11:06 AM peter enderborg
>  wrote:
>> On 8/26/20 4:45 PM, Paul Moore wrote:
>>> On Wed, Aug 26, 2020 at 10:34 AM peter enderborg
>>>  wrote:
>>>> On 8/26/20 3:42 PM, Paul Moore wrote:
>>>>> On Mon, Aug 24, 2020 at 9:23 AM Peter Enderborg
>>>>>  wrote:
>>>>>> This adds tracing of all denies. They are grouped with trace_seq for
>>>>>> each audit.
>>>>>>
>>>>>> A filter can be inserted with a write to it's filter section.
>>>>>>
>>>>>> echo "permission==\"entrypoint\"" > events/avc/selinux_denied/filter
>>>>>>
>>>>>> A output will be like:
>>>>>>   runcon-1046  [002] .N..   156.351738: selinux_denied:
>>>>>>   trace_seq=2 result=-13
>>>>>>   scontext=system_u:system_r:cupsd_t:s0-s0:c0.
>>>>>>   c1023 tcontext=system_u:object_r:bin_t:s0
>>>>>>   tclass=file permission=entrypoint
>>>>>>
>>>>>> Signed-off-by: Peter Enderborg 
>>>>>> ---
>>>>>>  include/trace/events/avc.h | 37 +
>>>>>>  security/selinux/avc.c | 27 +--
>>>>>>  2 files changed, 62 insertions(+), 2 deletions(-)
>>>>> My most significant comment is that I don't think we want, or need,
>>>>> two trace points in the avc_audit_post_callback() function.  Yes, I
>>>>> understand they are triggered slightly differently, but from my
>>>>> perspective there isn't enough difference between the two tracepoints
>>>>> to warrant including both.  However, while the tracepoints may be
>>>> We tried that but that was problematic too.
>>> My apologies if I was on that thread, but can you remind me why it was
>>> a problem?  Why can't we use a single tracepoint to capture the AVC
>>> information?
>> The problem is parsing the event.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_18_842=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=iS3eZr3TFrN5I7BbnvPFYOKd6DfW1FHTFcwI7joS_fk=
>>  
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_21_526=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=9OsLN0Y5mUWxEAAqUE6K4PS57Pn1XyZz7GXak6uc_Ls=
>>  
>>
>> and the "single list" version
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2020_8_17_1346=DwIBaQ=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc=oO5HuGEGxznA2F3djiiYxmxxWQonw0h6Sks-BEoB4ys=qmi2ROWsLC_0mLLhHkpb71j1YoicydLh-7l4cOsLYcY=tWSY2ry2IT6RcT5BIUwMuqBL_yPObDE1VljbLqI1zrA=
>>  
>>
>> With this patch we follow standard message format so no plugin should be 
>> needed.
> I'm evidently missing something very fundamental (likely), and/or I'm
> just not communicating very clearly (also likely), because the above
> links don't appear to make any sense with respect to my question.
>
> Let me try a reset ... Why can't we basically take the
> "selinux_denied" TRACE_EVENT implementation in your patch and use it
> to replace the "selinux_audited" TRACE_EVENT in the selinux/next tree
> (of course with the necessary changes to the AVC callback code)?
>
> If the "selinux_denied" implementation is valid from a tracing point
> of view, why can we not do this?  Of course if the "selinux_denied"
> implementation is not a valid TRACE_EVENT then I'm not sure why this
> was suggested for SELinux :)
>
Im happly fine with replacing the selinux_audited with selinux_denied.  However 
it is the case where there are more than one denial at the same time. Im not 
sure how and when it might happen.
When that happen we got more than one event. I have no problems with that, but 
im not sure if the debug tools and perf can make sense of that.

A other feature with the selinux_audited event it might be inserted on other 
places in the code too.  A denial is sort of final.





Re: [RFC PATCH] selinux: Add denied trace with permssion filter

2020-08-26 Thread peter enderborg
On 8/26/20 4:45 PM, Paul Moore wrote:
> On Wed, Aug 26, 2020 at 10:34 AM peter enderborg
>  wrote:
>> On 8/26/20 3:42 PM, Paul Moore wrote:
>>> On Mon, Aug 24, 2020 at 9:23 AM Peter Enderborg
>>>  wrote:
>>>> This adds tracing of all denies. They are grouped with trace_seq for
>>>> each audit.
>>>>
>>>> A filter can be inserted with a write to it's filter section.
>>>>
>>>> echo "permission==\"entrypoint\"" > events/avc/selinux_denied/filter
>>>>
>>>> A output will be like:
>>>>   runcon-1046  [002] .N..   156.351738: selinux_denied:
>>>>   trace_seq=2 result=-13
>>>>       scontext=system_u:system_r:cupsd_t:s0-s0:c0.
>>>>   c1023 tcontext=system_u:object_r:bin_t:s0
>>>>   tclass=file permission=entrypoint
>>>>
>>>> Signed-off-by: Peter Enderborg 
>>>> ---
>>>>  include/trace/events/avc.h | 37 +
>>>>  security/selinux/avc.c | 27 +--
>>>>  2 files changed, 62 insertions(+), 2 deletions(-)
>>> My most significant comment is that I don't think we want, or need,
>>> two trace points in the avc_audit_post_callback() function.  Yes, I
>>> understand they are triggered slightly differently, but from my
>>> perspective there isn't enough difference between the two tracepoints
>>> to warrant including both.  However, while the tracepoints may be
>> We tried that but that was problematic too.
> My apologies if I was on that thread, but can you remind me why it was
> a problem?  Why can't we use a single tracepoint to capture the AVC
> information?

The problem is parsing the event.

https://lkml.org/lkml/2020/8/18/842

https://lkml.org/lkml/2020/8/21/526

and the "single list" version

https://lkml.org/lkml/2020/8/17/1346

With this patch we follow standard message format so no plugin should be needed.


>> Having partly overlapping traces is not unheard off.  Check
>> compaction.c where we have a trace_mm_compaction_begin
>> and a more detailed trace_mm_compaction_migratepages.
>> (And a  trace_mm_compaction_end)
> It may not be unique to SELinux, but that doesn't mean I like it :)
>
> One of my concerns with adding tracepoints is that the code would get
> littered with tracepoints; I accepted that it the AVC decision
> codepath was an obvious place for one, so we added a tracepoint.
> Having two tracepoints here is getting awfully close to my original
> fears.
>
>>> redundant in my mind, this new event does do the permission lookup in
>>> the kernel so that the contexts/class/permissions are all available as
>>> a string which is a good thing.
>>>
>>> Without going into the details, would the tracing folks be okay with
>>> doing something similar with the existing selinux_audited tracepoint?
>>> It's extra work in the kernel, but since it would only be triggered
>>> when the tracepoint was active it seems bearable to me.
>> I think the method for expanding lists is what we tried first on
>> suggestion from Steven Rostedt.  Maybe we can do a trace_event
>> from a TP_prink but that would be recursive.
> Wait, why would you be adding a trace event to a trace event, or am I
> misunderstanding you?
>
> All I was talking about was adding the permission resolution code to
> the already existing SELinux AVC tracepoint.
>



Re: [RFC PATCH] selinux: Add denied trace with permssion filter

2020-08-26 Thread peter enderborg
On 8/26/20 3:42 PM, Paul Moore wrote:
> On Mon, Aug 24, 2020 at 9:23 AM Peter Enderborg
>  wrote:
>> This adds tracing of all denies. They are grouped with trace_seq for
>> each audit.
>>
>> A filter can be inserted with a write to it's filter section.
>>
>> echo "permission==\"entrypoint\"" > events/avc/selinux_denied/filter
>>
>> A output will be like:
>>   runcon-1046  [002] .N..   156.351738: selinux_denied:
>>   trace_seq=2 result=-13
>>   scontext=system_u:system_r:cupsd_t:s0-s0:c0.
>>   c1023 tcontext=system_u:object_r:bin_t:s0
>>   tclass=file permission=entrypoint
>>
>> Signed-off-by: Peter Enderborg 
>> ---
>>  include/trace/events/avc.h | 37 +
>>  security/selinux/avc.c | 27 +--
>>  2 files changed, 62 insertions(+), 2 deletions(-)
> My most significant comment is that I don't think we want, or need,
> two trace points in the avc_audit_post_callback() function.  Yes, I
> understand they are triggered slightly differently, but from my
> perspective there isn't enough difference between the two tracepoints
> to warrant including both.  However, while the tracepoints may be

We tried that but that was problematic too.

Having partly overlapping traces is not unheard off.  Check
compaction.c where we have a     trace_mm_compaction_begin
and a more detailed trace_mm_compaction_migratepages.
(And a  trace_mm_compaction_end)


> redundant in my mind, this new event does do the permission lookup in
> the kernel so that the contexts/class/permissions are all available as
> a string which is a good thing.
>
> Without going into the details, would the tracing folks be okay with
> doing something similar with the existing selinux_audited tracepoint?
> It's extra work in the kernel, but since it would only be triggered
> when the tracepoint was active it seems bearable to me.

I think the method for expanding lists is what we tried first on
suggestion from Steven Rostedt.  Maybe we can do a trace_event
from a TP_prink but that would be recursive.

 

>> diff --git a/include/trace/events/avc.h b/include/trace/events/avc.h
>> index 94bca8bef8d2..9a28559956de 100644
>> --- a/include/trace/events/avc.h
>> +++ b/include/trace/events/avc.h
>> @@ -54,6 +54,43 @@ TRACE_EVENT(selinux_audited,
>> )
>>  );
>>
>> +TRACE_EVENT(selinux_denied,
>> +
>> +   TP_PROTO(struct selinux_audit_data *sad,
>> +   char *scontext,
>> +   char *tcontext,
>> +   const char *tclass,
>> +   const char *permission,
>> +   int64_t seq
>> +   ),
>> +
>> +   TP_ARGS(sad, scontext, tcontext, tclass, permission, seq),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(int, result)
>> +   __string(scontext, scontext)
>> +   __string(tcontext, tcontext)
>> +   __string(permission, permission)
>> +   __string(tclass, tclass)
>> +   __field(u64, seq)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   __entry->result = sad->result;
>> +   __entry->seq= seq;
>> +   __assign_str(tcontext, tcontext);
>> +   __assign_str(scontext, scontext);
>> +   __assign_str(permission, permission);
>> +   __assign_str(tclass, tclass);
>> +   ),
>> +
>> +   TP_printk("trace_seq=%lld result=%d scontext=%s tcontext=%s 
>> tclass=%s permission=%s",
>> +__entry->seq, __entry->result, __get_str(scontext), 
>> __get_str(tcontext),
>> +__get_str(tclass), __get_str(permission)
>> +
>> +   )
>> +);
>> +
>>  #endif
>>
>>  /* This part must be outside protection */
>> diff --git a/security/selinux/avc.c b/security/selinux/avc.c
>> index 1debddfb5af9..ca53baca15e1 100644
>> --- a/security/selinux/avc.c
>> +++ b/security/selinux/avc.c
>> @@ -92,6 +92,7 @@ struct selinux_avc {
>>  };
>>
>>  static struct selinux_avc selinux_avc;
>> +static atomic64_t trace_seqno;
>>
>>  void selinux_avc_init(struct selinux_avc **avc)
>>  {
>> @@ -731,9 +732,31 @@ static void avc_audit_post_callback(struct audit_buffer 
>> *ab, void *a)
>> tclass = secclass_map[sad->tclass-1].name;
>> audit_log_format(ab, " tclass=%s", tclass);
>>
>> -   if (sad-&

Re: [RFC PATCH 3/3] selinux: track policy lifetime with refcount

2020-08-25 Thread peter enderborg
On 8/25/20 5:20 PM, Ondrej Mosnacek wrote:
> Instead of holding the RCU read lock the whole time while accessing the
> policy, add a simple refcount mechanism to track its lifetime. After
> this, the RCU read lock is held only for a brief time when fetching the
> policy pointer and incrementing the refcount. The policy struct is then
> guaranteed to stay alive until the refcount is decremented.
>
> Freeing of the policy remains the responsibility of the task that does
> the policy reload. In case the refcount drops to zero in a different
> task, the policy load task is notified via a completion.
>
> The advantage of this change is that the operations that access the
> policy can now do sleeping allocations, since they don't need to hold
> the RCU read lock anymore. This patch so far only leverages this in
> security_read_policy() for the vmalloc_user() allocation (although this
> function is always called under fsi->mutex and could just access the
> policy pointer directly). The conversion of affected GFP_ATOMIC
> allocations to GFP_KERNEL is left for a later patch, since auditing
> which code paths may still need GFP_ATOMIC is not very easy.
>
> Signed-off-by: Ondrej Mosnacek 
>
Very clever. But is it the right prioritization? We get a lot more
cpu synchronization need with two RCU in-out and refcounts inc/dec
instead of only one RCU in-out.  What is the problem with the atomic
allocations? And this if for each syscall, all caches are on the inside?



[RFC PATCH] selinux: Add denied trace with permssion filter

2020-08-24 Thread Peter Enderborg
This adds tracing of all denies. They are grouped with trace_seq for
each audit.

A filter can be inserted with a write to it's filter section.

echo "permission==\"entrypoint\"" > events/avc/selinux_denied/filter

A output will be like:
  runcon-1046  [002] .N..   156.351738: selinux_denied:
  trace_seq=2 result=-13
  scontext=system_u:system_r:cupsd_t:s0-s0:c0.
  c1023 tcontext=system_u:object_r:bin_t:s0
  tclass=file permission=entrypoint

Signed-off-by: Peter Enderborg 
---
 include/trace/events/avc.h | 37 +
 security/selinux/avc.c | 27 +--
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/avc.h b/include/trace/events/avc.h
index 94bca8bef8d2..9a28559956de 100644
--- a/include/trace/events/avc.h
+++ b/include/trace/events/avc.h
@@ -54,6 +54,43 @@ TRACE_EVENT(selinux_audited,
)
 );
 
+TRACE_EVENT(selinux_denied,
+
+   TP_PROTO(struct selinux_audit_data *sad,
+   char *scontext,
+   char *tcontext,
+   const char *tclass,
+   const char *permission,
+   int64_t seq
+   ),
+
+   TP_ARGS(sad, scontext, tcontext, tclass, permission, seq),
+
+   TP_STRUCT__entry(
+   __field(int, result)
+   __string(scontext, scontext)
+   __string(tcontext, tcontext)
+   __string(permission, permission)
+   __string(tclass, tclass)
+   __field(u64, seq)
+   ),
+
+   TP_fast_assign(
+   __entry->result = sad->result;
+   __entry->seq= seq;
+   __assign_str(tcontext, tcontext);
+   __assign_str(scontext, scontext);
+   __assign_str(permission, permission);
+   __assign_str(tclass, tclass);
+   ),
+
+   TP_printk("trace_seq=%lld result=%d scontext=%s tcontext=%s tclass=%s 
permission=%s",
+__entry->seq, __entry->result, __get_str(scontext), 
__get_str(tcontext),
+__get_str(tclass), __get_str(permission)
+
+   )
+);
+
 #endif
 
 /* This part must be outside protection */
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index 1debddfb5af9..ca53baca15e1 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -92,6 +92,7 @@ struct selinux_avc {
 };
 
 static struct selinux_avc selinux_avc;
+static atomic64_t trace_seqno;
 
 void selinux_avc_init(struct selinux_avc **avc)
 {
@@ -731,9 +732,31 @@ static void avc_audit_post_callback(struct audit_buffer 
*ab, void *a)
tclass = secclass_map[sad->tclass-1].name;
audit_log_format(ab, " tclass=%s", tclass);
 
-   if (sad->denied)
+   if (sad->denied) {
audit_log_format(ab, " permissive=%u", sad->result ? 0 : 1);
-
+   if (trace_selinux_denied_enabled()) {
+   int i, perm;
+   const char **perms;
+   u32 denied = sad->denied;
+   int64_t seq;
+
+   seq = atomic_long_inc_return(_seqno);
+   perms = secclass_map[sad->tclass-1].perms;
+   i = 0;
+   perm = 1;
+   while (i < (sizeof(denied) * 8)) {
+   if ((perm & denied & sad->requested) && 
perms[i]) {
+   trace_selinux_denied(sad, scontext, 
tcontext,
+tclass, perms[i], 
seq);
+   denied &= ~perm;
+   if (!denied)
+   break;
+   }
+   i++;
+   perm <<= 1;
+   }
+   }
+   }
trace_selinux_audited(sad, scontext, tcontext, tclass);
kfree(tcontext);
kfree(scontext);
-- 
2.17.1



[no subject]

2020-08-24 Thread Peter Enderborg


This is other solution. I dont think needs any plugin support. It will generate 
one
event for eatch denial and there is possible to filter on permission.
.



Re: [PATCH v3 3/3] selinux: add permission names to trace event

2020-08-21 Thread peter enderborg
On 8/21/20 3:19 PM, Paul Moore wrote:
> On Fri, Aug 21, 2020 at 8:29 AM Stephen Smalley
>  wrote:
>> On Thu, Aug 20, 2020 at 10:31 PM Steven Rostedt  wrote:
>>> On Wed, 19 Aug 2020 09:11:08 -0400
>>> Stephen Smalley  wrote:
>>>
 So we'll need to update this plugin whenever we modify
 security/selinux/include/classmap.h to keep them in sync.  Is that a
 concern?  I don't suppose the plugin could directly include classmap.h?
 I guess we'd have to export it as a public header. It isn't considered
 to be part of the kernel API/ABI and can change anytime (but in practice
 changes are not that frequent, and usually just additive in nature).
>>> Yes, it would require some stability between userspace and the plugin.
>>> If the value indexes don't change then that would work fine. If you add
>>> new ones, that too should be OK, just have a way to state "unknown" in
>>> the plugin.
>> Since we introduced the dynamic class/perm mapping support, it has
>> been possible for the values of existing classes/permissions to
>> change, and that has happened at time, e.g. when we added watch
>> permissions to the common file perms, that shifted the values of the
>> class file perms like entrypoint, when we added the process2 class
>> right after the process class, it shifted the values of all the
>> subsequent classes in the classmap.h.  So you can't rely on those
>> values remaining stable across kernel versions.
> I think it is becoming increasingly clear that generating the
> permission set string in userspace isn't really workable without
> breaking the dynamic class/permission mapping to some degree.
> Unfortunately I don't see these perf changes as a big enough "win" to
> offset the loss of the dynamic mapping loss.
>
> I'm okay with merging patches 1/3 and 2/3 wth the changes Stephen
> suggested, but I think we will need to leave patch 3/3 out of this for
> now.
>
Ok.



Re: [PATCH v3 3/3] selinux: add permission names to trace event

2020-08-20 Thread peter enderborg
On 8/21/20 4:22 AM, Paul Moore wrote:
> On Tue, Aug 18, 2020 at 8:14 AM Stephen Smalley
>  wrote:
>> On Tue, Aug 18, 2020 at 4:11 AM peter enderborg  
>> wrote:
> ...
>
>>> Is there any other things we need to fix? A part 1&2 now OK?
>> They looked ok to me, but Paul should review them.
> Patches 1 and 2 look fine to me with the small nits that Stephen
> pointed out corrected.  I'm glad to see the information in string form
> now, I think that will be a big help for people making use of this.
>
> Unfortunately, I'm a little concerned about patch 3 for the reason
> Stephen already mentioned.  While changes to the class mapping are
> infrequent, they do happen, and I'm not very excited about adding it
> to the userspace kAPI via a header.  Considering that the tracing
> tools are going to be running on the same system that is being
> inspected, perhaps the tracing tools could inspect
> /sys/fs/selinux/class at runtime to query the permission mappings?
> Stephen, is there a libselinux API which does this already?
>
One way to use this trace is to write directly to a memory buffer over a time
period. In the case for Android and I guess in many other embedded cases too
they are moved to be some other machine to be analysed so having them
locked to where it was running also have problems.

So what is the problem we see with the plugin, that we have perms names
that are "unknown" ?



Re: Scheduler benchmarks

2020-08-18 Thread peter enderborg
On 8/18/20 7:53 PM, Muni Sekhar wrote:
> On Tue, Aug 18, 2020 at 11:06 PM Greg KH  wrote:
>> On Tue, Aug 18, 2020 at 11:01:35PM +0530, Muni Sekhar wrote:
>>> On Tue, Aug 18, 2020 at 10:44 PM Greg KH  wrote:
 On Tue, Aug 18, 2020 at 10:24:13PM +0530, Muni Sekhar wrote:
> On Tue, Aug 18, 2020 at 8:06 PM Greg KH  wrote:
>> On Tue, Aug 18, 2020 at 08:00:11PM +0530, Muni Sekhar wrote:
>>> Hi all,
>>>
>>> I’ve two identical Linux systems with only kernel differences.
>> What are the differences in the kernels?
 You didn't answer this question, is this the same kernel source being
 compared here?  Same version?  Same compiler?  Everything identical?
>>> Both systems are having exactly the same hardware configuration.
>>> Compiler and kernel versions are different. One system has Ubuntu
>>> 16.04.4 LTS(4.4.0-66-generic kernel with gcc version 5.4.0) kernel and
>>> the other one has Ubuntu 18.04.4 LTS(4.15.0-91-generic kernel with gcc
>>> version 7.5.0).
>> Those are _very_ different kernel versions, with many years and tens of
>> thousands of different changes between them.
>>
>> Hopefully the newer kernel is faster, so just stick with that :)
> But unfortunately the newer kernel is very slow, that is the reason
> for starting this investigation :)
> Any type of help,  and guidelines to dive deeper will be highly appreciated.

On the 4.4 kernel you dont have

+CONFIG_RETPOLINE=y
+CONFIG_INTEL_RDT=y

And your base is very different two.

Try to use mainline on both system and see.

You can also use the same base kernel version from ubuntu and

run your test.


>> greg k-h
>
>



Re: [PATCH v3 3/3] selinux: add permission names to trace event

2020-08-18 Thread peter enderborg
On 8/17/20 10:16 PM, Stephen Smalley wrote:
> On 8/17/20 1:07 PM, Thiébaud Weksteen wrote:
>
>> From: Peter Enderborg 
>>
>> In the print out add permissions, it will look like:
>>  <...>-1042  [007]    201.965142: selinux_audited:
>>  requested=0x400 denied=0x400 audited=0x400
>>  result=-13
>>  scontext=system_u:system_r:cupsd_t:s0-s0:c0.c1023
>>  tcontext=system_u:object_r:bin_t:s0
>>  tclass=file permissions={ !entrypoint }
>>
>> This patch is adding the "permissions={ !entrypoint }".
>> The permissions preceded by "!" have been denied and the permissions
>> without have been accepted.
>>
>> Note that permission filtering is done on the audited, denied or
>> requested attributes.
>>
>> Suggested-by: Steven Rostedt 
>> Suggested-by: Stephen Smalley 
>> Reviewed-by: Thiébaud Weksteen 
>> Signed-off-by: Peter Enderborg 
>> ---
>>   include/trace/events/avc.h | 11 +--
>>   security/selinux/avc.c | 36 
>>   2 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/security/selinux/avc.c b/security/selinux/avc.c
>> index 7de5cc5169af..d585b68c2a50 100644
>> --- a/security/selinux/avc.c
>> +++ b/security/selinux/avc.c
>> @@ -695,6 +695,7 @@ static void avc_audit_pre_callback(struct audit_buffer 
>> *ab, void *a)
>>   audit_log_format(ab, " } for ");
>>   }
>>   +
>>   /**
>>    * avc_audit_post_callback - SELinux specific information
>>    * will be called by generic audit code
>
> Also, drop the spurious whitespace change above.
>
>
Is there any other things we need to fix? A part 1&2 now OK?




Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-15 Thread peter enderborg
On 8/14/20 7:46 PM, Steven Rostedt wrote:
> On Fri, 14 Aug 2020 19:22:13 +0200
> peter enderborg  wrote:
>
>> On 8/14/20 7:08 PM, Stephen Smalley wrote:
>>> On Fri, Aug 14, 2020 at 1:07 PM peter enderborg
>>>  wrote:  
>>>> On 8/14/20 6:51 PM, Stephen Smalley wrote:  
>>>>> On Fri, Aug 14, 2020 at 9:05 AM Thiébaud Weksteen  
>>>>> wrote:  
>>>>>> On Thu, Aug 13, 2020 at 5:41 PM Stephen Smalley
>>>>>>  wrote:  
>>>>>>> An explanation here of how one might go about decoding audited and
>>>>>>> tclass would be helpful to users (even better would be a script to do it
>>>>>>> for them).  Again, I know how to do that but not everyone using
>>>>>>> perf/ftrace will.  
>>>>>> What about something along those lines:
>>>>>>
>>>>>> The tclass value can be mapped to a class by searching
>>>>>> security/selinux/flask.h. The audited value is a bit field of the
>>>>>> permissions described in security/selinux/av_permissions.h for the
>>>>>> corresponding class.  
>>>>> Sure, I guess that works.  Would be nice if we just included the class
>>>>> and permission name(s) in the event itself but I guess you viewed that
>>>>> as too heavyweight?  
>>>> The class name is added in part 2. Im not sure how a proper format for 
>>>> permission
>>>> would look like in trace terms. It is a list, right?  
>>> Yes.  See avc_audit_pre_callback() for example code to log the permission 
>>> names.  
>> I wrote about that on some of the previous sets. The problem is that trace 
>> format is quite fixed. So it is lists are not
>> that easy to handle if you want to filter in them. You can have a trace 
>> event for each of them. You can also add
>> additional trace event "selinux_audied_permission" for each permission. With 
>> that you can filter out tclass or permissions.
>>
>> But the basic thing we would like at the moment is a event that we can debug 
>> in user space.
> We have a trace_seq p helper, that lets you create strings in
> TP_printk(). I should document this more. Thus you can do:
>
> extern const char *audit_perm_to_name(struct trace_seq *p, u16 class, u32 
> audited);
> #define __perm_to_name(p, class, audited) audit_perm_to_name(p, class, 
> audited)
>
>   TP_printk("tclass=%u audited=%x (%s)",
>   __entry->tclass,
>   __entry->audited,
>   __perm_to_name(__entry->tclass, __entry->audited))
>
>
> const char *audit_perm_to_name(struct trace_seq *p, u16 tclass, u32 av)
> {
>   const char *ret = trace_seq_buffer_ptr(p);
>   int i, perm;
>
>   ( some check for tclass integrity here)
>
>   perms = secclass_map[tclass-1].perms;
>
>   i = 0;
>   perm = 1;
>   while (i < (sizeof(av) * 8)) {
>   if ((perm & av) && perms[i]) {
>   trace_seq_printf(p, " %s", perms[i]);
>   av &= ~perm;
>   }
>   i++;
>   perm <<= 1;
>   }
>
>   return ret;
> }
>
> Note, this wont work for perf and trace-cmd as it wouldn't know how to
> parse it, but if the tclass perms are stable, you could create a plugin
> to libtraceevent that can do the above as well.
>
> -- Steve

That works fine. I will do this as third patch in our patch-set.  But I think 
we also should export the permission-map
somewhere. I don’t think there is any good place for it in tracefs. So 
selinuxfs or debugfs might do? And I think it is
more useful to print what is denied than what is audited but that does not 
match the trace event name.






Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-15 Thread peter enderborg
On 8/14/20 7:46 PM, Steven Rostedt wrote:
> On Fri, 14 Aug 2020 19:22:13 +0200
> peter enderborg  wrote:
>
>> On 8/14/20 7:08 PM, Stephen Smalley wrote:
>>> On Fri, Aug 14, 2020 at 1:07 PM peter enderborg
>>>  wrote:  
>>>> On 8/14/20 6:51 PM, Stephen Smalley wrote:  
>>>>> On Fri, Aug 14, 2020 at 9:05 AM Thiébaud Weksteen  
>>>>> wrote:  
>>>>>> On Thu, Aug 13, 2020 at 5:41 PM Stephen Smalley
>>>>>>  wrote:  
>>>>>>> An explanation here of how one might go about decoding audited and
>>>>>>> tclass would be helpful to users (even better would be a script to do it
>>>>>>> for them).  Again, I know how to do that but not everyone using
>>>>>>> perf/ftrace will.  
>>>>>> What about something along those lines:
>>>>>>
>>>>>> The tclass value can be mapped to a class by searching
>>>>>> security/selinux/flask.h. The audited value is a bit field of the
>>>>>> permissions described in security/selinux/av_permissions.h for the
>>>>>> corresponding class.  
>>>>> Sure, I guess that works.  Would be nice if we just included the class
>>>>> and permission name(s) in the event itself but I guess you viewed that
>>>>> as too heavyweight?  
>>>> The class name is added in part 2. Im not sure how a proper format for 
>>>> permission
>>>> would look like in trace terms. It is a list, right?  
>>> Yes.  See avc_audit_pre_callback() for example code to log the permission 
>>> names.  
>> I wrote about that on some of the previous sets. The problem is that trace 
>> format is quite fixed. So it is lists are not
>> that easy to handle if you want to filter in them. You can have a trace 
>> event for each of them. You can also add
>> additional trace event "selinux_audied_permission" for each permission. With 
>> that you can filter out tclass or permissions.
>>
>> But the basic thing we would like at the moment is a event that we can debug 
>> in user space.
> We have a trace_seq p helper, that lets you create strings in
> TP_printk(). I should document this more. Thus you can do:
>
> extern const char *audit_perm_to_name(struct trace_seq *p, u16 class, u32 
> audited);
> #define __perm_to_name(p, class, audited) audit_perm_to_name(p, class, 
> audited)
>
>   TP_printk("tclass=%u audited=%x (%s)",
>   __entry->tclass,
>   __entry->audited,
>   __perm_to_name(__entry->tclass, __entry->audited))
>
>
> const char *audit_perm_to_name(struct trace_seq *p, u16 tclass, u32 av)
> {
>   const char *ret = trace_seq_buffer_ptr(p);
>   int i, perm;
>
>   ( some check for tclass integrity here)
>
>   perms = secclass_map[tclass-1].perms;
>
>   i = 0;
>   perm = 1;
>   while (i < (sizeof(av) * 8)) {
>   if ((perm & av) && perms[i]) {
>   trace_seq_printf(p, " %s", perms[i]);
>   av &= ~perm;
>   }
>   i++;
>   perm <<= 1;
>   }
>
>   return ret;
> }
>
> Note, this wont work for perf and trace-cmd as it wouldn't know how to
> parse it, but if the tclass perms are stable, you could create a plugin
> to libtraceevent that can do the above as well.
>
> -- Steve

Something like:

    while (i < (sizeof(av) * 8)) {
        if ((perm & av)  && perms[i]) {
            if (!(perm & avdenied))
                trace_seq_printf(p, " %s", perms[i]);
            else
                trace_seq_printf(p, " !%s", perms[i]);
            av &= ~perm;

And you get information about denied too.





Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-14 Thread peter enderborg
On 8/14/20 8:30 PM, Steven Rostedt wrote:
> On Fri, 14 Aug 2020 20:06:34 +0200
> peter enderborg  wrote:
>
>> Im find with that, but then you  can not do filtering? I would be
>> pretty neat with a filter saying tclass=file permission=write.
>>
> Well, if the mapping is stable, you could do:
>
>   (tclass == 6) && (audited & 0x4)

It does not happen to exist a hook for translate strings to numeric values when 
inserting filter?


> -- Steve




Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-14 Thread peter enderborg
On 8/14/20 7:46 PM, Steven Rostedt wrote:
> On Fri, 14 Aug 2020 19:22:13 +0200
> peter enderborg  wrote:
>
>> On 8/14/20 7:08 PM, Stephen Smalley wrote:
>>> On Fri, Aug 14, 2020 at 1:07 PM peter enderborg
>>>  wrote:  
>>>> On 8/14/20 6:51 PM, Stephen Smalley wrote:  
>>>>> On Fri, Aug 14, 2020 at 9:05 AM Thiébaud Weksteen  
>>>>> wrote:  
>>>>>> On Thu, Aug 13, 2020 at 5:41 PM Stephen Smalley
>>>>>>  wrote:  
>>>>>>> An explanation here of how one might go about decoding audited and
>>>>>>> tclass would be helpful to users (even better would be a script to do it
>>>>>>> for them).  Again, I know how to do that but not everyone using
>>>>>>> perf/ftrace will.  
>>>>>> What about something along those lines:
>>>>>>
>>>>>> The tclass value can be mapped to a class by searching
>>>>>> security/selinux/flask.h. The audited value is a bit field of the
>>>>>> permissions described in security/selinux/av_permissions.h for the
>>>>>> corresponding class.  
>>>>> Sure, I guess that works.  Would be nice if we just included the class
>>>>> and permission name(s) in the event itself but I guess you viewed that
>>>>> as too heavyweight?  
>>>> The class name is added in part 2. Im not sure how a proper format for 
>>>> permission
>>>> would look like in trace terms. It is a list, right?  
>>> Yes.  See avc_audit_pre_callback() for example code to log the permission 
>>> names.  
>> I wrote about that on some of the previous sets. The problem is that trace 
>> format is quite fixed. So it is lists are not
>> that easy to handle if you want to filter in them. You can have a trace 
>> event for each of them. You can also add
>> additional trace event "selinux_audied_permission" for each permission. With 
>> that you can filter out tclass or permissions.
>>
>> But the basic thing we would like at the moment is a event that we can debug 
>> in user space.
> We have a trace_seq p helper, that lets you create strings in
> TP_printk(). I should document this more. Thus you can do:
>
> extern const char *audit_perm_to_name(struct trace_seq *p, u16 class, u32 
> audited);
> #define __perm_to_name(p, class, audited) audit_perm_to_name(p, class, 
> audited)
>
>   TP_printk("tclass=%u audited=%x (%s)",
>   __entry->tclass,
>   __entry->audited,
>   __perm_to_name(__entry->tclass, __entry->audited))
>
>
> const char *audit_perm_to_name(struct trace_seq *p, u16 tclass, u32 av)
> {
>   const char *ret = trace_seq_buffer_ptr(p);
>   int i, perm;
>
>   ( some check for tclass integrity here)
>
>   perms = secclass_map[tclass-1].perms;
>
>   i = 0;
>   perm = 1;
>   while (i < (sizeof(av) * 8)) {
>   if ((perm & av) && perms[i]) {
>   trace_seq_printf(p, " %s", perms[i]);
>   av &= ~perm;
>   }
>   i++;
>   perm <<= 1;
>   }
>
>   return ret;
> }
>
> Note, this wont work for perf and trace-cmd as it wouldn't know how to
> parse it, but if the tclass perms are stable, you could create a plugin
> to libtraceevent that can do the above as well.
>
> -- Steve

Im find with that, but then you  can not do filtering? I would be pretty neat 
with a filter saying tclass=file permission=write.




Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-14 Thread peter enderborg
On 8/14/20 7:08 PM, Stephen Smalley wrote:
> On Fri, Aug 14, 2020 at 1:07 PM peter enderborg
>  wrote:
>> On 8/14/20 6:51 PM, Stephen Smalley wrote:
>>> On Fri, Aug 14, 2020 at 9:05 AM Thiébaud Weksteen  wrote:
>>>> On Thu, Aug 13, 2020 at 5:41 PM Stephen Smalley
>>>>  wrote:
>>>>> An explanation here of how one might go about decoding audited and
>>>>> tclass would be helpful to users (even better would be a script to do it
>>>>> for them).  Again, I know how to do that but not everyone using
>>>>> perf/ftrace will.
>>>> What about something along those lines:
>>>>
>>>> The tclass value can be mapped to a class by searching
>>>> security/selinux/flask.h. The audited value is a bit field of the
>>>> permissions described in security/selinux/av_permissions.h for the
>>>> corresponding class.
>>> Sure, I guess that works.  Would be nice if we just included the class
>>> and permission name(s) in the event itself but I guess you viewed that
>>> as too heavyweight?
>> The class name is added in part 2. Im not sure how a proper format for 
>> permission
>> would look like in trace terms. It is a list, right?
> Yes.  See avc_audit_pre_callback() for example code to log the permission 
> names.

I wrote about that on some of the previous sets. The problem is that trace 
format is quite fixed. So it is lists are not
that easy to handle if you want to filter in them. You can have a trace event 
for each of them. You can also add
additional trace event "selinux_audied_permission" for each permission. With 
that you can filter out tclass or permissions.

But the basic thing we would like at the moment is a event that we can debug in 
user space.



Re: [PATCH v2 1/2] selinux: add tracepoint on denials

2020-08-14 Thread peter enderborg
On 8/14/20 6:51 PM, Stephen Smalley wrote:
> On Fri, Aug 14, 2020 at 9:05 AM Thiébaud Weksteen  wrote:
>> On Thu, Aug 13, 2020 at 5:41 PM Stephen Smalley
>>  wrote:
>>> An explanation here of how one might go about decoding audited and
>>> tclass would be helpful to users (even better would be a script to do it
>>> for them).  Again, I know how to do that but not everyone using
>>> perf/ftrace will.
>> What about something along those lines:
>>
>> The tclass value can be mapped to a class by searching
>> security/selinux/flask.h. The audited value is a bit field of the
>> permissions described in security/selinux/av_permissions.h for the
>> corresponding class.
> Sure, I guess that works.  Would be nice if we just included the class
> and permission name(s) in the event itself but I guess you viewed that
> as too heavyweight?

The class name is added in part 2. Im not sure how a proper format for 
permission
would look like in trace terms. It is a list, right?





Re: [PATCH v2 2/2] selinux: add basic filtering for audit trace events

2020-08-13 Thread peter enderborg
On 8/13/20 7:38 PM, Steven Rostedt wrote:
> On Thu, 13 Aug 2020 19:14:10 +0200
> peter enderborg  wrote:
>
>>> To be clear, userspace tools can't use fixed secid values because
>>> secids are dynamically assigned by SELinux and thus secid 42 need
>>> not correspond to the same security context across different boots
>>> even with the same kernel and policy.  I wouldn't include them in
>>> the event unless it is common practice to include fields that can
>>> only be interpreted if you can debug the running kernel.  It would
>>> be akin to including kernel pointers in the event (albeit without
>>> the KASLR ramifications).
>>>
>>>  
>> Just as a reference on my fedora system; out of 1808 events 244 as a
>> pointer print. I don't see that there is any obfuscating aka "%pK" as
>> there is for logs.
> Which is a reason why tracefs is root only.
>
> The "%p" gets obfuscated when printed from the trace file by default
> now. But they are consistent (where the same pointer shows up as the
> same hash).
>
> It's used mainly to map together events. For example, if you print the
> address of a skb in the networking events, it's good to know what
> events reference the same skb, and the pointer is used for that.

So what is your opinion on ssid? I dont mind removing them
now since people dont like it and the strong use-case is not
strong (yet). Is there any problem to put getting them back
later if useful? And then before the strings so the evaluation
of filter first come on number before stings Or is there already
some mechanism that optimize for that?


> -- Steve




Re: [PATCH v2 2/2] selinux: add basic filtering for audit trace events

2020-08-13 Thread peter enderborg
On 8/13/20 5:49 PM, Stephen Smalley wrote:
> On 8/13/20 11:35 AM, peter enderborg wrote:
>
>> On 8/13/20 5:05 PM, Casey Schaufler wrote:
>>> On 8/13/2020 7:48 AM, Thiébaud Weksteen wrote:
>>>> From: Peter Enderborg 
>>>>
>>>> This patch adds further attributes to the event. These attributes are
>>>> helpful to understand the context of the message and can be used
>>>> to filter the events.
>>>>
>>>> There are three common items. Source context, target context and tclass.
>>>> There are also items from the outcome of operation performed.
>>>>
>>>> An event is similar to:
>>>>     <...>-1309  [002]   6346.691689: selinux_audited:
>>>>     requested=0x400 denied=0x400 audited=0x400
>>>>     result=-13 ssid=315 tsid=61
>>> It may not be my place to ask, but *please please please* don't
>>> externalize secids. I understand that it's easier to type "42"
>>> than "system_r:cupsd_t:s0-s0:c0.c1023", and that it's easier for
>>> your tools to parse and store the number. Once you start training
>>> people that system_r:cupsd_t:s0-s0:c0.c1023 is secid 42 you'll
>>> never be able to change it. The secid will start showing up in
>>> scripts. Bad  Things  Will  Happen.
>> Ok, it seems to mostly against having this performance options.
>> Yes, it is a kernel internal data. So is most of the kernel tracing.
>> I see it is a primary tool for kernel debugging but than can also be
>> used for user-space debugging tools.  Hiding data for debuggers
>> does not make any sense too me.
>
> To be clear, userspace tools can't use fixed secid values because secids are 
> dynamically assigned by SELinux and thus secid 42 need not correspond to the 
> same security context across different boots even with the same kernel and 
> policy.  I wouldn't include them in the event unless it is common practice to 
> include fields that can only be interpreted if you can debug the running 
> kernel.  It would be akin to including kernel pointers in the event (albeit 
> without the KASLR ramifications).
>
>
Just as a reference on my fedora system; out of 1808 events 244 as a pointer 
print. I don't see that there is any obfuscating aka "%pK" as there is for logs.




Re: [PATCH v2 2/2] selinux: add basic filtering for audit trace events

2020-08-13 Thread peter enderborg
On 8/13/20 5:49 PM, Stephen Smalley wrote:
> On 8/13/20 11:35 AM, peter enderborg wrote:
>
>> On 8/13/20 5:05 PM, Casey Schaufler wrote:
>>> On 8/13/2020 7:48 AM, Thiébaud Weksteen wrote:
>>>> From: Peter Enderborg 
>>>>
>>>> This patch adds further attributes to the event. These attributes are
>>>> helpful to understand the context of the message and can be used
>>>> to filter the events.
>>>>
>>>> There are three common items. Source context, target context and tclass.
>>>> There are also items from the outcome of operation performed.
>>>>
>>>> An event is similar to:
>>>>     <...>-1309  [002]   6346.691689: selinux_audited:
>>>>     requested=0x400 denied=0x400 audited=0x400
>>>>     result=-13 ssid=315 tsid=61
>>> It may not be my place to ask, but *please please please* don't
>>> externalize secids. I understand that it's easier to type "42"
>>> than "system_r:cupsd_t:s0-s0:c0.c1023", and that it's easier for
>>> your tools to parse and store the number. Once you start training
>>> people that system_r:cupsd_t:s0-s0:c0.c1023 is secid 42 you'll
>>> never be able to change it. The secid will start showing up in
>>> scripts. Bad  Things  Will  Happen.
>> Ok, it seems to mostly against having this performance options.
>> Yes, it is a kernel internal data. So is most of the kernel tracing.
>> I see it is a primary tool for kernel debugging but than can also be
>> used for user-space debugging tools.  Hiding data for debuggers
>> does not make any sense too me.
>
> To be clear, userspace tools can't use fixed secid values because secids are 
> dynamically assigned by SELinux and thus secid 42 need not correspond to the 
> same security context across different boots even with the same kernel and 
> policy.  I wouldn't include them in the event unless it is common practice to 
> include fields that can only be interpreted if you can debug the running 
> kernel.  It would be akin to including kernel pointers in the event (albeit 
> without the KASLR ramifications).
>
Yes of course. Trace debugging is about running kernel. Would i make more sense 
if the was a debugfs entry with the sid's? This filter are a reminisce  of the 
same filter used not only to catch denials. Doing a string compare
for all syscalls keep the cpu busy.  I will do an update without it.



Re: [PATCH v2 2/2] selinux: add basic filtering for audit trace events

2020-08-13 Thread peter enderborg
On 8/13/20 5:05 PM, Casey Schaufler wrote:
> On 8/13/2020 7:48 AM, Thiébaud Weksteen wrote:
>> From: Peter Enderborg 
>>
>> This patch adds further attributes to the event. These attributes are
>> helpful to understand the context of the message and can be used
>> to filter the events.
>>
>> There are three common items. Source context, target context and tclass.
>> There are also items from the outcome of operation performed.
>>
>> An event is similar to:
>><...>-1309  [002]   6346.691689: selinux_audited:
>>requested=0x400 denied=0x400 audited=0x400
>>result=-13 ssid=315 tsid=61
> It may not be my place to ask, but *please please please* don't
> externalize secids. I understand that it's easier to type "42"
> than "system_r:cupsd_t:s0-s0:c0.c1023", and that it's easier for
> your tools to parse and store the number. Once you start training
> people that system_r:cupsd_t:s0-s0:c0.c1023 is secid 42 you'll
> never be able to change it. The secid will start showing up in
> scripts. Bad  Things  Will  Happen.

Ok, it seems to mostly against having this performance options.
Yes, it is a kernel internal data. So is most of the kernel tracing.
I see it is a primary tool for kernel debugging but than can also be
used for user-space debugging tools.  Hiding data for debuggers
does not make any sense too me.


>>scontext=system_u:system_r:cupsd_t:s0-s0:c0.c1023
>>tcontext=system_u:object_r:bin_t:s0 tclass=file
>>
>> With systems where many denials are occurring, it is useful to apply a
>> filter. The filtering is a set of logic that is inserted with
>> the filter file. Example:
>>  echo "tclass==\"file\" && ssid!=42" > events/avc/selinux_audited/filter
>>
>> This adds that we only get tclass=file but not for ssid 42.
>>
>> The trace can also have extra properties. Adding the user stack
>> can be done with
>>echo 1 > options/userstacktrace
>>
>> Now the output will be
>>  runcon-1365  [003]   6960.955530: selinux_audited:
>>  requested=0x400 denied=0x400 audited=0x400
>>  result=-13 ssid=315 tsid=61
>>  scontext=system_u:system_r:cupsd_t:s0-s0:c0.c1023
>>  tcontext=system_u:object_r:bin_t:s0 tclass=file
>>   runcon-1365  [003]   6960.955560: 
>>  =>  <7f325b4ce45b>
>>  =>  <5607093efa57>
>>
>> Note that the ssid is the internal numeric representation of scontext
>> and tsid is numeric for tcontext. They are useful for filtering.
>>
>> Signed-off-by: Peter Enderborg 
>> Reviewed-by: Thiébaud Weksteen 
>> ---
>> v2 changes:
>> - update changelog to include usage examples
>>
>>  include/trace/events/avc.h | 41 --
>>  security/selinux/avc.c | 22 +++-
>>  2 files changed, 44 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/trace/events/avc.h b/include/trace/events/avc.h
>> index 07c058a9bbcd..ac5ef2e1c2c5 100644
>> --- a/include/trace/events/avc.h
>> +++ b/include/trace/events/avc.h
>> @@ -1,6 +1,7 @@
>>  /* SPDX-License-Identifier: GPL-2.0 */
>>  /*
>> - * Author: Thiébaud Weksteen 
>> + * Authors: Thiébaud Weksteen 
>> + *  Peter Enderborg 
>>   */
>>  #undef TRACE_SYSTEM
>>  #define TRACE_SYSTEM avc
>> @@ -12,23 +13,43 @@
>>  
>>  TRACE_EVENT(selinux_audited,
>>  
>> -TP_PROTO(struct selinux_audit_data *sad),
>> +TP_PROTO(struct selinux_audit_data *sad,
>> +char *scontext,
>> +char *tcontext,
>> +const char *tclass
>> +),
>>  
>> -TP_ARGS(sad),
>> +TP_ARGS(sad, scontext, tcontext, tclass),
>>  
>>  TP_STRUCT__entry(
>> -__field(unsigned int, tclass)
>> -__field(unsigned int, audited)
>> +__field(u32, requested)
>> +__field(u32, denied)
>> +__field(u32, audited)
>> +__field(int, result)
>> +__string(scontext, scontext)
>> +__string(tcontext, tcontext)
>> +__string(tclass, tclass)
>> +__field(u32, ssid)
>> +__field(u32, tsid)
>>  ),
>>  
>>  TP_fast_assign(
>> -__entry->tclass = sad->tclass;
>> -__entry->audited = sad->audited;
>> +__entry->requested  = sad->requested;
>> +__entry->deni

Re: [PATCH 2/2] selinux: add attributes to avc tracepoint

2020-08-06 Thread peter enderborg
On 8/6/20 2:11 PM, Stephen Smalley wrote:
> On 8/6/20 4:03 AM, Thiébaud Weksteen wrote:
>
>> From: Peter Enderborg 
>>
>> Add further attributes to filter the trace events from AVC.
>
> Please include sample usage and output in the description.
>
>
Im not sure where you want it to be.

In the commit message or in a Documentation/trace/events-avc.rst ?



Re: [PATCH 2/2] selinux: add attributes to avc tracepoint

2020-08-06 Thread peter enderborg
On 8/6/20 3:49 PM, Stephen Smalley wrote:
> On Thu, Aug 6, 2020 at 9:45 AM Stephen Smalley
>  wrote:
>> On 8/6/20 8:32 AM, Stephen Smalley wrote:
>>
>>> On 8/6/20 8:24 AM, peter enderborg wrote:
>>>
>>>> On 8/6/20 2:11 PM, Stephen Smalley wrote:
>>>>> On 8/6/20 4:03 AM, Thiébaud Weksteen wrote:
>>>>>
>>>>>> From: Peter Enderborg 
>>>>>>
>>>>>> Add further attributes to filter the trace events from AVC.
>>>>> Please include sample usage and output in the description.
>>>>>
>>>>>
>>>> Im not sure where you want it to be.
>>>>
>>>> In the commit message or in a Documentation/trace/events-avc.rst ?
>>> I was just asking for it in the commit message / patch description.  I
>>> don't know what is typical for Documentation/trace.
>> For example, I just took the patches for a spin, running the
>> selinux-testsuite under perf like so:
>>
>> sudo perf record -e avc:selinux_audited -g make test
>>
>> and then ran:
>>
>> sudo perf report -g
>>
>> and a snippet of sample output included:
>>
>>   6.40% 6.40%  requested=0x80 denied=0x80
>> audited=0x80 result=-13 ssid=922 tsid=922
>> scontext=unconfined_u:unconfined_r:test_binder_mgr_t:s0-s0:c0.c1023
>> tcontext=unconfined_u:unconfined_r:test_binder_mgr_t:s0-s0:c0.c1023
>> tclass=capability
> So then the question becomes how do you use the above information,
> e.g. is that sufficient to correlate it to an actual avc: denied
> message, how do you decode the requested/denied/audited fields (or
> should the code do that for you and just report the string name(s) of
> the permission(s), do you need all three of those fields separately,
> is it useful to log the ssid/tsid at all given that you have the
> contexts and sids are dynamically assigned, etc.
>
>>  |
>>  ---0x49564128933d
>> __libc_start_main
>> |
>> |--4.60%--__GI___ioctl
>> |  entry_SYSCALL_64
>> |  do_syscall_64
>> |  __x64_sys_ioctl
>> |  ksys_ioctl
>> |  binder_ioctl
>> |  binder_set_nice
>> |  can_nice
>> |  capable
>> |  security_capable
>> |  cred_has_capability.isra.0
>> |  slow_avc_audit
>> |  common_lsm_audit
>> |  avc_audit_post_callback
>> |  avc_audit_post_callback

The real cool thing happen when you enable "user-stack-trace" too.

   <...>-4820  [007]  85878.897553: selinux_audited: 
requested=0x400 denied=0x400 audited=0x400 result=-13 ssid=341 
tsid=61 scontext=system_u:system_r:ntpd_t:s0 
tcontext=system_u:object_r:bin_t:s0 tclass=file
   <...>-4820  [007]  85878.897572: 
 =>  <7f07d99bb45b>
 =>  <555ecd89ca57>

The fields are useful for filter what you what to see and what you can ignore.  
Having the ssid and text was from the part where it is called.
The numeric can be used for two things. When you dont have any context. Same 
same reason as in post_callback. We need to be static
in format so it need  be there if it ever can happen. And it is also useful for 
faster filtering.

You can do "ssid!=42 && ssid!=43 && tsid==666".  From my view it would be good 
to have all fields there. But they need to right typed to be able
to use the filter mechanism. There must me some trade-off too where the 
argument filtering get bigger than the processing, but I think we can
add a lot more before we reach that threshold.



Re: [PATCH 2/2] selinux: add attributes to avc tracepoint

2020-08-06 Thread peter enderborg
On 8/6/20 5:03 PM, Stephen Smalley wrote:
> On Thu, Aug 6, 2020 at 10:51 AM peter enderborg
>  wrote:
>> On 8/6/20 3:49 PM, Stephen Smalley wrote:
>>> On Thu, Aug 6, 2020 at 9:45 AM Stephen Smalley
>>>  wrote:
>>>> On 8/6/20 8:32 AM, Stephen Smalley wrote:
>>>>
>>>>> On 8/6/20 8:24 AM, peter enderborg wrote:
>>>>>
>>>>>> On 8/6/20 2:11 PM, Stephen Smalley wrote:
>>>>>>> On 8/6/20 4:03 AM, Thiébaud Weksteen wrote:
>>>>>>>
>>>>>>>> From: Peter Enderborg 
>>>>>>>>
>>>>>>>> Add further attributes to filter the trace events from AVC.
>>>>>>> Please include sample usage and output in the description.
>>>>>>>
>>>>>>>
>>>>>> Im not sure where you want it to be.
>>>>>>
>>>>>> In the commit message or in a Documentation/trace/events-avc.rst ?
>>>>> I was just asking for it in the commit message / patch description.  I
>>>>> don't know what is typical for Documentation/trace.
>>>> For example, I just took the patches for a spin, running the
>>>> selinux-testsuite under perf like so:
>>>>
>>>> sudo perf record -e avc:selinux_audited -g make test
>>>>
>>>> and then ran:
>>>>
>>>> sudo perf report -g
>>>>
>>>> and a snippet of sample output included:
>>>>
>>>>   6.40% 6.40%  requested=0x80 denied=0x80
>>>> audited=0x80 result=-13 ssid=922 tsid=922
>>>> scontext=unconfined_u:unconfined_r:test_binder_mgr_t:s0-s0:c0.c1023
>>>> tcontext=unconfined_u:unconfined_r:test_binder_mgr_t:s0-s0:c0.c1023
>>>> tclass=capability
>>> So then the question becomes how do you use the above information,
>>> e.g. is that sufficient to correlate it to an actual avc: denied
>>> message, how do you decode the requested/denied/audited fields (or
>>> should the code do that for you and just report the string name(s) of
>>> the permission(s), do you need all three of those fields separately,
>>> is it useful to log the ssid/tsid at all given that you have the
>>> contexts and sids are dynamically assigned, etc.
>>>
>>>>  |
>>>>  ---0x49564128933d
>>>> __libc_start_main
>>>> |
>>>> |--4.60%--__GI___ioctl
>>>> |  entry_SYSCALL_64
>>>> |  do_syscall_64
>>>> |  __x64_sys_ioctl
>>>> |  ksys_ioctl
>>>> |  binder_ioctl
>>>> |  binder_set_nice
>>>> |  can_nice
>>>> |  capable
>>>> |  security_capable
>>>> |  cred_has_capability.isra.0
>>>> |  slow_avc_audit
>>>> |  common_lsm_audit
>>>> |  avc_audit_post_callback
>>>> |  avc_audit_post_callback
>> The real cool thing happen when you enable "user-stack-trace" too.
>>
>><...>-4820  [007]  85878.897553: selinux_audited: 
>> requested=0x400 denied=0x400 audited=0x400 result=-13 ssid=341 
>> tsid=61 scontext=system_u:system_r:ntpd_t:s0 
>> tcontext=system_u:object_r:bin_t:s0 tclass=file
>><...>-4820  [007]  85878.897572: 
>>  =>  <7f07d99bb45b>
>>  =>  <555ecd89ca57>
>>
>> The fields are useful for filter what you what to see and what you can 
>> ignore.  Having the ssid and text was from the part where it is called.
>> The numeric can be used for two things. When you dont have any context. Same 
>> same reason as in post_callback. We need to be static
>> in format so it need  be there if it ever can happen. And it is also useful 
>> for faster filtering.
>>
>> You can do "ssid!=42 && ssid!=43 && tsid==666".  From my view it would be 
>> good to have all fields there. But they need to right typed to be able
>> to use the filter mechanism. There must me some trade-off too where the 
>> argument filtering get bigger than the processing, but I think we can
>> add a lot more before we reach that 

[tip: core/rcu] rcu: Stop shrinker loop

2020-07-31 Thread tip-bot2 for Peter Enderborg
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: c6dfd72b7a3b70a2054db0f73245ea2f762a8452
Gitweb:
https://git.kernel.org/tip/c6dfd72b7a3b70a2054db0f73245ea2f762a8452
Author:Peter Enderborg 
AuthorDate:Thu, 04 Jun 2020 12:23:20 +02:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 29 Jun 2020 11:58:51 -07:00

rcu: Stop shrinker loop

The count and scan can be separated in time, and there is a fair chance
that all work is already done when the scan starts, which might in turn
result in a needless retry.  This commit therefore avoids this retry by
returning SHRINK_STOP.

Reviewed-by: Uladzislau Rezki (Sony) 
Signed-off-by: Peter Enderborg 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d17e5a0..c8196fa 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3332,7 +3332,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
break;
}
 
-   return freed;
+   return freed == 0 ? SHRINK_STOP : freed;
 }
 
 static struct shrinker kfree_rcu_shrinker = {


Re: [PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
On 7/30/20 9:29 PM, Steven Rostedt wrote:
> On Thu, 30 Jul 2020 21:12:39 +0200
> peter enderborg  wrote:
>
>>>> avc:  denied  { find } for interface=vendor.qti.hardware.perf::IPerf 
>>>> sid=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 pid=9164 
>>>> scontext=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 
>>>> tcontext=u:object_r:vendor_hal_perf_hwservice:s0 tclass=hwservice_manager 
>>>> permissive=0
>>>>  avc:  denied  { execute } for  pid=13914 comm="ScionFrontendAp" 
>>>> path="/data/user_de/0/com.google.android.gms/app_chimera/m/0002/oat/arm64/DynamiteLoader.odex"
>>>>  dev="sda77" ino=204967 scontext=u:r:platform_app:s0:c512,c768 
>>>> tcontext=u:object_r:privapp_data_file:s0:c512,c768 tclass=file 
>>>> permissive=0 ppid=788 pcomm="main" pgid=13914 pgcomm="on.updatecenter"
>>>>
>>>> It omit the fields that are not used. Some parts are common some are not. 
>>>> So a correct format specification for trace will be problematic if there 
>>>> is no "optional" field indicator.  
>>> That's all quite noisy. What is the object of these changes? What
>>> exactly are you trying to trace and why?  
>> It is noisy, and it have to be. it covers a lot of different areas.  One 
>> common problem is
>> to debug userspace applications regarding violations. You get the violation 
>> from the logs
>> and try to figure out what you did to cause it. With a trace point you can 
>> do much better
>> when combine with other traces. Having a the userspace stack is a very good 
>> way,
>> unfortunately  it does not work on that many architectures within trace.
>>
>> What exactly are you doing with any trace? You collect data to analyse what's
>> going on. This is not different. Selinux do a specific thing, but is has 
>> lots of parameters.
> Have you thought of adding multiple trace events with if statements
> around them to decode each specific type of event?

Yes. And I think class is good split point. But I think it will require
a few layers, but a is mostly data driven so I think it might be hard
to do it compile time.  I think a hybrid might be possible,
but it then we need some ugly part with a other separator than =,
or some escape seq to separate.

sort of "generc1=X generic2=Y variable1^x variable2^y" or

"generc1=X generic2=Y leftover=[variable1=x variable2=y]"

If there was a formal parameter tree we could maybe do some
generated printer. I don't think there are one, maybe Paul Moore or Stephen 
Smalley
can verify that.

 

> Note, you can have a generic event that gets enabled by all the other
> events via the "reg" and "unreg" part of TRACE_EVENT_FN(). Say its
> called trace_avc, make a dummy trace_avc() call hat doesn't even need
> to be called anywhere, it just needs to exist to get to the other trace
> events.
>
> Then have:
>
>   if (trace_avc_enabled()) {
>   if (event1)
>   trace_avc_req_event1();
>   if (event2)
>   trace_avc_req_event2();
>   [..]
>   }
>
> The reason for the trace_avc_enabled() is because that's a static
> branch, which is a nop when not enabled. When enabled, it is a jump to
> the out of band if condition block that has all the other trace events.
>
> -- Steve




Re: [PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
On 7/30/20 7:16 PM, Steven Rostedt wrote:
> On Thu, 30 Jul 2020 19:05:49 +0200
> peter enderborg  wrote:
>
>>>> It should be a full structure with a lot of sub strings.  But that make is 
>>>> even more relevant.  
>>> So one event instance can have a list of strings recorded?  
>> Yes, it is a list very similar to a normal trace. But it is more generic.
>>
>> For example ino= is for filesystems that have inode, but for a
>> violation that send a signal that make no sense at all.  Network
>> addresses is in many cases not applicable. laddr= is only exist for
>> for IP.
>>
>> So if you just print them it will look like:
>>
>> avc:  denied  { find } for interface=vendor.qti.hardware.perf::IPerf 
>> sid=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 pid=9164 
>> scontext=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 
>> tcontext=u:object_r:vendor_hal_perf_hwservice:s0 tclass=hwservice_manager 
>> permissive=0
>>  avc:  denied  { execute } for  pid=13914 comm="ScionFrontendAp" 
>> path="/data/user_de/0/com.google.android.gms/app_chimera/m/0002/oat/arm64/DynamiteLoader.odex"
>>  dev="sda77" ino=204967 scontext=u:r:platform_app:s0:c512,c768 
>> tcontext=u:object_r:privapp_data_file:s0:c512,c768 tclass=file permissive=0 
>> ppid=788 pcomm="main" pgid=13914 pgcomm="on.updatecenter"
>>
>> It omit the fields that are not used. Some parts are common some are not. So 
>> a correct format specification for trace will be problematic if there is no 
>> "optional" field indicator.
> That's all quite noisy. What is the object of these changes? What
> exactly are you trying to trace and why?

It is noisy, and it have to be. it covers a lot of different areas.  One common 
problem is
to debug userspace applications regarding violations. You get the violation 
from the logs
and try to figure out what you did to cause it. With a trace point you can do 
much better
when combine with other traces. Having a the userspace stack is a very good way,
unfortunately  it does not work on that many architectures within trace.

What exactly are you doing with any trace? You collect data to analyse what's
going on. This is not different. Selinux do a specific thing, but is has lots 
of parameters.


> -- Steve




Re: [PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
On 7/30/20 6:02 PM, Steven Rostedt wrote:
> On Thu, 30 Jul 2020 17:31:17 +0200
> peter enderborg  wrote:
>
>> On 7/30/20 5:04 PM, Steven Rostedt wrote:
>>> On Thu, 30 Jul 2020 16:29:12 +0200
>>> peter enderborg  wrote:
>>>  
>>>> +#undef TRACE_SYSTEM
>>>> +#define TRACE_SYSTEM avc
>>>> +
>>>> +#if !defined(_TRACE_AVC_H) || defined(TRACE_HEADER_MULTI_READ)
>>>> +#define _TRACE_AVC_H
>>>> +
>>>> +#include 
>>>> +TRACE_EVENT(avc_data,
>>>> +        TP_PROTO(u32 requested,
>>>> +         u32 denied,
>>>> +         u32 audited,
>>>> +         int result,
>>>> +         const char *msg
>>>> +         ),
>>>> +
>>>> +        TP_ARGS(requested, denied, audited, result,msg),
>>>> +
>>>> +        TP_STRUCT__entry(
>>>> +             __field(u32, requested)
>>>> +             __field(u32, denied)
>>>> +             __field(u32, audited)
>>>> +             __field(int, result)
>>>> +             __array(char, msg, 255)  
>>> You want to use __string() here, otherwise you are wasting a lot of
>>> buffer space.
>>>
>>> __string( msg, msg)  
>> It should be a full structure with a lot of sub strings.  But that make is 
>> even more relevant.
> So one event instance can have a list of strings recorded?

Yes, it is a list very similar to a normal trace. But it is more generic.

For example ino= is for filesystems that have inode, but for a
violation that send a signal that make no sense at all.  Network
addresses is in many cases not applicable. laddr= is only exist for
for IP.

So if you just print them it will look like:

avc:  denied  { find } for interface=vendor.qti.hardware.perf::IPerf 
sid=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 pid=9164 
scontext=u:r:permissioncontroller_app:s0:c230,c256,c512,c768 
tcontext=u:object_r:vendor_hal_perf_hwservice:s0 tclass=hwservice_manager 
permissive=0
 avc:  denied  { execute } for  pid=13914 comm="ScionFrontendAp" 
path="/data/user_de/0/com.google.android.gms/app_chimera/m/0002/oat/arm64/DynamiteLoader.odex"
 dev="sda77" ino=204967 scontext=u:r:platform_app:s0:c512,c768 
tcontext=u:object_r:privapp_data_file:s0:c512,c768 tclass=file permissive=0 
ppid=788 pcomm="main" pgid=13914 pgcomm="on.updatecenter"

It omit the fields that are not used. Some parts are common some are not. So a 
correct format specification for trace will be problematic if there is no 
"optional" field indicator.

>
>>>  
>>>> +             ),
>>>> +
>>>> +        TP_fast_assign(
>>>> +               __entry->requested    = requested;
>>>> +               __entry->denied    = denied;
>>>> +               __entry->audited    = audited;
>>>> +               __entry->result    = result;
>>>> +               memcpy(__entry->msg, msg, 255);  
>>> Not to mention, the above is a bug. As the msg being passed in, is
>>> highly unlikely to be 255 bytes. You just leaked all that memory after
>>> the sting to user space.
>>>
>>> Where you want here:
>>>
>>> __assign_str( msg, msg );  
>> Directly in to the code. Was more in to get in to discussion on how complex 
>> we should have
>> the trace data. There is a lot of fields. Not all is always present. Is 
>> there any good way
>> to handle that? Like "something= somethingelse=42" or "something=nil 
>> somthingelse=42"
> Can you show what you want to record and what you want to display? I'm
> not totally understanding the request.
>
> -- Steve
>
>>>> +    ),
>>>> +
>>>> +        TP_printk("requested=0x%x denied=%d audited=%d result=%d
>>>> msg=%s",
>>>> +          __entry->requested, __entry->denied, __entry->audited,
>>>> __entry->result, __entry->msg
>>>> +          )  




Re: [PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
On 7/30/20 4:50 PM, Stephen Smalley wrote:
> On Thu, Jul 30, 2020 at 10:29 AM peter enderborg
>  wrote:
>> I did manage to rebase it but this is about my approach.
>>
>> Compared to Thiébaud Weksteen patch this adds:
>>
>> 1 Filtering. Types goes to trace so we can put up a filter for contexts or 
>> type etc.
>>
>> 2 It tries also to cover non denies.  And upon that you should be able to do 
>> coverage tools.
>> I think many systems have a lot more rules that what is needed, but there is 
>> good way
>> to find out what.  A other way us to make a stat page for the rules, but 
>> this way connect to
>> userspace and can be used for test cases.
>>
>> This code need a lot more work, but it shows how the filter should work 
>> (extra info is not right)
>> and there are  memory leaks, extra debug info and nonsense variable etc.
> Perhaps the two of you could work together to come up with a common
> tracepoint that addresses both needs.

Sounds good to me.

> On the one hand, we don't need/want to duplicate the avc message
> itself; we just need enough to be able to correlate them.
> With respect to non-denials, SELinux auditallow statements can be used
> to generate avc: granted messages that can be used to support coverage
> tools although you can easily flood the logs that way.  One other

That is one reason to use trace. I can be used for things that
generate a lot of data. Like memory allocations and scheduler etc,
and it is a developer tool so you should not have to worry about DOS etc.

Both netlink and android logging are too happy to throw away data for
developers to be happy.

> limitation of the other patch is that it doesn't support generating
> trace information for denials silenced by dontaudit rules, which might
> be challenging to debug especially on Android where you can't just run
> semodule -DB to strip all dontaudits.

I think that only work for rooted devices. Many application developers
run on commercial devices that are locked, but they do have access
to trace. But I have no idea if they (google) have intended the selinux traces 
to
available there.



Re: [PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
On 7/30/20 5:04 PM, Steven Rostedt wrote:
> On Thu, 30 Jul 2020 16:29:12 +0200
> peter enderborg  wrote:
>
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM avc
>> +
>> +#if !defined(_TRACE_AVC_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_AVC_H
>> +
>> +#include 
>> +TRACE_EVENT(avc_data,
>> +        TP_PROTO(u32 requested,
>> +         u32 denied,
>> +         u32 audited,
>> +         int result,
>> +         const char *msg
>> +         ),
>> +
>> +        TP_ARGS(requested, denied, audited, result,msg),
>> +
>> +        TP_STRUCT__entry(
>> +             __field(u32, requested)
>> +             __field(u32, denied)
>> +             __field(u32, audited)
>> +             __field(int, result)
>> +             __array(char, msg, 255)
> You want to use __string() here, otherwise you are wasting a lot of
> buffer space.
>
>   __string( msg, msg)
It should be a full structure with a lot of sub strings.  But that make is even 
more relevant.
>
>> +             ),
>> +
>> +        TP_fast_assign(
>> +               __entry->requested    = requested;
>> +               __entry->denied    = denied;
>> +               __entry->audited    = audited;
>> +               __entry->result    = result;
>> +               memcpy(__entry->msg, msg, 255);
> Not to mention, the above is a bug. As the msg being passed in, is
> highly unlikely to be 255 bytes. You just leaked all that memory after
> the sting to user space.
>
> Where you want here:
>
>   __assign_str( msg, msg );

Directly in to the code. Was more in to get in to discussion on how complex we 
should have
the trace data. There is a lot of fields. Not all is always present. Is there 
any good way
to handle that? Like "something= somethingelse=42" or "something=nil 
somthingelse=42"


>
> -- Steve
>
>
>
>> +    ),
>> +
>> +        TP_printk("requested=0x%x denied=%d audited=%d result=%d
>> msg=%s",
>> +          __entry->requested, __entry->denied, __entry->audited,
>> __entry->result, __entry->msg
>> +          )




[PATCH] RFC: selinux avc trace

2020-07-30 Thread peter enderborg
I did manage to rebase it but this is about my approach.

Compared to Thiébaud Weksteen patch this adds:

1 Filtering. Types goes to trace so we can put up a filter for contexts or type 
etc.

2 It tries also to cover non denies.  And upon that you should be able to do 
coverage tools.
I think many systems have a lot more rules that what is needed, but there is 
good way
to find out what.  A other way us to make a stat page for the rules, but this 
way connect to
userspace and can be used for test cases.

This code need a lot more work, but it shows how the filter should work (extra 
info is not right)
and there are  memory leaks, extra debug info and nonsense variable etc.

From: Peter Enderborg 
Date: Thu, 30 Jul 2020 14:44:53 +0200
Subject: [PATCH] RFC: selinux avc trace

This is not done yet. But it shows a trace for selinux avc.
---
 include/trace/events/avc.h |  92 +
 security/selinux/avc.c | 115 -
 2 files changed, 205 insertions(+), 2 deletions(-)
 create mode 100644 include/trace/events/avc.h

diff --git a/include/trace/events/avc.h b/include/trace/events/avc.h
new file mode 100644
index ..28c1044e918b
--- /dev/null
+++ b/include/trace/events/avc.h
@@ -0,0 +1,92 @@
+/*
+ * License terms: GNU General Public License (GPL) version 2
+ *
+ * Author: Peter Enderborg 
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM avc
+
+#if !defined(_TRACE_AVC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_AVC_H
+
+#include 
+TRACE_EVENT(avc_data,
+        TP_PROTO(u32 requested,
+         u32 denied,
+         u32 audited,
+         int result,
+         const char *msg
+         ),
+
+        TP_ARGS(requested, denied, audited, result,msg),
+
+        TP_STRUCT__entry(
+             __field(u32, requested)
+             __field(u32, denied)
+             __field(u32, audited)
+             __field(int, result)
+             __array(char, msg, 255)
+             ),
+
+        TP_fast_assign(
+               __entry->requested    = requested;
+               __entry->denied    = denied;
+               __entry->audited    = audited;
+               __entry->result    = result;
+               memcpy(__entry->msg, msg, 255);
+    ),
+
+        TP_printk("requested=0x%x denied=%d audited=%d result=%d msg=%s",
+          __entry->requested, __entry->denied, __entry->audited, 
__entry->result, __entry->msg
+          )
+);
+TRACE_EVENT(avc_req,
+        TP_PROTO(u32 requested,
+         u32 denied,
+         u32 audited,
+         int result,
+         const char *msg,
+         u32 ssid,
+         struct selinux_state *state
+         ),
+
+        TP_ARGS(requested, denied, audited, result,msg, ssid, state),
+
+        TP_STRUCT__entry(
+            __field(u32, requested)
+            __field(u32, denied)
+            __field(u32, audited)
+            __field(int, result)
+            __array(char, msg, 255)
+            __field(u32, ssid)
+            __field(struct selinux_state *, state)
+            __field(char*, scontext)
+            __field(u32, ssid_len)
+            __field(u32, ssid_res)
+             ),
+
+        TP_fast_assign(
+            __entry->requested    = requested;
+            __entry->denied    = denied;
+            __entry->audited    = audited;
+             __entry->result    = result;
+            memcpy(__entry->msg, msg, 255);
+            __entry->ssid    = ssid;
+            __entry->state    = state;
+            __entry->ssid_res = security_sid_to_context(
+                           state,
+                           ssid,
+                           &__entry->scontext,
+                           &__entry->ssid_len);
+    ),
+
+        TP_printk("requested=0x%x denied=%d audited=%d result=%d msg=%s 
ssid=%d state=%p scontext=%s slen=%d s=%d",
+          __entry->requested, __entry->denied, __entry->audited, 
__entry->result, __entry->msg, __entry->ssid, 
__entry->state,__entry->scontext,__entry->ssid_len, __entry->ssid_res
+          )
+);
+
+#endif /* _TRACE_AVC_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index d18cb32a242a..d8cb9a23d669 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -30,6 +30,8 @@
 #include "avc.h"
 #include "avc_ss.h"
 #include "classmap.h"
+#define CREATE_TRACE_POINTS
+#include 
 
 #define AVC_CACHE_SLOTS            512
 #define AVC_DEF_CACHE_THRESHOLD        512
@@ -126,6 +128,41 @@ static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
 return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
 }
 
+static int avc_dump_avs(char *ab, u16 tclass, u32 av)
+{
+    const char **perms;
+    int i, perm;
+    int rp;
+
+    i

Re: [PATCH] selinux: add tracepoint on denials

2020-07-30 Thread peter enderborg
On 7/28/20 6:02 PM, Thiébaud Weksteen wrote:
> On Tue, Jul 28, 2020 at 5:12 PM Paul Moore  wrote:
>> Perhaps it would be helpful if you provided an example of how one
>> would be expected to use this new tracepoint?  That would help put
>> things in the proper perspective.
> The best example is the one I provided in the commit message, that is
> using perf (or a perf equivalent), to hook onto that tracepoint.
>
>> Well, to be honest, the very nature of this tracepoint is duplicating
>> the AVC audit record with a focus on using perf to establish a full
>> backtrace at the expense of reduced information.  At least that is how
>> it appears to me.
> I see both methods as complementary. By default, the kernel itself can
> do some reporting (i.e avc message) on which process triggered the
> denial, what was the context, etc. This is useful even in production
> and doesn't require any extra tooling.
> The case for adding this tracepoint can be seen as advanced debugging.
> That is, once an avc denial has been confirmed, a developer can use
> this tracepoint to surface the userland stacktrace. It requires more
> userland tools and symbols on the userland binaries.

I think from development view you would like to have a better
way to trap this events in userspace. One idea that I have is
is to have more outcomes from a rule. We have today allow,
dontaudit, auditallow i think it would be good to have signal sent too.
"signal-xxx-allow" for some set of signals. SIGBUS, SIGSEGV, SIGABRT maybe.

That will be a good way to pickup the problem with a debugger or generate a
a core file.

I have also done some selinux trace functions. I think they collide with this 
set,
but I think I can rebase them upon yours and see if they give some more 
functionality.

I see this functionality very much needed in some form.




[PATCH 1/2] tracefs: Remove unnecessary debug_fs checks.

2020-07-16 Thread Peter Enderborg
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot 
Signed-off-by: Peter Enderborg 
Reviewed-by: Greg Kroah-Hartman 
Acked-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb62269724d5..848f67a5f16d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8945,9 +8945,7 @@ struct dentry *tracing_init_dentry(void)
if (tr->dir)
return NULL;
 
-   if (WARN_ON(!tracefs_initialized()) ||
-   (IS_ENABLED(CONFIG_DEBUG_FS) &&
-WARN_ON(!debugfs_initialized(
+   if (WARN_ON(!tracefs_initialized()))
return ERR_PTR(-ENODEV);
 
/*
-- 
2.17.1



[PATCH v8 0/2] debugfs: Add access restriction option

2020-07-16 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.
v4  Removed #ifdefs from inode.c and using internal.h for configuration
and now using BIT() for that. Function is now always on, and are
instead selected by a built in default or command line parameter.
Changed return value on debug_mount
Reported-by: kernel test robot 
Im not sure about that it is right
v5  Added notes to config help suggested by GregKH.
Removed _BIT from names, white-space and tab.
(checkpatch did not complain).
v6  Using ALLOW instead of ACCESS as name on BIT's. Change the fs to
mount to make it clear and easy to understand.
v7  Updated Kconfig.debug with Randy Dunlap corrections.
v8  Spell fixes from Randy and using else-if for command argument
parser.




[PATCH 2/2] debugfs: Add access restriction option

2020-07-16 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)

It can be on or off, but also internally on but not seen from user-space.
This no-mount mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 15 +++
 fs/debugfs/inode.c| 39 +++
 fs/debugfs/internal.h | 14 +++
 lib/Kconfig.debug | 32 +++
 4 files changed, 100 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..6766a308ad96 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,21 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] This parameter enables what is exposed to 
userspace
+   and debugfs internal clients.
+   Format: { on, no-mount, off }
+   on: All functions are enabled.
+   no-mount:
+   Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   its content. There is nothing to mount.
+   off:Filesystem is not registered and clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+   This is equivalent of the runtime functionality 
if
+   debugfs was not enabled in the kernel at all.
+   Default value is set in build-time with a kernel 
configuration.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..2fcf66473436 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -35,6 +35,7 @@
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ALLOW_BITS;
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +267,9 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -311,6 +315,9 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
pr_debug("creating file '%s'\n", name);
 
if (IS_ERR(parent))
@@ -385,6 +392,11 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +553,11 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +600,11 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+

[PATCH 2/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)

It can be on or off, but also internally on but not seen from user-space.
This no-mount mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 15 
 fs/debugfs/inode.c| 37 +++
 fs/debugfs/internal.h | 14 +++
 lib/Kconfig.debug | 32 
 4 files changed, 98 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..779d6cdc9627 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,21 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] This parameter enables what is exposed to 
userspace
+   and debugfs internal clients.
+   Format: { on, no-mount, off }
+   on: All functions are enabled.
+   no-mount:
+   Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   its content. There is nothing to mount.
+   off:Filesystem is not registered and clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+   This is equilivant of the runtime functionality 
if
+   debugfs was not enabled in the kernel at all.
+   Default value is set in build-time with a kernel 
configuration.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..02d08b17d0e6 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -35,6 +35,7 @@
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ALLOW_BITS;
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +267,9 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -311,6 +315,9 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
pr_debug("creating file '%s'\n", name);
 
if (IS_ERR(parent))
@@ -385,6 +392,11 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +553,11 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +600,11 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+

[PATCH v7 0/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.
v4  Removed #ifdefs from inode.c and using internal.h for configuration
and now using BIT() for that. Function is now always on, and are
instead selected by a built in default or command line parameter.
Changed return value on debug_mount
Reported-by: kernel test robot 
Im not sure about that it is right
v5  Added notes to config help suggested by GregKH.
Removed _BIT from names, white-space and tab.
(checkpatch did not complain).
v6  Using ALLOW instead of ACCESS as name on BIT's. Change the fs to
mount to make it clear and easy to understand.
v7  Updated Kconfig.debug with Randy Dunlap corrections.




[PATCH 1/2] tracefs: Remove unnecessary debug_fs checks.

2020-07-15 Thread Peter Enderborg
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot 
Signed-off-by: Peter Enderborg 
Reviewed-by: Greg Kroah-Hartman 
Acked-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb62269724d5..848f67a5f16d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8945,9 +8945,7 @@ struct dentry *tracing_init_dentry(void)
if (tr->dir)
return NULL;
 
-   if (WARN_ON(!tracefs_initialized()) ||
-   (IS_ENABLED(CONFIG_DEBUG_FS) &&
-WARN_ON(!debugfs_initialized(
+   if (WARN_ON(!tracefs_initialized()))
return ERR_PTR(-ENODEV);
 
/*
-- 
2.17.1



[PATCH 1/2] tracefs: Remove unnecessary debug_fs checks.

2020-07-15 Thread Peter Enderborg
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot 
Signed-off-by: Peter Enderborg 
Reviewed-by: Greg Kroah-Hartman 
Acked-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb62269724d5..848f67a5f16d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8945,9 +8945,7 @@ struct dentry *tracing_init_dentry(void)
if (tr->dir)
return NULL;
 
-   if (WARN_ON(!tracefs_initialized()) ||
-   (IS_ENABLED(CONFIG_DEBUG_FS) &&
-WARN_ON(!debugfs_initialized(
+   if (WARN_ON(!tracefs_initialized()))
return ERR_PTR(-ENODEV);
 
/*
-- 
2.17.1



[PATCH v6 0/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.
v4  Removed #ifdefs from inode.c and using internal.h for configuration
and now using BIT() for that. Function is now always on, and are
instead selected by a built in default or command line parameter.
Changed return value on debug_mount
Reported-by: kernel test robot 
Im not sure about that it is right
v5  Added notes to config help suggested by GregKH.
Removed _BIT from names, white-space and tab.
(checkpatch did not complain).
v6  Using ALLOW instead of ACCESS as name on BIT's. Change the fs to
mount to make it clear and easy to understand. 




[PATCH 2/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)

It can be on or off, but also internally on but not seen from user-space.
This no-mount mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 15 
 fs/debugfs/inode.c| 37 +++
 fs/debugfs/internal.h | 14 +++
 lib/Kconfig.debug | 32 
 4 files changed, 98 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..779d6cdc9627 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,21 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] This parameter enables what is exposed to 
userspace
+   and debugfs internal clients.
+   Format: { on, no-mount, off }
+   on: All functions are enabled.
+   no-mount:
+   Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   its content. There is nothing to mount.
+   off:Filesystem is not registered and clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+   This is equilivant of the runtime functionality 
if
+   debugfs was not enabled in the kernel at all.
+   Default value is set in build-time with a kernel 
configuration.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..02d08b17d0e6 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -35,6 +35,7 @@
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ALLOW_BITS;
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +267,9 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -311,6 +315,9 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
pr_debug("creating file '%s'\n", name);
 
if (IS_ERR(parent))
@@ -385,6 +392,11 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +553,11 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +600,11 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+

[PATCH 1/2] tracefs: Remove unnecessary debug_fs checks.

2020-07-15 Thread Peter Enderborg
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot 
Signed-off-by: Peter Enderborg 
Reviewed-by: Greg Kroah-Hartman 
Acked-by: Steven Rostedt (VMware) 
---
 kernel/trace/trace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb62269724d5..848f67a5f16d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8945,9 +8945,7 @@ struct dentry *tracing_init_dentry(void)
if (tr->dir)
return NULL;
 
-   if (WARN_ON(!tracefs_initialized()) ||
-   (IS_ENABLED(CONFIG_DEBUG_FS) &&
-WARN_ON(!debugfs_initialized(
+   if (WARN_ON(!tracefs_initialized()))
return ERR_PTR(-ENODEV);
 
/*
-- 
2.17.1



[PATCH v5 0/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg


Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.
v4  Removed #ifdefs from inode.c and using internal.h for configuration
and now using BIT() for that. Function is now always on, and are
instead selected by a built in default or command line parameter.
Changed return value on debug_mount
Reported-by: kernel test robot 
Im not sure about that it is right
v5  Added notes to config help suggested by GregKH.
Removed _BIT from names, white-space and tab.
(checkpatch did not complain).




[PATCH 2/2] debugfs: Add access restriction option

2020-07-15 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)

It can be on or off, but also internally on but not seen from user-space.
This no-fs mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 14 +++
 fs/debugfs/inode.c| 37 +++
 fs/debugfs/internal.h | 14 +++
 lib/Kconfig.debug | 32 
 4 files changed, 97 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..805aa2e58491 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,20 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] This parameter enables what is exposed to 
userspace
+   and debugfs internal clients.
+   Format: { on, no-fs, off }
+   on: All functions are enabled.
+   no-fs:  Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   its content. There is nothing to mount.
+   off:Filesystem is not registered and clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+   This is equilivant of the runtime functionality 
if
+   debugfs was not enabled in the kernel at all.
+   Default value is set in build-time with a kernel 
configuration.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..eee937c49a80 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -35,6 +35,7 @@
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ACCESS_BITS;
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +267,9 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -311,6 +315,9 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+
pr_debug("creating file '%s'\n", name);
 
if (IS_ERR(parent))
@@ -385,6 +392,11 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +553,11 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +600,11 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+ 

Re: nr_cpu_ids vs AMD 3970x(32 physical CPUs)

2020-07-03 Thread peter enderborg
On 7/3/20 5:57 PM, Uladzislau Rezki wrote:
> Hello, folk.
>
> I have a system based on AMD 3970x CPUs. It has 32 physical cores
> and 64 threads. It seems that "nr_cpu_ids" variable is not correctly
> set on latest 5.8-rc3 kernel. Please have a look below on dmesg output:
>
> 
> urezki@pc638:~$ sudo dmesg | grep 128
> [0.00] IOAPIC[0]: apic_id 128, version 33, address 0xfec0, GSI 
> 0-23
> [0.00] smpboot: Allowing 128 CPUs, 64 hotplug CPUs
> [0.00] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:128 
> nr_node_ids:1

My 3950 do


[    0.005271] ACPI: SSDT 0xCA1F5000 0003F1 (v02 ALASKA CPUSSDT  
01072009 AMI  01072009)
[    0.108266] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
[    0.111384] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:32 nr_cpu_ids:32 
nr_node_ids:1

(Fedora F32  5.6.18-300) What motherboard and BIOs do you use?


> ...
> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=128, Nodes=1
> [0.00] rcu: RCU restricting CPUs from NR_CPUS=512 to 
> nr_cpu_ids=128.
> [0.00] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=128
> urezki@pc638:~$
> 
>
> For example SLUB thinks that it deals with 128 CPUs in the system what is
> wrong if i do not miss something. Since nr_cpu_ids is broken(?), thus the
> "cpu_possible_mask" does not correspond to reality as well.
>
> Any thoughts?
>
> Thanks!
>
> --
> Vlad Rezki




[PATCH 1/2] tracefs: Remove unnecessary debug_fs checks.

2020-06-22 Thread Peter Enderborg
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot 
Signed-off-by: Peter Enderborg 
---
 kernel/trace/trace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ec44b0e2a19c..34ed82364edb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8946,9 +8946,7 @@ struct dentry *tracing_init_dentry(void)
if (tr->dir)
return NULL;
 
-   if (WARN_ON(!tracefs_initialized()) ||
-   (IS_ENABLED(CONFIG_DEBUG_FS) &&
-WARN_ON(!debugfs_initialized(
+   if (WARN_ON(!tracefs_initialized()))
return ERR_PTR(-ENODEV);
 
/*
-- 
2.26.2



[PATCH v4 0/2] debugfs: Add access restriction option

2020-06-22 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.
v4  Removed #ifdefs from inode.c and using internal.h for configuration
and now using BIT() for that. Function is now always on, and are
instead selected by a built in default or command line parameter.
Added preparation patch that removes check debugfs is initialised.
Reported-by: kernel test robot 



[PATCH 2/2] debugfs: Add access restriction option

2020-06-22 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)

It can be on or off, but also internally on but not seen from user-space.
This no-fs mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 12 ++
 fs/debugfs/inode.c| 37 +++
 fs/debugfs/internal.h | 14 +++
 lib/Kconfig.debug | 32 
 4 files changed, 95 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..236aacaceaf5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,18 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] This parameter enables what is exposed to 
userspace
+   and debugfs internal clients.
+   Format: { on, no-fs, off }
+   on: All functions are enabled.
+   no-fs:  Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   its content. There is nothing to mount.
+   off:Filesystem is not registered and clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+   Default value is set in build-time with a kernel 
configuration.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..a4a1c92ae478 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -35,6 +35,7 @@
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ACCESS_BITS;
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +267,9 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API_BIT))
+   return ERR_CAST(ERR_PTR(-EPERM));
+
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -311,6 +315,9 @@ static struct dentry *start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API_BIT))
+   return ERR_PTR(-EPERM);
+
pr_debug("creating file '%s'\n", name);
 
if (IS_ERR(parent))
@@ -385,6 +392,11 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API_BIT)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +553,11 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API_BIT)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +600,11 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API_BIT)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentri

[PATCH v3] debugfs: Add access restriction option

2020-06-17 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

When enabled it is needed a kernel command line parameter to be activated.

It can be on or off, but also internally on but not seen from user-space.
This no-fs mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
v2. Removed MOUNT as part of restrictions. Added API's restrictions as
separate restriction.
v3  Updated Documentation after Randy Dunlap reviews and suggestions.

 .../admin-guide/kernel-parameters.txt | 11 +
 fs/debugfs/inode.c| 47 +++
 lib/Kconfig.debug | 10 
 3 files changed, 68 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad81c79..249c86e53bb7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,17 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] When CONFIG_DEBUG_FS_RESTRICTED is set, this 
parameter
+   enables what is exposed to userspace.
+   Format: { on, no_fs, off }
+   on: All functions are enabled.
+   no_fs:  Filesystem is not registered but kernel clients 
can
+   access APIs and a crashkernel can be used to 
read
+   it's content. There its nothing to mount.
+   off:(default) Filesystem is not registered and 
clients
+   get a -EPERM as result when trying to register 
files
+   or directories within debugfs.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..2bd80a932ae1 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -31,10 +31,17 @@
 #include "internal.h"
 
 #define DEBUGFS_DEFAULT_MODE   0700
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+#define DEBUGFS_ALLOW_API 0x2
+#define DEBUGFS_ALLOW_FS 0x1
+#endif
 
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+static unsigned int debugfs_allow;
+#endif
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +273,10 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+#endif
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -385,6 +396,12 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +558,12 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +606,12 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create automount 

[PATCH v2] debugfs: Add access restriction option

2020-06-11 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.

When enabled it is needed a kernel command line parameter to be activated.

It can be on or off, but also internally on but not seen from user-space.
This no-fs mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.

Signed-off-by: Peter Enderborg 
---
 .../admin-guide/kernel-parameters.txt | 11 +
 fs/debugfs/inode.c| 47 +++
 lib/Kconfig.debug | 10 
 3 files changed, 68 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 4379c6ac3265..b1c1446aa90d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -827,6 +827,17 @@
useful to also enable the page_owner functionality.
on: enable the feature
 
+   debugfs=[KNL] When CONFIG_DEBUG_FS_RESTRICTED is set, this 
parameter
+   enable what is exposed to userspace.
+   Format: { on, no_fs, off }
+   on: All functions are enabled.
+   no_fs:  Filesystem is not registered but kernel clients 
can
+   access apis and a crashkernel can be used to 
read
+   it's content. There is noting to mount.
+   off:(default) Filesystem is not registered and 
clients
+   get a -EPERM as result when trying to register 
files
+   or directory’s within debugfs.
+
debugpat[X86] Enable PAT debugging
 
decnet.addr=[HW,NET]
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..2bd80a932ae1 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -31,10 +31,17 @@
 #include "internal.h"
 
 #define DEBUGFS_DEFAULT_MODE   0700
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+#define DEBUGFS_ALLOW_API 0x2
+#define DEBUGFS_ALLOW_FS 0x1
+#endif
 
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+static unsigned int debugfs_allow;
+#endif
 
 /*
  * Don't allow access attributes to be changed whilst the kernel is locked down
@@ -266,6 +273,10 @@ static struct dentry *debug_mount(struct file_system_type 
*fs_type,
int flags, const char *dev_name,
void *data)
 {
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API))
+   return ERR_PTR(-EPERM);
+#endif
return mount_single(fs_type, flags, data, debug_fill_super);
 }
 
@@ -385,6 +396,12 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
@@ -541,6 +558,12 @@ struct dentry *debugfs_create_dir(const char *name, struct 
dentry *parent)
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
@@ -583,6 +606,12 @@ struct dentry *debugfs_create_automount(const char *name,
if (IS_ERR(dentry))
return dentry;
 
+#ifdef CONFIG_DEBUG_FS_RESTRICTED
+   if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
+   failed_creating(dentry);
+   return ERR_PTR(-EPERM);
+   }
+#endif
inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create automount '%s'\n",
@@ -786,10 +815,28 @@ bool debugfs_initialized(void)
 }
 EXPORT_SYMBOL_GPL(debugfs_initialized);
 
+static int __init debugfs_kernel(char *str)
+{

[PATCH] rcu: Stop shrinker loop

2020-06-04 Thread Peter Enderborg
The count and scan can be separated in time. It is a fair chance
that all work is already done when the scan starts. It
then might retry. This is can be avoided with returning SHRINK_STOP.

Signed-off-by: Peter Enderborg 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c716eadc7617..8b36c6b2887d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3310,7 +3310,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
break;
}
 
-   return freed;
+   return freed == 0 ? SHRINK_STOP : freed;
 }
 
 static struct shrinker kfree_rcu_shrinker = {
-- 
2.17.1



Re: [tip: core/rcu] rcu/tree: Add a shrinker to prevent OOM due to kfree_rcu() batching

2020-06-03 Thread peter enderborg
On 5/11/20 10:59 PM, tip-bot2 for Joel Fernandes (Google) wrote:
> The following commit has been merged into the core/rcu branch of tip:
>
> Commit-ID: 9154244c1ab6c9db4f1f25ac8f73bd46dba64287
> Gitweb:
> https://git.kernel.org/tip/9154244c1ab6c9db4f1f25ac8f73bd46dba64287
> Author:Joel Fernandes (Google) 
> AuthorDate:Mon, 16 Mar 2020 12:32:27 -04:00
> Committer: Paul E. McKenney 
> CommitterDate: Mon, 27 Apr 2020 11:02:50 -07:00
>
> rcu/tree: Add a shrinker to prevent OOM due to kfree_rcu() batching
>
> To reduce grace periods and improve kfree() performance, we have done
> batching recently dramatically bringing down the number of grace periods
> while giving us the ability to use kfree_bulk() for efficient kfree'ing.
>
> However, this has increased the likelihood of OOM condition under heavy
> kfree_rcu() flood on small memory systems. This patch introduces a
> shrinker which starts grace periods right away if the system is under
> memory pressure due to existence of objects that have still not started
> a grace period.
>
> With this patch, I do not observe an OOM anymore on a system with 512MB
> RAM and 8 CPUs, with the following rcuperf options:
>
> rcuperf.kfree_loops=2 rcuperf.kfree_alloc_num=8000
> rcuperf.kfree_rcu_test=1 rcuperf.kfree_mult=2
>
> Otherwise it easily OOMs with the above parameters.
>
> NOTE:
> 1. On systems with no memory pressure, the patch has no effect as intended.
> 2. In the future, we can use this same mechanism to prevent grace periods
>from happening even more, by relying on shrinkers carefully.
>
> Cc: ure...@gmail.com
> Signed-off-by: Joel Fernandes (Google) 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/rcu/tree.c | 60 ++-
>  1 file changed, 60 insertions(+)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 156ac8d..e299cd0 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2824,6 +2824,8 @@ struct kfree_rcu_cpu {
>   struct delayed_work monitor_work;
>   bool monitor_todo;
>   bool initialized;
> + // Number of objects for which GP not started
> + int count;


Isn't it better with a atomic counter to avoid the irq handling  in 
shrink_count?


>  };
>  
>  static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
> @@ -2937,6 +2939,8 @@ static inline bool queue_kfree_rcu_work(struct 
> kfree_rcu_cpu *krcp)
>   krcp->head = NULL;
>   }
>  
> + krcp->count = 0;
> +
>   /*
>* One work is per one batch, so there are two "free 
> channels",
>* "bhead_free" and "head_free" the batch can handle. 
> It can be
> @@ -3073,6 +3077,8 @@ void kfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>   krcp->head = head;
>   }
>  
> + krcp->count++;
> +
>   // Set timer to drain after KFREE_DRAIN_JIFFIES.
>   if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
>   !krcp->monitor_todo) {
> @@ -3087,6 +3093,58 @@ unlock_return:
>  }
>  EXPORT_SYMBOL_GPL(kfree_call_rcu);
>  
> +static unsigned long
> +kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> + int cpu;
> + unsigned long flags, count = 0;
> +
> + /* Snapshot count of all CPUs */
> + for_each_online_cpu(cpu) {
> + struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
> +
> + spin_lock_irqsave(>lock, flags);
> + count += krcp->count;
> + spin_unlock_irqrestore(>lock, flags);
> + }
> +
> + return count;
> +}
> +
> +static unsigned long
> +kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> +{
> + int cpu, freed = 0;
> + unsigned long flags;
> +
> + for_each_online_cpu(cpu) {
> + int count;
> + struct kfree_rcu_cpu *krcp = per_cpu_ptr(, cpu);
> +
> + count = krcp->count;

inside the lock held


> + spin_lock_irqsave(>lock, flags);
> + if (krcp->monitor_todo)
> + kfree_rcu_drain_unlock(krcp, flags);
> + else
> + spin_unlock_irqrestore(>lock, flags);
> +
> + sc->nr_to_scan -= count;
> + freed += count;
> +
> + if (sc->nr_to_scan <= 0)
> + break;
> + }
> +
> + return freed;
> +}
> +
> +static struct shrinker kfree_rcu_shrinker = {
> + .count_objects = kfree_rcu_shrink_count,
> + .scan_objects = kfree_rcu_shrink_scan,
> + .batch = 0,
> + .seeks = DEFAULT_SEEKS,
> +};
> +
>  void __init kfree_rcu_scheduler_running(void)
>  {
>   int cpu;
> @@ -4007,6 +4065,8 @@ static void __init kfree_rcu_batch_init(void)
>   INIT_DELAYED_WORK(>monitor_work, kfree_rcu_monitor);
>   krcp->initialized = true;
>   }
> + if (register_shrinker(_rcu_shrinker))
> + pr_err("Failed to register 

[PATCH] debugfs: Add mount restriction option

2020-05-28 Thread Peter Enderborg
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. It is needed new
kernel command line parameter to be activated.

Signed-off-by: Peter Enderborg 
---
 fs/debugfs/inode.c | 17 -
 lib/Kconfig.debug  | 10 ++
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index b7f2e971ecbc..bde37dab77e0 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -786,10 +786,25 @@ bool debugfs_initialized(void)
 }
 EXPORT_SYMBOL_GPL(debugfs_initialized);
 
+static int allow_debugfs;
+
+static int __init debugfs_kernel(char *str)
+{
+   if (str && !strcmp(str, "true"))
+   allow_debugfs = true;
+
+   return 0;
+
+}
+early_param("debugfs", debugfs_kernel);
+
 static int __init debugfs_init(void)
 {
int retval;
-
+#ifdef CONFIG_DEBUG_FS_MOUNT_RESTRICTED
+   if (!allow_debugfs)
+   return -EPERM;
+#endif
retval = sysfs_create_mount_point(kernel_kobj, "debug");
if (retval)
return retval;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 21d9c5f6e7ec..d3a3338740d2 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -443,6 +443,16 @@ config DEBUG_FS
 
  If unsure, say N.
 
+config DEBUG_FS_MOUNT_RESTRICTED
+   bool "Debug Filesystem mount restricted"
+   depends on DEBUG_FS
+   help
+ This is an additional restriction for mounting debugfs. It allows
+ the kernel to have debugfs compiled, but requires that kernel command
+ line has a debugfs parameter to register as a filesystem.
+
+ If unsure, say N.
+
 source "lib/Kconfig.kgdb"
 
 source "lib/Kconfig.ubsan"
-- 
2.26.2



Re: [PATCH] mm, compaction: Indicate when compaction is manually triggered by sysctl

2020-05-18 Thread peter enderborg
On 5/11/20 1:26 PM, Guilherme Piccoli wrote:
> On Sun, May 10, 2020 at 10:25 PM David Rientjes  wrote:
>> [...]
>> The kernel log is not preferred for this (or drop_caches, really) because
>> the amount of info can causing important information to be lost.  We don't
>> really gain anything by printing that someone manually triggered
>> compaction; they could just write to the kernel log themselves if they
>> really wanted to.  The reverse is not true: we can't suppress your kernel
>> message with this patch.
>>
>> Instead, a statsfs-like approach could be used to indicate when this has
>> happened and there is no chance of losing events because it got scrolled
>> off the kernel log.  It has the added benefit of not requiring the entire
>> log to be parsed for such events.
> OK, agreed! Let's forget the kernel log. So, do you think the way to
> go is the statsfs, not a zoneinfo stat, a per-node thing? I'm saying
> that because kernel mm subsystem statistics seem pretty.."comfortable"
> the way they are, in files like vmstat, zoneinfo, etc. Let me know
> your thoughts on this, if I could work on that or should wait statsfs.
> Thanks,
>
>
> Guilherme

I think a trace notification in compaction like kcompad_wake would be good.



Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

2020-04-29 Thread peter enderborg
On 4/28/20 9:43 AM, Michal Hocko wrote:
> On Mon 27-04-20 16:35:58, Andrew Morton wrote:
> [...]
>> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
>> it.
>> Subsystems such as networking will consume a certain amount and
>> will then start recycling it.  The total amount in-flight will vary
>> over the longer term as workloads change.  A dynamically tuning
>> threshold system will need to adapt rapidly enough to sudden load
>> shifts, which might require unreasonable amounts of headroom.
> I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
> reserves under memory pressure. Then allocatios start failing very
> quickly and users have to cope with that, usually by deferring to a
> sleepable context. Tuning reserves dynamically for heavy reserves
> consumers would be possible but I am worried that this is far from
> trivial.
>
> We definitely need to understand what is going on here.  Why doesn't
> kswapd + N*direct reclaimers do not provide enough memory to satisfy
> both N threads + reserves consumers? How many times those direct
> reclaimers have to retry?

Was this not supposed to be avoided with PSI, user-space should
a fair change to take actions before it goes bad in user-space?


> We used to have the allocation stall warning as David mentioned in the
> patch description and I have seen it triggering without heavy reserves
> consumers (aka reported free pages corresponded to the min watermark).
> The underlying problem was usually kswapd being stuck on some FS locks,
> direct reclaimers stuck in shrinkers or way too overloaded system with
> dozens if not hundreds of processes stuck in the page allocator each
> racing with the reclaim and betting on luck. The last problem was the
> most annoying because it is really hard to tune for.




Re: [PATCH] lib/genalloc.c: include vmalloc.h

2019-01-10 Thread peter enderborg
On 1/7/19 11:56 PM, Andrew Morton wrote:
> On Sat, 5 Jan 2019 13:35:33 -0800 Linus Torvalds 
>  wrote:
>
>> On Sat, Jan 5, 2019 at 1:21 PM Olof Johansson  wrote:
>>> Fixes build break on most ARM/ARM64 defconfigs:
>> Interesting.
>>
>> Andrew, I thought ARM was one of the platforms that your tree compiled
>> against? Or was some other change just hiding this?
>>
> I've become a bit lazy with the cross-compiling because linux-next's
> build coverage is pretty broad.  But Stephen's holiday interrupted
> things and this one fell in that window.
>
Is it now OK to use vmalloc in fast path's on all platforms?



Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-17 Thread peter enderborg
Will it be part of the backport to 4.9 google android or is it for test only?
I guess that this patch is to big for the LTS tree.

On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
> code system running Android. Signals behave as expected reacting to
> memory pressure, no jumps in "total" counters that would indicate an
> overflow/underflow issues. Nicely done!
>
> Tested-by: Suren Baghdasaryan 
>
> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner  wrote:
>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>> scheduler tree since most of this lives there?
>> Thanks for the ack.
>>
>> As for routing it, I'll leave that decision to you and Andrew. It
>> touches stuff all over, so it could result in quite a few conflicts
>> between trees (although I don't expect any of them to be non-trivial).




Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-17 Thread peter enderborg
Will it be part of the backport to 4.9 google android or is it for test only?
I guess that this patch is to big for the LTS tree.

On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
> code system running Android. Signals behave as expected reacting to
> memory pressure, no jumps in "total" counters that would indicate an
> overflow/underflow issues. Nicely done!
>
> Tested-by: Suren Baghdasaryan 
>
> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner  wrote:
>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>> scheduler tree since most of this lives there?
>> Thanks for the ack.
>>
>> As for routing it, I'll leave that decision to you and Andrew. It
>> touches stuff all over, so it could result in quite a few conflicts
>> between trees (although I don't expect any of them to be non-trivial).




Re: [PATCH RFC] mm: don't raise MEMCG_OOM event due to failed high-order allocation

2018-09-11 Thread peter enderborg
On 09/11/2018 02:11 PM, Michal Hocko wrote:
> Why is this a problem though? IIRC this event was deliberately placed
> outside of the oom path because we wanted to count allocation failures
> and this is also documented that way
>
>   oom
> The number of time the cgroup's memory usage was
> reached the limit and allocation was about to fail.
>
> Depending on context result could be invocation of OOM
> killer and retrying allocation or failing a
>
> One could argue that we do not apply the same logic to GFP_NOWAIT
> requests but in general I would like to see a good reason to change
> the behavior and if it is really the right thing to do then we need to
> update the documentation as well.
>

Why not introduce a MEMCG_ALLOC_FAIL in to memcg_memory_event?


Re: [PATCH RFC] mm: don't raise MEMCG_OOM event due to failed high-order allocation

2018-09-11 Thread peter enderborg
On 09/11/2018 02:11 PM, Michal Hocko wrote:
> Why is this a problem though? IIRC this event was deliberately placed
> outside of the oom path because we wanted to count allocation failures
> and this is also documented that way
>
>   oom
> The number of time the cgroup's memory usage was
> reached the limit and allocation was about to fail.
>
> Depending on context result could be invocation of OOM
> killer and retrying allocation or failing a
>
> One could argue that we do not apply the same logic to GFP_NOWAIT
> requests but in general I would like to see a good reason to change
> the behavior and if it is really the right thing to do then we need to
> update the documentation as well.
>

Why not introduce a MEMCG_ALLOC_FAIL in to memcg_memory_event?


Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v3

2018-08-07 Thread peter enderborg
On 08/01/2018 05:19 PM, Johannes Weiner wrote:
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> aggregate task stalls at the cgroup level instead of system-wide.
>
Usually there are objections to add more stuff to /proc. Is this an exception?


Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v3

2018-08-07 Thread peter enderborg
On 08/01/2018 05:19 PM, Johannes Weiner wrote:
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> aggregate task stalls at the cgroup level instead of system-wide.
>
Usually there are objections to add more stuff to /proc. Is this an exception?


Re: [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing

2018-08-02 Thread peter enderborg
On 08/01/2018 05:13 PM, Johannes Weiner wrote:
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e34a27727b9a..7af1c3c15d8e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -69,13 +69,14 @@
>   */
>  enum pageflags {
>   PG_locked,  /* Page is locked. Don't touch. */
> - PG_error,
>   PG_referenced,
>   PG_uptodate,
>   PG_dirty,
>   PG_lru,
>   PG_active,
> + PG_workingset,
>   PG_waiters, /* Page has waiters, check its waitqueue. Must 
> be bit #7 and in the same byte as "PG_locked" */
> + PG_error,
>   PG_slab,
>   PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
>   PG_arch_1,
> @@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, 
> PF_HEAD)
Any reason why the PG_error was moved? And dont you need to do some handling of 
this flag in proc/fs/page.c ?
Some KFP_WORKINGSET ?




Re: [PATCH 2/9] mm: workingset: tell cache transitions from workingset thrashing

2018-08-02 Thread peter enderborg
On 08/01/2018 05:13 PM, Johannes Weiner wrote:
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e34a27727b9a..7af1c3c15d8e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -69,13 +69,14 @@
>   */
>  enum pageflags {
>   PG_locked,  /* Page is locked. Don't touch. */
> - PG_error,
>   PG_referenced,
>   PG_uptodate,
>   PG_dirty,
>   PG_lru,
>   PG_active,
> + PG_workingset,
>   PG_waiters, /* Page has waiters, check its waitqueue. Must 
> be bit #7 and in the same byte as "PG_locked" */
> + PG_error,
>   PG_slab,
>   PG_owner_priv_1,/* Owner use. If pagecache, fs may use*/
>   PG_arch_1,
> @@ -280,6 +281,8 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, 
> PF_HEAD)
Any reason why the PG_error was moved? And dont you need to do some handling of 
this flag in proc/fs/page.c ?
Some KFP_WORKINGSET ?




Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/25/2018 03:07 PM, Michal Hocko wrote:

> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument 
that it was right)
But if you don’t have a way to interact with the memory system we will get 
attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.




Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/25/2018 03:07 PM, Michal Hocko wrote:

> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument 
that it was right)
But if you don’t have a way to interact with the memory system we will get 
attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.




Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/25/2018 03:07 PM, Michal Hocko wrote:
> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>>
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument 
that it was right)
But if you don’t have a way to interact with the memory system we will get 
attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.



Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/25/2018 03:07 PM, Michal Hocko wrote:
> On Mon 25-06-18 15:03:40, peter enderborg wrote:
>> On 06/20/2018 01:55 PM, Michal Hocko wrote:
>>> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>>>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>>>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>>>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>>>> with oom_lock held is currently an unavoidable problem.
>>> Could you be more specific about the potential deadlock? Sleeping while
>>> holding oom lock is certainly not nice but I do not see how that would
>>> result in a deadlock assuming that the sleeping context doesn't sleep on
>>> the memory allocation obviously.
>> It is a mutex you are supposed to be able to sleep.  It's even exported.
> What do you mean? oom_lock is certainly not exported for general use. It
> is not local to oom_killer.c just because it is needed in other _mm_
> code.
>  

It  is in the oom.h file include/linux/oom.h, if it that sensitive it should
be in mm/ and a documented note about the special rules. It is only used
in drivers/tty/sysrq.c and that be replaced by a help function in mm that
do the  oom stuff.


>>>> As a preparation for not to sleep with oom_lock held, this patch brings
>>>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>>>> changes explained below.
>>> Can we just eliminate this ugliness and remove it altogether? We do not
>>> have that many notifiers. Is there anything fundamental that would
>>> prevent us from moving them to shrinkers instead?
>>
>> @Hocko Do you remember the lowmemorykiller from android? Some things
>> might not be the right thing for shrinkers.
> Just that lmk did it wrong doesn't mean others have to follow.
>
If all you have is a hammer, everything looks like a nail. (I don’t argument 
that it was right)
But if you don’t have a way to interact with the memory system we will get 
attempts like lmk. 
Oom notifiers and vmpressure is for this task better than shrinkers.



Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/20/2018 01:55 PM, Michal Hocko wrote:
> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>> with oom_lock held is currently an unavoidable problem.
> Could you be more specific about the potential deadlock? Sleeping while
> holding oom lock is certainly not nice but I do not see how that would
> result in a deadlock assuming that the sleeping context doesn't sleep on
> the memory allocation obviously.

It is a mutex you are supposed to be able to sleep.  It's even exported.

>> As a preparation for not to sleep with oom_lock held, this patch brings
>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>> changes explained below.
> Can we just eliminate this ugliness and remove it altogether? We do not
> have that many notifiers. Is there anything fundamental that would
> prevent us from moving them to shrinkers instead?


@Hocko Do you remember the lowmemorykiller from android? Some things might not 
be the right thing for shrinkers.



Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-06-25 Thread peter enderborg
On 06/20/2018 01:55 PM, Michal Hocko wrote:
> On Wed 20-06-18 20:20:38, Tetsuo Handa wrote:
>> Sleeping with oom_lock held can cause AB-BA lockup bug because
>> __alloc_pages_may_oom() does not wait for oom_lock. Since
>> blocking_notifier_call_chain() in out_of_memory() might sleep, sleeping
>> with oom_lock held is currently an unavoidable problem.
> Could you be more specific about the potential deadlock? Sleeping while
> holding oom lock is certainly not nice but I do not see how that would
> result in a deadlock assuming that the sleeping context doesn't sleep on
> the memory allocation obviously.

It is a mutex you are supposed to be able to sleep.  It's even exported.

>> As a preparation for not to sleep with oom_lock held, this patch brings
>> OOM notifier callbacks to outside of OOM killer, with two small behavior
>> changes explained below.
> Can we just eliminate this ugliness and remove it altogether? We do not
> have that many notifiers. Is there anything fundamental that would
> prevent us from moving them to shrinkers instead?


@Hocko Do you remember the lowmemorykiller from android? Some things might not 
be the right thing for shrinkers.



[PATCH V3 1/5 selinux-next] selinux: Make allocation atomic in policydb objects functions.

2018-05-30 Thread Peter Enderborg
From: peter 

As preparation for RCU the allocation need to be atomic,
there is a lot of them so they do in this patch.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/avtab.c   |   8 +--
 security/selinux/ss/conditional.c |  14 ++---
 security/selinux/ss/ebitmap.c |   3 +-
 security/selinux/ss/hashtab.c |   6 +--
 security/selinux/ss/policydb.c| 104 +++---
 5 files changed, 69 insertions(+), 66 deletions(-)

diff --git a/security/selinux/ss/avtab.c b/security/selinux/ss/avtab.c
index a2c9148b0662..1114a308aa94 100644
--- a/security/selinux/ss/avtab.c
+++ b/security/selinux/ss/avtab.c
@@ -72,13 +72,13 @@ avtab_insert_node(struct avtab *h, int hvalue,
 {
struct avtab_node *newnode;
struct avtab_extended_perms *xperms;
-   newnode = kmem_cache_zalloc(avtab_node_cachep, GFP_KERNEL);
+   newnode = kmem_cache_zalloc(avtab_node_cachep, GFP_ATOMIC);
if (newnode == NULL)
return NULL;
newnode->key = *key;
 
if (key->specified & AVTAB_XPERMS) {
-   xperms = kmem_cache_zalloc(avtab_xperms_cachep, GFP_KERNEL);
+   xperms = kmem_cache_zalloc(avtab_xperms_cachep, GFP_ATOMIC);
if (xperms == NULL) {
kmem_cache_free(avtab_node_cachep, newnode);
return NULL;
@@ -95,7 +95,7 @@ avtab_insert_node(struct avtab *h, int hvalue,
} else {
newnode->next = flex_array_get_ptr(h->htable, hvalue);
if (flex_array_put_ptr(h->htable, hvalue, newnode,
-  GFP_KERNEL|__GFP_ZERO)) {
+  GFP_ATOMIC|__GFP_ZERO)) {
kmem_cache_free(avtab_node_cachep, newnode);
return NULL;
}
@@ -330,7 +330,7 @@ int avtab_alloc(struct avtab *h, u32 nrules)
mask = nslot - 1;
 
h->htable = flex_array_alloc(sizeof(struct avtab_node *), nslot,
-GFP_KERNEL | __GFP_ZERO);
+GFP_ATOMIC | __GFP_ZERO);
if (!h->htable)
return -ENOMEM;
 
diff --git a/security/selinux/ss/conditional.c 
b/security/selinux/ss/conditional.c
index c91543a617ac..a09c8a8e9472 100644
--- a/security/selinux/ss/conditional.c
+++ b/security/selinux/ss/conditional.c
@@ -178,7 +178,7 @@ int cond_init_bool_indexes(struct policydb *p)
kfree(p->bool_val_to_struct);
p->bool_val_to_struct = kmalloc_array(p->p_bools.nprim,
  sizeof(*p->bool_val_to_struct),
- GFP_KERNEL);
+ GFP_ATOMIC);
if (!p->bool_val_to_struct)
return -ENOMEM;
return 0;
@@ -205,7 +205,7 @@ int cond_index_bool(void *key, void *datum, void *datap)
 
fa = p->sym_val_to_name[SYM_BOOLS];
if (flex_array_put_ptr(fa, booldatum->value - 1, key,
-  GFP_KERNEL | __GFP_ZERO))
+  GFP_ATOMIC | __GFP_ZERO))
BUG();
p->bool_val_to_struct[booldatum->value - 1] = booldatum;
 
@@ -227,7 +227,7 @@ int cond_read_bool(struct policydb *p, struct hashtab *h, 
void *fp)
u32 len;
int rc;
 
-   booldatum = kzalloc(sizeof(*booldatum), GFP_KERNEL);
+   booldatum = kzalloc(sizeof(*booldatum), GFP_ATOMIC);
if (!booldatum)
return -ENOMEM;
 
@@ -247,7 +247,7 @@ int cond_read_bool(struct policydb *p, struct hashtab *h, 
void *fp)
goto err;
 
rc = -ENOMEM;
-   key = kmalloc(len + 1, GFP_KERNEL);
+   key = kmalloc(len + 1, GFP_ATOMIC);
if (!key)
goto err;
rc = next_entry(key, fp, len);
@@ -332,7 +332,7 @@ static int cond_insertf(struct avtab *a, struct avtab_key 
*k, struct avtab_datum
goto err;
}
 
-   list = kzalloc(sizeof(*list), GFP_KERNEL);
+   list = kzalloc(sizeof(*list), GFP_ATOMIC);
if (!list) {
rc = -ENOMEM;
goto err;
@@ -420,7 +420,7 @@ static int cond_read_node(struct policydb *p, struct 
cond_node *node, void *fp)
goto err;
 
rc = -ENOMEM;
-   expr = kzalloc(sizeof(*expr), GFP_KERNEL);
+   expr = kzalloc(sizeof(*expr), GFP_ATOMIC);
if (!expr)
goto err;
 
@@ -471,7 +471,7 @@ int cond_read_list(struct policydb *p, void *fp)
 
for (i = 0; i < len; i++) {
rc = -ENOMEM;
-   node = kzalloc(sizeof(*node), GFP_KERNEL);
+   node = kzalloc(sizeof(*node), GFP_ATOMIC);
if (!node)
goto err;
 
diff --git a/security/selinux/ss/ebitmap.c b/security/selinux/ss/ebitmap.c
index 5ae8c61b75bf..a49fab

[PATCH V3 3/5 selinux-next] selinux: sidtab_clone switch to use rwlock.

2018-05-30 Thread Peter Enderborg
We need a copy of sidtabs, so change the generic sidtab_clone
as from a function pointer and let it use a read rwlock while
do the clone.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 20 +---
 security/selinux/ss/sidtab.c   | 39 ---
 security/selinux/ss/sidtab.h   |  3 ++-
 3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 4f3ce389084c..2be471d72c85 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -1891,19 +1891,6 @@ int security_change_sid(struct selinux_state *state,
out_sid, false);
 }
 
-/* Clone the SID into the new SID table. */
-static int clone_sid(u32 sid,
-struct context *context,
-void *arg)
-{
-   struct sidtab *s = arg;
-
-   if (sid > SECINITSID_NUM)
-   return sidtab_insert(s, sid, context);
-   else
-   return 0;
-}
-
 static inline int convert_context_handle_invalid_context(
struct selinux_state *state,
struct context *context)
@@ -2199,10 +2186,7 @@ int security_load_policy(struct selinux_state *state, 
void *data, size_t len)
goto err;
}
 
-   /* Clone the SID table. */
-   sidtab_shutdown(old_set->sidtab);
-
-   rc = sidtab_map(old_set->sidtab, clone_sid, next_set->sidtab);
+   rc = sidtab_clone(old_set->sidtab, next_set->sidtab);
if (rc)
goto err;
 
@@ -2926,8 +2910,6 @@ int security_set_bools(struct selinux_state *state, int 
len, int *values)
goto out;
}
 
-   seqno = ++state->ss->latest_granting;
-   state->ss->active_set = next_set;
rc = 0;
 out:
if (!rc) {
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
index 5be31b7af225..811503cd7c2b 100644
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -27,7 +27,7 @@ int sidtab_init(struct sidtab *s)
s->nel = 0;
s->next_sid = 1;
s->shutdown = 0;
-   spin_lock_init(>lock);
+   rwlock_init(>lock);
return 0;
 }
 
@@ -116,6 +116,31 @@ struct context *sidtab_search_force(struct sidtab *s, u32 
sid)
return sidtab_search_core(s, sid, 1);
 }
 
+int sidtab_clone(struct sidtab *s, struct sidtab *d)
+{
+   int i, rc = 0;
+   struct sidtab_node *cur;
+
+   if (!s || !d)
+   goto errout;
+
+   read_lock(>lock);
+   for (i = 0; i < SIDTAB_SIZE; i++) {
+   cur = s->htable[i];
+   while (cur) {
+   if (cur->sid > SECINITSID_NUM)
+   rc =  sidtab_insert(d, cur->sid, >context);
+   if (rc)
+   goto out;
+   cur = cur->next;
+   }
+   }
+out:
+   read_unlock(>lock);
+errout:
+   return rc;
+}
+
 int sidtab_map(struct sidtab *s,
   int (*apply) (u32 sid,
 struct context *context,
@@ -202,7 +227,7 @@ int sidtab_context_to_sid(struct sidtab *s,
if (!sid)
sid = sidtab_search_context(s, context);
if (!sid) {
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
/* Rescan now that we hold the lock. */
sid = sidtab_search_context(s, context);
if (sid)
@@ -221,7 +246,7 @@ int sidtab_context_to_sid(struct sidtab *s,
if (ret)
s->next_sid--;
 unlock_out:
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
}
 
if (ret)
@@ -287,21 +312,21 @@ void sidtab_set(struct sidtab *dst, struct sidtab *src)
unsigned long flags;
int i;
 
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
dst->htable = src->htable;
dst->nel = src->nel;
dst->next_sid = src->next_sid;
dst->shutdown = 0;
for (i = 0; i < SIDTAB_CACHE_LEN; i++)
dst->cache[i] = NULL;
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
 
 void sidtab_shutdown(struct sidtab *s)
 {
unsigned long flags;
 
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
s->shutdown = 1;
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
diff --git a/security/selinux/ss/sidtab.h b/security/selinux/ss/sidtab.h
index a1a1d2617b6f..6751f8bcbd66 100644
--- a/security/selinux/ss/sidtab.h
+++ b/security/selinux/ss/sidtab.h
@@ -29,7 +29,7 @@ struct sidtab {
unsigned ch

[PATCH V3 1/5 selinux-next] selinux: Make allocation atomic in policydb objects functions.

2018-05-30 Thread Peter Enderborg
From: peter 

As preparation for RCU the allocation need to be atomic,
there is a lot of them so they do in this patch.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/avtab.c   |   8 +--
 security/selinux/ss/conditional.c |  14 ++---
 security/selinux/ss/ebitmap.c |   3 +-
 security/selinux/ss/hashtab.c |   6 +--
 security/selinux/ss/policydb.c| 104 +++---
 5 files changed, 69 insertions(+), 66 deletions(-)

diff --git a/security/selinux/ss/avtab.c b/security/selinux/ss/avtab.c
index a2c9148b0662..1114a308aa94 100644
--- a/security/selinux/ss/avtab.c
+++ b/security/selinux/ss/avtab.c
@@ -72,13 +72,13 @@ avtab_insert_node(struct avtab *h, int hvalue,
 {
struct avtab_node *newnode;
struct avtab_extended_perms *xperms;
-   newnode = kmem_cache_zalloc(avtab_node_cachep, GFP_KERNEL);
+   newnode = kmem_cache_zalloc(avtab_node_cachep, GFP_ATOMIC);
if (newnode == NULL)
return NULL;
newnode->key = *key;
 
if (key->specified & AVTAB_XPERMS) {
-   xperms = kmem_cache_zalloc(avtab_xperms_cachep, GFP_KERNEL);
+   xperms = kmem_cache_zalloc(avtab_xperms_cachep, GFP_ATOMIC);
if (xperms == NULL) {
kmem_cache_free(avtab_node_cachep, newnode);
return NULL;
@@ -95,7 +95,7 @@ avtab_insert_node(struct avtab *h, int hvalue,
} else {
newnode->next = flex_array_get_ptr(h->htable, hvalue);
if (flex_array_put_ptr(h->htable, hvalue, newnode,
-  GFP_KERNEL|__GFP_ZERO)) {
+  GFP_ATOMIC|__GFP_ZERO)) {
kmem_cache_free(avtab_node_cachep, newnode);
return NULL;
}
@@ -330,7 +330,7 @@ int avtab_alloc(struct avtab *h, u32 nrules)
mask = nslot - 1;
 
h->htable = flex_array_alloc(sizeof(struct avtab_node *), nslot,
-GFP_KERNEL | __GFP_ZERO);
+GFP_ATOMIC | __GFP_ZERO);
if (!h->htable)
return -ENOMEM;
 
diff --git a/security/selinux/ss/conditional.c 
b/security/selinux/ss/conditional.c
index c91543a617ac..a09c8a8e9472 100644
--- a/security/selinux/ss/conditional.c
+++ b/security/selinux/ss/conditional.c
@@ -178,7 +178,7 @@ int cond_init_bool_indexes(struct policydb *p)
kfree(p->bool_val_to_struct);
p->bool_val_to_struct = kmalloc_array(p->p_bools.nprim,
  sizeof(*p->bool_val_to_struct),
- GFP_KERNEL);
+ GFP_ATOMIC);
if (!p->bool_val_to_struct)
return -ENOMEM;
return 0;
@@ -205,7 +205,7 @@ int cond_index_bool(void *key, void *datum, void *datap)
 
fa = p->sym_val_to_name[SYM_BOOLS];
if (flex_array_put_ptr(fa, booldatum->value - 1, key,
-  GFP_KERNEL | __GFP_ZERO))
+  GFP_ATOMIC | __GFP_ZERO))
BUG();
p->bool_val_to_struct[booldatum->value - 1] = booldatum;
 
@@ -227,7 +227,7 @@ int cond_read_bool(struct policydb *p, struct hashtab *h, 
void *fp)
u32 len;
int rc;
 
-   booldatum = kzalloc(sizeof(*booldatum), GFP_KERNEL);
+   booldatum = kzalloc(sizeof(*booldatum), GFP_ATOMIC);
if (!booldatum)
return -ENOMEM;
 
@@ -247,7 +247,7 @@ int cond_read_bool(struct policydb *p, struct hashtab *h, 
void *fp)
goto err;
 
rc = -ENOMEM;
-   key = kmalloc(len + 1, GFP_KERNEL);
+   key = kmalloc(len + 1, GFP_ATOMIC);
if (!key)
goto err;
rc = next_entry(key, fp, len);
@@ -332,7 +332,7 @@ static int cond_insertf(struct avtab *a, struct avtab_key 
*k, struct avtab_datum
goto err;
}
 
-   list = kzalloc(sizeof(*list), GFP_KERNEL);
+   list = kzalloc(sizeof(*list), GFP_ATOMIC);
if (!list) {
rc = -ENOMEM;
goto err;
@@ -420,7 +420,7 @@ static int cond_read_node(struct policydb *p, struct 
cond_node *node, void *fp)
goto err;
 
rc = -ENOMEM;
-   expr = kzalloc(sizeof(*expr), GFP_KERNEL);
+   expr = kzalloc(sizeof(*expr), GFP_ATOMIC);
if (!expr)
goto err;
 
@@ -471,7 +471,7 @@ int cond_read_list(struct policydb *p, void *fp)
 
for (i = 0; i < len; i++) {
rc = -ENOMEM;
-   node = kzalloc(sizeof(*node), GFP_KERNEL);
+   node = kzalloc(sizeof(*node), GFP_ATOMIC);
if (!node)
goto err;
 
diff --git a/security/selinux/ss/ebitmap.c b/security/selinux/ss/ebitmap.c
index 5ae8c61b75bf..a49fab

[PATCH V3 3/5 selinux-next] selinux: sidtab_clone switch to use rwlock.

2018-05-30 Thread Peter Enderborg
We need a copy of sidtabs, so change the generic sidtab_clone
as from a function pointer and let it use a read rwlock while
do the clone.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 20 +---
 security/selinux/ss/sidtab.c   | 39 ---
 security/selinux/ss/sidtab.h   |  3 ++-
 3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 4f3ce389084c..2be471d72c85 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -1891,19 +1891,6 @@ int security_change_sid(struct selinux_state *state,
out_sid, false);
 }
 
-/* Clone the SID into the new SID table. */
-static int clone_sid(u32 sid,
-struct context *context,
-void *arg)
-{
-   struct sidtab *s = arg;
-
-   if (sid > SECINITSID_NUM)
-   return sidtab_insert(s, sid, context);
-   else
-   return 0;
-}
-
 static inline int convert_context_handle_invalid_context(
struct selinux_state *state,
struct context *context)
@@ -2199,10 +2186,7 @@ int security_load_policy(struct selinux_state *state, 
void *data, size_t len)
goto err;
}
 
-   /* Clone the SID table. */
-   sidtab_shutdown(old_set->sidtab);
-
-   rc = sidtab_map(old_set->sidtab, clone_sid, next_set->sidtab);
+   rc = sidtab_clone(old_set->sidtab, next_set->sidtab);
if (rc)
goto err;
 
@@ -2926,8 +2910,6 @@ int security_set_bools(struct selinux_state *state, int 
len, int *values)
goto out;
}
 
-   seqno = ++state->ss->latest_granting;
-   state->ss->active_set = next_set;
rc = 0;
 out:
if (!rc) {
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
index 5be31b7af225..811503cd7c2b 100644
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -27,7 +27,7 @@ int sidtab_init(struct sidtab *s)
s->nel = 0;
s->next_sid = 1;
s->shutdown = 0;
-   spin_lock_init(>lock);
+   rwlock_init(>lock);
return 0;
 }
 
@@ -116,6 +116,31 @@ struct context *sidtab_search_force(struct sidtab *s, u32 
sid)
return sidtab_search_core(s, sid, 1);
 }
 
+int sidtab_clone(struct sidtab *s, struct sidtab *d)
+{
+   int i, rc = 0;
+   struct sidtab_node *cur;
+
+   if (!s || !d)
+   goto errout;
+
+   read_lock(>lock);
+   for (i = 0; i < SIDTAB_SIZE; i++) {
+   cur = s->htable[i];
+   while (cur) {
+   if (cur->sid > SECINITSID_NUM)
+   rc =  sidtab_insert(d, cur->sid, >context);
+   if (rc)
+   goto out;
+   cur = cur->next;
+   }
+   }
+out:
+   read_unlock(>lock);
+errout:
+   return rc;
+}
+
 int sidtab_map(struct sidtab *s,
   int (*apply) (u32 sid,
 struct context *context,
@@ -202,7 +227,7 @@ int sidtab_context_to_sid(struct sidtab *s,
if (!sid)
sid = sidtab_search_context(s, context);
if (!sid) {
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
/* Rescan now that we hold the lock. */
sid = sidtab_search_context(s, context);
if (sid)
@@ -221,7 +246,7 @@ int sidtab_context_to_sid(struct sidtab *s,
if (ret)
s->next_sid--;
 unlock_out:
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
}
 
if (ret)
@@ -287,21 +312,21 @@ void sidtab_set(struct sidtab *dst, struct sidtab *src)
unsigned long flags;
int i;
 
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
dst->htable = src->htable;
dst->nel = src->nel;
dst->next_sid = src->next_sid;
dst->shutdown = 0;
for (i = 0; i < SIDTAB_CACHE_LEN; i++)
dst->cache[i] = NULL;
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
 
 void sidtab_shutdown(struct sidtab *s)
 {
unsigned long flags;
 
-   spin_lock_irqsave(>lock, flags);
+   write_lock_irqsave(>lock, flags);
s->shutdown = 1;
-   spin_unlock_irqrestore(>lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
diff --git a/security/selinux/ss/sidtab.h b/security/selinux/ss/sidtab.h
index a1a1d2617b6f..6751f8bcbd66 100644
--- a/security/selinux/ss/sidtab.h
+++ b/security/selinux/ss/sidtab.h
@@ -29,7 +29,7 @@ struct sidtab {
unsigned ch

Re: [PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread peter enderborg
On 11/24/2017 11:14 AM, Michal Hocko wrote:
> On Fri 24-11-17 11:07:07, Peter Enderborg wrote:
>> When tuning the watermark_scale_factor to reduce stalls and compactions
>> the high mark is also changed, it changed a bit too much. So this
>> patch introduces a slope that can reduce this overhead a bit, or
>> increase it if needed.
> This doesn't explain what is the problem, why it is a problem and why we
> need yet another tuning to address it. Users shouldn't really care about
> internal stuff like watermark tuning for each watermark independently.
> This looks like a gross hack. Please start over with the problem
> description and then we can move on to an approapriate fix. Piling up
> tuning knobs to workaround problems is simply not acceptable.
>  

In the original patch - https://lkml.org/lkml/2016/2/18/498 - had a

discussion about small systems with 8GB RAM. In the handheld world, that's
a lot of RAM. However, the magic number 2 used in the present algorithm
is out of the blue. Compaction problems are the same for both small and
big. So small devices also need to increase watermark to
get compaction to work and reduce direct reclaims. Changing the low watermark
makes direct reclaim rate drop a lot. But it will cause kswap to work more,
and that has a negative impact. Lowering the gap will smooth out the kswap
workload to suite embedded devices a lot better. This can be addressed by
reducing the high watermark using the slope patch herein. Im sort of understand
your opinion on user knobs, but hard-coded magic numbers are even worse.



Re: [PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread peter enderborg
On 11/24/2017 11:14 AM, Michal Hocko wrote:
> On Fri 24-11-17 11:07:07, Peter Enderborg wrote:
>> When tuning the watermark_scale_factor to reduce stalls and compactions
>> the high mark is also changed, it changed a bit too much. So this
>> patch introduces a slope that can reduce this overhead a bit, or
>> increase it if needed.
> This doesn't explain what is the problem, why it is a problem and why we
> need yet another tuning to address it. Users shouldn't really care about
> internal stuff like watermark tuning for each watermark independently.
> This looks like a gross hack. Please start over with the problem
> description and then we can move on to an approapriate fix. Piling up
> tuning knobs to workaround problems is simply not acceptable.
>  

In the original patch - https://lkml.org/lkml/2016/2/18/498 - had a

discussion about small systems with 8GB RAM. In the handheld world, that's
a lot of RAM. However, the magic number 2 used in the present algorithm
is out of the blue. Compaction problems are the same for both small and
big. So small devices also need to increase watermark to
get compaction to work and reduce direct reclaims. Changing the low watermark
makes direct reclaim rate drop a lot. But it will cause kswap to work more,
and that has a negative impact. Lowering the gap will smooth out the kswap
workload to suite embedded devices a lot better. This can be addressed by
reducing the high watermark using the slope patch herein. Im sort of understand
your opinion on user knobs, but hard-coded magic numbers are even worse.



Re: [PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread peter enderborg
On 11/24/2017 11:15 AM, Vlastimil Babka wrote:
> Agreed. Also if you send a patch adding userspace API or a tuning knob,
> please CC linux-api mailing list (did that for this reply).
>
The cc-list is generated by get_maintainer.pl script.



Re: [PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread peter enderborg
On 11/24/2017 11:15 AM, Vlastimil Babka wrote:
> Agreed. Also if you send a patch adding userspace API or a tuning knob,
> please CC linux-api mailing list (did that for this reply).
>
The cc-list is generated by get_maintainer.pl script.



[PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread Peter Enderborg
When tuning the watermark_scale_factor to reduce stalls and compactions
the high mark is also changed, it changed a bit too much. So this
patch introduces a slope that can reduce this overhead a bit, or
increase it if needed.

Signed-off-by: Peter Enderborg <peter.enderb...@sony.com>
---
 Documentation/sysctl/vm.txt | 15 +++
 include/linux/mm.h  |  1 +
 include/linux/mmzone.h  |  2 ++
 kernel/sysctl.c |  9 +
 mm/page_alloc.c |  6 +-
 5 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index eda628c..aecff6c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
 - user_reserve_kbytes
 - vfs_cache_pressure
 - watermark_scale_factor
+- watermark_high_factor_slope
 - zone_reclaim_mode
 
 ==
@@ -857,6 +858,20 @@ that the number of free pages kswapd maintains for latency 
reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
 
+=
+
+watermark_high_factor_slope:
+
+This factor is high mark for watermark_scale_factor.
+The unit is in percent.
+Max value is 1000 and min value is 100. (High watermark is the same as
+low water mark) Low watermark is min_wmark_pages + watermark_scale_factor.
+and high watermark is
+min_wmark_pages+(watermark_scale_factor * watermark_high_factor_slope).
+
+The default value is 200.
+
+
 ==
 
 zone_reclaim_mode:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7661156..c89536b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2094,6 +2094,7 @@ extern void zone_pcp_reset(struct zone *zone);
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_scale_factor;
+extern int watermark_high_factor_slope;
 
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..91bf842 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -886,6 +886,8 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+//int watermark_high_factor_tilt_sysctl_handler(struct ctl_table *, int,
+// void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2fb4e27..83c48c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1444,6 +1444,15 @@ static struct ctl_table vm_table[] = {
.extra2 = _thousand,
},
{
+   .procname   = "watermark_high_factor_slope",
+   .data   = _high_factor_slope,
+   .maxlen = sizeof(watermark_high_factor_slope),
+   .mode   = 0644,
+   .proc_handler   = watermark_scale_factor_sysctl_handler,
+   .extra1 = _hundred,
+   .extra2 = _thousand,
+   },
+   {
.procname   = "percpu_pagelist_fraction",
.data   = _pagelist_fraction,
.maxlen = sizeof(percpu_pagelist_fraction),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5b01..3dc50ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_scale_factor = 10;
+int watermark_high_factor_slope = 200;
 
 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned long __meminitdata nr_all_pages;
@@ -6989,6 +6990,7 @@ static void __setup_per_zone_wmarks(void)
 
for_each_zone(zone) {
u64 tmp;
+   u64 tmp_high;
 
spin_lock_irqsave(>lock, flags);
tmp = (u64)pages_min * zone->managed_pages;
@@ -7026,7 +7028,9 @@ static void __setup_per_zone_wmarks(void)
  watermark_scale_factor, 1));
 
zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
-   zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+   tmp_high = mult_frac(tmp, watermark_high_factor_slope, 100);
+   zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp_high;
+
 
spin_unlock_irqrestore(>lock, flags);
}
-- 
2.7.4



[PATCH] mm:Add watermark slope for high mark

2017-11-24 Thread Peter Enderborg
When tuning the watermark_scale_factor to reduce stalls and compactions
the high mark is also changed, it changed a bit too much. So this
patch introduces a slope that can reduce this overhead a bit, or
increase it if needed.

Signed-off-by: Peter Enderborg 
---
 Documentation/sysctl/vm.txt | 15 +++
 include/linux/mm.h  |  1 +
 include/linux/mmzone.h  |  2 ++
 kernel/sysctl.c |  9 +
 mm/page_alloc.c |  6 +-
 5 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index eda628c..aecff6c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -62,6 +62,7 @@ Currently, these files are in /proc/sys/vm:
 - user_reserve_kbytes
 - vfs_cache_pressure
 - watermark_scale_factor
+- watermark_high_factor_slope
 - zone_reclaim_mode
 
 ==
@@ -857,6 +858,20 @@ that the number of free pages kswapd maintains for latency 
reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
 
+=
+
+watermark_high_factor_slope:
+
+This factor is high mark for watermark_scale_factor.
+The unit is in percent.
+Max value is 1000 and min value is 100. (High watermark is the same as
+low water mark) Low watermark is min_wmark_pages + watermark_scale_factor.
+and high watermark is
+min_wmark_pages+(watermark_scale_factor * watermark_high_factor_slope).
+
+The default value is 200.
+
+
 ==
 
 zone_reclaim_mode:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7661156..c89536b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2094,6 +2094,7 @@ extern void zone_pcp_reset(struct zone *zone);
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_scale_factor;
+extern int watermark_high_factor_slope;
 
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c..91bf842 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -886,6 +886,8 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+//int watermark_high_factor_tilt_sysctl_handler(struct ctl_table *, int,
+// void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2fb4e27..83c48c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1444,6 +1444,15 @@ static struct ctl_table vm_table[] = {
.extra2 = _thousand,
},
{
+   .procname   = "watermark_high_factor_slope",
+   .data   = _high_factor_slope,
+   .maxlen = sizeof(watermark_high_factor_slope),
+   .mode   = 0644,
+   .proc_handler   = watermark_scale_factor_sysctl_handler,
+   .extra1 = _hundred,
+   .extra2 = _thousand,
+   },
+   {
.procname   = "percpu_pagelist_fraction",
.data   = _pagelist_fraction,
.maxlen = sizeof(percpu_pagelist_fraction),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5b01..3dc50ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_scale_factor = 10;
+int watermark_high_factor_slope = 200;
 
 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned long __meminitdata nr_all_pages;
@@ -6989,6 +6990,7 @@ static void __setup_per_zone_wmarks(void)
 
for_each_zone(zone) {
u64 tmp;
+   u64 tmp_high;
 
spin_lock_irqsave(>lock, flags);
tmp = (u64)pages_min * zone->managed_pages;
@@ -7026,7 +7028,9 @@ static void __setup_per_zone_wmarks(void)
  watermark_scale_factor, 1));
 
zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
-   zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+   tmp_high = mult_frac(tmp, watermark_high_factor_slope, 100);
+   zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp_high;
+
 
spin_unlock_irqrestore(>lock, flags);
}
-- 
2.7.4



Re: [PATCH] Add slowpath enter/exit trace events

2017-11-24 Thread peter enderborg
On 11/23/2017 03:01 PM, Michal Hocko wrote:
> I am not sure adding a probe on a production system will fly in many
> cases. A static tracepoint would be much easier in that case. But I
> agree there are other means to accomplish the same thing. My main point
> was to have an easy out-of-the-box way to check latencies. But that is
> not something I would really insist on.
>
In android tracefs (or part of it) is the way for the system to control to what 
developers can access to the linux system on commercial devices.  So it is very 
much used on production systems, it is even  a requirement from google to be 
certified as android.  Things like dmesg is not.  However, this probe is at the 
moment not in that scope. 

My point is that you need to condense the information as much as possible but 
still be useful before making the effort to copy it to userspace.  And  for 
this the trace-event are very useful for small systems since the cost is very 
low for events where no one is listening.



Re: [PATCH] Add slowpath enter/exit trace events

2017-11-24 Thread peter enderborg
On 11/23/2017 03:01 PM, Michal Hocko wrote:
> I am not sure adding a probe on a production system will fly in many
> cases. A static tracepoint would be much easier in that case. But I
> agree there are other means to accomplish the same thing. My main point
> was to have an easy out-of-the-box way to check latencies. But that is
> not something I would really insist on.
>
In android tracefs (or part of it) is the way for the system to control to what 
developers can access to the linux system on commercial devices.  So it is very 
much used on production systems, it is even  a requirement from google to be 
certified as android.  Things like dmesg is not.  However, this probe is at the 
moment not in that scope. 

My point is that you need to condense the information as much as possible but 
still be useful before making the effort to copy it to userspace.  And  for 
this the trace-event are very useful for small systems since the cost is very 
low for events where no one is listening.



Re: [PATCH] Add slowpath enter/exit trace events

2017-11-23 Thread peter enderborg
On 11/23/2017 02:43 PM, Tetsuo Handa wrote:
> Please see my attempt at
> http://lkml.kernel.org/r/1510833448-19918-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp
>  .
> Printing just current thread is not sufficient for me.
>
>
Seems to  me that it is a lot more overhead with timers and stuff.
My probe is for the health of the system trying to capture how get the penalty 
and how much. A slowpath alloc in a audio stream can causes drop-outs. And they 
are very hard to debug in userspace.



  1   2   >