[RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list

2019-04-24 Thread Fan Du
On system with heterogeneous memory, reasonable fall back lists woul be:
a. No fall back, stick to current running node.
b. Fall back to other nodes of the same type or different type
   e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3
c. Fall back to other nodes of the same type only.
   e.g. DRAM node 0 -> DRAM node 1

a. is already in place, previous patch implement b. providing way to
satisfy memory request as best effort by default. And this patch of
writing build c. to fallback to the same node type when user specify
GFP_SAME_NODE_TYPE only.

Signed-off-by: Fan Du 
---
 include/linux/gfp.h|  7 +++
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c| 15 +++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de..ca5fdfc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -44,6 +44,8 @@
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
+#define ___GFP_SAME_NODE_TYPE  0x100u
+
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -215,6 +217,7 @@
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_TYPE)
 
 /* Room for N __GFP_FOO bits */
 #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
@@ -301,6 +304,8 @@
 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE  (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
 
+#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE)
+
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 #define GFP_MOVABLE_SHIFT 3
@@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags)
 #ifdef CONFIG_NUMA
if (unlikely(flags & __GFP_THISNODE))
return ZONELIST_NOFALLBACK;
+   if (unlikely(flags & __GFP_SAME_NODE_TYPE))
+   return ZONELIST_FALLBACK_SAME_TYPE;
 #endif
return ZONELIST_FALLBACK;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8c37e1c..2f8603e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone,
 
 enum {
ZONELIST_FALLBACK,  /* zonelist with fallback */
+   ZONELIST_FALLBACK_SAME_TYPE,/* zonelist with fallback to the same 
type node */
 #ifdef CONFIG_NUMA
/*
 * The NUMA zonelists are doubled because we need zonelists that
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a408a91..de797921 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int *node_order,
}
zonerefs->zone = NULL;
zonerefs->zone_idx = 0;
+
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs;
+
+   for (i = 0; i < nr_nodes; i++) {
+   int nr_zones;
+
+   pg_data_t *node = NODE_DATA(node_order[i]);
+
+   if (!is_node_same_type(node->node_id, pgdat->node_id))
+   continue;
+   nr_zones = build_zonerefs_node(node, zonerefs);
+   zonerefs += nr_zones;
+   }
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
-- 
1.8.3.1



[RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-24 Thread Fan Du
This is another approach of building zonelist based on patch #10 of
patchset[1].

For systems with heterogeneous DRAM and PMEM (persistent memory),

1) change ZONELIST_FALLBACK to first fallback to same type nodes,
   then the other types

2) add ZONELIST_FALLBACK_SAME_TYPE to fallback only same type nodes.
   To be explicitly selected by __GFP_SAME_NODE_TYPE.

For example, a 2S DRAM+PMEM system may have NUMA distances:
node   0   1   2   3 
  0:  10  21  17  28 
  1:  21  10  28  17 
  2:  17  28  10  28 
  3:  28  17  28  10

Node 0,1 are DRAM nodes, node 2, 3 are PMEM nodes.

ZONELIST_FALLBACK
=
Current zoned fallback lists are based on numa distance only,
which means page allocation request from node 0 will iterate zone order
like: DRAM node 0 -> PMEM node 2 -> DRAM node 1 -> PMEM node 3.

However PMEM has different characteristics from DRAM,
the more reasonable or desirable fallback style would be:
DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
When DRAM is exhausted, try PMEM then. 

ZONELIST_FALLBACK_SAME_TYPE
===
Some cases are more suitable to fit PMEM characteristics, like page is
read more frequently than written. Other cases may prefer DRAM only.
It doesn't matter page is from local node, or remote.

Create __GFP_SAME_NODE_TYPE to request page of same node type,
either we got DRAM(from node 0, 1) or PMEM (from node 2, 3), it's kind
of extension to the nofallback list, but with the same node type. 

This patchset is self-contained, and based on Linux 5.1-rc6.

[1]:
https://lkml.org/lkml/2018/12/26/138

Fan Du (5):
  acpi/numa: memorize NUMA node type from SRAT table
  mmzone: new pgdat flags for DRAM and PMEM
  x86,numa: update numa node type
  mm, page alloc: build fallback list on per node type basis
  mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list

 arch/x86/include/asm/numa.h |  2 ++
 arch/x86/mm/numa.c  |  3 +++
 drivers/acpi/numa.c |  5 
 include/linux/gfp.h |  7 ++
 include/linux/mmzone.h  | 35 
 mm/page_alloc.c | 57 -
 6 files changed, 93 insertions(+), 16 deletions(-)

-- 
1.8.3.1



[RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis

2019-04-24 Thread Fan Du
On box with both DRAM and PMEM managed by mm system,
Usually node 0, 1 are DRAM nodes, nodes 2, 3 are PMEM nodes.
nofallback list are same as before, fallback list are not
redesigned to be arranged by node type basis, iow,
allocation request of DRAM page start from node 0 will go
through node0->node1->node2->node3 zonelists.

Signed-off-by: Fan Du 
---
 include/linux/mmzone.h |  8 
 mm/page_alloc.c| 42 ++
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d3ee9f9..8c37e1c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -939,6 +939,14 @@ static inline int is_node_dram(int nid)
return test_bit(PGDAT_DRAM, >flags);
 }
 
+static inline int is_node_same_type(int nida, int nidb)
+{
+   if (node_isset(nida, numa_nodes_pmem))
+   return node_isset(nidb, numa_nodes_pmem);
+   else
+   return node_isset(nidb, numa_nodes_dram);
+}
+
 static inline void set_node_type(int nid)
 {
pg_data_t *pgdat = NODE_DATA(nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6ce20a..a408a91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5372,7 +5372,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, 
int write,
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+static int find_next_best_node(int node, nodemask_t *used_node_mask, int 
need_same_type)
 {
int n, val;
int min_val = INT_MAX;
@@ -5380,7 +5380,7 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
const struct cpumask *tmp = cpumask_of_node(0);
 
/* Use the local node if we haven't already */
-   if (!node_isset(node, *used_node_mask)) {
+   if (need_same_type && !node_isset(node, *used_node_mask)) {
node_set(node, *used_node_mask);
return node;
}
@@ -5391,6 +5391,12 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
if (node_isset(n, *used_node_mask))
continue;
 
+   if (need_same_type && !is_node_same_type(node, n))
+   continue;
+
+   if (!need_same_type && is_node_same_type(node, n))
+   continue;
+
/* Use the distance array to find the distance */
val = node_distance(node, n);
 
@@ -5472,31 +5478,35 @@ static void build_zonelists(pg_data_t *pgdat)
int node, load, nr_nodes = 0;
nodemask_t used_mask;
int local_node, prev_node;
+   int need_same_type;
 
/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
load = nr_online_nodes;
prev_node = local_node;
-   nodes_clear(used_mask);
 
memset(node_order, 0, sizeof(node_order));
-   while ((node = find_next_best_node(local_node, _mask)) >= 0) {
-   /*
-* We don't want to pressure a particular node.
-* So adding penalty to the first node in same
-* distance group to make it round-robin.
-*/
-   if (node_distance(local_node, node) !=
-   node_distance(local_node, prev_node))
-   node_load[node] = load;
+   for (need_same_type = 1; need_same_type >= 0; need_same_type--) {
+   nodes_clear(used_mask);
+   while ((node = find_next_best_node(local_node, _mask,
+   need_same_type)) >= 0) {
+   /*
+* We don't want to pressure a particular node.
+* So adding penalty to the first node in same
+* distance group to make it round-robin.
+*/
+   if (node_distance(local_node, node) !=
+   node_distance(local_node, prev_node))
+   node_load[node] = load;
 
-   node_order[nr_nodes++] = node;
-   prev_node = node;
-   load--;
+   node_order[nr_nodes++] = node;
+   prev_node = node;
+   load--;
+   }
}
-
build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
+
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-- 
1.8.3.1



[RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table

2019-04-24 Thread Fan Du
Mark NUMA node as DRAM or PMEM.

This could happen in boot up state (see the e820 pmem type
override patch), or on fly when bind devdax device with kmem
driver.

It depends on BIOS supplying PMEM NUMA proximity in SRAT table,
that's current production BIOS does.

Signed-off-by: Fan Du 
---
 arch/x86/include/asm/numa.h | 2 ++
 arch/x86/mm/numa.c  | 2 ++
 drivers/acpi/numa.c | 5 +
 3 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index bbfde3d..5191198 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -30,6 +30,8 @@
  */
 extern s16 __apicid_to_node[MAX_LOCAL_APIC];
 extern nodemask_t numa_nodes_parsed __initdata;
+extern nodemask_t numa_nodes_pmem;
+extern nodemask_t numa_nodes_dram;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index dfb6c4d..3c3a1f5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -20,6 +20,8 @@
 
 int numa_off;
 nodemask_t numa_nodes_parsed __initdata;
+nodemask_t numa_nodes_pmem;
+nodemask_t numa_nodes_dram;
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 867f6e3..ec4b7a7e 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -298,6 +298,11 @@ void __init acpi_numa_slit_init(struct acpi_table_slit 
*slit)
 
node_set(node, numa_nodes_parsed);
 
+   if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE)
+   node_set(node, numa_nodes_pmem);
+   else
+   node_set(node, numa_nodes_dram);
+
pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-- 
1.8.3.1



[RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM

2019-04-24 Thread Fan Du
One system with DRAM and PMEM, we need new flag
to tag pgdat is made of DRAM or peristent memory.

This patch serves as preparetion one for follow up patch.

Signed-off-by: Fan Du 
---
 include/linux/mmzone.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..d3ee9f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -520,6 +520,8 @@ enum pgdat_flags {
 * many pages under writeback
 */
PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */
+   PGDAT_DRAM, /* Volatile DRAM memory node */
+   PGDAT_PMEM, /* Persistent memory node */
 };
 
 enum zone_flags {
@@ -923,6 +925,30 @@ extern int numa_zonelist_order_handler(struct ctl_table *, 
int,
 
 #endif /* !CONFIG_NEED_MULTIPLE_NODES */
 
+static inline int is_node_pmem(int nid)
+{
+   pg_data_t *pgdat = NODE_DATA(nid);
+
+   return test_bit(PGDAT_PMEM, >flags);
+}
+
+static inline int is_node_dram(int nid)
+{
+   pg_data_t *pgdat = NODE_DATA(nid);
+
+   return test_bit(PGDAT_DRAM, >flags);
+}
+
+static inline void set_node_type(int nid)
+{
+   pg_data_t *pgdat = NODE_DATA(nid);
+
+   if (node_isset(nid, numa_nodes_pmem))
+   set_bit(PGDAT_PMEM, >flags);
+   else
+   set_bit(PGDAT_DRAM, >flags);
+}
+
 extern struct pglist_data *first_online_pgdat(void);
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);
-- 
1.8.3.1



[RFC PATCH 3/5] x86,numa: update numa node type

2019-04-24 Thread Fan Du
Give the newly created node a type per SRAT attribution.

Signed-off-by: Fan Du 
---
 arch/x86/mm/numa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3c3a1f5..ff8ad63 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -590,6 +590,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
 
alloc_node_data(nid);
+   set_node_type(nid);
}
 
/* Dump memblock with node info and return. */
-- 
1.8.3.1



[PATCH] memory hotplug: fix comments when add section

2017-11-15 Thread Fan Du
Here, pfn_to_node should be page_to_nid.

Signed-off-by: Fan Du <fan...@intel.com>
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..c3c1c57 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -265,7 +265,7 @@ static int __meminit __add_section(int nid, unsigned long 
phys_start_pfn,
/*
 * Make all the pages reserved so that nobody will stumble over half
 * initialized state.
-* FIXME: We also have to associate it with a node because pfn_to_node
+* FIXME: We also have to associate it with a node because page_to_nid
 * relies on having page with the proper node.
 */
for (i = 0; i < PAGES_PER_SECTION; i++) {
-- 
1.8.3.1



[PATCH] memory hotplug: fix comments when add section

2017-11-15 Thread Fan Du
Here, pfn_to_node should be page_to_nid.

Signed-off-by: Fan Du 
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..c3c1c57 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -265,7 +265,7 @@ static int __meminit __add_section(int nid, unsigned long 
phys_start_pfn,
/*
 * Make all the pages reserved so that nobody will stumble over half
 * initialized state.
-* FIXME: We also have to associate it with a node because pfn_to_node
+* FIXME: We also have to associate it with a node because page_to_nid
 * relies on having page with the proper node.
 */
for (i = 0; i < PAGES_PER_SECTION; i++) {
-- 
1.8.3.1



[PATCH v4] Add /proc/PID/smaps support for DAX

2017-10-25 Thread Fan Du
Memory behind device DAX is not attached into normal memory
management system, when user mmap /dev/dax, smaps part is
currently missing, so no idea for user to check how much
device DAX memory are actually used in practice.

Whether vma is backed up by normal page,huge page, or both
at the same time, this makes no difference for device DAX
user so far.

Using existing smaps structure is enough to do the job, so
this patch tries to use existing RSS/PSS stuff for statistics.
An example reading is like this:

7f30fe20-7f310220 rw-s  00:06 19567  
/dev/dax12.0
Size:  65536 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Rss:   65536 kB
Pss:   65536 kB
Shared_Clean:  0 kB
Shared_Dirty:  0 kB
Private_Clean: 0 kB
Private_Dirty: 65536 kB
Referenced:65536 kB
Anonymous: 0 kB
LazyFree:  0 kB
AnonHugePages: 0 kB
ShmemPmdMapped:0 kB
Shared_Hugetlb:0 kB
Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
Locked:65536 kB
ProtectionKey: 0
VmFlags: rd wr sh mr mw me ms mm hg

Signed-off-by: Fan Du <fan...@intel.com>
---
v4:
* Merge device DAX readings into existing smap counters
  for simplicity.

v3:
* Elaborate more about the usage suggested by Michal Hocko

v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!
 fs/proc/task_mmu.c | 74 --
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..9b2d3e6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -507,6 +507,55 @@ static void smaps_account(struct mem_size_stats *mss, 
struct page *page,
}
 }
 
+/* page structure behind DAX mappings is NOT compound page
+ * when it's a huge page mappings, so introduce new API to
+ * account for both PMD and PUD mapping.
+ */
+static void smaps_account_dax_huge(struct mem_size_stats *mss,
+   struct page *page, unsigned long size, bool young, bool 
dirty)
+{
+   int mapcount = page_mapcount(page);
+
+   if (PageAnon(page)) {
+   mss->anonymous += size;
+   if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
+   mss->lazyfree += size;
+   }
+
+   mss->resident += size;
+   /* Accumulate the size in pages that have been accessed. */
+   if (young || page_is_young(page) || PageReferenced(page))
+   mss->referenced += size;
+
+   /*
+* page_count(page) == 1 guarantees the page is mapped exactly once.
+* If any subpage of the compound page mapped with PTE it would elevate
+* page_count().
+*/
+   if (page_count(page) == 1) {
+   if (dirty || PageDirty(page))
+   mss->private_dirty += size;
+   else
+   mss->private_clean += size;
+   mss->pss += (u64)size << PSS_SHIFT;
+   return;
+   }
+
+   if (mapcount >= 2) {
+   if (dirty || PageDirty(page))
+   mss->shared_dirty += size;
+   else
+   mss->shared_clean += size;
+   mss->pss += (size << PSS_SHIFT) / mapcount;
+   } else {
+   if (dirty || PageDirty(page))
+   mss->private_dirty += size;
+   else
+   mss->private_clean += size;
+   mss->pss += size << PSS_SHIFT;
+   }
+}
+
 #ifdef CONFIG_SHMEM
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -528,7 +577,16 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte)) {
+   struct dev_pagemap *pgmap;
+
+   pgmap = get_dev_pagemap(pte_pfn(*pte), NULL);
+   if (!pgmap)
+   return;
+   page = pte_page(*pte);
+   }
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -579,7 +637,19 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd)) {
+  

[PATCH v4] Add /proc/PID/smaps support for DAX

2017-10-25 Thread Fan Du
Memory behind device DAX is not attached into normal memory
management system, when user mmap /dev/dax, smaps part is
currently missing, so no idea for user to check how much
device DAX memory are actually used in practice.

Whether vma is backed up by normal page,huge page, or both
at the same time, this makes no difference for device DAX
user so far.

Using existing smaps structure is enough to do the job, so
this patch tries to use existing RSS/PSS stuff for statistics.
An example reading is like this:

7f30fe20-7f310220 rw-s  00:06 19567  
/dev/dax12.0
Size:  65536 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Rss:   65536 kB
Pss:   65536 kB
Shared_Clean:  0 kB
Shared_Dirty:  0 kB
Private_Clean: 0 kB
Private_Dirty: 65536 kB
Referenced:65536 kB
Anonymous: 0 kB
LazyFree:  0 kB
AnonHugePages: 0 kB
ShmemPmdMapped:0 kB
Shared_Hugetlb:0 kB
Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
Locked:65536 kB
ProtectionKey: 0
VmFlags: rd wr sh mr mw me ms mm hg

Signed-off-by: Fan Du 
---
v4:
* Merge device DAX readings into existing smap counters
  for simplicity.

v3:
* Elaborate more about the usage suggested by Michal Hocko

v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!
 fs/proc/task_mmu.c | 74 --
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..9b2d3e6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -507,6 +507,55 @@ static void smaps_account(struct mem_size_stats *mss, 
struct page *page,
}
 }
 
+/* page structure behind DAX mappings is NOT compound page
+ * when it's a huge page mappings, so introduce new API to
+ * account for both PMD and PUD mapping.
+ */
+static void smaps_account_dax_huge(struct mem_size_stats *mss,
+   struct page *page, unsigned long size, bool young, bool 
dirty)
+{
+   int mapcount = page_mapcount(page);
+
+   if (PageAnon(page)) {
+   mss->anonymous += size;
+   if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
+   mss->lazyfree += size;
+   }
+
+   mss->resident += size;
+   /* Accumulate the size in pages that have been accessed. */
+   if (young || page_is_young(page) || PageReferenced(page))
+   mss->referenced += size;
+
+   /*
+* page_count(page) == 1 guarantees the page is mapped exactly once.
+* If any subpage of the compound page mapped with PTE it would elevate
+* page_count().
+*/
+   if (page_count(page) == 1) {
+   if (dirty || PageDirty(page))
+   mss->private_dirty += size;
+   else
+   mss->private_clean += size;
+   mss->pss += (u64)size << PSS_SHIFT;
+   return;
+   }
+
+   if (mapcount >= 2) {
+   if (dirty || PageDirty(page))
+   mss->shared_dirty += size;
+   else
+   mss->shared_clean += size;
+   mss->pss += (size << PSS_SHIFT) / mapcount;
+   } else {
+   if (dirty || PageDirty(page))
+   mss->private_dirty += size;
+   else
+   mss->private_clean += size;
+   mss->pss += size << PSS_SHIFT;
+   }
+}
+
 #ifdef CONFIG_SHMEM
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -528,7 +577,16 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte)) {
+   struct dev_pagemap *pgmap;
+
+   pgmap = get_dev_pagemap(pte_pfn(*pte), NULL);
+   if (!pgmap)
+   return;
+   page = pte_page(*pte);
+   }
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -579,7 +637,19 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd)) {
+   struct dev_p

[PATCH 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

With this patch, user could check Device DAX usage at Ptes@ readings,
fox example, statistics from a Device DAX algined by 2MB:

7f6c1780-7f6c17e0 rw-s  00:06 20559  /dev/dax12.0
Size:   6144 kB
.
.
.
Ptes@2MB:   6144 kB

Signed-off-by: Fan Du <fan...@intel.com>
---
v3:
* Elaborate more about the usage suggested by Michal Hocko

v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!

---
 fs/proc/task_mmu.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 30dbf37..5c4535c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



[PATCH 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

With this patch, user could check Device DAX usage at Ptes@ readings,
fox example, statistics from a Device DAX algined by 2MB:

7f6c1780-7f6c17e0 rw-s  00:06 20559  /dev/dax12.0
Size:   6144 kB
.
.
.
Ptes@2MB:   6144 kB

Signed-off-by: Fan Du 
---
v3:
* Elaborate more about the usage suggested by Michal Hocko

v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!

---
 fs/proc/task_mmu.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 30dbf37..5c4535c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



[PATCHv3 1/2] proc: mm: export PTE sizes directly in smaps

2017-10-24 Thread Fan Du
From: Dave Hansen <dave.han...@intel.com>

/proc/$pid/smaps has a number of fields that are intended to imply the
kinds of PTEs used to map memory.  "AnonHugePages" obviously tells you
how many PMDs are being used.  "MMUPageSize" along with the "Hugetlb"
fields tells you how many PTEs you have for a huge page.

The current mechanisms work fine when we have one or two page sizes.
But, they start to get a bit muddled when we mix page sizes inside
one VMA.  For instance, the DAX folks were proposing adding a set of
fields like:

DevicePages:
DeviceHugePages:
DeviceGiganticPages:
DeviceGinormousPages:

to unmuddle things when page sizes get mixed.  That's fine, but
it does require userspace know the mapping from our various
arbitrary names to hardware page sizes on each architecture and
kernel configuration.  That seems rather suboptimal.

What folks really want is to know how much memory is mapped with
each page size.  How about we just do *that* instead?

Patch attached.  Seems harmless enough.  Seems to compile on a
bunch of random architectures.  Makes smaps look like this:

Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Locked:0 kB
Ptes@4kB: 32 kB
Ptes@2MB:   2048 kB

The format I used here should be unlikely to break smaps parsers
unless they're looking for "kB" and now match the 'Ptes@4kB' instead
of the one at the end of the line.

Note: hugetlbfs PTEs are unusual.  We can have more than one "pte_t"
for each hugetlbfs "page".  arm64 has this configuration, and probably
others.  The code should now handle when an hstate's size is not equal
to one of the page table entry sizes.  For instance, it assumes that
hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
and prints them as such.

I've tested this on x86 with normal 4k ptes, anonymous huge pages,
1G hugetlbfs and 2M hugetlbfs pages.

1. I'd like to thank Dan Williams for showing me a mirror as I
   complained about the bozo that introduced 'AnonHugePages'.

[Fan]
Rebase the original patch from Dave Hansen by fixing a couple of compile
issues.

Signed-off-by: Fan Du <fan...@intel.com>
Signed-off-by: Dave Hansen <dave.han...@intel.com>
---
 Documentation/filesystems/proc.txt |   6 +++
 fs/proc/task_mmu.c | 106 -
 mm/hugetlb.c   |  11 
 3 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..a11ab80 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,6 +423,9 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
+Ptes@4kB: 4 kB
+Ptes@2MB:  8192 kB
+
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem 
object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"Ptes@..." lines show how many page table entries are currently in place and
+pointing to memory.  There is an entry for each size present in the hardware
+page tables for this mapping.
 
 "VmFlags" field deserves a separate description. This member represents the 
kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..30dbf37 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,9 @@ struct mem_size_stats {
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
unsigned long first_vma_start;
+   unsigned long rss_pte;
+   unsigned long rss_pmd;
+   unsigned long rss_pud;
u64 pss;
u64 pss_locked;
u64 swap_pss;
@@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
+   mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
/* pass */;
else
VM_BUG_ON_PAGE(1, page);
+   mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 }
 
 #ifdef

[PATCHv3 1/2] proc: mm: export PTE sizes directly in smaps

2017-10-24 Thread Fan Du
From: Dave Hansen 

/proc/$pid/smaps has a number of fields that are intended to imply the
kinds of PTEs used to map memory.  "AnonHugePages" obviously tells you
how many PMDs are being used.  "MMUPageSize" along with the "Hugetlb"
fields tells you how many PTEs you have for a huge page.

The current mechanisms work fine when we have one or two page sizes.
But, they start to get a bit muddled when we mix page sizes inside
one VMA.  For instance, the DAX folks were proposing adding a set of
fields like:

DevicePages:
DeviceHugePages:
DeviceGiganticPages:
DeviceGinormousPages:

to unmuddle things when page sizes get mixed.  That's fine, but
it does require userspace know the mapping from our various
arbitrary names to hardware page sizes on each architecture and
kernel configuration.  That seems rather suboptimal.

What folks really want is to know how much memory is mapped with
each page size.  How about we just do *that* instead?

Patch attached.  Seems harmless enough.  Seems to compile on a
bunch of random architectures.  Makes smaps look like this:

Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Locked:0 kB
Ptes@4kB: 32 kB
Ptes@2MB:   2048 kB

The format I used here should be unlikely to break smaps parsers
unless they're looking for "kB" and now match the 'Ptes@4kB' instead
of the one at the end of the line.

Note: hugetlbfs PTEs are unusual.  We can have more than one "pte_t"
for each hugetlbfs "page".  arm64 has this configuration, and probably
others.  The code should now handle when an hstate's size is not equal
to one of the page table entry sizes.  For instance, it assumes that
hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
and prints them as such.

I've tested this on x86 with normal 4k ptes, anonymous huge pages,
1G hugetlbfs and 2M hugetlbfs pages.

1. I'd like to thank Dan Williams for showing me a mirror as I
   complained about the bozo that introduced 'AnonHugePages'.

[Fan]
Rebase the original patch from Dave Hansen by fixing a couple of compile
issues.

Signed-off-by: Fan Du 
Signed-off-by: Dave Hansen 
---
 Documentation/filesystems/proc.txt |   6 +++
 fs/proc/task_mmu.c | 106 -
 mm/hugetlb.c   |  11 
 3 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..a11ab80 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,6 +423,9 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
+Ptes@4kB: 4 kB
+Ptes@2MB:  8192 kB
+
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem 
object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"Ptes@..." lines show how many page table entries are currently in place and
+pointing to memory.  There is an entry for each size present in the hardware
+page tables for this mapping.
 
 "VmFlags" field deserves a separate description. This member represents the 
kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..30dbf37 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,9 @@ struct mem_size_stats {
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
unsigned long first_vma_start;
+   unsigned long rss_pte;
+   unsigned long rss_pmd;
+   unsigned long rss_pud;
u64 pss;
u64 pss_locked;
u64 swap_pss;
@@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
+   mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
/* pass */;
else
VM_BUG_ON_PAGE(1, page);
+   mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Most architectures have a 1:1 mapping of PTEs to hugetl

[PATCHv2 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

Signed-off-by: Fan Du <fan...@intel.com>
---
v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!

---
 fs/proc/task_mmu.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 30dbf37..5c4535c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



[PATCHv2 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

Signed-off-by: Fan Du 
---
v2:
* Using pte_devmap to check valid pfn page structure,
  Pointed out by Dan. thx!

---
 fs/proc/task_mmu.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 30dbf37..5c4535c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else if (pmd_devmap(*pmd))
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else if (pte_devmap(*pte))
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



[PATCHv2 1/2] proc: mm: export PTE sizes directly in smaps

2017-10-24 Thread Fan Du
/proc/$pid/smaps has a number of fields that are intended to imply the
kinds of PTEs used to map memory.  "AnonHugePages" obviously tells you
how many PMDs are being used.  "MMUPageSize" along with the "Hugetlb"
fields tells you how many PTEs you have for a huge page.

The current mechanisms work fine when we have one or two page sizes.
But, they start to get a bit muddled when we mix page sizes inside
one VMA.  For instance, the DAX folks were proposing adding a set of
fields like:

DevicePages:
DeviceHugePages:
DeviceGiganticPages:
DeviceGinormousPages:

to unmuddle things when page sizes get mixed.  That's fine, but
it does require userspace know the mapping from our various
arbitrary names to hardware page sizes on each architecture and
kernel configuration.  That seems rather suboptimal.

What folks really want is to know how much memory is mapped with
each page size.  How about we just do *that* instead?

Patch attached.  Seems harmless enough.  Seems to compile on a
bunch of random architectures.  Makes smaps look like this:

Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Locked:0 kB
Ptes@4kB: 32 kB
Ptes@2MB:   2048 kB

The format I used here should be unlikely to break smaps parsers
unless they're looking for "kB" and now match the 'Ptes@4kB' instead
of the one at the end of the line.

Note: hugetlbfs PTEs are unusual.  We can have more than one "pte_t"
for each hugetlbfs "page".  arm64 has this configuration, and probably
others.  The code should now handle when an hstate's size is not equal
to one of the page table entry sizes.  For instance, it assumes that
hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
and prints them as such.

I've tested this on x86 with normal 4k ptes, anonymous huge pages,
1G hugetlbfs and 2M hugetlbfs pages.

1. I'd like to thank Dan Williams for showing me a mirror as I
   complained about the bozo that introduced 'AnonHugePages'.

[Fan]
Rebase the original patch from Dave Hansen by fixing a couple of compile
issues.

Signed-off-by: Fan Du <fan...@intel.com>
Signed-off-by: Dave Hansen <dave.han...@intel.com>
---
 Documentation/filesystems/proc.txt |   6 +++
 fs/proc/task_mmu.c | 106 -
 mm/hugetlb.c   |  11 
 3 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..a11ab80 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,6 +423,9 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
+Ptes@4kB: 4 kB
+Ptes@2MB:  8192 kB
+
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem 
object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"Ptes@..." lines show how many page table entries are currently in place and
+pointing to memory.  There is an entry for each size present in the hardware
+page tables for this mapping.
 
 "VmFlags" field deserves a separate description. This member represents the 
kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..30dbf37 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,9 @@ struct mem_size_stats {
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
unsigned long first_vma_start;
+   unsigned long rss_pte;
+   unsigned long rss_pmd;
+   unsigned long rss_pud;
u64 pss;
u64 pss_locked;
u64 swap_pss;
@@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
+   mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
/* pass */;
else
VM_BUG_ON_PAGE(1, page);
+   mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Most architectures 

[PATCHv2 1/2] proc: mm: export PTE sizes directly in smaps

2017-10-24 Thread Fan Du
/proc/$pid/smaps has a number of fields that are intended to imply the
kinds of PTEs used to map memory.  "AnonHugePages" obviously tells you
how many PMDs are being used.  "MMUPageSize" along with the "Hugetlb"
fields tells you how many PTEs you have for a huge page.

The current mechanisms work fine when we have one or two page sizes.
But, they start to get a bit muddled when we mix page sizes inside
one VMA.  For instance, the DAX folks were proposing adding a set of
fields like:

DevicePages:
DeviceHugePages:
DeviceGiganticPages:
DeviceGinormousPages:

to unmuddle things when page sizes get mixed.  That's fine, but
it does require userspace know the mapping from our various
arbitrary names to hardware page sizes on each architecture and
kernel configuration.  That seems rather suboptimal.

What folks really want is to know how much memory is mapped with
each page size.  How about we just do *that* instead?

Patch attached.  Seems harmless enough.  Seems to compile on a
bunch of random architectures.  Makes smaps look like this:

Private_Hugetlb:   0 kB
Swap:  0 kB
SwapPss:   0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB
Locked:0 kB
Ptes@4kB: 32 kB
Ptes@2MB:   2048 kB

The format I used here should be unlikely to break smaps parsers
unless they're looking for "kB" and now match the 'Ptes@4kB' instead
of the one at the end of the line.

Note: hugetlbfs PTEs are unusual.  We can have more than one "pte_t"
for each hugetlbfs "page".  arm64 has this configuration, and probably
others.  The code should now handle when an hstate's size is not equal
to one of the page table entry sizes.  For instance, it assumes that
hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
and prints them as such.

I've tested this on x86 with normal 4k ptes, anonymous huge pages,
1G hugetlbfs and 2M hugetlbfs pages.

1. I'd like to thank Dan Williams for showing me a mirror as I
   complained about the bozo that introduced 'AnonHugePages'.

[Fan]
Rebase the original patch from Dave Hansen by fixing a couple of compile
issues.

Signed-off-by: Fan Du 
Signed-off-by: Dave Hansen 
---
 Documentation/filesystems/proc.txt |   6 +++
 fs/proc/task_mmu.c | 106 -
 mm/hugetlb.c   |  11 
 3 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index adba21b..a11ab80 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,6 +423,9 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
+Ptes@4kB: 4 kB
+Ptes@2MB:  8192 kB
+
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem 
object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"Ptes@..." lines show how many page table entries are currently in place and
+pointing to memory.  There is an entry for each size present in the hardware
+page tables for this mapping.
 
 "VmFlags" field deserves a separate description. This member represents the 
kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..30dbf37 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,9 @@ struct mem_size_stats {
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
unsigned long first_vma_start;
+   unsigned long rss_pte;
+   unsigned long rss_pmd;
+   unsigned long rss_pud;
u64 pss;
u64 pss_locked;
u64 swap_pss;
@@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
+   mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
/* pass */;
else
VM_BUG_ON_PAGE(1, page);
+   mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Most architectures have a 1:1 mapping of PTEs to hugetlb page

[PATCH] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

Signed-off-by: Fan Du <fan...@intel.com>
---
 fs/proc/task_mmu.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..ba2e58c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -528,7 +528,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else
+   page = pte_page(*pte);
+
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -579,7 +583,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1668,13 +1676,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1687,7 +1697,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



[PATCH] Add /proc/PID/{smaps, numa_maps} support for DAX

2017-10-24 Thread Fan Du
So user could check those interface for more detailed
information about how much DAX mappings are currently
created.

Here we use vma_is_dax method to find specific page
struture with DAX {huge, normal}page mappings,
vm_normal_page routine works as before without any
impact on the existing logical where _vm_normal_page
are called.

Signed-off-by: Fan Du 
---
 fs/proc/task_mmu.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..ba2e58c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -528,7 +528,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
struct page *page = NULL;
 
if (pte_present(*pte)) {
-   page = vm_normal_page(vma, addr, *pte);
+   if (!vma_is_dax(vma))
+   page = vm_normal_page(vma, addr, *pte);
+   else
+   page = pte_page(*pte);
+
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
@@ -579,7 +583,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
struct page *page;
 
/* FOLL_DUMP will return -EFAULT on huge zero page */
-   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   if (!vma_is_dax(vma))
+   page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+   else
+   page = pmd_page(*pmd);
+
if (IS_ERR_OR_NULL(page))
return;
if (PageAnon(page))
@@ -1668,13 +1676,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   struct page *page;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   struct page *page;
-
-   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+   else
+   page = pmd_page(*pmd);
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1687,7 +1697,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 #endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
-   struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (!vma_is_dax(vma))
+   page = can_gather_numa_stats(*pte, vma, addr);
+   else
+   page = pte_page(*pte);
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
-- 
1.8.3.1



Re: [PATCH v3 net-next] xfrm: Add check to prevent un-complete key manager

2013-11-10 Thread Fan Du



On 2013年11月11日 14:39, baker.ker...@gmail.com wrote:

From: Baker Zhang

"acquire" and "compile_policy" callbacks are necessary for a key manager.

Signed-off-by: Baker Zhang
---
Thanks for all reply.

V1:
For current kernel source, there is no problem.

In our vpn product, we need a xfrm_km in kernel module
to monitor the xfrm state change.
thus, the 'acquire' and 'compile_policy' may be NULL.

So I think we should do the check before use it.

V2:
Align the continuation line according the networking coding style.

V3:
Add check to prevent un-complete key manager at register time.

  net/xfrm/xfrm_state.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index b9c3f9e..178283e 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1806,6 +1806,9 @@ static DEFINE_SPINLOCK(xfrm_km_lock);

  int xfrm_register_km(struct xfrm_mgr *km)
  {
+   if (km->acquire == NULL || km->compile_policy == NULL)


"acquire" is a MUST, "compile_policy" is not a necessity.

From  the fist commit log, you probably add functionality providing SA state 
changes
in your private key manager, which current implementation does not. Maybe it's 
worthwhile
to elaborate the missing functionality than add those checking, because both 
key manage
(pfkeyv2/netlink) in use has "acquire" and "compile_policy" at the same time.



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 net-next] xfrm: Add check to prevent un-complete key manager

2013-11-10 Thread Fan Du



On 2013年11月11日 14:39, baker.ker...@gmail.com wrote:

From: Baker Zhangbaker.ker...@gmail.com

acquire and compile_policy callbacks are necessary for a key manager.

Signed-off-by: Baker Zhangbaker.ker...@gmail.com
---
Thanks for all reply.

V1:
For current kernel source, there is no problem.

In our vpn product, we need a xfrm_km in kernel module
to monitor the xfrm state change.
thus, the 'acquire' and 'compile_policy' may be NULL.

So I think we should do the check before use it.

V2:
Align the continuation line according the networking coding style.

V3:
Add check to prevent un-complete key manager at register time.

  net/xfrm/xfrm_state.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index b9c3f9e..178283e 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1806,6 +1806,9 @@ static DEFINE_SPINLOCK(xfrm_km_lock);

  int xfrm_register_km(struct xfrm_mgr *km)
  {
+   if (km-acquire == NULL || km-compile_policy == NULL)


acquire is a MUST, compile_policy is not a necessity.

From  the fist commit log, you probably add functionality providing SA state 
changes
in your private key manager, which current implementation does not. Maybe it's 
worthwhile
to elaborate the missing functionality than add those checking, because both 
key manage
(pfkeyv2/netlink) in use has acquire and compile_policy at the same time.



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-17 Thread Fan Du



On 2013年09月16日 17:01, Thomas Gleixner wrote:

On Mon, 16 Sep 2013, Fan Du wrote:

On 2013年09月13日 22:32, Thomas Gleixner wrote:

On Fri, 13 Sep 2013, Fan Du wrote:

(2) What I have been bugging you around here for this long time is really
the
second
  problem, I'm sorry I didn't make it clearly to you and others, which
is
below:

  Why using wall clock time to calculate soft/hard IPsec events when
xfrm_state timer
  out happens in its timeout handler? Because even if xfrm_state using
CLOCK_BOOTTIME,
  system wall clock time changing will surely disturb soft/hard IPsec
events, which
  you raised your concern about in (*a*).


No CLOCK_BOOTTIME is not affected by wall clock time changes. It's
basically CLOCK_MONOTONIC, it just keeps counting the suspend time as
well. So without a suspend/resume cycle CLOCK_MONOTONIC and
CLOCK_BOOTTIME are the same. After a suspend/resume cycle
CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is
the number of seconds spent in suspend.


Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting
suspend/resume time as well as not affected by wall clock time change.

Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not
affected by wall clock time change, as well as keep the timer valid when
suspend/resume.

A classic way of using timer is:
  a- Arm a timer with specified interval
  b- Wait for the timer to timeout
  c- After the timeout, notify the event to other place in the timer handler.

IPsec key lifetime timer does its this way:
  a- Arm a timer with specified interval
  b- Wait for the timer to timeout
  c- After timeout, in the timer handler, using wall clock time to calculate
which kind of event to notify user(soft or hard for both key use time,
and key created time). So the problem arises at this stage if wall clock
time changed.


And why on earth must it use wall clock time for selecting the event
type?


/*shivering... sorry to bother you again.*/

Yep, it's a bit twisted. This parts of codes come a long way from v2.5.67
Beyond this fact, I barely know its original design goal by doing so.
The first version of this patch is to remove the wall clock time things
by myself, it's a pity that the feedback is not very welcome at the end.
So later on adding notifier at clock_was_set is suggested by David.


Thanks,

tglx


--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-17 Thread Fan Du



On 2013年09月16日 17:01, Thomas Gleixner wrote:

On Mon, 16 Sep 2013, Fan Du wrote:

On 2013年09月13日 22:32, Thomas Gleixner wrote:

On Fri, 13 Sep 2013, Fan Du wrote:

(2) What I have been bugging you around here for this long time is really
the
second
  problem, I'm sorry I didn't make it clearly to you and others, which
is
below:

  Why using wall clock time to calculate soft/hard IPsec events when
xfrm_state timer
  out happens in its timeout handler? Because even if xfrm_state using
CLOCK_BOOTTIME,
  system wall clock time changing will surely disturb soft/hard IPsec
events, which
  you raised your concern about in (*a*).


No CLOCK_BOOTTIME is not affected by wall clock time changes. It's
basically CLOCK_MONOTONIC, it just keeps counting the suspend time as
well. So without a suspend/resume cycle CLOCK_MONOTONIC and
CLOCK_BOOTTIME are the same. After a suspend/resume cycle
CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is
the number of seconds spent in suspend.


Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting
suspend/resume time as well as not affected by wall clock time change.

Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not
affected by wall clock time change, as well as keep the timer valid when
suspend/resume.

A classic way of using timer is:
  a- Arm a timer with specified interval
  b- Wait for the timer to timeout
  c- After the timeout, notify the event to other place in the timer handler.

IPsec key lifetime timer does its this way:
  a- Arm a timer with specified interval
  b- Wait for the timer to timeout
  c- After timeout, in the timer handler, using wall clock time to calculate
which kind of event to notify user(soft or hard for both key use time,
and key created time). So the problem arises at this stage if wall clock
time changed.


And why on earth must it use wall clock time for selecting the event
type?


/*shivering... sorry to bother you again.*/

Yep, it's a bit twisted. This parts of codes come a long way from v2.5.67
Beyond this fact, I barely know its original design goal by doing so.
The first version of this patch is to remove the wall clock time things
by myself, it's a pity that the feedback is not very welcome at the end.
So later on adding notifier at clock_was_set is suggested by David.


Thanks,

tglx


--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-15 Thread Fan Du

Hi, Thomas

On 2013年09月13日 22:32, Thomas Gleixner wrote:

On Fri, 13 Sep 2013, Fan Du wrote:

(2) What I have been bugging you around here for this long time is really the
second
 problem, I'm sorry I didn't make it clearly to you and others, which is
below:

 Why using wall clock time to calculate soft/hard IPsec events when
xfrm_state timer
 out happens in its timeout handler? Because even if xfrm_state using
CLOCK_BOOTTIME,
 system wall clock time changing will surely disturb soft/hard IPsec
events, which
 you raised your concern about in (*a*).


No CLOCK_BOOTTIME is not affected by wall clock time changes. It's
basically CLOCK_MONOTONIC, it just keeps counting the suspend time as
well. So without a suspend/resume cycle CLOCK_MONOTONIC and
CLOCK_BOOTTIME are the same. After a suspend/resume cycle
CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is
the number of seconds spent in suspend.


Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting
suspend/resume time as well as not affected by wall clock time change.

Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not
affected by wall clock time change, as well as keep the timer valid when
suspend/resume.

A classic way of using timer is:
 a- Arm a timer with specified interval
 b- Wait for the timer to timeout
 c- After the timeout, notify the event to other place in the timer handler.

IPsec key lifetime timer does its this way:
 a- Arm a timer with specified interval
 b- Wait for the timer to timeout
 c- After timeout, in the timer handler, using wall clock time to calculate
   which kind of event to notify user(soft or hard for both key use time,
   and key created time). So the problem arises at this stage if wall clock
   time changed.



Thanks


Thanks,

tglx



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-15 Thread Fan Du

Hi, Thomas

On 2013年09月13日 22:32, Thomas Gleixner wrote:

On Fri, 13 Sep 2013, Fan Du wrote:

(2) What I have been bugging you around here for this long time is really the
second
 problem, I'm sorry I didn't make it clearly to you and others, which is
below:

 Why using wall clock time to calculate soft/hard IPsec events when
xfrm_state timer
 out happens in its timeout handler? Because even if xfrm_state using
CLOCK_BOOTTIME,
 system wall clock time changing will surely disturb soft/hard IPsec
events, which
 you raised your concern about in (*a*).


No CLOCK_BOOTTIME is not affected by wall clock time changes. It's
basically CLOCK_MONOTONIC, it just keeps counting the suspend time as
well. So without a suspend/resume cycle CLOCK_MONOTONIC and
CLOCK_BOOTTIME are the same. After a suspend/resume cycle
CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is
the number of seconds spent in suspend.


Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting
suspend/resume time as well as not affected by wall clock time change.

Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not
affected by wall clock time change, as well as keep the timer valid when
suspend/resume.

A classic way of using timer is:
 a- Arm a timer with specified interval
 b- Wait for the timer to timeout
 c- After the timeout, notify the event to other place in the timer handler.

IPsec key lifetime timer does its this way:
 a- Arm a timer with specified interval
 b- Wait for the timer to timeout
 c- After timeout, in the timer handler, using wall clock time to calculate
   which kind of event to notify user(soft or hard for both key use time,
   and key created time). So the problem arises at this stage if wall clock
   time changed.



Thanks


Thanks,

tglx



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-12 Thread Fan Du

Hi Dave/Thomas

On 2013年09月13日 01:32, David Miller wrote:

From: Thomas Gleixner
Date: Thu, 12 Sep 2013 16:43:37 +0200 (CEST)


So what about going back to timer_list timers and simply utilize
register_pm_notifier(), which will tell you that the system resumed?


The thing to understand is that there are two timeouts for an IPSEC
rule, a soft and a hard timeout.

There is a gap between these two exactly so that we can negotiate a
new encapsulation with the IPSEC gateway before communication ceases
to be possible over the IPSEC protected path.

So the idea is that the soft timeout triggers the re-negotiation,
and after a hard timeout the IPSEC path is no longer usable and
all communication will fail.

Simply triggering a re-negoation after every suspend/resume makes
no sense at all.  Spurious re-negotiations are undesirable.

^^ (*a*)

What's the differences between this with re-negotiation after every
system wall clock changing by using clock_was_set notifier?


> On 2013年08月02日 06:35, David Miller wrote:
>
> I suspect the thing to do is to have system time changes generate a
> notifier when clock_was_set() is called.
>
> The XFRM code would walk the rules and pretend that we hit the soft
> timeout for every rule that we haven't hit the soft timeout yet
> already.
>
> If a rule hit the soft timeout, force a hard timeout.
>
> When forcing a soft timeout, adjust the hard timeout to be
> (hard_timeout - soft_timeout) into the future.




What we want are real timers.  We want that rather than a "we
suspended so just assume all timers expired" event which is not very
useful for this kind of application.



Here we are facing two problems:)

(1) what kind timer should xfrm_state should employ, Two requirements here:
First one, KEY lifetime should include suspend/resume time. Second one,
system wall clock time changing(backward/forward) should *not* impact
*timer* timeout event(not the soft/hard IPsec events fired to user space!)

net-next commit 99565a6c471cbb66caa68347c195133017559943 ("xfrm: Make
xfrm_state timer monotonic") by utilizing *CLOCK_BOOTTIME* has solved this 
problem.

(2) What I have been bugging you around here for this long time is really the 
second
problem, I'm sorry I didn't make it clearly to you and others, which is 
below:

Why using wall clock time to calculate soft/hard IPsec events when 
xfrm_state timer
out happens in its timeout handler? Because even if xfrm_state using 
CLOCK_BOOTTIME,
system wall clock time changing will surely disturb soft/hard IPsec events, 
which
you raised your concern about in (*a*).

The initial approach( 
http://marc.info/?l=linux-netdev=137534280429187=2) has
tried to solve this second problem by eliminating depending system wall 
clock in
xfrm_state timer timeout handler.

I think this time, I have made this situation crystal clear.

--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-09-12 Thread Fan Du

Hi Dave/Thomas

On 2013年09月13日 01:32, David Miller wrote:

From: Thomas Gleixnert...@linutronix.de
Date: Thu, 12 Sep 2013 16:43:37 +0200 (CEST)


So what about going back to timer_list timers and simply utilize
register_pm_notifier(), which will tell you that the system resumed?


The thing to understand is that there are two timeouts for an IPSEC
rule, a soft and a hard timeout.

There is a gap between these two exactly so that we can negotiate a
new encapsulation with the IPSEC gateway before communication ceases
to be possible over the IPSEC protected path.

So the idea is that the soft timeout triggers the re-negotiation,
and after a hard timeout the IPSEC path is no longer usable and
all communication will fail.

Simply triggering a re-negoation after every suspend/resume makes
no sense at all.  Spurious re-negotiations are undesirable.

^^ (*a*)

What's the differences between this with re-negotiation after every
system wall clock changing by using clock_was_set notifier?


 On 2013年08月02日 06:35, David Miller wrote:

 I suspect the thing to do is to have system time changes generate a
 notifier when clock_was_set() is called.

 The XFRM code would walk the rules and pretend that we hit the soft
 timeout for every rule that we haven't hit the soft timeout yet
 already.

 If a rule hit the soft timeout, force a hard timeout.

 When forcing a soft timeout, adjust the hard timeout to be
 (hard_timeout - soft_timeout) into the future.




What we want are real timers.  We want that rather than a we
suspended so just assume all timers expired event which is not very
useful for this kind of application.



Here we are facing two problems:)

(1) what kind timer should xfrm_state should employ, Two requirements here:
First one, KEY lifetime should include suspend/resume time. Second one,
system wall clock time changing(backward/forward) should *not* impact
*timer* timeout event(not the soft/hard IPsec events fired to user space!)

net-next commit 99565a6c471cbb66caa68347c195133017559943 (xfrm: Make
xfrm_state timer monotonic) by utilizing *CLOCK_BOOTTIME* has solved this 
problem.

(2) What I have been bugging you around here for this long time is really the 
second
problem, I'm sorry I didn't make it clearly to you and others, which is 
below:

Why using wall clock time to calculate soft/hard IPsec events when 
xfrm_state timer
out happens in its timeout handler? Because even if xfrm_state using 
CLOCK_BOOTTIME,
system wall clock time changing will surely disturb soft/hard IPsec events, 
which
you raised your concern about in (*a*).

The initial approach( 
http://marc.info/?l=linux-netdevm=137534280429187w=2) has
tried to solve this second problem by eliminating depending system wall 
clock in
xfrm_state timer timeout handler.

I think this time, I have made this situation crystal clear.

--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-08-19 Thread Fan Du



On 2013年08月18日 17:05, Thomas Gleixner wrote:

On Wed, 14 Aug 2013, Fan Du wrote:


 From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001
From: Fan Du
Date: Wed, 14 Aug 2013 16:39:23 +0800
Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was
called

When clock_was_set is called in case of system wall time change
or host resume from suspend state, use this notifier for places
where interested in this action, e.g Ipsec SA lifetime management.


Sigh. These notifiers have been proposed in the past and we always
rejected them. Why do you need this? There should be nothing except
the core timekeeping code which needs to know about clock_was_set.

Can you please explain what kind of users you have in mind and WHY
they need to know about it.


Hi, Thomas


Thanks for your patience. Please let me take a few seconds try to
explain this.

Current xfrm layers has *one* hrtimer to guard Ipsec keys timeout,
The timeout could be measured in either of below two ways:

 (1) The timer is started once the keys is created, but this
 key is not necessary actually used right now. In detail,
 record the get_seconds() when this key is created.

 (2) Starting the timer when this key is actually used, e.g when
 an IP packet need to be encrypted. In details, recored the
 get_seconds() when this key is first used.

So in the hrtimer handler, the code get current get_seconds, and
check against with what saved in (1)or(2), and notify the timeout
up to user land.

So the pitfall is using one hrtimer for two timeout events,
most importantly using get_seconds to check timeout, once system
clock is changed by user intentionally, the key timeout could
misbehave wildly.

A refractor has been proposed to get rid of depending on system wall
clock by cleaning up the hrtimer handler. Unfortunately David frowned
on it in (3), and suggest once system clock is changed, adjust the
timeout of the key.


(3): http://www.spinics.net/lists/netdev/msg245169.html



Thanks,

tglx



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-08-19 Thread Fan Du



On 2013年08月18日 17:05, Thomas Gleixner wrote:

On Wed, 14 Aug 2013, Fan Du wrote:


 From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001
From: Fan Dufan...@windriver.com
Date: Wed, 14 Aug 2013 16:39:23 +0800
Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was
called

When clock_was_set is called in case of system wall time change
or host resume from suspend state, use this notifier for places
where interested in this action, e.g Ipsec SA lifetime management.


Sigh. These notifiers have been proposed in the past and we always
rejected them. Why do you need this? There should be nothing except
the core timekeeping code which needs to know about clock_was_set.

Can you please explain what kind of users you have in mind and WHY
they need to know about it.


Hi, Thomas


Thanks for your patience. Please let me take a few seconds try to
explain this.

Current xfrm layers has *one* hrtimer to guard Ipsec keys timeout,
The timeout could be measured in either of below two ways:

 (1) The timer is started once the keys is created, but this
 key is not necessary actually used right now. In detail,
 record the get_seconds() when this key is created.

 (2) Starting the timer when this key is actually used, e.g when
 an IP packet need to be encrypted. In details, recored the
 get_seconds() when this key is first used.

So in the hrtimer handler, the code get current get_seconds, and
check against with what saved in (1)or(2), and notify the timeout
up to user land.

So the pitfall is using one hrtimer for two timeout events,
most importantly using get_seconds to check timeout, once system
clock is changed by user intentionally, the key timeout could
misbehave wildly.

A refractor has been proposed to get rid of depending on system wall
clock by cleaning up the hrtimer handler. Unfortunately David frowned
on it in (3), and suggest once system clock is changed, adjust the
timeout of the key.


(3): http://www.spinics.net/lists/netdev/msg245169.html



Thanks,

tglx



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-08-14 Thread Fan Du

From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001
From: Fan Du 
Date: Wed, 14 Aug 2013 16:39:23 +0800
Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was 
called

When clock_was_set is called in case of system wall time change
or host resume from suspend state, use this notifier for places
where interested in this action, e.g Ipsec SA lifetime management.

Signed-off-by: Fan Du 

v3:
  -Beautify notifier with register/unregister API exported for other subsystem.

---
 include/linux/hrtimer.h |3 +++
 kernel/hrtimer.c|   19 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index d19a5c2..f0404e4 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -461,4 +461,7 @@ extern u64 ktime_divns(const ktime_t kt, s64 div);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);

+extern int register_clock_change_notifier(struct notifier_block *nb);
+extern int unregister_clock_change_notifier(struct notifier_block *nb);
+
 #endif
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..c6e6405 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -755,6 +755,24 @@ static inline void retrigger_next_event(void *arg) { }

 #endif /* CONFIG_HIGH_RES_TIMERS */

+static ATOMIC_NOTIFIER_HEAD(clock_change_notifier_list);
+static int call_clock_change_notifiers(void)
+{
+   return atomic_notifier_call_chain(_change_notifier_list, 0, 0);
+}
+
+int register_clock_change_notifier(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_register(_change_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_clock_change_notifier);
+
+int unregister_clock_change_notifier(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_unregister(_change_notifier_list, 
nb);
+}
+EXPORT_SYMBOL_GPL(unregister_clock_change_notifier);
+
 /*
  * Clock realtime was set
  *
@@ -773,6 +791,7 @@ void clock_was_set(void)
on_each_cpu(retrigger_next_event, NULL, 1);
 #endif
timerfd_clock_was_set();
+   call_clock_change_notifiers();
 }

 /*
--
1.7.9.5


--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called

2013-08-14 Thread Fan Du

From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001
From: Fan Du fan...@windriver.com
Date: Wed, 14 Aug 2013 16:39:23 +0800
Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was 
called

When clock_was_set is called in case of system wall time change
or host resume from suspend state, use this notifier for places
where interested in this action, e.g Ipsec SA lifetime management.

Signed-off-by: Fan Du fan...@windriver.com

v3:
  -Beautify notifier with register/unregister API exported for other subsystem.

---
 include/linux/hrtimer.h |3 +++
 kernel/hrtimer.c|   19 +++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index d19a5c2..f0404e4 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -461,4 +461,7 @@ extern u64 ktime_divns(const ktime_t kt, s64 div);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);

+extern int register_clock_change_notifier(struct notifier_block *nb);
+extern int unregister_clock_change_notifier(struct notifier_block *nb);
+
 #endif
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..c6e6405 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -755,6 +755,24 @@ static inline void retrigger_next_event(void *arg) { }

 #endif /* CONFIG_HIGH_RES_TIMERS */

+static ATOMIC_NOTIFIER_HEAD(clock_change_notifier_list);
+static int call_clock_change_notifiers(void)
+{
+   return atomic_notifier_call_chain(clock_change_notifier_list, 0, 0);
+}
+
+int register_clock_change_notifier(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_register(clock_change_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_clock_change_notifier);
+
+int unregister_clock_change_notifier(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_unregister(clock_change_notifier_list, 
nb);
+}
+EXPORT_SYMBOL_GPL(unregister_clock_change_notifier);
+
 /*
  * Clock realtime was set
  *
@@ -773,6 +791,7 @@ void clock_was_set(void)
on_each_cpu(retrigger_next_event, NULL, 1);
 #endif
timerfd_clock_was_set();
+   call_clock_change_notifiers();
 }

 /*
--
1.7.9.5


--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] ipv6, rawv6_close(): BUG: unable to handle kernel paging request

2013-07-22 Thread Fan Du

Hallo Srivatsa


On 2013年07月22日 02:28, Srivatsa S. Bhat wrote:

Hi,

I'm seeing this on every boot.

Version: Latest mainline (commit ea45ea70b)


I tested with this commit using your updated IPv6 config, this incident didn't 
show up after several times of reboot.
Could you please elaborate your testing details if possible?

A wild guess, it dereference mrt->mroute6_sk, indicating mrt is invalid.


Regards,
Srivatsa S. Bhat

---

BUG: unable to handle kernel paging request at 882018552020
IP: [] ip6mr_sk_done+0x32/0xb0 [ipv6]
PGD 290a067 PUD 207ffe0067 PMD 207ff1d067 PTE 802018552060
Oops:  [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ebtable_nat ebtables nfs fscache nf_conntrack_ipv4 
nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables 
nfsd lockd nfs_acl exportfs auth_rpcgss autofs4 sunrpc 8021q garp bridge stp 
llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
ip6table_filter ip6_tables ipv6 vfat fat vhost_net macvtap macvlan vhost tun 
kvm_intel kvm uinput iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii 
microcode i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma dca mlx4_core 
be2net wmi acpi_cpufreq mperf ext4 jbd2 mbcache dm_mirror dm_region_hash dm_log 
dm_mod
CPU: 0 PID: 7 Comm: kworker/u33:0 Not tainted 3.11.0-rc1-ea45e-a #4
Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
Workqueue: netns cleanup_net
task: 8810393641c0 ti: 881039366000 task.ti: 881039366000
RIP: 0010:[]  [] ip6mr_sk_done+0x32/0xb0 
[ipv6]
RSP: 0018:881039367bd8  EFLAGS: 00010286
RAX: 881039367fd8 RBX: 882018552000 RCX: dead00200200
RDX:  RSI: 881039367b68 RDI: 881039367b68
RBP: 881039367bf8 R08: 881039367b68 R09: 
R10:  R11:  R12: 882015a7a040
R13: 882014eb89c0 R14: 8820289e2800 R15: 
FS:  () GS:88103fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 882018552020 CR3: 01c0b000 CR4: 000407f0
Stack:
  881039367c18 882014eb89c0 882015e28c00 
  881039367c18 a034d9d1 8820289e2800 882014eb89c0
  881039367c58 815bdecb 815bddf2 882014eb89c0
Call Trace:
  [] rawv6_close+0x21/0x40 [ipv6]
  [] inet_release+0xfb/0x220
  [] ? inet_release+0x22/0x220
  [] inet6_release+0x3f/0x50 [ipv6]
  [] sock_release+0x29/0xa0
  [] sk_release_kernel+0x30/0x70
  [] icmpv6_sk_exit+0x3b/0x80 [ipv6]
  [] ops_exit_list+0x39/0x60
  [] cleanup_net+0xfb/0x1a0
  [] process_one_work+0x1da/0x610
  [] ? process_one_work+0x169/0x610
  [] worker_thread+0x120/0x3a0
  [] ? process_one_work+0x610/0x610
  [] kthread+0xee/0x100
  [] ? __init_kthread_worker+0x70/0x70
  [] ret_from_fork+0x7c/0xb0
  [] ? __init_kthread_worker+0x70/0x70
Code: 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 4c 8b 67 30 49 89 fd e8 
db 3c 1e e1 49 8b 9c 24 90 08 00 00 48 85 db 74 06<4c>  39 6b 20 74 20 bb f3 ff 
ff ff e8 8e 3c 1e e1 89 d8 4c 8b 65
RIP  [] ip6mr_sk_done+0x32/0xb0 [ipv6]
  RSP
CR2: 882018552020
---[ end trace e8367f5addd58b5f ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: kworker/u33:0
INFO: lockdep is turned off.
irq event stamp: 7804
hardirqs last  enabled at (7803): [] 
_raw_spin_unlock_irq+0x30/0x50
hardirqs last disabled at (7804): [] 
_raw_spin_lock_irq+0x17/0x60
softirqs last  enabled at (7122): [] __do_softirq+0x1e6/0x400
softirqs last disabled at (7113): [] irq_exit+0xed/0x100
CPU: 0 PID: 7 Comm: kworker/u33:0 Tainted: G  D  3.11.0-rc1-ea45e-a #4
Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
Workqueue: netns cleanup_net
  819f4a61 881039367828 8161ab9c 881039367828
  8810393641c0 881039367858 8108cbee 881039367898
  881039357ec8 0009 0009 881039367888
Call Trace:
  [] dump_stack+0x59/0x7d
  [] __might_sleep+0x17e/0x230
  [] down_read+0x24/0x70
  [] exit_signals+0x24/0x140
  [] ? blocking_notifier_call_chain+0x16/0x20
  [] do_exit+0xb2/0x4c0
  [] oops_end+0xa9/0xf0
  [] no_context+0x11e/0x1f0
  [] __bad_area_nosemaphore+0x12d/0x230
  [] bad_area_nosemaphore+0x13/0x20
  [] __do_page_fault+0x133/0x4e0
  [] ? __change_page_attr+0x6b/0x2b0
  [] ? __change_page_attr_set_clr+0x4d/0xb0
  [] do_page_fault+0x37/0x70
  [] ? restore_args+0x30/0x30
  [] page_fault+0x22/0x30
  [] ? ip6mr_sk_done+0x32/0xb0 [ipv6]
  [] ? ip6mr_sk_done+0x25/0xb0 [ipv6]
  [] rawv6_close+0x21/0x40 [ipv6]
  [] inet_release+0xfb/0x220
  [] ? inet_release+0x22/0x220
  [] inet6_release+0x3f/0x50 [ipv6]
  [] sock_release+0x29/0xa0
  [] sk_release_kernel+0x30/0x70
  [] icmpv6_sk_exit+0x3b/0x80 [ipv6]
  [] ops_exit_list+0x39/0x60
  [] cleanup_net+0xfb/0x1a0
  [] 

Re: [BUG] ipv6, rawv6_close(): BUG: unable to handle kernel paging request

2013-07-22 Thread Fan Du

Hallo Srivatsa


On 2013年07月22日 02:28, Srivatsa S. Bhat wrote:

Hi,

I'm seeing this on every boot.

Version: Latest mainline (commit ea45ea70b)


I tested with this commit using your updated IPv6 config, this incident didn't 
show up after several times of reboot.
Could you please elaborate your testing details if possible?

A wild guess, it dereference mrt-mroute6_sk, indicating mrt is invalid.


Regards,
Srivatsa S. Bhat

---

BUG: unable to handle kernel paging request at 882018552020
IP: [a0366b02] ip6mr_sk_done+0x32/0xb0 [ipv6]
PGD 290a067 PUD 207ffe0067 PMD 207ff1d067 PTE 802018552060
Oops:  [#1] SMP DEBUG_PAGEALLOC
Modules linked in: ebtable_nat ebtables nfs fscache nf_conntrack_ipv4 
nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables 
nfsd lockd nfs_acl exportfs auth_rpcgss autofs4 sunrpc 8021q garp bridge stp 
llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
ip6table_filter ip6_tables ipv6 vfat fat vhost_net macvtap macvlan vhost tun 
kvm_intel kvm uinput iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii 
microcode i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma dca mlx4_core 
be2net wmi acpi_cpufreq mperf ext4 jbd2 mbcache dm_mirror dm_region_hash dm_log 
dm_mod
CPU: 0 PID: 7 Comm: kworker/u33:0 Not tainted 3.11.0-rc1-ea45e-a #4
Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
Workqueue: netns cleanup_net
task: 8810393641c0 ti: 881039366000 task.ti: 881039366000
RIP: 0010:[a0366b02]  [a0366b02] ip6mr_sk_done+0x32/0xb0 
[ipv6]
RSP: 0018:881039367bd8  EFLAGS: 00010286
RAX: 881039367fd8 RBX: 882018552000 RCX: dead00200200
RDX:  RSI: 881039367b68 RDI: 881039367b68
RBP: 881039367bf8 R08: 881039367b68 R09: 
R10:  R11:  R12: 882015a7a040
R13: 882014eb89c0 R14: 8820289e2800 R15: 
FS:  () GS:88103fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 882018552020 CR3: 01c0b000 CR4: 000407f0
Stack:
  881039367c18 882014eb89c0 882015e28c00 
  881039367c18 a034d9d1 8820289e2800 882014eb89c0
  881039367c58 815bdecb 815bddf2 882014eb89c0
Call Trace:
  [a034d9d1] rawv6_close+0x21/0x40 [ipv6]
  [815bdecb] inet_release+0xfb/0x220
  [815bddf2] ? inet_release+0x22/0x220
  [a032686f] inet6_release+0x3f/0x50 [ipv6]
  [8151c1d9] sock_release+0x29/0xa0
  [81525520] sk_release_kernel+0x30/0x70
  [a034f14b] icmpv6_sk_exit+0x3b/0x80 [ipv6]
  [8152fff9] ops_exit_list+0x39/0x60
  [815306fb] cleanup_net+0xfb/0x1a0
  [81075e3a] process_one_work+0x1da/0x610
  [81075dc9] ? process_one_work+0x169/0x610
  [81076390] worker_thread+0x120/0x3a0
  [81076270] ? process_one_work+0x610/0x610
  [8107da2e] kthread+0xee/0x100
  [8107d940] ? __init_kthread_worker+0x70/0x70
  [8162a99c] ret_from_fork+0x7c/0xb0
  [8107d940] ? __init_kthread_worker+0x70/0x70
Code: 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 4c 8b 67 30 49 89 fd e8 
db 3c 1e e1 49 8b 9c 24 90 08 00 00 48 85 db 74 064c  39 6b 20 74 20 bb f3 ff 
ff ff e8 8e 3c 1e e1 89 d8 4c 8b 65
RIP  [a0366b02] ip6mr_sk_done+0x32/0xb0 [ipv6]
  RSP881039367bd8
CR2: 882018552020
---[ end trace e8367f5addd58b5f ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: kworker/u33:0
INFO: lockdep is turned off.
irq event stamp: 7804
hardirqs last  enabled at (7803): [81620ad0] 
_raw_spin_unlock_irq+0x30/0x50
hardirqs last disabled at (7804): [81620287] 
_raw_spin_lock_irq+0x17/0x60
softirqs last  enabled at (7122): [81058ea6] __do_softirq+0x1e6/0x400
softirqs last disabled at (7113): [8105921d] irq_exit+0xed/0x100
CPU: 0 PID: 7 Comm: kworker/u33:0 Tainted: G  D  3.11.0-rc1-ea45e-a #4
Hardware name: IBM  -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012
Workqueue: netns cleanup_net
  819f4a61 881039367828 8161ab9c 881039367828
  8810393641c0 881039367858 8108cbee 881039367898
  881039357ec8 0009 0009 881039367888
Call Trace:
  [8161ab9c] dump_stack+0x59/0x7d
  [8108cbee] __might_sleep+0x17e/0x230
  [8161d7b4] down_read+0x24/0x70
  [81068404] exit_signals+0x24/0x140
  [81084b36] ? blocking_notifier_call_chain+0x16/0x20
  [81055ee2] do_exit+0xb2/0x4c0
  [81621f49] oops_end+0xa9/0xf0
  [81042e0e] no_context+0x11e/0x1f0
  [8104300d] __bad_area_nosemaphore+0x12d/0x230
  [81043123] bad_area_nosemaphore+0x13/0x20
  

[PATCH] percpu_counter: __this_cpu_write doesn't need to be protected by spinlock

2013-06-30 Thread Fan Du
__this_cpu_write doesn't need to be protected by spinlock, AS we are doing
per cpu write with preempt disabled. And another reason to remove 
__this_cpu_write
outside of spinlock: __percpu_counter_sum is not a accurate counter.  

Signed-off-by: Fan Du 
---
 lib/percpu_counter.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index ba6085d..1fc23a3 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -80,8 +80,8 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 
amount, s32 batch)
if (count >= batch || count <= -batch) {
raw_spin_lock(>lock);
fbc->count += count;
-   __this_cpu_write(*fbc->counters, 0);
raw_spin_unlock(>lock);
+   __this_cpu_write(*fbc->counters, 0);
} else {
__this_cpu_write(*fbc->counters, count);
}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] percpu_counter: __this_cpu_write doesn't need to be protected by spinlock

2013-06-30 Thread Fan Du
__this_cpu_write doesn't need to be protected by spinlock, AS we are doing
per cpu write with preempt disabled. And another reason to remove 
__this_cpu_write
outside of spinlock: __percpu_counter_sum is not a accurate counter.  

Signed-off-by: Fan Du fan...@windriver.com
---
 lib/percpu_counter.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index ba6085d..1fc23a3 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -80,8 +80,8 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 
amount, s32 batch)
if (count = batch || count = -batch) {
raw_spin_lock(fbc-lock);
fbc-count += count;
-   __this_cpu_write(*fbc-counters, 0);
raw_spin_unlock(fbc-lock);
+   __this_cpu_write(*fbc-counters, 0);
} else {
__this_cpu_write(*fbc-counters, count);
}
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PANIC at net/xfrm/xfrm_output.c:125 (3.9.4)

2013-06-05 Thread Fan Du

Hello Chris/Jean

This issue might have already been fixed by this:
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/xfrm/xfrm_output.c?id=497574c72c9922cf20c12aed15313c389f722fa0

Hope it helps.

On 2013年06月06日 09:04, Jean Sacren wrote:

From: Chris Boot
Date: Wed, 05 Jun 2013 22:47:48 +0100


Hi folks,

I have a re-purposed Watchguard Firebox running Debian GNU/Linux with a
self-built vanilla 3.9.4 kernel. I have an IPsec tunnel up to a remote
router through which I was passing a fair bit of traffic when I hit the
following panic:

[486832.949560] BUG: unable to handle kernel NULL pointer dereference at
0010
[486832.953431] IP: [] xfrm_output_resume+0x61/0x29f
[486832.953431] *pde = 
[486832.953431] Oops:  [#1]
[486832.953431] Modules linked in: xt_realm xt_nat authenc esp4
xfrm4_mode_tunnel tun ip6table_nat nf_nat_ipv6 sch_fq_codel xt_statistic
xt_CT xt_LOG xt_connlimit xt_recent xt_time xt_TCPMSS xt_sctp
ip6t_REJECT pppoe deflate zlib_deflate pppox ctr twofish_generic
twofish_i586 twofish_common camellia_generic serpent_sse2_i586 xts
serpent_generic lrw gf128mul glue_helper ablk_helper cryptd
blowfish_generic blowfish_common cast5_generic cast_common des_generic
cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic hmac
crypto_null af_key xfrm_algo xt_comment xt_addrtype xt_policy
ip_set_hash_ip ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP
ipt_ah act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio
sch_htb sch_hfsc sch_ingress sch_sfq xt_set ip_set nf_nat_tftp
nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp
nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp
nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip
nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp
nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323
nf_conntrack_ftp xt_TPROXY nf_tproxy_core xt_tcpmss xt_pkttype
xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport
xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit
xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT xt_state
nfnetlink bridge 8021q garp stp mrp llc ppp_generic slhc
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_raw
ip6table_filter ip6_tables xt_tcpudp xt_conntrack iptable_mangle
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nf_conntrack iptable_raw iptable_filter ip_tables x_tables w83627hf
hwmon_vid loop iTCO_wdt iTCO_vendor_support evdev snd_pcm snd_page_alloc
snd_timer snd soundcore acpi_cpufreq mperf processor pcspkr serio_raw
drm_kms_helper lpc_ich i2c_i801 of_i2c drm rng_core thermal_sys
i2c_algo_bit ehci_pci i2c_core ext4 crc16 jbd2 mbcache dm_mod sg sd_mod
crc_t10dif ata_generic ata_piix uhci_hcd ehci_hcd libata microcode
scsi_mod skge sky2 usbcore usb_common
[486832.953431] Pid: 0, comm: swapper Not tainted 3.9.4-1-bootc #1
[486832.953431] EIP: 0060:[] EFLAGS: 00210246 CPU: 0
[486832.953431] EIP is at xfrm_output_resume+0x61/0x29f
[486832.953431] EAX:  EBX: f3fbc100 ECX: f77f1288 EDX: f6130200
[486832.953431] ESI: 0016 EDI:  EBP: f70b3c00 ESP: c1407c44
[486832.953431]  DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068
[486832.953431] CR0: 8005003b CR2: 0010 CR3: 37247000 CR4: 07d0
[486832.953431] DR0:  DR1:  DR2:  DR3: 
[486832.953431] DR6: 0ff0 DR7: 0400
[486832.953431] Process swapper (pid: 0, ti=c1406000 task=c1413490
task.ti=c1406000)
[486832.953431] Stack:
[486832.953431]  c129d44f 8000 0002 c1457254 f3fbc100 c129d44f
 0008
[486832.953431]  c129d49e  f4524000 c129d44f 8000 
f3fbc100 c1268b49
[486832.953431]  f3fbc100 f127604e c12678c3  f7123000 c1267685
c1457f8c c1456b80
[486832.953431] Call Trace:
[486832.953431]  [] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [] ? xfrm4_output+0x2c/0x6a
[486832.953431]  [] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [] ? ip_forward_finish+0x59/0x5c
[486832.953431]  [] ? ip_rcv_finish+0x23e/0x274
[486832.953431]  [] ? pskb_may_pull+0x2d/0x2d
[486832.953431]  [] ? __netif_receive_skb_core+0x39d/0x406
[486832.953431]  [] ? br_handle_frame_finish+0x22c/0x264 [bridge]
[486832.953431]  [] ? process_backlog+0xd0/0xd0
[486832.953431]  [] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [] ? NF_HOOK_THRESH+0x1d/0x4c [bridge]
[486832.953431]  [] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [] ? br_nf_pre_routing_finish+0x1c8/0x1d2
[bridge]
[486832.953431]  [] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [] ? nf_hook_slow+0x52/0xed
[486832.953431]  [] ? nf_bridge_alloc.isra.18+0x32/0x32 [bridge]
[486832.953431]  [] ? nf_bridge_alloc.isra.18+0x32/0x32 [bridge]
[486832.953431]  [] ? NF_HOOK_THRESH+0x1d/0x4c [bridge]
[486832.953431]  [] ? 

Re: PANIC at net/xfrm/xfrm_output.c:125 (3.9.4)

2013-06-05 Thread Fan Du

Hello Chris/Jean

This issue might have already been fixed by this:
https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/xfrm/xfrm_output.c?id=497574c72c9922cf20c12aed15313c389f722fa0

Hope it helps.

On 2013年06月06日 09:04, Jean Sacren wrote:

From: Chris Bootbo...@bootc.net
Date: Wed, 05 Jun 2013 22:47:48 +0100


Hi folks,

I have a re-purposed Watchguard Firebox running Debian GNU/Linux with a
self-built vanilla 3.9.4 kernel. I have an IPsec tunnel up to a remote
router through which I was passing a fair bit of traffic when I hit the
following panic:

[486832.949560] BUG: unable to handle kernel NULL pointer dereference at
0010
[486832.953431] IP: [c12a4dd0] xfrm_output_resume+0x61/0x29f
[486832.953431] *pde = 
[486832.953431] Oops:  [#1]
[486832.953431] Modules linked in: xt_realm xt_nat authenc esp4
xfrm4_mode_tunnel tun ip6table_nat nf_nat_ipv6 sch_fq_codel xt_statistic
xt_CT xt_LOG xt_connlimit xt_recent xt_time xt_TCPMSS xt_sctp
ip6t_REJECT pppoe deflate zlib_deflate pppox ctr twofish_generic
twofish_i586 twofish_common camellia_generic serpent_sse2_i586 xts
serpent_generic lrw gf128mul glue_helper ablk_helper cryptd
blowfish_generic blowfish_common cast5_generic cast_common des_generic
cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic hmac
crypto_null af_key xfrm_algo xt_comment xt_addrtype xt_policy
ip_set_hash_ip ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP
ipt_ah act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio
sch_htb sch_hfsc sch_ingress sch_sfq xt_set ip_set nf_nat_tftp
nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp
nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp
nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip
nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp
nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323
nf_conntrack_ftp xt_TPROXY nf_tproxy_core xt_tcpmss xt_pkttype
xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport
xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit
xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT xt_state
nfnetlink bridge 8021q garp stp mrp llc ppp_generic slhc
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_raw
ip6table_filter ip6_tables xt_tcpudp xt_conntrack iptable_mangle
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
nf_conntrack iptable_raw iptable_filter ip_tables x_tables w83627hf
hwmon_vid loop iTCO_wdt iTCO_vendor_support evdev snd_pcm snd_page_alloc
snd_timer snd soundcore acpi_cpufreq mperf processor pcspkr serio_raw
drm_kms_helper lpc_ich i2c_i801 of_i2c drm rng_core thermal_sys
i2c_algo_bit ehci_pci i2c_core ext4 crc16 jbd2 mbcache dm_mod sg sd_mod
crc_t10dif ata_generic ata_piix uhci_hcd ehci_hcd libata microcode
scsi_mod skge sky2 usbcore usb_common
[486832.953431] Pid: 0, comm: swapper Not tainted 3.9.4-1-bootc #1
[486832.953431] EIP: 0060:[c12a4dd0] EFLAGS: 00210246 CPU: 0
[486832.953431] EIP is at xfrm_output_resume+0x61/0x29f
[486832.953431] EAX:  EBX: f3fbc100 ECX: f77f1288 EDX: f6130200
[486832.953431] ESI: 0016 EDI:  EBP: f70b3c00 ESP: c1407c44
[486832.953431]  DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068
[486832.953431] CR0: 8005003b CR2: 0010 CR3: 37247000 CR4: 07d0
[486832.953431] DR0:  DR1:  DR2:  DR3: 
[486832.953431] DR6: 0ff0 DR7: 0400
[486832.953431] Process swapper (pid: 0, ti=c1406000 task=c1413490
task.ti=c1406000)
[486832.953431] Stack:
[486832.953431]  c129d44f 8000 0002 c1457254 f3fbc100 c129d44f
 0008
[486832.953431]  c129d49e  f4524000 c129d44f 8000 
f3fbc100 c1268b49
[486832.953431]  f3fbc100 f127604e c12678c3  f7123000 c1267685
c1457f8c c1456b80
[486832.953431] Call Trace:
[486832.953431]  [c129d44f] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [c129d44f] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [c129d49e] ? xfrm4_output+0x2c/0x6a
[486832.953431]  [c129d44f] ? xfrm4_extract_output+0x94/0x94
[486832.953431]  [c1268b49] ? ip_forward_finish+0x59/0x5c
[486832.953431]  [c12678c3] ? ip_rcv_finish+0x23e/0x274
[486832.953431]  [c1267685] ? pskb_may_pull+0x2d/0x2d
[486832.953431]  [c1246890] ? __netif_receive_skb_core+0x39d/0x406
[486832.953431]  [f849557f] ? br_handle_frame_finish+0x22c/0x264 [bridge]
[486832.953431]  [c1246a16] ? process_backlog+0xd0/0xd0
[486832.953431]  [f8495353] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [f8499310] ? NF_HOOK_THRESH+0x1d/0x4c [bridge]
[486832.953431]  [f8495353] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [f849983e] ? br_nf_pre_routing_finish+0x1c8/0x1d2
[bridge]
[486832.953431]  [f8495353] ? br_handle_local_finish+0x4d/0x4d [bridge]
[486832.953431]  [c12635d1] ? nf_hook_slow+0x52/0xed
[486832.953431]  [f8499676] ? nf_bridge_alloc.isra.18+0x32/0x32 

Re: [PATCH] fs: Disable preempt when acquire i_size_seqcount write lock

2013-01-10 Thread Fan Du



On 2013年01月11日 06:38, Andrew Morton wrote:

On Wed, 9 Jan 2013 11:34:19 +0800
Fan Du  wrote:


Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
   write_seqcount_begin(>i_size_seqcount);i_size_read
   inode->i_size = i_size; read_seqcount_begin<-- 
lockup here...



Ouch.

And even if the preemping task is preemptible, it will spend an entire
timeslice pointlessly spinning, which isn't very good.


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

...

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
  static inline void i_size_write(struct inode *inode, loff_t i_size)
  {
  #if BITS_PER_LONG==32&&  defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(>i_size_seqcount);
inode->i_size = i_size;
write_seqcount_end(>i_size_seqcount);
+   preempt_enable();
  #elif BITS_PER_LONG==32&&  defined(CONFIG_PREEMPT)
preempt_disable();
inode->i_size = i_size;


afacit all write_seqcount_begin()/read_seqretry() sites are vulnerable
to this problem.  Would it not be better to do the preempt_disable() in
write_seqcount_begin()?


IMHO, write_seqcount_begin/write_seqcount_end are often wrapped by mutex,
this gives higher priority task a chance to sleep, and then lower priority task
get cpu to unlock, so avoid the problematic scenario this patch describing.

But in i_size_write case, I could only find disable preempt a good choice before
someone else has better idea :)



Possible problems:

- mm/filemap_xip.c does disk I/O under write_seqcount_begin().

- dev_change_name() does GFP_KERNEL allocations under write_seqcount_begin()

- I didn't review u64_stats_update_begin() callers.

But I think calling schedule() under preempt_disable() is OK anyway?



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: Disable preempt when acquire i_size_seqcount write lock

2013-01-10 Thread Fan Du



On 2013年01月11日 06:38, Andrew Morton wrote:

On Wed, 9 Jan 2013 11:34:19 +0800
Fan Dufan...@windriver.com  wrote:


Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
   write_seqcount_begin(inode-i_size_seqcount);i_size_read
   inode-i_size = i_size; read_seqcount_begin-- 
lockup here...



Ouch.

And even if the preemping task is preemptible, it will spend an entire
timeslice pointlessly spinning, which isn't very good.


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

...

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
  static inline void i_size_write(struct inode *inode, loff_t i_size)
  {
  #if BITS_PER_LONG==32  defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(inode-i_size_seqcount);
inode-i_size = i_size;
write_seqcount_end(inode-i_size_seqcount);
+   preempt_enable();
  #elif BITS_PER_LONG==32  defined(CONFIG_PREEMPT)
preempt_disable();
inode-i_size = i_size;


afacit all write_seqcount_begin()/read_seqretry() sites are vulnerable
to this problem.  Would it not be better to do the preempt_disable() in
write_seqcount_begin()?


IMHO, write_seqcount_begin/write_seqcount_end are often wrapped by mutex,
this gives higher priority task a chance to sleep, and then lower priority task
get cpu to unlock, so avoid the problematic scenario this patch describing.

But in i_size_write case, I could only find disable preempt a good choice before
someone else has better idea :)



Possible problems:

- mm/filemap_xip.c does disk I/O under write_seqcount_begin().

- dev_change_name() does GFP_KERNEL allocations under write_seqcount_begin()

- I didn't review u64_stats_update_begin() callers.

But I think calling schedule() under preempt_disable() is OK anyway?



--
浮沉随浪只记今朝笑

--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fs: Disable preempt when acquire i_size_seqcount write lock

2013-01-08 Thread Fan Du
Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
  write_seqcount_begin(>i_size_seqcount);i_size_read  
  inode->i_size = i_size; read_seqcount_begin <-- 
lockup here... 


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

Signed-off-by: Fan Du 
---
 include/linux/fs.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index db84f77..1b69e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(>i_size_seqcount);
inode->i_size = i_size;
write_seqcount_end(>i_size_seqcount);
+   preempt_enable();
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
preempt_disable();
inode->i_size = i_size;
-- 
1.7.0.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fs: Disable preempt when acquire i_size_seqcount write lock

2013-01-08 Thread Fan Du
Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
  write_seqcount_begin(inode-i_size_seqcount);i_size_read  
  inode-i_size = i_size; read_seqcount_begin -- 
lockup here... 


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

Signed-off-by: Fan Du fan...@windriver.com
---
 include/linux/fs.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index db84f77..1b69e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32  defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(inode-i_size_seqcount);
inode-i_size = i_size;
write_seqcount_end(inode-i_size_seqcount);
+   preempt_enable();
 #elif BITS_PER_LONG==32  defined(CONFIG_PREEMPT)
preempt_disable();
inode-i_size = i_size;
-- 
1.7.0.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Disable preempt when acquire i_size_seqcount write lock

2012-12-21 Thread Fan Du
Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
  write_seqcount_begin(>i_size_seqcount);i_size_read  
  inode->i_size = i_size; read_seqcount_begin <-- 
lockup here... 


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

Signed-off-by: Fan Du 
---
 include/linux/fs.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index db84f77..1b69e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(>i_size_seqcount);
inode->i_size = i_size;
write_seqcount_end(>i_size_seqcount);
+   preempt_enable();
 #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
preempt_disable();
inode->i_size = i_size;
-- 
1.7.0.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Disable preempt when acquire i_size_seqcount write lock

2012-12-21 Thread Fan Du
Two rt tasks bind to one CPU core.

The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority
rt task A try to acquire read seq lock, it's doomed to lockup.

rt task A with lower priority: call write
i_size_writert task B with higher 
priority: call sync, and preempt task A
  write_seqcount_begin(inode-i_size_seqcount);i_size_read  
  inode-i_size = i_size; read_seqcount_begin -- 
lockup here... 


So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.

Signed-off-by: Fan Du fan...@windriver.com
---
 include/linux/fs.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index db84f77..1b69e87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode)
 static inline void i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG==32  defined(CONFIG_SMP)
+   preempt_disable();
write_seqcount_begin(inode-i_size_seqcount);
inode-i_size = i_size;
write_seqcount_end(inode-i_size_seqcount);
+   preempt_enable();
 #elif BITS_PER_LONG==32  defined(CONFIG_PREEMPT)
preempt_disable();
inode-i_size = i_size;
-- 
1.7.0.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du



On 2012年08月17日 09:34, Paul Gortmaker wrote:

On Thu, Aug 16, 2012 at 9:07 PM, Fan Du  wrote:



On 2012年08月16日 23:19, Fengguang Wu wrote:


Hi Fan,

On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote:



Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks



Yes, it worked, thank you very much!



Hi, Dave

Could you please pick up this patch?


Please do not make extra work for maintainers by sending attachments,
or requests for status/merge etc.  Your 1st patch had to be manually
set to an RFC, and now you add another patch less than 24h later.

Please see:

   http://patchwork.ozlabs.org/patch/177934/
   http://patchwork.ozlabs.org/patch/178132/

Also, a patch should describe the problem it solves (i.e. the symptom
the end user sees), and how the problem originated, and why the fix
in the patch is the _right_ fix.  The worst description a commit log
can have is one that just describes the C change in words, since
most people can read C on their own.

Here you add "_bh" in the code and then repeat exactly that in
the commit log.  Your commit log does not tell me when it broke,
or why it broke, or who had their use case broken.  Can you see
why this is not acceptable?

Please take the time to look at the traffic in netdev, and read
the feedback given by maintainers on other patches, so that the
common errors are understood by you, and not repeated.  It
will be time well spent!



Rick Jones has already well informed me the etiquettes off the list,
which I just broken.

Anyway, thanks for your time writing those suggestions.


Thanks,
Paul.
---


thanks





btw, your email client wraps long lines..


Oh, I will definitely fix this.
thanks feng guang for the testing :)




Thanks,
Fengguang


   From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Du
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer
protected by rcu_read_lock_bh

Signed-off-by: Fan Du
---
   net/xfrm/xfrm_policy.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init
xfrm_dst_ops_init(struct net *net)
 struct xfrm_policy_afinfo *afinfo;

 rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
 if (afinfo)
 net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops;
   #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:


Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on
xfrm_policy_afinfo with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504
suspicious rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: []
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at:
[] xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891
#64
[0.954364] Call Trace:
[0.955074]  [] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [] xfrm_net_init+0x30e/0x437
[0.958205]  [] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [] ops_init+0x1bb/0x1ff
[0.961067]  [] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  []
register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [] register_pernet_subsys+0x30/0x57
[0.972992]  [] xfrm_init+0x17/0x2c
[0.974316]  [] ip_rt_init+0x82/0xe7
[0.975668]  [] ip_init+0x10/0x25
[0.976952]  [] inet_init+0x235/0x360
[0.978352]  [] ? devinet_init+0xf2/0xf2
[0.979808]  [] do_one_initcall+0xb4/0x203
[0.981313]  [] kernel_init+0x1a9/0x29a
[0.982732]  [] ? loglevel+0x46/0x46
[0.990889]  [] kernel_thread_helper+0x4/0x10
[0.992472]  [] ? retint_restore_args+0x13/0x13
[0.994076]  [] ? do_one_initcall+0x203/0x203
[0.995636]  [] ? gs_change+0x13/0x13
[0.997197] TCP established hash table entries: 8192 (order: 5,
131072 bytes)
[1.74] TCP bind hash table entries: 8192 (order: 7, 655360
bytes)

Thanks,
Fengguang



--

Love each day!
--fan





--

Love each day!
--fan




--

Love each day!
--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Mo

Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du



On 2012年08月16日 23:19, Fengguang Wu wrote:

Hi Fan,

On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote:


Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks


Yes, it worked, thank you very much!



Hi, Dave

Could you please pick up this patch?
thanks




btw, your email client wraps long lines..


Oh, I will definitely fix this.
thanks feng guang for the testing :)



Thanks,
Fengguang


 From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Du
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer
protected by rcu_read_lock_bh

Signed-off-by: Fan Du
---
  net/xfrm/xfrm_policy.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init
xfrm_dst_ops_init(struct net *net)
struct xfrm_policy_afinfo *afinfo;

rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
if (afinfo)
net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops;
  #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:

Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo 
with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious 
rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: [] 
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at: [] 
xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64
[0.954364] Call Trace:
[0.955074]  [] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [] xfrm_net_init+0x30e/0x437
[0.958205]  [] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [] ops_init+0x1bb/0x1ff
[0.961067]  [] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  [] register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [] register_pernet_subsys+0x30/0x57
[0.972992]  [] xfrm_init+0x17/0x2c
[0.974316]  [] ip_rt_init+0x82/0xe7
[0.975668]  [] ip_init+0x10/0x25
[0.976952]  [] inet_init+0x235/0x360
[0.978352]  [] ? devinet_init+0xf2/0xf2
[0.979808]  [] do_one_initcall+0xb4/0x203
[0.981313]  [] kernel_init+0x1a9/0x29a
[0.982732]  [] ? loglevel+0x46/0x46
[0.990889]  [] kernel_thread_helper+0x4/0x10
[0.992472]  [] ? retint_restore_args+0x13/0x13
[0.994076]  [] ? do_one_initcall+0x203/0x203
[0.995636]  [] ? gs_change+0x13/0x13
[0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes)
[1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes)

Thanks,
Fengguang


--

Love each day!
--fan




--

Love each day!
--fan
>From f3b4d84f7ca2ed235c8c8ae5186f45e20b9b80db Mon Sep 17 00:00:00 2001
From: Fan Du 
Date: Thu, 16 Aug 2012 17:51:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh

Signed-off-by: Fan Du 
---
 net/xfrm/xfrm_policy.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..6405764 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,11 +2501,11 @@ static void __net_init xfrm_dst_ops_init(struct net *net)
 	struct xfrm_policy_afinfo *afinfo;
 
 	rcu_read_lock_bh();
-	afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+	afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
 	if (afinfo)
 		net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops;
 #if IS_ENABLED(CONFIG_IPV6)
-	afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET6]);
+	afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET6]);
 	if (afinfo)
 		net->xfrm.xfrm6_dst_ops = *afinfo->dst_ops;
 #endif
-- 
1.7.1



Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du


Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks


From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Du 
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected 
by rcu_read_lock_bh


Signed-off-by: Fan Du 
---
 net/xfrm/xfrm_policy.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init xfrm_dst_ops_init(struct 
net *net)

struct xfrm_policy_afinfo *afinfo;

rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
if (afinfo)
net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops;
 #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:

Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo 
with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious 
rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: [] 
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at: [] 
xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64
[0.954364] Call Trace:
[0.955074]  [] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [] xfrm_net_init+0x30e/0x437
[0.958205]  [] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [] ops_init+0x1bb/0x1ff
[0.961067]  [] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  [] register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [] register_pernet_subsys+0x30/0x57
[0.972992]  [] xfrm_init+0x17/0x2c
[0.974316]  [] ip_rt_init+0x82/0xe7
[0.975668]  [] ip_init+0x10/0x25
[0.976952]  [] inet_init+0x235/0x360
[0.978352]  [] ? devinet_init+0xf2/0xf2
[0.979808]  [] do_one_initcall+0xb4/0x203
[0.981313]  [] kernel_init+0x1a9/0x29a
[0.982732]  [] ? loglevel+0x46/0x46
[0.990889]  [] kernel_thread_helper+0x4/0x10
[0.992472]  [] ? retint_restore_args+0x13/0x13
[0.994076]  [] ? do_one_initcall+0x203/0x203
[0.995636]  [] ? gs_change+0x13/0x13
[0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes)
[1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes)

Thanks,
Fengguang


--

Love each day!
--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du


Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks


From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Du fan...@windriver.com
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected 
by rcu_read_lock_bh


Signed-off-by: Fan Du fan...@windriver.com
---
 net/xfrm/xfrm_policy.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init xfrm_dst_ops_init(struct 
net *net)

struct xfrm_policy_afinfo *afinfo;

rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
if (afinfo)
net-xfrm.xfrm4_dst_ops = *afinfo-dst_ops;
 #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:

Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo 
with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious 
rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: [814e1ad0] 
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at: [822c7329] 
xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64
[0.954364] Call Trace:
[0.955074]  [8108b375] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [822c7453] xfrm_net_init+0x30e/0x437
[0.958205]  [822c7329] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [814e134a] ops_init+0x1bb/0x1ff
[0.961067]  [810861f9] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  [814e17cd] register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [814e1adf] register_pernet_subsys+0x30/0x57
[0.972992]  [822c7130] xfrm_init+0x17/0x2c
[0.974316]  [822c2f8c] ip_rt_init+0x82/0xe7
[0.975668]  [822c31dc] ip_init+0x10/0x25
[0.976952]  [822c3f77] inet_init+0x235/0x360
[0.978352]  [822c3d42] ? devinet_init+0xf2/0xf2
[0.979808]  [82283252] do_one_initcall+0xb4/0x203
[0.981313]  [8228354a] kernel_init+0x1a9/0x29a
[0.982732]  [822826d9] ? loglevel+0x46/0x46
[0.990889]  [816d3d84] kernel_thread_helper+0x4/0x10
[0.992472]  [816d262c] ? retint_restore_args+0x13/0x13
[0.994076]  [822833a1] ? do_one_initcall+0x203/0x203
[0.995636]  [816d3d80] ? gs_change+0x13/0x13
[0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes)
[1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes)

Thanks,
Fengguang


--

Love each day!
--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du



On 2012年08月16日 23:19, Fengguang Wu wrote:

Hi Fan,

On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote:


Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks


Yes, it worked, thank you very much!



Hi, Dave

Could you please pick up this patch?
thanks




btw, your email client wraps long lines..


Oh, I will definitely fix this.
thanks feng guang for the testing :)



Thanks,
Fengguang


 From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Dufan...@windriver.com
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer
protected by rcu_read_lock_bh

Signed-off-by: Fan Dufan...@windriver.com
---
  net/xfrm/xfrm_policy.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init
xfrm_dst_ops_init(struct net *net)
struct xfrm_policy_afinfo *afinfo;

rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
if (afinfo)
net-xfrm.xfrm4_dst_ops = *afinfo-dst_ops;
  #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:

Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo 
with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious 
rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: [814e1ad0] 
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at: [822c7329] 
xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64
[0.954364] Call Trace:
[0.955074]  [8108b375] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [822c7453] xfrm_net_init+0x30e/0x437
[0.958205]  [822c7329] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [814e134a] ops_init+0x1bb/0x1ff
[0.961067]  [810861f9] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  [814e17cd] register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [814e1adf] register_pernet_subsys+0x30/0x57
[0.972992]  [822c7130] xfrm_init+0x17/0x2c
[0.974316]  [822c2f8c] ip_rt_init+0x82/0xe7
[0.975668]  [822c31dc] ip_init+0x10/0x25
[0.976952]  [822c3f77] inet_init+0x235/0x360
[0.978352]  [822c3d42] ? devinet_init+0xf2/0xf2
[0.979808]  [82283252] do_one_initcall+0xb4/0x203
[0.981313]  [8228354a] kernel_init+0x1a9/0x29a
[0.982732]  [822826d9] ? loglevel+0x46/0x46
[0.990889]  [816d3d84] kernel_thread_helper+0x4/0x10
[0.992472]  [816d262c] ? retint_restore_args+0x13/0x13
[0.994076]  [822833a1] ? do_one_initcall+0x203/0x203
[0.995636]  [816d3d80] ? gs_change+0x13/0x13
[0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes)
[1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes)

Thanks,
Fengguang


--

Love each day!
--fan




--

Love each day!
--fan
From f3b4d84f7ca2ed235c8c8ae5186f45e20b9b80db Mon Sep 17 00:00:00 2001
From: Fan Du fan...@windriver.com
Date: Thu, 16 Aug 2012 17:51:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh

Signed-off-by: Fan Du fan...@windriver.com
---
 net/xfrm/xfrm_policy.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..6405764 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,11 +2501,11 @@ static void __net_init xfrm_dst_ops_init(struct net *net)
 	struct xfrm_policy_afinfo *afinfo;
 
 	rcu_read_lock_bh();
-	afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+	afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
 	if (afinfo)
 		net-xfrm.xfrm4_dst_ops = *afinfo-dst_ops;
 #if IS_ENABLED(CONFIG_IPV6)
-	afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET6]);
+	afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET6]);
 	if (afinfo)
 		net-xfrm.xfrm6_dst_ops = *afinfo-dst_ops;
 #endif
-- 
1.7.1



Re: suspicious RCU usage in xfrm_net_init()

2012-08-16 Thread Fan Du



On 2012年08月17日 09:34, Paul Gortmaker wrote:

On Thu, Aug 16, 2012 at 9:07 PM, Fan Dufan...@windriver.com  wrote:



On 2012年08月16日 23:19, Fengguang Wu wrote:


Hi Fan,

On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote:



Hi, Fengguang

Could you please try the below patch, see if spewing still there?
thanks



Yes, it worked, thank you very much!



Hi, Dave

Could you please pick up this patch?


Please do not make extra work for maintainers by sending attachments,
or requests for status/merge etc.  Your 1st patch had to be manually
set to an RFC, and now you add another patch less than 24h later.

Please see:

   http://patchwork.ozlabs.org/patch/177934/
   http://patchwork.ozlabs.org/patch/178132/

Also, a patch should describe the problem it solves (i.e. the symptom
the end user sees), and how the problem originated, and why the fix
in the patch is the _right_ fix.  The worst description a commit log
can have is one that just describes the C change in words, since
most people can read C on their own.

Here you add _bh in the code and then repeat exactly that in
the commit log.  Your commit log does not tell me when it broke,
or why it broke, or who had their use case broken.  Can you see
why this is not acceptable?

Please take the time to look at the traffic in netdev, and read
the feedback given by maintainers on other patches, so that the
common errors are understood by you, and not repeated.  It
will be time well spent!



Rick Jones has already well informed me the etiquettes off the list,
which I just broken.

Anyway, thanks for your time writing those suggestions.


Thanks,
Paul.
---


thanks





btw, your email client wraps long lines..


Oh, I will definitely fix this.
thanks feng guang for the testing :)




Thanks,
Fengguang


   From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001
From: Fan Dufan...@windriver.com
Date: Thu, 16 Aug 2012 17:31:25 +0800
Subject: [PATCH] Use rcu_dereference_bh to deference pointer
protected by rcu_read_lock_bh

Signed-off-by: Fan Dufan...@windriver.com
---
   net/xfrm/xfrm_policy.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5ad4d2c..75a9d6a 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2501,7 +2501,7 @@ static void __net_init
xfrm_dst_ops_init(struct net *net)
 struct xfrm_policy_afinfo *afinfo;

 rcu_read_lock_bh();
-   afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]);
+   afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]);
 if (afinfo)
 net-xfrm.xfrm4_dst_ops = *afinfo-dst_ops;
   #if IS_ENABLED(CONFIG_IPV6)
--
1.7.1




On 2012年08月16日 15:37, Fengguang Wu wrote:


Hi Priyanka,

The below warning shows up, probably related to this commit:

418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on
xfrm_policy_afinfo with rcu

[0.921216]
[0.921645] ===
[0.922766] [ INFO: suspicious RCU usage. ]
[0.923887] 3.5.0-01540-g1669891 #64 Not tainted
[0.925123] ---
[0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504
suspicious rcu_dereference_check() usage!
[0.935361]
[0.935361] other info that might help us debug this:
[0.935361]
[0.937472]
[0.937472] rcu_scheduler_active = 1, debug_locks = 0
[0.939182] 2 locks held by swapper/1:
[0.940171]  #0:  (net_mutex){+.+.+.}, at: [814e1ad0]
register_pernet_subsys+0x21/0x57
[0.942705]  #1:  (rcu_read_lock_bh){..}, at:
[822c7329] xfrm_net_init+0x1e4/0x437
[0.951507]
[0.951507] stack backtrace:
[0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891
#64
[0.954364] Call Trace:
[0.955074]  [8108b375] lockdep_rcu_suspicious+0x174/0x187
[0.956736]  [822c7453] xfrm_net_init+0x30e/0x437
[0.958205]  [822c7329] ? xfrm_net_init+0x1e4/0x437
[0.959712]  [814e134a] ops_init+0x1bb/0x1ff
[0.961067]  [810861f9] ? trace_hardirqs_on+0x1b/0x24
[0.962644]  [814e17cd]
register_pernet_operations.isra.5+0x9d/0xfe
[0.971376]  [814e1adf] register_pernet_subsys+0x30/0x57
[0.972992]  [822c7130] xfrm_init+0x17/0x2c
[0.974316]  [822c2f8c] ip_rt_init+0x82/0xe7
[0.975668]  [822c31dc] ip_init+0x10/0x25
[0.976952]  [822c3f77] inet_init+0x235/0x360
[0.978352]  [822c3d42] ? devinet_init+0xf2/0xf2
[0.979808]  [82283252] do_one_initcall+0xb4/0x203
[0.981313]  [8228354a] kernel_init+0x1a9/0x29a
[0.982732]  [822826d9] ? loglevel+0x46/0x46
[0.990889]  [816d3d84] kernel_thread_helper+0x4/0x10
[0.992472]  [816d262c] ? retint_restore_args+0x13/0x13
[0.994076]  [822833a1] ? do_one_initcall+0x203/0x203
[0.995636]  [816d3d80] ? gs_change+0x13/0x13
[0.997197] TCP established hash table

Re: [PATCH] MIPS: oops when show backtrace of all active cpu

2012-08-01 Thread Fan Du



On 2012年08月02日 12:18, Kevin Cernekee wrote:

On Wed, Aug 1, 2012 at 8:31 PM, Fan Du  wrote:

show_backtrace must have an valid task when calling unwind_stack,
so fix it by checking first.

[...]

--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const 
struct pt_regs *regs)
 show_raw_backtrace(sp);
 return;
 }
+
+   if (task == NULL)
+   task = current;
+
 printk("Call Trace:\n");
 do {
 print_ip_sym(pc);


FYI, a slightly different version of this change was accepted:

https://patchwork.linux-mips.org/patch/3524/




Oh, Looks like I'm late :)
thanks anyway.


--

Love each day!
--fan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MIPS: oops when show backtrace of all active cpu

2012-08-01 Thread Fan Du
show_backtrace must have an valid task when calling unwind_stack,
so fix it by checking first.

root@romashell:/root> echo l > /proc/sysrq-trigger
SysRq : Show backtrace of all active CPUs
Stack : 81bf 81ceroot@octeon:/root>  81bf 
0036
81bf 0003 a8041b60b850 811dfa8c
4000 a8041fc0c060 a8041b60b870 8124c42c
0038 811b8efc a80003abbc00 81be0980
81be4200 0038 81ce 
81b92c40 a8041b60b8c0 81b92c40 8157b0d8
 4000 4000 
81be0990 a8041b60b810 a8041b60b910 8157b0e4
a8041b60b958 a80003b9b358 0001 81d7
 811a1ac8 a8041b60b990 8157b0d8
...
Call Trace:
[] show_stack+0xd8/0x100
CPU 14 Unable to handle kernel paging request at virtual address 
0008, epc == 8119c6a4, ra == 811a0664
Oops[#1]:
Cpu 14
$ 0   :  811a0664 002e 00020003
$ 4   :  a8041b60b770 811a1ac8 a8041b60b778
$ 8   :  3fff 282d 0001
$12   : 811a19a8 101f a8041fe68000 
$16   :  81b5ae70 a8041b60b770 811a1ac8
$20   : a8041b60b810  81b5ae98 00200200
$24   :  2b6e8c94
$28   : a8041b608000 a8041b60b700 a8041b60b700 811a0664
Hi: 0002
Lo: bf7889a5a000
epc   : 8119c6a4 unwind_stack+0x2c/0x180
Not tainted
ra: 811a0664 show_backtrace+0x15c/0x178
Status: 10008ce2KX SX UX KERNEL EXL
Cause : 40808008
BadVA : 0008
PrId  : 000d030b (Cavium Octeon+)
Modules linked in:
Process klogd (pid: 1163, threadinfo=a8041b608000, task=a8041c271600, 
tls=2aed3f70)
Stack :   81b5ae70 a8041b60b778
a8041b60b770 811a0650 a8041b60b740 811a1ac8
811a1ac8  81b5ae70 a8041b60b778
a8041b60b770 811a0664 a8041b60b810 8157b0e4
 a8041b60b950 0028 
81b5abe0 a8041b60b810 a8041b60b7c0 811a075c
a8041b60b7d0   
4000 4000  81be0990
a8041b60b810 811a1a30 81bf 81ce
81bf 0036 81bf 0003
...
Call Trace:
[] unwind_stack+0x2c/0x180
[] show_backtrace+0x15c/0x178
[] show_stacktrace+0xdc/0x138
[] show_stack+0x40/0x100
[] showacpu+0x84/0x98
[] generic_smp_call_function_interrupt+0x174/0x250
[] smp_call_function_interrupt+0x34/0x50
[] mailbox_interrupt+0x78/0x80
[] handle_IRQ_event+0x9c/0x368
[] handle_percpu_irq+0x70/0xd0
[] do_IRQ+0x40/0x60
[] native_plat_irq_dispatch+0xf8/0x188
[] plat_irq_dispatch+0x24/0x38
[] ret_from_irq+0x0/0x4
[] _raw_spin_unlock_irq+0x38/0x70
[] do_syslog+0x408/0x590
[] kmsg_read+0x48/0x98
[] proc_reg_read+0xa0/0xf0
[] vfs_read+0xcc/0x160
[] SyS_read+0x6c/0x168
[] handle_sysn32+0x44/0x84

Code:     00a0902d  122d  00c0882d  3c02811a  
64423e60  10c20033
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Fan Du 
---
 arch/mips/kernel/traps.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index b25b921..99c6611 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const 
struct pt_regs *regs)
show_raw_backtrace(sp);
return;
}
+
+   if (task == NULL)
+   task = current;
+
printk("Call Trace:\n");
do {
print_ip_sym(pc);
-- 
1.7.0.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MIPS: oops when show backtrace of all active cpu

2012-08-01 Thread Fan Du
show_backtrace must have an valid task when calling unwind_stack,
so fix it by checking first.

root@romashell:/root echo l  /proc/sysrq-trigger
SysRq : Show backtrace of all active CPUs
Stack : 81bf 81ceroot@octeon:/root  81bf 
0036
81bf 0003 a8041b60b850 811dfa8c
4000 a8041fc0c060 a8041b60b870 8124c42c
0038 811b8efc a80003abbc00 81be0980
81be4200 0038 81ce 
81b92c40 a8041b60b8c0 81b92c40 8157b0d8
 4000 4000 
81be0990 a8041b60b810 a8041b60b910 8157b0e4
a8041b60b958 a80003b9b358 0001 81d7
 811a1ac8 a8041b60b990 8157b0d8
...
Call Trace:
[811a1ac8] show_stack+0xd8/0x100
CPU 14 Unable to handle kernel paging request at virtual address 
0008, epc == 8119c6a4, ra == 811a0664
Oops[#1]:
Cpu 14
$ 0   :  811a0664 002e 00020003
$ 4   :  a8041b60b770 811a1ac8 a8041b60b778
$ 8   :  3fff 282d 0001
$12   : 811a19a8 101f a8041fe68000 
$16   :  81b5ae70 a8041b60b770 811a1ac8
$20   : a8041b60b810  81b5ae98 00200200
$24   :  2b6e8c94
$28   : a8041b608000 a8041b60b700 a8041b60b700 811a0664
Hi: 0002
Lo: bf7889a5a000
epc   : 8119c6a4 unwind_stack+0x2c/0x180
Not tainted
ra: 811a0664 show_backtrace+0x15c/0x178
Status: 10008ce2KX SX UX KERNEL EXL
Cause : 40808008
BadVA : 0008
PrId  : 000d030b (Cavium Octeon+)
Modules linked in:
Process klogd (pid: 1163, threadinfo=a8041b608000, task=a8041c271600, 
tls=2aed3f70)
Stack :   81b5ae70 a8041b60b778
a8041b60b770 811a0650 a8041b60b740 811a1ac8
811a1ac8  81b5ae70 a8041b60b778
a8041b60b770 811a0664 a8041b60b810 8157b0e4
 a8041b60b950 0028 
81b5abe0 a8041b60b810 a8041b60b7c0 811a075c
a8041b60b7d0   
4000 4000  81be0990
a8041b60b810 811a1a30 81bf 81ce
81bf 0036 81bf 0003
...
Call Trace:
[8119c6a4] unwind_stack+0x2c/0x180
[811a0664] show_backtrace+0x15c/0x178
[811a075c] show_stacktrace+0xdc/0x138
[811a1a30] show_stack+0x40/0x100
[8157b0e4] showacpu+0x84/0x98
[8121dd54] generic_smp_call_function_interrupt+0x174/0x250
[817581d4] smp_call_function_interrupt+0x34/0x50
[81105518] mailbox_interrupt+0x78/0x80
[8124c42c] handle_IRQ_event+0x9c/0x368
[8124f5b8] handle_percpu_irq+0x70/0xd0
[81758180] do_IRQ+0x40/0x60
[811025f0] native_plat_irq_dispatch+0xf8/0x188
[811024e4] plat_irq_dispatch+0x24/0x38
[811a3e60] ret_from_irq+0x0/0x4
[817572a8] _raw_spin_unlock_irq+0x38/0x70
[811e07b8] do_syslog+0x408/0x590
[81344b50] kmsg_read+0x48/0x98
[81337200] proc_reg_read+0xa0/0xf0
[812d47ec] vfs_read+0xcc/0x160
[812d4a44] SyS_read+0x6c/0x168
[811ad564] handle_sysn32+0x44/0x84

Code:     00a0902d dc98 122d  00c0882d  3c02811a  
64423e60  10c20033
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Fan Du f...@windriver.com
---
 arch/mips/kernel/traps.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index b25b921..99c6611 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const 
struct pt_regs *regs)
show_raw_backtrace(sp);
return;
}
+
+   if (task == NULL)
+   task = current;
+
printk(Call Trace:\n);
do {
print_ip_sym(pc);
-- 
1.7.0.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MIPS: oops when show backtrace of all active cpu

2012-08-01 Thread Fan Du



On 2012年08月02日 12:18, Kevin Cernekee wrote:

On Wed, Aug 1, 2012 at 8:31 PM, Fan Duf...@windriver.com  wrote:

show_backtrace must have an valid task when calling unwind_stack,
so fix it by checking first.

[...]

--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const 
struct pt_regs *regs)
 show_raw_backtrace(sp);
 return;
 }
+
+   if (task == NULL)
+   task = current;
+
 printk(Call Trace:\n);
 do {
 print_ip_sym(pc);


FYI, a slightly different version of this change was accepted:

https://patchwork.linux-mips.org/patch/3524/




Oh, Looks like I'm late :)
thanks anyway.


--

Love each day!
--fan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/