Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-08-11 Thread Bharata B Rao


On 11-Aug-23 11:56 AM, Huang, Ying wrote:
> Hi, Rao,
> 
> Bharata B Rao  writes:
> 
>> On 24-Jul-23 11:28 PM, Andrew Morton wrote:
>>> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple  
>>> wrote:
>>>
>>>> Thanks for this Huang, I had been hoping to take a look at it this week
>>>> but have run out of time. I'm keen to do some testing with it as well.
>>>
>>> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
>>> review and testing would be appreciated.
>>
>> I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
>> bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
>> puts the CXL nodes on a lower tier than DRAM nodes.
> 
> Thank you very much!
> 
> Can I add your "Tested-by" for the series?

Yes if the above test qualifies for it, please go ahead.

Regards,
Bharata.



Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

2023-07-31 Thread Bharata B Rao
On 24-Jul-23 11:28 PM, Andrew Morton wrote:
> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple  wrote:
> 
>> Thanks for this Huang, I had been hoping to take a look at it this week
>> but have run out of time. I'm keen to do some testing with it as well.
> 
> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
> review and testing would be appreciated.

I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
puts the CXL nodes on a lower tier than DRAM nodes.

Regards,
Bharata.




Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-15 Thread Bharata B Rao
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> 
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> Avoiding memory allocation in list_lru_add() was one of the main
> reasons for up-front static allocation of memcg lists. We cannot do
> memory allocation while callers are holding multiple spinlocks in
> core system algorithms (e.g. dentry_kill -> retain_dentry ->
> d_lru_add -> list_lru_add), let alone while holding an internal
> spinlock.
> 
> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> path we know might have memory demand in the *hundreds of GB* range
> gets an NACK from me. It's a great idea, but it's just not a

I do understand that GFP_ATOMIC allocations are really not preferrable
but want to point out that the allocations in the range of hundreds of
GBs get reduced to tens of MBs when we do lazy list_lru head allocations
under GFP_ATOMIC.

As shown earlier, this is what I see in my experimental setup with
10k containers:

Number of kmalloc-32 allocations
Before  During  After
W/o patch   178176  3442409472  388933632
W/  patch   190464  468992  468992

So 3442409472*32=102GB upfront list_lru creation-time GFP_KERNEL allocations
get reduced to 468992*32=14MB dynamic list_lru addtion-time GFP_ATOMIC
allocations.

This does really depend and vary on the type of the container and
the number of mounts it does, but I suspect we are looking
at GFP_ATOMIC allocations in the MB range. Also the number of
GFP_ATOMIC slab allocation requests matter I suppose.

There are other users of list_lru, but I was just looking at
dentry and inode list_lru usecase. It appears to me that for both
dentry and inode, we can tolerate the failure from list_lru_add
due to GFP_ATOMIC allocation failure. The failure to add dentry
or inode to the lru list means that they won't be retained in
the lru list, but would be freed immediately. Is this understanding
correct?

If so, would that likely impact the subsequent lookups adversely?
We failed to retain a dentry or inode in the lru list because
we failed to allocate memory, presumably under memory pressure.
Even in such a scenario, is failure to add dentry or inode to
lru list so bad and not tolerable? 

Regards,
Bharata.


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-15 Thread Bharata B Rao
On Thu, Apr 15, 2021 at 08:54:43AM +0200, Michal Hocko wrote:
> On Thu 15-04-21 10:53:00, Bharata B Rao wrote:
> > On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > > 
> > > Another approach may be to identify filesystem types that do not
> > > need memcg awareness and feed that into alloc_super() to set/clear
> > > the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> > > virtual filesystems that expose system information do not really
> > > need full memcg awareness because they are generally only visible to
> > > a single memcg instance...
> > 
> > Would something like below be appropriate?
> 
> No. First of all you are defining yet another way to say
> SHRINKER_MEMCG_AWARE which is messy.

Ok.

> And secondly why would shmem, proc
> and ramfs be any special and they would be ok to opt out? There is no
> single word about that reasoning in your changelog.

Right, I am only checking if the suggestion given by David (see above)
is indeed this. There are a few other things to take care of
which I shall if the overall direction of the patch turns
out to be acceptable.

Regards,
Bharata.


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-14 Thread Bharata B Rao
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> 
> Another approach may be to identify filesystem types that do not
> need memcg awareness and feed that into alloc_super() to set/clear
> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> virtual filesystems that expose system information do not really
> need full memcg awareness because they are generally only visible to
> a single memcg instance...

Would something like below be appropriate?

>From f314083ad69fde2a420a1b74febd6d3f7a25085f Mon Sep 17 00:00:00 2001
From: Bharata B Rao 
Date: Wed, 14 Apr 2021 11:21:24 +0530
Subject: [PATCH 1/1] fs: Let filesystems opt out of memcg awareness

All filesystem mounts by default are memcg aware and end hence
end up creating shrinker list_lrus for all the memcgs. Due to
the way the memcg_nr_cache_ids grow and the list_lru heads are
allocated for all memcgs, huge amount of memory gets consumed
by kmalloc-32 slab cache when running thousands of containers.

Improve this situation by allowing filesystems to opt out
of memcg awareness. In this patch, tmpfs, proc and ramfs
opt out of memcg awareness. This leads to considerable memory
savings when running 10k containers.

Signed-off-by: Bharata B Rao 
---
 fs/proc/root.c |  1 +
 fs/ramfs/inode.c   |  1 +
 fs/super.c | 27 +++
 include/linux/fs_context.h |  2 ++
 mm/shmem.c |  1 +
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index c7e3b1350ef8..7856bc2ca9f4 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -257,6 +257,7 @@ static int proc_init_fs_context(struct fs_context *fc)
fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
fc->fs_private = ctx;
fc->ops = _fs_context_ops;
+   fc->memcg_optout = true;
return 0;
 }
 
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 9ebd17d7befb..576a88bb7407 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -278,6 +278,7 @@ int ramfs_init_fs_context(struct fs_context *fc)
fsi->mount_opts.mode = RAMFS_DEFAULT_MODE;
fc->s_fs_info = fsi;
fc->ops = _context_ops;
+   fc->memcg_optout = true;
return 0;
 }
 
diff --git a/fs/super.c b/fs/super.c
index 8c1baca35c16..59aa22c678e6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -198,7 +198,8 @@ static void destroy_unused_super(struct super_block *s)
  * returns a pointer new superblock or %NULL if allocation had failed.
  */
 static struct super_block *alloc_super(struct file_system_type *type, int 
flags,
-  struct user_namespace *user_ns)
+  struct user_namespace *user_ns,
+  bool memcg_optout)
 {
struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
static const struct super_operations default_op;
@@ -266,13 +267,22 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags,
s->s_shrink.scan_objects = super_cache_scan;
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
-   s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+   s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+   if (!memcg_optout)
+   s->s_shrink.flags |= SHRINKER_MEMCG_AWARE;
if (prealloc_shrinker(>s_shrink))
goto fail;
-   if (list_lru_init_memcg(>s_dentry_lru, >s_shrink))
-   goto fail;
-   if (list_lru_init_memcg(>s_inode_lru, >s_shrink))
-   goto fail;
+   if (memcg_optout) {
+   if (list_lru_init(>s_dentry_lru))
+   goto fail;
+   if (list_lru_init(>s_inode_lru))
+   goto fail;
+   } else {
+   if (list_lru_init_memcg(>s_dentry_lru, >s_shrink))
+   goto fail;
+   if (list_lru_init_memcg(>s_inode_lru, >s_shrink))
+   goto fail;
+   }
return s;
 
 fail:
@@ -527,7 +537,8 @@ struct super_block *sget_fc(struct fs_context *fc,
}
if (!s) {
spin_unlock(_lock);
-   s = alloc_super(fc->fs_type, fc->sb_flags, user_ns);
+   s = alloc_super(fc->fs_type, fc->sb_flags, user_ns,
+   fc->memcg_optout);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -610,7 +621,7 @@ struct super_block *sget(struct file_system_type *type,
}
if (!s) {
spin_unlock(_lock);
-   s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+   s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns, false);
if (!s)
return ERR_P

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-07 Thread Bharata B Rao
On Wed, Apr 07, 2021 at 01:54:48PM +0200, Michal Hocko wrote:
> On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> 
> Is this 10k cgroups a testing enviroment or does anybody really use that
> in production? I would be really curious to hear how that behaves when
> those containers are not idle. E.g. global memory reclaim iterating over
> 10k memcgs will likely be very visible. I do remember playing with
> similar setups few years back and the overhead was very high.

This 10k containers is only a test scenario that we are looking at.

Regards,
Bharata.


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-07 Thread Bharata B Rao
On Wed, Apr 07, 2021 at 01:07:27PM +0300, Kirill Tkhai wrote:
> > Here is how the calculation turns out to be in my setup:
> > 
> > Number of possible NUMA nodes = 2
> > Number of mounts per container = 7 (Check below to see which are these)
> > Number of list creation requests per mount = 2
> > Number of containers = 1
> > memcg_nr_cache_ids for 10k containers = 12286
> 
> Luckily, we have "+1" in memcg_nr_cache_ids formula: size = 2 * (id + 1).
> In case of we only multiplied it, you would have to had 
> memcg_nr_cache_ids=2.

Not really, it would grow like this for size = 2 * id

id 0 size 4
id 4 size 8
id 8 size 16
id 16 size 32
id 32 size 64
id 64 size 128
id 128 size 256
id 256 size 512
id 512 size 1024
id 1024 size 2048
id 2048 size 4096
id 4096 size 8192
id 8192 size 16384

Currently (size = 2 * (id + 1)), it grows like this:

id 0 size 4
id 4 size 10
id 10 size 22
id 22 size 46
id 46 size 94
id 94 size 190
id 190 size 382
id 382 size 766
id 766 size 1534
id 1534 size 3070
id 3070 size 6142
id 6142 size 12286

> 
> Maybe, we need change that formula to increase memcg_nr_cache_ids more 
> accurate
> for further growths of containers number. Say,
> 
> size = id < 2000 ? 2 * (id + 1) : id + 2000

For the above, it would only be marginally better like this:

id 0 size 4
id 4 size 10
id 10 size 22
id 22 size 46
id 46 size 94
id 94 size 190
id 190 size 382
id 382 size 766
id 766 size 1534
id 1534 size 3070
id 3070 size 5070
id 5070 size 7070
id 7070 size 9070
id 9070 size 11070

All the above numbers are for 10k memcgs.

Regards,
Bharata.


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-06 Thread Bharata B Rao
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> > 
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> > 
> > If,
> > 
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> > 
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> > 
> > Following factors contribute to the excessive allocations:
> > 
> > - Lists are created for possible NUMA nodes.
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and 
> > additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 1 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> 
> So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.
> 
> So for that to make up 100GB of RAM, you must have somewhere over
> 500,000 mounted superblocks on the machine?
> 
> That implies 50+ unique mounted superblocks per container, which
> seems like an awful lot.

Here is how the calculation turns out to be in my setup:

Number of possible NUMA nodes = 2
Number of mounts per container = 7 (Check below to see which are these)
Number of list creation requests per mount = 2
Number of containers = 1
memcg_nr_cache_ids for 10k containers = 12286
size of kmalloc-32 = 32 byes

2*7*2*1*12286*32 = 11008256 bytes = 102.5G

> 
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> 
> Yeah, at first glance this doesn't strike me as a problem with the
> list_lru structure, it smells more like a problem resulting from a
> huge number of superblock instantiations on the machine. Which,
> probably, mostly have no significant need for anything other than a
> single memcg awareness?
> 
> Can you post a typical /proc/self/mounts output from one of these
> idle/empty containers so we can see exactly how many mounts and
> their type are being instantiated in each container?

Tracing type->name in alloc_super() lists these 7 types for
every container.

3-2691[041]    222.761041: alloc_super: fstype: mqueue
3-2692[072]    222.812187: alloc_super: fstype: proc
3-2692[072]    222.812261: alloc_super: fstype: tmpfs
3-2692[072]    222.812329: alloc_super: fstype: devpts
3-2692[072]    222.812392: alloc_super: fstype: tmpfs
3-2692[072]    222.813102: alloc_super: fstype: tmpfs
3-2692[072]    222.813159: alloc_super: fstype: tmpfs

> 
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> 
> Superblocks have to support an unknown number

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-06 Thread Bharata B Rao
On Mon, Apr 05, 2021 at 11:38:44AM -0700, Roman Gushchin wrote:
> > > @@ -534,7 +521,17 @@ static void memcg_drain_list_lru_node(struct 
> > > list_lru *lru, int nid,
> > > spin_lock_irq(>lock);
> > >
> > > src = list_lru_from_memcg_idx(nlru, src_idx);
> > > +   if (!src)
> > > +   goto out;
> > > +
> > > dst = list_lru_from_memcg_idx(nlru, dst_idx);
> > > +   if (!dst) {
> > > +   /* TODO: Use __GFP_NOFAIL? */
> > > +   dst = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> > > +   init_one_lru(dst);
> > > +   memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, 
> > > true);
> > > +   memcg_lrus->lru[dst_idx] = dst;
> > > +   }
> 
> Hm, can't we just reuse src as dst in this case?
> We don't need src anymore and we're basically allocating dst to move all data 
> from src.

Yes, we can do that and it would be much simpler.

> If not, we can allocate up to the root memcg every time to avoid having
> !dst case and fiddle with __GFP_NOFAIL.
> 
> Otherwise I like the idea and I think it might reduce the memory overhead
> especially on (very) big machines.

Yes, I will however have to check if the callers of list_lru_add() are capable
of handling failure which can happen with this approach if kmalloc fails.

Regards,
Bharata.


Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-06 Thread Bharata B Rao
On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao  wrote:
> >
> > Hi,
> >
> > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> >
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> >
> > If,
> >
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> >
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> 
> Yes, this is exactly what the current implementation does.
> 
> >
> > Following factors contribute to the excessive allocations:
> >
> > - Lists are created for possible NUMA nodes.
> 
> Yes, because filesystem caches (dentry and inode) are NUMA aware.

True, but creating lists for possible nodes as against online nodes
can hurt platforms where possible is typically higher than online.

> 
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and 
> > additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 1 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> 
> The current implementation has list_lrus attached with super_block. So
> the list can't be freed until the super block is unmounted.
> 
> I'm looking into consolidating list_lrus more closely with memcgs. It
> means the list_lrus will have the same life cycles as memcgs rather
> than filesystems. This may be able to improve some. But I'm supposed
> the filesystem will be unmounted once the container exits and the
> memcgs will get offlined for your usecase.

Yes, but when the containers are still running, the lists that get
created for non-existing memcgs and non-relavent memcgs are the main
cause of increased memory consumption.

> 
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> 
> Yes, because the kernel is *NOT* aware of containers.
> 
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> >
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> >
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> It is fine to reduce the memory consumption for your usecase, but I'm
> not sure if this would incur any noticeable overhead for vfs
> operations since list_lru_add() should be called quite often, but it
> just needs to allocate the list for once (for each memcg +
> filesystem), so the

High kmalloc-32 slab cache consumption with 10k containers

2021-04-04 Thread Bharata B Rao
Hi,

When running 1 (more-or-less-empty-)containers on a bare-metal Power9
server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
consumption increases quite a lot (around 172G) when the containers are
running. Most of it comes from slab (149G) and within slab, the majority of
it comes from kmalloc-32 cache (102G)

The major allocator of kmalloc-32 slab cache happens to be the list_head
allocations of list_lru_one list. These lists are created whenever a
FS mount happens. Specially two such lists are registered by alloc_super(),
one for dentry and another for inode shrinker list. And these lists
are created for all possible NUMA nodes and for all given memcgs
(memcg_nr_cache_ids to be particular)

If,

A = Nr allocation request per mount: 2 (one for dentry and inode list)
B = Nr NUMA possible nodes
C = memcg_nr_cache_ids
D = size of each kmalloc-32 object: 32 bytes,

then for every mount, the amount of memory consumed by kmalloc-32 slab
cache for list_lru creation is A*B*C*D bytes.

Following factors contribute to the excessive allocations:

- Lists are created for possible NUMA nodes.
- memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
  list_lrus are created when it grows. Thus we end up creating list_lru_one
  list_heads even for those memcgs which are yet to be created.
  For example, when 1 memcgs are created, memcg_nr_cache_ids reach
  a value of 12286.
- When a memcg goes offline, the list elements are drained to the parent
  memcg, but the list_head entry remains.
- The lists are destroyed only when the FS is unmounted. So list_heads
  for non-existing memcgs remain and continue to contribute to the
  kmalloc-32 allocation. This is presumably done for performance
  reason as they get reused when new memcgs are created, but they end up
  consuming slab memory until then.
- In case of containers, a few file systems get mounted and are specific
  to the container namespace and hence to a particular memcg, but we
  end up creating lists for all the memcgs.
  As an example, if 7 FS mounts are done for every container and when
  10k containers are created, we end up creating 2*7*12286 list_lru_one
  lists for each NUMA node. It appears that no elements will get added
  to other than 2*7=14 of them in the case of containers.

One straight forward way to prevent this excessive list_lru_one
allocations is to limit the list_lru_one creation only to the
relevant memcg. However I don't see an easy way to figure out
that relevant memcg from FS mount path (alloc_super())

As an alternative approach, I have this below hack that does lazy
list_lru creation. The memcg-specific list is created and initialized
only when there is a request to add an element to that particular
list. Though I am not sure about the full impact of this change
on the owners of the lists and also the performance impact of this,
the overall savings look good.

Used memory
Before  During  After
W/o patch   23G 172G40G
W/  patch   23G 69G 29G

Slab consumption
Before  During  After
W/o patch   1.5G149G22G
W/  patch   1.5G45G 10G

Number of kmalloc-32 allocations
Before  During  After
W/o patch   178176  3442409472  388933632
W/  patch   190464  468992  468992

Any thoughts on other approaches to address this scenario and
any specific comments about the approach that I have taken is
appreciated. Meanwhile the patch looks like below:

>From 9444a0c6734c2853057b1f486f85da2c409fdc84 Mon Sep 17 00:00:00 2001
From: Bharata B Rao 
Date: Wed, 31 Mar 2021 18:21:45 +0530
Subject: [PATCH 1/1] mm: list_lru: Allocate list_lru_one only when required.

Don't pre-allocate list_lru_one list heads for all memcg_cache_ids.
Instead allocate and initialize it only when required.

Signed-off-by: Bharata B Rao 
---
 mm/list_lru.c | 79 +--
 1 file changed, 38 insertions(+), 41 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 6f067b6b935f..b453fa5008cc 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -112,16 +112,32 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static void init_one_lru(struct list_lru_one *l)
+{
+   INIT_LIST_HEAD(>list);
+   l->nr_items = 0;
+}
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = >node[nid];
struct mem_cgroup *memcg;
struct list_lru_one *l;
+   struct list_lru_memcg *memcg_lrus;
 
spin_lock(>lock);
if (list_empty(item)) {
l = list_lru_from_kmem(nlru, item, );
+   if (!l) {
+   l = kmalloc(sizeof(struct list_lru_o

Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-02-03 Thread Bharata B Rao
On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > On Tue, 26 Jan 2021, Will Deacon wrote:
> > 
> >> > Hm, but booting the secondaries is just a software (kernel) action? They 
> >> > are
> >> > already physically there, so it seems to me as if the cpu_present_mask 
> >> > is not
> >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> >>
> >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> >> the online mask does not. We can't really do any better on arm64, as 
> >> there's
> >> no way of telling that a CPU is present until we've seen it.
> > 
> > The order of each page in a kmem cache --and therefore also the number
> > of objects in a slab page-- can be different because that information is
> > stored in the page struct.
> > 
> > Therefore it is possible to retune the order while the cache is in operaton.
> 
> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> is enabled, see [1].
> 
> But as a quick fix for the regression, the heuristic idea could work 
> reasonably
> on all architectures?
> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> arm64, and use it
> - otherwise use nr_cpu_ids
> 
> Long-term we can attempt to do the retuning safe, or decide that number of 
> cpus
> shouldn't determine the order...
> 
> [1] 
> https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b...@suse.cz/

So what is preferrable here now? Above or other quick fix or reverting
the original commit?

Regards,
Bharata.


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-24 Thread Bharata B Rao
On Fri, Jan 22, 2021 at 02:05:47PM +0100, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka  wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. 
> > It's
> > supposed to be a scheduler benchmark? What exactly is going on?
> 
> Uuuh, I think powerpc doesn't have cmpxchg_double?
> 
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
> 
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?

I can check that, but the current patch was aimed at reducing
the page order of the slub caches so that they don't end up
consuming more memory on 64K systems.

> 
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

Sounds interesting, will keep a watch to see its effect on powerpc.

Regards,
Bharata.


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-22 Thread Bharata B Rao
On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka  wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, 
> >> anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> > 
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's  main
> >job  is  to  create a specified number of pairs of schedulable
> > entities (either threads or
> >traditional processes) which communicate via either sockets or
> > pipes and time how long  it
> >takes for each pair to send data back and forth.
> 
> Yep, so I wonder which slab entities this is stressing that much.
> 
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has 
> >> during
> >> boot, even if not yet onlined, at the time we init slab. This would still 
> >> not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody 
> >> would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on 
> >> systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we 
> >> could
> >> still have an opposite problem, where theoretically "anything is possible" 
> >> but
> >> the virtual cpus are never added later.
> > 
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> > 
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
> 
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it 
> caused
> by bios or the hypervisor? How does num_present_cpus() look there?

PowerPC PowerNV Host: (160 cpus)
num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160 

PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160 

That's what I see on powerpc, hence I thought num_present_cpus() could
be the correct one to use in slub page order calculation.

> 
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?

Looking at the following snippet from include/linux/cpumask.h, it
appears that num_present_cpus() should be reasonable compromise
between online and possible/nr_cpus_ids to use here.

/*
 * The following particular system cpumasks and operations manage
 * possible, present, active and online cpus.
 *
 * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
 * cpu_present_mask - has bit 'cpu' set iff cpu is populated
 * cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
 * cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
 *
 *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
 *
 *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
 *  that it is possible might ever be plugged in at anytime during the
 *  life of that system boot.  The cpu_present_mask is dynamic(*),
 *  representing which CPUs are currently plugged in.  And
 *  cpu_online_mask is the dynamic subset of cpu_present_mask,
 *  indicating those CPUs available for scheduling.
 *
 *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
 *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
 *  ACPI reports present at boot.
 *
 *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
 *  depending on what ACPI reports as currently plugged in, otherwise
 *  cpu_present_mask is just a copy of cpu_possible_mask.
 *
 *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
 *  hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
 */

So for host systems, present is (usually) equal to possible and for
guest systems present should indicate the CPUs found to be present
at boottime. The intention of my original patch was to use this
metric in slub page order calculation rather than nr_cpus_ids
or num_cpus_possible() which could be high on guest systems that
typically support CPU hotplug.

Regards,
Bharata.


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-20 Thread Bharata B Rao
On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> Hi,
> 
> On Wed, 18 Nov 2020 at 09:28, Bharata B Rao  wrote:
> >
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> >
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> >
> > Signed-off-by: Bharata B Rao 
> > ---
> > This is a generic change and I am unsure how it would affect
> > other archs, but as a start, here are some numbers from
> > PowerPC pseries KVM guest with and without this patch:
> >
> > This table shows how this change has affected some of the slab
> > caches.
> > ===
> > Current Patched
> > Cache  
> > ===
> > TCPv6   532 261
> > net_namespace   534 262
> > dtl 322 161
> > names_cache 322 161
> > task_struct 538 132
> > thread_stack328 8 2
> > pgtable-2^11168 8 4
> > pgtable-2^8 322 161
> > kmalloc-32k 168 8 4
> > kmalloc-16k 328 8 2
> > kmalloc-8k  324 8 1
> > kmalloc-4k  322 161
> > ===
> >
> > Slab memory (kB) consumption comparision
> > ==
> > Current Patched
> > ==
> > After-boot  205760  156096
> > During-hackbench629145  506752 (Avg of 5 runs)
> > After-hackbench 474176  331840 (after drop_caches)
> > ==
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ==
> > Current Patched
> > ==
> > 10.990  11.010
> > ==
> >
> > Measuring the effect due to CPU hotplug
> > 
> > Since the patch doesn't consider all the possible CPUs for page
> > order calcluation, let's see how affects the case when CPUs are
> > hotplugged. Here I compare a system that is booted with 64CPUs
> > with a system that is booted with 16CPUs but hotplugged with
> > 48CPUs after boot. These numbers are with the patch applied.
> >
> > Slab memory (kB) consumption comparision
> > ===
> > 64bootCPUs  16bootCPUs+48HotPluggedCPUs
> > ===
> > After-boot  390272  159744
> > After-hotplug   -   251328
> > During-hackbench1001267 941926 (Avg of 5 runs)
> > After-hackbench 913600  827200 (after drop_caches)
> > ===
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ===
> > 64bootCPUs  16bootCPUs+48HotPluggedCPUs
> > ===
> > 12.554  12.589
> > ===
> >  mm/slub.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> 
> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magn

[RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2020-11-18 Thread Bharata B Rao
The page order of the slab that gets chosen for a given slab
cache depends on the number of objects that can be fit in the
slab while meeting other requirements. We start with a value
of minimum objects based on nr_cpu_ids that is driven by
possible number of CPUs and hence could be higher than the
actual number of CPUs present in the system. This leads to
calculate_order() chosing a page order that is on the higher
side leading to increased slab memory consumption on systems
that have bigger page sizes.

Hence rely on the number of online CPUs when determining the
mininum objects, thereby increasing the chances of chosing
a lower conservative page order for the slab.

Signed-off-by: Bharata B Rao 
---
This is a generic change and I am unsure how it would affect
other archs, but as a start, here are some numbers from
PowerPC pseries KVM guest with and without this patch:

This table shows how this change has affected some of the slab
caches.
===
Current Patched
Cache  
===
TCPv6   532 261
net_namespace   534 262
dtl 322 161
names_cache 322 161
task_struct 538 132
thread_stack328 8 2
pgtable-2^11168 8 4
pgtable-2^8 322 161
kmalloc-32k 168 8 4
kmalloc-16k 328 8 2
kmalloc-8k  324 8 1
kmalloc-4k  322 161
===

Slab memory (kB) consumption comparision
==
Current Patched
==
After-boot  205760  156096
During-hackbench629145  506752 (Avg of 5 runs)
After-hackbench 474176  331840 (after drop_caches)
==

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
==
Current Patched
==
10.990  11.010
==

Measuring the effect due to CPU hotplug

Since the patch doesn't consider all the possible CPUs for page
order calcluation, let's see how affects the case when CPUs are
hotplugged. Here I compare a system that is booted with 64CPUs
with a system that is booted with 16CPUs but hotplugged with
48CPUs after boot. These numbers are with the patch applied.

Slab memory (kB) consumption comparision
===
64bootCPUs  16bootCPUs+48HotPluggedCPUs
===
After-boot  390272  159744
After-hotplug   -   251328
During-hackbench1001267 941926 (Avg of 5 runs)
After-hackbench 913600  827200 (after drop_caches)
===

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
===
64bootCPUs  16bootCPUs+48HotPluggedCPUs
===
12.554  12.589
===
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 34dcc09e2ec9..8342c0a167b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
 */
min_objects = slub_min_objects;
if (!min_objects)
-   min_objects = 4 * (fls(nr_cpu_ids) + 1);
+   min_objects = 4 * (fls(num_online_cpus()) + 1);
max_objects = order_objects(slub_max_order, size);
min_objects = min(min_objects, max_objects);
 
-- 
2.26.2



Re: Higher slub memory consumption on 64K page-size systems?

2020-11-11 Thread Bharata B Rao
On Thu, Nov 05, 2020 at 05:47:03PM +0100, Vlastimil Babka wrote:
> On 10/28/20 6:50 AM, Bharata B Rao wrote:
> > slub_max_order
> > --
> > The most promising tunable that shows consistent reduction in slab memory
> > is slub_max_order. Here is a table that shows the number of slabs that
> > end up with different orders and the total slab consumption at boot
> > for different values of slub_max_order:
> > ---
> > slub_max_order  Order   NrSlabs Slab memory
> > ---
> > 0   276
> > 3   1   16  207488 kB
> >  (default)  2   4
> > 3   11
> > ---
> > 0   276
> > 2   1   16  166656 kB
> > 2   4
> > ---
> > 0   276 144128 kB
> > 1   1   31
> > ---
> > 
> > Though only a few bigger sized caches fall into order-2 or order-3, they
> > seem to make a considerable difference to the overall slab consumption.
> > If we take task_struct cache as an example, this is how it ends up when
> > slub_max_order is varied:
> > 
> > task_struct, objsize=9856
> > 
> > slub_max_order  objperslab  pagesperslab
> > 
> > 3   53  8
> > 2   26  4
> > 1   13  2
> > 
> > 
> > The slab page-order and hence the number of objects in a slab has a
> > bearing on the performance, but I wonder if some caches like task_struct
> > above can be auto-tuned to fall into a conservative order and do good
> > both wrt both memory and performance?
> 
> Hmm ideally this should be based on objperslab so if there's larger page
> sizes, then the calculated order becomes smaller, even 0?

It is indeed based on number of objects that could be optimally
fit within a slab. As I explain below, curently we start with a
minimum objects value that ends up pushing the page order higher
for some slab sizes and page size combination. The question is can
we start with a more conservative/lower value for min_objects in
calculate_order()?

> 
> > mm/slub.c:calulate_order() has the logic which determines the the
> > page-order for the slab. It starts with min_objects and attempts
> > to arrive at the best configuration for the slab. The min_objects
> > is starts like this:
> > 
> > min_objects = 4 * (fls(nr_cpu_ids) + 1);
> > 
> > Here nr_cpu_ids depends on the maxcpus and hence this can have a
> > significant effect on those systems which define maxcpus. Slab numbers
> > post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
> > number of maxcpus look like this:
> > ---
> > maxcpus Slab memory(kB)
> > ---
> > 64  209280
> > 256 253824
> > 512 293824
> > ---
> 
> Yeah IIRC nr_cpu_ids is related to number of possible cpus which is rather
> excessive on some systems, so a relation to actually online cpus would make
> more sense.

May be I can send a patch to change the above calculation of
min_objects to be based on online cpus and see how it is received.

> 
> > Page-order is a one time setting and obviously can't be tweaked dynamically
> > on CPU hotplug, but just wanted to bring out the effect of the same.
> > 
> > And that constant multiplicative factor of 4 was infact added by the commit
> > 9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."
> > 
> > Reducing that to say 2, does give some reduction in the slab memory
> > and also same hackbench performance with reduced slab memory, but I am not
> > sure if that could be assumed to be beneficial for all scenarios.
> > 
> > MIN_PARTIAL
> > ---
> > This determines the number of slabs left on the partial list even if they
> > are empty. My initial thought was that the default MIN_PARTIAL value of 5
> > is on the higher side and we are accumulating MIN_PARTIAL number of
> > empty slabs in all caches without freeing them. However I hardly find
> > the case where an empty slab is retained during freeing on account of
> > partial slabs being lesser than MIN_PARTIAL.
> > 
> > However what I find in practice is that 

Re: Higher slub memory consumption on 64K page-size systems?

2020-11-02 Thread Bharata B Rao
On Wed, Oct 28, 2020 at 05:07:57PM -0700, Roman Gushchin wrote:
> On Wed, Oct 28, 2020 at 11:20:30AM +0530, Bharata B Rao wrote:
> > I have mostly looked at reducing the slab memory consumption here.
> > But I do understand that default tunable values have been arrived
> > at based on some benchmark numbers. Are there ways or possibilities
> > to reduce the slub memory consumption with the existing level of
> > performance is what I would like to understand and explore.
> 
> Hi Bharata!
> 
> I wonder how the distribution of the consumed memory by slab_caches
> differs between 4k and 64k pages. In particular, I wonder if
> page-sized and larger kmallocs make the difference (or a big part of it)?
> There are many places in the kernel which are doing something like
> kmalloc(PAGE_SIZE).

Here is comparision of topmost slabs in terms of memory usage b/n
4K and 64K configurations:

Case 1: After boot
==
4K page-size

Name   Objects Objsize   Space Slabs/Part/Cpu  O/S O 
%Fr %Ef Flg
inode_cache  23382 592   14.1M   400/0/33   54 3   
0  97 a
dentry   29484 1925.7M  592/0/110   42 1   
0  98 a
kmalloc-1k535810245.6M   130/9/42   32 3   
5  97
task_struct37198564.1M88/6/403 3   
4  87
kmalloc-512   6640 5123.4M   159/3/49   32 2   
1  99
...
kmalloc-4k 53040962.2M42/6/278 3   
8  96

64K page-size
-
pgtable-2^11   935   16384   38.7M   16/16/58   16 3  
21  39
inode_cache  23980 592   14.4M   203/0/17  109 0   
0  98 a
thread_stack   709   16384   12.0M 6/1/17   32 3   
4  96
task_struct   10129856   10.4M 4/1/16   53 3   
5  95
kmalloc-64k144   655369.4M 2/0/168 3   
0 100

Case 2: After hackbench run
===
4K page-size

inode_cache  21823 592   13.3M   361/3/46   54 3   
0  96 a
kmalloc-512  10309 5129.4M433/325/146   32 2  
56  55
kmalloc-1k620710246.5M  121/12/78   32 3   
6  97
dentry   28923 1925.9M 468/48/261   42 1   
6  92 a
task_struct41898565.1M  106/24/513 3  
15  80
...
kmalloc-4k 51040962.1M   41/10/268 3  
14  95

64K page-size
-
kmalloc-8k30818192   84.9M 241/241/83   32 2  
74  29
thread_stack  2919   16384   52.4M   15/10/85   32 3  
10  91
pgtable-2^11  1281   16384   50.8M   20/20/77   16 3  
20  41
task_struct   37719856   40.3M 9/6/68   53 3   
7  92
vm_area_struct   92295 200   18.9M8/8/281  327 0   
2  97
...
kmalloc-64k144   655369.4M 2/0/168 3   
0 100

I can't see any specific pattern wrt to kmalloc cache usage in both the
above cases (boot vs hackbench run). In the boot case, the 64K configuration
consuming more memory can be attributed probably to the bigger page size
itself. However in case of hackbench run, any significant number of
partial slabs does contribute to significant increase of memory for
64K configuration.

> 
> Re slub tuning: in general we do care about the number of objects
> in a partial list, less about the number of pages. If we can have the
> same amount of objects but on fewer pages, it's even better.

Right, but how do we achieve that when few number of inuse objects are
spread across a number of partial slabs? This specifically is the case
we see after a workload run (hackbench in this case)

> So I don't see any reasons why we shouldn't scale down these tunables
> if the PAGE_SIZE > 4K.
> Idk if it makes sense to switch to byte-sized tunables or just to hardcode
> custom default values for the 64k page case. The latter is probably
> is easier.

Right, tuning the mininum number of objects when calculating the page order
of the slab and tuning cpu_partial value show some consistent reduction
in the slab memory consumption. (I have shown this in previous mail)

Thanks for your comments.

Regards,
Bharata.


Higher slub memory consumption on 64K page-size systems?

2020-10-28 Thread Bharata B Rao
Hi,

On POWER systems, where 64K PAGE_SIZE is default, I see that slub
consumes higher amount of memory compared to any 4K page-size system.
While slub is obviously going to consume more memory on 64K page-size
systems compared to 4K as slabs are allocated in page-size granularity,
I want to check if there are any obvious tuning (via existing tunables
or via some code change) that we can do to reduce the amount of memory
consumed by slub.

Here is a comparision of the slab memory consumption between 4K and
64K page-size pseries hash KVM guest with 16 cores and 16G memory
configuration immediately after boot:

64K 209280 kB
4K  67636 kB

64K configuration may never be able to consume as less as a 4K configuration,
but it certainly shows that the slub can be optimized for 64K page-size better.

slub_max_order
--
The most promising tunable that shows consistent reduction in slab memory
is slub_max_order. Here is a table that shows the number of slabs that
end up with different orders and the total slab consumption at boot
for different values of slub_max_order:
---
slub_max_order  Order   NrSlabs Slab memory
---
0   276
3   1   16  207488 kB
(default)   2   4
3   11
---
0   276
2   1   16  166656 kB
2   4
---
0   276 144128 kB
1   1   31
---

Though only a few bigger sized caches fall into order-2 or order-3, they
seem to make a considerable difference to the overall slab consumption.
If we take task_struct cache as an example, this is how it ends up when
slub_max_order is varied:

task_struct, objsize=9856

slub_max_order  objperslab  pagesperslab

3   53  8
2   26  4
1   13  2


The slab page-order and hence the number of objects in a slab has a
bearing on the performance, but I wonder if some caches like task_struct
above can be auto-tuned to fall into a conservative order and do good
both wrt both memory and performance?

mm/slub.c:calulate_order() has the logic which determines the the
page-order for the slab. It starts with min_objects and attempts
to arrive at the best configuration for the slab. The min_objects
is starts like this:

min_objects = 4 * (fls(nr_cpu_ids) + 1);

Here nr_cpu_ids depends on the maxcpus and hence this can have a
significant effect on those systems which define maxcpus. Slab numbers
post-boot for a KVM pseries guest that has 16 boottime CPUs and varying
number of maxcpus look like this:
---
maxcpus Slab memory(kB)
---
64  209280
256 253824
512 293824
---

Page-order is a one time setting and obviously can't be tweaked dynamically
on CPU hotplug, but just wanted to bring out the effect of the same.

And that constant multiplicative factor of 4 was infact added by the commit
9b2cd506e5f2 - "slub: Calculate min_objects based on number of processors."

Reducing that to say 2, does give some reduction in the slab memory
and also same hackbench performance with reduced slab memory, but I am not
sure if that could be assumed to be beneficial for all scenarios.

MIN_PARTIAL
---
This determines the number of slabs left on the partial list even if they
are empty. My initial thought was that the default MIN_PARTIAL value of 5
is on the higher side and we are accumulating MIN_PARTIAL number of
empty slabs in all caches without freeing them. However I hardly find
the case where an empty slab is retained during freeing on account of
partial slabs being lesser than MIN_PARTIAL.

However what I find in practice is that we are accumulating a lot of partial
slabs with just one in-use object in the whole slab. High number of such
partial slabs is indeed contributing to the increased slab memory consumption.

For example, after a hackbench run, I find the distribution of objects
like this for kmalloc-2k cache:

total_objects   3168
objects 1611
Nr partial slabs54
Nr parital slabs with
just 1 inuse object 38

With 64K page-size, so many partial slabs with just 1 inuse object can
result in high memory usage. Is there any workaround possible prevent this
kind of situation?

cpu_partial
---
Here is how the slab consumption post-boot varies when all the slab
caches are forced with the fixed cpu_partial value:
---
cpu_partial Slab Memory
---
0   175872 kB
2   187136 

Re: [PATCH FIX v0] mm: memcg/slab: Uncharge during kmem_cache_free_bulk()

2020-10-11 Thread Bharata B Rao
On Fri, Oct 09, 2020 at 11:45:51AM -0700, Roman Gushchin wrote:
> On Fri, Oct 09, 2020 at 11:34:23AM +0530, Bharata B Rao wrote:
> 
> Hi Bharata,
> 
> > Object cgroup charging is done for all the objects during
> > allocation, but during freeing, uncharging ends up happening
> > for only one object in the case of bulk allocation/freeing.
> 
> Yes, it's definitely a problem. Thank you for catching it!
> 
> I'm curious, did you find it in the wild or by looking into the code?

Found by looking at the code.

> 
> > 
> > Fix this by having a separate call to uncharge all the
> > objects from kmem_cache_free_bulk() and by modifying
> > memcg_slab_free_hook() to take care of bulk uncharging.
> >
> > Signed-off-by: Bharata B Rao 
> 
> Please, add:
> 
> Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab 
> objects")
> Cc: sta...@vger.kernel.org
> 
> > ---
> >  mm/slab.c |  2 +-
> >  mm/slab.h | 42 +++---
> >  mm/slub.c |  3 ++-
> >  3 files changed, 30 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/slab.c b/mm/slab.c
> > index f658e86ec8cee..5c70600d8b1cc 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -3440,7 +3440,7 @@ void ___cache_free(struct kmem_cache *cachep, void 
> > *objp,
> > memset(objp, 0, cachep->object_size);
> > kmemleak_free_recursive(objp, cachep->flags);
> > objp = cache_free_debugcheck(cachep, objp, caller);
> > -   memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
> > +   memcg_slab_free_hook(cachep, , 1);
> >  
> > /*
> >  * Skip calling cache_free_alien() when the platform is not numa.
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 6cc323f1313af..6dd4b702888a7 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -345,30 +345,42 @@ static inline void memcg_slab_post_alloc_hook(struct 
> > kmem_cache *s,
> > obj_cgroup_put(objcg);
> >  }
> >  
> > -static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page 
> > *page,
> > -   void *p)
> > +static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
> > +   void **p, int objects)
> >  {
> > +   struct kmem_cache *s;
> > struct obj_cgroup *objcg;
> > +   struct page *page;
> > unsigned int off;
> > +   int i;
> >  
> > if (!memcg_kmem_enabled())
> > return;
> >  
> > -   if (!page_has_obj_cgroups(page))
> > -   return;
> > +   for (i = 0; i < objects; i++) {
> > +   if (unlikely(!p[i]))
> > +   continue;
> >  
> > -   off = obj_to_index(s, page, p);
> > -   objcg = page_obj_cgroups(page)[off];
> > -   page_obj_cgroups(page)[off] = NULL;
> > +   page = virt_to_head_page(p[i]);
> > +   if (!page_has_obj_cgroups(page))
> > +   continue;
> >  
> > -   if (!objcg)
> > -   return;
> > +   if (!s_orig)
> > +   s = page->slab_cache;
> > +   else
> > +   s = s_orig;
> >  
> > -   obj_cgroup_uncharge(objcg, obj_full_size(s));
> > -   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
> > -   -obj_full_size(s));
> > +   off = obj_to_index(s, page, p[i]);
> > +   objcg = page_obj_cgroups(page)[off];
> > +   if (!objcg)
> > +   continue;
> >  
> > -   obj_cgroup_put(objcg);
> > +   page_obj_cgroups(page)[off] = NULL;
> > +   obj_cgroup_uncharge(objcg, obj_full_size(s));
> > +   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
> > +   -obj_full_size(s));
> > +   obj_cgroup_put(objcg);
> > +   }
> >  }
> >  
> >  #else /* CONFIG_MEMCG_KMEM */
> > @@ -406,8 +418,8 @@ static inline void memcg_slab_post_alloc_hook(struct 
> > kmem_cache *s,
> >  {
> >  }
> >  
> > -static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page 
> > *page,
> > -   void *p)
> > +static inline void memcg_slab_free_hook(struct kmem_cache *s,
> > +   void **p, int objects)
> >  {
> >  }
> >  #endif /* CONFIG_MEMCG_KMEM */
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 6d3574013b2f8..0cbe67f13946e 100644
> > --- a/mm/slub.c
> > ++

[PATCH FIX v0] mm: memcg/slab: Uncharge during kmem_cache_free_bulk()

2020-10-09 Thread Bharata B Rao
Object cgroup charging is done for all the objects during
allocation, but during freeing, uncharging ends up happening
for only one object in the case of bulk allocation/freeing.

Fix this by having a separate call to uncharge all the
objects from kmem_cache_free_bulk() and by modifying
memcg_slab_free_hook() to take care of bulk uncharging.

Signed-off-by: Bharata B Rao 
---
 mm/slab.c |  2 +-
 mm/slab.h | 42 +++---
 mm/slub.c |  3 ++-
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index f658e86ec8cee..5c70600d8b1cc 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3440,7 +3440,7 @@ void ___cache_free(struct kmem_cache *cachep, void *objp,
memset(objp, 0, cachep->object_size);
kmemleak_free_recursive(objp, cachep->flags);
objp = cache_free_debugcheck(cachep, objp, caller);
-   memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
+   memcg_slab_free_hook(cachep, , 1);
 
/*
 * Skip calling cache_free_alien() when the platform is not numa.
diff --git a/mm/slab.h b/mm/slab.h
index 6cc323f1313af..6dd4b702888a7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -345,30 +345,42 @@ static inline void memcg_slab_post_alloc_hook(struct 
kmem_cache *s,
obj_cgroup_put(objcg);
 }
 
-static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page 
*page,
-   void *p)
+static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
+   void **p, int objects)
 {
+   struct kmem_cache *s;
struct obj_cgroup *objcg;
+   struct page *page;
unsigned int off;
+   int i;
 
if (!memcg_kmem_enabled())
return;
 
-   if (!page_has_obj_cgroups(page))
-   return;
+   for (i = 0; i < objects; i++) {
+   if (unlikely(!p[i]))
+   continue;
 
-   off = obj_to_index(s, page, p);
-   objcg = page_obj_cgroups(page)[off];
-   page_obj_cgroups(page)[off] = NULL;
+   page = virt_to_head_page(p[i]);
+   if (!page_has_obj_cgroups(page))
+   continue;
 
-   if (!objcg)
-   return;
+   if (!s_orig)
+   s = page->slab_cache;
+   else
+   s = s_orig;
 
-   obj_cgroup_uncharge(objcg, obj_full_size(s));
-   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-   -obj_full_size(s));
+   off = obj_to_index(s, page, p[i]);
+   objcg = page_obj_cgroups(page)[off];
+   if (!objcg)
+   continue;
 
-   obj_cgroup_put(objcg);
+   page_obj_cgroups(page)[off] = NULL;
+   obj_cgroup_uncharge(objcg, obj_full_size(s));
+   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+   -obj_full_size(s));
+   obj_cgroup_put(objcg);
+   }
 }
 
 #else /* CONFIG_MEMCG_KMEM */
@@ -406,8 +418,8 @@ static inline void memcg_slab_post_alloc_hook(struct 
kmem_cache *s,
 {
 }
 
-static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page 
*page,
-   void *p)
+static inline void memcg_slab_free_hook(struct kmem_cache *s,
+   void **p, int objects)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/slub.c b/mm/slub.c
index 6d3574013b2f8..0cbe67f13946e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3091,7 +3091,7 @@ static __always_inline void do_slab_free(struct 
kmem_cache *s,
struct kmem_cache_cpu *c;
unsigned long tid;
 
-   memcg_slab_free_hook(s, page, head);
+   memcg_slab_free_hook(s, , 1);
 redo:
/*
 * Determine the currently cpus per cpu slab.
@@ -3253,6 +3253,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t 
size, void **p)
if (WARN_ON(!size))
return;
 
+   memcg_slab_free_hook(s, p, size);
do {
struct detached_freelist df;
 
-- 
2.26.2



Re: [PATCH 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-28 Thread Bharata B Rao
On Fri, Sep 25, 2020 at 01:44:42PM -0700, Ralph Campbell wrote:
> ZONE_DEVICE struct pages have an extra reference count that complicates the
> code for put_page() and several places in the kernel that need to check the
> reference count to see that a page is not being used (gup, compaction,
> migration, etc.). Clean up the code so the reference count doesn't need to
> be treated specially for ZONE_DEVICE.
> 
> Signed-off-by: Ralph Campbell 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>  include/linux/dax.h|  2 +-
>  include/linux/memremap.h   |  7 ++-
>  include/linux/mm.h | 44 --
>  lib/test_hmm.c |  2 +-
>  mm/gup.c   | 44 --
>  mm/internal.h  |  8 +++
>  mm/memremap.c  | 82 ++
>  mm/migrate.c   |  5 --
>  mm/page_alloc.c|  3 +
>  mm/swap.c  | 46 +++
>  12 files changed, 44 insertions(+), 203 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 7705d5557239..e6ec98325fab 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
> gpa, struct kvm *kvm)
>  
>   dpage = pfn_to_page(uvmem_pfn);
>   dpage->zone_device_data = pvt;
> - get_page(dpage);
> + init_page_count(dpage);

The powerpc change looks good. Passes a quick sanity test of
booting/rebooting a secure guest on Power.

Tested-by: Bharata B Rao 

Regards,
Bharata.


Re: [PATCH v2 00/28] The new cgroup slab memory controller

2020-09-02 Thread Bharata B Rao
On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
> On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao  wrote:
> >
> > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > > There appears to be another problem that is related to the
> > > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >
> > > In the original deadlock that I described, the workaround is to
> > > replace crash dump from piping to Linux traditional save to files
> > > method. However, after trying this workaround, I still observed
> > > hardware watchdog resets during machine  shutdown.
> > >
> > > The new problem occurs for the following reason: upon shutdown systemd
> > > calls a service that hot-removes memory, and if hot-removing fails for
> > > some reason systemd kills that service after timeout. However, systemd
> > > is never able to kill the service, and we get hardware reset caused by
> > > watchdog or a hang during shutdown:
> > >
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >   do {
> > >   pfn = scan_movable_pages(pfn, end_pfn);
> > >   # Returns 0, meaning there is nothing available to
> > >   # migrate, no page is PageLRU(page)
> > >   ...
> > >   ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > NULL, 
> > > check_pages_isolated_cb);
> > >   # Returns -EBUSY, meaning there is at least one PFN that
> > >   # still has to be migrated.
> > >   } while (ret);
> > >
> > > Thread #2: ccs killer kthread
> > >css_killed_work_fn
> > >  cgroup_mutex  <- Grab this Mutex
> > >  mem_cgroup_css_offline
> > >memcg_offline_kmem.part
> > >   memcg_deactivate_kmem_caches
> > > get_online_mems
> > >   mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: systemd
> > > ksys_read
> > >  vfs_read
> > >__vfs_read
> > >  seq_read
> > >proc_single_show
> > >  proc_cgroup_show
> > >mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> > >
> > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > > to thread #1.
> > >
> > > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > > still present in all stable branches. Unfortunately, I do not see a
> > > simple fix in how to remove mem_hotplug_lock from
> > > memcg_deactivate_kmem_caches without using Roman's series that is too
> > > big for stable.
> >
> > We too are seeing this on Power systems when stress-testing memory
> > hotplug, but with the following call trace (from hung task timer)
> > instead of Thread #2 above:
> >
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > get_online_mems
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> > ret_from_kernel_thread
> >
> > While I understand that Roman's new slab controller patchset will fix
> > this, I also wonder if infinitely looping in the memory unplug path
> > with mem_hotplug_lock held is the right thing to do? Earlier we had
> > a few other exit possibilities in this path (like max retries etc)
> > but those were removed by commits:
> >
> > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
> >
> > Or, is the user-space test is expected to induce a signal back-off when
> > unplug doesn't complete within a reasonable amount of time?
> 
> Hi Bharata,
> 
> Thank you for your input, it looks like you are experiencing the same
> problems that I observed.
> 
> What I found is that the reason why our machines did not complete
> hot-remove within the given time is because of this bug:
> https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatas...@soleen.com
> 
> Could you please try it and see if that helps for your case?

I am on an old codebase that already has the fix that you are proposing,
so I might be seeing someother issue which I will debug further.

So looks like the loop in __offline_pages() had a call to
drain_all_pages() before it was removed by

c52e75935f8d: mm: remove extra drain pages on pcp list

Regards,
Bharata.


Re: [PATCH v2 00/28] The new cgroup slab memory controller

2020-08-31 Thread Bharata B Rao
On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>   do {
>   pfn = scan_movable_pages(pfn, end_pfn);
>   # Returns 0, meaning there is nothing available to
>   # migrate, no page is PageLRU(page)
>   ...
>   ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
>   # Returns -EBUSY, meaning there is at least one PFN that
>   # still has to be migrated.
>   } while (ret);
> 
> Thread #2: ccs killer kthread
>css_killed_work_fn
>  cgroup_mutex  <- Grab this Mutex
>  mem_cgroup_css_offline
>memcg_offline_kmem.part
>   memcg_deactivate_kmem_caches
> get_online_mems
>   mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>__vfs_read
>  seq_read
>proc_single_show
>  proc_cgroup_show
>mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.



Re: [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-22 Thread Bharata B Rao
On Tue, Jul 21, 2020 at 12:42:02PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.
> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 5a4b02d3f651..ba5c7c77cc3a 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct 
> vm_area_struct *vma,
>   * fault on them, do fault time migration to replace the device PTEs in
>   * QEMU page table with normal PTEs from newly allocated pages.
>   */
> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>struct kvm *kvm, bool skip_page_out)
>  {
>   int i;
>   struct kvmppc_uvmem_page_pvt *pvt;
> - unsigned long pfn, uvmem_pfn;
> - unsigned long gfn = free->base_gfn;
> + struct page *uvmem_page;
> + struct vm_area_struct *vma = NULL;
> + unsigned long uvmem_pfn, gfn;
> + unsigned long addr, end;
> +
> + mmap_read_lock(kvm->mm);
> +
> + addr = slot->userspace_addr;

We typically use gfn_to_hva() for that, but that won't work for a
memslot that is already marked INVALID which is the case here.
I think it is ok to access slot->userspace_addr here of an INVALID
memslot, but just thought of explictly bringing this up.

> + end = addr + (slot->npages * PAGE_SIZE);
>  
> - for (i = free->npages; i; --i, ++gfn) {
> - struct page *uvmem_page;
> + gfn = slot->base_gfn;
> + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
> +
> + /* Fetch the VMA if addr is not in the latest fetched one */
> + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
> + vma = find_vma_intersection(kvm->mm, addr, end);
> + if (!vma ||
> + vma->vm_start > addr || vma->vm_end < end) {
> + pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
> + break;
> + }
> + }

In Ram's series, kvmppc_memslot_page_merge() also walks the VMAs spanning
the memslot, but it uses a different logic for the same. Why can't these
two cases use the same method to walk the VMAs? Is there anything subtly
different between the two cases?

Regards,
Bharata.


Re: [PATCH v3 2/5] mm/migrate: add a flags parameter to migrate_vma

2020-07-22 Thread Bharata B Rao
On Tue, Jul 21, 2020 at 02:31:16PM -0700, Ralph Campbell wrote:
> The src_owner field in struct migrate_vma is being used for two purposes,
> it acts as a selection filter for which types of pages are to be migrated
> and it identifies device private pages owned by the caller. Split this
> into separate parameters so the src_owner field can be used just to
> identify device private pages owned by the caller of migrate_vma_setup().
> Rename the src_owner field to pgmap_owner to reflect it is now used only
> to identify which device private pages to migrate.
> 
> Signed-off-by: Ralph Campbell 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c |  4 +++-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  4 +++-
>  include/linux/migrate.h| 13 +
>  lib/test_hmm.c |  6 --
>  mm/migrate.c   |  6 --
>  5 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 09d8119024db..6850bd04bcb9 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -400,6 +400,7 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
> long start,
>   mig.end = end;
>   mig.src = _pfn;
>   mig.dst = _pfn;
> + mig.flags = MIGRATE_VMA_SELECT_SYSTEM;
>  
>   /*
>* We come here with mmap_lock write lock held just for
> @@ -577,7 +578,8 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
> long start,
>   mig.end = end;
>   mig.src = _pfn;
>   mig.dst = _pfn;
> - mig.src_owner = _uvmem_pgmap;
> + mig.pgmap_owner = _uvmem_pgmap;
> + mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
>  
>   mutex_lock(>arch.uvmem_lock);

For the kvmppc changes above,
Reviewed-by: Bharata B Rao 


Re: [PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE

2020-07-16 Thread Bharata B Rao
On Fri, Jul 17, 2020 at 12:44:00PM +1000, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of July 17, 2020 12:08 pm:
> > Excerpts from Qian Cai's message of July 17, 2020 3:27 am:
> >> On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> >>> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> >>> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> >>> permitted to use instructions like tblie and tlbsync directly, but is
> >>> expected to make hypervisor calls to get the TLB flushed.
> >>> 
> >>> This series enables the TLB flush routines in the radix code to
> >>> off-load TLB flushing to hypervisor via the newly proposed hcall
> >>> H_RPT_INVALIDATE. 
> >>> 
> >>> To easily check the availability of GTSE, it is made an MMU feature.
> >>> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> >>> handle GTSE as an optionally available feature and to not assume GTSE
> >>> when radix support is available.
> >>> 
> >>> The actual hcall implementation for KVM isn't included in this
> >>> patchset and will be posted separately.
> >>> 
> >>> Changes in v3
> >>> =
> >>> - Fixed a bug in the hcall wrapper code where we were missing setting
> >>>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
> >>>   a full flush for the nested case.
> >>> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> >>> 
> >>> v2: 
> >>> https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bhar...@linux.ibm.com/T/#t
> >>> 
> >>> Bharata B Rao (2):
> >>>   powerpc/mm: Enable radix GTSE only if supported.
> >>>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
> >>> enabled
> >>> 
> >>> Nicholas Piggin (1):
> >>>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
> >>> !GTSE
> >> 
> >> Reverting the whole series fixed random memory corruptions during boot on
> >> POWER9 PowerNV systems below.
> > 
> > If I s/mmu_has_feature(MMU_FTR_GTSE)/(1)/g in radix_tlb.c, then the .o
> > disasm is the same as reverting my patch.
> > 
> > Feature bits not being set right? PowerNV should be pretty simple, seems
> > to do the same as FTR_TYPE_RADIX.
> 
> Might need this fix
> 
> ---
> 
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 9cc49f265c86..54c9bcea9d4e 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -163,7 +163,7 @@ static struct ibm_pa_feature {
>   { .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
>   { .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
>  #ifdef CONFIG_PPC_RADIX_MMU
> - { .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
> + { .pabyte = 40, .pabit = 0, .mmu_features  = (MMU_FTR_TYPE_RADIX | 
> MMU_FTR_GTSE) },
>  #endif
>   { .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = 
> CPU_FTR_NODSISRALIGN },
>   { .pabyte = 5,  .pabit = 0, .cpu_features  = CPU_FTR_REAL_LE,

Michael - Let me know if this should be folded into 1/3 and the complete
series resent.

Regards,
Bharata.


Re: [PATCH 1/2] mm/migrate: optimize migrate_vma_setup() for holes

2020-07-10 Thread Bharata B Rao
On Thu, Jul 09, 2020 at 09:57:10AM -0700, Ralph Campbell wrote:
> When migrating system memory to device private memory, if the source
> address range is a valid VMA range and there is no memory or a zero page,
> the source PFN array is marked as valid but with no PFN. This lets the
> device driver allocate private memory and clear it, then insert the new
> device private struct page into the CPU's page tables when
> migrate_vma_pages() is called. migrate_vma_pages() only inserts the
> new page if the VMA is an anonymous range. There is no point in telling
> the device driver to allocate device private memory and then not migrate
> the page. Instead, mark the source PFN array entries as not migrating to
> avoid this overhead.
> 
> Signed-off-by: Ralph Campbell 
> ---
>  mm/migrate.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b0125c082549..8aa434691577 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2204,9 +2204,13 @@ static int migrate_vma_collect_hole(unsigned long 
> start,
>  {
>   struct migrate_vma *migrate = walk->private;
>   unsigned long addr;
> + unsigned long flags;
> +
> + /* Only allow populating anonymous memory. */
> + flags = vma_is_anonymous(walk->vma) ? MIGRATE_PFN_MIGRATE : 0;
>  
>   for (addr = start; addr < end; addr += PAGE_SIZE) {
> - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
> + migrate->src[migrate->npages] = flags;

I see a few other such cases where we directly populate MIGRATE_PFN_MIGRATE
w/o a pfn in migrate_vma_collect_pmd() and wonder why the vma_is_anonymous()
check can't help there as well?

1. pte_none() check in migrate_vma_collect_pmd()
2. is_zero_pfn() check in migrate_vma_collect_pmd()

Regards,
Bharata.


Re: [PATCH 2/5] mm/migrate: add a direction parameter to migrate_vma

2020-07-08 Thread Bharata B Rao
On Mon, Jul 06, 2020 at 03:23:44PM -0700, Ralph Campbell wrote:
> The src_owner field in struct migrate_vma is being used for two purposes,
> it implies the direction of the migration and it identifies device private
> pages owned by the caller. Split this into separate parameters so the
> src_owner field can be used just to identify device private pages owned
> by the caller of migrate_vma_setup().
> 
> Signed-off-by: Ralph Campbell 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 ++
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 ++
>  include/linux/migrate.h| 12 +---
>  lib/test_hmm.c |  2 ++
>  mm/migrate.c   |  5 +++--
>  5 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 09d8119024db..acbf14cd2d72 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -400,6 +400,7 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
> long start,
>   mig.end = end;
>   mig.src = _pfn;
>   mig.dst = _pfn;
> + mig.dir = MIGRATE_VMA_FROM_SYSTEM;
>  
>   /*
>* We come here with mmap_lock write lock held just for
> @@ -578,6 +579,7 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
> long start,
>   mig.src = _pfn;
>   mig.dst = _pfn;
>   mig.src_owner = _uvmem_pgmap;
> + mig.dir = MIGRATE_VMA_FROM_DEVICE_PRIVATE;

Reviewed-by: Bharata B Rao 

for the above kvmppc change.


Re: [PATCH 0/5] mm/migrate: avoid device private invalidations

2020-07-08 Thread Bharata B Rao
On Mon, Jul 06, 2020 at 03:23:42PM -0700, Ralph Campbell wrote:
> The goal for this series is to avoid device private memory TLB
> invalidations when migrating a range of addresses from system
> memory to device private memory and some of those pages have already
> been migrated. The approach taken is to introduce a new mmu notifier
> invalidation event type and use that in the device driver to skip
> invalidation callbacks from migrate_vma_setup(). The device driver is
> also then expected to handle device MMU invalidations as part of the
> migrate_vma_setup(), migrate_vma_pages(), migrate_vma_finalize() process.
> Note that this is opt-in. A device driver can simply invalidate its MMU
> in the mmu notifier callback and not handle MMU invalidations in the
> migration sequence.

In the kvmppc secure guest usecase,

1. We ensure that we don't issue migrate_vma() calls for pages that have
already been migrated to the device side (which is actually secure memory
for us that is managed by Ultravisor firmware)

2. The page table mappings on the device side (secure memory) are managed
transparent to the kernel by the Ultravisor firmware.

Hence I assume that no specific action would be required by the kvmppc
usecase due to this patchset. In fact, we never registered for this
mmu notifier events.

Regards,
Bharata.


Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-08 Thread Bharata B Rao
On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.

Yes, this appears much simpler.

> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 852cc9ae6a0b..479ddf16d18c 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
> vm_area_struct *vma,
>   * fault on them, do fault time migration to replace the device PTEs in
>   * QEMU page table with normal PTEs from newly allocated pages.
>   */
> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>struct kvm *kvm, bool skip_page_out)
>  {
>   int i;
>   struct kvmppc_uvmem_page_pvt *pvt;
> - unsigned long pfn, uvmem_pfn;
> - unsigned long gfn = free->base_gfn;
> + struct page *uvmem_page;
> + struct vm_area_struct *vma = NULL;
> + unsigned long uvmem_pfn, gfn;
> + unsigned long addr, end;
> +
> + down_read(>mm->mmap_sem);

You should be using mmap_read_lock(kvm->mm) with recent kernels.

> +
> + addr = slot->userspace_addr;
> + end = addr + (slot->npages * PAGE_SIZE);
>  
> - for (i = free->npages; i; --i, ++gfn) {
> - struct page *uvmem_page;
> + gfn = slot->base_gfn;
> + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
> +
> + /* Fetch the VMA if addr is not in the latest fetched one */
> + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
> + vma = find_vma_intersection(kvm->mm, addr, end);
> + if (!vma ||
> + vma->vm_start > addr || vma->vm_end < end) {
> + pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
> + break;
> + }
> + }

The first find_vma_intersection() was called for the range spanning the
entire memslot, but you have code to check if vma remains valid for the
new addr in each iteration. Guess you wanted to get vma for one page at
a time and use it for subsequent pages until it covers the range?

Regards,
Bharata.


Re: add a not device managed memremap_pages v3

2019-08-18 Thread Bharata B Rao
On Sun, Aug 18, 2019 at 11:05:53AM +0200, Christoph Hellwig wrote:
> Hi Dan and Jason,
> 
> Bharata has been working on secure page management for kvmppc guests,
> and one I thing I noticed is that he had to fake up a struct device
> just so that it could be passed to the devm_memremap_pages
> instrastructure for device private memory.
> 
> This series adds non-device managed versions of the
> devm_request_free_mem_region and devm_memremap_pages functions for
> his use case.

Tested kvmppc ultravisor patchset with migrate_vma changes and this
patchset. (Had to manually patch mm/memremap.c instead of kernel/memremap.c
though)

For the series,

Tested-by: Bharata B Rao 



Re: add a not device managed memremap_pages v2

2019-08-16 Thread Bharata B Rao
On Fri, Aug 16, 2019 at 08:54:30AM +0200, Christoph Hellwig wrote:
> Hi Dan and Jason,
> 
> Bharata has been working on secure page management for kvmppc guests,
> and one I thing I noticed is that he had to fake up a struct device
> just so that it could be passed to the devm_memremap_pages
> instrastructure for device private memory.
> 
> This series adds non-device managed versions of the
> devm_request_free_mem_region and devm_memremap_pages functions for
> his use case.

Tested this series with my patches that add secure page management
for kvmppc guests. These patches along with migrate_vma-cleanup
series are good-to-have to support secure guests on ultravisor enabled
POWER platforms.

Regards,
Bharata.



Re: [PATCH 5/5] memremap: provide a not device managed memremap_pages

2019-08-14 Thread Bharata B Rao
On Wed, Aug 14, 2019 at 08:11:50AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 13, 2019 at 10:26:11AM +0530, Bharata B Rao wrote:
> > Yes, this patchset works non-modular and with kvm-hv as module, it
> > works with devm_memremap_pages_release() and release_mem_region() in the
> > cleanup path. The cleanup path will be required in the non-modular
> > case too for proper recovery from failures.
> 
> Can you check if the version here:
> 
> git://git.infradead.org/users/hch/misc.git pgmap-remove-dev
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-remove-dev
> 
> works for you fully before I resend?

Yes, this works for us. This and migrate-vma-cleanup series helps to
really simplify the kvmppc secure pages management code. Thanks.

Regards,
Bharata.



Re: [PATCH 5/5] memremap: provide a not device managed memremap_pages

2019-08-12 Thread Bharata B Rao
On Mon, Aug 12, 2019 at 05:00:12PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 12, 2019 at 08:20:58PM +0530, Bharata B Rao wrote:
> > On Sun, Aug 11, 2019 at 10:12:47AM +0200, Christoph Hellwig wrote:
> > > The kvmppc ultravisor code wants a device private memory pool that is
> > > system wide and not attached to a device.  Instead of faking up one
> > > provide a low-level memremap_pages for it.  Note that this function is
> > > not exported, and doesn't have a cleanup routine associated with it to
> > > discourage use from more driver like users.
> > 
> > The kvmppc secure pages management code will be part of kvm-hv which
> > can be built as module too. So it would require memremap_pages() to be
> > exported.
> > 
> > Additionally, non-dev version of the cleanup routine
> > devm_memremap_pages_release() or equivalent would also be requried.
> > With device being present, put_device() used to take care of this
> > cleanup.
> 
> Oh well.  We can add them fairly easily if we really need to, but I
> tried to avoid that.  Can you try to see if this works non-modular
> for you for now until we hear more feedback from Dan?

Yes, this patchset works non-modular and with kvm-hv as module, it
works with devm_memremap_pages_release() and release_mem_region() in the
cleanup path. The cleanup path will be required in the non-modular
case too for proper recovery from failures.

Regards,
Bharata.



Re: [PATCH 5/5] memremap: provide a not device managed memremap_pages

2019-08-12 Thread Bharata B Rao
On Sun, Aug 11, 2019 at 10:12:47AM +0200, Christoph Hellwig wrote:
> The kvmppc ultravisor code wants a device private memory pool that is
> system wide and not attached to a device.  Instead of faking up one
> provide a low-level memremap_pages for it.  Note that this function is
> not exported, and doesn't have a cleanup routine associated with it to
> discourage use from more driver like users.

The kvmppc secure pages management code will be part of kvm-hv which
can be built as module too. So it would require memremap_pages() to be
exported.

Additionally, non-dev version of the cleanup routine
devm_memremap_pages_release() or equivalent would also be requried.
With device being present, put_device() used to take care of this
cleanup.

Regards,
Bharata.



Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-20 Thread Bharata B Rao
On Tue, May 21, 2019 at 12:55:49AM +1000, Nicholas Piggin wrote:
> Bharata B Rao's on May 21, 2019 12:29 am:
> > On Mon, May 20, 2019 at 01:50:35PM +0530, Bharata B Rao wrote:
> >> On Mon, May 20, 2019 at 05:00:21PM +1000, Nicholas Piggin wrote:
> >> > Bharata B Rao's on May 20, 2019 3:56 pm:
> >> > > On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote:
> >> > >> >> > git bisect points to
> >> > >> >> >
> >> > >> >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
> >> > >> >> > Author: Nicholas Piggin 
> >> > >> >> > Date:   Fri Jul 27 21:48:17 2018 +1000
> >> > >> >> >
> >> > >> >> > powerpc/64s: Fix page table fragment refcount race vs 
> >> > >> >> > speculative references
> >> > >> >> >
> >> > >> >> > The page table fragment allocator uses the main page 
> >> > >> >> > refcount racily
> >> > >> >> > with respect to speculative references. A customer observed 
> >> > >> >> > a BUG due
> >> > >> >> > to page table page refcount underflow in the fragment 
> >> > >> >> > allocator. This
> >> > >> >> > can be caused by the fragment allocator set_page_count 
> >> > >> >> > stomping on a
> >> > >> >> > speculative reference, and then the speculative failure 
> >> > >> >> > handler
> >> > >> >> > decrements the new reference, and the underflow eventually 
> >> > >> >> > pops when
> >> > >> >> > the page tables are freed.
> >> > >> >> >
> >> > >> >> > Fix this by using a dedicated field in the struct page for 
> >> > >> >> > the page
> >> > >> >> > table fragment allocator.
> >> > >> >> >
> >> > >> >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory 
> >> > >> >> > wastage")
> >> > >> >> > Cc: sta...@vger.kernel.org # v3.10+
> >> > >> >> 
> >> > >> >> That's the commit that added the BUG_ON(), so prior to that you 
> >> > >> >> won't
> >> > >> >> see the crash.
> >> > >> > 
> >> > >> > Right, but the commit says it fixes page table page refcount 
> >> > >> > underflow by
> >> > >> > introducing a new field >pt_frag_refcount. Now we are hitting 
> >> > >> > the underflow
> >> > >> > for this pt_frag_refcount.
> >> > >> 
> >> > >> The fixed underflow is caused by a bug (race on page count) that got 
> >> > >> fixed by that patch. You are hitting a different underflow here. It's
> >> > >> not certain my patch caused it, I'm just trying to reproduce now.
> >> > > 
> >> > > Ok.
> >> > 
> >> > Can't reproduce I'm afraid, tried adding and removing 8GB memory from a
> >> > 4GB guest (via host adding / removing memory device), and it just works.
> >> 
> >> Boot, add 8G, reboot, remove 8G is the sequence to reproduce.
> >> 
> >> > 
> >> > It's likely to be an edge case like an off by one or rounding error
> >> > that just happens to trigger in your config. Might be easiest if you
> >> > could test with a debug patch.
> >> 
> >> Sure, I will continue debugging.
> > 
> > When the guest is rebooted after hotplug, the entire memory (which includes
> > the hotplugged memory) gets remapped again freshly. However at this time
> > since no slab is available yet, pt_frag_refcount never gets initialized as 
> > we
> > never do pte_fragment_alloc() for these mappings. So we right away hit the
> > underflow during the first unplug itself, it looks like.
> 
> Nice catch, good debugging work.

Thanks, with help from Aneesh.

> 
> > I will check how this can be fixed.
> 
> Tricky problem. What do you think? You might be able to make the early 
> page table allocations in the same pattern as the frag allocations, and 
> then fill in the struct page metadata when you have those.

Will explore.

> 
> Other option may be create a new set of page tables after mm comes up
> to replace the early page tables with. That's a bigger hammer though.

Will also check if similar scenario exists on x86 and if so, how and when
pte frag data is fixed there.

Regards,
Bharata.



Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-20 Thread Bharata B Rao
On Mon, May 20, 2019 at 01:50:35PM +0530, Bharata B Rao wrote:
> On Mon, May 20, 2019 at 05:00:21PM +1000, Nicholas Piggin wrote:
> > Bharata B Rao's on May 20, 2019 3:56 pm:
> > > On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote:
> > >> >> > git bisect points to
> > >> >> >
> > >> >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
> > >> >> > Author: Nicholas Piggin 
> > >> >> > Date:   Fri Jul 27 21:48:17 2018 +1000
> > >> >> >
> > >> >> > powerpc/64s: Fix page table fragment refcount race vs 
> > >> >> > speculative references
> > >> >> >
> > >> >> > The page table fragment allocator uses the main page refcount 
> > >> >> > racily
> > >> >> > with respect to speculative references. A customer observed a 
> > >> >> > BUG due
> > >> >> > to page table page refcount underflow in the fragment 
> > >> >> > allocator. This
> > >> >> > can be caused by the fragment allocator set_page_count stomping 
> > >> >> > on a
> > >> >> > speculative reference, and then the speculative failure handler
> > >> >> > decrements the new reference, and the underflow eventually pops 
> > >> >> > when
> > >> >> > the page tables are freed.
> > >> >> >
> > >> >> > Fix this by using a dedicated field in the struct page for the 
> > >> >> > page
> > >> >> > table fragment allocator.
> > >> >> >
> > >> >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> > >> >> > Cc: sta...@vger.kernel.org # v3.10+
> > >> >> 
> > >> >> That's the commit that added the BUG_ON(), so prior to that you won't
> > >> >> see the crash.
> > >> > 
> > >> > Right, but the commit says it fixes page table page refcount underflow 
> > >> > by
> > >> > introducing a new field >pt_frag_refcount. Now we are hitting 
> > >> > the underflow
> > >> > for this pt_frag_refcount.
> > >> 
> > >> The fixed underflow is caused by a bug (race on page count) that got 
> > >> fixed by that patch. You are hitting a different underflow here. It's
> > >> not certain my patch caused it, I'm just trying to reproduce now.
> > > 
> > > Ok.
> > 
> > Can't reproduce I'm afraid, tried adding and removing 8GB memory from a
> > 4GB guest (via host adding / removing memory device), and it just works.
> 
> Boot, add 8G, reboot, remove 8G is the sequence to reproduce.
> 
> > 
> > It's likely to be an edge case like an off by one or rounding error
> > that just happens to trigger in your config. Might be easiest if you
> > could test with a debug patch.
> 
> Sure, I will continue debugging.

When the guest is rebooted after hotplug, the entire memory (which includes
the hotplugged memory) gets remapped again freshly. However at this time
since no slab is available yet, pt_frag_refcount never gets initialized as we
never do pte_fragment_alloc() for these mappings. So we right away hit the
underflow during the first unplug itself, it looks like.

I will check how this can be fixed.

> 
> Regards,
> Bharata.



Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-20 Thread Bharata B Rao
On Mon, May 20, 2019 at 05:00:21PM +1000, Nicholas Piggin wrote:
> Bharata B Rao's on May 20, 2019 3:56 pm:
> > On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote:
> >> >> > git bisect points to
> >> >> >
> >> >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
> >> >> > Author: Nicholas Piggin 
> >> >> > Date:   Fri Jul 27 21:48:17 2018 +1000
> >> >> >
> >> >> > powerpc/64s: Fix page table fragment refcount race vs speculative 
> >> >> > references
> >> >> >
> >> >> > The page table fragment allocator uses the main page refcount 
> >> >> > racily
> >> >> > with respect to speculative references. A customer observed a BUG 
> >> >> > due
> >> >> > to page table page refcount underflow in the fragment allocator. 
> >> >> > This
> >> >> > can be caused by the fragment allocator set_page_count stomping 
> >> >> > on a
> >> >> > speculative reference, and then the speculative failure handler
> >> >> > decrements the new reference, and the underflow eventually pops 
> >> >> > when
> >> >> > the page tables are freed.
> >> >> >
> >> >> > Fix this by using a dedicated field in the struct page for the 
> >> >> > page
> >> >> > table fragment allocator.
> >> >> >
> >> >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> >> >> > Cc: sta...@vger.kernel.org # v3.10+
> >> >> 
> >> >> That's the commit that added the BUG_ON(), so prior to that you won't
> >> >> see the crash.
> >> > 
> >> > Right, but the commit says it fixes page table page refcount underflow by
> >> > introducing a new field >pt_frag_refcount. Now we are hitting the 
> >> > underflow
> >> > for this pt_frag_refcount.
> >> 
> >> The fixed underflow is caused by a bug (race on page count) that got 
> >> fixed by that patch. You are hitting a different underflow here. It's
> >> not certain my patch caused it, I'm just trying to reproduce now.
> > 
> > Ok.
> 
> Can't reproduce I'm afraid, tried adding and removing 8GB memory from a
> 4GB guest (via host adding / removing memory device), and it just works.

Boot, add 8G, reboot, remove 8G is the sequence to reproduce.

> 
> It's likely to be an edge case like an off by one or rounding error
> that just happens to trigger in your config. Might be easiest if you
> could test with a debug patch.

Sure, I will continue debugging.

Regards,
Bharata.



Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-19 Thread Bharata B Rao
On Mon, May 20, 2019 at 02:48:35PM +1000, Nicholas Piggin wrote:
> >> > git bisect points to
> >> >
> >> > commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
> >> > Author: Nicholas Piggin 
> >> > Date:   Fri Jul 27 21:48:17 2018 +1000
> >> >
> >> > powerpc/64s: Fix page table fragment refcount race vs speculative 
> >> > references
> >> >
> >> > The page table fragment allocator uses the main page refcount racily
> >> > with respect to speculative references. A customer observed a BUG due
> >> > to page table page refcount underflow in the fragment allocator. This
> >> > can be caused by the fragment allocator set_page_count stomping on a
> >> > speculative reference, and then the speculative failure handler
> >> > decrements the new reference, and the underflow eventually pops when
> >> > the page tables are freed.
> >> >
> >> > Fix this by using a dedicated field in the struct page for the page
> >> > table fragment allocator.
> >> >
> >> > Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> >> > Cc: sta...@vger.kernel.org # v3.10+
> >> 
> >> That's the commit that added the BUG_ON(), so prior to that you won't
> >> see the crash.
> > 
> > Right, but the commit says it fixes page table page refcount underflow by
> > introducing a new field >pt_frag_refcount. Now we are hitting the 
> > underflow
> > for this pt_frag_refcount.
> 
> The fixed underflow is caused by a bug (race on page count) that got 
> fixed by that patch. You are hitting a different underflow here. It's
> not certain my patch caused it, I'm just trying to reproduce now.

Ok.

> 
> > 
> > BTW, if I go below this commit, I don't hit the pagecount
> > 
> > VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > 
> > which is in pte_fragment_free() path.
> 
> Do you have CONFIG_DEBUG_VM=y?

Yes.

Regards,
Bharata.



Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-19 Thread Bharata B Rao
On Mon, May 20, 2019 at 12:02:23PM +1000, Michael Ellerman wrote:
> Bharata B Rao  writes:
> > On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote:
> >> Hello,
> >> 
> >> On power9 host, performing memory hotunplug from ppc64le guest results in
> >> kernel oops.
> >> 
> >> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using
> >> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest.
> >> 
> >> Recreation steps:
> >> 
> >> 1. Boot a guest with below mem configuration:
> >>   33554432
> >>   8388608
> >>   4194304
> >>   
> >>     
> >>   
> >>     
> >>   
> >> 
> >> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> now
> >> reboot guest -> once guest comes back try to unplug 8G memory
> >> 
> >> mem.xml used:
> >> 
> >> 
> >> 8
> >> 0
> >> 
> >> 
> >> 
> >> Memory attach and detach commands used:
> >>     virsh attach-device vm1 ./mem.xml --live
> >>     virsh detach-device vm1 ./mem.xml --live
> >> 
> >> Trace seen inside guest after unplug, guest just hangs there forever:
> >> 
> >> [   21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> >> [   21.963064] Oops: Exception in kernel mode, sig: 5 [#1]
> >> [   21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA
> >> pSeries
> >> [   21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse
> >> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi 
> >> scsi_transport_iscsi
> >> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress
> >> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
> >> xor raid6_pq multipath crc32c_vpmsum
> >> [   21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not
> >> tainted 5.1.0-dirty #2
> >> [   21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn
> >> [   21.963355] NIP:  c0079e18 LR: c0c79308 CTR:
> >> 8000
> >> [   21.963392] REGS: c003f88034f0 TRAP: 0700   Not tainted 
> >> (5.1.0-dirty)
> >> [   21.963422] MSR:  8282b033   
> >> CR:
> >> 28002884  XER: 2004
> >> [   21.963470] CFAR: c0c79304 IRQMASK: 0
> >> [   21.963470] GPR00: c0c79308 c003f8803780 c1521000
> >> 00fff8c0
> >> [   21.963470] GPR04: 0001 ffe30005 0005
> >> 0020
> >> [   21.963470] GPR08:  0001 c00a00fff8e0
> >> c16d21a0
> >> [   21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0
> >> c003ffe30100
> >> [   21.963470] GPR16: c003ffe3 c14aa4de c00a009f
> >> c16d21b0
> >> [   21.963470] GPR20: c14de588 0001 c16d21b8
> >> c00a00a0
> >> [   21.963470] GPR24:   c00a00a0
> >> c003ffe96000
> >> [   21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000
> >> c00a00fff8c0
> >> [   21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0
> >> [   21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4
> >> [   21.963873] Call Trace:
> >> [   21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0
> >> (unreliable)
> >> [   21.963933] [c003f88037b0] [] (null)
> >> [   21.963969] [c003f88038c0] [c006f038]
> >> vmemmap_free+0x218/0x2e0
> >> [   21.964006] [c003f8803940] [c036f100]
> >> sparse_remove_one_section+0xd0/0x138
> >> [   21.964050] [c003f8803980] [c0383a50]
> >> __remove_pages+0x410/0x560
> >> [   21.964093] [c003f8803a90] [c0c784d8]
> >> arch_remove_memory+0x68/0xdc
> >> [   21.964136] [c003f8803ad0] [c0385d74]
> >> __remove_memory+0xc4/0x110
> >> [   21.964180] [c003f8803b10] [c00d44e4]
> >> dlpar_remove_lmb+0x94/0x140
> >> [   21.964223] [c003f8803b50] [c00d52b4]
> >> dlpar_memory+0x464/0xd00
> >> [   21.964259] [c003f8803be0] [c00cd5c0]
> >> handle_dlpar_errorlog+0xc0/0x190
> >> [   21.964303] [c003f8803c50] [c00cd6bc]
> >

Re: PROBLEM: Power9: kernel oops on memory hotunplug from ppc64le guest

2019-05-18 Thread Bharata B Rao
On Thu, May 16, 2019 at 07:44:20PM +0530, srikanth wrote:
> Hello,
> 
> On power9 host, performing memory hotunplug from ppc64le guest results in
> kernel oops.
> 
> Kernel used : https://github.com/torvalds/linux/tree/v5.1 built using
> ppc64le_defconfig for host and ppc64le_guest_defconfig for guest.
> 
> Recreation steps:
> 
> 1. Boot a guest with below mem configuration:
>   33554432
>   8388608
>   4194304
>   
>     
>   
>     
>   
> 
> 2. From host hotplug 8G memory -> verify memory hotadded succesfully -> now
> reboot guest -> once guest comes back try to unplug 8G memory
> 
> mem.xml used:
> 
> 
> 8
> 0
> 
> 
> 
> Memory attach and detach commands used:
>     virsh attach-device vm1 ./mem.xml --live
>     virsh detach-device vm1 ./mem.xml --live
> 
> Trace seen inside guest after unplug, guest just hangs there forever:
> 
> [   21.962986] kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!
> [   21.963064] Oops: Exception in kernel mode, sig: 5 [#1]
> [   21.963090] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA
> pSeries
> [   21.963131] Modules linked in: xt_tcpudp iptable_filter squashfs fuse
> vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi
> ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress
> raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
> xor raid6_pq multipath crc32c_vpmsum
> [   21.963281] CPU: 11 PID: 316 Comm: kworker/u64:5 Kdump: loaded Not
> tainted 5.1.0-dirty #2
> [   21.963323] Workqueue: pseries hotplug workque pseries_hp_work_fn
> [   21.963355] NIP:  c0079e18 LR: c0c79308 CTR:
> 8000
> [   21.963392] REGS: c003f88034f0 TRAP: 0700   Not tainted (5.1.0-dirty)
> [   21.963422] MSR:  8282b033   CR:
> 28002884  XER: 2004
> [   21.963470] CFAR: c0c79304 IRQMASK: 0
> [   21.963470] GPR00: c0c79308 c003f8803780 c1521000
> 00fff8c0
> [   21.963470] GPR04: 0001 ffe30005 0005
> 0020
> [   21.963470] GPR08:  0001 c00a00fff8e0
> c16d21a0
> [   21.963470] GPR12: c16e7b90 c7ff2700 c00a00a0
> c003ffe30100
> [   21.963470] GPR16: c003ffe3 c14aa4de c00a009f
> c16d21b0
> [   21.963470] GPR20: c14de588 0001 c16d21b8
> c00a00a0
> [   21.963470] GPR24:   c00a00a0
> c003ffe96000
> [   21.963470] GPR28: c00a00a0 c00a00a0 c003fffec000
> c00a00fff8c0
> [   21.963802] NIP [c0079e18] pte_fragment_free+0x48/0xd0
> [   21.963838] LR [c0c79308] remove_pagetable+0x49c/0x5b4
> [   21.963873] Call Trace:
> [   21.963890] [c003f8803780] [c003ffe997f0] 0xc003ffe997f0
> (unreliable)
> [   21.963933] [c003f88037b0] [] (null)
> [   21.963969] [c003f88038c0] [c006f038]
> vmemmap_free+0x218/0x2e0
> [   21.964006] [c003f8803940] [c036f100]
> sparse_remove_one_section+0xd0/0x138
> [   21.964050] [c003f8803980] [c0383a50]
> __remove_pages+0x410/0x560
> [   21.964093] [c003f8803a90] [c0c784d8]
> arch_remove_memory+0x68/0xdc
> [   21.964136] [c003f8803ad0] [c0385d74]
> __remove_memory+0xc4/0x110
> [   21.964180] [c003f8803b10] [c00d44e4]
> dlpar_remove_lmb+0x94/0x140
> [   21.964223] [c003f8803b50] [c00d52b4]
> dlpar_memory+0x464/0xd00
> [   21.964259] [c003f8803be0] [c00cd5c0]
> handle_dlpar_errorlog+0xc0/0x190
> [   21.964303] [c003f8803c50] [c00cd6bc]
> pseries_hp_work_fn+0x2c/0x60
> [   21.964346] [c003f8803c80] [c013a4a0]
> process_one_work+0x2b0/0x5a0
> [   21.964388] [c003f8803d10] [c013a818]
> worker_thread+0x88/0x610
> [   21.964434] [c003f8803db0] [c0143884] kthread+0x1a4/0x1b0
> [   21.964468] [c003f8803e20] [c000bdc4]
> ret_from_kernel_thread+0x5c/0x78
> [   21.964506] Instruction dump:
> [   21.964527] fbe1fff8 f821ffd1 78638502 78633664 ebe9 7fff1a14
> 395f0020 813f0020
> [   21.964569] 7d2907b4 7d2900d0 79290fe0 69290001 <0b09> 7c0004ac
> 7d205028 3129
> [   21.964613] ---[ end trace aaa571aa1636fee6 ]---
> [   21.966349]
> [   21.966383] Sending IPI to other CPUs
> [   21.978335] IPI complete
> [   21.981354] kexec: Starting switchover sequence.
> I'm in purgatory

git bisect points to

commit 4231aba000f5a4583dd9f67057aadb68c3eca99d
Author: Nicholas Piggin 
Date:   Fri Jul 27 21:48:17 2018 +1000

powerpc/64s: Fix page table fragment refcount race vs speculative references

The page table fragment allocator uses the main page refcount racily
with respect to speculative references. A customer observed a BUG due
to page table page refcount underflow in the fragment allocator. This
can be caused by the fragment allocator set_page_count stomping on a

Re: Memory hotplug not increasing the total RAM

2018-01-30 Thread Bharata B Rao
On Tue, Jan 30, 2018 at 10:28:15AM +0100, Michal Hocko wrote:
> On Tue 30-01-18 10:16:00, Michal Hocko wrote:
> > On Tue 30-01-18 14:00:06, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > With the latest upstream, I see that memory hotplug is not working
> > > as expected. The hotplugged memory isn't seen to increase the total
> > > RAM pages. This has been observed with both x86 and Power guests.
> > > 
> > > 1. Memory hotplug code intially marks pages as PageReserved via
> > > __add_section().
> > > 2. Later the struct page gets cleared in __init_single_page().
> > > 3. Next online_pages_range() increments totalram_pages only when
> > >PageReserved is set.
> > 
> > You are right. I have completely forgot about this late struct page
> > initialization during onlining. memory hotplug really doesn't want
> > zeroying. Let me think about a fix.
> 
> Could you test with the following please? Not an act of beauty but
> we are initializing memmap in sparse_add_one_section for memory
> hotplug. I hate how this is different from the initialization case
> but there is quite a long route to unify those two... So a quick
> fix should be as follows.

Tested on Power guest, fixes the issue. I can now see the total memory
size increasing after hotplug.

Regards,
Bharata.



Re: Memory hotplug not increasing the total RAM

2018-01-30 Thread Bharata B Rao
On Tue, Jan 30, 2018 at 10:28:15AM +0100, Michal Hocko wrote:
> On Tue 30-01-18 10:16:00, Michal Hocko wrote:
> > On Tue 30-01-18 14:00:06, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > With the latest upstream, I see that memory hotplug is not working
> > > as expected. The hotplugged memory isn't seen to increase the total
> > > RAM pages. This has been observed with both x86 and Power guests.
> > > 
> > > 1. Memory hotplug code intially marks pages as PageReserved via
> > > __add_section().
> > > 2. Later the struct page gets cleared in __init_single_page().
> > > 3. Next online_pages_range() increments totalram_pages only when
> > >PageReserved is set.
> > 
> > You are right. I have completely forgot about this late struct page
> > initialization during onlining. memory hotplug really doesn't want
> > zeroying. Let me think about a fix.
> 
> Could you test with the following please? Not an act of beauty but
> we are initializing memmap in sparse_add_one_section for memory
> hotplug. I hate how this is different from the initialization case
> but there is quite a long route to unify those two... So a quick
> fix should be as follows.

Tested on Power guest, fixes the issue. I can now see the total memory
size increasing after hotplug.

Regards,
Bharata.



Memory hotplug not increasing the total RAM

2018-01-30 Thread Bharata B Rao
Hi,

With the latest upstream, I see that memory hotplug is not working
as expected. The hotplugged memory isn't seen to increase the total
RAM pages. This has been observed with both x86 and Power guests.

1. Memory hotplug code intially marks pages as PageReserved via
__add_section().
2. Later the struct page gets cleared in __init_single_page().
3. Next online_pages_range() increments totalram_pages only when
   PageReserved is set.

The step 2 has been introduced recently by the following commit:

commit f7f99100d8d95dbcf09e0216a143211e79418b9f
Author: Pavel Tatashin 
Date:   Wed Nov 15 17:36:44 2017 -0800

mm: stop zeroing memory during allocation in vmemmap

Reverting this commit restores the correct behaviour of memory hotplug.

Regards,
Bharata.



Memory hotplug not increasing the total RAM

2018-01-30 Thread Bharata B Rao
Hi,

With the latest upstream, I see that memory hotplug is not working
as expected. The hotplugged memory isn't seen to increase the total
RAM pages. This has been observed with both x86 and Power guests.

1. Memory hotplug code intially marks pages as PageReserved via
__add_section().
2. Later the struct page gets cleared in __init_single_page().
3. Next online_pages_range() increments totalram_pages only when
   PageReserved is set.

The step 2 has been introduced recently by the following commit:

commit f7f99100d8d95dbcf09e0216a143211e79418b9f
Author: Pavel Tatashin 
Date:   Wed Nov 15 17:36:44 2017 -0800

mm: stop zeroing memory during allocation in vmemmap

Reverting this commit restores the correct behaviour of memory hotplug.

Regards,
Bharata.



Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > > this setup.
> > > 
> > > Interesting - I'd not got that far myself on power; I was hitting a 
> > > problem
> > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> > > stream (htab_shift=25) )
> > > 
> > > Did you have to make any changes to the qemu code to get that happy?
> > 
> > I should have mentioned that I tried only QEMU driven migration within
> > the same host using wp3-postcopy branch of your tree. I don't see the
> > above issue.
> > 
> > (qemu) info migrate
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: 
> > off compress: off x-postcopy-ram: on 
> > Migration status: completed
> > total time: 39432 milliseconds
> > downtime: 162 milliseconds
> > setup: 14 milliseconds
> > transferred ram: 1297209 kbytes
> > throughput: 270.72 mbps
> > remaining ram: 0 kbytes
> > total ram: 4194560 kbytes
> > duplicate: 734015 pages
> > skipped: 0 pages
> > normal: 318469 pages
> > normal bytes: 1273876 kbytes
> > dirty sync count: 4
> > 
> > I will try migration between different hosts soon and check.
> 
> I hit that on the same host; are you sure you've switched into postcopy mode;
> i.e. issued a migrate_start_postcopy before the end of migration?

Sorry I was following your discussion with Li in this thread

https://www.marc.info/?l=qemu-devel=143035620026744=4

and it wasn't obvious to me that anything apart from turning on the
x-postcopy-ram capability was required :(

So I do see the problem now.

At the source
-
Error reading data from KVM HTAB fd: Bad file descriptor
Segmentation fault

At the target
-
htab_load() bad index 2113929216 (14336+0 entries) in htab stream 
(htab_shift=25)
qemu-system-ppc64: error while loading state section id 56(spapr/htab)
qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 
0x1f: delta 0xffe1
qemu-system-ppc64: error while loading state for instance 0x0 of device 
'pci@8002000:00.0/virtio-net'
*** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 
0x0100241234a0 ***
=== Backtrace: =
/lib64/power8/libc.so.6Segmentation fault

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > In fact I had successfully done postcopy migration of sPAPR guest with
> > this setup.
> 
> Interesting - I'd not got that far myself on power; I was hitting a problem
> loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> stream (htab_shift=25) )
> 
> Did you have to make any changes to the qemu code to get that happy?

I should have mentioned that I tried only QEMU driven migration within
the same host using wp3-postcopy branch of your tree. I don't see the
above issue.

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
compress: off x-postcopy-ram: on 
Migration status: completed
total time: 39432 milliseconds
downtime: 162 milliseconds
setup: 14 milliseconds
transferred ram: 1297209 kbytes
throughput: 270.72 mbps
remaining ram: 0 kbytes
total ram: 4194560 kbytes
duplicate: 734015 pages
skipped: 0 pages
normal: 318469 pages
normal bytes: 1273876 kbytes
dirty sync count: 4

I will try migration between different hosts soon and check.

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > Hello Bharata,
> > > 
> > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > to userfault21 branch of your git tree to compile on powerpc.
> > > 
> > > Not late, just in time. I increased the number of syscalls in earlier
> > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > 
> > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > thanks!
> > > 
> > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > the testsuite worked fine there. ppc also should work fine if you
> > > could confirm it'd be interesting, just beware that I got a typo in
> > > the testcase:
> > 
> > The testsuite passes on powerpc.
> > 
> > 
> > running userfaultfd
> > 
> > nr_pages: 2040, nr_pages_per_cpu: 170
> > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 
> > 2 96 13 128
> > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 
> > 1 0
> > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 
> > 114 96 1 0
> > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 
> > 71 20
> > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 > > 0
> > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 
> > 3 6 0
> > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 
> > 40 34 0
> > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 
> > 0 0
> > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 
> > 0 180 75
> > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 
> > 0 30
> > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 
> > 68 1 2
> > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 
> > 10 1
> > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 
> > 51 37 1
> > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 
> > 25 10
> > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 
> > 7 4 5
> > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 
> > 23 2
> > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > [PASS]
> 
> Hmm, not for me. See below.
> 
> What setup were you testing on Bharata?

I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

#uname -a
Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le 
GNU/Linux

In fact I had successfully done postcopy migration of sPAPR guest with
this setup.

> 
> Mine is:
> 
> $ uname -a
> Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 
> ppc64le ppc64le ppc64le GNU/Linux
> 
> Which is 7d9071a09502 plus a couple of powerpc patches.
> 
> $ zgrep USERFAULTFD /proc/config.gz
> CONFIG_USERFAULTFD=y
> 
> $ sudo ./userfaultfd 128 32
> nr_pages: 2048, nr_pages_per_cpu: 128
> bounces: 31, mode: rnd racing ver poll, error mutex 2 2
> error mutex 2 10

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > Hello Bharata,
> > > 
> > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > to userfault21 branch of your git tree to compile on powerpc.
> > > 
> > > Not late, just in time. I increased the number of syscalls in earlier
> > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > 
> > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > thanks!
> > > 
> > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > the testsuite worked fine there. ppc also should work fine if you
> > > could confirm it'd be interesting, just beware that I got a typo in
> > > the testcase:
> > 
> > The testsuite passes on powerpc.
> > 
> > 
> > running userfaultfd
> > 
> > nr_pages: 2040, nr_pages_per_cpu: 170
> > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 
> > 2 96 13 128
> > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 
> > 1 0
> > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 
> > 114 96 1 0
> > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 
> > 71 20
> > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 > > 0
> > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 
> > 3 6 0
> > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 
> > 40 34 0
> > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 
> > 0 0
> > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 
> > 0 180 75
> > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 
> > 0 30
> > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 
> > 68 1 2
> > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 
> > 10 1
> > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 
> > 51 37 1
> > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 
> > 25 10
> > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 
> > 7 4 5
> > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 
> > 23 2
> > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > [PASS]
> 
> Hmm, not for me. See below.
> 
> What setup were you testing on Bharata?

I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

#uname -a
Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le 
GNU/Linux

In fact I had successfully done postcopy migration of sPAPR guest with
this setup.

> 
> Mine is:
> 
> $ uname -a
> Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 
> ppc64le ppc64le ppc64le GNU/Linux
> 
> Which is 7d9071a09502 plus a couple of powerpc patches.
> 
> $ zgrep USERFAULTFD /proc/config.gz
> CONFIG_USERFAULTFD=y
> 
> $ sudo ./userfaultfd 128 32
> nr_pages: 2048, nr_pages_per_cpu: 128
> bounces: 31, mode: rnd racing ver poll, error mutex 2 2
> error mutex 2 10

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > In fact I had successfully done postcopy migration of sPAPR guest with
> > this setup.
> 
> Interesting - I'd not got that far myself on power; I was hitting a problem
> loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> stream (htab_shift=25) )
> 
> Did you have to make any changes to the qemu code to get that happy?

I should have mentioned that I tried only QEMU driven migration within
the same host using wp3-postcopy branch of your tree. I don't see the
above issue.

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
compress: off x-postcopy-ram: on 
Migration status: completed
total time: 39432 milliseconds
downtime: 162 milliseconds
setup: 14 milliseconds
transferred ram: 1297209 kbytes
throughput: 270.72 mbps
remaining ram: 0 kbytes
total ram: 4194560 kbytes
duplicate: 734015 pages
skipped: 0 pages
normal: 318469 pages
normal bytes: 1273876 kbytes
dirty sync count: 4

I will try migration between different hosts soon and check.

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > > this setup.
> > > 
> > > Interesting - I'd not got that far myself on power; I was hitting a 
> > > problem
> > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> > > stream (htab_shift=25) )
> > > 
> > > Did you have to make any changes to the qemu code to get that happy?
> > 
> > I should have mentioned that I tried only QEMU driven migration within
> > the same host using wp3-postcopy branch of your tree. I don't see the
> > above issue.
> > 
> > (qemu) info migrate
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: 
> > off compress: off x-postcopy-ram: on 
> > Migration status: completed
> > total time: 39432 milliseconds
> > downtime: 162 milliseconds
> > setup: 14 milliseconds
> > transferred ram: 1297209 kbytes
> > throughput: 270.72 mbps
> > remaining ram: 0 kbytes
> > total ram: 4194560 kbytes
> > duplicate: 734015 pages
> > skipped: 0 pages
> > normal: 318469 pages
> > normal bytes: 1273876 kbytes
> > dirty sync count: 4
> > 
> > I will try migration between different hosts soon and check.
> 
> I hit that on the same host; are you sure you've switched into postcopy mode;
> i.e. issued a migrate_start_postcopy before the end of migration?

Sorry I was following your discussion with Li in this thread

https://www.marc.info/?l=qemu-devel=143035620026744=4

and it wasn't obvious to me that anything apart from turning on the
x-postcopy-ram capability was required :(

So I do see the problem now.

At the source
-
Error reading data from KVM HTAB fd: Bad file descriptor
Segmentation fault

At the target
-
htab_load() bad index 2113929216 (14336+0 entries) in htab stream 
(htab_shift=25)
qemu-system-ppc64: error while loading state section id 56(spapr/htab)
qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 
0x1f: delta 0xffe1
qemu-system-ppc64: error while loading state for instance 0x0 of device 
'pci@8002000:00.0/virtio-net'
*** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 
0x0100241234a0 ***
=== Backtrace: =
/lib64/power8/libc.so.6Segmentation fault

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] driver: base: memory: Maintain correct mem->end_section_nr when memory block is partially filled

2015-08-17 Thread Bharata B Rao
On Fri, Aug 14, 2015 at 10:27:53AM -0500, Nathan Fontenot wrote:
> On 08/13/2015 04:17 AM, Bharata B Rao wrote:
> > Last section of memory block is always initialized to
> > 
> > mem->start_section_nr + sections_per_block - 1
> > 
> > which will not be true for a section that doesn't contain sections_per_block
> > sections due to the memory size specified. This causes the following
> > kernel crash when memory blocks under a node are registered during reboot
> > that follows a memory hotplug operation on pseries guest.
> > 
> > Unable to handle kernel paging request for data at address 
> > 0xf03f0020
> > Faulting instruction address: 0xc07657cc
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > SMP NR_CPUS=1024 NUMA pSeries
> > 
> > Modules linked in:
> > 
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc6+ #48
> > task: c000ba3c ti: c0013c58 task.ti: c0013c58
> > NIP: c07657cc LR: c0592dbc CTR: 0400
> > REGS: c0013c5836f0 TRAP: 0300   Not tainted  (4.2.0-rc6+)
> > MSR: 80009032  MSR: 80009032 
> > <>  CR: 4848  XER: 
> >   CR: 4848  XER: 
> > CFAR: 3fff990f50ec CFAR: 3fff990f50ec DAR: f03f0020 DSISR: 
> > 4000 DAR: f03f0020 DSISR: 4000 SOFTE: 1 SOFTE: 1
> > GPR00: c0592dbc c0592dbc c0013c583970 c0013c583970 
> > c14f0300 c14f0300 003f 003f
> > GPR04:   c000f43b2900 c000f43b2900 
> > c000ba324668 c000ba324668 0001 0001
> > GPR08: c1540300 c1540300 f000 f000 
> > f03f f03f 0001 0001
> > GPR12: 2484 2484 cff2 cff2 
> > c000b5b0 c000b5b0  
> > GPR16:     
> >    
> > GPR20:     
> >    
> > GPR24: c188c380 c188c380   
> > 00014000 00014000 c18b54e8 c18b54e8
> > GPR28: c0013c06e800 c0013c06e800   
> >   fc00 fc00
> > 
> > NIP [c07657cc] .get_nid_for_pfn+0x2c/0x60
> > LR [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
> > Call Trace:
> > [c0013c583970] [c056e44c] .put_device+0x2c/0x50
> > [c0013c5839f0] [c0592dbc] 
> > .register_mem_sect_under_node+0x8c/0x150
> > [c0013c583a80] [c05932b4] .register_one_node+0x2c4/0x380
> > [c0013c583b30] [c0c882b8] .topology_init+0x44/0x1e0
> > [c0013c583bf0] [c000ad30] .do_one_initcall+0x110/0x270
> > [c0013c583ce0] [c0c845d4] .kernel_init_freeable+0x278/0x360
> > [c0013c583db0] [c000b5d4] .kernel_init+0x24/0x130
> > [c0013c583e30] [c00094e8] .ret_from_kernel_thread+0x58/0x70
> > 
> > Fix this by updating the memory block to always contain the right
> > number of sections instead of assuming sections_per_block.
> > 
> > Signed-off-by: Bharata B Rao 
> > Cc: Nathan Fontenot 
> > ---
> >  drivers/base/memory.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 2804aed..7f3ce2e 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -645,6 +645,7 @@ static int add_memory_block(int base_section_nr)
> > if (ret)
> > return ret;
> > mem->section_count = section_count;
> > +mem->end_section_nr = mem->start_section_nr + section_count -1;
> 
> I think this change may be correct but makes me wonder if we need to update
> code elsewhere. There are places (at least in drivers/base/memory.c) that 
> assume
> a memory block contains sections_per_block sections.
> 
> Also, I think you may need to cc GregKH for this patch.
 
Hi Greg - Do you think the above is the right fix to the problem that is
described here ?

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] driver: base: memory: Maintain correct mem-end_section_nr when memory block is partially filled

2015-08-17 Thread Bharata B Rao
On Fri, Aug 14, 2015 at 10:27:53AM -0500, Nathan Fontenot wrote:
 On 08/13/2015 04:17 AM, Bharata B Rao wrote:
  Last section of memory block is always initialized to
  
  mem-start_section_nr + sections_per_block - 1
  
  which will not be true for a section that doesn't contain sections_per_block
  sections due to the memory size specified. This causes the following
  kernel crash when memory blocks under a node are registered during reboot
  that follows a memory hotplug operation on pseries guest.
  
  Unable to handle kernel paging request for data at address 
  0xf03f0020
  Faulting instruction address: 0xc07657cc
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=1024 NUMA pSeries
  
  Modules linked in:
  
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc6+ #48
  task: c000ba3c ti: c0013c58 task.ti: c0013c58
  NIP: c07657cc LR: c0592dbc CTR: 0400
  REGS: c0013c5836f0 TRAP: 0300   Not tainted  (4.2.0-rc6+)
  MSR: 80009032  MSR: 80009032 
  SFSF,EE,EE,ME,ME,IR,IR,DR,DR,RI,RI  CR: 4848  XER: 
CR: 4848  XER: 
  CFAR: 3fff990f50ec CFAR: 3fff990f50ec DAR: f03f0020 DSISR: 
  4000 DAR: f03f0020 DSISR: 4000 SOFTE: 1 SOFTE: 1
  GPR00: c0592dbc c0592dbc c0013c583970 c0013c583970 
  c14f0300 c14f0300 003f 003f
  GPR04:   c000f43b2900 c000f43b2900 
  c000ba324668 c000ba324668 0001 0001
  GPR08: c1540300 c1540300 f000 f000 
  f03f f03f 0001 0001
  GPR12: 2484 2484 cff2 cff2 
  c000b5b0 c000b5b0  
  GPR16:     
     
  GPR20:     
     
  GPR24: c188c380 c188c380   
  00014000 00014000 c18b54e8 c18b54e8
  GPR28: c0013c06e800 c0013c06e800   
    fc00 fc00
  
  NIP [c07657cc] .get_nid_for_pfn+0x2c/0x60
  LR [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
  Call Trace:
  [c0013c583970] [c056e44c] .put_device+0x2c/0x50
  [c0013c5839f0] [c0592dbc] 
  .register_mem_sect_under_node+0x8c/0x150
  [c0013c583a80] [c05932b4] .register_one_node+0x2c4/0x380
  [c0013c583b30] [c0c882b8] .topology_init+0x44/0x1e0
  [c0013c583bf0] [c000ad30] .do_one_initcall+0x110/0x270
  [c0013c583ce0] [c0c845d4] .kernel_init_freeable+0x278/0x360
  [c0013c583db0] [c000b5d4] .kernel_init+0x24/0x130
  [c0013c583e30] [c00094e8] .ret_from_kernel_thread+0x58/0x70
  
  Fix this by updating the memory block to always contain the right
  number of sections instead of assuming sections_per_block.
  
  Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
  Cc: Nathan Fontenot nf...@linux.vnet.ibm.com
  ---
   drivers/base/memory.c | 1 +
   1 file changed, 1 insertion(+)
  
  diff --git a/drivers/base/memory.c b/drivers/base/memory.c
  index 2804aed..7f3ce2e 100644
  --- a/drivers/base/memory.c
  +++ b/drivers/base/memory.c
  @@ -645,6 +645,7 @@ static int add_memory_block(int base_section_nr)
  if (ret)
  return ret;
  mem-section_count = section_count;
  +mem-end_section_nr = mem-start_section_nr + section_count -1;
 
 I think this change may be correct but makes me wonder if we need to update
 code elsewhere. There are places (at least in drivers/base/memory.c) that 
 assume
 a memory block contains sections_per_block sections.
 
 Also, I think you may need to cc GregKH for this patch.
 
Hi Greg - Do you think the above is the right fix to the problem that is
described here ?

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] driver: base: memory: Maintain correct mem->end_section_nr when memory block is partially filled

2015-08-13 Thread Bharata B Rao
Last section of memory block is always initialized to

mem->start_section_nr + sections_per_block - 1

which will not be true for a section that doesn't contain sections_per_block
sections due to the memory size specified. This causes the following
kernel crash when memory blocks under a node are registered during reboot
that follows a memory hotplug operation on pseries guest.

Unable to handle kernel paging request for data at address 0xf03f0020
Faulting instruction address: 0xc07657cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries

Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc6+ #48
task: c000ba3c ti: c0013c58 task.ti: c0013c58
NIP: c07657cc LR: c0592dbc CTR: 0400
REGS: c0013c5836f0 TRAP: 0300   Not tainted  (4.2.0-rc6+)
MSR: 80009032  MSR: 80009032 
<>  CR: 4848  XER: 
  CR: 4848  XER: 
CFAR: 3fff990f50ec CFAR: 3fff990f50ec DAR: f03f0020 DSISR: 
4000 DAR: f03f0020 DSISR: 4000 SOFTE: 1 SOFTE: 1
GPR00: c0592dbc c0592dbc c0013c583970 c0013c583970 
c14f0300 c14f0300 003f 003f
GPR04:   c000f43b2900 c000f43b2900 
c000ba324668 c000ba324668 0001 0001
GPR08: c1540300 c1540300 f000 f000 
f03f f03f 0001 0001
GPR12: 2484 2484 cff2 cff2 
c000b5b0 c000b5b0  
GPR16:     
   
GPR20:     
   
GPR24: c188c380 c188c380   
00014000 00014000 c18b54e8 c18b54e8
GPR28: c0013c06e800 c0013c06e800   
  fc00 fc00

NIP [c07657cc] .get_nid_for_pfn+0x2c/0x60
LR [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
Call Trace:
[c0013c583970] [c056e44c] .put_device+0x2c/0x50
[c0013c5839f0] [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
[c0013c583a80] [c05932b4] .register_one_node+0x2c4/0x380
[c0013c583b30] [c0c882b8] .topology_init+0x44/0x1e0
[c0013c583bf0] [c000ad30] .do_one_initcall+0x110/0x270
[c0013c583ce0] [c0c845d4] .kernel_init_freeable+0x278/0x360
[c0013c583db0] [c000b5d4] .kernel_init+0x24/0x130
[c0013c583e30] [c00094e8] .ret_from_kernel_thread+0x58/0x70

Fix this by updating the memory block to always contain the right
number of sections instead of assuming sections_per_block.

Signed-off-by: Bharata B Rao 
Cc: Nathan Fontenot 
---
 drivers/base/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2804aed..7f3ce2e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -645,6 +645,7 @@ static int add_memory_block(int base_section_nr)
if (ret)
return ret;
mem->section_count = section_count;
+mem->end_section_nr = mem->start_section_nr + section_count -1;
return 0;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] driver: base: memory: Maintain correct mem-end_section_nr when memory block is partially filled

2015-08-13 Thread Bharata B Rao
Last section of memory block is always initialized to

mem-start_section_nr + sections_per_block - 1

which will not be true for a section that doesn't contain sections_per_block
sections due to the memory size specified. This causes the following
kernel crash when memory blocks under a node are registered during reboot
that follows a memory hotplug operation on pseries guest.

Unable to handle kernel paging request for data at address 0xf03f0020
Faulting instruction address: 0xc07657cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries

Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc6+ #48
task: c000ba3c ti: c0013c58 task.ti: c0013c58
NIP: c07657cc LR: c0592dbc CTR: 0400
REGS: c0013c5836f0 TRAP: 0300   Not tainted  (4.2.0-rc6+)
MSR: 80009032  MSR: 80009032 
SFSF,EE,EE,ME,ME,IR,IR,DR,DR,RI,RI  CR: 4848  XER: 
  CR: 4848  XER: 
CFAR: 3fff990f50ec CFAR: 3fff990f50ec DAR: f03f0020 DSISR: 
4000 DAR: f03f0020 DSISR: 4000 SOFTE: 1 SOFTE: 1
GPR00: c0592dbc c0592dbc c0013c583970 c0013c583970 
c14f0300 c14f0300 003f 003f
GPR04:   c000f43b2900 c000f43b2900 
c000ba324668 c000ba324668 0001 0001
GPR08: c1540300 c1540300 f000 f000 
f03f f03f 0001 0001
GPR12: 2484 2484 cff2 cff2 
c000b5b0 c000b5b0  
GPR16:     
   
GPR20:     
   
GPR24: c188c380 c188c380   
00014000 00014000 c18b54e8 c18b54e8
GPR28: c0013c06e800 c0013c06e800   
  fc00 fc00

NIP [c07657cc] .get_nid_for_pfn+0x2c/0x60
LR [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
Call Trace:
[c0013c583970] [c056e44c] .put_device+0x2c/0x50
[c0013c5839f0] [c0592dbc] .register_mem_sect_under_node+0x8c/0x150
[c0013c583a80] [c05932b4] .register_one_node+0x2c4/0x380
[c0013c583b30] [c0c882b8] .topology_init+0x44/0x1e0
[c0013c583bf0] [c000ad30] .do_one_initcall+0x110/0x270
[c0013c583ce0] [c0c845d4] .kernel_init_freeable+0x278/0x360
[c0013c583db0] [c000b5d4] .kernel_init+0x24/0x130
[c0013c583e30] [c00094e8] .ret_from_kernel_thread+0x58/0x70

Fix this by updating the memory block to always contain the right
number of sections instead of assuming sections_per_block.

Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
Cc: Nathan Fontenot nf...@linux.vnet.ibm.com
---
 drivers/base/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2804aed..7f3ce2e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -645,6 +645,7 @@ static int add_memory_block(int base_section_nr)
if (ret)
return ret;
mem-section_count = section_count;
+mem-end_section_nr = mem-start_section_nr + section_count -1;
return 0;
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> Hello Bharata,
> 
> On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > May be it is a bit late to bring this up, but I needed the following fix
> > to userfault21 branch of your git tree to compile on powerpc.
> 
> Not late, just in time. I increased the number of syscalls in earlier
> versions, it must have gotten lost during a rejecting rebase, sorry.
> 
> I applied it to my tree and it can be applied to -mm and linux-next,
> thanks!
> 
> The syscall for arm32 are also ready and on their way to the arm tree,
> the testsuite worked fine there. ppc also should work fine if you
> could confirm it'd be interesting, just beware that I got a typo in
> the testcase:

The testsuite passes on powerpc.


running userfaultfd

nr_pages: 2040, nr_pages_per_cpu: 170
bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 
13 128
bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 
1 0
bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 
34 0
bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 
180 75
bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 
2
bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 
37 1
bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 
5
bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
[PASS]

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote:
> This activates the userfaultfd syscall.
> 
> Signed-off-by: Andrea Arcangeli 
> ---
>  arch/powerpc/include/asm/systbl.h  | 1 +
>  arch/powerpc/include/uapi/asm/unistd.h | 1 +
>  arch/x86/syscalls/syscall_32.tbl   | 1 +
>  arch/x86/syscalls/syscall_64.tbl   | 1 +
>  include/linux/syscalls.h   | 1 +
>  kernel/sys_ni.c| 1 +
>  6 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/systbl.h 
> b/arch/powerpc/include/asm/systbl.h
> index f1863a1..4741b15 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
>  SYSCALL_SPU(bpf)
>  COMPAT_SYS(execveat)
>  PPC64ONLY(switch_endian)
> +SYSCALL_SPU(userfaultfd)
> diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
> b/arch/powerpc/include/uapi/asm/unistd.h
> index e4aa173..6ad58d4 100644
> --- a/arch/powerpc/include/uapi/asm/unistd.h
> +++ b/arch/powerpc/include/uapi/asm/unistd.h
> @@ -386,5 +386,6 @@
>  #define __NR_bpf 361
>  #define __NR_execveat362
>  #define __NR_switch_endian   363
> +#define __NR_userfaultfd 364

May be it is a bit late to bring this up, but I needed the following fix
to userfault21 branch of your git tree to compile on powerpc.


powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd

From: Bharata B Rao 

With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
Reflect the same in __NR_syscalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/unistd.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index f4f8b66..4a055b6 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  364
+#define __NR_syscalls  365
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote:
 This activates the userfaultfd syscall.
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  arch/powerpc/include/asm/systbl.h  | 1 +
  arch/powerpc/include/uapi/asm/unistd.h | 1 +
  arch/x86/syscalls/syscall_32.tbl   | 1 +
  arch/x86/syscalls/syscall_64.tbl   | 1 +
  include/linux/syscalls.h   | 1 +
  kernel/sys_ni.c| 1 +
  6 files changed, 6 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/systbl.h 
 b/arch/powerpc/include/asm/systbl.h
 index f1863a1..4741b15 100644
 --- a/arch/powerpc/include/asm/systbl.h
 +++ b/arch/powerpc/include/asm/systbl.h
 @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
  SYSCALL_SPU(bpf)
  COMPAT_SYS(execveat)
  PPC64ONLY(switch_endian)
 +SYSCALL_SPU(userfaultfd)
 diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
 b/arch/powerpc/include/uapi/asm/unistd.h
 index e4aa173..6ad58d4 100644
 --- a/arch/powerpc/include/uapi/asm/unistd.h
 +++ b/arch/powerpc/include/uapi/asm/unistd.h
 @@ -386,5 +386,6 @@
  #define __NR_bpf 361
  #define __NR_execveat362
  #define __NR_switch_endian   363
 +#define __NR_userfaultfd 364

May be it is a bit late to bring this up, but I needed the following fix
to userfault21 branch of your git tree to compile on powerpc.


powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd

From: Bharata B Rao bhar...@linux.vnet.ibm.com

With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
Reflect the same in __NR_syscalls.

Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/unistd.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index f4f8b66..4a055b6 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include uapi/asm/unistd.h
 
 
-#define __NR_syscalls  364
+#define __NR_syscalls  365
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
 Hello Bharata,
 
 On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
  May be it is a bit late to bring this up, but I needed the following fix
  to userfault21 branch of your git tree to compile on powerpc.
 
 Not late, just in time. I increased the number of syscalls in earlier
 versions, it must have gotten lost during a rejecting rebase, sorry.
 
 I applied it to my tree and it can be applied to -mm and linux-next,
 thanks!
 
 The syscall for arm32 are also ready and on their way to the arm tree,
 the testsuite worked fine there. ppc also should work fine if you
 could confirm it'd be interesting, just beware that I got a typo in
 the testcase:

The testsuite passes on powerpc.


running userfaultfd

nr_pages: 2040, nr_pages_per_cpu: 170
bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 
13 128
bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 
1 0
bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 
34 0
bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 
180 75
bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 
2
bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 
37 1
bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 
5
bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
[PASS]

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: powerpc,numa: Memory hotplug to memory-less nodes ?

2015-06-23 Thread Bharata B Rao
So will it be correct to say that memory hotplug to memory-less node
isn't supported by PowerPC kernel ? Should I enforce the same in QEMU
for PowerKVM ?

On Mon, Jun 22, 2015 at 10:18 AM, Bharata B Rao  wrote:
> Hi,
>
> While developing memory hotplug support in QEMU for PoweKVM, I
> realized that guest kernel has specific checks to prevent hot addition
> of memory to a memory-less node.
>
> I am referring to arch/powerpc/mm/numa.c:hot_add_scn_to_nid() which
> has explicit checks to ensure that it returns a nid that has some some
> memory (NODE_DATA(nid)->node_spanned_pages) even when user wants to
> hotplug to a node that currently has zero memory.
>
> Is this limitation by design ?
>
> Regards,
> Bharata.
> --
> http://raobharata.wordpress.com/



-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: powerpc,numa: Memory hotplug to memory-less nodes ?

2015-06-23 Thread Bharata B Rao
So will it be correct to say that memory hotplug to memory-less node
isn't supported by PowerPC kernel ? Should I enforce the same in QEMU
for PowerKVM ?

On Mon, Jun 22, 2015 at 10:18 AM, Bharata B Rao bharata@gmail.com wrote:
 Hi,

 While developing memory hotplug support in QEMU for PoweKVM, I
 realized that guest kernel has specific checks to prevent hot addition
 of memory to a memory-less node.

 I am referring to arch/powerpc/mm/numa.c:hot_add_scn_to_nid() which
 has explicit checks to ensure that it returns a nid that has some some
 memory (NODE_DATA(nid)-node_spanned_pages) even when user wants to
 hotplug to a node that currently has zero memory.

 Is this limitation by design ?

 Regards,
 Bharata.
 --
 http://raobharata.wordpress.com/



-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


powerpc,numa: Memory hotplug to memory-less nodes ?

2015-06-21 Thread Bharata B Rao
Hi,

While developing memory hotplug support in QEMU for PoweKVM, I
realized that guest kernel has specific checks to prevent hot addition
of memory to a memory-less node.

I am referring to arch/powerpc/mm/numa.c:hot_add_scn_to_nid() which
has explicit checks to ensure that it returns a nid that has some some
memory (NODE_DATA(nid)->node_spanned_pages) even when user wants to
hotplug to a node that currently has zero memory.

Is this limitation by design ?

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/


powerpc,numa: Memory hotplug to memory-less nodes ?

2015-06-21 Thread Bharata B Rao
Hi,

While developing memory hotplug support in QEMU for PoweKVM, I
realized that guest kernel has specific checks to prevent hot addition
of memory to a memory-less node.

I am referring to arch/powerpc/mm/numa.c:hot_add_scn_to_nid() which
has explicit checks to ensure that it returns a nid that has some some
memory (NODE_DATA(nid)-node_spanned_pages) even when user wants to
hotplug to a node that currently has zero memory.

Is this limitation by design ?

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 PATCH] kvm: powerpc: Allow reuse of vCPU object

2015-03-15 Thread Bharata B Rao
Any feedback on the below patch ?

On Mon, Mar 9, 2015 at 11:00 AM,   wrote:
> From: Bharata B Rao 
>
> Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
> correctly, certain work arounds have to be employed to allow reuse of
> vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
> proposed workaround is to park the vcpu fd in userspace during cpu unplug
> and reuse it later during next hotplug.
>
> More details can be found here:
> KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
> QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html
>
> In order to support this workaround with PowerPC KVM, don't create or
> initialize ICP if the vCPU is found to be already associated with an ICP.
>
> Signed-off-by: Bharata B Rao 
> ---
> Note: It is not sure at the moment if "park vcpu and reuse" approach will
> be acceptable to KVM/QEMU community, but nevertheless I wanted to check
> if this little patch is harmful or not.
>
>  arch/powerpc/kvm/book3s_xics.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> index a4a8d9f..ead3a35 100644
> --- a/arch/powerpc/kvm/book3s_xics.c
> +++ b/arch/powerpc/kvm/book3s_xics.c
> @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
> struct kvm_vcpu *vcpu,
> return -EPERM;
> if (xics->kvm != vcpu->kvm)
> return -EPERM;
> -   if (vcpu->arch.irq_type)
> -   return -EBUSY;
> +
> +   /*
> +* If irq_type is already set, don't reinialize but
> +* return success allowing this vcpu to be reused.
> +*/
> +   if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
> +   return 0;
>
> r = kvmppc_xics_create_icp(vcpu, xcpu);
> if (!r)
> --
> 2.1.0
>



-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 PATCH] kvm: powerpc: Allow reuse of vCPU object

2015-03-15 Thread Bharata B Rao
Any feedback on the below patch ?

On Mon, Mar 9, 2015 at 11:00 AM,  bharata@gmail.com wrote:
 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
 correctly, certain work arounds have to be employed to allow reuse of
 vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
 proposed workaround is to park the vcpu fd in userspace during cpu unplug
 and reuse it later during next hotplug.

 More details can be found here:
 KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
 QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

 In order to support this workaround with PowerPC KVM, don't create or
 initialize ICP if the vCPU is found to be already associated with an ICP.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 ---
 Note: It is not sure at the moment if park vcpu and reuse approach will
 be acceptable to KVM/QEMU community, but nevertheless I wanted to check
 if this little patch is harmful or not.

  arch/powerpc/kvm/book3s_xics.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
 index a4a8d9f..ead3a35 100644
 --- a/arch/powerpc/kvm/book3s_xics.c
 +++ b/arch/powerpc/kvm/book3s_xics.c
 @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
 struct kvm_vcpu *vcpu,
 return -EPERM;
 if (xics-kvm != vcpu-kvm)
 return -EPERM;
 -   if (vcpu-arch.irq_type)
 -   return -EBUSY;
 +
 +   /*
 +* If irq_type is already set, don't reinialize but
 +* return success allowing this vcpu to be reused.
 +*/
 +   if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT)
 +   return 0;

 r = kvmppc_xics_create_icp(vcpu, xcpu);
 if (!r)
 --
 2.1.0




-- 
http://raobharata.wordpress.com/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc: use device_online/offline() instead of cpu_up/down()

2014-11-01 Thread Bharata B Rao
On Fri, Oct 31, 2014 at 03:41:34PM -0400, Dan Streetman wrote:
> In powerpc pseries platform dlpar operations, Use device_online() and
> device_offline() instead of cpu_up() and cpu_down().
> 
> Calling cpu_up/down directly does not update the cpu device offline
> field, which is used to online/offline a cpu from sysfs.  Calling
> device_online/offline instead keeps the sysfs cpu online value correct.
> The hotplug lock, which is required to be held when calling
> device_online/offline, is already held when dlpar_online/offline_cpu
> are called, since they are called only from cpu_probe|release_store.
> 
> This patch fixes errors on PowerVM systems that have cpu(s) added/removed
> using dlpar operations; without this patch, the
> /sys/devices/system/cpu/cpuN/online nodes do not correctly show the
> online state of added/removed cpus.

Verified the patch to be working as expected when I online and offline
CPUs of a PowerKVM guest using QEMU (plus my RFC hotplug patchset for
QEMU)

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] powerpc: use device_online/offline() instead of cpu_up/down()

2014-11-01 Thread Bharata B Rao
On Fri, Oct 31, 2014 at 03:41:34PM -0400, Dan Streetman wrote:
 In powerpc pseries platform dlpar operations, Use device_online() and
 device_offline() instead of cpu_up() and cpu_down().
 
 Calling cpu_up/down directly does not update the cpu device offline
 field, which is used to online/offline a cpu from sysfs.  Calling
 device_online/offline instead keeps the sysfs cpu online value correct.
 The hotplug lock, which is required to be held when calling
 device_online/offline, is already held when dlpar_online/offline_cpu
 are called, since they are called only from cpu_probe|release_store.
 
 This patch fixes errors on PowerVM systems that have cpu(s) added/removed
 using dlpar operations; without this patch, the
 /sys/devices/system/cpu/cpuN/online nodes do not correctly show the
 online state of added/removed cpus.

Verified the patch to be working as expected when I online and offline
CPUs of a PowerKVM guest using QEMU (plus my RFC hotplug patchset for
QEMU)

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pseries: Make CPU hotplug path endian safe

2014-09-05 Thread Bharata B Rao
On Fri, Sep 5, 2014 at 7:38 PM, Nathan Fontenot
 wrote:
> On 09/05/2014 04:16 AM, bharata@gmail.com wrote:
>> From: Bharata B Rao 
>>
>> - ibm,rtas-configure-connector should treat the RTAS data as big endian.
>> - Treat ibm,ppc-interrupt-server#s as big-endian when setting
>>   smp_processor_id during hotplug.
>>
>> Signed-off-by: Bharata B Rao 
>> ---
>>  arch/powerpc/platforms/pseries/dlpar.c   | 10 +-
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  4 ++--
>>  2 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
>> b/arch/powerpc/platforms/pseries/dlpar.c
>> index 2d0b4d6..dc55f9c 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -48,11 +48,11 @@ static struct property *dlpar_parse_cc_property(struct 
>> cc_workarea *ccwa)
>>   if (!prop)
>>   return NULL;
>>
>> - name = (char *)ccwa + ccwa->name_offset;
>> + name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
>>   prop->name = kstrdup(name, GFP_KERNEL);
>>
>> - prop->length = ccwa->prop_length;
>> - value = (char *)ccwa + ccwa->prop_offset;
>> + prop->length = be32_to_cpu(ccwa->prop_length);
>> + value = (char *)ccwa + be32_to_cpu(ccwa->prop_offset);
>>   prop->value = kmemdup(value, prop->length, GFP_KERNEL);
>>   if (!prop->value) {
>>   dlpar_free_cc_property(prop);
>> @@ -78,7 +78,7 @@ static struct device_node *dlpar_parse_cc_node(struct 
>> cc_workarea *ccwa,
>>   if (!dn)
>>   return NULL;
>>
>> - name = (char *)ccwa + ccwa->name_offset;
>> + name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
>>   dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
>>   if (!dn->full_name) {
>>   kfree(dn);
>> @@ -148,7 +148,7 @@ struct device_node *dlpar_configure_connector(u32 
>> drc_index,
>>   return NULL;
>>
>>   ccwa = (struct cc_workarea *)_buf[0];
>> - ccwa->drc_index = drc_index;
>> + ccwa->drc_index = cpu_to_be32(drc_index);
>
> I need to look at this some more but I think this may cause an issue for
> partition migration. If I am following the code correctly, starting in
> pseries_devicetree_update(), the drc_index value passed to
> dlpar_configure_connector is pulled directly out of a buffer we get from
> firmware. This would mean the drc_index value is already in BE format.

Yes I see that now.

>
> Whereas for cpu hotplug the drc_index value is passed in from userspace
> via the cpu probe interface in sysfs. I assume that you are seeing the
> drc_index value getting passed in in LE format.

Yes I am seeing drc_index in LE format for an LE guest during CPU
hotplug operation.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] pseries: Make CPU hotplug path endian safe

2014-09-05 Thread Bharata B Rao
On Fri, Sep 5, 2014 at 7:38 PM, Nathan Fontenot
nf...@linux.vnet.ibm.com wrote:
 On 09/05/2014 04:16 AM, bharata@gmail.com wrote:
 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 - ibm,rtas-configure-connector should treat the RTAS data as big endian.
 - Treat ibm,ppc-interrupt-server#s as big-endian when setting
   smp_processor_id during hotplug.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 ---
  arch/powerpc/platforms/pseries/dlpar.c   | 10 +-
  arch/powerpc/platforms/pseries/hotplug-cpu.c |  4 ++--
  2 files changed, 7 insertions(+), 7 deletions(-)

 diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
 b/arch/powerpc/platforms/pseries/dlpar.c
 index 2d0b4d6..dc55f9c 100644
 --- a/arch/powerpc/platforms/pseries/dlpar.c
 +++ b/arch/powerpc/platforms/pseries/dlpar.c
 @@ -48,11 +48,11 @@ static struct property *dlpar_parse_cc_property(struct 
 cc_workarea *ccwa)
   if (!prop)
   return NULL;

 - name = (char *)ccwa + ccwa-name_offset;
 + name = (char *)ccwa + be32_to_cpu(ccwa-name_offset);
   prop-name = kstrdup(name, GFP_KERNEL);

 - prop-length = ccwa-prop_length;
 - value = (char *)ccwa + ccwa-prop_offset;
 + prop-length = be32_to_cpu(ccwa-prop_length);
 + value = (char *)ccwa + be32_to_cpu(ccwa-prop_offset);
   prop-value = kmemdup(value, prop-length, GFP_KERNEL);
   if (!prop-value) {
   dlpar_free_cc_property(prop);
 @@ -78,7 +78,7 @@ static struct device_node *dlpar_parse_cc_node(struct 
 cc_workarea *ccwa,
   if (!dn)
   return NULL;

 - name = (char *)ccwa + ccwa-name_offset;
 + name = (char *)ccwa + be32_to_cpu(ccwa-name_offset);
   dn-full_name = kasprintf(GFP_KERNEL, %s/%s, path, name);
   if (!dn-full_name) {
   kfree(dn);
 @@ -148,7 +148,7 @@ struct device_node *dlpar_configure_connector(u32 
 drc_index,
   return NULL;

   ccwa = (struct cc_workarea *)data_buf[0];
 - ccwa-drc_index = drc_index;
 + ccwa-drc_index = cpu_to_be32(drc_index);

 I need to look at this some more but I think this may cause an issue for
 partition migration. If I am following the code correctly, starting in
 pseries_devicetree_update(), the drc_index value passed to
 dlpar_configure_connector is pulled directly out of a buffer we get from
 firmware. This would mean the drc_index value is already in BE format.

Yes I see that now.


 Whereas for cpu hotplug the drc_index value is passed in from userspace
 via the cpu probe interface in sysfs. I assume that you are seeing the
 drc_index value getting passed in in LE format.

Yes I am seeing drc_index in LE format for an LE guest during CPU
hotplug operation.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Bharata B Rao
On Thu, Dec 06, 2007 at 11:01:18AM +0100, Jan Blunck wrote:
> On Wed, Dec 05, Dave Hansen wrote:
> 
> > I think the key here is what kind of consistency we're trying to
> > provide.  If a directory is being changed underneath a reader, what
> > kinds of guarantees do they get about the contents of their directory
> > read?  When do those guarantees start?  Are there any at open() time?
> 
> But we still want to be compliant to what POSIX defines. The problem isn't the
> consistency of the readdir result but the seekdir/telldir interface. IMHO that
> interface is totally broken: you need to be able to find every offset given by
> telldir since the last open. The problem is that seekdir isn't able to return
> errors. Otherwise you could just forbid seeking on union directories.

Also, what kind of consistency is expected when a directory is open(2)ed
and readdir(2) and lseek(2) are applied to it when the directory gets
changed underneath the reader. From this:
http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
the behaviour/guarantees wasn't apparent to me.

> 
> > Rather than give each _dirent_ an offset, could we give each sub-mount
> > an offset?  Let's say we have three members comprising a union mount
> > directory.  The first has 100 dirents, the second 200, and the third
> > 10,000.  When the first readdir is done, we populate the table like
> > this:
> > 
> > mount_offset[0] = 0;
> > mount_offset[1] = 100;
> > mount_offset[2] = 300;
> > 
> > If someone seeks back to 150, then we subtrack the mount[1]'s offset
> > (100), and realize that we want the 50th dirent from mount[1].
> 
> Yes, that is a nice idea and it is exactly what I have implemented in my patch
> series. But you forgot one thing: directories are not flat files. The dentry
> offset in a directory is a random cookie. Therefore it is not possible to have
> a linear mapping without allocating memory.

And I defined this linear behaviour on the cache of dirents we maintain
in the approach I posted. And the main reason we maintain cache of
dirents in memory is for duplicate elimination.

> 
> > I don't know whether we're bound to this:
> > 
> > http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
> > 
> > "If a file is removed from or added to the directory after the
> > most recent call to opendir() or rewinddir(), whether a
> > subsequent call to readdir() returns an entry for that file is
> > unspecified."
> > 
> > But that would seem to tell me that once you populate a table such as
> > the one I've described and create it at open(dir) time, you don't
> > actually ever need to update it.
> 
> Yes, I'm using such a patch on our S390 buildservers to work around some
> readdir/seek/rm problem with old glibc versions. It seems to work but on the
> other hand this are really huge systems and I haven't run out of memory while
> doing a readdir yet ;)
> 
> The proper way to implement this would be to cache the offsets on a per inode
> base. Otherwise the user could easily DoS this by opening a number of
> directories and never close them.
> 

You mean cache the offsets or dirents ? How would that solve
the seek problem ? How would it enable you to define a seek behaviour
for the entire union of directories ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Bharata B Rao
On Thu, Dec 06, 2007 at 11:01:18AM +0100, Jan Blunck wrote:
 On Wed, Dec 05, Dave Hansen wrote:
 
  I think the key here is what kind of consistency we're trying to
  provide.  If a directory is being changed underneath a reader, what
  kinds of guarantees do they get about the contents of their directory
  read?  When do those guarantees start?  Are there any at open() time?
 
 But we still want to be compliant to what POSIX defines. The problem isn't the
 consistency of the readdir result but the seekdir/telldir interface. IMHO that
 interface is totally broken: you need to be able to find every offset given by
 telldir since the last open. The problem is that seekdir isn't able to return
 errors. Otherwise you could just forbid seeking on union directories.

Also, what kind of consistency is expected when a directory is open(2)ed
and readdir(2) and lseek(2) are applied to it when the directory gets
changed underneath the reader. From this:
http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
the behaviour/guarantees wasn't apparent to me.

 
  Rather than give each _dirent_ an offset, could we give each sub-mount
  an offset?  Let's say we have three members comprising a union mount
  directory.  The first has 100 dirents, the second 200, and the third
  10,000.  When the first readdir is done, we populate the table like
  this:
  
  mount_offset[0] = 0;
  mount_offset[1] = 100;
  mount_offset[2] = 300;
  
  If someone seeks back to 150, then we subtrack the mount[1]'s offset
  (100), and realize that we want the 50th dirent from mount[1].
 
 Yes, that is a nice idea and it is exactly what I have implemented in my patch
 series. But you forgot one thing: directories are not flat files. The dentry
 offset in a directory is a random cookie. Therefore it is not possible to have
 a linear mapping without allocating memory.

And I defined this linear behaviour on the cache of dirents we maintain
in the approach I posted. And the main reason we maintain cache of
dirents in memory is for duplicate elimination.

 
  I don't know whether we're bound to this:
  
  http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
  
  If a file is removed from or added to the directory after the
  most recent call to opendir() or rewinddir(), whether a
  subsequent call to readdir() returns an entry for that file is
  unspecified.
  
  But that would seem to tell me that once you populate a table such as
  the one I've described and create it at open(dir) time, you don't
  actually ever need to update it.
 
 Yes, I'm using such a patch on our S390 buildservers to work around some
 readdir/seek/rm problem with old glibc versions. It seems to work but on the
 other hand this are really huge systems and I haven't run out of memory while
 doing a readdir yet ;)
 
 The proper way to implement this would be to cache the offsets on a per inode
 base. Otherwise the user could easily DoS this by opening a number of
 directories and never close them.
 

You mean cache the offsets or dirents ? How would that solve
the seek problem ? How would it enable you to define a seek behaviour
for the entire union of directories ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 3/5] Add list_for_each_entry_reverse_from()

2007-12-05 Thread Bharata B Rao
Introduce list_for_each_entry_reverse_from() needed by a subsequent patch.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 include/linux/list.h |   13 +
 1 file changed, 13 insertions(+)

--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -562,6 +562,19 @@ static inline void list_splice_init_rcu(
 pos = list_entry(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_reverse_from - iterate backwards over list of given
+ * type from the current point
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * Iterate backwards over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_reverse_from(pos, head, member)\
+   for (; prefetch(pos->member.prev), >member != (head);  \
+pos = list_entry(pos->member.prev, typeof(*pos), member))
+
+/**
  * list_for_each_entry_safe - iterate over list of given type safe against 
removal of list entry
  * @pos:   the type * to use as a loop cursor.
  * @n: another type * to use as temporary storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 5/5] Directory cache invalidation

2007-12-05 Thread Bharata B Rao
Changes to keep dirent cache uptodate.

Dirent cache stored as part of topmost directory's struct file needs to
be marked stale whenever there is a modification in any of the directories
that is part of the union. Modifications(like addition/deletion of new
entries) to a directory can occur from places like mkdir, rmdir, mknod etc.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 fs/dcache.c|1 
 fs/namei.c |   13 +++
 fs/union.c |  178 -
 include/linux/dcache.h |4 -
 include/linux/fs.h |4 +
 include/linux/union.h  |3 
 6 files changed, 201 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -974,6 +974,7 @@ struct dentry *d_alloc(struct dentry * p
 #endif
 #ifdef CONFIG_UNION_MOUNT
INIT_LIST_HEAD(>d_unions);
+   INIT_LIST_HEAD(>d_overlaid);
dentry->d_unionized = 0;
 #endif
INIT_HLIST_NODE(>d_hash);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2237,6 +2237,7 @@ static int __open_namei_create(struct na
nd->path.dentry = path->dentry;
if (error)
return error;
+   rdcache_invalidate(>path);
/* Don't check for write permission, don't truncate */
return may_open(nd, 0, flag & ~O_TRUNC);
 }
@@ -2682,6 +2683,8 @@ asmlinkage long sys_mknodat(int dfd, con
  mode, 0);
break;
}
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(, );
@@ -2757,6 +2760,8 @@ asmlinkage long sys_mkdirat(int dfd, con
if (error)
goto out_dput;
error = vfs_mkdir(nd.path.dentry->d_inode, path.dentry, mode);
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(, );
@@ -3287,6 +3292,8 @@ static long do_rmdir(int dfd, const char
if (error)
goto exit3;
error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
 exit3:
path_put_conditional(, );
@@ -3375,6 +3382,8 @@ static long do_unlinkat(int dfd, const c
if (error)
goto exit2;
error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
exit2:
path_put_conditional(, );
@@ -3466,6 +3475,8 @@ asmlinkage long sys_symlinkat(const char
goto out_dput;
error = vfs_symlink(nd.path.dentry->d_inode, path.dentry, from,
S_IALLUGO);
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(, );
@@ -3566,6 +3577,8 @@ asmlinkage long sys_linkat(int olddfd, c
goto out_dput;
error = vfs_link(old_nd.path.dentry, nd.path.dentry->d_inode,
 path.dentry);
+   if (!error)
+   rdcache_invalidate();
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(, );
--- a/fs/union.c
+++ b/fs/union.c
@@ -39,6 +39,7 @@ static struct hlist_head *union_hashtabl
 static unsigned int union_rhash_mask __read_mostly;
 static unsigned int union_rhash_shift __read_mostly;
 static struct hlist_head *union_rhashtable __read_mostly;
+static struct hlist_head *readdir_hashtable __read_mostly;
 
 /*
  * Locking Rules:
@@ -103,6 +104,18 @@ static int __init init_union(void)
for (loop = 0; loop < (1 << union_rhash_shift); loop++)
INIT_HLIST_HEAD(_rhashtable[loop]);
 
+   readdir_hashtable = alloc_large_system_hash("readdir-cache",
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ _rhash_shift,
+ _rhash_mask,
+ 0);
+
+   for (loop = 0; loop < (1 << union_rhash_shift); loop++)
+   INIT_HLIST_HEAD(_hashtable[loop]);
+
readdir_cache = kmem_cache_create("readdir-cache",
sizeof(struct rdcache_entry), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
@@ -126,6 +139,7 @@ struct union_mount *union_alloc(struct d
atomic_set(>u_count, 1);
INIT_LIST_HEAD(>u_unions);
INIT_LIST_HEAD(>u_list);
+   INIT_LIST_HEAD(>u_overlaid);
INIT_HLIST_NODE(>u_hash)

[RFC PATCH 4/5] Directory seek support

2007-12-05 Thread Bharata B Rao
Directory seek support.

Define the seek behaviour on the stored cache of dirents.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 fs/read_write.c   |   11 ---
 fs/union.c|  171 +-
 include/linux/fs.h|8 ++
 include/linux/union.h |   25 +++
 4 files changed, 205 insertions(+), 10 deletions(-)

--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "read_write.h"
 
 #include 
@@ -116,15 +117,7 @@ EXPORT_SYMBOL(default_llseek);
 
 loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
 {
-   loff_t (*fn)(struct file *, loff_t, int);
-
-   fn = no_llseek;
-   if (file->f_mode & FMODE_LSEEK) {
-   fn = default_llseek;
-   if (file->f_op && file->f_op->llseek)
-   fn = file->f_op->llseek;
-   }
-   return fn(file, offset, origin);
+   return do_llseek(file, offset, origin);
 }
 EXPORT_SYMBOL(vfs_llseek);
 
--- a/fs/union.c
+++ b/fs/union.c
@@ -614,6 +614,7 @@ static int rdcache_add_entry(struct rdst
this->dtype = d_type;
INIT_LIST_HEAD(>list);
list_add_tail(>list, list);
+   r->cur_dirent = this;
return 0;
 }
 
@@ -636,18 +637,96 @@ static int filldir_union(void *buf, cons
if (rdcache_find_entry(>dirent_cache, name, namlen))
return 0;
 
-   err =  cb->filldir(cb->buf, name, namlen, r->cur_off,
+   /* We come here with NULL cb->filldir from lseek path */
+   if (cb->filldir)
+   err =  cb->filldir(cb->buf, name, namlen, r->cur_off,
ino, d_type);
if (err >= 0) {
rdcache_add_entry(r, >dirent_cache,
name, namlen, offset, ino, d_type);
r->cur_off = ++r->last_off;
r->nr_dirents++;
+   if (r->cur_off == r->fill_off) {
+   /* We filled up to the required seek offset */
+   r->fill_off = 0;
+   err = -EINVAL;
+   }
}
cb->error = err;
return err;
 }
 
+/*
+ * This is called when current offset in rdcache gets changed and when
+ * we need to change the current underlying directory in the rdstate
+ * to match the current offset.
+ */
+static void update_rdstate(struct file *file)
+{
+   struct rdstate *r = file->f_rdstate;
+   loff_t off = r->cur_off;
+   struct union_mount *um;
+
+   if (!(r->flags & RDSTATE_NEED_UPDATE))
+   return;
+
+   spin_lock(_lock);
+   um = union_lookup(file->f_path.dentry, file->f_path.mnt);
+   spin_unlock(_lock);
+   if (!um)
+   goto out;
+   off -= um->nr_dirents;
+   path_put(>cur_path);
+   r->cur_path = file->f_path;
+   path_get(>cur_path);
+
+   while (off > 0) {
+   spin_lock(_lock);
+   um = union_lookup(r->cur_path.dentry, r->cur_path.mnt);
+   spin_unlock(_lock);
+   if (!um)
+   goto out;
+   off -= um->nr_dirents;
+   path_put(>cur_path);
+   r->cur_path = um->u_next;
+   path_get(>cur_path);
+   }
+out:
+   r->file_off = r->cur_dirent->off;
+}
+
+/*
+ * Returns dirents from rdcache to userspace.
+ */
+static int readdir_rdcache(struct file *file, struct rdcache_callback *cb)
+{
+   struct rdstate *r = cb->rdstate;
+   struct rdcache_entry *tmp = r->cur_dirent;
+   int err = 0;
+
+   BUG_ON(r->cur_off > r->last_off);
+
+   /* If offsets already uptodate, just return */
+   if (likely(r->cur_off == r->last_off))
+   return 0;
+
+   /*
+* return the entries from cur_off till last_off from rdcache to
+* user space.
+*/
+   list_for_each_entry_from(tmp, >dirent_cache, list) {
+   err =  cb->filldir(cb->buf, tmp->name.name, tmp->name.len,
+   r->cur_off, tmp->ino, tmp->dtype);
+   r->cur_dirent = tmp;
+   if (err < 0)
+   break;
+   r->cur_off++;
+   r->flags |= RDSTATE_NEED_UPDATE;
+   }
+   update_rdstate(file);
+   return err;
+}
+
 /* Called from last fput() */
 void put_rdstate(struct rdstate *rdstate)
 {
@@ -710,6 +789,10 @@ int readdir_union(struct file *file, voi
cb.rdstate = rdstate;
cb.error = 0;
 
+   err = readdir_rdcache(file, );
+   if (err)
+   return err;
+
offset = rdstate->file_off;
 
/* Read from the topmost directory */
@@ -796,6 +879,92 @@ out:
return err;
 }
 
+static

[RFC PATCH 2/5] Add New directory listing approach

2007-12-05 Thread Bharata B Rao
Another readdir implementation for union uounted directories.

Reads dirents from all layers of the union into a cache, eliminates duplicates,
before returning them into userspace. The cache is stored persistently as part
of struct file of the topmost directory. Instead of original directory offsets,
offsets are defined as linearly increasing indices on this cache and the same
is returned to userspace.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 fs/file_table.c   |1 
 fs/readdir.c  |   10 -
 fs/union.c|  281 ++
 include/linux/fs.h|   30 +
 include/linux/union.h |   28 
 5 files changed, 342 insertions(+), 8 deletions(-)

--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -286,6 +286,7 @@ void fastcall __fput(struct file *file)
drop_file_write_access(file);
 
put_pid(file->f_owner.pid);
+   put_rdstate(file->f_rdstate);
file_kill(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
 int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
int res = -ENOTDIR;
 
if (!file->f_op || !file->f_op->readdir)
@@ -31,13 +31,7 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;
 
-   mutex_lock(>i_mutex);
-   res = -ENOENT;
-   if (!IS_DEADDIR(inode)) {
-   res = file->f_op->readdir(file, buf, filler);
-   file_accessed(file);
-   }
-   mutex_unlock(>i_mutex);
+   res = do_readdir(file, buf, filler);
 out:
return res;
 }
--- a/fs/union.c
+++ b/fs/union.c
@@ -46,8 +46,10 @@ static struct hlist_head *union_rhashtab
  * - union_lock
  */
 DEFINE_SPINLOCK(union_lock);
+DEFINE_MUTEX(union_rdmutex);
 
 static struct kmem_cache *union_cache __read_mostly;
+static struct kmem_cache *readdir_cache;
 
 static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
 {
@@ -101,6 +103,9 @@ static int __init init_union(void)
for (loop = 0; loop < (1 << union_rhash_shift); loop++)
INIT_HLIST_HEAD(_rhashtable[loop]);
 
+   readdir_cache = kmem_cache_create("readdir-cache",
+   sizeof(struct rdcache_entry), 0,
+   SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
return 0;
 }
 
@@ -516,6 +521,282 @@ int last_union_is_root(struct path *path
 }
 
 /*
+ * readdir support for Union mounts.
+ */
+
+struct rdcache_callback {
+   void *buf;  /* original callback buffer */
+   filldir_t filldir;  /* the filldir() we should call */
+   int error;  /* stores filldir error */
+   struct rdstate *rdstate;/* readdir state */
+};
+
+/*
+ * This is called after every ->readdir() to persistently store the number of
+ * entries in a directory in the corresponding union_mount structure.
+ */
+static void update_um_dirents(struct rdstate *r)
+{
+   struct union_mount *um;
+
+   spin_lock(_lock);
+   um = union_lookup(r->cur_path.dentry, r->cur_path.mnt);
+   if (!um)
+   goto out;
+   um->nr_dirents = r->nr_dirents;
+out:
+   spin_unlock(_lock);
+}
+
+static void rdcache_free(struct list_head *list)
+{
+   struct list_head *p;
+   struct list_head *ptmp;
+   int count = 0;
+
+   list_for_each_safe(p, ptmp, list) {
+   struct rdcache_entry *this;
+
+   this = list_entry(p, struct rdcache_entry, list);
+   list_del_init(>list);
+   kfree(this->name.name);
+   kmem_cache_free(readdir_cache, this);
+   count++;
+   }
+   INIT_LIST_HEAD(list);
+   return;
+}
+
+static int rdcache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+   struct rdcache_entry *p;
+   int ret = 0;
+
+   list_for_each_entry(p, uc_list, list) {
+   if (p->name.len != namelen)
+   continue;
+   if (strncmp(p->name.name, name, namelen) == 0) {
+   ret = 1;
+   break;
+   }
+   }
+   return ret;
+}
+
+static int rdcache_add_entry(struct rdstate *r, struct list_head *list,
+   const char *name, int namelen, loff_t offset, u64 ino,
+   unsigned int d_type)
+{
+   struct rdcache_entry *this;
+   char *tmp_name;
+
+   this = kmem_cache_alloc(readdir_cache, GFP_KERNEL);
+   if (!this) {
+   printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
+   return -ENOMEM;
+   }
+
+   t

[RFC PATCH 1/5] Remove existing directory listing implementation

2007-12-05 Thread Bharata B Rao
Remove the existing readdir implementation.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 fs/readdir.c  |   10 +
 fs/union.c|  333 --
 include/linux/union.h |   23 ---
 3 files changed, 8 insertions(+), 358 deletions(-)

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
 int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 {
+   struct inode *inode = file->f_path.dentry->d_inode;
int res = -ENOTDIR;
 
if (!file->f_op || !file->f_op->readdir)
@@ -31,7 +31,13 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;
 
-   res = do_readdir(file, buf, filler);
+   mutex_lock(>i_mutex);
+   res = -ENOENT;
+   if (!IS_DEADDIR(inode)) {
+   res = file->f_op->readdir(file, buf, filler);
+   file_accessed(file);
+   }
+   mutex_unlock(>i_mutex);
 out:
return res;
 }
--- a/fs/union.c
+++ b/fs/union.c
@@ -516,339 +516,6 @@ int last_union_is_root(struct path *path
 }
 
 /*
- * Union mounts support for readdir.
- */
-
-/* This is a copy from fs/readdir.c */
-struct getdents_callback {
-   struct linux_dirent __user *current_dir;
-   struct linux_dirent __user *previous;
-   int count;
-   int error;
-};
-
-/* The readdir union cache object */
-struct union_cache_entry {
-   struct list_head list;
-   struct qstr name;
-};
-
-static int union_cache_add_entry(struct list_head *list,
-const char *name, int namelen)
-{
-   struct union_cache_entry *this;
-   char *tmp_name;
-
-   this = kmalloc(sizeof(*this), GFP_KERNEL);
-   if (!this) {
-   printk(KERN_CRIT
-  "union_cache_add_entry(): out of kernel memory\n");
-   return -ENOMEM;
-   }
-
-   tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
-   if (!tmp_name) {
-   printk(KERN_CRIT
-  "union_cache_add_entry(): out of kernel memory\n");
-   kfree(this);
-   return -ENOMEM;
-   }
-
-   this->name.name = tmp_name;
-   this->name.len = namelen;
-   this->name.hash = 0;
-   memcpy(tmp_name, name, namelen);
-   tmp_name[namelen] = 0;
-   INIT_LIST_HEAD(>list);
-   list_add(>list, list);
-   return 0;
-}
-
-static void union_cache_free(struct list_head *uc_list)
-{
-   struct list_head *p;
-   struct list_head *ptmp;
-   int count = 0;
-
-   list_for_each_safe(p, ptmp, uc_list) {
-   struct union_cache_entry *this;
-
-   this = list_entry(p, struct union_cache_entry, list);
-   list_del_init(>list);
-   kfree(this->name.name);
-   kfree(this);
-   count++;
-   }
-   return;
-}
-
-static int union_cache_find_entry(struct list_head *uc_list,
- const char *name, int namelen)
-{
-   struct union_cache_entry *p;
-   int ret = 0;
-
-   list_for_each_entry(p, uc_list, list) {
-   if (p->name.len != namelen)
-   continue;
-   if (strncmp(p->name.name, name, namelen) == 0) {
-   ret = 1;
-   break;
-   }
-   }
-
-   return ret;
-}
-
-/*
- * There are four filldir() wrapper necessary for the union mount readdir
- * implementation:
- *
- * - filldir_topmost(): fills the union's readdir cache and the user space
- * buffer. This is only used for the topmost directory
- * in the union stack.
- * - filldir_topmost_cacheonly(): only fills the union's readdir cache.
- * This is only used for the topmost directory in the
- * union stack.
- * - filldir_overlaid(): fills the union's readdir cache and the user space
- * buffer. This is only used for directories on the
- * stack's lower layers.
- * - filldir_overlaid_cacheonly(): only fills the union's readdir cache.
- * This is only used for directories on the stack's
- * lower layers.
- */
-
-struct union_cache_callback {
-   struct getdents_callback *buf;  /* original getdents_callback */
-   struct list_head list;  /* list of union cache entries */
-   filldir_t filler;   /* the filldir() we should call */
-   loff_t offset;  /* base offset of our dirents */
-   loff_t count;   /* maximum number of bytes to "read" */
-};
-
-static int filldir_topmost(void *buf, const char *name, int namlen,
-  loff_t offset, u64 ino, unsigned int d_type)
-{
-   struct union_cache_callback *cb = buf;
-
-   u

[RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Bharata B Rao
Hi,

In Union Mount, the merged view of directories of the union is obtained
by enhancing readdir(2)/getdents(2) to read and merge the entries of
all the directories  by eliminating the duplicates. While we have tried
a few approaches for this, none of them could perfectly solve all the problems.
One of the challenges has been to provide a correct directory seek support for
the union mounted directories. Sometime back when I posted an
RFC (http://lkml.org/lkml/2007/9/7/22) on this problem, one of the
suggestions I got was to have the dirents stored in a cache (which we
already do for duplicate elimination) and define the directory seek
behaviour on this cache constructed out of unioned directories.

I set out to try this and the result is this set of patches. I am myself
not impressed by the implementation complexity in these patches but posting
them here only to get further comments and suggestions. Moreover I haven't
debugged them thoroughly to uncover all the problems. While I don't expect
anybody try these patches, for the completeness sake I have to mention that
these apply on top of Jan Blunck's patches on 2.6.24-rc2-mm1
(ftp://ftp.suse.com/pub/people/jblunck/patches/).

In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed to
define a seek behaviour on the union mounted directory. This cache is stored
as part of the struct file of the topmost directory of the union and will
remain as long as the directory is kept open.

While this approach works, it has the following downsides:

- The cache can grow arbitrarily large in size for big directories thereby
consuming lots of memory. Pruning individual cache entries is out of question
as entire cache is needed for subsequent readdirs for duplicate elimination.

- When an exising union gets overlaid by a new directory, there is a
possibility of the cache getting duplicated for the new union, thereby wasting
more space.

- Whenever _any_ directory that is part of the union gets
modified (addition/deletion of entries), the dirent cache of all the unions
which this directory is part of, needs to be purged and rebuilt. This is
expensive not only due to re-reads of dirents but also because
readdir(2)/getdents(2) needs to be synchronized with other operations
like mkdir/mknod/link/unlink etc.

- Since lseek(2) of the unioned directory also works on the same dirent
cache, it too needs to be invalidated when the directory gets modified.

- Supporting SEEK_END of lseek(2) is expensive as it involves reading in
all the dirents to know the EOF of the directory file.

After all this, I am beginning to think if it would be better to delegate
this readdir and whiteout processing to userspace. Can this be better handled
by readdir(3) in glibc ? But even with this, I am not sure if correct seek
behaviour (from lseek(2) or seekdir(3))  can be achieved in an easy manner.
Any thoughts on this ?

Earlier Erez Zodak had suggested that things will become easier if readdir
state is stored as a disk file (http://lkml.org/lkml/2007/9/7/114). This
approach simplifies directory seek support in Unionfs. But I am not sure if
such an approach would go well with VFS based unification approach like
Union Mount.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 3/5] Add list_for_each_entry_reverse_from()

2007-12-05 Thread Bharata B Rao
Introduce list_for_each_entry_reverse_from() needed by a subsequent patch.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 include/linux/list.h |   13 +
 1 file changed, 13 insertions(+)

--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -562,6 +562,19 @@ static inline void list_splice_init_rcu(
 pos = list_entry(pos-member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_reverse_from - iterate backwards over list of given
+ * type from the current point
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * Iterate backwards over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_reverse_from(pos, head, member)\
+   for (; prefetch(pos-member.prev), pos-member != (head);  \
+pos = list_entry(pos-member.prev, typeof(*pos), member))
+
+/**
  * list_for_each_entry_safe - iterate over list of given type safe against 
removal of list entry
  * @pos:   the type * to use as a loop cursor.
  * @n: another type * to use as temporary storage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 5/5] Directory cache invalidation

2007-12-05 Thread Bharata B Rao
Changes to keep dirent cache uptodate.

Dirent cache stored as part of topmost directory's struct file needs to
be marked stale whenever there is a modification in any of the directories
that is part of the union. Modifications(like addition/deletion of new
entries) to a directory can occur from places like mkdir, rmdir, mknod etc.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/dcache.c|1 
 fs/namei.c |   13 +++
 fs/union.c |  178 -
 include/linux/dcache.h |4 -
 include/linux/fs.h |4 +
 include/linux/union.h  |3 
 6 files changed, 201 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -974,6 +974,7 @@ struct dentry *d_alloc(struct dentry * p
 #endif
 #ifdef CONFIG_UNION_MOUNT
INIT_LIST_HEAD(dentry-d_unions);
+   INIT_LIST_HEAD(dentry-d_overlaid);
dentry-d_unionized = 0;
 #endif
INIT_HLIST_NODE(dentry-d_hash);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2237,6 +2237,7 @@ static int __open_namei_create(struct na
nd-path.dentry = path-dentry;
if (error)
return error;
+   rdcache_invalidate(nd-path);
/* Don't check for write permission, don't truncate */
return may_open(nd, 0, flag  ~O_TRUNC);
 }
@@ -2682,6 +2683,8 @@ asmlinkage long sys_mknodat(int dfd, con
  mode, 0);
break;
}
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(path, nd);
@@ -2757,6 +2760,8 @@ asmlinkage long sys_mkdirat(int dfd, con
if (error)
goto out_dput;
error = vfs_mkdir(nd.path.dentry-d_inode, path.dentry, mode);
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(path, nd);
@@ -3287,6 +3292,8 @@ static long do_rmdir(int dfd, const char
if (error)
goto exit3;
error = vfs_rmdir(nd.path.dentry-d_inode, path.dentry);
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
 exit3:
path_put_conditional(path, nd);
@@ -3375,6 +3382,8 @@ static long do_unlinkat(int dfd, const c
if (error)
goto exit2;
error = vfs_unlink(nd.path.dentry-d_inode, path.dentry);
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
exit2:
path_put_conditional(path, nd);
@@ -3466,6 +3475,8 @@ asmlinkage long sys_symlinkat(const char
goto out_dput;
error = vfs_symlink(nd.path.dentry-d_inode, path.dentry, from,
S_IALLUGO);
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(path, nd);
@@ -3566,6 +3577,8 @@ asmlinkage long sys_linkat(int olddfd, c
goto out_dput;
error = vfs_link(old_nd.path.dentry, nd.path.dentry-d_inode,
 path.dentry);
+   if (!error)
+   rdcache_invalidate(nd.path);
mnt_drop_write(nd.path.mnt);
 out_dput:
path_put_conditional(path, nd);
--- a/fs/union.c
+++ b/fs/union.c
@@ -39,6 +39,7 @@ static struct hlist_head *union_hashtabl
 static unsigned int union_rhash_mask __read_mostly;
 static unsigned int union_rhash_shift __read_mostly;
 static struct hlist_head *union_rhashtable __read_mostly;
+static struct hlist_head *readdir_hashtable __read_mostly;
 
 /*
  * Locking Rules:
@@ -103,6 +104,18 @@ static int __init init_union(void)
for (loop = 0; loop  (1  union_rhash_shift); loop++)
INIT_HLIST_HEAD(union_rhashtable[loop]);
 
+   readdir_hashtable = alloc_large_system_hash(readdir-cache,
+ sizeof(struct hlist_head),
+ union_hash_entries,
+ 14,
+ 0,
+ union_rhash_shift,
+ union_rhash_mask,
+ 0);
+
+   for (loop = 0; loop  (1  union_rhash_shift); loop++)
+   INIT_HLIST_HEAD(readdir_hashtable[loop]);
+
readdir_cache = kmem_cache_create(readdir-cache,
sizeof(struct rdcache_entry), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
@@ -126,6 +139,7 @@ struct union_mount *union_alloc(struct d
atomic_set(um-u_count, 1);
INIT_LIST_HEAD(um-u_unions);
INIT_LIST_HEAD(um-u_list);
+   INIT_LIST_HEAD(um-u_overlaid);
INIT_HLIST_NODE(um-u_hash

[RFC PATCH 1/5] Remove existing directory listing implementation

2007-12-05 Thread Bharata B Rao
Remove the existing readdir implementation.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/readdir.c  |   10 +
 fs/union.c|  333 --
 include/linux/union.h |   23 ---
 3 files changed, 8 insertions(+), 358 deletions(-)

--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/unistd.h
-#include linux/union.h
 
 #include asm/uaccess.h
 
 int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 {
+   struct inode *inode = file-f_path.dentry-d_inode;
int res = -ENOTDIR;
 
if (!file-f_op || !file-f_op-readdir)
@@ -31,7 +31,13 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;
 
-   res = do_readdir(file, buf, filler);
+   mutex_lock(inode-i_mutex);
+   res = -ENOENT;
+   if (!IS_DEADDIR(inode)) {
+   res = file-f_op-readdir(file, buf, filler);
+   file_accessed(file);
+   }
+   mutex_unlock(inode-i_mutex);
 out:
return res;
 }
--- a/fs/union.c
+++ b/fs/union.c
@@ -516,339 +516,6 @@ int last_union_is_root(struct path *path
 }
 
 /*
- * Union mounts support for readdir.
- */
-
-/* This is a copy from fs/readdir.c */
-struct getdents_callback {
-   struct linux_dirent __user *current_dir;
-   struct linux_dirent __user *previous;
-   int count;
-   int error;
-};
-
-/* The readdir union cache object */
-struct union_cache_entry {
-   struct list_head list;
-   struct qstr name;
-};
-
-static int union_cache_add_entry(struct list_head *list,
-const char *name, int namelen)
-{
-   struct union_cache_entry *this;
-   char *tmp_name;
-
-   this = kmalloc(sizeof(*this), GFP_KERNEL);
-   if (!this) {
-   printk(KERN_CRIT
-  union_cache_add_entry(): out of kernel memory\n);
-   return -ENOMEM;
-   }
-
-   tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
-   if (!tmp_name) {
-   printk(KERN_CRIT
-  union_cache_add_entry(): out of kernel memory\n);
-   kfree(this);
-   return -ENOMEM;
-   }
-
-   this-name.name = tmp_name;
-   this-name.len = namelen;
-   this-name.hash = 0;
-   memcpy(tmp_name, name, namelen);
-   tmp_name[namelen] = 0;
-   INIT_LIST_HEAD(this-list);
-   list_add(this-list, list);
-   return 0;
-}
-
-static void union_cache_free(struct list_head *uc_list)
-{
-   struct list_head *p;
-   struct list_head *ptmp;
-   int count = 0;
-
-   list_for_each_safe(p, ptmp, uc_list) {
-   struct union_cache_entry *this;
-
-   this = list_entry(p, struct union_cache_entry, list);
-   list_del_init(this-list);
-   kfree(this-name.name);
-   kfree(this);
-   count++;
-   }
-   return;
-}
-
-static int union_cache_find_entry(struct list_head *uc_list,
- const char *name, int namelen)
-{
-   struct union_cache_entry *p;
-   int ret = 0;
-
-   list_for_each_entry(p, uc_list, list) {
-   if (p-name.len != namelen)
-   continue;
-   if (strncmp(p-name.name, name, namelen) == 0) {
-   ret = 1;
-   break;
-   }
-   }
-
-   return ret;
-}
-
-/*
- * There are four filldir() wrapper necessary for the union mount readdir
- * implementation:
- *
- * - filldir_topmost(): fills the union's readdir cache and the user space
- * buffer. This is only used for the topmost directory
- * in the union stack.
- * - filldir_topmost_cacheonly(): only fills the union's readdir cache.
- * This is only used for the topmost directory in the
- * union stack.
- * - filldir_overlaid(): fills the union's readdir cache and the user space
- * buffer. This is only used for directories on the
- * stack's lower layers.
- * - filldir_overlaid_cacheonly(): only fills the union's readdir cache.
- * This is only used for directories on the stack's
- * lower layers.
- */
-
-struct union_cache_callback {
-   struct getdents_callback *buf;  /* original getdents_callback */
-   struct list_head list;  /* list of union cache entries */
-   filldir_t filler;   /* the filldir() we should call */
-   loff_t offset;  /* base offset of our dirents */
-   loff_t count;   /* maximum number of bytes to read */
-};
-
-static int filldir_topmost(void *buf, const char *name, int namlen,
-  loff_t offset, u64 ino, unsigned int d_type)
-{
-   struct union_cache_callback *cb = buf

[RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Bharata B Rao
Hi,

In Union Mount, the merged view of directories of the union is obtained
by enhancing readdir(2)/getdents(2) to read and merge the entries of
all the directories  by eliminating the duplicates. While we have tried
a few approaches for this, none of them could perfectly solve all the problems.
One of the challenges has been to provide a correct directory seek support for
the union mounted directories. Sometime back when I posted an
RFC (http://lkml.org/lkml/2007/9/7/22) on this problem, one of the
suggestions I got was to have the dirents stored in a cache (which we
already do for duplicate elimination) and define the directory seek
behaviour on this cache constructed out of unioned directories.

I set out to try this and the result is this set of patches. I am myself
not impressed by the implementation complexity in these patches but posting
them here only to get further comments and suggestions. Moreover I haven't
debugged them thoroughly to uncover all the problems. While I don't expect
anybody try these patches, for the completeness sake I have to mention that
these apply on top of Jan Blunck's patches on 2.6.24-rc2-mm1
(ftp://ftp.suse.com/pub/people/jblunck/patches/).

In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed to
define a seek behaviour on the union mounted directory. This cache is stored
as part of the struct file of the topmost directory of the union and will
remain as long as the directory is kept open.

While this approach works, it has the following downsides:

- The cache can grow arbitrarily large in size for big directories thereby
consuming lots of memory. Pruning individual cache entries is out of question
as entire cache is needed for subsequent readdirs for duplicate elimination.

- When an exising union gets overlaid by a new directory, there is a
possibility of the cache getting duplicated for the new union, thereby wasting
more space.

- Whenever _any_ directory that is part of the union gets
modified (addition/deletion of entries), the dirent cache of all the unions
which this directory is part of, needs to be purged and rebuilt. This is
expensive not only due to re-reads of dirents but also because
readdir(2)/getdents(2) needs to be synchronized with other operations
like mkdir/mknod/link/unlink etc.

- Since lseek(2) of the unioned directory also works on the same dirent
cache, it too needs to be invalidated when the directory gets modified.

- Supporting SEEK_END of lseek(2) is expensive as it involves reading in
all the dirents to know the EOF of the directory file.

After all this, I am beginning to think if it would be better to delegate
this readdir and whiteout processing to userspace. Can this be better handled
by readdir(3) in glibc ? But even with this, I am not sure if correct seek
behaviour (from lseek(2) or seekdir(3))  can be achieved in an easy manner.
Any thoughts on this ?

Earlier Erez Zodak had suggested that things will become easier if readdir
state is stored as a disk file (http://lkml.org/lkml/2007/9/7/114). This
approach simplifies directory seek support in Unionfs. But I am not sure if
such an approach would go well with VFS based unification approach like
Union Mount.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 2/5] Add New directory listing approach

2007-12-05 Thread Bharata B Rao
Another readdir implementation for union uounted directories.

Reads dirents from all layers of the union into a cache, eliminates duplicates,
before returning them into userspace. The cache is stored persistently as part
of struct file of the topmost directory. Instead of original directory offsets,
offsets are defined as linearly increasing indices on this cache and the same
is returned to userspace.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/file_table.c   |1 
 fs/readdir.c  |   10 -
 fs/union.c|  281 ++
 include/linux/fs.h|   30 +
 include/linux/union.h |   28 
 5 files changed, 342 insertions(+), 8 deletions(-)

--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -286,6 +286,7 @@ void fastcall __fput(struct file *file)
drop_file_write_access(file);
 
put_pid(file-f_owner.pid);
+   put_rdstate(file-f_rdstate);
file_kill(file);
file-f_path.dentry = NULL;
file-f_path.mnt = NULL;
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -16,12 +16,12 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/unistd.h
+#include linux/union.h
 
 #include asm/uaccess.h
 
 int vfs_readdir(struct file *file, filldir_t filler, void *buf)
 {
-   struct inode *inode = file-f_path.dentry-d_inode;
int res = -ENOTDIR;
 
if (!file-f_op || !file-f_op-readdir)
@@ -31,13 +31,7 @@ int vfs_readdir(struct file *file, filld
if (res)
goto out;
 
-   mutex_lock(inode-i_mutex);
-   res = -ENOENT;
-   if (!IS_DEADDIR(inode)) {
-   res = file-f_op-readdir(file, buf, filler);
-   file_accessed(file);
-   }
-   mutex_unlock(inode-i_mutex);
+   res = do_readdir(file, buf, filler);
 out:
return res;
 }
--- a/fs/union.c
+++ b/fs/union.c
@@ -46,8 +46,10 @@ static struct hlist_head *union_rhashtab
  * - union_lock
  */
 DEFINE_SPINLOCK(union_lock);
+DEFINE_MUTEX(union_rdmutex);
 
 static struct kmem_cache *union_cache __read_mostly;
+static struct kmem_cache *readdir_cache;
 
 static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt)
 {
@@ -101,6 +103,9 @@ static int __init init_union(void)
for (loop = 0; loop  (1  union_rhash_shift); loop++)
INIT_HLIST_HEAD(union_rhashtable[loop]);
 
+   readdir_cache = kmem_cache_create(readdir-cache,
+   sizeof(struct rdcache_entry), 0,
+   SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
return 0;
 }
 
@@ -516,6 +521,282 @@ int last_union_is_root(struct path *path
 }
 
 /*
+ * readdir support for Union mounts.
+ */
+
+struct rdcache_callback {
+   void *buf;  /* original callback buffer */
+   filldir_t filldir;  /* the filldir() we should call */
+   int error;  /* stores filldir error */
+   struct rdstate *rdstate;/* readdir state */
+};
+
+/*
+ * This is called after every -readdir() to persistently store the number of
+ * entries in a directory in the corresponding union_mount structure.
+ */
+static void update_um_dirents(struct rdstate *r)
+{
+   struct union_mount *um;
+
+   spin_lock(union_lock);
+   um = union_lookup(r-cur_path.dentry, r-cur_path.mnt);
+   if (!um)
+   goto out;
+   um-nr_dirents = r-nr_dirents;
+out:
+   spin_unlock(union_lock);
+}
+
+static void rdcache_free(struct list_head *list)
+{
+   struct list_head *p;
+   struct list_head *ptmp;
+   int count = 0;
+
+   list_for_each_safe(p, ptmp, list) {
+   struct rdcache_entry *this;
+
+   this = list_entry(p, struct rdcache_entry, list);
+   list_del_init(this-list);
+   kfree(this-name.name);
+   kmem_cache_free(readdir_cache, this);
+   count++;
+   }
+   INIT_LIST_HEAD(list);
+   return;
+}
+
+static int rdcache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+   struct rdcache_entry *p;
+   int ret = 0;
+
+   list_for_each_entry(p, uc_list, list) {
+   if (p-name.len != namelen)
+   continue;
+   if (strncmp(p-name.name, name, namelen) == 0) {
+   ret = 1;
+   break;
+   }
+   }
+   return ret;
+}
+
+static int rdcache_add_entry(struct rdstate *r, struct list_head *list,
+   const char *name, int namelen, loff_t offset, u64 ino,
+   unsigned int d_type)
+{
+   struct rdcache_entry *this;
+   char *tmp_name;
+
+   this = kmem_cache_alloc(readdir_cache, GFP_KERNEL);
+   if (!this) {
+   printk(KERN_CRIT rdcache_add_entry(): out of kernel memory\n);
+   return -ENOMEM;
+   }
+
+   tmp_name = kmalloc(namelen + 1, GFP_KERNEL

[RFC PATCH 4/5] Directory seek support

2007-12-05 Thread Bharata B Rao
Directory seek support.

Define the seek behaviour on the stored cache of dirents.

Signed-off-by: Bharata B Rao [EMAIL PROTECTED]
---
 fs/read_write.c   |   11 ---
 fs/union.c|  171 +-
 include/linux/fs.h|8 ++
 include/linux/union.h |   25 +++
 4 files changed, 205 insertions(+), 10 deletions(-)

--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include linux/syscalls.h
 #include linux/pagemap.h
 #include linux/splice.h
+#include linux/union.h
 #include read_write.h
 
 #include asm/uaccess.h
@@ -116,15 +117,7 @@ EXPORT_SYMBOL(default_llseek);
 
 loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
 {
-   loff_t (*fn)(struct file *, loff_t, int);
-
-   fn = no_llseek;
-   if (file-f_mode  FMODE_LSEEK) {
-   fn = default_llseek;
-   if (file-f_op  file-f_op-llseek)
-   fn = file-f_op-llseek;
-   }
-   return fn(file, offset, origin);
+   return do_llseek(file, offset, origin);
 }
 EXPORT_SYMBOL(vfs_llseek);
 
--- a/fs/union.c
+++ b/fs/union.c
@@ -614,6 +614,7 @@ static int rdcache_add_entry(struct rdst
this-dtype = d_type;
INIT_LIST_HEAD(this-list);
list_add_tail(this-list, list);
+   r-cur_dirent = this;
return 0;
 }
 
@@ -636,18 +637,96 @@ static int filldir_union(void *buf, cons
if (rdcache_find_entry(r-dirent_cache, name, namlen))
return 0;
 
-   err =  cb-filldir(cb-buf, name, namlen, r-cur_off,
+   /* We come here with NULL cb-filldir from lseek path */
+   if (cb-filldir)
+   err =  cb-filldir(cb-buf, name, namlen, r-cur_off,
ino, d_type);
if (err = 0) {
rdcache_add_entry(r, r-dirent_cache,
name, namlen, offset, ino, d_type);
r-cur_off = ++r-last_off;
r-nr_dirents++;
+   if (r-cur_off == r-fill_off) {
+   /* We filled up to the required seek offset */
+   r-fill_off = 0;
+   err = -EINVAL;
+   }
}
cb-error = err;
return err;
 }
 
+/*
+ * This is called when current offset in rdcache gets changed and when
+ * we need to change the current underlying directory in the rdstate
+ * to match the current offset.
+ */
+static void update_rdstate(struct file *file)
+{
+   struct rdstate *r = file-f_rdstate;
+   loff_t off = r-cur_off;
+   struct union_mount *um;
+
+   if (!(r-flags  RDSTATE_NEED_UPDATE))
+   return;
+
+   spin_lock(union_lock);
+   um = union_lookup(file-f_path.dentry, file-f_path.mnt);
+   spin_unlock(union_lock);
+   if (!um)
+   goto out;
+   off -= um-nr_dirents;
+   path_put(r-cur_path);
+   r-cur_path = file-f_path;
+   path_get(r-cur_path);
+
+   while (off  0) {
+   spin_lock(union_lock);
+   um = union_lookup(r-cur_path.dentry, r-cur_path.mnt);
+   spin_unlock(union_lock);
+   if (!um)
+   goto out;
+   off -= um-nr_dirents;
+   path_put(r-cur_path);
+   r-cur_path = um-u_next;
+   path_get(r-cur_path);
+   }
+out:
+   r-file_off = r-cur_dirent-off;
+}
+
+/*
+ * Returns dirents from rdcache to userspace.
+ */
+static int readdir_rdcache(struct file *file, struct rdcache_callback *cb)
+{
+   struct rdstate *r = cb-rdstate;
+   struct rdcache_entry *tmp = r-cur_dirent;
+   int err = 0;
+
+   BUG_ON(r-cur_off  r-last_off);
+
+   /* If offsets already uptodate, just return */
+   if (likely(r-cur_off == r-last_off))
+   return 0;
+
+   /*
+* return the entries from cur_off till last_off from rdcache to
+* user space.
+*/
+   list_for_each_entry_from(tmp, r-dirent_cache, list) {
+   err =  cb-filldir(cb-buf, tmp-name.name, tmp-name.len,
+   r-cur_off, tmp-ino, tmp-dtype);
+   r-cur_dirent = tmp;
+   if (err  0)
+   break;
+   r-cur_off++;
+   r-flags |= RDSTATE_NEED_UPDATE;
+   }
+   update_rdstate(file);
+   return err;
+}
+
 /* Called from last fput() */
 void put_rdstate(struct rdstate *rdstate)
 {
@@ -710,6 +789,10 @@ int readdir_union(struct file *file, voi
cb.rdstate = rdstate;
cb.error = 0;
 
+   err = readdir_rdcache(file, cb);
+   if (err)
+   return err;
+
offset = rdstate-file_off;
 
/* Read from the topmost directory */
@@ -796,6 +879,92 @@ out:
return err;
 }
 
+static void rdcache_rewind(struct file *file, struct rdstate *r, loff_t offset)
+{
+   struct rdcache_entry *tmp = r-cur_dirent;
+
+   list_for_each_entry_reverse_from(tmp, r-dirent_cache, list

Re: [PATCH 6/7] d_path: Make d_path() use a struct path

2007-11-01 Thread Bharata B Rao
On 10/29/07, Jan Blunck <[EMAIL PROTECTED]> wrote:
>
>

Did you miss the d_path() caller arch/blackfin/kernel/traps.c:printk_address() ?

Regards,
Bharata.
-- 
"Men come and go but mountains remain" -- Ruskin Bond.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/7] d_path: Make d_path() use a struct path

2007-11-01 Thread Bharata B Rao
On 10/29/07, Jan Blunck [EMAIL PROTECTED] wrote:



Did you miss the d_path() caller arch/blackfin/kernel/traps.c:printk_address() ?

Regards,
Bharata.
-- 
Men come and go but mountains remain -- Ruskin Bond.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/13] Use struct path in struct nameidata

2007-10-23 Thread Bharata B Rao
On Tue, Oct 23, 2007 at 10:43:05AM +0200, Jan Blunck wrote:
> 
> The thing is: how do we keep going from here? Do you want to send my patches
> in the future or are you going to ask me before sending things out? We don't
> need to duplicate the work here. I already put my quilt stack into a public
> place for you to work on them but I don't like the way this is going on at the
> moment.

My intention was to help speed up the Union Mount effort by ensuring
that patches don't rot waiting for developer's attention. Going by the
past interactions, I got a feeling that you have lot of other work
besides this, while I have time to spare on this. Hence wanted to do my
bit to get patches moving as quickly as possible.

As I have conveyed to you many times, I would still like you to maintain
the patches and send out as timely as possible on lkml. If you can't do
this because of your other commitments, then I would more than willing
to maintain these and give them maximum attention.

And thanks for making available the patches publicly, I have been asking
this for months now.

Regards,
Bharata.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/13] Use struct path in struct nameidata

2007-10-23 Thread Bharata B Rao
On Tue, Oct 23, 2007 at 10:43:05AM +0200, Jan Blunck wrote:
 
 The thing is: how do we keep going from here? Do you want to send my patches
 in the future or are you going to ask me before sending things out? We don't
 need to duplicate the work here. I already put my quilt stack into a public
 place for you to work on them but I don't like the way this is going on at the
 moment.

My intention was to help speed up the Union Mount effort by ensuring
that patches don't rot waiting for developer's attention. Going by the
past interactions, I got a feeling that you have lot of other work
besides this, while I have time to spare on this. Hence wanted to do my
bit to get patches moving as quickly as possible.

As I have conveyed to you many times, I would still like you to maintain
the patches and send out as timely as possible on lkml. If you can't do
this because of your other commitments, then I would more than willing
to maintain these and give them maximum attention.

And thanks for making available the patches publicly, I have been asking
this for months now.

Regards,
Bharata.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/13] Use struct path in struct nameidata

2007-10-22 Thread Bharata B Rao
On Mon, Oct 22, 2007 at 03:57:58PM +0200, Christoph Hellwig wrote:
> 
> Any reason we've got this patchset posted by three people now? :)

Two reasons actually !

- The set of patches posted by Jan last was on 2.6.23-rc8-mm1. So I
thought let me help Andrew a bit by making them available on latest
-mm :) And I didn't know that these were already under consideration
by Andrew.

- The set of patches posted by Jan didn't even pass compile test for me.
So I made sure that the patches compiled and worked on x86, x86_64 and powerpc.

Regards,
Bharata.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/13] Rename {__}d_path() to {__}print_path() and fix comments

2007-10-22 Thread Bharata B Rao
Changes the name of d_path() and __d_path() to print_path() and __print_path()
respectively and fixes the kerneldoc comments for print_path().

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 arch/blackfin/kernel/traps.c  |2 -
 drivers/md/bitmap.c   |2 -
 drivers/usb/gadget/file_storage.c |4 +--
 fs/compat_ioctl.c |3 +-
 fs/dcache.c   |   41 +-
 fs/dcookies.c |2 -
 fs/ecryptfs/super.c   |2 -
 fs/nfsd/export.c  |2 -
 fs/proc/base.c|2 -
 fs/seq_file.c |2 -
 fs/sysfs/file.c   |2 -
 fs/unionfs/super.c|2 -
 include/linux/dcache.h|2 -
 kernel/audit.c|2 -
 14 files changed, 34 insertions(+), 36 deletions(-)

--- a/arch/blackfin/kernel/traps.c
+++ b/arch/blackfin/kernel/traps.c
@@ -102,7 +102,7 @@ static int printk_address(unsigned long 
struct file *file = vma->vm_file;
if (file) {
char _tmpbuf[256];
-   name = d_path(file->f_dentry,
+   name = print_path(file->f_dentry,
  file->f_vfsmnt,
  _tmpbuf,
  sizeof(_tmpbuf));
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -215,7 +215,7 @@ char *file_path(struct file *file, char 
d = file->f_path.dentry;
v = file->f_path.mnt;
 
-   buf = d_path(d, v, buf, count);
+   buf = print_path(d, v, buf, count);
 
return IS_ERR(buf) ? NULL : buf;
 }
--- a/drivers/usb/gadget/file_storage.c
+++ b/drivers/usb/gadget/file_storage.c
@@ -3567,7 +3567,7 @@ static ssize_t show_file(struct device *
 
down_read(>filesem);
if (backing_file_is_open(curlun)) { // Get the complete pathname
-   p = d_path(curlun->filp->f_path.dentry,
+   p = print_path(curlun->filp->f_path.dentry,
curlun->filp->f_path.mnt, buf, PAGE_SIZE - 1);
if (IS_ERR(p))
rc = PTR_ERR(p);
@@ -3985,7 +3985,7 @@ static int __init fsg_bind(struct usb_ga
if (backing_file_is_open(curlun)) {
p = NULL;
if (pathbuf) {
-   p = d_path(curlun->filp->f_path.dentry,
+   p = print_path(curlun->filp->f_path.dentry,
curlun->filp->f_path.mnt,
pathbuf, PATH_MAX);
if (IS_ERR(p))
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -3544,7 +3544,8 @@ static void compat_ioctl_error(struct fi
/* find the name of the device. */
path = (char *)__get_free_page(GFP_KERNEL);
if (path) {
-   fn = d_path(filp->f_path.dentry, filp->f_path.mnt, path, 
PAGE_SIZE);
+   fn = print_path(filp->f_path.dentry, filp->f_path.mnt,
+   path, PAGE_SIZE);
if (IS_ERR(fn))
fn = "?";
}
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1762,23 +1762,7 @@ shouldnt_be_hashed:
goto shouldnt_be_hashed;
 }
 
-/**
- * d_path - return the path of a dentry
- * @dentry: dentry to report
- * @vfsmnt: vfsmnt to which the dentry belongs
- * @root: root dentry
- * @rootmnt: vfsmnt to which the root dentry belongs
- * @buffer: buffer to return value in
- * @buflen: buffer length
- *
- * Convert a dentry into an ASCII path name. If the entry has been deleted
- * the string " (deleted)" is appended. Note that this is ambiguous.
- *
- * Returns the buffer or an error code if the path was too long.
- *
- * "buflen" should be positive. Caller holds the dcache_lock.
- */
-static char * __d_path( struct dentry *dentry, struct vfsmount *vfsmnt,
+static char *__print_path( struct dentry *dentry, struct vfsmount *vfsmnt,
struct path *root, char *buffer, int buflen)
 {
char * end = buffer+buflen;
@@ -1845,8 +1829,21 @@ Elong:
return ERR_PTR(-ENAMETOOLONG);
 }
 
-/* write full pathname into buffer and return start of pathname */
-char * d_path(struct dentry *dentry, struct vfsmount *vfsmnt,
+/**
+ * print_path - return the path of a dentry
+ * @dentry: dentry to report
+ * @vfsmnt: vfsmnt to which the dentry belongs
+ * @buffer: buffer to return value in
+ * @buflen: buffer length
+ *
+ * Convert a dentry into an ASCII path name. If the entry has been deleted
+ * the string " (deleted)" is appended. Note that this is a

[PATCH 12/13] Use struct path argument in proc_get_link()

2007-10-22 Thread Bharata B Rao
Replace the (vfsmnt, dentry) arguments in proc_inode operation proc_get_link()
by struct path.

Also, this should eventually allow do_proc_readlink() to call d_path() with
a struct path argument.

Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 fs/proc/base.c  |   60 +---
 fs/proc/internal.h  |2 -
 fs/proc/task_mmu.c  |6 ++--
 fs/proc/task_nommu.c|6 ++--
 include/linux/proc_fs.h |2 -
 5 files changed, 35 insertions(+), 41 deletions(-)

--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -153,7 +153,7 @@ static int get_nr_threads(struct task_st
return count;
 }
 
-static int proc_cwd_link(struct inode *inode, struct dentry **dentry, struct 
vfsmount **mnt)
+static int proc_cwd_link(struct inode *inode, struct path *path)
 {
struct task_struct *task = get_proc_task(inode);
struct fs_struct *fs = NULL;
@@ -165,8 +165,8 @@ static int proc_cwd_link(struct inode *i
}
if (fs) {
read_lock(>lock);
-   *mnt = mntget(fs->pwd.mnt);
-   *dentry = dget(fs->pwd.dentry);
+   *path = fs->pwd;
+   path_get(path);
read_unlock(>lock);
result = 0;
put_fs_struct(fs);
@@ -174,7 +174,7 @@ static int proc_cwd_link(struct inode *i
return result;
 }
 
-static int proc_root_link(struct inode *inode, struct dentry **dentry, struct 
vfsmount **mnt)
+static int proc_root_link(struct inode *inode, struct path *path)
 {
struct task_struct *task = get_proc_task(inode);
struct fs_struct *fs = NULL;
@@ -186,8 +186,8 @@ static int proc_root_link(struct inode *
}
if (fs) {
read_lock(>lock);
-   *mnt = mntget(fs->root.mnt);
-   *dentry = dget(fs->root.dentry);
+   *path = fs->root;
+   path_get(path);
read_unlock(>lock);
result = 0;
put_fs_struct(fs);
@@ -1039,34 +1039,32 @@ static void *proc_pid_follow_link(struct
if (!proc_fd_access_allowed(inode))
goto out;
 
-   error = PROC_I(inode)->op.proc_get_link(inode, >path.dentry,
-   >path.mnt);
+   error = PROC_I(inode)->op.proc_get_link(inode, >path);
nd->last_type = LAST_BIND;
 out:
return ERR_PTR(error);
 }
 
-static int do_proc_readlink(struct dentry *dentry, struct vfsmount *mnt,
-   char __user *buffer, int buflen)
+static int do_proc_readlink(struct path *path, char __user *buffer, int buflen)
 {
struct inode * inode;
char *tmp = (char*)__get_free_page(GFP_TEMPORARY);
-   char *path;
+   char *pathname;
int len;
 
if (!tmp)
return -ENOMEM;
 
-   inode = dentry->d_inode;
-   path = d_path(dentry, mnt, tmp, PAGE_SIZE);
-   len = PTR_ERR(path);
-   if (IS_ERR(path))
+   inode = path->dentry->d_inode;
+   pathname = d_path(path->dentry, path->mnt, tmp, PAGE_SIZE);
+   len = PTR_ERR(pathname);
+   if (IS_ERR(pathname))
goto out;
-   len = tmp + PAGE_SIZE - 1 - path;
+   len = tmp + PAGE_SIZE - 1 - pathname;
 
if (len > buflen)
len = buflen;
-   if (copy_to_user(buffer, path, len))
+   if (copy_to_user(buffer, pathname, len))
len = -EFAULT;
  out:
free_page((unsigned long)tmp);
@@ -1077,20 +1075,18 @@ static int proc_pid_readlink(struct dent
 {
int error = -EACCES;
struct inode *inode = dentry->d_inode;
-   struct dentry *de;
-   struct vfsmount *mnt = NULL;
+   struct path path;
 
/* Are we allowed to snoop on the tasks file descriptors? */
if (!proc_fd_access_allowed(inode))
goto out;
 
-   error = PROC_I(inode)->op.proc_get_link(inode, , );
+   error = PROC_I(inode)->op.proc_get_link(inode, );
if (error)
goto out;
 
-   error = do_proc_readlink(de, mnt, buffer, buflen);
-   dput(de);
-   mntput(mnt);
+   error = do_proc_readlink(, buffer, buflen);
+   path_put();
 out:
return error;
 }
@@ -1317,8 +1313,7 @@ out:
 
 #define PROC_FDINFO_MAX 64
 
-static int proc_fd_info(struct inode *inode, struct dentry **dentry,
-   struct vfsmount **mnt, char *info)
+static int proc_fd_info(struct inode *inode, struct path *path, char *info)
 {
struct task_struct *task = get_proc_task(inode);
struct files_struct *files = NULL;
@@ -1337,10 +1332,10 @@ static int proc_fd_info(struct inode *in
spin_lock(>file_lock);
file = fcheck_files(files, fd);
if (file) {
-   if (mnt)
-   *mnt = mntget(file->f_path.mnt);
-  

[PATCH 10/13] Make set_fs_{root,pwd} take a struct path

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

In nearly all cases the set_fs_{root,pwd}() calls work on a struct
path. Change the function to reflect this and use path_get() here.

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/namespace.c|   28 ++--
 fs/open.c |   12 
 include/linux/fs_struct.h |4 ++--
 3 files changed, 20 insertions(+), 24 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2040,15 +2040,14 @@ out1:
  * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
  * It can block. Requires the big lock held.
  */
-void set_fs_root(struct fs_struct *fs, struct vfsmount *mnt,
-struct dentry *dentry)
+void set_fs_root(struct fs_struct *fs, struct path *path)
 {
struct path old_root;
 
write_lock(>lock);
old_root = fs->root;
-   fs->root.mnt = mntget(mnt);
-   fs->root.dentry = dget(dentry);
+   fs->root = *path;
+   path_get(path);
write_unlock(>lock);
if (old_root.dentry)
path_put(_root);
@@ -2058,15 +2057,14 @@ void set_fs_root(struct fs_struct *fs, s
  * Replace the fs->{pwdmnt,pwd} with {mnt,dentry}. Put the old values.
  * It can block. Requires the big lock held.
  */
-void set_fs_pwd(struct fs_struct *fs, struct vfsmount *mnt,
-   struct dentry *dentry)
+void set_fs_pwd(struct fs_struct *fs, struct path *path)
 {
struct path old_pwd;
 
write_lock(>lock);
old_pwd = fs->pwd;
-   fs->pwd.mnt = mntget(mnt);
-   fs->pwd.dentry = dget(dentry);
+   fs->pwd = *path;
+   path_get(path);
write_unlock(>lock);
 
if (old_pwd.dentry)
@@ -2087,12 +2085,10 @@ static void chroot_fs_refs(struct nameid
task_unlock(p);
if (fs->root.dentry == old_nd->path.dentry
&& fs->root.mnt == old_nd->path.mnt)
-   set_fs_root(fs, new_nd->path.mnt,
-   new_nd->path.dentry);
+   set_fs_root(fs, _nd->path);
if (fs->pwd.dentry == old_nd->path.dentry
&& fs->pwd.mnt == old_nd->path.mnt)
-   set_fs_pwd(fs, new_nd->path.mnt,
-  new_nd->path.dentry);
+   set_fs_pwd(fs, _nd->path);
put_fs_struct(fs);
} else
task_unlock(p);
@@ -2235,6 +2231,7 @@ static void __init init_mount_tree(void)
 {
struct vfsmount *mnt;
struct mnt_namespace *ns;
+   struct path root;
 
mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
if (IS_ERR(mnt))
@@ -2253,8 +2250,11 @@ static void __init init_mount_tree(void)
init_task.nsproxy->mnt_ns = ns;
get_mnt_ns(ns);
 
-   set_fs_pwd(current->fs, ns->root, ns->root->mnt_root);
-   set_fs_root(current->fs, ns->root, ns->root->mnt_root);
+   root.mnt = ns->root;
+   root.dentry = ns->root->mnt_root;
+
+   set_fs_pwd(current->fs, );
+   set_fs_root(current->fs, );
 }
 
 void __init mnt_init(void)
--- a/fs/open.c
+++ b/fs/open.c
@@ -501,7 +501,7 @@ asmlinkage long sys_chdir(const char __u
if (error)
goto dput_and_out;
 
-   set_fs_pwd(current->fs, nd.path.mnt, nd.path.dentry);
+   set_fs_pwd(current->fs, );
 
 dput_and_out:
path_put();
@@ -512,9 +512,7 @@ out:
 asmlinkage long sys_fchdir(unsigned int fd)
 {
struct file *file;
-   struct dentry *dentry;
struct inode *inode;
-   struct vfsmount *mnt;
int error;
 
error = -EBADF;
@@ -522,9 +520,7 @@ asmlinkage long sys_fchdir(unsigned int 
if (!file)
goto out;
 
-   dentry = file->f_path.dentry;
-   mnt = file->f_path.mnt;
-   inode = dentry->d_inode;
+   inode = file->f_path.dentry->d_inode;
 
error = -ENOTDIR;
if (!S_ISDIR(inode->i_mode))
@@ -532,7 +528,7 @@ asmlinkage long sys_fchdir(unsigned int 
 
error = file_permission(file, MAY_EXEC);
if (!error)
-   set_fs_pwd(current->fs, mnt, dentry);
+   set_fs_pwd(current->fs, >f_path);
 out_putf:
fput(file);
 out:
@@ -556,7 +552,7 @@ asmlinkage long sys_chroot(const char __
if (!capable(CAP_SYS_CHROOT))
goto dput_and_out;
 
-   set_fs_root(current->fs, nd.path.mnt, nd.path.dentry);
+   set_fs_root(current->fs, );
set_fs_altroot();

[PATCH 11/13] Make __d_path() to take a struct path argument

2007-10-22 Thread Bharata B Rao
From: Andreas Gruenbacher <[EMAIL PROTECTED]>

One less argument to __d_path.

All callers to __d_path pass the dentry and vfsmount of a struct
path to __d_path. Pass the struct path directly, instead.

Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/dcache.c |   10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1779,8 +1779,7 @@ shouldnt_be_hashed:
  * "buflen" should be positive. Caller holds the dcache_lock.
  */
 static char * __d_path( struct dentry *dentry, struct vfsmount *vfsmnt,
-   struct dentry *root, struct vfsmount *rootmnt,
-   char *buffer, int buflen)
+   struct path *root, char *buffer, int buflen)
 {
char * end = buffer+buflen;
char * retval;
@@ -1805,7 +1804,7 @@ static char * __d_path( struct dentry *d
for (;;) {
struct dentry * parent;
 
-   if (dentry == root && vfsmnt == rootmnt)
+   if (dentry == root->dentry && vfsmnt == root->mnt)
break;
if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
@@ -1868,7 +1867,7 @@ char * d_path(struct dentry *dentry, str
path_get(>fs->root);
read_unlock(>fs->lock);
spin_lock(_lock);
-   res = __d_path(dentry, vfsmnt, root.dentry, root.mnt, buf, buflen);
+   res = __d_path(dentry, vfsmnt, , buf, buflen);
spin_unlock(_lock);
path_put();
return res;
@@ -1936,8 +1935,7 @@ asmlinkage long sys_getcwd(char __user *
unsigned long len;
char * cwd;
 
-   cwd = __d_path(pwd.dentry, pwd.mnt, root.dentry, root.mnt,
-  page, PAGE_SIZE);
+   cwd = __d_path(pwd.dentry, pwd.mnt, , page, PAGE_SIZE);
spin_unlock(_lock);
 
error = PTR_ERR(cwd);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/13] Use struct path in fs_struct

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

* Use struct path in fs_struct.

Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/dcache.c   |   34 ---
 fs/namei.c|   53 ++
 fs/namespace.c|   57 --
 fs/proc/base.c|8 +++---
 include/linux/fs_struct.h |6 +---
 init/do_mounts.c  |6 ++--
 kernel/auditsc.c  |4 +--
 kernel/exit.c |   12 +++--
 kernel/fork.c |   18 +++---
 9 files changed, 87 insertions(+), 111 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1851,8 +1851,7 @@ char * d_path(struct dentry *dentry, str
char *buf, int buflen)
 {
char *res;
-   struct vfsmount *rootmnt;
-   struct dentry *root;
+   struct path root;
 
/*
 * We have various synthetic filesystems that never get mounted.  On
@@ -1865,14 +1864,13 @@ char * d_path(struct dentry *dentry, str
return dentry->d_op->d_dname(dentry, buf, buflen);
 
read_lock(>fs->lock);
-   rootmnt = mntget(current->fs->rootmnt);
-   root = dget(current->fs->root);
+   root = current->fs->root;
+   path_get(>fs->root);
read_unlock(>fs->lock);
spin_lock(_lock);
-   res = __d_path(dentry, vfsmnt, root, rootmnt, buf, buflen);
+   res = __d_path(dentry, vfsmnt, root.dentry, root.mnt, buf, buflen);
spin_unlock(_lock);
-   dput(root);
-   mntput(rootmnt);
+   path_put();
return res;
 }
 
@@ -1918,28 +1916,28 @@ char *dynamic_dname(struct dentry *dentr
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size)
 {
int error;
-   struct vfsmount *pwdmnt, *rootmnt;
-   struct dentry *pwd, *root;
+   struct path pwd, root;
char *page = (char *) __get_free_page(GFP_USER);
 
if (!page)
return -ENOMEM;
 
read_lock(>fs->lock);
-   pwdmnt = mntget(current->fs->pwdmnt);
-   pwd = dget(current->fs->pwd);
-   rootmnt = mntget(current->fs->rootmnt);
-   root = dget(current->fs->root);
+   pwd = current->fs->pwd;
+   path_get(>fs->pwd);
+   root = current->fs->root;
+   path_get(>fs->root);
read_unlock(>fs->lock);
 
error = -ENOENT;
/* Has the current directory has been unlinked? */
spin_lock(_lock);
-   if (pwd->d_parent == pwd || !d_unhashed(pwd)) {
+   if (pwd.dentry->d_parent == pwd.dentry || !d_unhashed(pwd.dentry)) {
unsigned long len;
char * cwd;
 
-   cwd = __d_path(pwd, pwdmnt, root, rootmnt, page, PAGE_SIZE);
+   cwd = __d_path(pwd.dentry, pwd.mnt, root.dentry, root.mnt,
+  page, PAGE_SIZE);
spin_unlock(_lock);
 
error = PTR_ERR(cwd);
@@ -1957,10 +1955,8 @@ asmlinkage long sys_getcwd(char __user *
spin_unlock(_lock);
 
 out:
-   dput(pwd);
-   mntput(pwdmnt);
-   dput(root);
-   mntput(rootmnt);
+   path_put();
+   path_put();
free_page((unsigned long) page);
return error;
 }
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -550,16 +550,16 @@ walk_init_root(const char *name, struct 
struct fs_struct *fs = current->fs;
 
read_lock(>lock);
-   if (fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
-   nd->path.mnt = mntget(fs->altrootmnt);
-   nd->path.dentry = dget(fs->altroot);
+   if (fs->altroot.dentry && !(nd->flags & LOOKUP_NOALT)) {
+   nd->path = fs->altroot;
+   path_get(>altroot);
read_unlock(>lock);
if (__emul_lookup_dentry(name,nd))
return 0;
read_lock(>lock);
}
-   nd->path.mnt = mntget(fs->rootmnt);
-   nd->path.dentry = dget(fs->root);
+   nd->path = fs->root;
+   path_get(>root);
read_unlock(>lock);
return 1;
 }
@@ -756,8 +756,8 @@ static __always_inline void follow_dotdo
struct dentry *old = nd->path.dentry;
 
 read_lock(>lock);
-   if (nd->path.dentry == fs->root &&
-   nd->path.mnt == fs->rootmnt) {
+   if (nd->path.dentry == fs->root.dentry &&
+   nd->path.mnt == fs->root.mnt) {
 read_unlock(>lock);
break;
}
@@ -1079,8 +1079,

[PATCH 06/13] Introduce path_put()

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(>path)

* Rename dput_path() to path_put_conditional()

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
---
 arch/alpha/kernel/osf_sys.c  |2 
 arch/mips/kernel/sysirix.c   |6 +-
 arch/parisc/hpux/sys_hpux.c  |2 
 arch/powerpc/platforms/cell/spufs/syscalls.c |2 
 arch/sparc64/solaris/fs.c|4 -
 drivers/md/dm-table.c|2 
 drivers/mtd/mtdsuper.c   |4 -
 fs/afs/mntpt.c   |2 
 fs/autofs4/root.c|2 
 fs/block_dev.c   |2 
 fs/coda/pioctl.c |4 -
 fs/compat.c  |4 -
 fs/configfs/symlink.c|4 -
 fs/dquot.c   |2 
 fs/ecryptfs/main.c   |2 
 fs/exec.c|4 -
 fs/ext3/super.c  |4 -
 fs/ext4/super.c  |4 -
 fs/gfs2/ops_fstype.c |2 
 fs/inotify_user.c|4 -
 fs/namei.c   |   56 ++-
 fs/namespace.c   |   20 -
 fs/nfs/namespace.c   |2 
 fs/nfsctl.c  |2 
 fs/nfsd/export.c |   10 ++--
 fs/nfsd/nfs4recover.c|2 
 fs/nfsd/nfs4state.c  |2 
 fs/open.c|   22 +-
 fs/proc/base.c   |2 
 fs/reiserfs/super.c  |8 +--
 fs/revoke.c  |2 
 fs/stat.c|6 +-
 fs/unionfs/main.c|2 
 fs/unionfs/super.c   |   12 ++---
 fs/utimes.c  |2 
 fs/xattr.c   |   16 +++
 fs/xfs/linux-2.6/xfs_ioctl.c |2 
 include/linux/namei.h|7 ---
 include/linux/path.h |2 
 kernel/audit_tree.c  |   16 +++
 kernel/auditfilter.c |4 -
 net/sunrpc/rpc_pipe.c|2 
 net/unix/af_unix.c   |6 +-
 43 files changed, 134 insertions(+), 133 deletions(-)

--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -261,7 +261,7 @@ osf_statfs(char __user *path, struct osf
retval = user_path_walk(path, );
if (!retval) {
retval = do_osf_statfs(nd.path.dentry, buffer, bufsiz);
-   path_release();
+   path_put();
}
return retval;
 }
--- a/arch/mips/kernel/sysirix.c
+++ b/arch/mips/kernel/sysirix.c
@@ -711,7 +711,7 @@ asmlinkage int irix_statfs(const char __
}
 
 dput_and_out:
-   path_release();
+   path_put();
 out:
return error;
 }
@@ -1385,7 +1385,7 @@ asmlinkage int irix_statvfs(char __user 
error |= __put_user(0, >f_fstr[i]);
 
 dput_and_out:
-   path_release();
+   path_put();
 out:
return error;
 }
@@ -1636,7 +1636,7 @@ asmlinkage int irix_statvfs64(char __use
error |= __put_user(0, >f_fstr[i]);
 
 dput_and_out:
-   path_release();
+   path_put();
 out:
return error;
 }
--- a/arch/parisc/hpux/sys_hpux.c
+++ b/arch/parisc/hpux/sys_hpux.c
@@ -222,7 +222,7 @@ asmlinkage long hpux_statfs(const char _
error = vfs_statfs_hpux(nd.path.dentry, );
if (!error && copy_to_user(buf, , sizeof(tmp)))
error = -EFAULT;
-   path_release();
+   path_put();
}
return error;
 }
--- a/arch/powerpc/platforms/cell/spufs/syscalls.c
+++ b/arch/powerpc/platforms/cell/spufs/syscalls.c
@@ -73,7 +73,7 @@ static long do_spu_create(const char __u
LOOKUP_OPEN|LOOKUP_CREATE, );
if (!ret) {
ret = spufs_create(, flags, mode, neighbor);
-   path_release();
+   path_put();
}
putname(tmp);
}
--- a/arch/sparc64/solaris/fs.c
+++ b/arch/sparc64/solaris/fs.c
@@ -436,7 +436,7 @@ asmlinkage int solaris_statvfs(u32 path,
if (!error) {
struct inode *inode = nd.path.dentry->d_inode;
er

[PATCH 07/13] Use path_put() in a few places instead of {mnt,d}put()

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

Use path_put() in a few places instead of {mnt,d}put()

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/afs/mntpt.c |3 +--
 fs/namei.c |   15 +--
 2 files changed, 6 insertions(+), 12 deletions(-)

--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -235,8 +235,7 @@ static void *afs_mntpt_follow_link(struc
err = do_add_mount(newmnt, nd, MNT_SHRINKABLE, _vfsmounts);
switch (err) {
case 0:
-   dput(nd->path.dentry);
-   mntput(nd->path.mnt);
+   path_put(>path);
nd->path.mnt = newmnt;
nd->path.dentry = dget(newmnt->mnt_root);
schedule_delayed_work(_mntpt_expiry_timer,
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -626,8 +626,7 @@ static __always_inline int __do_follow_l
if (dentry->d_inode->i_op->put_link)
dentry->d_inode->i_op->put_link(dentry, nd, cookie);
}
-   dput(dentry);
-   mntput(path->mnt);
+   path_put(path);
 
return error;
 }
@@ -1034,8 +1033,7 @@ static int fastcall link_path_walk(const
result = __link_path_walk(name, nd);
}
 
-   dput(save.path.dentry);
-   mntput(save.path.mnt);
+   path_put();
 
return result;
 }
@@ -1057,8 +1055,7 @@ static int __emul_lookup_dentry(const ch
 
if (!nd->path.dentry->d_inode ||
S_ISDIR(nd->path.dentry->d_inode->i_mode)) {
-   struct dentry *old_dentry = nd->path.dentry;
-   struct vfsmount *old_mnt = nd->path.mnt;
+   struct path old_path = nd->path;
struct qstr last = nd->last;
int last_type = nd->last_type;
struct fs_struct *fs = current->fs;
@@ -1074,14 +1071,12 @@ static int __emul_lookup_dentry(const ch
read_unlock(>lock);
if (path_walk(name, nd) == 0) {
if (nd->path.dentry->d_inode) {
-   dput(old_dentry);
-   mntput(old_mnt);
+   path_put(_path);
return 1;
}
path_put(>path);
}
-   nd->path.dentry = old_dentry;
-   nd->path.mnt = old_mnt;
+   nd->path = old_path;
nd->last = last;
nd->last_type = last_type;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/13] Introduce path_get()

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

This introduces the symmetric function to path_put() for getting a reference
to the dentry and vfsmount of a struct path in the right order.

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/namei.c|   17 +++--
 fs/unionfs/super.c|2 +-
 include/linux/namei.h |6 --
 include/linux/path.h  |1 +
 4 files changed, 17 insertions(+), 9 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -363,6 +363,19 @@ int deny_write_access(struct file * file
 }
 
 /**
+ * path_get - get a reference to a path
+ * @path: path to get the reference to
+ *
+ * Given a path increment the reference count to the dentry and the vfsmount.
+ */
+void path_get(struct path *path)
+{
+   mntget(path->mnt);
+   dget(path->dentry);
+}
+EXPORT_SYMBOL(path_get);
+
+/**
  * path_put - put a reference to a path
  * @path: path to put the reference to
  *
@@ -1161,8 +1174,8 @@ static int fastcall do_path_lookup(int d
if (retval)
goto fput_fail;
 
-   nd->path.mnt = mntget(file->f_path.mnt);
-   nd->path.dentry = dget(dentry);
+   nd->path = file->f_path;
+   path_get(>f_path);
 
fput_light(file, fput_needed);
}
--- a/fs/unionfs/super.c
+++ b/fs/unionfs/super.c
@@ -544,7 +544,7 @@ static int unionfs_remount_fs(struct sup
memcpy(tmp_lower_paths, UNIONFS_D(sb->s_root)->lower_paths,
   cur_branches * sizeof(struct path));
for (i = 0; i < cur_branches; i++)
-   pathget(_lower_paths[i]); /* drop refs at end of fxn */
+   path_get(_lower_paths[i]); /* drop refs at end of fxn */
 
/***
 * For each branch command, do path_lookup on the requested branch,
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -94,10 +94,4 @@ static inline char *nd_get_link(struct n
return nd->saved_names[nd->depth];
 }
 
-static inline void pathget(struct path *path)
-{
-   mntget(path->mnt);
-   dget(path->dentry);
-}
-
 #endif /* _LINUX_NAMEI_H */
--- a/include/linux/path.h
+++ b/include/linux/path.h
@@ -9,6 +9,7 @@ struct path {
struct dentry *dentry;
 };
 
+extern void path_get(struct path *);
 extern void path_put(struct path *);
 
 #endif  /* _LINUX_PATH_H */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/13] Move struct path into its own header

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

Move the definition of struct path into its own header file for further
patches.

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 include/linux/namei.h |6 +-
 include/linux/path.h  |   12 
 2 files changed, 13 insertions(+), 5 deletions(-)

--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct vfsmount;
 
@@ -30,11 +31,6 @@ struct nameidata {
} intent;
 };
 
-struct path {
-   struct vfsmount *mnt;
-   struct dentry *dentry;
-};
-
 /*
  * Type of the last component on LOOKUP_PARENT
  */
--- /dev/null
+++ b/include/linux/path.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_PATH_H
+#define _LINUX_PATH_H
+
+struct dentry;
+struct vfsmount;
+
+struct path {
+   struct vfsmount *mnt;
+   struct dentry *dentry;
+};
+
+#endif  /* _LINUX_PATH_H */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/13] Remove path_release_on_umount()

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

path_release_on_umount() should only be called from sys_umount(). I merged the
function into sys_umount() instead of having in in namei.c.

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 fs/namei.c|   10 --
 fs/namespace.c|4 +++-
 include/linux/namei.h |1 -
 3 files changed, 3 insertions(+), 12 deletions(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -368,16 +368,6 @@ void path_release(struct nameidata *nd)
mntput(nd->mnt);
 }
 
-/*
- * umount() mustn't call path_release()/mntput() as that would clear
- * mnt_expiry_mark
- */
-void path_release_on_umount(struct nameidata *nd)
-{
-   dput(nd->dentry);
-   mntput_no_expire(nd->mnt);
-}
-
 /**
  * release_open_intent - free up open intent resources
  * @nd: pointer to nameidata
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -988,7 +988,9 @@ asmlinkage long sys_umount(char __user *
 
retval = do_umount(nd.mnt, flags);
 dput_and_out:
-   path_release_on_umount();
+   /* we mustn't call path_put() as that would clear mnt_expiry_mark */
+   dput(nd.dentry);
+   mntput_no_expire(nd.mnt);
 out:
return retval;
 }
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -73,7 +73,6 @@ extern int FASTCALL(path_lookup(const ch
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
   const char *, unsigned int, struct nameidata *);
 extern void path_release(struct nameidata *);
-extern void path_release_on_umount(struct nameidata *);
 
 extern int __user_path_lookup_open(const char __user *, unsigned lookup_flags, 
struct nameidata *nd, int open_flags);
 extern int path_lookup_open(int dfd, const char *name, unsigned lookup_flags, 
struct nameidata *, int open_flags);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/13] Don't touch fs_struct in usermodehelper

2007-10-22 Thread Bharata B Rao
From: Jan Blunck <[EMAIL PROTECTED]>

This test seems to be unnecessary since we always have rootfs mounted before
calling a usermodehelper.

Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
---
 kernel/kmod.c |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -173,10 +173,7 @@ static int call_usermodehelper(void 
 */
set_user_nice(current, 0);
 
-   retval = -EPERM;
-   if (current->fs->root)
-   retval = kernel_execve(sub_info->path,
-   sub_info->argv, sub_info->envp);
+   retval = kernel_execve(sub_info->path, sub_info->argv, sub_info->envp);
 
/* Exec failed? */
sub_info->retval = retval;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >