Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Pavel Emelianov
Dave Hansen wrote:
> On Mon, 2007-03-12 at 20:19 +0300, Pavel Emelianov wrote:
>> Dave Hansen wrote:
>>> On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
 now VE2 maps the same page. You can't determine whether this page is mapped
 to this container or another one w/o page->container pointer. 
>>> Hi Kirill,
>>>
>>> I thought we can always get from the page to the VMA.  rmap provides
>>> this to us via page->mapping and the 'struct address_space' or anon_vma.
>>> Do we agree on that?
>> Not completely. When page is unmapped from the *very last*
>> user its *first* toucher may already be dead. So we'll never
>> find out who it was.
> 
> OK, but  this is assuming that we didn't *un*account for the page when
> the last user of the "owning" container stopped using the page.

That's exactly what we agreed on during our discussions:
When page is get touched it is charged to this container.
When page is get touched again by new container it is NOT
charged to new container, but keeps holding the old one
till it (the page) is completely freed. Nobody worried the
fact that a single page can hold container for good.

OpenVZ beancounters work the other way (and we proposed this
solution when we first sent the patches). We keep track of
*all* the containers (i.e. beancounters) holding this page.

>>> We can also get from the vma to the mm very easily, via vma->vm_mm,
>>> right?
>>>
>>> We can also get from a task to the container quite easily.  
>>>
>>> So, the only question becomes whether there is a 1:1 relationship
>>> between mm_structs and containers.  Does each mm_struct belong to one
>> No. The question is "how to get a container that touched the
>> page first" which is the same as "how to find mm_struct which
>> touched the page first". Obviously there's no answer on this
>> question unless we hold some direct page->container reference.
>> This may be a hash, a direct on-page pointer, or mirrored
>> array of pointers.
> 
> Or, you keep track of when the last user from the container goes away,
> and you effectively account it to another one.

We can migrate page to another user but we decided
to implement it later after accepting simple accounting.

> Are there problems with shifting ownership around like this?
> 
> -- Dave
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 20:19 +0300, Pavel Emelianov wrote:
> Dave Hansen wrote:
> > On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> >> now VE2 maps the same page. You can't determine whether this page is mapped
> >> to this container or another one w/o page->container pointer. 
> > 
> > Hi Kirill,
> > 
> > I thought we can always get from the page to the VMA.  rmap provides
> > this to us via page->mapping and the 'struct address_space' or anon_vma.
> > Do we agree on that?
> 
> Not completely. When page is unmapped from the *very last*
> user its *first* toucher may already be dead. So we'll never
> find out who it was.

OK, but  this is assuming that we didn't *un*account for the page when
the last user of the "owning" container stopped using the page.

> > We can also get from the vma to the mm very easily, via vma->vm_mm,
> > right?
> > 
> > We can also get from a task to the container quite easily.  
> > 
> > So, the only question becomes whether there is a 1:1 relationship
> > between mm_structs and containers.  Does each mm_struct belong to one
> 
> No. The question is "how to get a container that touched the
> page first" which is the same as "how to find mm_struct which
> touched the page first". Obviously there's no answer on this
> question unless we hold some direct page->container reference.
> This may be a hash, a direct on-page pointer, or mirrored
> array of pointers.

Or, you keep track of when the last user from the container goes away,
and you effectively account it to another one.

Are there problems with shifting ownership around like this?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Balbir Singh

On 3/12/07, Dave Hansen <[EMAIL PROTECTED]> wrote:

On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> now VE2 maps the same page. You can't determine whether this page is mapped
> to this container or another one w/o page->container pointer.

Hi Kirill,

I thought we can always get from the page to the VMA.  rmap provides
this to us via page->mapping and the 'struct address_space' or anon_vma.
Do we agree on that?

We can also get from the vma to the mm very easily, via vma->vm_mm,
right?

We can also get from a task to the container quite easily.

So, the only question becomes whether there is a 1:1 relationship
between mm_structs and containers.  Does each mm_struct belong to one
and only one container?  Basically, can a threaded process have
different threads in different containers?

It seems that we could bridge the gap pretty easily by either assigning
each mm_struct to a container directly, or putting some kind of
task-to-mm lookup.  Perhaps just a list like
mm->tasks_using_this_mm_list.

Not rocket science, right?

-- Dave


These patches are very similar to what I posted at
   http://lwn.net/Articles/223829/
In my patches, the thread group leader owns the mm_struct and all
threads belong to the same container. I did not have a per container
LRU, walking the global list for reclaim was a bit slow, but otherwise
my patches did not add anything to struct page

I used rmap information to get to the VMA and then the mm_struct.
Kirill, it is possible to determine all the containers that map the
page. Please see the page_in_container() function of
http://lkml.org/lkml/2007/2/26/7.

I was also thinking of using the page table(s) to identify all pages
belonging to a container, by obtaining all the mm_structs of tasks
belonging to a container. But this approach would not work well for
the page cache controller, when we add that to our memory controller.

Balbir
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Pavel Emelianov
Dave Hansen wrote:
> On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
>> now VE2 maps the same page. You can't determine whether this page is mapped
>> to this container or another one w/o page->container pointer. 
> 
> Hi Kirill,
> 
> I thought we can always get from the page to the VMA.  rmap provides
> this to us via page->mapping and the 'struct address_space' or anon_vma.
> Do we agree on that?

Not completely. When page is unmapped from the *very last*
user its *first* toucher may already be dead. So we'll never
find out who it was.

> We can also get from the vma to the mm very easily, via vma->vm_mm,
> right?
> 
> We can also get from a task to the container quite easily.  
> 
> So, the only question becomes whether there is a 1:1 relationship
> between mm_structs and containers.  Does each mm_struct belong to one

No. The question is "how to get a container that touched the
page first" which is the same as "how to find mm_struct which
touched the page first". Obviously there's no answer on this
question unless we hold some direct page->container reference.
This may be a hash, a direct on-page pointer, or mirrored
array of pointers.

> and only one container?  Basically, can a threaded process have
> different threads in different containers?
> 
> It seems that we could bridge the gap pretty easily by either assigning
> each mm_struct to a container directly, or putting some kind of
> task-to-mm lookup.  Perhaps just a list like
> mm->tasks_using_this_mm_list.

This could work for reclamation: we scan through all the
mm_struct-s within the container and shrink its' pages, but
we can't make LRU this way.

> Not rocket science, right?
> 
> -- Dave
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 19:16 +0300, Kirill Korotaev wrote:
> now VE2 maps the same page. You can't determine whether this page is mapped
> to this container or another one w/o page->container pointer. 

Hi Kirill,

I thought we can always get from the page to the VMA.  rmap provides
this to us via page->mapping and the 'struct address_space' or anon_vma.
Do we agree on that?

We can also get from the vma to the mm very easily, via vma->vm_mm,
right?

We can also get from a task to the container quite easily.  

So, the only question becomes whether there is a 1:1 relationship
between mm_structs and containers.  Does each mm_struct belong to one
and only one container?  Basically, can a threaded process have
different threads in different containers?

It seems that we could bridge the gap pretty easily by either assigning
each mm_struct to a container directly, or putting some kind of
task-to-mm lookup.  Perhaps just a list like
mm->tasks_using_this_mm_list.

Not rocket science, right?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-12 Thread Kirill Korotaev
Eric W. Biederman wrote:
> Pavel Emelianov <[EMAIL PROTECTED]> writes:
> 
> 
>>Adds needed pointers to mm_struct and page struct,
>>places hooks to core code for mm_struct initialization
>>and hooks in container_init_early() to preinitialize
>>RSS accounting subsystem.
> 
> 
> An extra pointer in struct page is unlikely to fly.
> Both because it increases the size of a size critical structure,
> and because conceptually it is ridiculous.
as it was discussed multiple times (and according OLS):
- it is not critical nowdays to expand struct page a bit in case
  accounting is on.
- it can be done w/o extending, e.g. via mapping page <-> container
  using hash or some other data structure.
  i.e. we can optimize it on size if considered needed.

> If you are limiting the RSS size you are counting the number of pages in
> the page tables.  You don't care about the page itself.
> 
> With the rmap code it is relatively straight forward to see if this is
> the first time a page has been added to a page table in your rss
> group, or if this is the last reference to a particular page in your
> rss group.  The counters should only increment the first time a
> particular page is added to your rss group.  The counters should only
> decrement when it is the last reference in your rss subsystem.
You are fundamentally wrong if shared pages are concerned.
Imagine a glibc page shared between 2 containers - VE1 and VE2.
VE1 was the first who mapped it, so it is accounted to VE1
(rmap count was increased by it).
now VE2 maps the same page. You can't determine whether this page is mapped
to this container or another one w/o page->container pointer.
All the choices you have are:
a) do not account this page, since it is allready accounted to some other VE.
b) account this page again to current container.

(a) is bad, since VE1 can unmap this page first, and the last user will be VE2.
Which means VE1 will be charged for it, while VE2 uncharged. Accounting screws 
up.

b) is bad, since:
  - the same page is accounted multiple times, which makes impossible
to understand how much real memory pages container needs/consumes
  - and because on container enter the process and it's pages
are essentially moved to another context, while accounting
can not be fixed up easily and we essentially have (a).

> This allow important little cases like glibc to be properly accounted
> for. One of the key features of a rss limit is that the kernel can
> still keep pages that you need in-core, that are accessible with just
> a minor fault.  Directly owning pages works directly against that
> principle.
Sorry, can't understand what you mean. It doesn't work against.
Each container has it's own LRU. So if glibc has the most
often used pages - it won't be thrashed out.

Thanks,
Kirill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-11 Thread Eric W. Biederman
Pavel Emelianov <[EMAIL PROTECTED]> writes:

> Adds needed pointers to mm_struct and page struct,
> places hooks to core code for mm_struct initialization
> and hooks in container_init_early() to preinitialize
> RSS accounting subsystem.

An extra pointer in struct page is unlikely to fly.
Both because it increases the size of a size critical structure,
and because conceptually it is ridiculous.

If you are limiting the RSS size you are counting the number of pages in
the page tables.  You don't care about the page itself.

With the rmap code it is relatively straight forward to see if this is
the first time a page has been added to a page table in your rss
group, or if this is the last reference to a particular page in your
rss group.  The counters should only increment the first time a
particular page is added to your rss group.  The counters should only
decrement when it is the last reference in your rss subsystem.

This allow important little cases like glibc to be properly accounted
for. One of the key features of a rss limit is that the kernel can
still keep pages that you need in-core, that are accessible with just
a minor fault.  Directly owning pages works directly against that
principle.


> diff -upr linux-2.6.20.orig/include/linux/mm_types.h
> linux-2.6.20-0/include/linux/mm_types.h
> --- linux-2.6.20.orig/include/linux/mm_types.h 2007-02-04 21:44:54.0
> +0300
> +++ linux-2.6.20-0/include/linux/mm_types.h 2007-03-06 13:33:28.0 
> +0300
> @@ -62,6 +62,9 @@ struct page {
>   void *virtual;  /* Kernel virtual address (NULL if
>  not kmapped, ie. highmem) */
>  #endif /* WANT_PAGE_VIRTUAL */
> +#ifdef CONFIG_RSS_CONTAINER
> + struct page_container *rss_container;
> +#endif
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 3/7] Data structures changes for RSS accounting

2007-03-06 Thread Pavel Emelianov
Adds needed pointers to mm_struct and page struct,
places hooks to core code for mm_struct initialization
and hooks in container_init_early() to preinitialize
RSS accounting subsystem.
diff -upr linux-2.6.20.orig/include/linux/mm.h linux-2.6.20-0/include/linux/mm.h
--- linux-2.6.20.orig/include/linux/mm.h2007-02-04 21:44:54.0 
+0300
+++ linux-2.6.20-0/include/linux/mm.h   2007-03-06 13:33:28.0 +0300
@@ -220,6 +220,12 @@ struct vm_operations_struct {
 struct mmu_gather;
 struct inode;
 
+#ifdef CONFIG_RSS_CONTAINER
+#define page_container(page)   (page->rss_container)
+#else
+#define page_container(page)   (NULL)
+#endif
+
 #define page_private(page) ((page)->private)
 #define set_page_private(page, v)  ((page)->private = (v))
 
diff -upr linux-2.6.20.orig/include/linux/mm_types.h 
linux-2.6.20-0/include/linux/mm_types.h
--- linux-2.6.20.orig/include/linux/mm_types.h  2007-02-04 21:44:54.0 
+0300
+++ linux-2.6.20-0/include/linux/mm_types.h 2007-03-06 13:33:28.0 
+0300
@@ -62,6 +62,9 @@ struct page {
void *virtual;  /* Kernel virtual address (NULL if
   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
+#ifdef CONFIG_RSS_CONTAINER
+   struct page_container *rss_container;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff -upr linux-2.6.20.orig/include/linux/sched.h 
linux-2.6.20-0/include/linux/sched.h
--- linux-2.6.20.orig/include/linux/sched.h 2007-03-06 13:33:28.0 
+0300
+++ linux-2.6.20-0/include/linux/sched.h2007-03-06 13:33:28.0 
+0300
@@ -373,6 +373,9 @@ struct mm_struct {
/* aio bits */
rwlock_tioctx_list_lock;
struct kioctx   *ioctx_list;
+#ifdef CONFIG_RSS_CONTAINER
+   struct rss_container*rss_container;
+#endif
 };
 
 struct sighand_struct {
diff -upr linux-2.6.20.orig/kernel/fork.c linux-2.6.20-0/kernel/fork.c
--- linux-2.6.20.orig/kernel/fork.c 2007-03-06 13:33:28.0 +0300
+++ linux-2.6.20-0/kernel/fork.c2007-03-06 13:33:28.0 +0300
@@ -57,6 +57,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -325,7 +328,7 @@ static inline void mm_free_pgd(struct mm
 
 #include 
 
-static struct mm_struct * mm_init(struct mm_struct * mm)
+static struct mm_struct * mm_init(struct mm_struct *mm, struct task_struct 
*tsk)
 {
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
@@ -341,10 +344,18 @@ static struct mm_struct * mm_init(struct
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
 
-   if (likely(!mm_alloc_pgd(mm))) {
-   mm->def_flags = 0;
-   return mm;
-   }
+   if (unlikely(mm_init_container(mm, tsk)))
+   goto out_cont;
+
+   if (unlikely(mm_alloc_pgd(mm)))
+   goto out_pgd;
+
+   mm->def_flags = 0;
+   return mm;
+
+out_pgd:
+   mm_free_container(mm);
+out_cont:
free_mm(mm);
return NULL;
 }
@@ -359,7 +370,7 @@ struct mm_struct * mm_alloc(void)
mm = allocate_mm();
if (mm) {
memset(mm, 0, sizeof(*mm));
-   mm = mm_init(mm);
+   mm = mm_init(mm, current);
}
return mm;
 }
@@ -373,6 +384,7 @@ void fastcall __mmdrop(struct mm_struct 
 {
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
+   mm_free_container(mm);
destroy_context(mm);
free_mm(mm);
 }
@@ -493,7 +505,7 @@ static struct mm_struct *dup_mm(struct t
mm->token_priority = 0;
mm->last_interval = 0;
 
-   if (!mm_init(mm))
+   if (!mm_init(mm, tsk))
goto fail_nomem;
 
if (init_new_context(tsk, mm))
@@ -520,6 +532,7 @@ fail_nocontext:
 * because it calls destroy_context()
 */
mm_free_pgd(mm);
+   mm_free_container(mm);
free_mm(mm);
return NULL;
 }
diff -upr linux-2.6.20.orig/kernel/container.c linux-2.6.20-0/kernel/container.c
--- linux-2.6.20.orig/kernel/container.c2007-03-06 13:33:28.0 
+0300
+++ linux-2.6.20-0/kernel/container.c   2007-03-06 13:35:48.0 +0300
@@ -60,6 +60,8 @@
 #include 
 #include 
 
+#include 
+
 #define CONTAINER_SUPER_MAGIC  0x27e0eb
 
 static struct container_subsys *subsys[CONFIG_MAX_CONTAINER_SUBSYS];
@@ -1721,6 +1725,8 @@ int __init container_init_early(void)
}
init_task.containers = &init_container_group;
 
+   container_rss_init_early();
+
return 0;
 }