[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers

2012-07-30 Thread Andrea Righi
On Wed, Jul 25, 2012 at 02:00:41PM +0400, Glauber Costa wrote:
 On 07/25/2012 02:00 PM, Eric W. Biederman wrote:
  Glauber Costa glom...@parallels.com writes:
  
  On 07/12/2012 01:41 AM, Kir Kolyshkin wrote:
  Gentlemen,
 
  We are organizing containers mini-summit during next Linux Plumbers (San
  Diego, August 29-31).
  The idea is to gather and discuss everything relevant to namespaces,
  cgroups, resource management,
  checkpoint-restore and so on.
 
  We are trying to come up with a list of topics to discuss, so please
  reply with topic suggestions, and
  let me know if you are going to come.
 
  I probably forgot a few more people (such as, I am not sure who else
  from Google is working
  on cgroups stuff), so fill free to forward this to anyone you believe
  should go,
  or just let me know whom I missed.
 
  Regards,
Kir.
 
  BTW, sorry for not replying before (vacations + post-vacations laziness)
 
  I would be interested in adding /proc virtualization to the discussion.
  By now it seems userspace would be the best place for that to happen, in
  a fuse overlay. I know Daniel has an initial implementation of that, and
  it would be good to have it as library that both OpenVZ and LXC (and
  whoever else wants) can use.
 
  Shouldn't take much time...
  
  What would you need proc virtualization for?
  
 
 proc provides a lot of information that userspace tools rely upon.
 For instance, when running top, you can draw per-process figures from
 what we have now, but you can't make sense of percentages without
 aggregating container-wide information.
 
 When you read /proc/cpuinfo, as well, you would expect to see something
 that matches your container configuration.
 
 free is another example. The list go on.

Another interesting feature IMHO would be the per-cgroup loadavg. A
typical use case could be a monitoring system that wants to know which
containers are more overloaded than others, instead of using a single
system-wide measure in /proc/loadavg.

-Andrea

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote:
  I think we should accept to have an inode granularity. We could redesign
  the writeback code to work per-cgroup / per-page, etc. but that would
  add a huge overhead. The limit of inode granularity could be an
  acceptable tradeoff, cgroups are supposed to work to different files
  usually, well.. except when databases come into play (ouch!).
 
 Agreed. Granularity of per inode level might be accetable in many 
 cases. Again, I am worried faster group getting stuck behind slower
 group.
 
 I am wondering if we are trying to solve the problem of ASYNC write throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to write to
 page cache. The way we throttle the process based on dirty ratio, can we
 just check for throttle limits also there or something like that.(I think
 that's what you had done in your initial throttling controller 
 implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback. However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-23 Thread Andrea Righi
On Tue, Feb 22, 2011 at 07:07:19PM -0500, Vivek Goyal wrote:
 On Wed, Feb 23, 2011 at 12:05:34AM +0100, Andrea Righi wrote:
  On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote:
   On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote:
Add the tracking of buffered (writeback) and anonymous pages.

Dirty pages in the page cache can be processed asynchronously by the
per-bdi flusher kernel threads or by any other thread in the system,
according to the writeback policy.

For this reason the real writes to the underlying block devices may
occur in a different IO context respect to the task that originally
generated the dirty pages involved in the IO operation. This makes
the tracking and throttling of writeback IO more complicate respect to
the synchronous IO from the blkio controller's point of view.

The idea is to save the cgroup owner of each anonymous page and dirty
page in page cache. A page is associated to a cgroup the first time it
is dirtied in memory (for file cache pages) or when it is set as
swap-backed (for anonymous pages). This information is stored using the
page_cgroup functionality.

Then, at the block layer, it is possible to retrieve the throttle group
looking at the bio_page(bio). If the page was not explicitly associated
to any cgroup the IO operation is charged to the current task/cgroup, as
it was done by the previous implementation.

Signed-off-by: Andrea Righi ari...@develer.com
---
 block/blk-throttle.c   |   87 
+++-
 include/linux/blkdev.h |   26 ++-
 2 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9ad3d1e..a50ee04 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -8,6 +8,10 @@
 #include linux/slab.h
 #include linux/blkdev.h
 #include linux/bio.h
+#include linux/memcontrol.h
+#include linux/mm_inline.h
+#include linux/pagemap.h
+#include linux/page_cgroup.h
 #include linux/blktrace_api.h
 #include linux/blk-cgroup.h
 
@@ -221,6 +225,85 @@ done:
return tg;
 }
 
+static inline bool is_kernel_io(void)
+{
+   return !!(current-flags  (PF_KTHREAD | PF_KSWAPD | 
PF_MEMALLOC));
+}
+
+static int throtl_set_page_owner(struct page *page, struct mm_struct 
*mm)
+{
+   struct blkio_cgroup *blkcg;
+   unsigned short id = 0;
+
+   if (blkio_cgroup_disabled())
+   return 0;
+   if (!mm)
+   goto out;
+   rcu_read_lock();
+   blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner));
+   if (likely(blkcg))
+   id = css_id(blkcg-css);
+   rcu_read_unlock();
+out:
+   return page_cgroup_set_owner(page, id);
+}
+
+int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct 
*mm)
+{
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_anonpage_owner);
+
+int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct 
*mm)
+{
+   if (is_kernel_io() || !page_is_file_cache(page))
+   return 0;
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_filepage_owner);
   
   Why are we exporting all these symbols?
  
  Right. Probably a single one is enough:
  
   int blk_throtl_set_page_owner(struct page *page,
  struct mm_struct *mm, bool anon);
 
 Who is going to use this single export? Which module?
 

I was actually thinking at some filesystem modules, but I was wrong,
because at the moment no one needs the export. I'll remove it in the
next version of the patch.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-23 Thread Andrea Righi
On Wed, Feb 23, 2011 at 01:49:10PM +0900, KAMEZAWA Hiroyuki wrote:
 On Wed, 23 Feb 2011 00:37:18 +0100
 Andrea Righi ari...@develer.com wrote:
 
  On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote:
   On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote:
On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote:
 On Tue, 22 Feb 2011 18:12:54 +0100
 Andrea Righi ari...@develer.com wrote:
 
  The page_cgroup infrastructure, currently available only for the 
  memory
  cgroup controller, can be used to store the owner of each page and
  opportunely track the writeback IO. This information is encoded in
  the upper 16-bits of the page_cgroup-flags.
  
  A owner can be identified using a generic ID number and the 
  following
  interfaces are provided to store a retrieve this information:
  
unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page 
  *opage);
 
 My immediate observation is that you're not really tracking the 
 owner
 here - you're tracking an opaque 16-bit token known only to the block
 controller in a field which - if changed by anybody other than the 
 block
 controller - will lead to mayhem in the block controller.  I think it
 might be clearer - and safer - to say blkcg or some such instead of
 owner here.
 

Basically the idea here was to be as generic as possible and make this
feature potentially available also to other subsystems, so that cgroup
subsystems may represent whatever they want with the 16-bit token.
However, no more than a single subsystem may be able to use this feature
at the same time.

 I'm tempted to say it might be better to just add a pointer to your
 throtl_grp structure into struct page_cgroup.  Or maybe replace the
 mem_cgroup pointer with a single pointer to struct css_set.  Both of
 those ideas, though, probably just add unwanted extra overhead now to 
 gain
 generality which may or may not be wanted in the future.

The pointer to css_set sounds good, but it would add additional space to
the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch)
and all of them are allocated at boot time. Using unused bits in
page_cgroup-flags is a choice with no overhead from this point of view.
   
   I think John suggested replacing mem_cgroup pointer with css_set so that
   size of the strcuture does not increase but it leads extra level of 
   indirection.
  
  OK, got it sorry.
  
  So, IIUC we save css_set pointer and get a struct cgroup as following:
  
struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup;
  
  Then, for example to get the mem_cgroup reference:
  
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
  
  It seems a lot of indirections, but I may have done something wrong or
  there could be a simpler way to do it.
  
 
 
 Then, page_cgroup should have reference count on css_set and make tons of
 atomic ops.
 
 BTW, bits of pc-flags are used for storing sectionID or nodeID.
 Please clarify your 16bit never breaks that information. And please keep
 more 4-5 flags for dirty_ratio support of memcg.

OK, I didn't see the recent work about section and node id encoded in
the pc-flags, thanks. So, it'd be probably better to rebase the patch to
the latest mmotm to check all this stuff.

 
 I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, 
 ==
 static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc)
 {
 struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, 
 pc-memid);
 return container_of(css, struct mem_cgroup, css);
 }
 ==
 Overhead will be seen at updating file statistics and LRU management.
 
 But, hmm, can't you do that tracking without page_cgroup ?
 Because the number of dirty/writeback pages are far smaller than total pages,
 chasing I/O with dynamic structure is not very bad..
 
 prepareing [pfn - blkio] record table and move that information to struct bio
 in dynamic way is very difficult ?

This would be ok for dirty pages, but consider that we're also tracking
anonymous pages. So, if we want to control the swap IO we actually need
to save this information for a lot of pages and at the end I think we'll
basically duplicate the page_cgroup code.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
   Agreed. Granularity of per inode level might be accetable in many 
   cases. Again, I am worried faster group getting stuck behind slower
   group.
   
   I am wondering if we are trying to solve the problem of ASYNC write 
   throttling
   at wrong layer. Should ASYNC IO be throttled before we allow task to 
   write to
   page cache. The way we throttle the process based on dirty ratio, can we
   just check for throttle limits also there or something like that.(I think
   that's what you had done in your initial throttling controller 
   implementation?)
  
  Right. This is exactly the same approach I've used in my old throttling
  controller: throttle sync READs and WRITEs at the block layer and async
  WRITEs when the task is dirtying memory pages.
  
  This is probably the simplest way to resolve the problem of faster group
  getting blocked by slower group, but the controller will be a little bit
  more leaky, because the writeback IO will be never throttled and we'll
  see some limited IO spikes during the writeback.
 
 Yes writeback will not be throttled. Not sure how big a problem that is.
 
 - We have controlled the input rate. So that should help a bit.
 - May be one can put some high limit on root cgroup to in blkio throttle
   controller to limit overall WRITE rate of the system.
 - For SATA disks, try to use CFQ which can try to minimize the impact of
   WRITE.
 
 It will atleast provide consistent bandwindth experience to application.

Right.

 
 However, this is always
  a better solution IMHO respect to the current implementation that is
  affected by that kind of priority inversion problem.
  
  I can try to add this logic to the current blk-throttle controller if
  you think it is worth to test it.
 
 At this point of time I have few concerns with this approach.
 
 - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
   separately is inconvenient. One has to know the nature of workload.
 
 - Most likely we will come up with global limits (atleast to begin with),
   and not per device limit. That can lead to contention on one single
   lock and scalability issues on big systems.
 
 Having said that, this approach should reduce the kernel complexity a lot.
 So if we can do some intelligent locking to limit the overhead then it
 will boil down to reduced complexity in kernel vs ease of use to user. I 
 guess at this point of time I am inclined towards keeping it simple in
 kernel.
 

BTW, with this approach probably we can even get rid of the page
tracking stuff for now. If we don't consider the swap IO, any other IO
operation from our point of view will happen directly from process
context (writes in memory + sync reads from the block device).

However, I'm sure we'll need the page tracking also for the blkio
controller soon or later. This is an important information and also the
proportional bandwidth controller can take advantage of it.

 
 Couple of people have asked me that we have backup jobs running at night
 and we want to reduce the IO bandwidth of these jobs to limit the impact
 on latency of other jobs, I guess this approach will definitely solve
 that issue.
 
 IMHO, it might be worth trying this approach and see how well does it work. It
 might not solve all the problems but can be helpful in many situations.

Agreed. This could be a good tradeoff for a lot of common cases.

 
 I feel that for proportional bandwidth division, implementing ASYNC
 control at CFQ will make sense because even if things get serialized in
 higher layers, consequences are not very bad as it is work conserving
 algorithm. But for throttling serialization will lead to bad consequences.

Agreed.

 
 May be one can think of new files in blkio controller to limit async IO
 per group during page dirty time.
 
 blkio.throttle.async.write_bps_limit
 blkio.throttle.async.write_iops_limit

OK, I'll try to add the async throttling logic and use this interface.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-22 Thread Andrea Righi
Currently the blkio.throttle controller only support synchronous IO requests.
This means that we always look at the current task to identify the owner of
each IO request.

However dirty pages in the page cache can be wrote to disk asynchronously by
the per-bdi flusher kernel threads or by any other thread in the system,
according to the writeback policy.

For this reason the real writes to the underlying block devices may
occur in a different IO context respect to the task that originally
generated the dirty pages involved in the IO operation. This makes the
tracking and throttling of writeback IO more complicate respect to the
synchronous IO from the blkio controller's perspective.

The same concept is also valid for anonymous pages involed in IO operations
(swap).

This patch allow to track the cgroup that originally dirtied each page in page
cache and each anonymous page and pass these informations to the blk-throttle
controller. These informations can be used to provide a better service level
differentiation of buffered writes swap IO between different cgroups.

Testcase

- create a cgroup with 1MiB/s write limit:
  # mount -t cgroup -o blkio none /mnt/cgroup
  # mkdir /mnt/cgroup/foo
  # echo 8:0 $((1024 * 1024))  /mnt/cgroup/foo/blkio.throttle.write_bps_device

- move a task into the cgroup and run a dd to generate some writeback IO

Results:
  - 2.6.38-rc6 vanilla:
  $ cat /proc/$$/cgroup
  1:blkio:/foo
  $ dd if=/dev/zero of=zero bs=1M count=1024 
  $ dstat -df
  --dsk/sda--
   read  writ
 019M
 019M
 0 0
 0 0
 019M
  ...

  - 2.6.38-rc6 + blk-throttle writeback IO control:
  $ cat /proc/$$/cgroup
  1:blkio:/foo
  $ dd if=/dev/zero of=zero bs=1M count=1024 
  $ dstat -df
  --dsk/sda--
   read  writ
 0  1024
 0  1024
 0  1024
 0  1024
 0  1024
  ...

TODO

 - lots of testing

Any feedback is welcome.
-Andrea

[PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h
[PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup()
[PATCH 3/5] page_cgroup: make page tracking available for blkio
[PATCH 4/5] blk-throttle: track buffered and anonymous pages
[PATCH 5/5] blk-throttle: buffered and anonymous page tracking instrumentation

 block/Kconfig   |2 +
 block/blk-cgroup.c  |   15 ++-
 block/blk-cgroup.h  |  335 --
 block/blk-throttle.c|   89 +++-
 block/cfq.h |2 +-
 fs/buffer.c |1 +
 include/linux/blk-cgroup.h  |  341 +++
 include/linux/blkdev.h  |   26 +++-
 include/linux/memcontrol.h  |6 +
 include/linux/mmzone.h  |4 +-
 include/linux/page_cgroup.h |   33 -
 init/Kconfig|4 +
 mm/Makefile |3 +-
 mm/bounce.c |1 +
 mm/filemap.c|1 +
 mm/memcontrol.c |6 +
 mm/memory.c |5 +
 mm/page-writeback.c |1 +
 mm/page_cgroup.c|  129 +++--
 mm/swap_state.c |2 +
 20 files changed, 649 insertions(+), 357 deletions(-)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h

2011-02-22 Thread Andrea Righi
Move blk-cgroup.h in include/linux for generic usage.

Signed-off-by: Andrea Righi ari...@develer.com
---
 block/blk-cgroup.c |2 +-
 block/blk-cgroup.h |  335 ---
 block/blk-throttle.c   |2 +-
 block/cfq.h|2 +-
 include/linux/blk-cgroup.h |  337 
 5 files changed, 340 insertions(+), 338 deletions(-)
 delete mode 100644 block/blk-cgroup.h
 create mode 100644 include/linux/blk-cgroup.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..bf9d354 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -17,7 +17,7 @@
 #include linux/err.h
 #include linux/blkdev.h
 #include linux/slab.h
-#include blk-cgroup.h
+#include linux/blk-cgroup.h
 #include linux/genhd.h
 
 #define MAX_KEY_LEN 100
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
deleted file mode 100644
index ea4861b..000
--- a/block/blk-cgroup.h
+++ /dev/null
@@ -1,335 +0,0 @@
-#ifndef _BLK_CGROUP_H
-#define _BLK_CGROUP_H
-/*
- * Common Block IO controller cgroup interface
- *
- * Based on ideas and code from CFQ, CFS and BFQ:
- * Copyright (C) 2003 Jens Axboe ax...@kernel.dk
- *
- * Copyright (C) 2008 Fabio Checconi fa...@gandalf.sssup.it
- *   Paolo Valente paolo.vale...@unimore.it
- *
- * Copyright (C) 2009 Vivek Goyal vgo...@redhat.com
- *   Nauman Rafique nau...@google.com
- */
-
-#include linux/cgroup.h
-
-enum blkio_policy_id {
-   BLKIO_POLICY_PROP = 0,  /* Proportional Bandwidth division */
-   BLKIO_POLICY_THROTL,/* Throttling */
-};
-
-/* Max limits for throttle policy */
-#define THROTL_IOPS_MAXUINT_MAX
-
-#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
-
-#ifndef CONFIG_BLK_CGROUP
-/* When blk-cgroup is a module, its subsys_id isn't a compile-time constant */
-extern struct cgroup_subsys blkio_subsys;
-#define blkio_subsys_id blkio_subsys.subsys_id
-#endif
-
-enum stat_type {
-   /* Total time spent (in ns) between request dispatch to the driver and
-* request completion for IOs doen by this cgroup. This may not be
-* accurate when NCQ is turned on. */
-   BLKIO_STAT_SERVICE_TIME = 0,
-   /* Total bytes transferred */
-   BLKIO_STAT_SERVICE_BYTES,
-   /* Total IOs serviced, post merge */
-   BLKIO_STAT_SERVICED,
-   /* Total time spent waiting in scheduler queue in ns */
-   BLKIO_STAT_WAIT_TIME,
-   /* Number of IOs merged */
-   BLKIO_STAT_MERGED,
-   /* Number of IOs queued up */
-   BLKIO_STAT_QUEUED,
-   /* All the single valued stats go below this */
-   BLKIO_STAT_TIME,
-   BLKIO_STAT_SECTORS,
-#ifdef CONFIG_DEBUG_BLK_CGROUP
-   BLKIO_STAT_AVG_QUEUE_SIZE,
-   BLKIO_STAT_IDLE_TIME,
-   BLKIO_STAT_EMPTY_TIME,
-   BLKIO_STAT_GROUP_WAIT_TIME,
-   BLKIO_STAT_DEQUEUE
-#endif
-};
-
-enum stat_sub_type {
-   BLKIO_STAT_READ = 0,
-   BLKIO_STAT_WRITE,
-   BLKIO_STAT_SYNC,
-   BLKIO_STAT_ASYNC,
-   BLKIO_STAT_TOTAL
-};
-
-/* blkg state flags */
-enum blkg_state_flags {
-   BLKG_waiting = 0,
-   BLKG_idling,
-   BLKG_empty,
-};
-
-/* cgroup files owned by proportional weight policy */
-enum blkcg_file_name_prop {
-   BLKIO_PROP_weight = 1,
-   BLKIO_PROP_weight_device,
-   BLKIO_PROP_io_service_bytes,
-   BLKIO_PROP_io_serviced,
-   BLKIO_PROP_time,
-   BLKIO_PROP_sectors,
-   BLKIO_PROP_io_service_time,
-   BLKIO_PROP_io_wait_time,
-   BLKIO_PROP_io_merged,
-   BLKIO_PROP_io_queued,
-   BLKIO_PROP_avg_queue_size,
-   BLKIO_PROP_group_wait_time,
-   BLKIO_PROP_idle_time,
-   BLKIO_PROP_empty_time,
-   BLKIO_PROP_dequeue,
-};
-
-/* cgroup files owned by throttle policy */
-enum blkcg_file_name_throtl {
-   BLKIO_THROTL_read_bps_device,
-   BLKIO_THROTL_write_bps_device,
-   BLKIO_THROTL_read_iops_device,
-   BLKIO_THROTL_write_iops_device,
-   BLKIO_THROTL_io_service_bytes,
-   BLKIO_THROTL_io_serviced,
-};
-
-struct blkio_cgroup {
-   struct cgroup_subsys_state css;
-   unsigned int weight;
-   spinlock_t lock;
-   struct hlist_head blkg_list;
-   struct list_head policy_list; /* list of blkio_policy_node */
-};
-
-struct blkio_group_stats {
-   /* total disk time and nr sectors dispatched by this group */
-   uint64_t time;
-   uint64_t sectors;
-   uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
-#ifdef CONFIG_DEBUG_BLK_CGROUP
-   /* Sum of number of IOs queued across all samples */
-   uint64_t avg_queue_size_sum;
-   /* Count of samples taken for average */
-   uint64_t avg_queue_size_samples;
-   /* How many times this group has been removed from service tree */
-   unsigned long dequeue;
-
-   /* Total time spent waiting for it to be assigned a timeslice. */
-   uint64_t group_wait_time

[Devel] [PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup()

2011-02-22 Thread Andrea Righi
Introduce a helper function to retrieve a blkio cgroup from a task.

Signed-off-by: Andrea Righi ari...@develer.com
---
 block/blk-cgroup.c |7 +++
 include/linux/blk-cgroup.h |4 
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bf9d354..f283ae1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -107,6 +107,13 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, 
dev_t dev,
return NULL;
 }
 
+struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *task)
+{
+   return container_of(task_subsys_state(task, blkio_subsys_id),
+   struct blkio_cgroup, css);
+}
+EXPORT_SYMBOL_GPL(task_to_blkio_cgroup);
+
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 {
return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 5e48204..41b59db 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -287,6 +287,7 @@ static inline void blkiocg_set_start_empty_time(struct 
blkio_group *blkg) {}
 extern struct blkio_cgroup blkio_root_cgroup;
 extern bool blkio_cgroup_disabled(void);
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+extern struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *task);
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid);
@@ -311,6 +312,9 @@ static inline bool blkio_cgroup_disabled(void) { return 
true; }
 static inline struct blkio_cgroup *
 cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
 
+static inline struct blkio_cgroup *
+task_to_blkio_cgroup(struct task_struct *task) { return NULL; }
+
 static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev,
enum blkio_policy_id plid) {}
-- 
1.7.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-22 Thread Andrea Righi
Add the tracking of buffered (writeback) and anonymous pages.

Dirty pages in the page cache can be processed asynchronously by the
per-bdi flusher kernel threads or by any other thread in the system,
according to the writeback policy.

For this reason the real writes to the underlying block devices may
occur in a different IO context respect to the task that originally
generated the dirty pages involved in the IO operation. This makes
the tracking and throttling of writeback IO more complicate respect to
the synchronous IO from the blkio controller's point of view.

The idea is to save the cgroup owner of each anonymous page and dirty
page in page cache. A page is associated to a cgroup the first time it
is dirtied in memory (for file cache pages) or when it is set as
swap-backed (for anonymous pages). This information is stored using the
page_cgroup functionality.

Then, at the block layer, it is possible to retrieve the throttle group
looking at the bio_page(bio). If the page was not explicitly associated
to any cgroup the IO operation is charged to the current task/cgroup, as
it was done by the previous implementation.

Signed-off-by: Andrea Righi ari...@develer.com
---
 block/blk-throttle.c   |   87 +++-
 include/linux/blkdev.h |   26 ++-
 2 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9ad3d1e..a50ee04 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -8,6 +8,10 @@
 #include linux/slab.h
 #include linux/blkdev.h
 #include linux/bio.h
+#include linux/memcontrol.h
+#include linux/mm_inline.h
+#include linux/pagemap.h
+#include linux/page_cgroup.h
 #include linux/blktrace_api.h
 #include linux/blk-cgroup.h
 
@@ -221,6 +225,85 @@ done:
return tg;
 }
 
+static inline bool is_kernel_io(void)
+{
+   return !!(current-flags  (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC));
+}
+
+static int throtl_set_page_owner(struct page *page, struct mm_struct *mm)
+{
+   struct blkio_cgroup *blkcg;
+   unsigned short id = 0;
+
+   if (blkio_cgroup_disabled())
+   return 0;
+   if (!mm)
+   goto out;
+   rcu_read_lock();
+   blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner));
+   if (likely(blkcg))
+   id = css_id(blkcg-css);
+   rcu_read_unlock();
+out:
+   return page_cgroup_set_owner(page, id);
+}
+
+int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm)
+{
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_anonpage_owner);
+
+int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm)
+{
+   if (is_kernel_io() || !page_is_file_cache(page))
+   return 0;
+   return throtl_set_page_owner(page, mm);
+}
+EXPORT_SYMBOL(blk_throtl_set_filepage_owner);
+
+int blk_throtl_copy_page_owner(struct page *npage, struct page *opage)
+{
+   if (blkio_cgroup_disabled())
+   return 0;
+   return page_cgroup_copy_owner(npage, opage);
+}
+EXPORT_SYMBOL(blk_throtl_copy_page_owner);
+
+/*
+ * A helper function to get the throttle group from css id.
+ *
+ * NOTE: must be called under rcu_read_lock().
+ */
+static struct throtl_grp *throtl_tg_lookup(struct throtl_data *td, int id)
+{
+   struct cgroup_subsys_state *css;
+
+   if (!id)
+   return NULL;
+   css = css_lookup(blkio_subsys, id);
+   if (!css)
+   return NULL;
+   return throtl_find_alloc_tg(td, css-cgroup);
+}
+
+static struct throtl_grp *
+throtl_get_tg_from_page(struct throtl_data *td, struct page *page)
+{
+   struct throtl_grp *tg;
+   int id;
+
+   if (unlikely(!page))
+   return NULL;
+   id = page_cgroup_get_owner(page);
+
+   rcu_read_lock();
+   tg = throtl_tg_lookup(td, id);
+   rcu_read_unlock();
+
+   return tg;
+}
+
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
struct cgroup *cgroup;
@@ -1000,7 +1083,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio 
**biop)
}
 
spin_lock_irq(q-queue_lock);
-   tg = throtl_get_tg(td);
+   tg = throtl_get_tg_from_page(td, bio_page(bio));
+   if (!tg)
+   tg = throtl_get_tg(td);
 
if (tg-nr_queued[rw]) {
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..2d03dee 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1136,10 +1136,34 @@ static inline uint64_t rq_io_start_time_ns(struct 
request *req)
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
 extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
+extern int blk_throtl_set_anonpage_owner(struct page *page,
+   struct mm_struct *mm);
+extern int blk_throtl_set_filepage_owner(struct page *page,
+   struct

[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 10:42:41AM -0800, Chad Talbott wrote:
 On Tue, Feb 22, 2011 at 9:12 AM, Andrea Righi ari...@develer.com wrote:
  Add the tracking of buffered (writeback) and anonymous pages.
 ...
  ---
   block/blk-throttle.c   |   87 
  +++-
   include/linux/blkdev.h |   26 ++-
   2 files changed, 111 insertions(+), 2 deletions(-)
 
  diff --git a/block/blk-throttle.c b/block/blk-throttle.c
  index 9ad3d1e..a50ee04 100644
  --- a/block/blk-throttle.c
  +++ b/block/blk-throttle.c
 ...
  +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm)
  +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm)
  +int blk_throtl_copy_page_owner(struct page *npage, struct page *opage)
 
 It would be nice if these were named blk_cgroup_*.  This is arguably
 more correct as the id comes from the blkio subsystem, and isn't
 specific to blk-throttle.  This will be more important very shortly,
 as CFQ will be using this same cgroup id for async IO tracking soon.

Sounds reasonable. Will do in the next version.

 
 is_kernel_io() is a good idea, it avoids a bug that we've run into
 with CFQ async IO tracking.  Why isn't PF_KTHREAD sufficient to cover
 all kernel threads, including kswapd and those marked PF_MEMALLOC?

With PF_MEMALLOC we're sure we don't add the page tracking overhead also
to non-kernel threads when memory gets low.

PF_KSWAPD is not probably needed, AFAICS it is only used by kswapd, that
is created by kthread_create() and so it has the PF_KTHREAD flag set.

Let's see if someone can give more deatils about that. In the while I'll
investigate and try to do some tests only with PF_KTHREAD.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote:
 On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote:
  Currently the blkio.throttle controller only support synchronous IO 
  requests.
  This means that we always look at the current task to identify the owner 
  of
  each IO request.
  
  However dirty pages in the page cache can be wrote to disk asynchronously by
  the per-bdi flusher kernel threads or by any other thread in the system,
  according to the writeback policy.
  
  For this reason the real writes to the underlying block devices may
  occur in a different IO context respect to the task that originally
  generated the dirty pages involved in the IO operation. This makes the
  tracking and throttling of writeback IO more complicate respect to the
  synchronous IO from the blkio controller's perspective.
  
  The same concept is also valid for anonymous pages involed in IO operations
  (swap).
  
  This patch allow to track the cgroup that originally dirtied each page in 
  page
  cache and each anonymous page and pass these informations to the 
  blk-throttle
  controller. These informations can be used to provide a better service level
  differentiation of buffered writes swap IO between different cgroups.
  
 
 Hi Andrea,
 
 Thanks for the patches. Before I look deeper into patches, had few
 general queries/thoughts.
 
 - So this requires memory controller to be enabled. Does it also require
   these to be co-mounted?

No and no. The blkio controller enables and uses the page_cgroup
functionality, but it doesn't depend on the memory controller. It
automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last
one added in PATCH 3/5) and this is sufficient to make page_cgroup
usable from any generic controller.

 
 - Currently in throttling there is no limit on number of bios queued
   per group. I think this is not necessarily a very good idea because
   if throttling limits are low, we will build very long bio queues. So
   some AIO process can queue up lots of bios, consume lots of memory
   without getting blocked. I am sure there will be other side affects
   too. One of the side affects I noticed is that if an AIO process
   queues up too much of IO, and if I want to kill it now, it just hangs
   there for a really-2 long time (waiting for all the throttled IO
   to complete).
 
   So I was thinking of implementing either per group limit or per io
   context limit and after that process will be put to sleep. (something
   like request descriptor mechanism).

io context limit seems a better solution for now. We can also expect
some help from the memory controller, if we'll have the dirty memory
limit per cgroup in the future the max amount of bios queued will be
automatically limited by this functionality.

 
   If that's the case, then comes the question of what do to about kernel
   threads. Should they be blocked or not. If these are blocked then a
   fast group will also be indirectly throttled behind a slow group. If
   they are not then we still have the problem of too many bios queued
   in throttling layer.

I think kernel threads should be never forced to sleep, to avoid the
classic priority inversion problem and create potential DoS in the
system.

Also for this part the dirty memory limit per cgroup could help a lot,
because a cgroup will never exceed its quota of dirty memory, so it
will not be able to submit more than a certain amount of bios
(corresponding to the dirty memory limit).

 
 - What to do about other kernel thread like kjournald which is doing
   IO on behalf of all the filesystem users. If data is also journalled
   then I think again everything got serialized and a faster group got
   backlogged behind a slower one.

This is the most critical issue IMHO.

The blkio controller should need some help from the filesystems to
understand which IO request can be throttled and which cannot. At the
moment critical IO requests (with critical I mean that are dependency
for other requests) and non-critical requests are mixed together in a
way that throttling a single request may stop a lot of other requests in
the system, and at the block layer it's not possible to retrieve such
informations.

I don't have a solution for this right now. Except looking at each
filesystem implementation and try to understand how to pass these
informations to the block layer.

 
 - Two processes doing IO to same file and slower group will throttle
   IO for faster group also. (flushing is per inode). 
 

I think we should accept to have an inode granularity. We could redesign
the writeback code to work per-cgroup / per-page, etc. but that would
add a huge overhead. The limit of inode granularity could be an
acceptable tradeoff, cgroups are supposed to work to different files
usually, well.. except when databases come into play (ouch!).

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org

[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote:
 On Tue, 22 Feb 2011 18:12:54 +0100
 Andrea Righi ari...@develer.com wrote:
 
  The page_cgroup infrastructure, currently available only for the memory
  cgroup controller, can be used to store the owner of each page and
  opportunely track the writeback IO. This information is encoded in
  the upper 16-bits of the page_cgroup-flags.
  
  A owner can be identified using a generic ID number and the following
  interfaces are provided to store a retrieve this information:
  
unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page *opage);
 
 My immediate observation is that you're not really tracking the owner
 here - you're tracking an opaque 16-bit token known only to the block
 controller in a field which - if changed by anybody other than the block
 controller - will lead to mayhem in the block controller.  I think it
 might be clearer - and safer - to say blkcg or some such instead of
 owner here.
 

Basically the idea here was to be as generic as possible and make this
feature potentially available also to other subsystems, so that cgroup
subsystems may represent whatever they want with the 16-bit token.
However, no more than a single subsystem may be able to use this feature
at the same time.

 I'm tempted to say it might be better to just add a pointer to your
 throtl_grp structure into struct page_cgroup.  Or maybe replace the
 mem_cgroup pointer with a single pointer to struct css_set.  Both of
 those ideas, though, probably just add unwanted extra overhead now to gain
 generality which may or may not be wanted in the future.

The pointer to css_set sounds good, but it would add additional space to
the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch)
and all of them are allocated at boot time. Using unused bits in
page_cgroup-flags is a choice with no overhead from this point of view.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 03:49:28PM -0500, Vivek Goyal wrote:
 On Tue, Feb 22, 2011 at 10:42:41AM -0800, Chad Talbott wrote:
  On Tue, Feb 22, 2011 at 9:12 AM, Andrea Righi ari...@develer.com wrote:
   Add the tracking of buffered (writeback) and anonymous pages.
  ...
   ---
    block/blk-throttle.c   |   87 
   +++-
    include/linux/blkdev.h |   26 ++-
    2 files changed, 111 insertions(+), 2 deletions(-)
  
   diff --git a/block/blk-throttle.c b/block/blk-throttle.c
   index 9ad3d1e..a50ee04 100644
   --- a/block/blk-throttle.c
   +++ b/block/blk-throttle.c
  ...
   +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct 
   *mm)
   +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct 
   *mm)
   +int blk_throtl_copy_page_owner(struct page *npage, struct page *opage)
  
  It would be nice if these were named blk_cgroup_*.  This is arguably
  more correct as the id comes from the blkio subsystem, and isn't
  specific to blk-throttle.  This will be more important very shortly,
  as CFQ will be using this same cgroup id for async IO tracking soon.
 
 Should this really be all part of blk-cgroup.c and not blk-throttle.c
 so that it can be used by CFQ code also down the line? Anyway all this
 is not throttle specific as such but blkio controller specific.

Agreed.

 
 Though function naming convetion is not great in blk-cgroup.c But
 functions either have blkio_ prefix or blkiocg_ prefix.

ok.

 
 Functions which are not directly dealing with cgroups or in general
 are called by blk-throttle.c and/or cfq-iosched.c I have marked as
 prefixed with blkio_. Functions which directly deal with cgroup stuff
 and register with cgroup subsystem for this controller are generally
 having blkiocg_ prefix.
 
 In this case probably we can use probably blkio_ prefix.

ok.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 04:22:53PM -0500, Vivek Goyal wrote:
 On Tue, Feb 22, 2011 at 06:12:54PM +0100, Andrea Righi wrote:
  The page_cgroup infrastructure, currently available only for the memory
  cgroup controller, can be used to store the owner of each page and
  opportunely track the writeback IO. This information is encoded in
  the upper 16-bits of the page_cgroup-flags.
  
  A owner can be identified using a generic ID number and the following
  interfaces are provided to store a retrieve this information:
  
unsigned long page_cgroup_get_owner(struct page *page);
int page_cgroup_set_owner(struct page *page, unsigned long id);
int page_cgroup_copy_owner(struct page *npage, struct page *opage);
  
  The blkio.throttle controller can use the cgroup css_id() as the owner's
  ID number.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   block/Kconfig   |2 +
   block/blk-cgroup.c  |6 ++
   include/linux/memcontrol.h  |6 ++
   include/linux/mmzone.h  |4 +-
   include/linux/page_cgroup.h |   33 ++-
   init/Kconfig|4 +
   mm/Makefile |3 +-
   mm/memcontrol.c |6 ++
   mm/page_cgroup.c|  129 
  +++
   9 files changed, 176 insertions(+), 17 deletions(-)
  
  diff --git a/block/Kconfig b/block/Kconfig
  index 60be1e0..1351ea8 100644
  --- a/block/Kconfig
  +++ b/block/Kconfig
  @@ -80,6 +80,8 @@ config BLK_DEV_INTEGRITY
   config BLK_DEV_THROTTLING
  bool Block layer bio throttling support
  depends on BLK_CGROUP=y  EXPERIMENTAL
  +   select MM_OWNER
  +   select PAGE_TRACKING
  default n
  ---help---
  Block layer bio throttling support. It can be used to limit
  diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
  index f283ae1..5c57f0a 100644
  --- a/block/blk-cgroup.c
  +++ b/block/blk-cgroup.c
  @@ -107,6 +107,12 @@ blkio_policy_search_node(const struct blkio_cgroup 
  *blkcg, dev_t dev,
  return NULL;
   }
   
  +bool blkio_cgroup_disabled(void)
  +{
  +   return blkio_subsys.disabled ? true : false;
  +}
  +EXPORT_SYMBOL_GPL(blkio_cgroup_disabled);
  +
 
 I think there should be option to just disable this asyn feature of
 blkio controller. So those who don't want it (running VMs with cache=none
 option) and don't want to take the memory reservation hit should be
 able to disable just ASYNC facility of blkio controller and not
 the whole blkio controller facility.

Definitely a better choice.

OK, I'll apply all your suggestions and post a new version of the patch.

Thanks for the review!
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote:
 On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote:
  Add the tracking of buffered (writeback) and anonymous pages.
  
  Dirty pages in the page cache can be processed asynchronously by the
  per-bdi flusher kernel threads or by any other thread in the system,
  according to the writeback policy.
  
  For this reason the real writes to the underlying block devices may
  occur in a different IO context respect to the task that originally
  generated the dirty pages involved in the IO operation. This makes
  the tracking and throttling of writeback IO more complicate respect to
  the synchronous IO from the blkio controller's point of view.
  
  The idea is to save the cgroup owner of each anonymous page and dirty
  page in page cache. A page is associated to a cgroup the first time it
  is dirtied in memory (for file cache pages) or when it is set as
  swap-backed (for anonymous pages). This information is stored using the
  page_cgroup functionality.
  
  Then, at the block layer, it is possible to retrieve the throttle group
  looking at the bio_page(bio). If the page was not explicitly associated
  to any cgroup the IO operation is charged to the current task/cgroup, as
  it was done by the previous implementation.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   block/blk-throttle.c   |   87 
  +++-
   include/linux/blkdev.h |   26 ++-
   2 files changed, 111 insertions(+), 2 deletions(-)
  
  diff --git a/block/blk-throttle.c b/block/blk-throttle.c
  index 9ad3d1e..a50ee04 100644
  --- a/block/blk-throttle.c
  +++ b/block/blk-throttle.c
  @@ -8,6 +8,10 @@
   #include linux/slab.h
   #include linux/blkdev.h
   #include linux/bio.h
  +#include linux/memcontrol.h
  +#include linux/mm_inline.h
  +#include linux/pagemap.h
  +#include linux/page_cgroup.h
   #include linux/blktrace_api.h
   #include linux/blk-cgroup.h
   
  @@ -221,6 +225,85 @@ done:
  return tg;
   }
   
  +static inline bool is_kernel_io(void)
  +{
  +   return !!(current-flags  (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC));
  +}
  +
  +static int throtl_set_page_owner(struct page *page, struct mm_struct *mm)
  +{
  +   struct blkio_cgroup *blkcg;
  +   unsigned short id = 0;
  +
  +   if (blkio_cgroup_disabled())
  +   return 0;
  +   if (!mm)
  +   goto out;
  +   rcu_read_lock();
  +   blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner));
  +   if (likely(blkcg))
  +   id = css_id(blkcg-css);
  +   rcu_read_unlock();
  +out:
  +   return page_cgroup_set_owner(page, id);
  +}
  +
  +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm)
  +{
  +   return throtl_set_page_owner(page, mm);
  +}
  +EXPORT_SYMBOL(blk_throtl_set_anonpage_owner);
  +
  +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm)
  +{
  +   if (is_kernel_io() || !page_is_file_cache(page))
  +   return 0;
  +   return throtl_set_page_owner(page, mm);
  +}
  +EXPORT_SYMBOL(blk_throtl_set_filepage_owner);
 
 Why are we exporting all these symbols?

Right. Probably a single one is enough:

 int blk_throtl_set_page_owner(struct page *page,
struct mm_struct *mm, bool anon);

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 04:27:29PM -0700, Jonathan Corbet wrote:
 On Wed, 23 Feb 2011 00:01:47 +0100
 Andrea Righi ari...@develer.com wrote:
 
   My immediate observation is that you're not really tracking the owner
   here - you're tracking an opaque 16-bit token known only to the block
   controller in a field which - if changed by anybody other than the block
   controller - will lead to mayhem in the block controller.  I think it
   might be clearer - and safer - to say blkcg or some such instead of
   owner here.
  
  Basically the idea here was to be as generic as possible and make this
  feature potentially available also to other subsystems, so that cgroup
  subsystems may represent whatever they want with the 16-bit token.
  However, no more than a single subsystem may be able to use this feature
  at the same time.
 
 That makes me nervous; it can't really be used that way unless we want to
 say that certain controllers are fundamentally incompatible and can't be
 allowed to play together.  For whatever my $0.02 are worth (given the
 state of the US dollar, that's not a whole lot), I'd suggest keeping the
 current mechanism, but make it clear that it belongs to your controller.
 If and when another controller comes along with a need for similar
 functionality, somebody can worry about making it more general.

OK, I understand. I'll use blkio instead of owner. Also because I
wouldn't like to introduce additional logic and overhead to check if two
controllers are using this feature at the same time. Better to hard-code
this information in the name of the functions.

Probably the most generic solution is the one that you suggested:
replace the mem_cgroup with a pointer to css_set. I'll also try to
investigate this way.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote:
 On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote:
  On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote:
   On Tue, 22 Feb 2011 18:12:54 +0100
   Andrea Righi ari...@develer.com wrote:
   
The page_cgroup infrastructure, currently available only for the memory
cgroup controller, can be used to store the owner of each page and
opportunely track the writeback IO. This information is encoded in
the upper 16-bits of the page_cgroup-flags.

A owner can be identified using a generic ID number and the following
interfaces are provided to store a retrieve this information:

  unsigned long page_cgroup_get_owner(struct page *page);
  int page_cgroup_set_owner(struct page *page, unsigned long id);
  int page_cgroup_copy_owner(struct page *npage, struct page *opage);
   
   My immediate observation is that you're not really tracking the owner
   here - you're tracking an opaque 16-bit token known only to the block
   controller in a field which - if changed by anybody other than the block
   controller - will lead to mayhem in the block controller.  I think it
   might be clearer - and safer - to say blkcg or some such instead of
   owner here.
   
  
  Basically the idea here was to be as generic as possible and make this
  feature potentially available also to other subsystems, so that cgroup
  subsystems may represent whatever they want with the 16-bit token.
  However, no more than a single subsystem may be able to use this feature
  at the same time.
  
   I'm tempted to say it might be better to just add a pointer to your
   throtl_grp structure into struct page_cgroup.  Or maybe replace the
   mem_cgroup pointer with a single pointer to struct css_set.  Both of
   those ideas, though, probably just add unwanted extra overhead now to gain
   generality which may or may not be wanted in the future.
  
  The pointer to css_set sounds good, but it would add additional space to
  the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch)
  and all of them are allocated at boot time. Using unused bits in
  page_cgroup-flags is a choice with no overhead from this point of view.
 
 I think John suggested replacing mem_cgroup pointer with css_set so that
 size of the strcuture does not increase but it leads extra level of 
 indirection.

OK, got it sorry.

So, IIUC we save css_set pointer and get a struct cgroup as following:

  struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup;

Then, for example to get the mem_cgroup reference:

  struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);

It seems a lot of indirections, but I may have done something wrong or
there could be a simpler way to do it.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-06 Thread Andrea Righi
On Wed, Oct 06, 2010 at 11:34:16AM -0700, Greg Thelen wrote:
 Andrea Righi ari...@develer.com writes:
 
  On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com writes:
  
   On Sun,  3 Oct 2010 23:58:03 -0700
   Greg Thelen gthe...@google.com wrote:
  
   Add cgroupfs interface to memcg dirty page limits:
 Direct write-out is controlled with:
 - memory.dirty_ratio
 - memory.dirty_bytes
   
 Background write-out is controlled with:
 - memory.dirty_background_ratio
 - memory.dirty_background_bytes
   
   Signed-off-by: Andrea Righi ari...@develer.com
   Signed-off-by: Greg Thelen gthe...@google.com
  
   Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  
   a question below.
  
  
   ---
mm/memcontrol.c |   89 
   +++
1 files changed, 89 insertions(+), 0 deletions(-)
   
   diff --git a/mm/memcontrol.c b/mm/memcontrol.c
   index 6ec2625..2d45a0a 100644
   --- a/mm/memcontrol.c
   +++ b/mm/memcontrol.c
   @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_NSTATS,
};

   +enum {
   +   MEM_CGROUP_DIRTY_RATIO,
   +   MEM_CGROUP_DIRTY_BYTES,
   +   MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
   +   MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
   +};
   +
struct mem_cgroup_stat_cpu {
   s64 count[MEM_CGROUP_STAT_NSTATS];
};
   @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct 
   cgroup *cgrp,
   return 0;
}

   +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype 
   *cft)
   +{
   +   struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
   +   bool root;
   +
   +   root = mem_cgroup_is_root(mem);
   +
   +   switch (cft-private) {
   +   case MEM_CGROUP_DIRTY_RATIO:
   +   return root ? vm_dirty_ratio : 
   mem-dirty_param.dirty_ratio;
   +   case MEM_CGROUP_DIRTY_BYTES:
   +   return root ? vm_dirty_bytes : 
   mem-dirty_param.dirty_bytes;
   +   case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
   +   return root ? dirty_background_ratio :
   +   mem-dirty_param.dirty_background_ratio;
   +   case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
   +   return root ? dirty_background_bytes :
   +   mem-dirty_param.dirty_background_bytes;
   +   default:
   +   BUG();
   +   }
   +}
   +
   +static int
   +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 
   val)
   +{
   +   struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
   +   int type = cft-private;
   +
   +   if (cgrp-parent == NULL)
   +   return -EINVAL;
   +   if ((type == MEM_CGROUP_DIRTY_RATIO ||
   +type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)  val  100)
   +   return -EINVAL;
   +   switch (type) {
   +   case MEM_CGROUP_DIRTY_RATIO:
   +   memcg-dirty_param.dirty_ratio = val;
   +   memcg-dirty_param.dirty_bytes = 0;
   +   break;
   +   case MEM_CGROUP_DIRTY_BYTES:
   +   memcg-dirty_param.dirty_bytes = val;
   +   memcg-dirty_param.dirty_ratio  = 0;
   +   break;
   +   case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
   +   memcg-dirty_param.dirty_background_ratio = val;
   +   memcg-dirty_param.dirty_background_bytes = 0;
   +   break;
   +   case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
   +   memcg-dirty_param.dirty_background_bytes = val;
   +   memcg-dirty_param.dirty_background_ratio = 0;
   +   break;
  
  
   Curiousis this same behavior as vm_dirty_ratio ?
  
  I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
  changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
  vm_dirty_bytes is written dirty_bytes_handler() will set
  vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
  mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
  global dirty parameters.
  
  Am I missing your question?
 
  mmh... looking at the code it seems the same behaviour, but in
  Documentation/sysctl/vm.txt we say a different thing (i.e., for
  dirty_bytes):
 
  If dirty_bytes is written, dirty_ratio becomes a function of its value
  (dirty_bytes / the amount of dirtyable system memory).
 
  However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
  the counterpart value as 0.
 
  I think we should clarify the documentation.
 
  Signed-off-by: Andrea Righi ari...@develer.com
 
 Reviewed-by: Greg Thelen gthe...@google.com
 
 This documentation change is general cleanup that is independent of the
 memcg patch series shown on the subject.

Thanks Greg. I'll resend it as an independent patch.

-Andrea
___
Containers mailing list
contain

[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-05 Thread Andrea Righi
On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com writes:
 
  On Sun,  3 Oct 2010 23:58:03 -0700
  Greg Thelen gthe...@google.com wrote:
 
  Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_bytes
  
Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_bytes
  
  Signed-off-by: Andrea Righi ari...@develer.com
  Signed-off-by: Greg Thelen gthe...@google.com
 
  Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
  a question below.
 
 
  ---
   mm/memcontrol.c |   89 
  +++
   1 files changed, 89 insertions(+), 0 deletions(-)
  
  diff --git a/mm/memcontrol.c b/mm/memcontrol.c
  index 6ec2625..2d45a0a 100644
  --- a/mm/memcontrol.c
  +++ b/mm/memcontrol.c
  @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
 MEM_CGROUP_STAT_NSTATS,
   };
   
  +enum {
  +  MEM_CGROUP_DIRTY_RATIO,
  +  MEM_CGROUP_DIRTY_BYTES,
  +  MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
  +  MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
  +};
  +
   struct mem_cgroup_stat_cpu {
 s64 count[MEM_CGROUP_STAT_NSTATS];
   };
  @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct 
  cgroup *cgrp,
 return 0;
   }
   
  +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
  +{
  +  struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
  +  bool root;
  +
  +  root = mem_cgroup_is_root(mem);
  +
  +  switch (cft-private) {
  +  case MEM_CGROUP_DIRTY_RATIO:
  +  return root ? vm_dirty_ratio : mem-dirty_param.dirty_ratio;
  +  case MEM_CGROUP_DIRTY_BYTES:
  +  return root ? vm_dirty_bytes : mem-dirty_param.dirty_bytes;
  +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
  +  return root ? dirty_background_ratio :
  +  mem-dirty_param.dirty_background_ratio;
  +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
  +  return root ? dirty_background_bytes :
  +  mem-dirty_param.dirty_background_bytes;
  +  default:
  +  BUG();
  +  }
  +}
  +
  +static int
  +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
  +{
  +  struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
  +  int type = cft-private;
  +
  +  if (cgrp-parent == NULL)
  +  return -EINVAL;
  +  if ((type == MEM_CGROUP_DIRTY_RATIO ||
  +   type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)  val  100)
  +  return -EINVAL;
  +  switch (type) {
  +  case MEM_CGROUP_DIRTY_RATIO:
  +  memcg-dirty_param.dirty_ratio = val;
  +  memcg-dirty_param.dirty_bytes = 0;
  +  break;
  +  case MEM_CGROUP_DIRTY_BYTES:
  +  memcg-dirty_param.dirty_bytes = val;
  +  memcg-dirty_param.dirty_ratio  = 0;
  +  break;
  +  case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
  +  memcg-dirty_param.dirty_background_ratio = val;
  +  memcg-dirty_param.dirty_background_bytes = 0;
  +  break;
  +  case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
  +  memcg-dirty_param.dirty_background_bytes = val;
  +  memcg-dirty_param.dirty_background_ratio = 0;
  +  break;
 
 
  Curiousis this same behavior as vm_dirty_ratio ?
 
 I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
 changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
 vm_dirty_bytes is written dirty_bytes_handler() will set
 vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
 mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
 global dirty parameters.
 
 Am I missing your question?

mmh... looking at the code it seems the same behaviour, but in
Documentation/sysctl/vm.txt we say a different thing (i.e., for
dirty_bytes):

If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).

However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
the counterpart value as 0.

I think we should clarify the documentation.

Signed-off-by: Andrea Righi ari...@develer.com
---
 Documentation/sysctl/vm.txt |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..30289fa 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -80,8 +80,10 @@ dirty_background_bytes
 Contains the amount of dirty memory at which the pdflush background writeback
 daemon will start writeback.
 
-If dirty_background_bytes is written, dirty_background_ratio becomes a function
-of its value (dirty_background_bytes / the amount of dirtyable system memory).
+Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
+one of them may be specified at a time. When one sysctl is written it is
+immediately taken into account to evaluate the dirty memory limits

[Devel] Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup

2010-10-05 Thread Andrea Righi
On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
 Extend mem_cgroup to contain dirty page limits.  Also add routines
 allowing the kernel to query the dirty usage of a memcg.
 
 These interfaces not used by the kernel yet.  A subsequent commit
 will add kernel calls to utilize these new routines.

A small note below.

 
 Signed-off-by: Greg Thelen gthe...@google.com
 Signed-off-by: Andrea Righi ari...@develer.com
 ---
  include/linux/memcontrol.h |   44 +++
  mm/memcontrol.c|  180 
 +++-
  2 files changed, 223 insertions(+), 1 deletions(-)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 6303da1..dc8952d 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -19,6 +19,7 @@
  
  #ifndef _LINUX_MEMCONTROL_H
  #define _LINUX_MEMCONTROL_H
 +#include linux/writeback.h
  #include linux/cgroup.h
  struct mem_cgroup;
  struct page_cgroup;
 @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
   MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
  };
  
 +/* Cgroup memory statistics items exported to the kernel */
 +enum mem_cgroup_read_page_stat_item {
 + MEMCG_NR_DIRTYABLE_PAGES,
 + MEMCG_NR_RECLAIM_PAGES,
 + MEMCG_NR_WRITEBACK,
 + MEMCG_NR_DIRTY_WRITEBACK_PAGES,
 +};
 +
 +/* Dirty memory parameters */
 +struct vm_dirty_param {
 + int dirty_ratio;
 + int dirty_background_ratio;
 + unsigned long dirty_bytes;
 + unsigned long dirty_background_bytes;
 +};
 +
 +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
 +{
 + param-dirty_ratio = vm_dirty_ratio;
 + param-dirty_bytes = vm_dirty_bytes;
 + param-dirty_background_ratio = dirty_background_ratio;
 + param-dirty_background_bytes = dirty_background_bytes;
 +}
 +
  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
   struct list_head *dst,
   unsigned long *scanned, int order,
 @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page 
 *page,
   mem_cgroup_update_page_stat(page, idx, -1);
  }
  
 +bool mem_cgroup_has_dirty_limit(void);
 +void get_vm_dirty_param(struct vm_dirty_param *param);
 +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
 +
  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
   gfp_t gfp_mask);
  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page 
 *page,
  {
  }
  
 +static inline bool mem_cgroup_has_dirty_limit(void)
 +{
 + return false;
 +}
 +
 +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
 +{
 + get_global_vm_dirty_param(param);
 +}
 +
 +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
 item)
 +{
 + return -ENOSYS;
 +}
 +
  static inline
  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
   gfp_t gfp_mask)
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index f40839f..6ec2625 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -233,6 +233,10 @@ struct mem_cgroup {
   atomic_trefcnt;
  
   unsigned intswappiness;
 +
 + /* control memory cgroup dirty pages */
 + struct vm_dirty_param dirty_param;
 +
   /* OOM-Killer disable */
   int oom_kill_disable;
  
 @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup 
 *memcg)
   return swappiness;
  }
  
 +/*
 + * Returns a snapshot of the current dirty limits which is not synchronized 
 with
 + * the routines that change the dirty limits.  If this routine races with an
 + * update to the dirty bytes/ratio value, then the caller must handle the 
 case
 + * where both dirty_[background_]_ratio and _bytes are set.
 + */
 +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
 +  struct mem_cgroup *mem)
 +{
 + if (mem  !mem_cgroup_is_root(mem)) {
 + param-dirty_ratio = mem-dirty_param.dirty_ratio;
 + param-dirty_bytes = mem-dirty_param.dirty_bytes;
 + param-dirty_background_ratio =
 + mem-dirty_param.dirty_background_ratio;
 + param-dirty_background_bytes =
 + mem-dirty_param.dirty_background_bytes;
 + } else {
 + get_global_vm_dirty_param(param);
 + }
 +}
 +
 +/*
 + * Get dirty memory parameters of the current memcg or global values (if 
 memory
 + * cgroups are disabled or querying the root cgroup).
 + */
 +void get_vm_dirty_param(struct vm_dirty_param *param)
 +{
 + struct mem_cgroup *memcg;
 +
 + if (mem_cgroup_disabled()) {
 + get_global_vm_dirty_param(param);
 + return;
 + }
 +
 + /*
 +  * It's possible

[Devel] Re: [PATCH 00/10] memcg: per cgroup dirty page accounting

2010-10-05 Thread Andrea Righi
On Sun, Oct 03, 2010 at 11:57:55PM -0700, Greg Thelen wrote:
 This patch set provides the ability for each cgroup to have independent dirty
 page limits.
 
 Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
 page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
 will
 not be able to consume more than their designated share of dirty pages and 
 will
 be forced to perform write-out if they cross that limit.
 
 These patches were developed and tested on mmotm 2010-09-28-16-13.  The 
 patches
 are based on a series proposed by Andrea Righi in Mar 2010.
 
 Overview:
 - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
   unstable.
 - Extend mem_cgroup to record the total number of pages in each of the 
   interesting dirty states (dirty, writeback, unstable_nfs).  
 - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
   via cgroupfs control files.
 - Consider both system and per-memcg dirty limits in page writeback when
   deciding to queue background writeback or block for foreground writeback.
 
 Known shortcomings:
 - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
   just inodes contributing dirty pages to the cgroup exceeding its limit.  
 
 Performance measurements:
 - kernel builds are unaffected unless run with a small dirty limit.
 - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
 - dd has three data points (in secs) for three data sizes (100M, 200M, and 
 1G).  
   As expected, dd slows when it exceed its cgroup dirty limit.
 
kernel_build  dd
 mmotm 2:370.18, 0.38, 1.65
   root_memcg
 
 mmotm 2:370.18, 0.35, 1.66
   non-root_memcg
 
 mmotm+patches 2:370.18, 0.35, 1.68
   root_memcg
 
 mmotm+patches 2:370.19, 0.35, 1.69
   non-root_memcg
 
 mmotm+patches 2:370.19, 2.34, 22.82
   non-root_memcg
   150 MiB memcg dirty limit
 
 mmotm+patches 3:581.71, 3.38, 17.33
   non-root_memcg
   1 MiB memcg dirty limit

Hi Greg,

the patchset seems to work fine on my box.

I also ran a pretty simple test to directly verify the effectiveness of
the dirty memory limit, using a dd running on a non-root memcg:

  dd if=/dev/zero of=tmpfile bs=1M count=512

and monitoring the max of the dirty value in cgroup/memory.stat:

Here the results:
  dd in non-root memcg (  4 MiB memcg dirty limit): dirty max=4227072
  dd in non-root memcg (  8 MiB memcg dirty limit): dirty max=8454144
  dd in non-root memcg ( 16 MiB memcg dirty limit): dirty max=15179776
  dd in non-root memcg ( 32 MiB memcg dirty limit): dirty max=32235520
  dd in non-root memcg ( 64 MiB memcg dirty limit): dirty max=64245760
  dd in non-root memcg (128 MiB memcg dirty limit): dirty max=121028608
  dd in non-root memcg (256 MiB memcg dirty limit): dirty max=232865792
  dd in non-root memcg (512 MiB memcg dirty limit): dirty max=445194240

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC] [PATCH 0/2] memcg: per cgroup dirty limit

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_bytes in cgroupfs
 - start to write-out in balance_dirty_pages() when the cgroup or global limit
   is exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit.

TODO:
 - handle the migration of tasks across different cgroups (a page may be set
   dirty when a task runs in a cgroup and cleared after the task is moved to
   another cgroup).
 - provide an appropriate documentation (in Documentation/cgroups/memory.txt)

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

Signed-off-by: Andrea Righi ari...@develer.com
---
 fs/fuse/file.c  |3 ++
 fs/nfs/write.c  |3 ++
 fs/nilfs2/segment.c |8 -
 mm/filemap.c|1 +
 mm/page-writeback.c |   69 --
 mm/truncate.c   |1 +
 6 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..357632a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include linux/pagemap.h
 #include linux/slab.h
 #include linux/kernel.h
+#include linux/memcontrol.h
 #include linux/sched.h
 #include linux/module.h
 
@@ -1129,6 +1130,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 
list_del(req-writepages_entry);
dec_bdi_stat(bdi, BDI_WRITEBACK);
+   mem_cgroup_charge_dirty(req-pages[0], NR_WRITEBACK_TEMP, -1);
dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
@@ -1240,6 +1242,7 @@ static int fuse_writepage_locked(struct page *page)
req-inode = inode;
 
inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
+   mem_cgroup_charge_dirty(tmp_page, NR_WRITEBACK_TEMP, 1);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d63d964..3d9de01 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
req-wb_index,
NFS_PAGE_TAG_COMMIT);
spin_unlock(inode-i_lock);
+   mem_cgroup_charge_dirty(req-wb_page, NR_UNSTABLE_NFS, 1);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req-wb_page;
 
if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
+   mem_cgroup_charge_dirty(page, NR_UNSTABLE_NFS, -1);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1320,6 +1322,7 @@ nfs_commit_list(struct inode *inode, struct list_head 
*head, int how)
req = nfs_list_entry(head-next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+   mem_cgroup_charge_dirty(req-wb_page, NR_UNSTABLE_NFS, -1);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 105b508..b9ffac5 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1660,8 +1660,10 @@ nilfs_copy_replace_page_buffers(struct page *page, 
struct list_head *out)
} while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
kunmap_atomic(kaddr, KM_USER0);
 
-   if (!TestSetPageWriteback(clone_page))
+   if (!TestSetPageWriteback(clone_page)) {
+   mem_cgroup_charge_dirty(clone_page, NR_WRITEBACK, 1);
inc_zone_page_state(clone_page, NR_WRITEBACK);
+   }
unlock_page(clone_page);
 
return 0;
@@ -1788,8 +1790,10 @@ static void __nilfs_end_page_io(struct page *page, int 
err)
}
 
if (buffer_nilfs_allocated(page_buffers(page))) {
-   if (TestClearPageWriteback(page))
+   if (TestClearPageWriteback(page)) {
+   mem_cgroup_charge_dirty(clone_page, NR_WRITEBACK, -1);
dec_zone_page_state(page, NR_WRITEBACK);
+   }
} else
end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 698ea80..c19d809 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_charge_dirty(page, NR_FILE_DIRTY, -1);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b19943..c9ff1cd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-   unsigned long dirty_total;
+   unsigned long dirty_total, dirty_bytes;
 
-   if (vm_dirty_bytes)
-   dirty_total = vm_dirty_bytes / PAGE_SIZE;
+   dirty_bytes = mem_cgroup_dirty_bytes

[Devel] [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup + add memory.dirty_bytes limit
in cgroupfs.

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |   31 ++
 mm/memcontrol.c|  218 +++-
 2 files changed, 248 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..ba3fe0d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,16 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum memcg_page_stat_item {
+   MEMCG_NR_FREE_PAGES,
+   MEMCG_NR_RECLAIMABLE_PAGES,
+   MEMCG_NR_FILE_DIRTY,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_WRITEBACK_TEMP,
+   MEMCG_NR_UNSTABLE_NFS,
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -48,6 +58,8 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup 
*ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+extern void mem_cgroup_charge_dirty(struct page *page,
+   enum zone_stat_item idx, int charge);
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
@@ -117,6 +129,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern unsigned long mem_cgroup_dirty_bytes(void);
+
+extern u64 mem_cgroup_page_state(enum memcg_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -144,6 +160,11 @@ static inline int mem_cgroup_cache_charge(struct page 
*page,
return 0;
 }
 
+static inline void mem_cgroup_charge_dirty(struct page *page,
+   enum zone_stat_item idx, int charge)
+{
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
@@ -312,6 +333,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return 0;
 }
 
+static inline unsigned long mem_cgroup_dirty_bytes(void)
+{
+   return vm_dirty_bytes;
+}
+
+static inline u64 mem_cgroup_page_state(enum memcg_page_stat_item item)
+{
+   return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 954032b..288b9a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -64,13 +64,18 @@ enum mem_cgroup_stat_index {
/*
 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
 */
-   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+   MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+   MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+   MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+   temporary buffers */
+   MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
 
MEM_CGROUP_STAT_NSTATS,
 };
@@ -225,6 +230,9 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
boolmemsw_is_minimum;
 
+   /* control memory cgroup dirty pages */
+   unsigned long dirty_bytes;
+
/*
 * statistics. This must be placed at the end of memcg.
 */
@@ -519,6 +527,67 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup 
*mem,
put_cpu();
 }
 
+static struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
+{
+   struct page_cgroup *pc;
+   struct mem_cgroup *mem = NULL;
+
+   pc = lookup_page_cgroup(page);
+   if (unlikely(!pc))
+   return NULL;
+   lock_page_cgroup(pc);
+   if (PageCgroupUsed(pc)) {
+   mem = pc-mem_cgroup;
+   if (mem)
+   css_get(mem-css);
+   }
+   unlock_page_cgroup(pc);
+   return mem;
+}
+
+void mem_cgroup_charge_dirty(struct page *page,
+   enum zone_stat_item idx, int charge)
+{
+   struct mem_cgroup *mem;
+   struct mem_cgroup_stat_cpu *cpustat;
+   unsigned long flags;
+   int cpu

[Devel] Re: [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Sun, Feb 21, 2010 at 01:28:35PM -0800, David Rientjes wrote:

[snip]

  +static struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
  +{
  +   struct page_cgroup *pc;
  +   struct mem_cgroup *mem = NULL;
  +
  +   pc = lookup_page_cgroup(page);
  +   if (unlikely(!pc))
  +   return NULL;
  +   lock_page_cgroup(pc);
  +   if (PageCgroupUsed(pc)) {
  +   mem = pc-mem_cgroup;
  +   if (mem)
  +   css_get(mem-css);
  +   }
  +   unlock_page_cgroup(pc);
  +   return mem;
  +}
 
 Is it possible to merge this with try_get_mem_cgroup_from_page()?

Agreed.

 
  +
  +void mem_cgroup_charge_dirty(struct page *page,
  +   enum zone_stat_item idx, int charge)
  +{
  +   struct mem_cgroup *mem;
  +   struct mem_cgroup_stat_cpu *cpustat;
  +   unsigned long flags;
  +   int cpu;
  +
  +   if (mem_cgroup_disabled())
  +   return;
  +   /* Translate the zone_stat_item into a mem_cgroup_stat_index */
  +   switch (idx) {
  +   case NR_FILE_DIRTY:
  +   idx = MEM_CGROUP_STAT_FILE_DIRTY;
  +   break;
  +   case NR_WRITEBACK:
  +   idx = MEM_CGROUP_STAT_WRITEBACK;
  +   break;
  +   case NR_WRITEBACK_TEMP:
  +   idx = MEM_CGROUP_STAT_WRITEBACK_TEMP;
  +   break;
  +   case NR_UNSTABLE_NFS:
  +   idx = MEM_CGROUP_STAT_UNSTABLE_NFS;
  +   break;
  +   default:
  +   return;
 
 WARN()?  We don't want to silently leak counters.

Agreed.

 
  +   }
  +   /* Charge the memory cgroup statistics */
  +   mem = get_mem_cgroup_from_page(page);
  +   if (!mem) {
  +   mem = root_mem_cgroup;
  +   css_get(mem-css);
  +   }
 
 get_mem_cgroup_from_page() should probably handle the root_mem_cgroup case 
 and return a reference from it.

Right. But I'd prefer to use try_get_mem_cgroup_from_page() without
changing the behaviour of this function.

 
  +
  +   local_irq_save(flags);
  +   cpu = get_cpu();
  +   cpustat = mem-stat.cpustat[cpu];
  +   __mem_cgroup_stat_add_safe(cpustat, idx, charge);
 
 get_cpu()?  Preemption is already disabled, just use smp_processor_id().

mmmh... actually, we can just copy the code from
mem_cgroup_charge_statistics(), so local_irq_save/restore are not
necessarily needed and we can just use get_cpu()/put_cpu().

  +   put_cpu();
  +   local_irq_restore(flags);
  +   css_put(mem-css);
  +}
  +
   static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
  enum lru_list idx)
   {
  @@ -992,6 +1061,97 @@ static unsigned int get_swappiness(struct mem_cgroup 
  *memcg)
  return swappiness;
   }
   
  +static unsigned long get_dirty_bytes(struct mem_cgroup *memcg)
  +{
  +   struct cgroup *cgrp = memcg-css.cgroup;
  +   unsigned long dirty_bytes;
  +
  +   /* root ? */
  +   if (cgrp-parent == NULL)
  +   return vm_dirty_bytes;
  +
  +   spin_lock(memcg-reclaim_param_lock);
  +   dirty_bytes = memcg-dirty_bytes;
  +   spin_unlock(memcg-reclaim_param_lock);
  +
  +   return dirty_bytes;
  +}
  +
  +unsigned long mem_cgroup_dirty_bytes(void)
  +{
  +   struct mem_cgroup *memcg;
  +   unsigned long dirty_bytes;
  +
  +   if (mem_cgroup_disabled())
  +   return vm_dirty_bytes;
  +
  +   rcu_read_lock();
  +   memcg = mem_cgroup_from_task(current);
  +   if (memcg == NULL)
  +   dirty_bytes = vm_dirty_bytes;
  +   else
  +   dirty_bytes = get_dirty_bytes(memcg);
  +   rcu_read_unlock();
 
 The rcu_read_lock() isn't protecting anything here.

Right!

 
  +
  +   return dirty_bytes;
  +}
  +
  +u64 mem_cgroup_page_state(enum memcg_page_stat_item item)
  +{
  +   struct mem_cgroup *memcg;
  +   struct cgroup *cgrp;
  +   u64 ret = 0;
  +
  +   if (mem_cgroup_disabled())
  +   return 0;
  +
  +   rcu_read_lock();
 
 Again, this isn't necessary.

OK. I'll apply your changes to the next version of this patch.

Thanks for reviewing!

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Sun, Feb 21, 2010 at 01:38:28PM -0800, David Rientjes wrote:
 On Sun, 21 Feb 2010, Andrea Righi wrote:
 
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 0b19943..c9ff1cd 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
*/
   static int calc_period_shift(void)
   {
  -   unsigned long dirty_total;
  +   unsigned long dirty_total, dirty_bytes;
   
  -   if (vm_dirty_bytes)
  -   dirty_total = vm_dirty_bytes / PAGE_SIZE;
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
  +   dirty_total = dirty_bytes / PAGE_SIZE;
  else
  dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
  100;
 
 This needs a comment since mem_cgroup_dirty_bytes() doesn't imply that it 
 is responsible for returning the global vm_dirty_bytes when that's 
 actually what it does (both for CONFIG_CGROUP_MEM_RES_CTRL=n and root 
 cgroup).

Fair enough.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Mon, Feb 22, 2010 at 09:32:21AM +0900, KAMEZAWA Hiroyuki wrote:
  -   if (vm_dirty_bytes)
  -   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
  +   dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
  else {
  int dirty_ratio;
 
 you use local value. But, if hierarchila accounting used, memcg-dirty_bytes
 should be got from root-of-hierarchy memcg.
 
 I have no objection if you add a pointer as
   memcg-subhierarchy_root
 to get root of hierarchical accounting. But please check problem of 
 hierarchy, again.

Right, it won't work with hierarchy. I'll fix also considering the
hierarchy case.

Thanks for your review.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Mon, Feb 22, 2010 at 09:44:42PM +0530, Balbir Singh wrote:
[snip]
  +void mem_cgroup_charge_dirty(struct page *page,
  +   enum zone_stat_item idx, int charge)
  +{
  +   struct mem_cgroup *mem;
  +   struct mem_cgroup_stat_cpu *cpustat;
  +   unsigned long flags;
  +   int cpu;
  +
  +   if (mem_cgroup_disabled())
  +   return;
  +   /* Translate the zone_stat_item into a mem_cgroup_stat_index */
  +   switch (idx) {
  +   case NR_FILE_DIRTY:
  +   idx = MEM_CGROUP_STAT_FILE_DIRTY;
  +   break;
  +   case NR_WRITEBACK:
  +   idx = MEM_CGROUP_STAT_WRITEBACK;
  +   break;
  +   case NR_WRITEBACK_TEMP:
  +   idx = MEM_CGROUP_STAT_WRITEBACK_TEMP;
  +   break;
  +   case NR_UNSTABLE_NFS:
  +   idx = MEM_CGROUP_STAT_UNSTABLE_NFS;
  +   break;
  +   default:
  +   return;
  +   }
  +   /* Charge the memory cgroup statistics */
  +   mem = get_mem_cgroup_from_page(page);
  +   if (!mem) {
  +   mem = root_mem_cgroup;
  +   css_get(mem-css);
  +   }
  +
  +   local_irq_save(flags);
  +   cpu = get_cpu();
 
 Kamezawa is in the process of changing these, so you might want to
 look at and integrate with those patches when they are ready.

OK, I'll rebase the patch to -mm. Are those changes already included in
mmotm?

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Mon, Feb 22, 2010 at 11:52:15AM -0500, Vivek Goyal wrote:
   unsigned long determine_dirtyable_memory(void)
   {
  -   unsigned long x;
  -
  -   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  -
  +   unsigned long memcg_memory, memory;
  +
  +   memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  +   memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES);
  +   if (memcg_memory  0) {
 
 it could be just 
 
   if (memcg_memory) {

Agreed.

   }
 
  +   memcg_memory +=
  +   mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES);
  +   if (memcg_memory  memory)
  +   return memcg_memory;
  +   }
  if (!vm_highmem_is_dirtyable)
  -   x -= highmem_dirtyable_memory(x);
  +   memory -= highmem_dirtyable_memory(memory);
   
 
 If vm_highmem_is_dirtyable=0, In that case, we can still return with
 memcg_memory which can be more than memory.  IOW, highmem is not
 dirtyable system wide but still we can potetially return back saying
 for this cgroup we can dirty more pages which can potenailly be acutally
 be more that system wide allowed?
 
 Because you have modified dirtyable_memory() and made it per cgroup, I
 think it automatically takes care of the cases of per cgroup dirty ratio,
 I mentioned in my previous mail. So we will use system wide dirty ratio
 to calculate the allowed dirty pages in this cgroup (dirty_ratio *
 available_memory()) and if this cgroup wrote too many pages start
 writeout? 

OK, if I've understood well, you're proposing to use per-cgroup
dirty_ratio interface and do something like:

unsigned long determine_dirtyable_memory(void)
{
unsigned long memcg_memory, memory;

memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
if (!vm_highmem_is_dirtyable)
memory -= highmem_dirtyable_memory(memory);

memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES);
if (!memcg_memory)
return memory + 1;  /* Ensure that we never return 0 */
memcg_memory += mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES);
if (!vm_highmem_is_dirtyable)
 memcg_memory -= highmem_dirtyable_memory(memory) *
mem_cgroup_dirty_ratio() / 100;
if (memcg_memory  memory)
return memcg_memory;
}


 
  -   return x + 1;   /* Ensure that we never return 0 */
  +   return memory + 1;  /* Ensure that we never return 0 */
   }
   
   void
  @@ -421,12 +428,13 @@ get_dirty_limits(unsigned long *pbackground, unsigned 
  long *pdirty,
   unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
   {
  unsigned long background;
  -   unsigned long dirty;
  +   unsigned long dirty, dirty_bytes;
  unsigned long available_memory = determine_dirtyable_memory();
  struct task_struct *tsk;
   
  -   if (vm_dirty_bytes)
  -   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
  +   dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
  else {
  int dirty_ratio;
   
  @@ -505,9 +513,17 @@ static void balance_dirty_pages(struct address_space 
  *mapping,
  get_dirty_limits(background_thresh, dirty_thresh,
  bdi_thresh, bdi);
   
  -   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
  +   nr_reclaimable = mem_cgroup_page_state(MEMCG_NR_FILE_DIRTY);
  +   if (nr_reclaimable == 0) {
  +   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
  global_page_state(NR_UNSTABLE_NFS);
  -   nr_writeback = global_page_state(NR_WRITEBACK);
  +   nr_writeback = global_page_state(NR_WRITEBACK);
  +   } else {
  +   nr_reclaimable +=
  +   mem_cgroup_page_state(MEMCG_NR_UNSTABLE_NFS);
  +   nr_writeback =
  +   mem_cgroup_page_state(MEMCG_NR_WRITEBACK);
  +   }
   
  bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
  bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
  @@ -660,6 +676,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  unsigned long dirty_thresh;
   
   for ( ; ; ) {
  +   unsigned long dirty;
  +
  get_dirty_limits(background_thresh, dirty_thresh, NULL, NULL);
   
   /*
  @@ -668,10 +686,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
*/
   dirty_thresh += dirty_thresh / 10;  /* wh... */
   
  -if (global_page_state(NR_UNSTABLE_NFS) +
  -   global_page_state(NR_WRITEBACK) = dirty_thresh)
  -   break;
  -congestion_wait(BLK_RW_ASYNC, HZ/10);
  +   dirty = mem_cgroup_page_state(MEMCG_NR_WRITEBACK);
  +   

[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Feb 23, 2010 at 10:40:40AM +0100, Andrea Righi wrote:
  If vm_highmem_is_dirtyable=0, In that case, we can still return with
  memcg_memory which can be more than memory.  IOW, highmem is not
  dirtyable system wide but still we can potetially return back saying
  for this cgroup we can dirty more pages which can potenailly be acutally
  be more that system wide allowed?
  
  Because you have modified dirtyable_memory() and made it per cgroup, I
  think it automatically takes care of the cases of per cgroup dirty ratio,
  I mentioned in my previous mail. So we will use system wide dirty ratio
  to calculate the allowed dirty pages in this cgroup (dirty_ratio *
  available_memory()) and if this cgroup wrote too many pages start
  writeout? 
 
 OK, if I've understood well, you're proposing to use per-cgroup
 dirty_ratio interface and do something like:
 
 unsigned long determine_dirtyable_memory(void)
 {
   unsigned long memcg_memory, memory;
 
   memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
   if (!vm_highmem_is_dirtyable)
   memory -= highmem_dirtyable_memory(memory);
 
   memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES);
   if (!memcg_memory)
   return memory + 1;  /* Ensure that we never return 0 */
   memcg_memory += mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES);
   if (!vm_highmem_is_dirtyable)
memcg_memory -= highmem_dirtyable_memory(memory) *
   mem_cgroup_dirty_ratio() / 100;

ok, this is wrong:

   if (memcg_memory  memory)
   return memcg_memory;
 }

return min(memcg_memory, memory);

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH 0/2] memcg: per cgroup dirty limit

2010-03-30 Thread Andrea Righi
On Mon, Feb 22, 2010 at 01:29:34PM -0500, Vivek Goyal wrote:
  I would't like to add many different interfaces to do the same thing.
  I'd prefer to choose just one interface and always use it. We just have
  to define which is the best one. IMHO dirty_bytes is more generic. If
  we want to define the limit as a % we can always do that in userspace.
  
 
 dirty_ratio is easy to configure. One system wide default value works for
 all the newly created cgroups. For dirty_bytes, you shall have to
 configure each and individual cgroup with a specific value depneding on
 what is the upper limit of memory for that cgroup.

OK.

 
 Secondly, memory cgroup kind of partitions global memory resource per
 cgroup. So if as long as we have global dirty ratio knobs, it makes sense
 to have per cgroup dirty ratio knob also. 
 
 But I guess we can introduce that later and use gloabl dirty ratio for
 all the memory cgroups (instead of each cgroup having a separate dirty
 ratio). The only thing is that we need to enforce this dirty ratio on the
 cgroup and if I am reading the code correctly, your modifications of
 calculating available_memory() per cgroup should take care of that.

At the moment (with dirty_bytes) if the cgroup has dirty_bytes == 0, it
simply uses the system wide available_memory(), ignoring the memory
upper limit for that cgroup and fallbacks to the current behaviour.

With dirty_ratio, should we change the code to *always* apply this
percentage to the cgroup memory upper limit, and automatically set it
equal to the global dirty_ratio by default when the cgroup is created?
mmmh... I vote yes.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Feb 23, 2010 at 02:22:12PM -0800, David Rientjes wrote:
 On Tue, 23 Feb 2010, Vivek Goyal wrote:
 
Because you have modified dirtyable_memory() and made it per cgroup, I
think it automatically takes care of the cases of per cgroup dirty 
ratio,
I mentioned in my previous mail. So we will use system wide dirty ratio
to calculate the allowed dirty pages in this cgroup (dirty_ratio *
available_memory()) and if this cgroup wrote too many pages start
writeout? 
   
   OK, if I've understood well, you're proposing to use per-cgroup
   dirty_ratio interface and do something like:
  
  I think we can use system wide dirty_ratio for per cgroup (instead of
  providing configurable dirty_ratio for each cgroup where each memory
  cgroup can have different dirty ratio. Can't think of a use case
  immediately).
 
 I think each memcg should have both dirty_bytes and dirty_ratio, 
 dirty_bytes defaults to 0 (disabled) while dirty_ratio is inherited from 
 the global vm_dirty_ratio.  Changing vm_dirty_ratio would not change 
 memcgs already using their own dirty_ratio, but new memcgs would get the 
 new value by default.  The ratio would act over the amount of available 
 memory to the cgroup as though it were its own virtual system operating 
 with a subset of the system's RAM and the same global ratio.

Agreed.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote:
 On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote:
 
 [..]
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 0b19943..c9ff1cd 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
*/
   static int calc_period_shift(void)
   {
  -   unsigned long dirty_total;
  +   unsigned long dirty_total, dirty_bytes;
   
  -   if (vm_dirty_bytes)
  -   dirty_total = vm_dirty_bytes / PAGE_SIZE;
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
  +   dirty_total = dirty_bytes / PAGE_SIZE;
  else
  dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
  100;
 
 Ok, I don't understand this so I better ask. Can you explain a bit how memory
 cgroup dirty ratio is going to play with per BDI dirty proportion thing.
 
 Currently we seem to be calculating per BDI proportion (based on recently
 completed events), of system wide dirty ratio and decide whether a process
 should be throttled or not.
 
 Because throttling decision is also based on BDI and its proportion, how
 are we going to fit it with mem cgroup? Is it going to be BDI proportion
 of dirty memory with-in memory cgroup (and not system wide)?

IMHO we need to calculate the BDI dirty threshold as a function of the
cgroup's dirty memory, and keep BDI statistics system wide.

So, if a task is generating some writes, the threshold to start itself
the writeback will be calculated as a function of the cgroup's dirty
memory. If the BDI dirty memory is greater than this threshold, the task
must start to writeback dirty pages until it reaches the expected dirty
limit.

OK, in this way a cgroup with a small dirty limit may be forced to
writeback a lot of pages dirtied by other cgroups on the same device.
But this is always related to the fact that tasks are forced to
writeback dirty inodes randomly, and not the inodes they've actually
dirtied.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Fri, Feb 26, 2010 at 04:48:11PM -0500, Vivek Goyal wrote:
 On Thu, Feb 25, 2010 at 04:12:11PM +0100, Andrea Righi wrote:
  On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote:
   On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote:
   
   [..]
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b19943..c9ff1cd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-   unsigned long dirty_total;
+   unsigned long dirty_total, dirty_bytes;
 
-   if (vm_dirty_bytes)
-   dirty_total = vm_dirty_bytes / PAGE_SIZE;
+   dirty_bytes = mem_cgroup_dirty_bytes();
+   if (dirty_bytes)
+   dirty_total = dirty_bytes / PAGE_SIZE;
else
dirty_total = (vm_dirty_ratio * 
determine_dirtyable_memory()) /
100;
   
   Ok, I don't understand this so I better ask. Can you explain a bit how 
   memory
   cgroup dirty ratio is going to play with per BDI dirty proportion thing.
   
   Currently we seem to be calculating per BDI proportion (based on recently
   completed events), of system wide dirty ratio and decide whether a process
   should be throttled or not.
   
   Because throttling decision is also based on BDI and its proportion, how
   are we going to fit it with mem cgroup? Is it going to be BDI proportion
   of dirty memory with-in memory cgroup (and not system wide)?
  
  IMHO we need to calculate the BDI dirty threshold as a function of the
  cgroup's dirty memory, and keep BDI statistics system wide.
  
  So, if a task is generating some writes, the threshold to start itself
  the writeback will be calculated as a function of the cgroup's dirty
  memory. If the BDI dirty memory is greater than this threshold, the task
  must start to writeback dirty pages until it reaches the expected dirty
  limit.
  
 
 Ok, so calculate dirty per cgroup and calculate BDI's proportion from
 cgroup dirty? So will you be keeping track of vm_completion events per
 cgroup or will rely on existing system wide and per BDI completion events
 to calculate BDI proportion?
 
 BDI proportion are more of an indication of device speed and faster device
 gets higher share of dirty, so may be we don't have to keep track of
 completion events per cgroup and can rely on system wide completion events
 for calculating the proportion of a BDI.
 
  OK, in this way a cgroup with a small dirty limit may be forced to
  writeback a lot of pages dirtied by other cgroups on the same device.
  But this is always related to the fact that tasks are forced to
  writeback dirty inodes randomly, and not the inodes they've actually
  dirtied.
 
 So we are left with following two issues.
 
 - Should we rely on global BDI stats for BDI_RECLAIMABLE and BDI_WRITEBACK
   or we need to make these per cgroup to determine actually how many pages
   have been dirtied by a cgroup and force writeouts accordingly?
 
 - Once we decide to throttle a cgroup, it should write its inodes and
   should not be serialized behind other cgroup's inodes.  

We could try to save who made the inode dirty
(inode-cgroup_that_made_inode_dirty) so that during the active
writeback each cgroup can be forced to write only its own inodes.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 0/2] memcg: per cgroup dirty limit (v2)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages.

Changelog (v1 - v2)
~~
 * rebased to -mmotm
 * properly handle hierarchical accounting
 * added the same system-wide interfaces to set dirty limits
   (memory.dirty_ratio / memory.dirty_bytes, memory.dirty_background_ratio, 
memory.dirty_background_bytes)
 * other minor fixes and improvements based on the received feedbacks

TODO:
 - handle the migration of tasks across different cgroups (maybe adding
   DIRTY/WRITEBACK/UNSTABLE flag to struct page_cgroup)
 - provide an appropriate documentation (in Documentation/cgroups/memory.txt)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 1/2] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Active write-out: memory.dirty_ratio, memory.dirty_bytes
 - Background write-out: memory.dirty_background_ratio, 
memory.dirty_background_bytes

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |   74 +-
 mm/memcontrol.c|  354 
 2 files changed, 399 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..e6af95c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,41 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+   MEMCG_NR_DIRTYABLE_PAGES,
+   MEMCG_NR_RECLAIM_PAGES,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+   /*
+* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+*/
+   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
+   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
+   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
+   MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
+   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+   MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
+   used by soft limit implementation */
+   MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
+   used by threshold implementation */
+   MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+   MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+   MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+   temporary buffers */
+   MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+   MEM_CGROUP_STAT_NSTATS,
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +152,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern long mem_cgroup_dirty_ratio(void);
+extern unsigned long mem_cgroup_dirty_bytes(void);
+extern long mem_cgroup_dirty_background_ratio(void);
+extern unsigned long mem_cgroup_dirty_background_bytes(void);
+
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -125,7 +167,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+   enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -300,8 +343,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-   int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+   enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +355,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return 0;
 }
 
+static inline long mem_cgroup_dirty_ratio(void)
+{
+   return vm_dirty_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_bytes(void)
+{
+   return vm_dirty_bytes;
+}
+
+static inline long mem_cgroup_dirty_background_ratio(void)
+{
+   return dirty_background_ratio;
+}
+
+static inline unsigned long mem_cgroup_dirty_background_bytes(void)
+{
+   return dirty_background_bytes;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+   return -ENOMEM;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a443c30..56f3204 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for 
remember boot option*/
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 #define THRESHOLDS_EVENTS_THRESH (100)
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-   /*
-* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-*/
-   MEM_CGROUP_STAT_CACHE

[Devel] [PATCH -mmotm 2/2] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

Signed-off-by: Andrea Righi ari...@develer.com
---
 fs/fuse/file.c  |5 +++
 fs/nfs/write.c  |4 ++
 fs/nilfs2/segment.c |   10 +-
 mm/filemap.c|1 +
 mm/page-writeback.c |   84 --
 mm/rmap.c   |4 +-
 mm/truncate.c   |2 +
 7 files changed, 76 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include linux/pagemap.h
 #include linux/slab.h
 #include linux/kernel.h
+#include linux/memcontrol.h
 #include linux/sched.h
 #include linux/module.h
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 
list_del(req-writepages_entry);
dec_bdi_stat(bdi, BDI_WRITEBACK);
+   mem_cgroup_update_stat(req-pages[0],
+   MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
req-inode = inode;
 
inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
+   mem_cgroup_update_stat(tmp_page,
+   MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
req-wb_index,
NFS_PAGE_TAG_COMMIT);
spin_unlock(inode-i_lock);
+   mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req-wb_page;
 
if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
+   mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE);
return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
*head, int how)
req = nfs_list_entry(head-next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+   mem_cgroup_update_stat(req-wb_page,
+   MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..aef6d13 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, 
struct list_head *out)
} while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
kunmap_atomic(kaddr, KM_USER0);
 
-   if (!TestSetPageWriteback(clone_page))
+   if (!TestSetPageWriteback(clone_page)) {
+   mem_cgroup_update_stat(clone_page,
+   MEM_CGROUP_STAT_WRITEBACK, 1);
inc_zone_page_state(clone_page, NR_WRITEBACK);
+   }
unlock_page(clone_page);
 
return 0;
@@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int 
err)
}
 
if (buffer_nilfs_allocated(page_buffers(page))) {
-   if (TestClearPageWriteback(page))
+   if (TestClearPageWriteback(page)) {
+   mem_cgroup_update_stat(clone_page,
+   MEM_CGROUP_STAT_WRITEBACK, -1);
dec_zone_page_state(page, NR_WRITEBACK);
+   }
} else
end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..d83f41c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
  */
 static

[Devel] [PATCH -mmotm 0/3] memcg: per cgroup dirty limit (v3)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit.

Changelog (v2 - v3)
~~
 * properly handle the swapless case when reading dirtyable pages statistic
 * combine similar functions + code cleanup based on the received feedbacks
 * updated documentation in Documentation/cgroups/memory.txt

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 1/3] memcg: dirty memory documentation

2010-03-30 Thread Andrea Righi
Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi ari...@develer.com
---
 Documentation/cgroups/memory.txt |   36 
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index aad7d05..878afa7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -308,6 +308,11 @@ cache  - # of bytes of page cache memory.
 rss- # of bytes of anonymous and swap cache memory.
 pgpgin - # of pages paged in (equivalent to # of charging events).
 pgpgout- # of pages paged out (equivalent to # of uncharging 
events).
+filedirty  - # of pages that are waiting to get written back to the disk.
+writeback  - # of pages that are actively being written back to the disk.
+writeback_tmp  - # of pages used by FUSE for temporary writeback buffers.
+nfs- # of NFS pages sent to the server, but not yet committed to
+ the actual storage.
 active_anon- # of bytes of anonymous and  swap cache memory on active
  lru list.
 inactive_anon  - # of bytes of anonymous memory and swap cache memory on
@@ -343,6 +348,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given 
time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+amount of dirty memory at which a process which is generating disk writes
+inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+bytes) at which a process generating disk writes will start itself writing
+out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+memory, the amount of dirty memory at which background writeback kernel
+threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+bytes) at which background writeback kernel threads will start writing out
+dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
  @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
*/
   dirty_thresh += dirty_thresh / 10;  /* wh... */
   
  -if (global_page_state(NR_UNSTABLE_NFS) +
  -   global_page_state(NR_WRITEBACK) = dirty_thresh)
  -   break;
  -congestion_wait(BLK_RW_ASYNC, HZ/10);
  +
  +   dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
  +   if (dirty  0)
  +   dirty = global_page_state(NR_UNSTABLE_NFS) +
  +   global_page_state(NR_WRITEBACK);
 
 dirty is unsigned long. As mentioned last time, above will never be true?
 In general these patches look ok to me. I will do some testing with these.

Re-introduced the same bug. My bad. :(

The value returned from mem_cgroup_page_stat() can be negative, i.e.
when memory cgroup is disabled. We could simply use a long for dirty,
the unit is in # of pages so s64 should be enough. Or cast dirty to long
only for the check (see below).

Thanks!
-Andrea

Signed-off-by: Andrea Righi ari...@develer.com
---
 mm/page-writeback.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d83f41c..dbee976 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 
dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-   if (dirty  0)
+   if ((long)dirty  0)
dirty = global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK);
if (dirty = dirty_thresh)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
 On Mon,  1 Mar 2010 22:23:40 +0100
 Andrea Righi ari...@develer.com wrote:
 
  Apply the cgroup dirty pages accounting and limiting infrastructure to
  the opportune kernel functions.
  
  Signed-off-by: Andrea Righi ari...@develer.com
 
 Seems nice.
 
 Hmm. the last problem is moving account between memcg.
 
 Right ?

Correct. This was actually the last item of the TODO list. Anyway, I'm
still considering if it's correct to move dirty pages when a task is
migrated from a cgroup to another. Currently, dirty pages just remain in
the original cgroup and are flushed depending on the original cgroup
settings. That is not totally wrong... at least moving the dirty pages
between memcgs should be optional (move_charge_at_immigrate?).

Thanks for your ack and the detailed review!

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 12:04:53PM +0200, Kirill A. Shutemov wrote:
[snip]
  +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
  +{
  +       return -ENOMEM;
 
 Why ENOMEM? Probably, EINVAL or ENOSYS?

OK, ENOSYS is more appropriate IMHO.

  +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
  +                               enum mem_cgroup_page_stat_item item)
  +{
  +       s64 ret;
  +
  +       switch (item) {
  +       case MEMCG_NR_DIRTYABLE_PAGES:
  +               ret = res_counter_read_u64(memcg-res, RES_LIMIT) -
  +                       res_counter_read_u64(memcg-res, RES_USAGE);
  +               /* Translate free memory in pages */
  +               ret = PAGE_SHIFT;
  +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
  +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
  +               if (mem_cgroup_can_swap(memcg))
  +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) 
  +
  +                               mem_cgroup_read_stat(memcg, 
  LRU_INACTIVE_ANON);
  +               break;
  +       case MEMCG_NR_RECLAIM_PAGES:
  +               ret = mem_cgroup_read_stat(memcg, 
  MEM_CGROUP_STAT_FILE_DIRTY) +
  +                       mem_cgroup_read_stat(memcg,
  +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
  +               break;
  +       case MEMCG_NR_WRITEBACK:
  +               ret = mem_cgroup_read_stat(memcg, 
  MEM_CGROUP_STAT_WRITEBACK);
  +               break;
  +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
  +               ret = mem_cgroup_read_stat(memcg, 
  MEM_CGROUP_STAT_WRITEBACK) +
  +                       mem_cgroup_read_stat(memcg,
  +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
  +               break;
  +       default:
  +               ret = 0;
  +               WARN_ON_ONCE(1);
 
 I think it's a bug, not warning.

OK.

  +       }
  +       return ret;
  +}
  +
  +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
  +{
  +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat 
  *)data;
  +
  +       stat-value += mem_cgroup_get_local_page_stat(mem, stat-item);
  +       return 0;
  +}
  +
  +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
  +{
  +       struct mem_cgroup_page_stat stat = {};
  +       struct mem_cgroup *memcg;
  +
  +       if (mem_cgroup_disabled())
  +               return -ENOMEM;
 
 EINVAL/ENOSYS?

OK.

 
  +       rcu_read_lock();
  +       memcg = mem_cgroup_from_task(current);
  +       if (memcg) {
  +               /*
  +                * Recursively evaulate page statistics against all cgroup
  +                * under hierarchy tree
  +                */
  +               stat.item = item;
  +               mem_cgroup_walk_tree(memcg, stat, mem_cgroup_page_stat_cb);
  +       } else
  +               stat.value = -ENOMEM;
 
 ditto.

OK.

 
  +       rcu_read_unlock();
  +
  +       return stat.value;
  +}
  +
   static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
   {
         int *val = data;
  @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem)
   }
 
   /*
  - * Currently used to update mapped file statistics, but the routine can be
  - * generalized to update other statistics as well.
  + * Generalized routine to update memory cgroup statistics.
   */
  -void mem_cgroup_update_file_mapped(struct page *page, int val)
  +void mem_cgroup_update_stat(struct page *page,
  +                       enum mem_cgroup_stat_index idx, int val)
 
 EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since
 it uses by filesystems.

Agreed.

  +static int
  +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
  +{
  +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
  +       int type = cft-private;
  +
  +       if (cgrp-parent == NULL)
  +               return -EINVAL;
  +       if (((type == MEM_CGROUP_DIRTY_RATIO) ||
  +               (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO))  (val  100))
 
 Too many unnecessary brackets
 
if ((type == MEM_CGROUP_DIRTY_RATIO ||
type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)  val  100)
 

OK.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
 On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi ari...@develer.com wrote:
  Apply the cgroup dirty pages accounting and limiting infrastructure to
  the opportune kernel functions.
 
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   fs/fuse/file.c      |    5 +++
   fs/nfs/write.c      |    4 ++
   fs/nilfs2/segment.c |   10 +-
   mm/filemap.c        |    1 +
   mm/page-writeback.c |   84 
  --
   mm/rmap.c           |    4 +-
   mm/truncate.c       |    2 +
   7 files changed, 76 insertions(+), 34 deletions(-)
 
  diff --git a/fs/fuse/file.c b/fs/fuse/file.c
  index a9f5e13..dbbdd53 100644
  --- a/fs/fuse/file.c
  +++ b/fs/fuse/file.c
  @@ -11,6 +11,7 @@
   #include linux/pagemap.h
   #include linux/slab.h
   #include linux/kernel.h
  +#include linux/memcontrol.h
   #include linux/sched.h
   #include linux/module.h
 
  @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn 
  *fc, struct fuse_req *req)
 
         list_del(req-writepages_entry);
         dec_bdi_stat(bdi, BDI_WRITEBACK);
  +       mem_cgroup_update_stat(req-pages[0],
  +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
         dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
         bdi_writeout_inc(bdi);
         wake_up(fi-page_waitq);
  @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
         req-inode = inode;
 
         inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
  +       mem_cgroup_update_stat(tmp_page,
  +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
         inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
         end_page_writeback(page);
 
  diff --git a/fs/nfs/write.c b/fs/nfs/write.c
  index b753242..7316f7a 100644
  --- a/fs/nfs/write.c
  +++ b/fs/nfs/write.c
  @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
                         req-wb_index,
                         NFS_PAGE_TAG_COMMIT);
         spin_unlock(inode-i_lock);
  +       mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 
  1);
         inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
         inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
         __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
  @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
         struct page *page = req-wb_page;
 
         if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
  +               mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, 
  -1);
                 dec_zone_page_state(page, NR_UNSTABLE_NFS);
                 dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE);
                 return 1;
  @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
  *head, int how)
                 req = nfs_list_entry(head-next);
                 nfs_list_remove_request(req);
                 nfs_mark_request_commit(req);
  +               mem_cgroup_update_stat(req-wb_page,
  +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
                 dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
                 dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
                                 BDI_UNSTABLE);
  diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
  index ada2f1b..aef6d13 100644
  --- a/fs/nilfs2/segment.c
  +++ b/fs/nilfs2/segment.c
  @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, 
  struct list_head *out)
         } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
         kunmap_atomic(kaddr, KM_USER0);
 
  -       if (!TestSetPageWriteback(clone_page))
  +       if (!TestSetPageWriteback(clone_page)) {
  +               mem_cgroup_update_stat(clone_page,
 
 s/clone_page/page/

mmh... shouldn't we use the same page used by TestSetPageWriteback() and
inc_zone_page_state()?

 
 And #include linux/memcontrol.h is missed.

OK.

I'll apply your fixes and post a new version.

Thanks for reviewing,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 01:09:24PM +0200, Kirill A. Shutemov wrote:
 On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi ari...@develer.com wrote:
  On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote:
  On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi ari...@develer.com wrote:
   Apply the cgroup dirty pages accounting and limiting infrastructure to
   the opportune kernel functions.
  
   Signed-off-by: Andrea Righi ari...@develer.com
   ---
    fs/fuse/file.c      |    5 +++
    fs/nfs/write.c      |    4 ++
    fs/nilfs2/segment.c |   10 +-
    mm/filemap.c        |    1 +
    mm/page-writeback.c |   84 
   --
    mm/rmap.c           |    4 +-
    mm/truncate.c       |    2 +
    7 files changed, 76 insertions(+), 34 deletions(-)
  
   diff --git a/fs/fuse/file.c b/fs/fuse/file.c
   index a9f5e13..dbbdd53 100644
   --- a/fs/fuse/file.c
   +++ b/fs/fuse/file.c
   @@ -11,6 +11,7 @@
    #include linux/pagemap.h
    #include linux/slab.h
    #include linux/kernel.h
   +#include linux/memcontrol.h
    #include linux/sched.h
    #include linux/module.h
  
   @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn 
   *fc, struct fuse_req *req)
  
          list_del(req-writepages_entry);
          dec_bdi_stat(bdi, BDI_WRITEBACK);
   +       mem_cgroup_update_stat(req-pages[0],
   +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
          dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
          bdi_writeout_inc(bdi);
          wake_up(fi-page_waitq);
   @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
          req-inode = inode;
  
          inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
   +       mem_cgroup_update_stat(tmp_page,
   +                       MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
          inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
          end_page_writeback(page);
  
   diff --git a/fs/nfs/write.c b/fs/nfs/write.c
   index b753242..7316f7a 100644
   --- a/fs/nfs/write.c
   +++ b/fs/nfs/write.c
   @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
                          req-wb_index,
                          NFS_PAGE_TAG_COMMIT);
          spin_unlock(inode-i_lock);
   +       mem_cgroup_update_stat(req-wb_page, 
   MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
          inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
          inc_bdi_stat(req-wb_page-mapping-backing_dev_info, 
   BDI_UNSTABLE);
          __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
   @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
          struct page *page = req-wb_page;
  
          if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
   +               mem_cgroup_update_stat(page, 
   MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
                  dec_zone_page_state(page, NR_UNSTABLE_NFS);
                  dec_bdi_stat(page-mapping-backing_dev_info, 
   BDI_UNSTABLE);
                  return 1;
   @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct 
   list_head *head, int how)
                  req = nfs_list_entry(head-next);
                  nfs_list_remove_request(req);
                  nfs_mark_request_commit(req);
   +               mem_cgroup_update_stat(req-wb_page,
   +                               MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
                  dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
                  dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
                                  BDI_UNSTABLE);
   diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
   index ada2f1b..aef6d13 100644
   --- a/fs/nilfs2/segment.c
   +++ b/fs/nilfs2/segment.c
   @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page 
   *page, struct list_head *out)
          } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != 
   head);
          kunmap_atomic(kaddr, KM_USER0);
  
   -       if (!TestSetPageWriteback(clone_page))
   +       if (!TestSetPageWriteback(clone_page)) {
   +               mem_cgroup_update_stat(clone_page,
 
  s/clone_page/page/
 
  mmh... shouldn't we use the same page used by TestSetPageWriteback() and
  inc_zone_page_state()?
 
 Sorry, I've commented wrong hunk. It's for the next one.

Yes. Good catch! Will fix in the next version.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 06:32:24PM +0530, Balbir Singh wrote:

[snip]

  +extern long mem_cgroup_dirty_ratio(void);
  +extern unsigned long mem_cgroup_dirty_bytes(void);
  +extern long mem_cgroup_dirty_background_ratio(void);
  +extern unsigned long mem_cgroup_dirty_background_bytes(void);
  +
  +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
  +
 
 Docstyle comments for each function would be appreciated

OK.

   /*
* The memory controller data structure. The memory controller controls 
  both
* page cache and RSS per cgroup. We would eventually like to provide
  @@ -205,6 +199,9 @@ struct mem_cgroup {
  
  unsigned intswappiness;
  
  +   /* control memory cgroup dirty pages */
  +   unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS];
  +
 
 Could you mention what protects this field, is it the reclaim_lock?

Yes, it is.

Actually, we could avoid the lock completely for dirty_param[], using a
validation routine to check for incoherencies after any read with
get_dirty_param(), and retry if the validation fails. In practice, the
same approach we're using to read global vm_dirty_ratio, vm_dirty_bytes,
etc...

Considering that those values are rarely written and read often we can
protect them in a RCU way.


 BTW, is unsigned long sufficient to represent dirty_param(s)?

Yes, I think. It's the same type used for the equivalent global values.

 
  /* set when res.limit == memsw.limit */
  boolmemsw_is_minimum;
  
  @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct 
  mem_cgroup *memcg)
  return swappiness;
   }
  
  +static unsigned long get_dirty_param(struct mem_cgroup *memcg,
  +   enum mem_cgroup_dirty_param idx)
  +{
  +   unsigned long ret;
  +
  +   VM_BUG_ON(idx = MEM_CGROUP_DIRTY_NPARAMS);
  +   spin_lock(memcg-reclaim_param_lock);
  +   ret = memcg-dirty_param[idx];
  +   spin_unlock(memcg-reclaim_param_lock);
 
 Do we need a spinlock if we protect it using RCU? Is precise data very
 important?

See above.

  +unsigned long mem_cgroup_dirty_background_bytes(void)
  +{
  +   struct mem_cgroup *memcg;
  +   unsigned long ret = dirty_background_bytes;
  +
  +   if (mem_cgroup_disabled())
  +   return ret;
  +   rcu_read_lock();
  +   memcg = mem_cgroup_from_task(current);
  +   if (likely(memcg))
  +   ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES);
  +   rcu_read_unlock();
  +
  +   return ret;
  +}
  +
  +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
  +{
  +   return do_swap_account ?
  +   res_counter_read_u64(memcg-memsw, RES_LIMIT) :
 
 Shouldn't you do a res_counter_read_u64(...)  0 for readability?

OK.

 What happens if memcg-res, RES_LIMIT == memcg-memsw, RES_LIMIT?

OK, we should also check memcg-memsw_is_minimum.

   static struct cgroup_subsys_state * __ref
   mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
   {
  @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct 
  cgroup *cont)
  mem-last_scanned_child = 0;
  spin_lock_init(mem-reclaim_param_lock);
  
  -   if (parent)
  +   if (parent) {
  mem-swappiness = get_swappiness(parent);
  +
  +   spin_lock(parent-reclaim_param_lock);
  +   copy_dirty_params(mem, parent);
  +   spin_unlock(parent-reclaim_param_lock);
  +   } else {
  +   /*
  +* XXX: should we need a lock here? we could switch from
  +* vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not
  +* reading them atomically. The same for dirty_background_ratio
  +* and dirty_background_bytes.
  +*
  +* For now, try to read them speculatively and retry if a
  +* conflict is detected.a
 
 The do while loop is subtle, can we add a validate check,share it with
 the write routine and retry if validation fails?

Agreed.

 
  +*/
  +   do {
  +   mem-dirty_param[MEM_CGROUP_DIRTY_RATIO] =
  +   vm_dirty_ratio;
  +   mem-dirty_param[MEM_CGROUP_DIRTY_BYTES] =
  +   vm_dirty_bytes;
  +   } while (mem-dirty_param[MEM_CGROUP_DIRTY_RATIO] 
  +mem-dirty_param[MEM_CGROUP_DIRTY_BYTES]);
  +   do {
  +   mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] =
  +   dirty_background_ratio;
  +   mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] =
  +   dirty_background_bytes;
  +   } while (mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] 
  +   mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]);
  +   }
  atomic_set(mem-refcnt, 1);
  mem-move_charge_at_immigrate = 0;
  mutex_init(mem-thresholds_lock);

Many thanks for reviewing,
-Andrea

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 02:48:56PM +0100, Peter Zijlstra wrote:
 On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote:
  Apply the cgroup dirty pages accounting and limiting infrastructure to
  the opportune kernel functions.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
 
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 5a0f8f3..d83f41c 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties;
*/
   static int calc_period_shift(void)
   {
  -   unsigned long dirty_total;
  +   unsigned long dirty_total, dirty_bytes;
   
  -   if (vm_dirty_bytes)
  -   dirty_total = vm_dirty_bytes / PAGE_SIZE;
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
 
 So you don't think 0 is a valid max dirty amount?

A value of 0 means disabled. It's used to select between dirty_ratio
or dirty_bytes. It's the same for the gloabl vm_dirty_* parameters.

 
  +   dirty_total = dirty_bytes / PAGE_SIZE;
  else
  -   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
  -   100;
  +   dirty_total = (mem_cgroup_dirty_ratio() *
  +   determine_dirtyable_memory()) / 100;
  return 2 + ilog2(dirty_total - 1);
   }
   
  @@ -408,14 +409,16 @@ static unsigned long 
  highmem_dirtyable_memory(unsigned long total)
*/
   unsigned long determine_dirtyable_memory(void)
   {
  -   unsigned long x;
  -
  -   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  +   unsigned long memory;
  +   s64 memcg_memory;
   
  +   memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  if (!vm_highmem_is_dirtyable)
  -   x -= highmem_dirtyable_memory(x);
  -
  -   return x + 1;   /* Ensure that we never return 0 */
  +   memory -= highmem_dirtyable_memory(memory);
  +   memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
  +   if (memcg_memory  0)
 
 And here you somehow return negative?
 
  +   return memory + 1;
  +   return min((unsigned long)memcg_memory, memory + 1);
   }
   
   void
  @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned 
  long *pdirty,
   unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
   {
  unsigned long background;
  -   unsigned long dirty;
  +   unsigned long dirty, dirty_bytes, dirty_background;
  unsigned long available_memory = determine_dirtyable_memory();
  struct task_struct *tsk;
   
  -   if (vm_dirty_bytes)
  -   dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
  +   dirty_bytes = mem_cgroup_dirty_bytes();
  +   if (dirty_bytes)
 
 zero not valid again
 
  +   dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE);
  else {
  int dirty_ratio;
   
  -   dirty_ratio = vm_dirty_ratio;
  +   dirty_ratio = mem_cgroup_dirty_ratio();
  if (dirty_ratio  5)
  dirty_ratio = 5;
  dirty = (dirty_ratio * available_memory) / 100;
  }
   
  -   if (dirty_background_bytes)
  -   background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
  +   dirty_background = mem_cgroup_dirty_background_bytes();
  +   if (dirty_background)
 
 idem
 
  +   background = DIV_ROUND_UP(dirty_background, PAGE_SIZE);
  else
  -   background = (dirty_background_ratio * available_memory) / 100;
  -
  +   background = (mem_cgroup_dirty_background_ratio() *
  +   available_memory) / 100;
  if (background = dirty)
  background = dirty / 2;
  tsk = current;
  @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space 
  *mapping,
  get_dirty_limits(background_thresh, dirty_thresh,
  bdi_thresh, bdi);
   
  -   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
  +   nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
  +   nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
  +   if ((nr_reclaimable  0) || (nr_writeback  0)) {
  +   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
  global_page_state(NR_UNSTABLE_NFS);
 
 ??? why would a page_state be negative.. I see you return -ENOMEM on !
 cgroup, but how can one specify no dirty limit with this compiled in?
 
  -   nr_writeback = global_page_state(NR_WRITEBACK);
  +   nr_writeback = global_page_state(NR_WRITEBACK);
  +   }
   
  bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
  if (bdi_cap_account_unstable(bdi)) {
  @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space 
  *mapping,
   * In normal mode, we start background writeout at the lower
   * background_thresh, to keep the amount of dirty memory low.
   */
  +   nr_reclaimable

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
 * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 17:23:16]:
 
  On Tue, 2 Mar 2010 09:01:58 +0100
  Andrea Righi ari...@develer.com wrote:
  
   On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
On Mon,  1 Mar 2010 22:23:40 +0100
Andrea Righi ari...@develer.com wrote:

 Apply the cgroup dirty pages accounting and limiting infrastructure to
 the opportune kernel functions.
 
 Signed-off-by: Andrea Righi ari...@develer.com

Seems nice.

Hmm. the last problem is moving account between memcg.

Right ?
   
   Correct. This was actually the last item of the TODO list. Anyway, I'm
   still considering if it's correct to move dirty pages when a task is
   migrated from a cgroup to another. Currently, dirty pages just remain in
   the original cgroup and are flushed depending on the original cgroup
   settings. That is not totally wrong... at least moving the dirty pages
   between memcgs should be optional (move_charge_at_immigrate?).
   
  
  My concern is 
   - migration between memcg is already suppoted
  - at task move
  - at rmdir
  
  Then, if you leave DIRTY_PAGE accounting to original cgroup,
  the new cgroup (migration target)'s Dirty page accounting may
  goes to be negative, or incorrect value. Please check FILE_MAPPED
  implementation in __mem_cgroup_move_account()
  
  As
 if (page_mapped(page)  !PageAnon(page)) {
  /* Update mapped_file data for mem_cgroup */
  preempt_disable();
  
  __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
  
  __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
  preempt_enable();
  }
  then, FILE_MAPPED never goes negative.
 
 
 Absolutely! I am not sure how complex dirty memory migration will be,
 but one way of working around it would be to disable migration of
 charges when the feature is enabled (dirty* is set in the memory
 cgroup). We might need additional logic to allow that to happen. 

I've started to look at dirty memory migration. First attempt is to add
DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
__mem_cgroup_move_account(). Probably I'll have something ready for the
next version of the patch. I still need to figure if this can work as
expected...

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
 On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
  On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
@@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  */
 dirty_thresh += dirty_thresh / 10;  /* wh... */
 
-if (global_page_state(NR_UNSTABLE_NFS) +
-   global_page_state(NR_WRITEBACK) = dirty_thresh)
-   break;
-congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+   dirty = 
mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+   if (dirty  0)
+   dirty = global_page_state(NR_UNSTABLE_NFS) +
+   global_page_state(NR_WRITEBACK);
   
   dirty is unsigned long. As mentioned last time, above will never be true?
   In general these patches look ok to me. I will do some testing with these.
  
  Re-introduced the same bug. My bad. :(
  
  The value returned from mem_cgroup_page_stat() can be negative, i.e.
  when memory cgroup is disabled. We could simply use a long for dirty,
  the unit is in # of pages so s64 should be enough. Or cast dirty to long
  only for the check (see below).
  
  Thanks!
  -Andrea
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   mm/page-writeback.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index d83f41c..dbee976 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
   
   
  dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
  -   if (dirty  0)
  +   if ((long)dirty  0)
 
 This will also be problematic as on 32bit systems, your uppper limit of
 dirty memory will be 2G?
 
 I guess, I will prefer one of the two.
 
 - return the error code from function and pass a pointer to store stats
   in as function argument.
 
 - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
   per cgroup dirty control is enabled, then use per cgroup stats. In that
   case you don't have to return negative values.
 
   Only tricky part will be careful accouting so that none of the stats go
   negative in corner cases of migration etc.

What do you think about Peter's suggestion + the locking stuff? (see the
previous email). Otherwise, I'll choose the other solution, passing a
pointer and always return the error code is not bad.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 10:08:17AM -0800, Greg Thelen wrote:
 Comments below.  Yet to be tested on my end, but I will test it.
 
 On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi ari...@develer.com wrote:
  Infrastructure to account dirty pages per cgroup and add dirty limit
  interfaces in the cgroupfs:
 
   - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
 
   - Background write-out: memory.dirty_background_ratio, 
  memory.dirty_background_bytes
 
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   include/linux/memcontrol.h |   77 ++-
   mm/memcontrol.c            |  336 
  
   2 files changed, 384 insertions(+), 29 deletions(-)
 
  diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
  index 1f9b119..cc88b2e 100644
  --- a/include/linux/memcontrol.h
  +++ b/include/linux/memcontrol.h
  @@ -19,12 +19,50 @@
 
   #ifndef _LINUX_MEMCONTROL_H
   #define _LINUX_MEMCONTROL_H
  +
  +#include linux/writeback.h
   #include linux/cgroup.h
  +
   struct mem_cgroup;
   struct page_cgroup;
   struct page;
   struct mm_struct;
 
  +/* Cgroup memory statistics items exported to the kernel */
  +enum mem_cgroup_page_stat_item {
  +       MEMCG_NR_DIRTYABLE_PAGES,
  +       MEMCG_NR_RECLAIM_PAGES,
  +       MEMCG_NR_WRITEBACK,
  +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
  +};
  +
  +/*
  + * Statistics for memory cgroup.
  + */
  +enum mem_cgroup_stat_index {
  +       /*
  +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
  +        */
  +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
  +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
  +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
  +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
  +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
  +       MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use 
  */
  +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
  +       MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out.
  +                                       used by soft limit implementation */
  +       MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
  +                                       used by threshold implementation */
  +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
  +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
  +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback 
  using
  +                                               temporary buffers */
  +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
  +
  +       MEM_CGROUP_STAT_NSTATS,
  +};
  +
   #ifdef CONFIG_CGROUP_MEM_RES_CTLR
   /*
   * All charge functions with gfp_mask should use GFP_KERNEL or
  @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct 
  mem_cgroup *memcg,
   extern int do_swap_account;
   #endif
 
  +extern long mem_cgroup_dirty_ratio(void);
  +extern unsigned long mem_cgroup_dirty_bytes(void);
  +extern long mem_cgroup_dirty_background_ratio(void);
  +extern unsigned long mem_cgroup_dirty_background_bytes(void);
  +
  +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
  +
   static inline bool mem_cgroup_disabled(void)
   {
         if (mem_cgroup_subsys.disabled)
  @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void)
   }
 
   extern bool mem_cgroup_oom_called(struct task_struct *task);
  -void mem_cgroup_update_file_mapped(struct page *page, int val);
  +void mem_cgroup_update_stat(struct page *page,
  +                       enum mem_cgroup_stat_index idx, int val);
   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
                                                 gfp_t gfp_mask, int nid,
                                                 int zid);
  @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
  struct task_struct *p)
   {
   }
 
  -static inline void mem_cgroup_update_file_mapped(struct page *page,
  -                                                       int val)
  +static inline void mem_cgroup_update_stat(struct page *page,
  +                       enum mem_cgroup_stat_index idx, int val)
   {
   }
 
  @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct 
  zone *zone, int order,
         return 0;
   }
 
  +static inline long mem_cgroup_dirty_ratio(void)
  +{
  +       return vm_dirty_ratio;
  +}
  +
  +static inline unsigned long mem_cgroup_dirty_bytes(void)
  +{
  +       return vm_dirty_bytes;
  +}
  +
  +static inline long mem_cgroup_dirty_background_ratio(void)
  +{
  +       return dirty_background_ratio;
  +}
  +
  +static inline unsigned long mem_cgroup_dirty_background_bytes(void)
  +{
  +       return dirty_background_bytes;
  +}
  +
  +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
 On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
  On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
   On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
  @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
*/
   dirty_thresh += dirty_thresh / 10;  /* 
  wh... */
   
  -if (global_page_state(NR_UNSTABLE_NFS) +
  -   global_page_state(NR_WRITEBACK) = dirty_thresh)
  -   break;
  -congestion_wait(BLK_RW_ASYNC, HZ/10);
  +
  +   dirty = 
  mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
  +   if (dirty  0)
  +   dirty = global_page_state(NR_UNSTABLE_NFS) +
  +   global_page_state(NR_WRITEBACK);
 
 dirty is unsigned long. As mentioned last time, above will never be 
 true?
 In general these patches look ok to me. I will do some testing with 
 these.

Re-introduced the same bug. My bad. :(

The value returned from mem_cgroup_page_stat() can be negative, i.e.
when memory cgroup is disabled. We could simply use a long for dirty,
the unit is in # of pages so s64 should be enough. Or cast dirty to long
only for the check (see below).

Thanks!
-Andrea

Signed-off-by: Andrea Righi ari...@develer.com
---
 mm/page-writeback.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d83f41c..dbee976 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
 
dirty = 
mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-   if (dirty  0)
+   if ((long)dirty  0)
   
   This will also be problematic as on 32bit systems, your uppper limit of
   dirty memory will be 2G?
   
   I guess, I will prefer one of the two.
   
   - return the error code from function and pass a pointer to store stats
 in as function argument.
   
   - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
 per cgroup dirty control is enabled, then use per cgroup stats. In that
 case you don't have to return negative values.
   
 Only tricky part will be careful accouting so that none of the stats go
 negative in corner cases of migration etc.
  
  What do you think about Peter's suggestion + the locking stuff? (see the
  previous email). Otherwise, I'll choose the other solution, passing a
  pointer and always return the error code is not bad.
  
 
 Ok, so you are worried about that by the we finish 
 mem_cgroup_has_dirty_limit()
 call, task might change cgroup and later we might call
 mem_cgroup_get_page_stat() on a different cgroup altogether which might or
 might not have dirty limits specified?

Correct.

 
 But in what cases you don't want to use memory cgroup specified limit? I
 thought cgroup disabled what the only case where we need to use global
 limits. Otherwise a memory cgroup will have either dirty_bytes specified
 or by default inherit global dirty_ratio which is a valid number. If
 that's the case then you don't have to take rcu_lock() outside
 get_page_stat()?
 
 IOW, apart from cgroup being disabled, what are the other cases where you
 expect to not use cgroup's page stat and use global stats?

At boot, when mem_cgroup_from_task() may return NULL. But this is not
related to the RCU acquisition.

Anyway, probably the RCU protection is not so critical for this
particular case, and we can simply get rid of it. In this way we can
easily implement the interface proposed by Peter.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Wed, Mar 03, 2010 at 08:21:07AM +0900, Daisuke Nishimura wrote:
 On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi ari...@develer.com wrote:
  On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote:
   * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 
   17:23:16]:
   
On Tue, 2 Mar 2010 09:01:58 +0100
Andrea Righi ari...@develer.com wrote:

 On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote:
  On Mon,  1 Mar 2010 22:23:40 +0100
  Andrea Righi ari...@develer.com wrote:
  
   Apply the cgroup dirty pages accounting and limiting 
   infrastructure to
   the opportune kernel functions.
   
   Signed-off-by: Andrea Righi ari...@develer.com
  
  Seems nice.
  
  Hmm. the last problem is moving account between memcg.
  
  Right ?
 
 Correct. This was actually the last item of the TODO list. Anyway, I'm
 still considering if it's correct to move dirty pages when a task is
 migrated from a cgroup to another. Currently, dirty pages just remain 
 in
 the original cgroup and are flushed depending on the original cgroup
 settings. That is not totally wrong... at least moving the dirty pages
 between memcgs should be optional (move_charge_at_immigrate?).
 

My concern is 
 - migration between memcg is already suppoted
- at task move
- at rmdir

Then, if you leave DIRTY_PAGE accounting to original cgroup,
the new cgroup (migration target)'s Dirty page accounting may
goes to be negative, or incorrect value. Please check FILE_MAPPED
implementation in __mem_cgroup_move_account()

As
   if (page_mapped(page)  !PageAnon(page)) {
/* Update mapped_file data for mem_cgroup */
preempt_disable();

__this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);

__this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]);
preempt_enable();
}
then, FILE_MAPPED never goes negative.
   
   
   Absolutely! I am not sure how complex dirty memory migration will be,
   but one way of working around it would be to disable migration of
   charges when the feature is enabled (dirty* is set in the memory
   cgroup). We might need additional logic to allow that to happen. 
  
  I've started to look at dirty memory migration. First attempt is to add
  DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in
  __mem_cgroup_move_account(). Probably I'll have something ready for the
  next version of the patch. I still need to figure if this can work as
  expected...
  
 I agree it's a right direction(in fact, I have been planning to post a patch
 in that direction), so I leave it to you.
 Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the
 same way as other flags you're trying to add, and we can change
 if (page_mapped(page)  !PageAnon(page)) to if (PageCgroupFileMapped(pc)
 in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO.

OK, sounds good to me. I'll introduce PCG_FILE_MAPPED in the next
version.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
 On Wed, 3 Mar 2010 15:15:49 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  Agreed.
  Let's try how we can write a code in clean way. (we have time ;)
  For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
  over killing. What I really want is lockless code...but it seems impossible
  under current implementation.
  
  I wonder the fact the page is never unchareged under us can give us some 
  chances
  ...Hmm.
  
 
 How about this ? Basically, I don't like duplicating information...so,
 # of new pcg_flags may be able to be reduced.
 
 I'm glad this can be a hint for Andrea-san.

Many thanks! I already wrote pretty the same code, but at this point I
think I'll just apply and test this one. ;)

-Andrea

 
 ==
 ---
  include/linux/page_cgroup.h |   44 -
  mm/memcontrol.c |   91 
 +++-
  2 files changed, 132 insertions(+), 3 deletions(-)
 
 Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
 ===
 --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
 +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
 @@ -39,6 +39,11 @@ enum {
   PCG_CACHE, /* charged as cache */
   PCG_USED, /* this object is in use. */
   PCG_ACCT_LRU, /* page has been accounted for */
 + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
 + PCG_ACCT_DIRTY,
 + PCG_ACCT_WB,
 + PCG_ACCT_WB_TEMP,
 + PCG_ACCT_UNSTABLE,
  };
  
  #define TESTPCGFLAG(uname, lname)\
 @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
  TESTPCGFLAG(AcctLRU, ACCT_LRU)
  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
  
 +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
 +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
 +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
 +
 +SETPCGFLAG(AcctWB, ACCT_WB);
 +CLEARPCGFLAG(AcctWB, ACCT_WB);
 +TESTPCGFLAG(AcctWB, ACCT_WB);
 +
 +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +
 +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +
 +
  static inline int page_cgroup_nid(struct page_cgroup *pc)
  {
   return page_to_nid(pc-page);
 @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
  {
   return page_zonenum(pc-page);
  }
 -
 +/*
 + * lock_page_cgroup() should not be held under mapping-tree_lock
 + */
  static inline void lock_page_cgroup(struct page_cgroup *pc)
  {
   bit_spin_lock(PCG_LOCK, pc-flags);
 @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
   bit_spin_unlock(PCG_LOCK, pc-flags);
  }
  
 +/*
 + * Lock order is
 + *   lock_page_cgroup()
 + *   lock_page_cgroup_migrate()
 + * This lock is not be lock for charge/uncharge but for account moving.
 + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself
 + * the page is uncharged while we hold this.
 + */
 +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
 +{
 + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags);
 +}
 +
 +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
 +{
 + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags);
 +}
 +
  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
  struct page_cgroup;
  
 Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
 ===
 --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
 +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
 @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
   MEM_CGROUP_EVENTS,  /* incremented at every  pagein/pageout */
 + MEM_CGROUP_STAT_DIRTY,
 + MEM_CGROUP_STAT_WBACK,
 + MEM_CGROUP_STAT_WBACK_TEMP,
 + MEM_CGROUP_STAT_UNSTABLE_NFS,
  
   MEM_CGROUP_STAT_NSTATS,
  };
 @@ -1360,6 +1364,86 @@ done:
  }
  
  /*
 + * Update file cache's status for memcg. Before calling this,
 + * mapping-tree_lock should be held and preemption is disabled.
 + * Then, it's guarnteed that the page is not uncharged while we
 + * access page_cgroup. We can make use of that.
 + */
 +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
 +{
 + struct page_cgroup *pc;
 + struct mem_cgroup *mem;
 +
 + pc = lookup_page_cgroup(page);
 + /* Not accounted ? */
 + if (!PageCgroupUsed(pc))
 + return;
 + lock_page_cgroup_migrate(pc);
 + /*
 +  * It's guarnteed that this page is never uncharged.
 +  * The only racy problem is moving account among memcgs.
 +  */
 + switch (idx) {
 + case MEM_CGROUP_STAT_DIRTY:
 + if (set)
 + SetPageCgroupAcctDirty(pc);
 + else
 + ClearPageCgroupAcctDirty(pc);
 + 

[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Wed, Mar 03, 2010 at 12:47:03PM +0100, Andrea Righi wrote:
 On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote:
  On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote:
   On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote:
On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote:
 On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote:
   @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 */
dirty_thresh += dirty_thresh / 10;  /* 
   wh... */

   -if (global_page_state(NR_UNSTABLE_NFS) +
   - global_page_state(NR_WRITEBACK) = dirty_thresh)
   - break;
   -congestion_wait(BLK_RW_ASYNC, HZ/10);
   +
   + dirty = 
   mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
   + if (dirty  0)
   + dirty = global_page_state(NR_UNSTABLE_NFS) +
   + global_page_state(NR_WRITEBACK);
  
  dirty is unsigned long. As mentioned last time, above will never be 
  true?
  In general these patches look ok to me. I will do some testing with 
  these.
 
 Re-introduced the same bug. My bad. :(
 
 The value returned from mem_cgroup_page_stat() can be negative, i.e.
 when memory cgroup is disabled. We could simply use a long for dirty,
 the unit is in # of pages so s64 should be enough. Or cast dirty to 
 long
 only for the check (see below).
 
 Thanks!
 -Andrea
 
 Signed-off-by: Andrea Righi ari...@develer.com
 ---
  mm/page-writeback.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index d83f41c..dbee976 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
  
  
   dirty = 
 mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
 - if (dirty  0)
 + if ((long)dirty  0)

This will also be problematic as on 32bit systems, your uppper limit of
dirty memory will be 2G?

I guess, I will prefer one of the two.

- return the error code from function and pass a pointer to store stats
  in as function argument.

- Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if
  per cgroup dirty control is enabled, then use per cgroup stats. In 
that
  case you don't have to return negative values.

  Only tricky part will be careful accouting so that none of the stats 
go
  negative in corner cases of migration etc.
   
   What do you think about Peter's suggestion + the locking stuff? (see the
   previous email). Otherwise, I'll choose the other solution, passing a
   pointer and always return the error code is not bad.
   
  
  Ok, so you are worried about that by the we finish 
  mem_cgroup_has_dirty_limit()
  call, task might change cgroup and later we might call
  mem_cgroup_get_page_stat() on a different cgroup altogether which might or
  might not have dirty limits specified?
 
 Correct.
 
  
  But in what cases you don't want to use memory cgroup specified limit? I
  thought cgroup disabled what the only case where we need to use global
  limits. Otherwise a memory cgroup will have either dirty_bytes specified
  or by default inherit global dirty_ratio which is a valid number. If
  that's the case then you don't have to take rcu_lock() outside
  get_page_stat()?
  
  IOW, apart from cgroup being disabled, what are the other cases where you
  expect to not use cgroup's page stat and use global stats?
 
 At boot, when mem_cgroup_from_task() may return NULL. But this is not
 related to the RCU acquisition.

Nevermind. You're right. In any case even if a task is migrated to a
different cgroup it will always have mem_cgroup_has_dirty_limit() ==
true.

So RCU protection is not needed outside these functions.

OK, I'll go with the Peter's suggestion.

Thanks!
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Wed, Mar 03, 2010 at 11:07:35AM +0100, Peter Zijlstra wrote:
 On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote:
  
  I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under
  RCU, so something like:
  
  rcu_read_lock();
  if (mem_cgroup_has_dirty_limit())
  mem_cgroup_get_page_stat()
  else
  global_page_state()
  rcu_read_unlock();
  
  That is bad when mem_cgroup_has_dirty_limit() always returns false
  (e.g., when memory cgroups are disabled). So I fallback to the old
  interface.
 
 Why is it that mem_cgroup_has_dirty_limit() needs RCU when
 mem_cgroup_get_page_stat() doesn't? That is, simply make
 mem_cgroup_has_dirty_limit() not require RCU in the same way
 *_get_page_stat() doesn't either.

OK, I agree we can get rid of RCU protection here (see my previous
email).

BTW the point was that after mem_cgroup_has_dirty_limit() the task might
be moved to another cgroup, but also in this case mem_cgroup_has_dirty_limit()
will be always true, so mem_cgroup_get_page_stat() is always coherent.

 
  What do you think about:
  
  mem_cgroup_lock();
  if (mem_cgroup_has_dirty_limit())
  mem_cgroup_get_page_stat()
  else
  global_page_state()
  mem_cgroup_unlock();
  
  Where mem_cgroup_read_lock/unlock() simply expand to nothing when
  memory cgroups are disabled.
 
 I think you're engineering the wrong way around.
 
   
   That allows for a 0 dirty limit (which should work and basically makes
   all io synchronous).
  
  IMHO it is better to reserve 0 for the special value disabled like the
  global settings. A synchronous IO can be also achieved using a dirty
  limit of 1.
 
 Why?! 0 clearly states no writeback cache, IOW sync writes, a 1
 byte/page writeback cache effectively reduces to the same thing, but its
 not the same thing conceptually. If you want to put the size and enable
 into a single variable pick -1 for disable or so.

I might agree, and actually I prefer this solution.. but in this way we
would use a different interface respect to the equivalent vm_dirty_ratio
/ vm_dirty_bytes global settings (as well as dirty_background_ratio /
dirty_background_bytes).

IMHO it's better to use the same interface to avoid user
misunderstandings.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote:
 On Wed, 3 Mar 2010 15:15:49 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  Agreed.
  Let's try how we can write a code in clean way. (we have time ;)
  For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little
  over killing. What I really want is lockless code...but it seems impossible
  under current implementation.
  
  I wonder the fact the page is never unchareged under us can give us some 
  chances
  ...Hmm.
  
 
 How about this ? Basically, I don't like duplicating information...so,
 # of new pcg_flags may be able to be reduced.
 
 I'm glad this can be a hint for Andrea-san.
 
 ==
 ---
  include/linux/page_cgroup.h |   44 -
  mm/memcontrol.c |   91 
 +++-
  2 files changed, 132 insertions(+), 3 deletions(-)
 
 Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
 ===
 --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h
 +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h
 @@ -39,6 +39,11 @@ enum {
   PCG_CACHE, /* charged as cache */
   PCG_USED, /* this object is in use. */
   PCG_ACCT_LRU, /* page has been accounted for */
 + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
 + PCG_ACCT_DIRTY,
 + PCG_ACCT_WB,
 + PCG_ACCT_WB_TEMP,
 + PCG_ACCT_UNSTABLE,
  };
  
  #define TESTPCGFLAG(uname, lname)\
 @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
  TESTPCGFLAG(AcctLRU, ACCT_LRU)
  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
  
 +SETPCGFLAG(AcctDirty, ACCT_DIRTY);
 +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY);
 +TESTPCGFLAG(AcctDirty, ACCT_DIRTY);
 +
 +SETPCGFLAG(AcctWB, ACCT_WB);
 +CLEARPCGFLAG(AcctWB, ACCT_WB);
 +TESTPCGFLAG(AcctWB, ACCT_WB);
 +
 +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP);
 +
 +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE);
 +
 +
  static inline int page_cgroup_nid(struct page_cgroup *pc)
  {
   return page_to_nid(pc-page);
 @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup
  {
   return page_zonenum(pc-page);
  }
 -
 +/*
 + * lock_page_cgroup() should not be held under mapping-tree_lock
 + */
  static inline void lock_page_cgroup(struct page_cgroup *pc)
  {
   bit_spin_lock(PCG_LOCK, pc-flags);
 @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st
   bit_spin_unlock(PCG_LOCK, pc-flags);
  }
  
 +/*
 + * Lock order is
 + *   lock_page_cgroup()
 + *   lock_page_cgroup_migrate()
 + * This lock is not be lock for charge/uncharge but for account moving.
 + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself
 + * the page is uncharged while we hold this.
 + */
 +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
 +{
 + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags);
 +}
 +
 +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
 +{
 + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags);
 +}
 +
  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
  struct page_cgroup;
  
 Index: mmotm-2.6.33-Mar2/mm/memcontrol.c
 ===
 --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c
 +++ mmotm-2.6.33-Mar2/mm/memcontrol.c
 @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index {
   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
   MEM_CGROUP_EVENTS,  /* incremented at every  pagein/pageout */
 + MEM_CGROUP_STAT_DIRTY,
 + MEM_CGROUP_STAT_WBACK,
 + MEM_CGROUP_STAT_WBACK_TEMP,
 + MEM_CGROUP_STAT_UNSTABLE_NFS,
  
   MEM_CGROUP_STAT_NSTATS,
  };
 @@ -1360,6 +1364,86 @@ done:
  }
  
  /*
 + * Update file cache's status for memcg. Before calling this,
 + * mapping-tree_lock should be held and preemption is disabled.
 + * Then, it's guarnteed that the page is not uncharged while we
 + * access page_cgroup. We can make use of that.
 + */
 +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set)
 +{
 + struct page_cgroup *pc;
 + struct mem_cgroup *mem;
 +
 + pc = lookup_page_cgroup(page);
 + /* Not accounted ? */
 + if (!PageCgroupUsed(pc))
 + return;
 + lock_page_cgroup_migrate(pc);
 + /*
 +  * It's guarnteed that this page is never uncharged.
 +  * The only racy problem is moving account among memcgs.
 +  */
 + switch (idx) {
 + case MEM_CGROUP_STAT_DIRTY:
 + if (set)
 + SetPageCgroupAcctDirty(pc);
 + else
 + ClearPageCgroupAcctDirty(pc);
 + break;
 + case MEM_CGROUP_STAT_WBACK:
 + if (set)
 + SetPageCgroupAcctWB(pc);
 + 

[Devel] [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v3 - v4)
~~
 * handle the migration of tasks across different cgroups
   NOTE: at the moment we don't move charges of file cache pages, so this
   functionality is not immediately necessary. However, since the migration of
   file cache pages is in plan, it is better to start handling file pages
   anyway.
 * properly account dirty pages in nilfs2
   (thanks to Kirill A. Shutemov kir...@shutemov.name)
 * lockless access to dirty memory parameters
 * fix: page_cgroup lock must not be acquired under mapping-tree_lock
   (thanks to Daisuke Nishimura nishim...@mxp.nes.nec.co.jp and
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com)
 * code restyling

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 1/4] memcg: dirty memory documentation

2010-03-30 Thread Andrea Righi
Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi ari...@develer.com
---
 Documentation/cgroups/memory.txt |   36 
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache  - # of bytes of page cache memory.
 rss- # of bytes of anonymous and swap cache memory.
 pgpgin - # of pages paged in (equivalent to # of charging events).
 pgpgout- # of pages paged out (equivalent to # of uncharging 
events).
+filedirty  - # of pages that are waiting to get written back to the disk.
+writeback  - # of pages that are actively being written back to the disk.
+writeback_tmp  - # of pages used by FUSE for temporary writeback buffers.
+nfs- # of NFS pages sent to the server, but not yet committed to
+ the actual storage.
 active_anon- # of bytes of anonymous and  swap cache memory on active
  lru list.
 inactive_anon  - # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given 
time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+amount of dirty memory at which a process which is generating disk writes
+inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+bytes) at which a process generating disk writes will start itself writing
+out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+memory, the amount of dirty memory at which background writeback kernel
+threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+bytes) at which background writeback kernel threads will start writing out
+dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags

2010-03-30 Thread Andrea Righi
Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/page_cgroup.h |   49 +++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..1b79ded 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
+   PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+   PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+   PCG_ACCT_DIRTY, /* page is dirty */
+   PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+   PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+   PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)  \
@@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
return page_to_nid(pc-page);
@@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
+/*
+ * lock_page_cgroup() should not be held under mapping-tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
@@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup 
*pc)
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
 
+/*
+ * Lock order is
+ * lock_page_cgroup()
+ * lock_page_cgroup_migrate()
+ *
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+   bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+   bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, 
memory.dirty_background_bytes

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |   80 -
 mm/memcontrol.c|  420 +++-
 2 files changed, 450 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc3421b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,66 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include linux/writeback.h
 #include linux/cgroup.h
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+   MEMCG_NR_DIRTYABLE_PAGES,
+   MEMCG_NR_RECLAIM_PAGES,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct dirty_param {
+   int dirty_ratio;
+   unsigned long dirty_bytes;
+   int dirty_background_ratio;
+   unsigned long dirty_background_bytes;
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+   /*
+* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+*/
+   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
+   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
+   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
+   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+   MEM_CGROUP_EVENTS,  /* incremented at every  pagein/pageout */
+   MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+   MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+   MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+   temporary buffers */
+   MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+   MEM_CGROUP_STAT_NSTATS,
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_dirty_param(struct dirty_param *param)
+{
+   param-dirty_ratio = vm_dirty_ratio;
+   param-dirty_bytes = vm_dirty_bytes;
+   param-dirty_background_ratio = dirty_background_ratio;
+   param-dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_dirty_param(struct dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+   enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-   int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+   enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+   return false;
+}
+
+static inline void get_dirty_param(struct dirty_param *param)
+{
+   get_global_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+   return -ENOSYS;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 497b6f7..9842e7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for 
remember boot option*/
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
 
-/*
- * Statistics for memory cgroup.
- */
-enum

[Devel] [PATCH -mmotm 4/4] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
Apply the cgroup dirty pages accounting and limiting infrastructure
to the opportune kernel functions.

Signed-off-by: Andrea Righi ari...@develer.com
---
 fs/fuse/file.c  |5 +++
 fs/nfs/write.c  |4 ++
 fs/nilfs2/segment.c |   11 +-
 mm/filemap.c|1 +
 mm/page-writeback.c |   91 ++-
 mm/rmap.c   |4 +-
 mm/truncate.c   |2 +
 7 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include linux/pagemap.h
 #include linux/slab.h
 #include linux/kernel.h
+#include linux/memcontrol.h
 #include linux/sched.h
 #include linux/module.h
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
 
list_del(req-writepages_entry);
dec_bdi_stat(bdi, BDI_WRITEBACK);
+   mem_cgroup_update_stat(req-pages[0],
+   MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
wake_up(fi-page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
req-inode = inode;
 
inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
+   mem_cgroup_update_stat(tmp_page,
+   MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
req-wb_index,
NFS_PAGE_TAG_COMMIT);
spin_unlock(inode-i_lock);
+   mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req-wb_page;
 
if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
+   mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE);
return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
*head, int how)
req = nfs_list_entry(head-next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+   mem_cgroup_update_stat(req-wb_page,
+   MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..27a01b1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include linux/pagemap.h
 #include linux/buffer_head.h
 #include linux/writeback.h
+#include linux/memcontrol.h
 #include linux/bio.h
 #include linux/completion.h
 #include linux/blkdev.h
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, 
struct list_head *out)
} while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
kunmap_atomic(kaddr, KM_USER0);
 
-   if (!TestSetPageWriteback(clone_page))
+   if (!TestSetPageWriteback(clone_page)) {
+   mem_cgroup_update_stat(clone_page,
+   MEM_CGROUP_STAT_WRITEBACK, 1);
inc_zone_page_state(clone_page, NR_WRITEBACK);
+   }
unlock_page(clone_page);
 
return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int 
err)
}
 
if (buffer_nilfs_allocated(page_buffers(page))) {
-   if (TestClearPageWriteback(page))
+   if (TestClearPageWriteback(page)) {
+   mem_cgroup_update_stat(page,
+   MEM_CGROUP_STAT_WRITEBACK, -1);
dec_zone_page_state(page, NR_WRITEBACK);
+   }
} else
end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY);
}
diff --git

[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Thu, Mar 04, 2010 at 11:18:28AM -0500, Vivek Goyal wrote:
 On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
 
 [..]
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 5a0f8f3..c5d14ea 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
*/
   static int calc_period_shift(void)
   {
  +   struct dirty_param dirty_param;
  unsigned long dirty_total;
   
  -   if (vm_dirty_bytes)
  -   dirty_total = vm_dirty_bytes / PAGE_SIZE;
  +   get_dirty_param(dirty_param);
  +
  +   if (dirty_param.dirty_bytes)
  +   dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
  else
  -   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
  -   100;
  +   dirty_total = (dirty_param.dirty_ratio *
  +   determine_dirtyable_memory()) / 100;
  return 2 + ilog2(dirty_total - 1);
   }
   
  @@ -408,41 +411,46 @@ static unsigned long 
  highmem_dirtyable_memory(unsigned long total)
*/
   unsigned long determine_dirtyable_memory(void)
   {
  -   unsigned long x;
  -
  -   x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  +   unsigned long memory;
  +   s64 memcg_memory;
   
  +   memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
  if (!vm_highmem_is_dirtyable)
  -   x -= highmem_dirtyable_memory(x);
  -
  -   return x + 1;   /* Ensure that we never return 0 */
  +   memory -= highmem_dirtyable_memory(memory);
  +   if (mem_cgroup_has_dirty_limit())
  +   return memory + 1;
 
 Should above be?
   if (!mem_cgroup_has_dirty_limit())
   return memory + 1;

Very true.

I'll post another patch with this and Kirill's fixes.

Thanks,
-Andrea

 
 Vivek
 
  +   memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
  +   return min((unsigned long)memcg_memory, memory + 1);
   }
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)

2010-03-30 Thread Andrea Righi
On Thu, Mar 04, 2010 at 10:41:43PM +0530, Balbir Singh wrote:
 * Andrea Righi ari...@develer.com [2010-03-04 11:40:11]:
 
  Control the maximum amount of dirty pages a cgroup can have at any given 
  time.
  
  Per cgroup dirty limit is like fixing the max amount of dirty (hard to 
  reclaim)
  page cache used by any cgroup. So, in case of multiple cgroup writers, they
  will not be able to consume more than their designated share of dirty pages 
  and
  will be forced to perform write-out if they cross that limit.
  
  The overall design is the following:
  
   - account dirty pages per cgroup
   - limit the number of dirty pages via memory.dirty_ratio / 
  memory.dirty_bytes
 and memory.dirty_background_ratio / memory.dirty_background_bytes in
 cgroupfs
   - start to write-out (background or actively) when the cgroup limits are
 exceeded
  
  This feature is supposed to be strictly connected to any underlying IO
  controller implementation, so we can stop increasing dirty pages in VM layer
  and enforce a write-out before any cgroup will consume the global amount of
  dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
  /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
  
  Changelog (v3 - v4)
  ~~
   * handle the migration of tasks across different cgroups
 NOTE: at the moment we don't move charges of file cache pages, so this
 functionality is not immediately necessary. However, since the migration 
  of
 file cache pages is in plan, it is better to start handling file pages
 anyway.
   * properly account dirty pages in nilfs2
 (thanks to Kirill A. Shutemov kir...@shutemov.name)
   * lockless access to dirty memory parameters
   * fix: page_cgroup lock must not be acquired under mapping-tree_lock
 (thanks to Daisuke Nishimura nishim...@mxp.nes.nec.co.jp and
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com)
   * code restyling
 
 
 This seems to be converging, what sort of tests are you running on
 this patchset? 

A very simple test at the moment, just some parallel dd's running in
different cgroups. For example:

 - cgroup A: low dirty limits (writes are almost sync)
   echo 1000  /cgroups/A/memory.dirty_bytes
   echo 1000  /cgroups/A/memory.dirty_background_bytes

 - cgroup B: high dirty limits (writes are all buffered in page cache)
   echo 100  /cgroups/B/memory.dirty_ratio
   echo 50   /cgroups/B/memory.dirty_background_ratio

Then run the dd's and look at memory.stat:
  - cgroup A: # dd if=/dev/zero of=A bs=1M count=1000
  - cgroup B: # dd if=/dev/zero of=B bs=1M count=1000

A random snapshot during the writes:

# grep dirty\|writeback /cgroups/[AB]/memory.stat
/cgroups/A/memory.stat:filedirty 0
/cgroups/A/memory.stat:writeback 0
/cgroups/A/memory.stat:writeback_tmp 0
/cgroups/A/memory.stat:dirty_pages 0
/cgroups/A/memory.stat:writeback_pages 0
/cgroups/A/memory.stat:writeback_temp_pages 0
/cgroups/B/memory.stat:filedirty 67226
/cgroups/B/memory.stat:writeback 136
/cgroups/B/memory.stat:writeback_tmp 0
/cgroups/B/memory.stat:dirty_pages 67226
/cgroups/B/memory.stat:writeback_pages 136
/cgroups/B/memory.stat:writeback_temp_pages 0

I plan to run more detailed IO benchmark soon.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Thu, Mar 04, 2010 at 02:41:44PM -0500, Vivek Goyal wrote:
 On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
 
 [..]
  diff --git a/mm/page-writeback.c b/mm/page-writeback.c
  index 5a0f8f3..c5d14ea 100644
  --- a/mm/page-writeback.c
  +++ b/mm/page-writeback.c
  @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
*/
   static int calc_period_shift(void)
   {
  +   struct dirty_param dirty_param;
  unsigned long dirty_total;
   
  -   if (vm_dirty_bytes)
  -   dirty_total = vm_dirty_bytes / PAGE_SIZE;
  +   get_dirty_param(dirty_param);
  +
  +   if (dirty_param.dirty_bytes)
  +   dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
  else
  -   dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
  -   100;
  +   dirty_total = (dirty_param.dirty_ratio *
  +   determine_dirtyable_memory()) / 100;
  return 2 + ilog2(dirty_total - 1);
   }
   
 
 Hmm.., I have been staring at this for some time and I think something is
 wrong. I don't fully understand the way floating proportions are working
 but this function seems to be calculating the period over which we need
 to measuer the proportions. (vm_completion proportion and vm_dirties
 proportions).
 
 And we this period (shift), when admin updates dirty_ratio or dirty_bytes
 etc. In that case we recalculate the global dirty limit and take log2 and
 use that as period over which we monitor and calculate proportions.
 
 If yes, then it should be global and not per cgroup (because all our 
 accouting of bdi completion is global and not per cgroup).
 
 PeterZ, can tell us more about it. I am just raising the flag here to be
 sure.
 
 Thanks
 Vivek

Hi Vivek,

I tend to agree, we must use global dirty values here.

BTW, update_completion_period() is called from dirty_* handlers, so it's
totally unrelated to use the current memcg. That's the memcg where the
admin is running, so probably it's the root memcg almost all the time,
but it's wrong in principle. In conclusion this patch shouldn't touch
calc_period_shift().

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Fri, Mar 05, 2010 at 10:12:34AM +0900, Daisuke Nishimura wrote:
 On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi ari...@develer.com wrote:
  Infrastructure to account dirty pages per cgroup and add dirty limit
  interfaces in the cgroupfs:
  
   - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
  
   - Background write-out: memory.dirty_background_ratio, 
  memory.dirty_background_bytes
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   include/linux/memcontrol.h |   80 -
   mm/memcontrol.c|  420 
  +++-
   2 files changed, 450 insertions(+), 50 deletions(-)
  
  diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
  index 1f9b119..cc3421b 100644
  --- a/include/linux/memcontrol.h
  +++ b/include/linux/memcontrol.h
  @@ -19,12 +19,66 @@
   
   #ifndef _LINUX_MEMCONTROL_H
   #define _LINUX_MEMCONTROL_H
  +
  +#include linux/writeback.h
   #include linux/cgroup.h
  +
   struct mem_cgroup;
   struct page_cgroup;
   struct page;
   struct mm_struct;
   
  +/* Cgroup memory statistics items exported to the kernel */
  +enum mem_cgroup_page_stat_item {
  +   MEMCG_NR_DIRTYABLE_PAGES,
  +   MEMCG_NR_RECLAIM_PAGES,
  +   MEMCG_NR_WRITEBACK,
  +   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
  +};
  +
  +/* Dirty memory parameters */
  +struct dirty_param {
  +   int dirty_ratio;
  +   unsigned long dirty_bytes;
  +   int dirty_background_ratio;
  +   unsigned long dirty_background_bytes;
  +};
  +
  +/*
  + * Statistics for memory cgroup.
  + */
  +enum mem_cgroup_stat_index {
  +   /*
  +* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
  +*/
  +   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
  +   MEM_CGROUP_STAT_RSS,   /* # of pages charged as anon rss */
  +   MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
  +   MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
  +   MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
  +   MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
  +   MEM_CGROUP_EVENTS,  /* incremented at every  pagein/pageout */
  +   MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
  +   MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
  +   MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
  +   temporary buffers */
  +   MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
  +
  +   MEM_CGROUP_STAT_NSTATS,
  +};
  +
 I must have said it earlier, but I don't think exporting all of these flags
 is a good idea.
 Can you export only mem_cgroup_page_stat_item(of course, need to add 
 MEMCG_NR_FILE_MAPPED)?
 We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple 
 arithmetic
 if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS 
 sequentially.

Agreed.

 
  +/*
  + * TODO: provide a validation check routine. And retry if validation
  + * fails.
  + */
  +static inline void get_global_dirty_param(struct dirty_param *param)
  +{
  +   param-dirty_ratio = vm_dirty_ratio;
  +   param-dirty_bytes = vm_dirty_bytes;
  +   param-dirty_background_ratio = dirty_background_ratio;
  +   param-dirty_background_bytes = dirty_background_bytes;
  +}
  +
   #ifdef CONFIG_CGROUP_MEM_RES_CTLR
   /*
* All charge functions with gfp_mask should use GFP_KERNEL or
  @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct 
  mem_cgroup *memcg,
   extern int do_swap_account;
   #endif
   
  +extern bool mem_cgroup_has_dirty_limit(void);
  +extern void get_dirty_param(struct dirty_param *param);
  +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
  +
   static inline bool mem_cgroup_disabled(void)
   {
  if (mem_cgroup_subsys.disabled)
  @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
   }
   
   extern bool mem_cgroup_oom_called(struct task_struct *task);
  -void mem_cgroup_update_file_mapped(struct page *page, int val);
  +void mem_cgroup_update_stat(struct page *page,
  +   enum mem_cgroup_stat_index idx, int val);
   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
  gfp_t gfp_mask, int nid,
  int zid);
  @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
  struct task_struct *p)
   {
   }
   
  -static inline void mem_cgroup_update_file_mapped(struct page *page,
  -   int val)
  +static inline void mem_cgroup_update_stat(struct page *page,
  +   enum mem_cgroup_stat_index idx, int val)
   {
   }
   
  @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct 
  zone *zone, int order,
  return 0;
   }
   
  +static inline bool mem_cgroup_has_dirty_limit(void)
  +{
  +   return false;
  +}
  +
  +static inline void get_dirty_param(struct

[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Fri, Mar 05, 2010 at 10:58:55AM +0900, KAMEZAWA Hiroyuki wrote:
 On Fri, 5 Mar 2010 10:12:34 +0900
 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
 
  On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi ari...@develer.com wrote:
   Infrastructure to account dirty pages per cgroup and add dirty limit
static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void 
   *data)
{
 int *val = data;
   @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup 
   *mem)
}

/*
   - * Currently used to update mapped file statistics, but the routine can 
   be
   - * generalized to update other statistics as well.
   + * Generalized routine to update file cache's status for memcg.
   + *
   + * Before calling this, mapping-tree_lock should be held and preemption 
   is
   + * disabled.  Then, it's guarnteed that the page is not uncharged while 
   we
   + * access page_cgroup. We can make use of that.
 */
  IIUC, mapping-tree_lock is held with irq disabled, so I think 
  mapping-tree_lock
  should be held with irq disabled would be enouth.
  And, as far as I can see, callers of this function have not ensured this 
  yet in [4/4].
  
  how about:
  
  void mem_cgroup_update_stat_locked(...)
  {
  ...
  }
  
  void mem_cgroup_update_stat_unlocked(mapping, ...)
  {
  spin_lock_irqsave(mapping-tree_lock, ...);
  mem_cgroup_update_stat_locked();
  spin_unlock_irqrestore(...);
  }
 
 Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not 
 held.
 
   lock_page_cgroup();
   mem_cgroup_update_stat_locked();
   unlock_page_cgroup();
 
 Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't 
 depend
 on migration_lock about FILE_MAPPED.

Right. I'll consider this in the next version of the patch.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags

2010-03-30 Thread Andrea Righi
On Fri, Mar 05, 2010 at 12:02:49PM +0530, Balbir Singh wrote:
 * Andrea Righi ari...@develer.com [2010-03-04 11:40:13]:
 
  Introduce page_cgroup flags to keep track of file cache pages.
  
  Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
 
 Looks good
 
 
 Acked-by: Balbir Singh bal...@linux.vnet.ibm.com
  
 
   include/linux/page_cgroup.h |   49 
  +++
   1 files changed, 49 insertions(+), 0 deletions(-)
  
  diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
  index 30b0813..1b79ded 100644
  --- a/include/linux/page_cgroup.h
  +++ b/include/linux/page_cgroup.h
  @@ -39,6 +39,12 @@ enum {
  PCG_CACHE, /* charged as cache */
  PCG_USED, /* this object is in use. */
  PCG_ACCT_LRU, /* page has been accounted for */
  +   PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
  +   PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
  +   PCG_ACCT_DIRTY, /* page is dirty */
  +   PCG_ACCT_WRITEBACK, /* page is being written back to disk */
  +   PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
  +   PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
   };
  
   #define TESTPCGFLAG(uname, lname)  \
  @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
   TESTPCGFLAG(AcctLRU, ACCT_LRU)
   TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
  
  +/* File cache and dirty memory flags */
  +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
  +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
  +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
  +
  +TESTPCGFLAG(Dirty, ACCT_DIRTY)
  +SETPCGFLAG(Dirty, ACCT_DIRTY)
  +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
  +
  +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
  +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
  +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
  +
  +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
  +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
  +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
  +
  +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
  +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
  +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
  +
   static inline int page_cgroup_nid(struct page_cgroup *pc)
   {
  return page_to_nid(pc-page);
  @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct 
  page_cgroup *pc)
  return page_zonenum(pc-page);
   }
  
  +/*
  + * lock_page_cgroup() should not be held under mapping-tree_lock
  + */
 
 May be a DEBUG WARN_ON would be appropriate here?

Sounds good. WARN_ON_ONCE()?

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
On Fri, Mar 05, 2010 at 12:08:43PM +0530, Balbir Singh wrote:
 * Andrea Righi ari...@develer.com [2010-03-04 11:40:15]:
 
  Apply the cgroup dirty pages accounting and limiting infrastructure
  to the opportune kernel functions.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   fs/fuse/file.c  |5 +++
   fs/nfs/write.c  |4 ++
   fs/nilfs2/segment.c |   11 +-
   mm/filemap.c|1 +
   mm/page-writeback.c |   91 
  ++-
   mm/rmap.c   |4 +-
   mm/truncate.c   |2 +
   7 files changed, 84 insertions(+), 34 deletions(-)
  
  diff --git a/fs/fuse/file.c b/fs/fuse/file.c
  index a9f5e13..dbbdd53 100644
  --- a/fs/fuse/file.c
  +++ b/fs/fuse/file.c
  @@ -11,6 +11,7 @@
   #include linux/pagemap.h
   #include linux/slab.h
   #include linux/kernel.h
  +#include linux/memcontrol.h
   #include linux/sched.h
   #include linux/module.h
  
  @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn 
  *fc, struct fuse_req *req)
  
  list_del(req-writepages_entry);
  dec_bdi_stat(bdi, BDI_WRITEBACK);
  +   mem_cgroup_update_stat(req-pages[0],
  +   MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
  dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP);
  bdi_writeout_inc(bdi);
  wake_up(fi-page_waitq);
  @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
  req-inode = inode;
  
  inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK);
  +   mem_cgroup_update_stat(tmp_page,
  +   MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
  inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
  end_page_writeback(page);
  
  diff --git a/fs/nfs/write.c b/fs/nfs/write.c
  index b753242..7316f7a 100644
  --- a/fs/nfs/write.c
  +++ b/fs/nfs/write.c
  @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
  req-wb_index,
  NFS_PAGE_TAG_COMMIT);
  spin_unlock(inode-i_lock);
  +   mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
  inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
  inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE);
  __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
  @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
  struct page *page = req-wb_page;
  
  if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
  +   mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
  dec_zone_page_state(page, NR_UNSTABLE_NFS);
  dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE);
  return 1;
  @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
  *head, int how)
  req = nfs_list_entry(head-next);
  nfs_list_remove_request(req);
  nfs_mark_request_commit(req);
  +   mem_cgroup_update_stat(req-wb_page,
  +   MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
  dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
  dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
  BDI_UNSTABLE);
  diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
  index ada2f1b..27a01b1 100644
  --- a/fs/nilfs2/segment.c
  +++ b/fs/nilfs2/segment.c
  @@ -24,6 +24,7 @@
   #include linux/pagemap.h
   #include linux/buffer_head.h
   #include linux/writeback.h
  +#include linux/memcontrol.h
   #include linux/bio.h
   #include linux/completion.h
   #include linux/blkdev.h
  @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, 
  struct list_head *out)
  } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head);
  kunmap_atomic(kaddr, KM_USER0);
  
  -   if (!TestSetPageWriteback(clone_page))
  +   if (!TestSetPageWriteback(clone_page)) {
  +   mem_cgroup_update_stat(clone_page,
  +   MEM_CGROUP_STAT_WRITEBACK, 1);
 
 I wonder if we should start implementing inc and dec to avoid passing
 the +1 and -1 parameters. It should make the code easier to read.

OK, it's always +1/-1, and I don't see any case where we should use
different numbers. So, better to move to the inc/dec naming.

 
  inc_zone_page_state(clone_page, NR_WRITEBACK);
  +   }
  unlock_page(clone_page);
  
  return 0;
  @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, 
  int err)
  }
  
  if (buffer_nilfs_allocated(page_buffers(page))) {
  -   if (TestClearPageWriteback(page))
  +   if (TestClearPageWriteback(page)) {
  +   mem_cgroup_update_stat(page,
  +   MEM_CGROUP_STAT_WRITEBACK, -1);
  dec_zone_page_state(page, NR_WRITEBACK);
  +   }
  } else
  end_page_writeback(page);
   }
  diff --git a/mm/filemap.c b/mm/filemap.c
  index fe09e51..f85acae 100644
  --- a/mm/filemap.c
  +++ b/mm/filemap.c

[Devel] [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v5)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (directly or background) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes.

Changelog (v4 - v5)
~~
 * fixed a potential deadlock between lock_page_cgroup and mapping-tree_lock
   (I'm not sure I did the right thing for this point, so review and tests are
   very welcome)
 * introduce inc/dec functions to update file cache accounting
 * export only a restricted subset of mem_cgroup_stat_index flags
 * fixed a bug in determine_dirtyable_memory() to correctly return the local
   memcg dirtyable memory
 * always use global dirty memory settings in calc_period_shift()

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 1/4] memcg: dirty memory documentation

2010-03-30 Thread Andrea Righi
Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi ari...@develer.com
---
 Documentation/cgroups/memory.txt |   36 
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache  - # of bytes of page cache memory.
 rss- # of bytes of anonymous and swap cache memory.
 pgpgin - # of pages paged in (equivalent to # of charging events).
 pgpgout- # of pages paged out (equivalent to # of uncharging 
events).
+filedirty  - # of pages that are waiting to get written back to the disk.
+writeback  - # of pages that are actively being written back to the disk.
+writeback_tmp  - # of pages used by FUSE for temporary writeback buffers.
+nfs- # of NFS pages sent to the server, but not yet committed to
+ the actual storage.
 active_anon- # of bytes of anonymous and  swap cache memory on active
  lru list.
 inactive_anon  - # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given 
time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+amount of dirty memory at which a process which is generating disk writes
+inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+bytes) at which a process generating disk writes will start itself writing
+out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+memory, the amount of dirty memory at which background writeback kernel
+threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+bytes) at which background writeback kernel threads will start writing out
+dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup and to add dirty limit
interface to the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, 
memory.dirty_background_bytes

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |  122 +++-
 mm/memcontrol.c|  507 +---
 2 files changed, 593 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44301c6..61fdca4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,55 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include linux/writeback.h
 #include linux/cgroup.h
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+   MEMCG_NR_DIRTYABLE_PAGES,
+   MEMCG_NR_RECLAIM_PAGES,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* File cache pages accounting */
+enum mem_cgroup_write_page_stat_item {
+   MEMCG_NR_FILE_MAPPED,   /* # of pages charged as file rss */
+   MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */
+   MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */
+   MEMCG_NR_FILE_WRITEBACK_TEMP,   /* # of pages under writeback using
+  temporary buffers */
+   MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+
+   MEMCG_NR_FILE_NSTAT,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+   int dirty_ratio;
+   int dirty_background_ratio;
+   unsigned long dirty_bytes;
+   unsigned long dirty_background_bytes;
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+   param-dirty_ratio = vm_dirty_ratio;
+   param-dirty_bytes = vm_dirty_bytes;
+   param-dirty_background_ratio = dirty_background_ratio;
+   param-dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +160,40 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_vm_dirty_param(struct vm_dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
+
+extern void mem_cgroup_update_page_stat_locked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge);
+
+extern void mem_cgroup_update_page_stat_unlocked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge);
+
+static inline void mem_cgroup_inc_page_stat_locked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat_locked(page, idx, true);
+}
+
+static inline void mem_cgroup_dec_page_stat_locked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat_locked(page, idx, false);
+}
+
+static inline void mem_cgroup_inc_page_stat_unlocked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat_unlocked(page, idx, true);
+}
+
+static inline void mem_cgroup_dec_page_stat_unlocked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat_unlocked(page, idx, false);
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -124,7 +201,6 @@ static inline bool mem_cgroup_disabled(void)
return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -294,8 +370,38 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-   int val)
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
item)
+{
+   return -ENOSYS;
+}
+
+static inline void mem_cgroup_update_page_stat_locked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge)
+{
+}
+
+static inline void mem_cgroup_update_page_stat_unlocked(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge)
+{
+}
+
+static inline void mem_cgroup_inc_page_stat_locked(struct page *page,
+   enum

[Devel] [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags

2010-03-30 Thread Andrea Righi
Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/page_cgroup.h |   45 +++
 1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..dc66bee 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
+   PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+   PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+   PCG_ACCT_DIRTY, /* page is dirty */
+   PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+   PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+   PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)  \
@@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
return page_to_nid(pc-page);
@@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
+/*
+ * lock_page_cgroup() should not be held under mapping-tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
@@ -93,6 +123,21 @@ static inline void unlock_page_cgroup(struct page_cgroup 
*pc)
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
 
+/*
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+   bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+   bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)

2010-03-30 Thread Andrea Righi
On Tue, Mar 09, 2010 at 10:29:28AM +0900, Daisuke Nishimura wrote:
 On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki 
 kamezawa.hir...@jp.fujitsu.com wrote:
  On Tue, 9 Mar 2010 01:12:52 +0100
  Andrea Righi ari...@develer.com wrote:
  
   On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote:
On Mon, 8 Mar 2010 17:07:11 +0900
Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:

 On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki 
 kamezawa.hir...@jp.fujitsu.com wrote:
  On Mon, 8 Mar 2010 11:17:24 +0900
  Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:
  
But IIRC, clear_writeback is done under treelock No ?

   The place where NR_WRITEBACK is updated is out of tree_lock.
   
  1311 int test_clear_page_writeback(struct page *page)
  1312 {
  1313 struct address_space *mapping = 
   page_mapping(page);
  1314 int ret;
  1315
  1316 if (mapping) {
  1317 struct backing_dev_info *bdi = 
   mapping-backing_dev_info;
  1318 unsigned long flags;
  1319
  1320 spin_lock_irqsave(mapping-tree_lock, 
   flags);
  1321 ret = TestClearPageWriteback(page);
  1322 if (ret) {
  1323 
   radix_tree_tag_clear(mapping-page_tree,
  1324 
   page_index(page),
  1325 
   PAGECACHE_TAG_WRITEBACK);
  1326 if 
   (bdi_cap_account_writeback(bdi)) {
  1327 __dec_bdi_stat(bdi, 
   BDI_WRITEBACK);
  1328 __bdi_writeout_inc(bdi);
  1329 }
  1330 }
  1331 
   spin_unlock_irqrestore(mapping-tree_lock, flags);
  1332 } else {
  1333 ret = TestClearPageWriteback(page);
  1334 }
  1335 if (ret)
  1336 dec_zone_page_state(page, NR_WRITEBACK);
  1337 return ret;
  1338 }
  
  We can move this up to under tree_lock. Considering memcg, all our 
  target has mapping.
  
  If we newly account bounce-buffers (for NILFS, FUSE, etc..), which 
  has no -mapping,
  we need much more complex new charge/uncharge theory.
  
  But yes, adding new lock scheme seems complicated. (Sorry Andrea.)
  My concerns is performance. We may need somehing new 
  re-implementation of
  locks/migrate/charge/uncharge.
  
 I agree. Performance is my concern too.
 
 I made a patch below and measured the time(average of 10 times) of 
 kernel build
 on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig).
 
 before
 - root cgroup: 190.47 sec
 - child cgroup: 192.81 sec
 
 after
 - root cgroup: 191.06 sec
 - child cgroup: 193.06 sec
 
 Hmm... about 0.3% slower for root, 0.1% slower for child.
 

Hmm...accepatable ? (sounds it's in error-range)

BTW, why local_irq_disable() ? 
local_irq_save()/restore() isn't better ?
   
   Probably there's not the overhead of saving flags? 
  maybe.
  
   Anyway, it would make the code much more readable...
   
  ok.
  
  please go ahead in this direction. Nishimura-san, would you post an
  independent patch ? If no, Andrea-san, please.
  
 This is the updated version.
 
 Andrea-san, can you merge this into your patch set ?

OK, I'll merge, do some tests and post a new version.

Thanks!
-Andrea

 
 ===
 From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
 
 In current implementation, we don't have to disable irq at lock_page_cgroup()
 because the lock is never acquired in interrupt context.
 But we are going to call it in later patch in an interrupt context or with
 irq disabled, so this patch disables irq at lock_page_cgroup() and enables it
 at unlock_page_cgroup().
 
 Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
 ---
  include/linux/page_cgroup.h |   16 ++--
  mm/memcontrol.c |   43 
 +--
  2 files changed, 39 insertions(+), 20 deletions(-)
 
 diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
 index 30b0813..0d2f92c 100644
 --- a/include/linux/page_cgroup.h
 +++ b/include/linux/page_cgroup.h
 @@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct 
 page_cgroup *pc)
   return page_zonenum(pc-page);
  }
  
 -static inline void lock_page_cgroup(struct page_cgroup *pc)
 +static inline void __lock_page_cgroup(struct page_cgroup *pc)
  {
   bit_spin_lock(PCG_LOCK, pc-flags);
  }
  
 -static inline void unlock_page_cgroup

[Devel] [PATCH -mmotm 5/5] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

[ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2
bounce pages. This depends on charging also bounce pages per cgroup. ]

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi ari...@develer.com
---
 fs/nfs/write.c|4 +
 include/linux/writeback.h |2 -
 mm/filemap.c  |1 +
 mm/page-writeback.c   |  215 -
 mm/rmap.c |4 +-
 mm/truncate.c |1 +
 6 files changed, 141 insertions(+), 86 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..3e8b9f8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi-ncommit++;
spin_unlock(inode-i_lock);
+   mem_cgroup_inc_page_stat(req-wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req-wb_page;
 
if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
+   mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
*head, int how)
req = nfs_list_entry(head-next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+   mem_cgroup_dec_page_stat(req-wb_page,
+   MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd9512d..39e4cb2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/mm/filemap.c b/mm/filemap.c
index 62cbac0..bd833fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab84693..fcac9b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+   int node;
+   unsigned long x = 0;
+
+   for_each_node_state(node, N_HIGH_MEMORY) {
+   struct zone *z =
+   NODE_DATA(node)-node_zones[ZONE_HIGHMEM];
+
+   x += zone_page_state(z, NR_FREE_PAGES) +
+zone_reclaimable_pages(z);
+   }
+   /*
+* Make sure that the number of highmem pages is never larger
+* than the number of the total dirtyable memory. This can only
+* occur in very strange VM situations but we want to make sure
+* that this does not occur.
+*/
+   return min(x, total);
+#else
+   return 0;
+#endif
+}
+
+static unsigned long get_global_dirtyable_memory(void)
+{
+   unsigned long memory;
+
+   memory = global_page_state

[Devel] [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

2010-03-30 Thread Andrea Righi
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp

In current implementation, we don't have to disable irq at lock_page_cgroup()
because the lock is never acquired in interrupt context.
But we are going to call it in later patch in an interrupt context or with
irq disabled, so this patch disables irq at lock_page_cgroup() and enables it
at unlock_page_cgroup().

Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp
---
 include/linux/page_cgroup.h |   16 ++--
 mm/memcontrol.c |   43 +--
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..0d2f92c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct 
page_cgroup *pc)
return page_zonenum(pc-page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void __lock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_lock(PCG_LOCK, pc-flags);
 }
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void __unlock_page_cgroup(struct page_cgroup *pc)
 {
bit_spin_unlock(PCG_LOCK, pc-flags);
 }
 
+#define lock_page_cgroup(pc, flags)\
+   do {\
+   local_irq_save(flags);  \
+   __lock_page_cgroup(pc); \
+   } while (0)
+
+#define unlock_page_cgroup(pc, flags)  \
+   do {\
+   __unlock_page_cgroup(pc);   \
+   local_irq_restore(flags);   \
+   } while (0)
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fab84e..a9fd736 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1352,12 +1352,13 @@ void mem_cgroup_update_file_mapped(struct page *page, 
int val)
 {
struct mem_cgroup *mem;
struct page_cgroup *pc;
+   unsigned long flags;
 
pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
 
-   lock_page_cgroup(pc);
+   lock_page_cgroup(pc, flags);
mem = pc-mem_cgroup;
if (!mem)
goto done;
@@ -1371,7 +1372,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int 
val)
__this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 
 done:
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
 }
 
 /*
@@ -1705,11 +1706,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
struct page_cgroup *pc;
unsigned short id;
swp_entry_t ent;
+   unsigned long flags;
 
VM_BUG_ON(!PageLocked(page));
 
pc = lookup_page_cgroup(page);
-   lock_page_cgroup(pc);
+   lock_page_cgroup(pc, flags);
if (PageCgroupUsed(pc)) {
mem = pc-mem_cgroup;
if (mem  !css_tryget(mem-css))
@@ -1723,7 +1725,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
page *page)
mem = NULL;
rcu_read_unlock();
}
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
return mem;
 }
 
@@ -1736,13 +1738,15 @@ static void __mem_cgroup_commit_charge(struct 
mem_cgroup *mem,
 struct page_cgroup *pc,
 enum charge_type ctype)
 {
+   unsigned long flags;
+
/* try_charge() can return NULL to *memcg, taking care of it. */
if (!mem)
return;
 
-   lock_page_cgroup(pc);
+   lock_page_cgroup(pc, flags);
if (unlikely(PageCgroupUsed(pc))) {
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
mem_cgroup_cancel_charge(mem);
return;
}
@@ -1772,7 +1776,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup 
*mem,
 
mem_cgroup_charge_statistics(mem, pc, true);
 
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
/*
 * charge_statistics updated event counter. Then, check it.
 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -1842,12 +1846,13 @@ static int mem_cgroup_move_account(struct page_cgroup 
*pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
int ret = -EINVAL;
-   lock_page_cgroup(pc);
+   unsigned long flags;
+   lock_page_cgroup(pc, flags);
if (PageCgroupUsed(pc)  pc-mem_cgroup == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
-   unlock_page_cgroup(pc);
+   unlock_page_cgroup(pc, flags);
/*
 * check events
 */
@@ -1974,17 +1979,17 @@ int mem_cgroup_cache_charge(struct page *page, struct 
mm_struct *mm,
 */
   

[Devel] [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v5 - v6)
~~
 * always disable/enable IRQs at lock/unlock_page_cgroup(): this allows to drop
   the previous complicated locking scheme in favor of a simpler locking, even
   if this obviously adds some overhead (see results below)
 * drop FUSE and NILFS2 dirty pages accounting for now (this depends on
   charging bounce pages per cgroup)

Results
~~~
I ran some tests using a kernel build (2.6.33 x86_64_defconfig) on a
Intel Core 2 @ 1.2GHz as testcase using different kernels:
 - mmotm vanilla
 - mmotm with cgroup-dirty-memory using the previous complex locking scheme
   (my previous patchset + the fixes reported by Kame-san and Daisuke-san)
 - mmotm with cgroup-dirty-memory using the simple locking scheme
   (lock_page_cgroup() with IRQs disabled)

Following the results:
before
 - mmotm vanilla, root  cgroup:   11m51.983s
 - mmotm vanilla, child cgroup:   11m56.596s

after
 - mmotm, complex locking scheme, root  cgroup:   11m53.037s
 - mmotm, complex locking scheme, child cgroup:   11m57.896s

 - mmotm, lock_page_cgroup+irq_disabled, root  cgroup:  12m5.499s
 - mmotm, lock_page_cgroup+irq_disabled, child cgroup:  12m9.920s

With the complex locking solution, the overhead introduced by the
cgroup dirty memory accounting is minimal (0.14%), compared with the overhead
introduced by the lock_page_cgroup+irq_disabled solution (1.90%).

The performance overhead is not so huge in both solutions, but the impact on
performance is even more reduced using a complicated solution...

Maybe we can go ahead with the simplest implementation for now and start to
think to an alternative implementation of the page_cgroup locking and
charge/uncharge of pages.

If someone is interested or want to repeat the tests (maybe on a bigger
machine) I can post also the other version of the patchset. Just let me know.

-Andrea

 Documentation/cgroups/memory.txt |   36 +++
 fs/nfs/write.c   |4 +
 include/linux/memcontrol.h   |   87 +++-
 include/linux/page_cgroup.h  |   42 -
 include/linux/writeback.h|2 -
 mm/filemap.c |1 +
 mm/memcontrol.c  |  475 +-
 mm/page-writeback.c  |  215 +++---
 mm/rmap.c|4 +-
 mm/truncate.c|1 +
 10 files changed, 722 insertions(+), 145 deletions(-)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, 
memory.dirty_background_bytes

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |   87 +-
 mm/memcontrol.c|  432 
 2 files changed, 480 insertions(+), 39 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44301c6..0602ec9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,55 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include linux/writeback.h
 #include linux/cgroup.h
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+   MEMCG_NR_DIRTYABLE_PAGES,
+   MEMCG_NR_RECLAIM_PAGES,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* File cache pages accounting */
+enum mem_cgroup_write_page_stat_item {
+   MEMCG_NR_FILE_MAPPED,   /* # of pages charged as file rss */
+   MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */
+   MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */
+   MEMCG_NR_FILE_WRITEBACK_TEMP,   /* # of pages under writeback using
+  temporary buffers */
+   MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+
+   MEMCG_NR_FILE_NSTAT,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+   int dirty_ratio;
+   int dirty_background_ratio;
+   unsigned long dirty_bytes;
+   unsigned long dirty_background_bytes;
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+   param-dirty_ratio = vm_dirty_ratio;
+   param-dirty_bytes = vm_dirty_bytes;
+   param-dirty_background_ratio = dirty_background_ratio;
+   param-dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +160,25 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_vm_dirty_param(struct vm_dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
+
+extern void mem_cgroup_update_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat(page, idx, true);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat(page, idx, false);
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -124,7 +186,6 @@ static inline bool mem_cgroup_disabled(void)
return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -294,8 +355,18 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-   int val)
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
item)
+{
+   return -ENOSYS;
+}
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
 {
 }
 
@@ -306,6 +377,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+   return false;
+}
+
+static inline void get_vm_dirty_param(struct vm_dirty_param *param)
+{
+   get_global_vm_dirty_param(param);
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a9fd736..ffcf37c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,14 +80,21 @@ enum mem_cgroup_stat_index {
/*
 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
 */
-   MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
+   MEM_CGROUP_STAT_CACHE

[Devel] [PATCH -mmotm 3/5] page_cgroup: introduce file cache flags

2010-03-30 Thread Andrea Righi
Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/page_cgroup.h |   26 ++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 0d2f92c..4e09c8c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,11 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
+   PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+   PCG_ACCT_DIRTY, /* page is dirty */
+   PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+   PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+   PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)  \
@@ -73,6 +78,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
return page_to_nid(pc-page);
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 2/5] memcg: dirty memory documentation

2010-03-30 Thread Andrea Righi
Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi ari...@develer.com
---
 Documentation/cgroups/memory.txt |   36 
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache  - # of bytes of page cache memory.
 rss- # of bytes of anonymous and swap cache memory.
 pgpgin - # of pages paged in (equivalent to # of charging events).
 pgpgout- # of pages paged out (equivalent to # of uncharging 
events).
+filedirty  - # of pages that are waiting to get written back to the disk.
+writeback  - # of pages that are actively being written back to the disk.
+writeback_tmp  - # of pages used by FUSE for temporary writeback buffers.
+nfs- # of NFS pages sent to the server, but not yet committed to
+ the actual storage.
 active_anon- # of bytes of anonymous and  swap cache memory on active
  lru list.
 inactive_anon  - # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given 
time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+amount of dirty memory at which a process which is generating disk writes
+inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+bytes) at which a process generating disk writes will start itself writing
+out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+memory, the amount of dirty memory at which background writeback kernel
+threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+bytes) at which background writeback kernel threads will start writing out
+dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Thu, Mar 11, 2010 at 09:39:13AM +0900, KAMEZAWA Hiroyuki wrote:
 On Wed, 10 Mar 2010 00:00:31 +0100
 Andrea Righi ari...@develer.com wrote:
 
  Control the maximum amount of dirty pages a cgroup can have at any given 
  time.
  
  Per cgroup dirty limit is like fixing the max amount of dirty (hard to 
  reclaim)
  page cache used by any cgroup. So, in case of multiple cgroup writers, they
  will not be able to consume more than their designated share of dirty pages 
  and
  will be forced to perform write-out if they cross that limit.
  
  The overall design is the following:
  
   - account dirty pages per cgroup
   - limit the number of dirty pages via memory.dirty_ratio / 
  memory.dirty_bytes
 and memory.dirty_background_ratio / memory.dirty_background_bytes in
 cgroupfs
   - start to write-out (background or actively) when the cgroup limits are
 exceeded
  
  This feature is supposed to be strictly connected to any underlying IO
  controller implementation, so we can stop increasing dirty pages in VM layer
  and enforce a write-out before any cgroup will consume the global amount of
  dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
  /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
  
  Changelog (v5 - v6)
  ~~
   * always disable/enable IRQs at lock/unlock_page_cgroup(): this allows to 
  drop
 the previous complicated locking scheme in favor of a simpler locking, 
  even
 if this obviously adds some overhead (see results below)
   * drop FUSE and NILFS2 dirty pages accounting for now (this depends on
 charging bounce pages per cgroup)
  
  Results
  ~~~
  I ran some tests using a kernel build (2.6.33 x86_64_defconfig) on a
  Intel Core 2 @ 1.2GHz as testcase using different kernels:
   - mmotm vanilla
   - mmotm with cgroup-dirty-memory using the previous complex locking 
  scheme
 (my previous patchset + the fixes reported by Kame-san and Daisuke-san)
   - mmotm with cgroup-dirty-memory using the simple locking scheme
 (lock_page_cgroup() with IRQs disabled)
  
  Following the results:
  before
   - mmotm vanilla, root  cgroup:   11m51.983s
   - mmotm vanilla, child cgroup:   11m56.596s
  
  after
   - mmotm, complex locking scheme, root  cgroup:   11m53.037s
   - mmotm, complex locking scheme, child cgroup:   11m57.896s
  
   - mmotm, lock_page_cgroup+irq_disabled, root  cgroup:  12m5.499s
   - mmotm, lock_page_cgroup+irq_disabled, child cgroup:  12m9.920s
  
  With the complex locking solution, the overhead introduced by the
  cgroup dirty memory accounting is minimal (0.14%), compared with the 
  overhead
  introduced by the lock_page_cgroup+irq_disabled solution (1.90%).
  
 Hmmisn't this bigger than expected ?

Consider that I'm not running the kernel build on tmpfs, but on a fs
defined on /dev/sda. So the additional overhead should be normal
compared to the mmotm vanilla, where there's only FILE_MAPPED
accounting.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Thu, Mar 11, 2010 at 06:42:44PM +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 11 Mar 2010 18:25:00 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
  Then, it's not problem that check pc-mem_cgroup is root cgroup or not
  without spinlock.
  ==
  void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
  {
  pc = lookup_page_cgroup(page);
  if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup))
  return; 
  ...
  }
  ==
  This can be handle in the same logic of lock failure path.
  And we just do ignore accounting.
  
  There are will be no spinlocksto do more than this,
  I think we have to use struct page rather than struct page_cgroup.
  
 Hmm..like this ? The bad point of this patch is that this will corrupt 
 FILE_MAPPED
 status in root cgroup. This kind of change is not very good.
 So, one way is to use this kind of function only for new parameters. Hmm.

This kind of accouting shouldn't be a big problem for the dirty memory
write-out. The benefit in terms of performance is much more important I
think.

The missing accounting of root cgroup statistics could be an issue if we
move a lot of pages from root cgroup into a child cgroup (when migration
of file cache pages will be supported and enabled). But at worst we'll
continue to write-out pages using the global settings. Remember that
memcg dirty memory is always the min(memcg_dirty_memory, total_dirty_memory),
so even if we're leaking dirty memory accounting at worst we'll touch
the global dirty limit and fallback to the current write-out
implementation.

I'll merge this patch, re-run some tests (kernel build and large file
copy) and post a new version.

Unfortunately at the moment I've not a big machine to use for these
tests, but maybe I can get some help. Vivek has probably a nice hardware
to test this code.. ;)

Thanks!
-Andrea

 ==
 
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Now, file-mapped is maintaiend. But more generic update function
 will be needed for dirty page accounting.
 
 For accountig page status, we have to guarantee lock_page_cgroup()
 will be never called under tree_lock held.
 To guarantee that, we use trylock at updating status.
 By this, we do fuzyy accounting, but in almost all case, it's correct.
 
 Changelog:
  - removed unnecessary preempt_disable()
  - added root cgroup check. By this, we do no lock/account in root cgroup.
 
 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 ---
  include/linux/memcontrol.h  |7 ++-
  include/linux/page_cgroup.h |   15 +++
  mm/memcontrol.c |   92 
 +---
  mm/rmap.c   |4 -
  4 files changed, 94 insertions(+), 24 deletions(-)
 
 Index: mmotm-2.6.34-Mar9/mm/memcontrol.c
 ===
 --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c
 +++ mmotm-2.6.34-Mar9/mm/memcontrol.c
 @@ -1348,30 +1348,83 @@ bool mem_cgroup_handle_oom(struct mem_cg
   * Currently used to update mapped file statistics, but the routine can be
   * generalized to update other statistics as well.
   */
 -void mem_cgroup_update_file_mapped(struct page *page, int val)
 +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge)
  {
   struct mem_cgroup *mem;
 - struct page_cgroup *pc;
 -
 - pc = lookup_page_cgroup(page);
 - if (unlikely(!pc))
 - return;
 + int val;
  
 - lock_page_cgroup(pc);
   mem = pc-mem_cgroup;
 - if (!mem)
 - goto done;
 + if (!mem || !PageCgroupUsed(pc))
 + return;
  
 - if (!PageCgroupUsed(pc))
 - goto done;
 + if (charge)
 + val = 1;
 + else
 + val = -1;
  
 + switch (idx) {
 + case MEMCG_NR_FILE_MAPPED:
 + if (charge) {
 + if (!PageCgroupFileMapped(pc))
 + SetPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + } else {
 + if (PageCgroupFileMapped(pc))
 + ClearPageCgroupFileMapped(pc);
 + else
 + val = 0;
 + }
 + idx = MEM_CGROUP_STAT_FILE_MAPPED;
 + break;
 + default:
 + BUG();
 + break;
 + }
   /*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
 - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
 + __this_cpu_add(mem-stat-count[idx], val);
 +}
  
 -done:
 - unlock_page_cgroup(pc);
 +void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
 +{
 + struct page_cgroup *pc;
 +
 + pc = lookup_page_cgroup(page);
 + if (!pc || mem_cgroup_is_root(pc-mem_cgroup))
 + return;
 +
 + if (trylock_page_cgroup(pc)) {
 + __mem_cgroup_update_stat(pc, idx, charge);
 +  

[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Wed, Mar 10, 2010 at 05:23:39PM -0500, Vivek Goyal wrote:
 On Wed, Mar 10, 2010 at 12:00:35AM +0100, Andrea Righi wrote:
 
 [..]
 
  - * Currently used to update mapped file statistics, but the routine can be
  - * generalized to update other statistics as well.
  + * mem_cgroup_update_page_stat() - update memcg file cache's accounting
  + * @page:  the page involved in a file cache operation.
  + * @idx:   the particular file cache statistic.
  + * @charge:true to increment, false to decrement the statistic 
  specified
  + * by @idx.
  + *
  + * Update memory cgroup file cache's accounting.
*/
  -void mem_cgroup_update_file_mapped(struct page *page, int val)
  +void mem_cgroup_update_page_stat(struct page *page,
  +   enum mem_cgroup_write_page_stat_item idx, bool charge)
   {
  -   struct mem_cgroup *mem;
  struct page_cgroup *pc;
  unsigned long flags;
   
  +   if (mem_cgroup_disabled())
  +   return;
  pc = lookup_page_cgroup(page);
  -   if (unlikely(!pc))
  +   if (unlikely(!pc) || !PageCgroupUsed(pc))
  return;
  -
  lock_page_cgroup(pc, flags);
  -   mem = pc-mem_cgroup;
  -   if (!mem)
  -   goto done;
  -
  -   if (!PageCgroupUsed(pc))
  -   goto done;
  -
  -   /*
  -* Preemption is already disabled. We can use __this_cpu_xxx
  -*/
  -   __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val);
  -
  -done:
  +   __mem_cgroup_update_page_stat(pc, idx, charge);
  unlock_page_cgroup(pc, flags);
   }
  +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked);
 
   CC  mm/memcontrol.o
 mm/memcontrol.c:1600: error: ‘mem_cgroup_update_page_stat_unlocked’
 undeclared here (not in a function)
 mm/memcontrol.c:1600: warning: type defaults to ‘int’ in declaration of
 ‘mem_cgroup_update_page_stat_unlocked’
 make[1]: *** [mm/memcontrol.o] Error 1
 make: *** [mm] Error 2

Thanks! Will fix in the next version.

(mmh... why I didn't see this? probably because I'm building a static kernel...)

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Fri, Mar 12, 2010 at 08:42:30AM +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 11 Mar 2010 10:03:07 -0500
 Vivek Goyal vgo...@redhat.com wrote:
 
  On Thu, Mar 11, 2010 at 06:25:00PM +0900, KAMEZAWA Hiroyuki wrote:
   On Thu, 11 Mar 2010 10:14:25 +0100
   Peter Zijlstra pet...@infradead.org wrote:
   
On Thu, 2010-03-11 at 10:17 +0900, KAMEZAWA Hiroyuki wrote:
 On Thu, 11 Mar 2010 09:39:13 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
   The performance overhead is not so huge in both solutions, but 
   the impact on
   performance is even more reduced using a complicated solution...
   
   Maybe we can go ahead with the simplest implementation for now 
   and start to
   think to an alternative implementation of the page_cgroup locking 
   and
   charge/uncharge of pages.

FWIW bit spinlocks suck massive.

  
  maybe. But in this 2 years, one of our biggest concerns was the 
  performance.
  So, we do something complex in memcg. But complex-locking is , yes, 
  complex.
  Hmm..I don't want to bet we can fix locking scheme without 
  something complex.
  
 But overall patch set seems good (to me.) And dirty_ratio and 
 dirty_background_ratio
 will give us much benefit (of performance) than we lose by small 
 overheads.

Well, the !cgroup or root case should really have no performance impact.

 IIUC, this series affects trgger for background-write-out.

Not sure though, while this does the accounting the actual writeout is
still !cgroup aware and can definately impact performance negatively by
shrinking too much.

   
   Ah, okay, your point is !cgroup (ROOT cgroup case.)
   I don't think accounting these file cache status against root cgroup is 
   necessary.
   
  
  I think what peter meant was that with memory cgroups created we will do
  writeouts much more aggressively.
  
  In balance_dirty_pages()
  
  if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
  break;
  
  Now with Andrea's patches, we are calculating bdi_thres per memory cgroup
  (almost)
 hmm.
 
  
  bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction
  
  But bdi_nr_reclaimable and bdi_nr_writeback stats are still global.
  
 Why bdi_thresh of ROOT cgroup doesn't depend on global number ?

Very true. mem_cgroup_has_dirty_limit() must always return false in case
of root cgroup, so that global numbers are used.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Fri, Mar 12, 2010 at 09:03:26AM +0900, KAMEZAWA Hiroyuki wrote:
 On Fri, 12 Mar 2010 00:59:22 +0100
 Andrea Righi ari...@develer.com wrote:
 
  On Thu, Mar 11, 2010 at 01:07:53PM -0500, Vivek Goyal wrote:
   On Wed, Mar 10, 2010 at 12:00:31AM +0100, Andrea Righi wrote:
 
  mmmh.. strange, on my side I get something as expected:
  
  root cgroup
  $ dd if=/dev/zero of=test bs=1M count=500
  500+0 records in
  500+0 records out
  524288000 bytes (524 MB) copied, 6.28377 s, 83.4 MB/s
  
  child cgroup with 100M memory.limit_in_bytes
  $ dd if=/dev/zero of=test bs=1M count=500
  500+0 records in
  500+0 records out
  524288000 bytes (524 MB) copied, 11.8884 s, 44.1 MB/s
  
  Did you change the global /proc/sys/vm/dirty_* or memcg dirty
  parameters?
  
 what happens when bs=4k count=100 under 100M ? no changes ?

OK, I confirm the results found by Vivek. Repeating the tests 10 times:

root cgroup  ~= 34.05 MB/s average
 child cgroup (100M) ~= 38.80 MB/s average

So, actually the child cgroup with the 100M limit seems to perform
better in terms of throughput.

IIUC, with the large write and the 100M memory limit it happens that
direct write-out is enforced more frequently and a single write chunk is
enough to meet the bdi_thresh or the global background_thresh +
dirty_thresh limits. This means the task is never (or less) throttled
with io_schedule_timeout() in the balance_dirty_pages() loop. And the
child cgroup gets better performance over the root cgroup.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Fri, Mar 12, 2010 at 08:52:44AM +0900, KAMEZAWA Hiroyuki wrote:
 On Fri, 12 Mar 2010 00:27:09 +0100
 Andrea Righi ari...@develer.com wrote:
 
  On Thu, Mar 11, 2010 at 10:03:07AM -0500, Vivek Goyal wrote:
 
   I am still setting up the system to test whether we see any speedup in
   writeout of large files with-in a memory cgroup with small memory limits.
   I am assuming that we are expecting a speedup because we will start
   writeouts early and background writeouts probably are faster than direct
   reclaim?
  
  mmh... speedup? I think with a large file write + reduced dirty limits
  you'll get a more uniform write-out (more frequent small writes),
  respect to few and less frequent large writes. The system will be more
  reactive, but I don't think you'll be able to see a speedup in the large
  write itself.
  
 Ah, sorry. I misunderstood something. But it's depends on dirty_ratio param.
 If
   background_dirty_ratio = 5
   dirty_ratio= 100
 under 100M cgroup, I think background write-out will be a help.

Right, in this case background flusher threads will help a lot to
write-out the cgroup dirty memory and it'll get better performance.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Fri, Mar 12, 2010 at 10:14:11AM +0900, Daisuke Nishimura wrote:
 On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki 
 kamezawa.hir...@jp.fujitsu.com wrote:
  On Thu, 11 Mar 2010 18:25:00 +0900
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
   Then, it's not problem that check pc-mem_cgroup is root cgroup or not
   without spinlock.
   ==
   void mem_cgroup_update_stat(struct page *page, int idx, bool charge)
   {
 pc = lookup_page_cgroup(page);
 if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup))
 return; 
 ...
   }
   ==
   This can be handle in the same logic of lock failure path.
   And we just do ignore accounting.
   
   There are will be no spinlocksto do more than this,
   I think we have to use struct page rather than struct page_cgroup.
   
  Hmm..like this ? The bad point of this patch is that this will corrupt 
  FILE_MAPPED
  status in root cgroup. This kind of change is not very good.
  So, one way is to use this kind of function only for new parameters. Hmm.
 IMHO, if we disable accounting file stats in root cgroup, it would be better
 not to show them in memory.stat to avoid confusing users.

Or just show the same values that we show in /proc/meminfo.. (I mean,
not actually the same, but coherent with them).

 But, hmm, I think accounting them in root cgroup isn't so meaningless.
 Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup 
 enough?

Agreed. Returning false from mem_cgroup_has_dirty_limit() is enough to
always use global stats for the writeback, so this shouldn't introduce
any overhead for the root cgroup (at least for this part).

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 3/5] page_cgroup: introduce file cache flags

2010-03-30 Thread Andrea Righi
Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/page_cgroup.h |   22 +-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index bf9a913..65247e4 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,7 +40,11 @@ enum {
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
/* for cache-status accounting */
-   PCG_FILE_MAPPED,
+   PCG_FILE_MAPPED, /* page is accounted as file rss*/
+   PCG_FILE_DIRTY, /* page is dirty */
+   PCG_FILE_WRITEBACK, /* page is being written back to disk */
+   PCG_FILE_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+   PCG_FILE_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)  \
@@ -83,6 +87,22 @@ TESTPCGFLAG(FileMapped, FILE_MAPPED)
 SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+TESTPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+SETPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+CLEARPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
return page_to_nid(pc-page);
-- 
1.6.3.3

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v7)

2010-03-30 Thread Andrea Righi
Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v6 - v7)
~~
 * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
   is never called under tree_lock (no strict accounting, but better overall
   performance)
 * do not account file cache statistics for the root cgroup (zero
   overhead for the root cgroup)
 * fix: evaluate cgroup free pages as at the minimum free pages of all
   its parents

Results
~~~
The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
1.2GHz:

before
 - root  cgroup:11m51.983s
 - child cgroup:11m56.596s

after
 - root cgroup: 11m51.742s
 - child cgroup:12m5.016s

In the previous version of this patchset, using the complex locking scheme
with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
child cgroup required 11m57.896s and 12m9.920s with 
lock_page_cgroup()+irq_disabled.

With this version there's no overhead for the root cgroup (the small difference
is in error range). I expected to see less overhead for the child cgroup, I'll
do more testing and try to figure better what's happening.

In the while, it would be great if someone could perform some tests on a larger
system... unfortunately at the moment I don't have a big system available for
this kind of tests...

Thanks,
-Andrea

 Documentation/cgroups/memory.txt |   36 +++
 fs/nfs/write.c   |4 +
 include/linux/memcontrol.h   |   87 ++-
 include/linux/page_cgroup.h  |   35 +++
 include/linux/writeback.h|2 -
 mm/filemap.c |1 +
 mm/memcontrol.c  |  542 +++---
 mm/page-writeback.c  |  215 ++--
 mm/rmap.c|4 +-
 mm/truncate.c|1 +
 10 files changed, 806 insertions(+), 121 deletions(-)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, 
memory.dirty_background_bytes

Signed-off-by: Andrea Righi ari...@develer.com
---
 include/linux/memcontrol.h |   92 -
 mm/memcontrol.c|  484 +---
 2 files changed, 540 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 88d3f9e..0602ec9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,55 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include linux/writeback.h
 #include linux/cgroup.h
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+   MEMCG_NR_DIRTYABLE_PAGES,
+   MEMCG_NR_RECLAIM_PAGES,
+   MEMCG_NR_WRITEBACK,
+   MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* File cache pages accounting */
+enum mem_cgroup_write_page_stat_item {
+   MEMCG_NR_FILE_MAPPED,   /* # of pages charged as file rss */
+   MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */
+   MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */
+   MEMCG_NR_FILE_WRITEBACK_TEMP,   /* # of pages under writeback using
+  temporary buffers */
+   MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+
+   MEMCG_NR_FILE_NSTAT,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+   int dirty_ratio;
+   int dirty_background_ratio;
+   unsigned long dirty_bytes;
+   unsigned long dirty_background_bytes;
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+   param-dirty_ratio = vm_dirty_ratio;
+   param-dirty_bytes = vm_dirty_bytes;
+   param-dirty_background_ratio = dirty_background_ratio;
+   param-dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All charge functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +160,25 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup 
*memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_vm_dirty_param(struct vm_dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
+
+extern void mem_cgroup_update_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx, bool charge);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat(page, idx, true);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+   mem_cgroup_update_page_stat(page, idx, false);
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
if (mem_cgroup_subsys.disabled)
@@ -124,12 +186,6 @@ static inline bool mem_cgroup_disabled(void)
return false;
 }
 
-enum mem_cgroup_page_stat_item {
-   MEMCG_NR_FILE_MAPPED,
-   MEMCG_NR_FILE_NSTAT,
-};
-
-void mem_cgroup_update_stat(struct page *page, int idx, bool charge);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -299,8 +355,18 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct 
task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-   int val)
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item 
item)
+{
+   return -ENOSYS;
+}
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+   enum mem_cgroup_write_page_stat_item idx)
 {
 }
 
@@ -311,6 +377,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone 
*zone, int order,
return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+   return false;
+}
+
+static inline void get_vm_dirty_param(struct vm_dirty_param *param)
+{
+   get_global_vm_dirty_param(param);
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b7c23ea..91770d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,14 +80,21 @@ enum mem_cgroup_stat_index {
/*
 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss

[Devel] [PATCH -mmotm 5/5] memcg: dirty pages instrumentation

2010-03-30 Thread Andrea Righi
Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

[ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2
bounce pages. This depends on charging also bounce pages per cgroup. ]

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi ari...@develer.com
---
 fs/nfs/write.c|4 +
 include/linux/writeback.h |2 -
 mm/filemap.c  |1 +
 mm/page-writeback.c   |  215 -
 mm/rmap.c |4 +-
 mm/truncate.c |1 +
 6 files changed, 141 insertions(+), 86 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..3e8b9f8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi-ncommit++;
spin_unlock(inode-i_lock);
+   mem_cgroup_inc_page_stat(req-wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req-wb_page;
 
if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) {
+   mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head 
*head, int how)
req = nfs_list_entry(head-next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+   mem_cgroup_dec_page_stat(req-wb_page,
+   MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req-wb_page-mapping-backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd9512d..39e4cb2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/mm/filemap.c b/mm/filemap.c
index 62cbac0..bd833fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 * having removed the page entirely.
 */
if (PageDirty(page)  mapping_cap_account_dirty(mapping)) {
+   mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab84693..fcac9b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+   int node;
+   unsigned long x = 0;
+
+   for_each_node_state(node, N_HIGH_MEMORY) {
+   struct zone *z =
+   NODE_DATA(node)-node_zones[ZONE_HIGHMEM];
+
+   x += zone_page_state(z, NR_FREE_PAGES) +
+zone_reclaimable_pages(z);
+   }
+   /*
+* Make sure that the number of highmem pages is never larger
+* than the number of the total dirtyable memory. This can only
+* occur in very strange VM situations but we want to make sure
+* that this does not occur.
+*/
+   return min(x, total);
+#else
+   return 0;
+#endif
+}
+
+static unsigned long get_global_dirtyable_memory(void)
+{
+   unsigned long memory;
+
+   memory = global_page_state

[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

2010-03-30 Thread Andrea Righi
On Mon, Mar 15, 2010 at 09:06:38AM +0900, KAMEZAWA Hiroyuki wrote:
 On Mon, 15 Mar 2010 00:26:38 +0100
 Andrea Righi ari...@develer.com wrote:
 
  From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  
  Now, file-mapped is maintaiend. But more generic update function
  will be needed for dirty page accounting.
  
  For accountig page status, we have to guarantee lock_page_cgroup()
  will be never called under tree_lock held.
  To guarantee that, we use trylock at updating status.
  By this, we do fuzzy accounting, but in almost all case, it's correct.
  
  Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 Bad patch titleuse trylock for safe accounting in some contexts ?

OK, sounds better. I just copy  paste the email subject, but the title
was probably related to the old lock_page_cgroup()+irq_disable patch.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v7)

2010-03-30 Thread Andrea Righi
On Mon, Mar 15, 2010 at 11:36:12AM +0900, KAMEZAWA Hiroyuki wrote:
 On Mon, 15 Mar 2010 00:26:37 +0100
 Andrea Righi ari...@develer.com wrote:
 
  Control the maximum amount of dirty pages a cgroup can have at any given 
  time.
  
  Per cgroup dirty limit is like fixing the max amount of dirty (hard to 
  reclaim)
  page cache used by any cgroup. So, in case of multiple cgroup writers, they
  will not be able to consume more than their designated share of dirty pages 
  and
  will be forced to perform write-out if they cross that limit.
  
  The overall design is the following:
  
   - account dirty pages per cgroup
   - limit the number of dirty pages via memory.dirty_ratio / 
  memory.dirty_bytes
 and memory.dirty_background_ratio / memory.dirty_background_bytes in
 cgroupfs
   - start to write-out (background or actively) when the cgroup limits are
 exceeded
  
  This feature is supposed to be strictly connected to any underlying IO
  controller implementation, so we can stop increasing dirty pages in VM layer
  and enforce a write-out before any cgroup will consume the global amount of
  dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
  /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
  
  Changelog (v6 - v7)
  ~~
   * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
 is never called under tree_lock (no strict accounting, but better overall
 performance)
   * do not account file cache statistics for the root cgroup (zero
 overhead for the root cgroup)
   * fix: evaluate cgroup free pages as at the minimum free pages of all
 its parents
  
  Results
  ~~~
  The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
  1.2GHz:
  
  before
   - root  cgroup:11m51.983s
   - child cgroup:11m56.596s
  
  after
   - root cgroup: 11m51.742s
   - child cgroup:12m5.016s
  
  In the previous version of this patchset, using the complex locking scheme
  with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
  child cgroup required 11m57.896s and 12m9.920s with 
  lock_page_cgroup()+irq_disabled.
  
  With this version there's no overhead for the root cgroup (the small 
  difference
  is in error range). I expected to see less overhead for the child cgroup, 
  I'll
  do more testing and try to figure better what's happening.
  
 Okay, thanks. This seems good result. Optimization for children can be done 
 under
 -mm tree, I think. (If no nack, this seems ready for test in -mm.)

OK, I'll wait a bit to see if someone has other fixes or issues and post
a new version soon including these small changes.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)

2010-03-30 Thread Andrea Righi
On Mon, Mar 15, 2010 at 10:38:41AM -0400, Vivek Goyal wrote:
   
   bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction
   
   But bdi_nr_reclaimable and bdi_nr_writeback stats are still global.
   
  Why bdi_thresh of ROOT cgroup doesn't depend on global number ?
  
 
 I think in current implementation ROOT cgroup bdi_thres is always same
 as global number. It is only for other child groups where it is different
 from global number because of reduced dirytable_memory() limit. And we
 don't seem to be allowing any control on root group. 
 
 But I am wondering, what happens in following case.
 
 IIUC, with use_hierarhy=0, if I create two test groups test1 and test2, then
 hierarchy looks as follows.
 
   root  test1  test2
 
 Now root group's DIRTYABLE is still system wide but test1 and test2's
 dirtyable will be reduced based on RES_LIMIT in those groups.
 
 Conceptually, per cgroup dirty ratio is like fixing page cache share of
 each group. So effectively we are saying that these limits apply to only
 child group of root but not to root as such?

Correct. In this implementation root cgroup means outside all cgroups.
I think this can be an acceptable behaviour since in general we don't
set any limit to the root cgroup.

  
   So for the same number of dirty pages system wide on this bdi, we will be
   triggering writeouts much more aggressively if somebody has created few
   memory cgroups and tasks are running in those cgroups.
   
   I guess it might cause performance regressions in case of small file
   writeouts because previously one could have written the file to cache and
   be done with it but with this patch set, there are higher changes that
   you will be throttled to write the pages back to disk.
   
   I guess we need two pieces to resolve this.
 - BDI stats per cgroup.
 - Writeback of inodes from same cgroup.
   
   I think BDI stats per cgroup will increase the complextiy.
   
  Thank you for clarification. IIUC, dirty_limit implemanation shoul assume
  there is I/O resource controller, maybe usual users will use I/O resource
  controller and memcg at the same time.
  Then, my question is what happens when used with I/O resource controller ?
  
 
 Currently IO resource controller keep all the async IO queues in root
 group so we can't measure exactly. But my guess is until and unless we
 at least implement writeback inodes from same cgroup we will not see
 increased flow of writes from one cgroup over other cgroup.

Agreed. And I plan to look a the writeback inodes per cgroup feature
soon. I'm sorry but I've some deadlines this week, so probably I'll
start working on this in the next weekend.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Tue, Mar 16, 2010 at 10:11:50AM -0400, Vivek Goyal wrote:
 On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote:
 
 [..]
   + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
   + * @item:memory statistic item exported to the kernel
   + *
   + * Return the accounted statistic value, or a negative value in case of 
   error.
   + */
   +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
   +{
   + struct mem_cgroup_page_stat stat = {};
   + struct mem_cgroup *mem;
   +
   + rcu_read_lock();
   + mem = mem_cgroup_from_task(current);
   + if (mem  !mem_cgroup_is_root(mem)) {
   + /*
   +  * If we're looking for dirtyable pages we need to evaluate
   +  * free pages depending on the limit and usage of the parents
   +  * first of all.
   +  */
   + if (item == MEMCG_NR_DIRTYABLE_PAGES)
   + stat.value = memcg_get_hierarchical_free_pages(mem);
   + /*
   +  * Recursively evaluate page statistics against all cgroup
   +  * under hierarchy tree
   +  */
   + stat.item = item;
   + mem_cgroup_walk_tree(mem, stat, mem_cgroup_page_stat_cb);
   + } else
   + stat.value = -EINVAL;
   + rcu_read_unlock();
   +
   + return stat.value;
   +}
   +
  hmm, mem_cgroup_page_stat() can return negative value, but you place 
  BUG_ON()
  in [5/5] to check it returns negative value. What happens if the current is 
  moved
  to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ?
  How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, 
  and
  passing the mem_cgroup to mem_cgroup_page_stat() ?
  
 
 Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one
 shall have to use rcu_read_lock() and that will look ugly.
 
 Why don't we simply look at the return value and if it is negative, we
 fall back to using global stats and get rid of BUG_ON()?

I vote for this one. IMHO the caller of mem_cgroup_page_stat() should
fallback to the equivalent global stats. This allows to keep the things
separated and put in mm/memcontrol.c only the memcg stuff.

 
 Or, modify mem_cgroup_page_stat() to return global stats if it can't
 determine per cgroup stat for some reason. (mem=NULL or root cgroup etc).
 
 Vivek

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 2/5] memcg: dirty memory documentation

2010-03-30 Thread Andrea Righi
On Tue, Mar 16, 2010 at 04:41:21PM +0900, Daisuke Nishimura wrote:
 On Mon, 15 Mar 2010 00:26:39 +0100, Andrea Righi ari...@develer.com wrote:
  Document cgroup dirty memory interfaces and statistics.
  
  Signed-off-by: Andrea Righi ari...@develer.com
  ---
   Documentation/cgroups/memory.txt |   36 
  
   1 files changed, 36 insertions(+), 0 deletions(-)
  
  diff --git a/Documentation/cgroups/memory.txt 
  b/Documentation/cgroups/memory.txt
  index 49f86f3..38ca499 100644
  --- a/Documentation/cgroups/memory.txt
  +++ b/Documentation/cgroups/memory.txt
  @@ -310,6 +310,11 @@ cache  - # of bytes of page cache memory.
   rss- # of bytes of anonymous and swap cache memory.
   pgpgin - # of pages paged in (equivalent to # of charging 
  events).
   pgpgout- # of pages paged out (equivalent to # of uncharging 
  events).
  +filedirty  - # of pages that are waiting to get written back to the disk.
  +writeback  - # of pages that are actively being written back to the disk.
  +writeback_tmp  - # of pages used by FUSE for temporary writeback 
  buffers.
  +nfs- # of NFS pages sent to the server, but not yet 
  committed to
  + the actual storage.
   active_anon- # of bytes of anonymous and  swap cache memory on 
  active
lru list.
   inactive_anon  - # of bytes of anonymous memory and swap cache memory 
  on
  @@ -345,6 +350,37 @@ Note:
 - a cgroup which uses hierarchy and it has child cgroup.
 - a cgroup which uses hierarchy and not the root of hierarchy.
   
  +5.4 dirty memory
  +
  +  Control the maximum amount of dirty pages a cgroup can have at any given 
  time.
  +
  +  Limiting dirty memory is like fixing the max amount of dirty (hard to
  +  reclaim) page cache used by any cgroup. So, in case of multiple cgroup 
  writers,
  +  they will not be able to consume more than their designated share of 
  dirty
  +  pages and will be forced to perform write-out if they cross that limit.
  +
  +  The interface is equivalent to the procfs interface: 
  /proc/sys/vm/dirty_*.
  +  It is possible to configure a limit to trigger both a direct writeback 
  or a
  +  background writeback performed by per-bdi flusher threads.
  +
  +  Per-cgroup dirty limits can be set using the following files in the 
  cgroupfs:
  +
  +  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
  +amount of dirty memory at which a process which is generating disk 
  writes
  +inside the cgroup will start itself writing out dirty data.
  +
  +  - memory.dirty_bytes: the amount of dirty memory of the cgroup 
  (expressed in
  +bytes) at which a process generating disk writes will start itself 
  writing
  +out dirty data.
  +
  +  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
  +memory, the amount of dirty memory at which background writeback kernel
  +threads will start writing out dirty data.
  +
  +  - memory.dirty_background_bytes: the amount of dirty memory of the 
  cgroup (in
  +bytes) at which background writeback kernel threads will start writing 
  out
  +dirty data.
  +
   
 It would be better to note that what those files of root cgroup mean.
 We cannot write any value to them, IOW, we cannot control dirty limit about 
 root cgroup.

OK.

 And they show the same value as the global one(strictly speaking, it's not 
 true
 because global values can change. We need a hook in mem_cgroup_dirty_read()?).

OK, we can just return system-wide value if mem_cgroup_is_root() in
mem_cgroup_dirty_read(). Will change this in the next version.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure

2010-03-30 Thread Andrea Righi
On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote:
[snip]
  @@ -3190,10 +3512,14 @@ struct {
   } memcg_stat_strings[NR_MCS_STAT] = {
  {cache, total_cache},
  {rss, total_rss},
  -   {mapped_file, total_mapped_file},
  {pgpgin, total_pgpgin},
  {pgpgout, total_pgpgout},
  {swap, total_swap},
  +   {mapped_file, total_mapped_file},
  +   {filedirty, dirty_pages},
  +   {writeback, writeback_pages},
  +   {writeback_tmp, writeback_temp_pages},
  +   {nfs, nfs_unstable},
  {inactive_anon, total_inactive_anon},
  {active_anon, total_active_anon},
  {inactive_file, total_inactive_file},
 Why not using total_xxx for total_name ?

Agreed. I would be definitely more clear. Balbir, KAME-san, what do you
think?

 
  @@ -3212,8 +3538,6 @@ static int mem_cgroup_get_local_stat(struct 
  mem_cgroup *mem, void *data)
  s-stat[MCS_CACHE] += val * PAGE_SIZE;
  val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
  s-stat[MCS_RSS] += val * PAGE_SIZE;
  -   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
  -   s-stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
  val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
  s-stat[MCS_PGPGIN] += val;
  val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT);
  @@ -3222,6 +3546,16 @@ static int mem_cgroup_get_local_stat(struct 
  mem_cgroup *mem, void *data)
  val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
  s-stat[MCS_SWAP] += val * PAGE_SIZE;
  }
  +   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
  +   s-stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
  +   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
  +   s-stat[MCS_FILE_DIRTY] += val;
  +   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
  +   s-stat[MCS_WRITEBACK] += val;
  +   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
  +   s-stat[MCS_WRITEBACK_TEMP] += val;
  +   val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
  +   s-stat[MCS_UNSTABLE_NFS] += val;
   
 I don't have a strong objection, but I prefer showing them in bytes.
 And can you add to mem_cgroup_stat_show() something like:
 
   for (i = 0; i  NR_MCS_STAT; i++) {
   if (i == MCS_SWAP  !do_swap_account)
   continue;
 + if (i = MCS_FILE_STAT_STAR  i = MCS_FILE_STAT_END 
 +mem_cgroup_is_root(mem_cont))
 + continue;
   cb-fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
   }

I like this. And I also prefer to show these values in bytes.

 
 not to show file stat in root cgroup ? It's meaningless value anyway.
 Of course, you'd better mention it in [2/5] too.

OK.

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

2009-10-17 Thread Andrea Righi
On Mon, Oct 12, 2009 at 05:11:20PM -0400, Vivek Goyal wrote:

[snip]

 I modified my report scripts to also output aggreagate iops numbers and
 remove max-bandwidth and min-bandwidth numbers. So for same tests and same
 results I am now reporting iops numbers also. ( I have not re-run the
 tests.)
 
 IO scheduler controller + CFQ
 ---
 [Multiple Random Reader][Sequential Reader] 
 nr  Agg-bandw Max-latency Agg-iops  nr  Agg-bandw Max-latency Agg-iops  
 1   223KB/s   132K usec   551   5551KB/s  129K usec   1387  
 2   190KB/s   154K usec   461   5718KB/s  122K usec   1429  
 4   445KB/s   208K usec   111   1   5909KB/s  116K usec   1477  
 8   158KB/s   2820 msec   361   5445KB/s  168K usec   1361  
 16  145KB/s   5963 msec   281   5418KB/s  164K usec   1354  
 32  139KB/s   12762 msec  231   5398KB/s  175K usec   1349  
 
 io-throttle + CFQ
 ---
 BW limit group1=10 MB/s BW limit group2=10 MB/s 
 [Multiple Random Reader][Sequential Reader] 
 nr  Agg-bandw Max-latency Agg-iops  nr  Agg-bandw Max-latency Agg-iops  
 1   36KB/s218K usec   9 1   8006KB/s  20529 usec  2001  
 2   360KB/s   228K usec   891   7475KB/s  33665 usec  1868  
 4   699KB/s   262K usec   173   1   6800KB/s  46224 usec  1700  
 8   573KB/s   1800K usec  139   1   2835KB/s  885K usec   708   
 16  294KB/s   3590 msec   681   437KB/s   1855K usec  109   
 32  980KB/s   2861K usec  230   1   1145KB/s  1952K usec  286   
 
 Note that in case of random reader groups, iops are really small. Few
 thougts.
 
 - What should be the iops limit I should choose for the group. Lets say if
   I choose 80, then things should be better for sequential reader group,
   but just think of what will happen to random reader group. Especially,
   if nature of workload in group1 changes to sequential. Group1 will
   simply be killed.
 
   So yes, one can limit a group both by BW as well as iops-max, but this
   requires you to know in advance exactly what workload is running in the
   group. The moment workoload changes, these settings might have a very
   bad effects.
 
   So my biggest concern with max-bwidth and max-iops limits is that how
   will one configure the system for a dynamic environment. Think of two
   virtual machines being used by two customers. At one point they might be
   doing some copy operation and running sequential workload an later some
   webserver or database query might be doing some random read operations.

The main problem IMHO is how to accurately evaluate the cost of an IO
operation. On rotational media for example the cost to read two distant
blocks is not the same cost of reading two contiguous blocks (while on a
flash/SSD drive the cost is probably the same).

io-throttle tries to quantify the cost in absolute terms (iops and BW),
but this is not enough to cover all the possible cases. For example, you
could hit a physical disk limit, because you're doing a workload too
seeky, even if the iops and BW numbers are low.

 
 - Notice the interesting case of 16 random readers. iops for random reader
   group is really low, but still the throughput and iops of sequential
   reader group is very bad. I suspect that at CFQ level, some kind of
   mixup has taken place where we have not enabled idling for sequential
   reader and disk became seek bound hence both the group are loosing.
   (Just a guess)

Yes, my guess is the same.

I've re-run some of your tests using a SSD (a MOBI MTRON MSD-PATA3018-ZIF1),
but changing few parameters: I used a larger block size for the
sequential workload (there's no need to reduce the block size of the
single reads if we suppose to read a lot of contiguous blocks).

And for all the io-throttle tests I switched to noop scheduler (CFQ must
be changed to be cgroup-aware before using it together with io-throttle,
otherwise the result is that one simply breaks the logic of the other).

=== io-throttle settings ===
cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s
cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s

During the tests I used a larger block size for sequential readers,
respect to the random readers:

sequential-read:block size = 1MB
random-read:block size = 4KB

sequential-readers vs sequential-reader
===
[ cgroup #1 workload ]
fio_args=--rw=read --bs=1M --size=512M --runtime=30 --numjobs=N --direct=1
[ cgroup #2 workload ]
fio_args=--rw=read --bs=1M --size=512M --runtime=30 --numjobs=1 --direct=1

__2.6.32-rc5__
[   cgroup #1   ]   [   cgroup #2   ]
tasks   aggr-bw tasks   aggr-bw
1   36210KB/s   1   36992KB/s
2   47558KB/s   1   24479KB/s
4   57587KB/s   1   14809KB/s
8   64667KB/s   1   8393KB/s


[Devel] Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

2009-10-10 Thread Andrea Righi
 ===
 This time run multiple buffered writers in group1 and see run a single
 buffered writer in other group and see if we can provide fairness and
 isolation.
 
 Vanilla CFQ
 
 [Multiple Buffered Writer][Buffered Writer] 
 nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
 1   68997KB/s 68997KB/s 67380KB/s 645K usec   1   67122KB/s 567K usec   
 2   47509KB/s 46218KB/s 91510KB/s 865K usec   1   45118KB/s 865K usec   
 4   28002KB/s 26906KB/s 105MB/s   1649K usec  1   26879KB/s 1643K usec  
 8   15985KB/s 14849KB/s 117MB/s   943K usec   1   15653KB/s 766K usec   
 16  11567KB/s 6881KB/s  128MB/s   1174K usec  1   7333KB/s  947K usec   
 32  5877KB/s  3649KB/s  130MB/s   1205K usec  1   5142KB/s  988K usec   
 
 IO scheduler controller + CFQ
 ---
 [Multiple Buffered Writer][Buffered Writer] 
 nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
 1   68580KB/s 68580KB/s 66972KB/s 2901K usec  1   67194KB/s 2901K usec  
 2   47419KB/s 45700KB/s 90936KB/s 3149K usec  1   44628KB/s 2377K usec  
 4   27825KB/s 27274KB/s 105MB/s   1177K usec  1   27584KB/s 1177K usec  
 8   15382KB/s 14288KB/s 114MB/s   1539K usec  1   14794KB/s 783K usec   
 16  9161KB/s  7592KB/s  124MB/s   3177K usec  1   7713KB/s  886K usec   
 32  4928KB/s  3961KB/s  126MB/s   1152K usec  1   6465KB/s  4510K usec  
 
 Notes:
 - It does not work. Buffered writer in second group are being overwhelmed
   by writers in group1.
 
 - This is a limitation of IO scheduler based controller currently as page
   cache at higher layer evens out the traffic and does not throw more
   traffic from higher weight group.
 
 - This is something needs more work at higher layers like dirty limts
   per cgroup in memory contoller and the method to writeout buffered 
   pages belonging to a particular memory cgroup. This is still being
   brainstormed.
 
 io-throttle + CFQ
 ---
 BW limit group1=30 MB/s   BW limit group2=30 MB/s   
 [Multiple Buffered Writer][Buffered Writer] 
 nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
 1   33863KB/s 33863KB/s 33070KB/s 3046K usec  1   25165KB/s 13248K usec 
 2   13457KB/s 12906KB/s 25745KB/s 9286K usec  1   29958KB/s 3736K usec  
 4   7414KB/s  6543KB/s  27145KB/s 10557K usec 1   30968KB/s 8356K usec  
 8   3562KB/s  2640KB/s  24430KB/s 12012K usec 1   30801KB/s 7037K usec  
 16  3962KB/s  881KB/s   26632KB/s 12650K usec 1   31150KB/s 7173K usec  
 32  3275KB/s  406KB/s   27295KB/s 14609K usec 1   26328KB/s 8069K usec  
 
 Notes:
 - This seems to work well here. io-throttle is throttling the writers
   before they write too much of data in page cache. One side effect of
   this seems to be that now a process will not be allowed to write at
   memory speed in page cahce and will be limited to disk IO speed limits
   set for the cgroup.
 
   Andrea is thinking of removing throttling in balance_dirty_pages() to allow
   writting at disk speed till we hit dirty_limits. But removing it leads
   to a different issue where too many dirty pages from a single group can
   be present from a cgroup in page cache and if that cgroup is slow moving
   one, then pages are flushed to disk at slower speed delyaing other
   higher rate cgroups. (all discussed in private mails with Andrea).

I confirm this. :) But IMHO before removing the throttling in
balance_dirty_pages() we really need the per-cgroup dirty limit / dirty
page cache quota.

 
 
 ioprio class and iopriority with-in cgroups issues with IO-throttle
 ===
 
 Currently throttling logic is designed in such a way that it makes the
 throttling uniform for every process in the group. So we will loose the
 differentiation between different class of processes or differnetitation
 between different priority of processes with-in group.
 
 I have run the tests of these in the past and reported it here in the
 past.
 
 https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html
 
 Thanks
 Vivek

-- 
Andrea Righi - Develer s.r.l
http://www.develer.com
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: More performance numbers (Was: Re: IO scheduler based IO controller V10)

2009-10-08 Thread Andrea Righi
On Thu, Oct 08, 2009 at 12:42:51AM -0400, Vivek Goyal wrote:
 Apart from IO scheduler controller number, I also got a chance to run same
 tests with dm-ioband controller. I am posting these too. I am also
 planning to run similar numbers on Andrea's max bw controller also.
 Should be able to post those numbers also in 2-3 days.

For those who are interested (expecially to help Vivek to test all this
stuff) here the all-in-one patchset of the io-throttle controller,
rebased to 2.6.31:
http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch

And this one is v18 rebased to 2.6.32-rc3:
http://www.develer.com/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v18.patch

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-27 Thread Andrea Righi
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
  I think that only putting the hook in try_to_unmap() doesn't work
  correctly, because IOs will be charged to reclaiming processes or
  kswapd. These IOs should be charged to processes which cause memory
  pressure.
 
 Consider the following case:
 
   (1) There are two processes Proc-A and Proc-B.
   (2) Proc-A maps a large file into many pages by mmap() and writes
   many data to the file.
   (3) After (2), Proc-B try to get a page, but there are no available
   pages because Proc-A has used them.
   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
   a page which is owned by Proc-A, then blkio_cgroup_set_owner()
   sets Proc-B's ID on the page because the task's context is Proc-B.
   (5) After (4), kernel writes the page out to a disk. This IO is
   charged to Proc-B.
 
 In the above case, I think that the IO should be charged to a Proc-A,
 because the IO is caused by Proc-A's memory pressure. 
 I think we should consider in the case without memory and swap
 isolation.

mmmh.. even if they're strictly related I think we're mixing two
different problems in this way: memory pressure control and IO control.

It seems you're proposing something like the badness() for OOM
conditions to charge swap IO depending on how bad is a cgroup in terms
of memory consumption. I don't think this is the right way to proceed,
also because we already have the memory and swap control.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-18 Thread Andrea Righi
On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
 On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
  On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
   On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
 Vivek Goyal wrote:
 ...
   }
  @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
   /*
* Find the io group bio belongs to.
* If create is set, io group is created if it is not already 
  present.
  + * If curr is set, io group is information is searched for 
  current
  + * task and not with the help of bio.
  + *
  + * FIXME: Can we assume that if bio is NULL then lookup group for 
  current
  + * task and not create extra function parameter ?
*
  - * Note: There is a narrow window of race where a group is being 
  freed
  - * by cgroup deletion path and some rq has slipped through in this 
  group.
  - * Fix it.
*/
  -struct io_group *io_get_io_group_bio(struct request_queue *q, 
  struct bio *bio,
  -   int create)
  +struct io_group *io_get_io_group(struct request_queue *q, struct 
  bio *bio,
  +   int create, int curr)
 
   Hi Vivek,
 
   IIUC we can get rid of curr, and just determine iog from bio. If 
 bio is not NULL,
   get iog from bio, otherwise get it from current task.

Consider also that get_cgroup_from_bio() is much more slow than
task_cgroup() and need to lock/unlock_page_cgroup() in
get_blkio_cgroup_id(), while task_cgroup() is rcu protected.

   
   True.
   
BTW another optimization could be to use the blkio-cgroup functionality
only for dirty pages and cut out some blkio_set_owner(). For all the
other cases IO always occurs in the same context of the current task,
and you can use task_cgroup().

   
   Yes, may be in some cases we can avoid setting page owner. I will get
   to it once I have got functionality going well. In the mean time if
   you have a patch for it, it will be great.
   
However, this is true only for page cache pages, for IO generated by
anonymous pages (swap) you still need the page tracking functionality
both for reads and writes.

   
   Right now I am assuming that all the sync IO will belong to task
   submitting the bio hence use task_cgroup() for that. Only for async
   IO, I am trying to use page tracking functionality to determine the owner.
   Look at elv_bio_sync(bio).
   
   You seem to be saying that there are cases where even for sync IO, we
   can't use submitting task's context and need to rely on page tracking
   functionlity? In case of getting page (read) from swap, will it not happen
   in the context of process who will take a page fault and initiate the
   swap read?
  
  No, for example in read_swap_cache_async():
  
  @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
  gfp_t gfp_mask,
   */
  __set_page_locked(new_page);
  SetPageSwapBacked(new_page);
  +   blkio_cgroup_set_owner(new_page, current-mm);
  err = add_to_swap_cache(new_page, entry, gfp_mask  GFP_KERNEL);
  if (likely(!err)) {
  /*
  
  This is a read, but the current task is not always the owner of this
  swap cache page, because it's a readahead operation.
  
 
 But will this readahead be not initiated in the context of the task taking
 the page fault?
 
 handle_pte_fault()
   do_swap_page()
   swapin_readahead()
   read_swap_cache_async()
 
 If yes, then swap reads issued will still be in the context of process and
 we should be fine?

Right. I was trying to say that the current task may swap-in also pages
belonging to a different task, so from a certain point of view it's not
so fair to charge the current task for the whole activity. But ok, I
think it's a minor issue.

 
  Anyway, this is a minor corner case I think. And probably it is safe to
  consider this like any other read IO and get rid of the
  blkio_cgroup_set_owner().
 
 Agreed.
 
  
  I wonder if it would be better to attach the blkio_cgroup to the
  anonymous page only when swap-out occurs.
 
 Swap seems to be an interesting case in general. Somebody raised this
 question on lwn io controller article also. A user process never asked
 for swap activity. It is something enforced by kernel. So while doing
 some swap outs, it does not seem too fair to charge the write out to
 the process page belongs to and the fact of the matter may be that there
 is some other memory hungry application which is forcing these swap outs.
 
 Keeping this in mind, should swap activity be considered as system
 activity and be charged to root group instead of to user tasks in other
 cgroups

  1   2   3   4   >