[Devel] Re: containers and cgroups mini-summit @ Linux Plumbers
On Wed, Jul 25, 2012 at 02:00:41PM +0400, Glauber Costa wrote: On 07/25/2012 02:00 PM, Eric W. Biederman wrote: Glauber Costa glom...@parallels.com writes: On 07/12/2012 01:41 AM, Kir Kolyshkin wrote: Gentlemen, We are organizing containers mini-summit during next Linux Plumbers (San Diego, August 29-31). The idea is to gather and discuss everything relevant to namespaces, cgroups, resource management, checkpoint-restore and so on. We are trying to come up with a list of topics to discuss, so please reply with topic suggestions, and let me know if you are going to come. I probably forgot a few more people (such as, I am not sure who else from Google is working on cgroups stuff), so fill free to forward this to anyone you believe should go, or just let me know whom I missed. Regards, Kir. BTW, sorry for not replying before (vacations + post-vacations laziness) I would be interested in adding /proc virtualization to the discussion. By now it seems userspace would be the best place for that to happen, in a fuse overlay. I know Daniel has an initial implementation of that, and it would be good to have it as library that both OpenVZ and LXC (and whoever else wants) can use. Shouldn't take much time... What would you need proc virtualization for? proc provides a lot of information that userspace tools rely upon. For instance, when running top, you can draw per-process figures from what we have now, but you can't make sense of percentages without aggregating container-wide information. When you read /proc/cpuinfo, as well, you would expect to see something that matches your container configuration. free is another example. The list go on. Another interesting feature IMHO would be the per-cgroup loadavg. A typical use case could be a monitoring system that wants to know which containers are more overloaded than others, instead of using a single system-wide measure in /proc/loadavg. -Andrea ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote: I think we should accept to have an inode granularity. We could redesign the writeback code to work per-cgroup / per-page, etc. but that would add a huge overhead. The limit of inode granularity could be an acceptable tradeoff, cgroups are supposed to work to different files usually, well.. except when databases come into play (ouch!). Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages
On Tue, Feb 22, 2011 at 07:07:19PM -0500, Vivek Goyal wrote: On Wed, Feb 23, 2011 at 12:05:34AM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote: Add the tracking of buffered (writeback) and anonymous pages. Dirty pages in the page cache can be processed asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's point of view. The idea is to save the cgroup owner of each anonymous page and dirty page in page cache. A page is associated to a cgroup the first time it is dirtied in memory (for file cache pages) or when it is set as swap-backed (for anonymous pages). This information is stored using the page_cgroup functionality. Then, at the block layer, it is possible to retrieve the throttle group looking at the bio_page(bio). If the page was not explicitly associated to any cgroup the IO operation is charged to the current task/cgroup, as it was done by the previous implementation. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -8,6 +8,10 @@ #include linux/slab.h #include linux/blkdev.h #include linux/bio.h +#include linux/memcontrol.h +#include linux/mm_inline.h +#include linux/pagemap.h +#include linux/page_cgroup.h #include linux/blktrace_api.h #include linux/blk-cgroup.h @@ -221,6 +225,85 @@ done: return tg; } +static inline bool is_kernel_io(void) +{ + return !!(current-flags (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC)); +} + +static int throtl_set_page_owner(struct page *page, struct mm_struct *mm) +{ + struct blkio_cgroup *blkcg; + unsigned short id = 0; + + if (blkio_cgroup_disabled()) + return 0; + if (!mm) + goto out; + rcu_read_lock(); + blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner)); + if (likely(blkcg)) + id = css_id(blkcg-css); + rcu_read_unlock(); +out: + return page_cgroup_set_owner(page, id); +} + +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +{ + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_anonpage_owner); + +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +{ + if (is_kernel_io() || !page_is_file_cache(page)) + return 0; + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_filepage_owner); Why are we exporting all these symbols? Right. Probably a single one is enough: int blk_throtl_set_page_owner(struct page *page, struct mm_struct *mm, bool anon); Who is going to use this single export? Which module? I was actually thinking at some filesystem modules, but I was wrong, because at the moment no one needs the export. I'll remove it in the next version of the patch. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Wed, Feb 23, 2011 at 01:49:10PM +0900, KAMEZAWA Hiroyuki wrote: On Wed, 23 Feb 2011 00:37:18 +0100 Andrea Righi ari...@develer.com wrote: On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote: On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote: On Tue, 22 Feb 2011 18:12:54 +0100 Andrea Righi ari...@develer.com wrote: The page_cgroup infrastructure, currently available only for the memory cgroup controller, can be used to store the owner of each page and opportunely track the writeback IO. This information is encoded in the upper 16-bits of the page_cgroup-flags. A owner can be identified using a generic ID number and the following interfaces are provided to store a retrieve this information: unsigned long page_cgroup_get_owner(struct page *page); int page_cgroup_set_owner(struct page *page, unsigned long id); int page_cgroup_copy_owner(struct page *npage, struct page *opage); My immediate observation is that you're not really tracking the owner here - you're tracking an opaque 16-bit token known only to the block controller in a field which - if changed by anybody other than the block controller - will lead to mayhem in the block controller. I think it might be clearer - and safer - to say blkcg or some such instead of owner here. Basically the idea here was to be as generic as possible and make this feature potentially available also to other subsystems, so that cgroup subsystems may represent whatever they want with the 16-bit token. However, no more than a single subsystem may be able to use this feature at the same time. I'm tempted to say it might be better to just add a pointer to your throtl_grp structure into struct page_cgroup. Or maybe replace the mem_cgroup pointer with a single pointer to struct css_set. Both of those ideas, though, probably just add unwanted extra overhead now to gain generality which may or may not be wanted in the future. The pointer to css_set sounds good, but it would add additional space to the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch) and all of them are allocated at boot time. Using unused bits in page_cgroup-flags is a choice with no overhead from this point of view. I think John suggested replacing mem_cgroup pointer with css_set so that size of the strcuture does not increase but it leads extra level of indirection. OK, got it sorry. So, IIUC we save css_set pointer and get a struct cgroup as following: struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup; Then, for example to get the mem_cgroup reference: struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); It seems a lot of indirections, but I may have done something wrong or there could be a simpler way to do it. Then, page_cgroup should have reference count on css_set and make tons of atomic ops. BTW, bits of pc-flags are used for storing sectionID or nodeID. Please clarify your 16bit never breaks that information. And please keep more 4-5 flags for dirty_ratio support of memcg. OK, I didn't see the recent work about section and node id encoded in the pc-flags, thanks. So, it'd be probably better to rebase the patch to the latest mmotm to check all this stuff. I wonder I can make pc-mem_cgroup to be pc-memid(16bit), then, == static inline struct mem_cgroup *get_memcg_from_pc(struct page_cgroup *pc) { struct cgroup_subsys_state *css = css_lookup(mem_cgroup_subsys, pc-memid); return container_of(css, struct mem_cgroup, css); } == Overhead will be seen at updating file statistics and LRU management. But, hmm, can't you do that tracking without page_cgroup ? Because the number of dirty/writeback pages are far smaller than total pages, chasing I/O with dynamic structure is not very bad.. prepareing [pfn - blkio] record table and move that information to struct bio in dynamic way is very difficult ? This would be ok for dirty pages, but consider that we're also tracking anonymous pages. So, if we want to control the swap IO we actually need to save this information for a lot of pages and at the end I think we'll basically duplicate the page_cgroup code. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. Agreed. This could be a good tradeoff for a lot of common cases. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. Agreed. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit OK, I'll try to add the async throttling logic and use this interface. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/5] blk-throttle: writeback and swap IO control
Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Testcase - create a cgroup with 1MiB/s write limit: # mount -t cgroup -o blkio none /mnt/cgroup # mkdir /mnt/cgroup/foo # echo 8:0 $((1024 * 1024)) /mnt/cgroup/foo/blkio.throttle.write_bps_device - move a task into the cgroup and run a dd to generate some writeback IO Results: - 2.6.38-rc6 vanilla: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 019M 019M 0 0 0 0 019M ... - 2.6.38-rc6 + blk-throttle writeback IO control: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 0 1024 0 1024 0 1024 0 1024 0 1024 ... TODO - lots of testing Any feedback is welcome. -Andrea [PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h [PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup() [PATCH 3/5] page_cgroup: make page tracking available for blkio [PATCH 4/5] blk-throttle: track buffered and anonymous pages [PATCH 5/5] blk-throttle: buffered and anonymous page tracking instrumentation block/Kconfig |2 + block/blk-cgroup.c | 15 ++- block/blk-cgroup.h | 335 -- block/blk-throttle.c| 89 +++- block/cfq.h |2 +- fs/buffer.c |1 + include/linux/blk-cgroup.h | 341 +++ include/linux/blkdev.h | 26 +++- include/linux/memcontrol.h |6 + include/linux/mmzone.h |4 +- include/linux/page_cgroup.h | 33 - init/Kconfig|4 + mm/Makefile |3 +- mm/bounce.c |1 + mm/filemap.c|1 + mm/memcontrol.c |6 + mm/memory.c |5 + mm/page-writeback.c |1 + mm/page_cgroup.c| 129 +++-- mm/swap_state.c |2 + 20 files changed, 649 insertions(+), 357 deletions(-) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h
Move blk-cgroup.h in include/linux for generic usage. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-cgroup.c |2 +- block/blk-cgroup.h | 335 --- block/blk-throttle.c |2 +- block/cfq.h|2 +- include/linux/blk-cgroup.h | 337 5 files changed, 340 insertions(+), 338 deletions(-) delete mode 100644 block/blk-cgroup.h create mode 100644 include/linux/blk-cgroup.h diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 455768a..bf9d354 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -17,7 +17,7 @@ #include linux/err.h #include linux/blkdev.h #include linux/slab.h -#include blk-cgroup.h +#include linux/blk-cgroup.h #include linux/genhd.h #define MAX_KEY_LEN 100 diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h deleted file mode 100644 index ea4861b..000 --- a/block/blk-cgroup.h +++ /dev/null @@ -1,335 +0,0 @@ -#ifndef _BLK_CGROUP_H -#define _BLK_CGROUP_H -/* - * Common Block IO controller cgroup interface - * - * Based on ideas and code from CFQ, CFS and BFQ: - * Copyright (C) 2003 Jens Axboe ax...@kernel.dk - * - * Copyright (C) 2008 Fabio Checconi fa...@gandalf.sssup.it - * Paolo Valente paolo.vale...@unimore.it - * - * Copyright (C) 2009 Vivek Goyal vgo...@redhat.com - * Nauman Rafique nau...@google.com - */ - -#include linux/cgroup.h - -enum blkio_policy_id { - BLKIO_POLICY_PROP = 0, /* Proportional Bandwidth division */ - BLKIO_POLICY_THROTL,/* Throttling */ -}; - -/* Max limits for throttle policy */ -#define THROTL_IOPS_MAXUINT_MAX - -#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE) - -#ifndef CONFIG_BLK_CGROUP -/* When blk-cgroup is a module, its subsys_id isn't a compile-time constant */ -extern struct cgroup_subsys blkio_subsys; -#define blkio_subsys_id blkio_subsys.subsys_id -#endif - -enum stat_type { - /* Total time spent (in ns) between request dispatch to the driver and -* request completion for IOs doen by this cgroup. This may not be -* accurate when NCQ is turned on. */ - BLKIO_STAT_SERVICE_TIME = 0, - /* Total bytes transferred */ - BLKIO_STAT_SERVICE_BYTES, - /* Total IOs serviced, post merge */ - BLKIO_STAT_SERVICED, - /* Total time spent waiting in scheduler queue in ns */ - BLKIO_STAT_WAIT_TIME, - /* Number of IOs merged */ - BLKIO_STAT_MERGED, - /* Number of IOs queued up */ - BLKIO_STAT_QUEUED, - /* All the single valued stats go below this */ - BLKIO_STAT_TIME, - BLKIO_STAT_SECTORS, -#ifdef CONFIG_DEBUG_BLK_CGROUP - BLKIO_STAT_AVG_QUEUE_SIZE, - BLKIO_STAT_IDLE_TIME, - BLKIO_STAT_EMPTY_TIME, - BLKIO_STAT_GROUP_WAIT_TIME, - BLKIO_STAT_DEQUEUE -#endif -}; - -enum stat_sub_type { - BLKIO_STAT_READ = 0, - BLKIO_STAT_WRITE, - BLKIO_STAT_SYNC, - BLKIO_STAT_ASYNC, - BLKIO_STAT_TOTAL -}; - -/* blkg state flags */ -enum blkg_state_flags { - BLKG_waiting = 0, - BLKG_idling, - BLKG_empty, -}; - -/* cgroup files owned by proportional weight policy */ -enum blkcg_file_name_prop { - BLKIO_PROP_weight = 1, - BLKIO_PROP_weight_device, - BLKIO_PROP_io_service_bytes, - BLKIO_PROP_io_serviced, - BLKIO_PROP_time, - BLKIO_PROP_sectors, - BLKIO_PROP_io_service_time, - BLKIO_PROP_io_wait_time, - BLKIO_PROP_io_merged, - BLKIO_PROP_io_queued, - BLKIO_PROP_avg_queue_size, - BLKIO_PROP_group_wait_time, - BLKIO_PROP_idle_time, - BLKIO_PROP_empty_time, - BLKIO_PROP_dequeue, -}; - -/* cgroup files owned by throttle policy */ -enum blkcg_file_name_throtl { - BLKIO_THROTL_read_bps_device, - BLKIO_THROTL_write_bps_device, - BLKIO_THROTL_read_iops_device, - BLKIO_THROTL_write_iops_device, - BLKIO_THROTL_io_service_bytes, - BLKIO_THROTL_io_serviced, -}; - -struct blkio_cgroup { - struct cgroup_subsys_state css; - unsigned int weight; - spinlock_t lock; - struct hlist_head blkg_list; - struct list_head policy_list; /* list of blkio_policy_node */ -}; - -struct blkio_group_stats { - /* total disk time and nr sectors dispatched by this group */ - uint64_t time; - uint64_t sectors; - uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL]; -#ifdef CONFIG_DEBUG_BLK_CGROUP - /* Sum of number of IOs queued across all samples */ - uint64_t avg_queue_size_sum; - /* Count of samples taken for average */ - uint64_t avg_queue_size_samples; - /* How many times this group has been removed from service tree */ - unsigned long dequeue; - - /* Total time spent waiting for it to be assigned a timeslice. */ - uint64_t group_wait_time
[Devel] [PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup()
Introduce a helper function to retrieve a blkio cgroup from a task. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-cgroup.c |7 +++ include/linux/blk-cgroup.h |4 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index bf9d354..f283ae1 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -107,6 +107,13 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev, return NULL; } +struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *task) +{ + return container_of(task_subsys_state(task, blkio_subsys_id), + struct blkio_cgroup, css); +} +EXPORT_SYMBOL_GPL(task_to_blkio_cgroup); + struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id), diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 5e48204..41b59db 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -287,6 +287,7 @@ static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg) {} extern struct blkio_cgroup blkio_root_cgroup; extern bool blkio_cgroup_disabled(void); extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup); +extern struct blkio_cgroup *task_to_blkio_cgroup(struct task_struct *task); extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg, struct blkio_group *blkg, void *key, dev_t dev, enum blkio_policy_id plid); @@ -311,6 +312,9 @@ static inline bool blkio_cgroup_disabled(void) { return true; } static inline struct blkio_cgroup * cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; } +static inline struct blkio_cgroup * +task_to_blkio_cgroup(struct task_struct *task) { return NULL; } + static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg, struct blkio_group *blkg, void *key, dev_t dev, enum blkio_policy_id plid) {} -- 1.7.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/5] blk-throttle: track buffered and anonymous pages
Add the tracking of buffered (writeback) and anonymous pages. Dirty pages in the page cache can be processed asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's point of view. The idea is to save the cgroup owner of each anonymous page and dirty page in page cache. A page is associated to a cgroup the first time it is dirtied in memory (for file cache pages) or when it is set as swap-backed (for anonymous pages). This information is stored using the page_cgroup functionality. Then, at the block layer, it is possible to retrieve the throttle group looking at the bio_page(bio). If the page was not explicitly associated to any cgroup the IO operation is charged to the current task/cgroup, as it was done by the previous implementation. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -8,6 +8,10 @@ #include linux/slab.h #include linux/blkdev.h #include linux/bio.h +#include linux/memcontrol.h +#include linux/mm_inline.h +#include linux/pagemap.h +#include linux/page_cgroup.h #include linux/blktrace_api.h #include linux/blk-cgroup.h @@ -221,6 +225,85 @@ done: return tg; } +static inline bool is_kernel_io(void) +{ + return !!(current-flags (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC)); +} + +static int throtl_set_page_owner(struct page *page, struct mm_struct *mm) +{ + struct blkio_cgroup *blkcg; + unsigned short id = 0; + + if (blkio_cgroup_disabled()) + return 0; + if (!mm) + goto out; + rcu_read_lock(); + blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner)); + if (likely(blkcg)) + id = css_id(blkcg-css); + rcu_read_unlock(); +out: + return page_cgroup_set_owner(page, id); +} + +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +{ + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_anonpage_owner); + +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +{ + if (is_kernel_io() || !page_is_file_cache(page)) + return 0; + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_filepage_owner); + +int blk_throtl_copy_page_owner(struct page *npage, struct page *opage) +{ + if (blkio_cgroup_disabled()) + return 0; + return page_cgroup_copy_owner(npage, opage); +} +EXPORT_SYMBOL(blk_throtl_copy_page_owner); + +/* + * A helper function to get the throttle group from css id. + * + * NOTE: must be called under rcu_read_lock(). + */ +static struct throtl_grp *throtl_tg_lookup(struct throtl_data *td, int id) +{ + struct cgroup_subsys_state *css; + + if (!id) + return NULL; + css = css_lookup(blkio_subsys, id); + if (!css) + return NULL; + return throtl_find_alloc_tg(td, css-cgroup); +} + +static struct throtl_grp * +throtl_get_tg_from_page(struct throtl_data *td, struct page *page) +{ + struct throtl_grp *tg; + int id; + + if (unlikely(!page)) + return NULL; + id = page_cgroup_get_owner(page); + + rcu_read_lock(); + tg = throtl_tg_lookup(td, id); + rcu_read_unlock(); + + return tg; +} + static struct throtl_grp * throtl_get_tg(struct throtl_data *td) { struct cgroup *cgroup; @@ -1000,7 +1083,9 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop) } spin_lock_irq(q-queue_lock); - tg = throtl_get_tg(td); + tg = throtl_get_tg_from_page(td, bio_page(bio)); + if (!tg) + tg = throtl_get_tg(td); if (tg-nr_queued[rw]) { /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4d18ff3..2d03dee 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1136,10 +1136,34 @@ static inline uint64_t rq_io_start_time_ns(struct request *req) extern int blk_throtl_init(struct request_queue *q); extern void blk_throtl_exit(struct request_queue *q); extern int blk_throtl_bio(struct request_queue *q, struct bio **bio); +extern int blk_throtl_set_anonpage_owner(struct page *page, + struct mm_struct *mm); +extern int blk_throtl_set_filepage_owner(struct page *page, + struct
[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages
On Tue, Feb 22, 2011 at 10:42:41AM -0800, Chad Talbott wrote: On Tue, Feb 22, 2011 at 9:12 AM, Andrea Righi ari...@develer.com wrote: Add the tracking of buffered (writeback) and anonymous pages. ... --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c ... +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +int blk_throtl_copy_page_owner(struct page *npage, struct page *opage) It would be nice if these were named blk_cgroup_*. This is arguably more correct as the id comes from the blkio subsystem, and isn't specific to blk-throttle. This will be more important very shortly, as CFQ will be using this same cgroup id for async IO tracking soon. Sounds reasonable. Will do in the next version. is_kernel_io() is a good idea, it avoids a bug that we've run into with CFQ async IO tracking. Why isn't PF_KTHREAD sufficient to cover all kernel threads, including kswapd and those marked PF_MEMALLOC? With PF_MEMALLOC we're sure we don't add the page tracking overhead also to non-kernel threads when memory gets low. PF_KSWAPD is not probably needed, AFAICS it is only used by kswapd, that is created by kthread_create() and so it has the PF_KTHREAD flag set. Let's see if someone can give more deatils about that. In the while I'll investigate and try to do some tests only with PF_KTHREAD. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Hi Andrea, Thanks for the patches. Before I look deeper into patches, had few general queries/thoughts. - So this requires memory controller to be enabled. Does it also require these to be co-mounted? No and no. The blkio controller enables and uses the page_cgroup functionality, but it doesn't depend on the memory controller. It automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last one added in PATCH 3/5) and this is sufficient to make page_cgroup usable from any generic controller. - Currently in throttling there is no limit on number of bios queued per group. I think this is not necessarily a very good idea because if throttling limits are low, we will build very long bio queues. So some AIO process can queue up lots of bios, consume lots of memory without getting blocked. I am sure there will be other side affects too. One of the side affects I noticed is that if an AIO process queues up too much of IO, and if I want to kill it now, it just hangs there for a really-2 long time (waiting for all the throttled IO to complete). So I was thinking of implementing either per group limit or per io context limit and after that process will be put to sleep. (something like request descriptor mechanism). io context limit seems a better solution for now. We can also expect some help from the memory controller, if we'll have the dirty memory limit per cgroup in the future the max amount of bios queued will be automatically limited by this functionality. If that's the case, then comes the question of what do to about kernel threads. Should they be blocked or not. If these are blocked then a fast group will also be indirectly throttled behind a slow group. If they are not then we still have the problem of too many bios queued in throttling layer. I think kernel threads should be never forced to sleep, to avoid the classic priority inversion problem and create potential DoS in the system. Also for this part the dirty memory limit per cgroup could help a lot, because a cgroup will never exceed its quota of dirty memory, so it will not be able to submit more than a certain amount of bios (corresponding to the dirty memory limit). - What to do about other kernel thread like kjournald which is doing IO on behalf of all the filesystem users. If data is also journalled then I think again everything got serialized and a faster group got backlogged behind a slower one. This is the most critical issue IMHO. The blkio controller should need some help from the filesystems to understand which IO request can be throttled and which cannot. At the moment critical IO requests (with critical I mean that are dependency for other requests) and non-critical requests are mixed together in a way that throttling a single request may stop a lot of other requests in the system, and at the block layer it's not possible to retrieve such informations. I don't have a solution for this right now. Except looking at each filesystem implementation and try to understand how to pass these informations to the block layer. - Two processes doing IO to same file and slower group will throttle IO for faster group also. (flushing is per inode). I think we should accept to have an inode granularity. We could redesign the writeback code to work per-cgroup / per-page, etc. but that would add a huge overhead. The limit of inode granularity could be an acceptable tradeoff, cgroups are supposed to work to different files usually, well.. except when databases come into play (ouch!). Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote: On Tue, 22 Feb 2011 18:12:54 +0100 Andrea Righi ari...@develer.com wrote: The page_cgroup infrastructure, currently available only for the memory cgroup controller, can be used to store the owner of each page and opportunely track the writeback IO. This information is encoded in the upper 16-bits of the page_cgroup-flags. A owner can be identified using a generic ID number and the following interfaces are provided to store a retrieve this information: unsigned long page_cgroup_get_owner(struct page *page); int page_cgroup_set_owner(struct page *page, unsigned long id); int page_cgroup_copy_owner(struct page *npage, struct page *opage); My immediate observation is that you're not really tracking the owner here - you're tracking an opaque 16-bit token known only to the block controller in a field which - if changed by anybody other than the block controller - will lead to mayhem in the block controller. I think it might be clearer - and safer - to say blkcg or some such instead of owner here. Basically the idea here was to be as generic as possible and make this feature potentially available also to other subsystems, so that cgroup subsystems may represent whatever they want with the 16-bit token. However, no more than a single subsystem may be able to use this feature at the same time. I'm tempted to say it might be better to just add a pointer to your throtl_grp structure into struct page_cgroup. Or maybe replace the mem_cgroup pointer with a single pointer to struct css_set. Both of those ideas, though, probably just add unwanted extra overhead now to gain generality which may or may not be wanted in the future. The pointer to css_set sounds good, but it would add additional space to the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch) and all of them are allocated at boot time. Using unused bits in page_cgroup-flags is a choice with no overhead from this point of view. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages
On Tue, Feb 22, 2011 at 03:49:28PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 10:42:41AM -0800, Chad Talbott wrote: On Tue, Feb 22, 2011 at 9:12 AM, Andrea Righi ari...@develer.com wrote: Add the tracking of buffered (writeback) and anonymous pages. ... --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c ... +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +int blk_throtl_copy_page_owner(struct page *npage, struct page *opage) It would be nice if these were named blk_cgroup_*. This is arguably more correct as the id comes from the blkio subsystem, and isn't specific to blk-throttle. This will be more important very shortly, as CFQ will be using this same cgroup id for async IO tracking soon. Should this really be all part of blk-cgroup.c and not blk-throttle.c so that it can be used by CFQ code also down the line? Anyway all this is not throttle specific as such but blkio controller specific. Agreed. Though function naming convetion is not great in blk-cgroup.c But functions either have blkio_ prefix or blkiocg_ prefix. ok. Functions which are not directly dealing with cgroups or in general are called by blk-throttle.c and/or cfq-iosched.c I have marked as prefixed with blkio_. Functions which directly deal with cgroup stuff and register with cgroup subsystem for this controller are generally having blkiocg_ prefix. In this case probably we can use probably blkio_ prefix. ok. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Tue, Feb 22, 2011 at 04:22:53PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:54PM +0100, Andrea Righi wrote: The page_cgroup infrastructure, currently available only for the memory cgroup controller, can be used to store the owner of each page and opportunely track the writeback IO. This information is encoded in the upper 16-bits of the page_cgroup-flags. A owner can be identified using a generic ID number and the following interfaces are provided to store a retrieve this information: unsigned long page_cgroup_get_owner(struct page *page); int page_cgroup_set_owner(struct page *page, unsigned long id); int page_cgroup_copy_owner(struct page *npage, struct page *opage); The blkio.throttle controller can use the cgroup css_id() as the owner's ID number. Signed-off-by: Andrea Righi ari...@develer.com --- block/Kconfig |2 + block/blk-cgroup.c |6 ++ include/linux/memcontrol.h |6 ++ include/linux/mmzone.h |4 +- include/linux/page_cgroup.h | 33 ++- init/Kconfig|4 + mm/Makefile |3 +- mm/memcontrol.c |6 ++ mm/page_cgroup.c| 129 +++ 9 files changed, 176 insertions(+), 17 deletions(-) diff --git a/block/Kconfig b/block/Kconfig index 60be1e0..1351ea8 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -80,6 +80,8 @@ config BLK_DEV_INTEGRITY config BLK_DEV_THROTTLING bool Block layer bio throttling support depends on BLK_CGROUP=y EXPERIMENTAL + select MM_OWNER + select PAGE_TRACKING default n ---help--- Block layer bio throttling support. It can be used to limit diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index f283ae1..5c57f0a 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -107,6 +107,12 @@ blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev, return NULL; } +bool blkio_cgroup_disabled(void) +{ + return blkio_subsys.disabled ? true : false; +} +EXPORT_SYMBOL_GPL(blkio_cgroup_disabled); + I think there should be option to just disable this asyn feature of blkio controller. So those who don't want it (running VMs with cache=none option) and don't want to take the memory reservation hit should be able to disable just ASYNC facility of blkio controller and not the whole blkio controller facility. Definitely a better choice. OK, I'll apply all your suggestions and post a new version of the patch. Thanks for the review! -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/5] blk-throttle: track buffered and anonymous pages
On Tue, Feb 22, 2011 at 04:00:30PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:55PM +0100, Andrea Righi wrote: Add the tracking of buffered (writeback) and anonymous pages. Dirty pages in the page cache can be processed asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's point of view. The idea is to save the cgroup owner of each anonymous page and dirty page in page cache. A page is associated to a cgroup the first time it is dirtied in memory (for file cache pages) or when it is set as swap-backed (for anonymous pages). This information is stored using the page_cgroup functionality. Then, at the block layer, it is possible to retrieve the throttle group looking at the bio_page(bio). If the page was not explicitly associated to any cgroup the IO operation is charged to the current task/cgroup, as it was done by the previous implementation. Signed-off-by: Andrea Righi ari...@develer.com --- block/blk-throttle.c | 87 +++- include/linux/blkdev.h | 26 ++- 2 files changed, 111 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 9ad3d1e..a50ee04 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -8,6 +8,10 @@ #include linux/slab.h #include linux/blkdev.h #include linux/bio.h +#include linux/memcontrol.h +#include linux/mm_inline.h +#include linux/pagemap.h +#include linux/page_cgroup.h #include linux/blktrace_api.h #include linux/blk-cgroup.h @@ -221,6 +225,85 @@ done: return tg; } +static inline bool is_kernel_io(void) +{ + return !!(current-flags (PF_KTHREAD | PF_KSWAPD | PF_MEMALLOC)); +} + +static int throtl_set_page_owner(struct page *page, struct mm_struct *mm) +{ + struct blkio_cgroup *blkcg; + unsigned short id = 0; + + if (blkio_cgroup_disabled()) + return 0; + if (!mm) + goto out; + rcu_read_lock(); + blkcg = task_to_blkio_cgroup(rcu_dereference(mm-owner)); + if (likely(blkcg)) + id = css_id(blkcg-css); + rcu_read_unlock(); +out: + return page_cgroup_set_owner(page, id); +} + +int blk_throtl_set_anonpage_owner(struct page *page, struct mm_struct *mm) +{ + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_anonpage_owner); + +int blk_throtl_set_filepage_owner(struct page *page, struct mm_struct *mm) +{ + if (is_kernel_io() || !page_is_file_cache(page)) + return 0; + return throtl_set_page_owner(page, mm); +} +EXPORT_SYMBOL(blk_throtl_set_filepage_owner); Why are we exporting all these symbols? Right. Probably a single one is enough: int blk_throtl_set_page_owner(struct page *page, struct mm_struct *mm, bool anon); -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Tue, Feb 22, 2011 at 04:27:29PM -0700, Jonathan Corbet wrote: On Wed, 23 Feb 2011 00:01:47 +0100 Andrea Righi ari...@develer.com wrote: My immediate observation is that you're not really tracking the owner here - you're tracking an opaque 16-bit token known only to the block controller in a field which - if changed by anybody other than the block controller - will lead to mayhem in the block controller. I think it might be clearer - and safer - to say blkcg or some such instead of owner here. Basically the idea here was to be as generic as possible and make this feature potentially available also to other subsystems, so that cgroup subsystems may represent whatever they want with the 16-bit token. However, no more than a single subsystem may be able to use this feature at the same time. That makes me nervous; it can't really be used that way unless we want to say that certain controllers are fundamentally incompatible and can't be allowed to play together. For whatever my $0.02 are worth (given the state of the US dollar, that's not a whole lot), I'd suggest keeping the current mechanism, but make it clear that it belongs to your controller. If and when another controller comes along with a need for similar functionality, somebody can worry about making it more general. OK, I understand. I'll use blkio instead of owner. Also because I wouldn't like to introduce additional logic and overhead to check if two controllers are using this feature at the same time. Better to hard-code this information in the name of the functions. Probably the most generic solution is the one that you suggested: replace the mem_cgroup with a pointer to css_set. I'll also try to investigate this way. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/5] page_cgroup: make page tracking available for blkio
On Tue, Feb 22, 2011 at 06:06:30PM -0500, Vivek Goyal wrote: On Wed, Feb 23, 2011 at 12:01:47AM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 01:01:45PM -0700, Jonathan Corbet wrote: On Tue, 22 Feb 2011 18:12:54 +0100 Andrea Righi ari...@develer.com wrote: The page_cgroup infrastructure, currently available only for the memory cgroup controller, can be used to store the owner of each page and opportunely track the writeback IO. This information is encoded in the upper 16-bits of the page_cgroup-flags. A owner can be identified using a generic ID number and the following interfaces are provided to store a retrieve this information: unsigned long page_cgroup_get_owner(struct page *page); int page_cgroup_set_owner(struct page *page, unsigned long id); int page_cgroup_copy_owner(struct page *npage, struct page *opage); My immediate observation is that you're not really tracking the owner here - you're tracking an opaque 16-bit token known only to the block controller in a field which - if changed by anybody other than the block controller - will lead to mayhem in the block controller. I think it might be clearer - and safer - to say blkcg or some such instead of owner here. Basically the idea here was to be as generic as possible and make this feature potentially available also to other subsystems, so that cgroup subsystems may represent whatever they want with the 16-bit token. However, no more than a single subsystem may be able to use this feature at the same time. I'm tempted to say it might be better to just add a pointer to your throtl_grp structure into struct page_cgroup. Or maybe replace the mem_cgroup pointer with a single pointer to struct css_set. Both of those ideas, though, probably just add unwanted extra overhead now to gain generality which may or may not be wanted in the future. The pointer to css_set sounds good, but it would add additional space to the page_cgroup struct. Now, page_cgroup is 40 bytes (in 64-bit arch) and all of them are allocated at boot time. Using unused bits in page_cgroup-flags is a choice with no overhead from this point of view. I think John suggested replacing mem_cgroup pointer with css_set so that size of the strcuture does not increase but it leads extra level of indirection. OK, got it sorry. So, IIUC we save css_set pointer and get a struct cgroup as following: struct cgroup *cgrp = css_set-subsys[subsys_id]-cgroup; Then, for example to get the mem_cgroup reference: struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); It seems a lot of indirections, but I may have done something wrong or there could be a simpler way to do it. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
On Wed, Oct 06, 2010 at 11:34:16AM -0700, Greg Thelen wrote: Andrea Righi ari...@develer.com writes: On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com writes: On Sun, 3 Oct 2010 23:58:03 -0700 Greg Thelen gthe...@google.com wrote: Add cgroupfs interface to memcg dirty page limits: Direct write-out is controlled with: - memory.dirty_ratio - memory.dirty_bytes Background write-out is controlled with: - memory.dirty_background_ratio - memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com a question below. --- mm/memcontrol.c | 89 +++ 1 files changed, 89 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6ec2625..2d45a0a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_NSTATS, }; +enum { + MEM_CGROUP_DIRTY_RATIO, + MEM_CGROUP_DIRTY_BYTES, + MEM_CGROUP_DIRTY_BACKGROUND_RATIO, + MEM_CGROUP_DIRTY_BACKGROUND_BYTES, +}; + struct mem_cgroup_stat_cpu { s64 count[MEM_CGROUP_STAT_NSTATS]; }; @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + bool root; + + root = mem_cgroup_is_root(mem); + + switch (cft-private) { + case MEM_CGROUP_DIRTY_RATIO: + return root ? vm_dirty_ratio : mem-dirty_param.dirty_ratio; + case MEM_CGROUP_DIRTY_BYTES: + return root ? vm_dirty_bytes : mem-dirty_param.dirty_bytes; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + return root ? dirty_background_ratio : + mem-dirty_param.dirty_background_ratio; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + return root ? dirty_background_bytes : + mem-dirty_param.dirty_background_bytes; + default: + BUG(); + } +} + +static int +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + int type = cft-private; + + if (cgrp-parent == NULL) + return -EINVAL; + if ((type == MEM_CGROUP_DIRTY_RATIO || +type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) val 100) + return -EINVAL; + switch (type) { + case MEM_CGROUP_DIRTY_RATIO: + memcg-dirty_param.dirty_ratio = val; + memcg-dirty_param.dirty_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BYTES: + memcg-dirty_param.dirty_bytes = val; + memcg-dirty_param.dirty_ratio = 0; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + memcg-dirty_param.dirty_background_ratio = val; + memcg-dirty_param.dirty_background_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + memcg-dirty_param.dirty_background_bytes = val; + memcg-dirty_param.dirty_background_ratio = 0; + break; Curiousis this same behavior as vm_dirty_ratio ? I think this is same behavior as vm_dirty_ratio. When vm_dirty_ratio is changed then dirty_ratio_handler() will set vm_dirty_bytes=0. When vm_dirty_bytes is written dirty_bytes_handler() will set vm_dirty_ratio=0. So I think that the per-memcg dirty memory parameters mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other global dirty parameters. Am I missing your question? mmh... looking at the code it seems the same behaviour, but in Documentation/sysctl/vm.txt we say a different thing (i.e., for dirty_bytes): If dirty_bytes is written, dirty_ratio becomes a function of its value (dirty_bytes / the amount of dirtyable system memory). However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set the counterpart value as 0. I think we should clarify the documentation. Signed-off-by: Andrea Righi ari...@develer.com Reviewed-by: Greg Thelen gthe...@google.com This documentation change is general cleanup that is independent of the memcg patch series shown on the subject. Thanks Greg. I'll resend it as an independent patch. -Andrea ___ Containers mailing list contain
[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com writes: On Sun, 3 Oct 2010 23:58:03 -0700 Greg Thelen gthe...@google.com wrote: Add cgroupfs interface to memcg dirty page limits: Direct write-out is controlled with: - memory.dirty_ratio - memory.dirty_bytes Background write-out is controlled with: - memory.dirty_background_ratio - memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com a question below. --- mm/memcontrol.c | 89 +++ 1 files changed, 89 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6ec2625..2d45a0a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_NSTATS, }; +enum { + MEM_CGROUP_DIRTY_RATIO, + MEM_CGROUP_DIRTY_BYTES, + MEM_CGROUP_DIRTY_BACKGROUND_RATIO, + MEM_CGROUP_DIRTY_BACKGROUND_BYTES, +}; + struct mem_cgroup_stat_cpu { s64 count[MEM_CGROUP_STAT_NSTATS]; }; @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp, return 0; } +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + bool root; + + root = mem_cgroup_is_root(mem); + + switch (cft-private) { + case MEM_CGROUP_DIRTY_RATIO: + return root ? vm_dirty_ratio : mem-dirty_param.dirty_ratio; + case MEM_CGROUP_DIRTY_BYTES: + return root ? vm_dirty_bytes : mem-dirty_param.dirty_bytes; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + return root ? dirty_background_ratio : + mem-dirty_param.dirty_background_ratio; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + return root ? dirty_background_bytes : + mem-dirty_param.dirty_background_bytes; + default: + BUG(); + } +} + +static int +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + int type = cft-private; + + if (cgrp-parent == NULL) + return -EINVAL; + if ((type == MEM_CGROUP_DIRTY_RATIO || + type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) val 100) + return -EINVAL; + switch (type) { + case MEM_CGROUP_DIRTY_RATIO: + memcg-dirty_param.dirty_ratio = val; + memcg-dirty_param.dirty_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BYTES: + memcg-dirty_param.dirty_bytes = val; + memcg-dirty_param.dirty_ratio = 0; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + memcg-dirty_param.dirty_background_ratio = val; + memcg-dirty_param.dirty_background_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + memcg-dirty_param.dirty_background_bytes = val; + memcg-dirty_param.dirty_background_ratio = 0; + break; Curiousis this same behavior as vm_dirty_ratio ? I think this is same behavior as vm_dirty_ratio. When vm_dirty_ratio is changed then dirty_ratio_handler() will set vm_dirty_bytes=0. When vm_dirty_bytes is written dirty_bytes_handler() will set vm_dirty_ratio=0. So I think that the per-memcg dirty memory parameters mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other global dirty parameters. Am I missing your question? mmh... looking at the code it seems the same behaviour, but in Documentation/sysctl/vm.txt we say a different thing (i.e., for dirty_bytes): If dirty_bytes is written, dirty_ratio becomes a function of its value (dirty_bytes / the amount of dirtyable system memory). However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set the counterpart value as 0. I think we should clarify the documentation. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/sysctl/vm.txt | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b606c2c..30289fa 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -80,8 +80,10 @@ dirty_background_bytes Contains the amount of dirty memory at which the pdflush background writeback daemon will start writeback. -If dirty_background_bytes is written, dirty_background_ratio becomes a function -of its value (dirty_background_bytes / the amount of dirtyable system memory). +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only +one of them may be specified at a time. When one sysctl is written it is +immediately taken into account to evaluate the dirty memory limits
[Devel] Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote: Extend mem_cgroup to contain dirty page limits. Also add routines allowing the kernel to query the dirty usage of a memcg. These interfaces not used by the kernel yet. A subsequent commit will add kernel calls to utilize these new routines. A small note below. Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 44 +++ mm/memcontrol.c| 180 +++- 2 files changed, 223 insertions(+), 1 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6303da1..dc8952d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,6 +19,7 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H +#include linux/writeback.h #include linux/cgroup.h struct mem_cgroup; struct page_cgroup; @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item { MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */ }; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_read_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* Dirty memory parameters */ +struct vm_dirty_param { + int dirty_ratio; + int dirty_background_ratio; + unsigned long dirty_bytes; + unsigned long dirty_background_bytes; +}; + +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct list_head *dst, unsigned long *scanned, int order, @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, mem_cgroup_update_page_stat(page, idx, -1); } +bool mem_cgroup_has_dirty_limit(void); +void get_vm_dirty_param(struct vm_dirty_param *param); +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item); + unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask); u64 mem_cgroup_get_limit(struct mem_cgroup *mem); @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page, { } +static inline bool mem_cgroup_has_dirty_limit(void) +{ + return false; +} + +static inline void get_vm_dirty_param(struct vm_dirty_param *param) +{ + get_global_vm_dirty_param(param); +} + +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + return -ENOSYS; +} + static inline unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f40839f..6ec2625 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -233,6 +233,10 @@ struct mem_cgroup { atomic_trefcnt; unsigned intswappiness; + + /* control memory cgroup dirty pages */ + struct vm_dirty_param dirty_param; + /* OOM-Killer disable */ int oom_kill_disable; @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg) return swappiness; } +/* + * Returns a snapshot of the current dirty limits which is not synchronized with + * the routines that change the dirty limits. If this routine races with an + * update to the dirty bytes/ratio value, then the caller must handle the case + * where both dirty_[background_]_ratio and _bytes are set. + */ +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param, + struct mem_cgroup *mem) +{ + if (mem !mem_cgroup_is_root(mem)) { + param-dirty_ratio = mem-dirty_param.dirty_ratio; + param-dirty_bytes = mem-dirty_param.dirty_bytes; + param-dirty_background_ratio = + mem-dirty_param.dirty_background_ratio; + param-dirty_background_bytes = + mem-dirty_param.dirty_background_bytes; + } else { + get_global_vm_dirty_param(param); + } +} + +/* + * Get dirty memory parameters of the current memcg or global values (if memory + * cgroups are disabled or querying the root cgroup). + */ +void get_vm_dirty_param(struct vm_dirty_param *param) +{ + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) { + get_global_vm_dirty_param(param); + return; + } + + /* + * It's possible
[Devel] Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
On Sun, Oct 03, 2010 at 11:57:55PM -0700, Greg Thelen wrote: This patch set provides the ability for each cgroup to have independent dirty page limits. Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim) page cache used by a cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. These patches were developed and tested on mmotm 2010-09-28-16-13. The patches are based on a series proposed by Andrea Righi in Mar 2010. Overview: - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs unstable. - Extend mem_cgroup to record the total number of pages in each of the interesting dirty states (dirty, writeback, unstable_nfs). - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs control files. - Consider both system and per-memcg dirty limits in page writeback when deciding to queue background writeback or block for foreground writeback. Known shortcomings: - When a cgroup dirty limit is exceeded, then bdi writeback is employed to writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not just inodes contributing dirty pages to the cgroup exceeding its limit. Performance measurements: - kernel builds are unaffected unless run with a small dirty limit. - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y. - dd has three data points (in secs) for three data sizes (100M, 200M, and 1G). As expected, dd slows when it exceed its cgroup dirty limit. kernel_build dd mmotm 2:370.18, 0.38, 1.65 root_memcg mmotm 2:370.18, 0.35, 1.66 non-root_memcg mmotm+patches 2:370.18, 0.35, 1.68 root_memcg mmotm+patches 2:370.19, 0.35, 1.69 non-root_memcg mmotm+patches 2:370.19, 2.34, 22.82 non-root_memcg 150 MiB memcg dirty limit mmotm+patches 3:581.71, 3.38, 17.33 non-root_memcg 1 MiB memcg dirty limit Hi Greg, the patchset seems to work fine on my box. I also ran a pretty simple test to directly verify the effectiveness of the dirty memory limit, using a dd running on a non-root memcg: dd if=/dev/zero of=tmpfile bs=1M count=512 and monitoring the max of the dirty value in cgroup/memory.stat: Here the results: dd in non-root memcg ( 4 MiB memcg dirty limit): dirty max=4227072 dd in non-root memcg ( 8 MiB memcg dirty limit): dirty max=8454144 dd in non-root memcg ( 16 MiB memcg dirty limit): dirty max=15179776 dd in non-root memcg ( 32 MiB memcg dirty limit): dirty max=32235520 dd in non-root memcg ( 64 MiB memcg dirty limit): dirty max=64245760 dd in non-root memcg (128 MiB memcg dirty limit): dirty max=121028608 dd in non-root memcg (256 MiB memcg dirty limit): dirty max=232865792 dd in non-root memcg (512 MiB memcg dirty limit): dirty max=445194240 -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC] [PATCH 0/2] memcg: per cgroup dirty limit
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_bytes in cgroupfs - start to write-out in balance_dirty_pages() when the cgroup or global limit is exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit. TODO: - handle the migration of tasks across different cgroups (a page may be set dirty when a task runs in a cgroup and cleared after the task is moved to another cgroup). - provide an appropriate documentation (in Documentation/cgroups/memory.txt) -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/2] memcg: dirty pages instrumentation
Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c |3 ++ fs/nfs/write.c |3 ++ fs/nilfs2/segment.c |8 - mm/filemap.c|1 + mm/page-writeback.c | 69 -- mm/truncate.c |1 + 6 files changed, 63 insertions(+), 22 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..357632a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_charge_dirty(req-pages[0], NR_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1242,7 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_charge_dirty(tmp_page, NR_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d63d964..3d9de01 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_charge_dirty(req-wb_page, NR_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_charge_dirty(page, NR_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE); return 1; @@ -1320,6 +1322,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_charge_dirty(req-wb_page, NR_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 105b508..b9ffac5 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1660,8 +1660,10 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_charge_dirty(clone_page, NR_WRITEBACK, 1); inc_zone_page_state(clone_page, NR_WRITEBACK); + } unlock_page(clone_page); return 0; @@ -1788,8 +1790,10 @@ static void __nilfs_end_page_io(struct page *page, int err) } if (buffer_nilfs_allocated(page_buffers(page))) { - if (TestClearPageWriteback(page)) + if (TestClearPageWriteback(page)) { + mem_cgroup_charge_dirty(clone_page, NR_WRITEBACK, -1); dec_zone_page_state(page, NR_WRITEBACK); + } } else end_page_writeback(page); } diff --git a/mm/filemap.c b/mm/filemap.c index 698ea80..c19d809 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_charge_dirty(page, NR_FILE_DIRTY, -1); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..c9ff1cd 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { - unsigned long dirty_total; + unsigned long dirty_total, dirty_bytes; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + dirty_bytes = mem_cgroup_dirty_bytes
[Devel] [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup + add memory.dirty_bytes limit in cgroupfs. Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 31 ++ mm/memcontrol.c| 218 +++- 2 files changed, 248 insertions(+), 1 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f9b119..ba3fe0d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -25,6 +25,16 @@ struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum memcg_page_stat_item { + MEMCG_NR_FREE_PAGES, + MEMCG_NR_RECLAIMABLE_PAGES, + MEMCG_NR_FILE_DIRTY, + MEMCG_NR_WRITEBACK, + MEMCG_NR_WRITEBACK_TEMP, + MEMCG_NR_UNSTABLE_NFS, +}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -48,6 +58,8 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +extern void mem_cgroup_charge_dirty(struct page *page, + enum zone_stat_item idx, int charge); extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru); @@ -117,6 +129,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern unsigned long mem_cgroup_dirty_bytes(void); + +extern u64 mem_cgroup_page_state(enum memcg_page_stat_item item); + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -144,6 +160,11 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline void mem_cgroup_charge_dirty(struct page *page, + enum zone_stat_item idx, int charge) +{ +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { @@ -312,6 +333,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline unsigned long mem_cgroup_dirty_bytes(void) +{ + return vm_dirty_bytes; +} + +static inline u64 mem_cgroup_page_state(enum memcg_page_stat_item item) +{ + return 0; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 954032b..288b9a4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -64,13 +64,18 @@ enum mem_cgroup_stat_index { /* * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. */ - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ + MEM_CGROUP_STAT_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEM_CGROUP_STAT_UNSTABLE_NFS, /* # of NFS unstable pages */ MEM_CGROUP_STAT_NSTATS, }; @@ -225,6 +230,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ boolmemsw_is_minimum; + /* control memory cgroup dirty pages */ + unsigned long dirty_bytes; + /* * statistics. This must be placed at the end of memcg. */ @@ -519,6 +527,67 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, put_cpu(); } +static struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) +{ + struct page_cgroup *pc; + struct mem_cgroup *mem = NULL; + + pc = lookup_page_cgroup(page); + if (unlikely(!pc)) + return NULL; + lock_page_cgroup(pc); + if (PageCgroupUsed(pc)) { + mem = pc-mem_cgroup; + if (mem) + css_get(mem-css); + } + unlock_page_cgroup(pc); + return mem; +} + +void mem_cgroup_charge_dirty(struct page *page, + enum zone_stat_item idx, int charge) +{ + struct mem_cgroup *mem; + struct mem_cgroup_stat_cpu *cpustat; + unsigned long flags; + int cpu
[Devel] Re: [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure
On Sun, Feb 21, 2010 at 01:28:35PM -0800, David Rientjes wrote: [snip] +static struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) +{ + struct page_cgroup *pc; + struct mem_cgroup *mem = NULL; + + pc = lookup_page_cgroup(page); + if (unlikely(!pc)) + return NULL; + lock_page_cgroup(pc); + if (PageCgroupUsed(pc)) { + mem = pc-mem_cgroup; + if (mem) + css_get(mem-css); + } + unlock_page_cgroup(pc); + return mem; +} Is it possible to merge this with try_get_mem_cgroup_from_page()? Agreed. + +void mem_cgroup_charge_dirty(struct page *page, + enum zone_stat_item idx, int charge) +{ + struct mem_cgroup *mem; + struct mem_cgroup_stat_cpu *cpustat; + unsigned long flags; + int cpu; + + if (mem_cgroup_disabled()) + return; + /* Translate the zone_stat_item into a mem_cgroup_stat_index */ + switch (idx) { + case NR_FILE_DIRTY: + idx = MEM_CGROUP_STAT_FILE_DIRTY; + break; + case NR_WRITEBACK: + idx = MEM_CGROUP_STAT_WRITEBACK; + break; + case NR_WRITEBACK_TEMP: + idx = MEM_CGROUP_STAT_WRITEBACK_TEMP; + break; + case NR_UNSTABLE_NFS: + idx = MEM_CGROUP_STAT_UNSTABLE_NFS; + break; + default: + return; WARN()? We don't want to silently leak counters. Agreed. + } + /* Charge the memory cgroup statistics */ + mem = get_mem_cgroup_from_page(page); + if (!mem) { + mem = root_mem_cgroup; + css_get(mem-css); + } get_mem_cgroup_from_page() should probably handle the root_mem_cgroup case and return a reference from it. Right. But I'd prefer to use try_get_mem_cgroup_from_page() without changing the behaviour of this function. + + local_irq_save(flags); + cpu = get_cpu(); + cpustat = mem-stat.cpustat[cpu]; + __mem_cgroup_stat_add_safe(cpustat, idx, charge); get_cpu()? Preemption is already disabled, just use smp_processor_id(). mmmh... actually, we can just copy the code from mem_cgroup_charge_statistics(), so local_irq_save/restore are not necessarily needed and we can just use get_cpu()/put_cpu(). + put_cpu(); + local_irq_restore(flags); + css_put(mem-css); +} + static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem, enum lru_list idx) { @@ -992,6 +1061,97 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg) return swappiness; } +static unsigned long get_dirty_bytes(struct mem_cgroup *memcg) +{ + struct cgroup *cgrp = memcg-css.cgroup; + unsigned long dirty_bytes; + + /* root ? */ + if (cgrp-parent == NULL) + return vm_dirty_bytes; + + spin_lock(memcg-reclaim_param_lock); + dirty_bytes = memcg-dirty_bytes; + spin_unlock(memcg-reclaim_param_lock); + + return dirty_bytes; +} + +unsigned long mem_cgroup_dirty_bytes(void) +{ + struct mem_cgroup *memcg; + unsigned long dirty_bytes; + + if (mem_cgroup_disabled()) + return vm_dirty_bytes; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (memcg == NULL) + dirty_bytes = vm_dirty_bytes; + else + dirty_bytes = get_dirty_bytes(memcg); + rcu_read_unlock(); The rcu_read_lock() isn't protecting anything here. Right! + + return dirty_bytes; +} + +u64 mem_cgroup_page_state(enum memcg_page_stat_item item) +{ + struct mem_cgroup *memcg; + struct cgroup *cgrp; + u64 ret = 0; + + if (mem_cgroup_disabled()) + return 0; + + rcu_read_lock(); Again, this isn't necessary. OK. I'll apply your changes to the next version of this patch. Thanks for reviewing! -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Sun, Feb 21, 2010 at 01:38:28PM -0800, David Rientjes wrote: On Sun, 21 Feb 2010, Andrea Righi wrote: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..c9ff1cd 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { - unsigned long dirty_total; + unsigned long dirty_total, dirty_bytes; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) + dirty_total = dirty_bytes / PAGE_SIZE; else dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; This needs a comment since mem_cgroup_dirty_bytes() doesn't imply that it is responsible for returning the global vm_dirty_bytes when that's actually what it does (both for CONFIG_CGROUP_MEM_RES_CTRL=n and root cgroup). Fair enough. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Mon, Feb 22, 2010 at 09:32:21AM +0900, KAMEZAWA Hiroyuki wrote: - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) + dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE); else { int dirty_ratio; you use local value. But, if hierarchila accounting used, memcg-dirty_bytes should be got from root-of-hierarchy memcg. I have no objection if you add a pointer as memcg-subhierarchy_root to get root of hierarchical accounting. But please check problem of hierarchy, again. Right, it won't work with hierarchy. I'll fix also considering the hierarchy case. Thanks for your review. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/2] memcg: dirty pages accounting and limiting infrastructure
On Mon, Feb 22, 2010 at 09:44:42PM +0530, Balbir Singh wrote: [snip] +void mem_cgroup_charge_dirty(struct page *page, + enum zone_stat_item idx, int charge) +{ + struct mem_cgroup *mem; + struct mem_cgroup_stat_cpu *cpustat; + unsigned long flags; + int cpu; + + if (mem_cgroup_disabled()) + return; + /* Translate the zone_stat_item into a mem_cgroup_stat_index */ + switch (idx) { + case NR_FILE_DIRTY: + idx = MEM_CGROUP_STAT_FILE_DIRTY; + break; + case NR_WRITEBACK: + idx = MEM_CGROUP_STAT_WRITEBACK; + break; + case NR_WRITEBACK_TEMP: + idx = MEM_CGROUP_STAT_WRITEBACK_TEMP; + break; + case NR_UNSTABLE_NFS: + idx = MEM_CGROUP_STAT_UNSTABLE_NFS; + break; + default: + return; + } + /* Charge the memory cgroup statistics */ + mem = get_mem_cgroup_from_page(page); + if (!mem) { + mem = root_mem_cgroup; + css_get(mem-css); + } + + local_irq_save(flags); + cpu = get_cpu(); Kamezawa is in the process of changing these, so you might want to look at and integrate with those patches when they are ready. OK, I'll rebase the patch to -mm. Are those changes already included in mmotm? Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Mon, Feb 22, 2010 at 11:52:15AM -0500, Vivek Goyal wrote: unsigned long determine_dirtyable_memory(void) { - unsigned long x; - - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); - + unsigned long memcg_memory, memory; + + memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); + memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES); + if (memcg_memory 0) { it could be just if (memcg_memory) { Agreed. } + memcg_memory += + mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES); + if (memcg_memory memory) + return memcg_memory; + } if (!vm_highmem_is_dirtyable) - x -= highmem_dirtyable_memory(x); + memory -= highmem_dirtyable_memory(memory); If vm_highmem_is_dirtyable=0, In that case, we can still return with memcg_memory which can be more than memory. IOW, highmem is not dirtyable system wide but still we can potetially return back saying for this cgroup we can dirty more pages which can potenailly be acutally be more that system wide allowed? Because you have modified dirtyable_memory() and made it per cgroup, I think it automatically takes care of the cases of per cgroup dirty ratio, I mentioned in my previous mail. So we will use system wide dirty ratio to calculate the allowed dirty pages in this cgroup (dirty_ratio * available_memory()) and if this cgroup wrote too many pages start writeout? OK, if I've understood well, you're proposing to use per-cgroup dirty_ratio interface and do something like: unsigned long determine_dirtyable_memory(void) { unsigned long memcg_memory, memory; memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) memory -= highmem_dirtyable_memory(memory); memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES); if (!memcg_memory) return memory + 1; /* Ensure that we never return 0 */ memcg_memory += mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES); if (!vm_highmem_is_dirtyable) memcg_memory -= highmem_dirtyable_memory(memory) * mem_cgroup_dirty_ratio() / 100; if (memcg_memory memory) return memcg_memory; } - return x + 1; /* Ensure that we never return 0 */ + return memory + 1; /* Ensure that we never return 0 */ } void @@ -421,12 +428,13 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty, unsigned long *pbdi_dirty, struct backing_dev_info *bdi) { unsigned long background; - unsigned long dirty; + unsigned long dirty, dirty_bytes; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) + dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE); else { int dirty_ratio; @@ -505,9 +513,17 @@ static void balance_dirty_pages(struct address_space *mapping, get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); - nr_reclaimable = global_page_state(NR_FILE_DIRTY) + + nr_reclaimable = mem_cgroup_page_state(MEMCG_NR_FILE_DIRTY); + if (nr_reclaimable == 0) { + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); - nr_writeback = global_page_state(NR_WRITEBACK); + nr_writeback = global_page_state(NR_WRITEBACK); + } else { + nr_reclaimable += + mem_cgroup_page_state(MEMCG_NR_UNSTABLE_NFS); + nr_writeback = + mem_cgroup_page_state(MEMCG_NR_WRITEBACK); + } bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); @@ -660,6 +676,8 @@ void throttle_vm_writeout(gfp_t gfp_mask) unsigned long dirty_thresh; for ( ; ; ) { + unsigned long dirty; + get_dirty_limits(background_thresh, dirty_thresh, NULL, NULL); /* @@ -668,10 +686,15 @@ void throttle_vm_writeout(gfp_t gfp_mask) */ dirty_thresh += dirty_thresh / 10; /* wh... */ -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(BLK_RW_ASYNC, HZ/10); + dirty = mem_cgroup_page_state(MEMCG_NR_WRITEBACK); +
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Tue, Feb 23, 2010 at 10:40:40AM +0100, Andrea Righi wrote: If vm_highmem_is_dirtyable=0, In that case, we can still return with memcg_memory which can be more than memory. IOW, highmem is not dirtyable system wide but still we can potetially return back saying for this cgroup we can dirty more pages which can potenailly be acutally be more that system wide allowed? Because you have modified dirtyable_memory() and made it per cgroup, I think it automatically takes care of the cases of per cgroup dirty ratio, I mentioned in my previous mail. So we will use system wide dirty ratio to calculate the allowed dirty pages in this cgroup (dirty_ratio * available_memory()) and if this cgroup wrote too many pages start writeout? OK, if I've understood well, you're proposing to use per-cgroup dirty_ratio interface and do something like: unsigned long determine_dirtyable_memory(void) { unsigned long memcg_memory, memory; memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) memory -= highmem_dirtyable_memory(memory); memcg_memory = mem_cgroup_page_state(MEMCG_NR_FREE_PAGES); if (!memcg_memory) return memory + 1; /* Ensure that we never return 0 */ memcg_memory += mem_cgroup_page_state(MEMCG_NR_RECLAIMABLE_PAGES); if (!vm_highmem_is_dirtyable) memcg_memory -= highmem_dirtyable_memory(memory) * mem_cgroup_dirty_ratio() / 100; ok, this is wrong: if (memcg_memory memory) return memcg_memory; } return min(memcg_memory, memory); -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] [PATCH 0/2] memcg: per cgroup dirty limit
On Mon, Feb 22, 2010 at 01:29:34PM -0500, Vivek Goyal wrote: I would't like to add many different interfaces to do the same thing. I'd prefer to choose just one interface and always use it. We just have to define which is the best one. IMHO dirty_bytes is more generic. If we want to define the limit as a % we can always do that in userspace. dirty_ratio is easy to configure. One system wide default value works for all the newly created cgroups. For dirty_bytes, you shall have to configure each and individual cgroup with a specific value depneding on what is the upper limit of memory for that cgroup. OK. Secondly, memory cgroup kind of partitions global memory resource per cgroup. So if as long as we have global dirty ratio knobs, it makes sense to have per cgroup dirty ratio knob also. But I guess we can introduce that later and use gloabl dirty ratio for all the memory cgroups (instead of each cgroup having a separate dirty ratio). The only thing is that we need to enforce this dirty ratio on the cgroup and if I am reading the code correctly, your modifications of calculating available_memory() per cgroup should take care of that. At the moment (with dirty_bytes) if the cgroup has dirty_bytes == 0, it simply uses the system wide available_memory(), ignoring the memory upper limit for that cgroup and fallbacks to the current behaviour. With dirty_ratio, should we change the code to *always* apply this percentage to the cgroup memory upper limit, and automatically set it equal to the global dirty_ratio by default when the cgroup is created? mmmh... I vote yes. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Tue, Feb 23, 2010 at 02:22:12PM -0800, David Rientjes wrote: On Tue, 23 Feb 2010, Vivek Goyal wrote: Because you have modified dirtyable_memory() and made it per cgroup, I think it automatically takes care of the cases of per cgroup dirty ratio, I mentioned in my previous mail. So we will use system wide dirty ratio to calculate the allowed dirty pages in this cgroup (dirty_ratio * available_memory()) and if this cgroup wrote too many pages start writeout? OK, if I've understood well, you're proposing to use per-cgroup dirty_ratio interface and do something like: I think we can use system wide dirty_ratio for per cgroup (instead of providing configurable dirty_ratio for each cgroup where each memory cgroup can have different dirty ratio. Can't think of a use case immediately). I think each memcg should have both dirty_bytes and dirty_ratio, dirty_bytes defaults to 0 (disabled) while dirty_ratio is inherited from the global vm_dirty_ratio. Changing vm_dirty_ratio would not change memcgs already using their own dirty_ratio, but new memcgs would get the new value by default. The ratio would act over the amount of available memory to the cgroup as though it were its own virtual system operating with a subset of the system's RAM and the same global ratio. Agreed. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote: On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote: [..] diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..c9ff1cd 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { - unsigned long dirty_total; + unsigned long dirty_total, dirty_bytes; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) + dirty_total = dirty_bytes / PAGE_SIZE; else dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; Ok, I don't understand this so I better ask. Can you explain a bit how memory cgroup dirty ratio is going to play with per BDI dirty proportion thing. Currently we seem to be calculating per BDI proportion (based on recently completed events), of system wide dirty ratio and decide whether a process should be throttled or not. Because throttling decision is also based on BDI and its proportion, how are we going to fit it with mem cgroup? Is it going to be BDI proportion of dirty memory with-in memory cgroup (and not system wide)? IMHO we need to calculate the BDI dirty threshold as a function of the cgroup's dirty memory, and keep BDI statistics system wide. So, if a task is generating some writes, the threshold to start itself the writeback will be calculated as a function of the cgroup's dirty memory. If the BDI dirty memory is greater than this threshold, the task must start to writeback dirty pages until it reaches the expected dirty limit. OK, in this way a cgroup with a small dirty limit may be forced to writeback a lot of pages dirtied by other cgroups on the same device. But this is always related to the fact that tasks are forced to writeback dirty inodes randomly, and not the inodes they've actually dirtied. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/2] memcg: dirty pages instrumentation
On Fri, Feb 26, 2010 at 04:48:11PM -0500, Vivek Goyal wrote: On Thu, Feb 25, 2010 at 04:12:11PM +0100, Andrea Righi wrote: On Tue, Feb 23, 2010 at 04:29:43PM -0500, Vivek Goyal wrote: On Sun, Feb 21, 2010 at 04:18:45PM +0100, Andrea Righi wrote: [..] diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..c9ff1cd 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,10 +137,11 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { - unsigned long dirty_total; + unsigned long dirty_total, dirty_bytes; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) + dirty_total = dirty_bytes / PAGE_SIZE; else dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; Ok, I don't understand this so I better ask. Can you explain a bit how memory cgroup dirty ratio is going to play with per BDI dirty proportion thing. Currently we seem to be calculating per BDI proportion (based on recently completed events), of system wide dirty ratio and decide whether a process should be throttled or not. Because throttling decision is also based on BDI and its proportion, how are we going to fit it with mem cgroup? Is it going to be BDI proportion of dirty memory with-in memory cgroup (and not system wide)? IMHO we need to calculate the BDI dirty threshold as a function of the cgroup's dirty memory, and keep BDI statistics system wide. So, if a task is generating some writes, the threshold to start itself the writeback will be calculated as a function of the cgroup's dirty memory. If the BDI dirty memory is greater than this threshold, the task must start to writeback dirty pages until it reaches the expected dirty limit. Ok, so calculate dirty per cgroup and calculate BDI's proportion from cgroup dirty? So will you be keeping track of vm_completion events per cgroup or will rely on existing system wide and per BDI completion events to calculate BDI proportion? BDI proportion are more of an indication of device speed and faster device gets higher share of dirty, so may be we don't have to keep track of completion events per cgroup and can rely on system wide completion events for calculating the proportion of a BDI. OK, in this way a cgroup with a small dirty limit may be forced to writeback a lot of pages dirtied by other cgroups on the same device. But this is always related to the fact that tasks are forced to writeback dirty inodes randomly, and not the inodes they've actually dirtied. So we are left with following two issues. - Should we rely on global BDI stats for BDI_RECLAIMABLE and BDI_WRITEBACK or we need to make these per cgroup to determine actually how many pages have been dirtied by a cgroup and force writeouts accordingly? - Once we decide to throttle a cgroup, it should write its inodes and should not be serialized behind other cgroup's inodes. We could try to save who made the inode dirty (inode-cgroup_that_made_inode_dirty) so that during the active writeback each cgroup can be forced to write only its own inodes. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 0/2] memcg: per cgroup dirty limit (v2)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages. Changelog (v1 - v2) ~~ * rebased to -mmotm * properly handle hierarchical accounting * added the same system-wide interfaces to set dirty limits (memory.dirty_ratio / memory.dirty_bytes, memory.dirty_background_ratio, memory.dirty_background_bytes) * other minor fixes and improvements based on the received feedbacks TODO: - handle the migration of tasks across different cgroups (maybe adding DIRTY/WRITEBACK/UNSTABLE flag to struct page_cgroup) - provide an appropriate documentation (in Documentation/cgroups/memory.txt) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 1/2] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Active write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 74 +- mm/memcontrol.c| 354 2 files changed, 399 insertions(+), 29 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f9b119..e6af95c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -25,6 +25,41 @@ struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* + * Statistics for memory cgroup. + */ +enum mem_cgroup_stat_index { + /* +* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. +*/ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ + MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out. + used by soft limit implementation */ + MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. + used by threshold implementation */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ + MEM_CGROUP_STAT_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEM_CGROUP_STAT_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEM_CGROUP_STAT_NSTATS, +}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +152,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern long mem_cgroup_dirty_ratio(void); +extern unsigned long mem_cgroup_dirty_bytes(void); +extern long mem_cgroup_dirty_background_ratio(void); +extern unsigned long mem_cgroup_dirty_background_bytes(void); + +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item); + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -125,7 +167,8 @@ static inline bool mem_cgroup_disabled(void) } extern bool mem_cgroup_oom_called(struct task_struct *task); -void mem_cgroup_update_file_mapped(struct page *page, int val); +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -300,8 +343,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) { } @@ -312,6 +355,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline long mem_cgroup_dirty_ratio(void) +{ + return vm_dirty_ratio; +} + +static inline unsigned long mem_cgroup_dirty_bytes(void) +{ + return vm_dirty_bytes; +} + +static inline long mem_cgroup_dirty_background_ratio(void) +{ + return dirty_background_ratio; +} + +static inline unsigned long mem_cgroup_dirty_background_bytes(void) +{ + return dirty_background_bytes; +} + +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item) +{ + return -ENOMEM; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a443c30..56f3204 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -66,31 +66,16 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/ #define SOFTLIMIT_EVENTS_THRESH (1000) #define THRESHOLDS_EVENTS_THRESH (100) -/* - * Statistics for memory cgroup. - */ -enum mem_cgroup_stat_index { - /* -* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. -*/ - MEM_CGROUP_STAT_CACHE
[Devel] [PATCH -mmotm 2/2] memcg: dirty pages instrumentation
Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c |5 +++ fs/nfs/write.c |4 ++ fs/nilfs2/segment.c | 10 +- mm/filemap.c|1 + mm/page-writeback.c | 84 -- mm/rmap.c |4 +- mm/truncate.c |2 + 7 files changed, 76 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..aef6d13 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_update_stat(clone_page, + MEM_CGROUP_STAT_WRITEBACK, 1); inc_zone_page_state(clone_page, NR_WRITEBACK); + } unlock_page(clone_page); return 0; @@ -1783,8 +1786,11 @@ static void __nilfs_end_page_io(struct page *page, int err) } if (buffer_nilfs_allocated(page_buffers(page))) { - if (TestClearPageWriteback(page)) + if (TestClearPageWriteback(page)) { + mem_cgroup_update_stat(clone_page, + MEM_CGROUP_STAT_WRITEBACK, -1); dec_zone_page_state(page, NR_WRITEBACK); + } } else end_page_writeback(page); } diff --git a/mm/filemap.c b/mm/filemap.c index fe09e51..f85acae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5a0f8f3..d83f41c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties; */ static
[Devel] [PATCH -mmotm 0/3] memcg: per cgroup dirty limit (v3)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes limit. Changelog (v2 - v3) ~~ * properly handle the swapless case when reading dirtyable pages statistic * combine similar functions + code cleanup based on the received feedbacks * updated documentation in Documentation/cgroups/memory.txt -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 1/3] memcg: dirty memory documentation
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index aad7d05..878afa7 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -308,6 +308,11 @@ cache - # of bytes of page cache memory. rss- # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout- # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs- # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon- # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -343,6 +348,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + 6. Hierarchy support -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote: @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask) */ dirty_thresh += dirty_thresh / 10; /* wh... */ -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(BLK_RW_ASYNC, HZ/10); + + dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); + if (dirty 0) + dirty = global_page_state(NR_UNSTABLE_NFS) + + global_page_state(NR_WRITEBACK); dirty is unsigned long. As mentioned last time, above will never be true? In general these patches look ok to me. I will do some testing with these. Re-introduced the same bug. My bad. :( The value returned from mem_cgroup_page_stat() can be negative, i.e. when memory cgroup is disabled. We could simply use a long for dirty, the unit is in # of pages so s64 should be enough. Or cast dirty to long only for the check (see below). Thanks! -Andrea Signed-off-by: Andrea Righi ari...@develer.com --- mm/page-writeback.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d83f41c..dbee976 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); - if (dirty 0) + if ((long)dirty 0) dirty = global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK); if (dirty = dirty_thresh) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 1 Mar 2010 22:23:40 +0100 Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com Seems nice. Hmm. the last problem is moving account between memcg. Right ? Correct. This was actually the last item of the TODO list. Anyway, I'm still considering if it's correct to move dirty pages when a task is migrated from a cgroup to another. Currently, dirty pages just remain in the original cgroup and are flushed depending on the original cgroup settings. That is not totally wrong... at least moving the dirty pages between memcgs should be optional (move_charge_at_immigrate?). Thanks for your ack and the detailed review! -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
On Tue, Mar 02, 2010 at 12:04:53PM +0200, Kirill A. Shutemov wrote: [snip] +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item) +{ + return -ENOMEM; Why ENOMEM? Probably, EINVAL or ENOSYS? OK, ENOSYS is more appropriate IMHO. +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg, + enum mem_cgroup_page_stat_item item) +{ + s64 ret; + + switch (item) { + case MEMCG_NR_DIRTYABLE_PAGES: + ret = res_counter_read_u64(memcg-res, RES_LIMIT) - + res_counter_read_u64(memcg-res, RES_USAGE); + /* Translate free memory in pages */ + ret = PAGE_SHIFT; + ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) + + mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE); + if (mem_cgroup_can_swap(memcg)) + ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) + + mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON); + break; + case MEMCG_NR_RECLAIM_PAGES: + ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) + + mem_cgroup_read_stat(memcg, + MEM_CGROUP_STAT_UNSTABLE_NFS); + break; + case MEMCG_NR_WRITEBACK: + ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); + break; + case MEMCG_NR_DIRTY_WRITEBACK_PAGES: + ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) + + mem_cgroup_read_stat(memcg, + MEM_CGROUP_STAT_UNSTABLE_NFS); + break; + default: + ret = 0; + WARN_ON_ONCE(1); I think it's a bug, not warning. OK. + } + return ret; +} + +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data) +{ + struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data; + + stat-value += mem_cgroup_get_local_page_stat(mem, stat-item); + return 0; +} + +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item) +{ + struct mem_cgroup_page_stat stat = {}; + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return -ENOMEM; EINVAL/ENOSYS? OK. + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (memcg) { + /* + * Recursively evaulate page statistics against all cgroup + * under hierarchy tree + */ + stat.item = item; + mem_cgroup_walk_tree(memcg, stat, mem_cgroup_page_stat_cb); + } else + stat.value = -ENOMEM; ditto. OK. + rcu_read_unlock(); + + return stat.value; +} + static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data) { int *val = data; @@ -1263,14 +1418,16 @@ static void record_last_oom(struct mem_cgroup *mem) } /* - * Currently used to update mapped file statistics, but the routine can be - * generalized to update other statistics as well. + * Generalized routine to update memory cgroup statistics. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) EXPORT_SYMBOL_GPL(mem_cgroup_update_stat) is needed, since it uses by filesystems. Agreed. +static int +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + int type = cft-private; + + if (cgrp-parent == NULL) + return -EINVAL; + if (((type == MEM_CGROUP_DIRTY_RATIO) || + (type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)) (val 100)) Too many unnecessary brackets if ((type == MEM_CGROUP_DIRTY_RATIO || type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) val 100) OK. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote: On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c | 5 +++ fs/nfs/write.c | 4 ++ fs/nilfs2/segment.c | 10 +- mm/filemap.c | 1 + mm/page-writeback.c | 84 -- mm/rmap.c | 4 +- mm/truncate.c | 2 + 7 files changed, 76 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..aef6d13 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_update_stat(clone_page, s/clone_page/page/ mmh... shouldn't we use the same page used by TestSetPageWriteback() and inc_zone_page_state()? And #include linux/memcontrol.h is missed. OK. I'll apply your fixes and post a new version. Thanks for reviewing, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 01:09:24PM +0200, Kirill A. Shutemov wrote: On Tue, Mar 2, 2010 at 1:02 PM, Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 12:11:10PM +0200, Kirill A. Shutemov wrote: On Mon, Mar 1, 2010 at 11:23 PM, Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c | 5 +++ fs/nfs/write.c | 4 ++ fs/nilfs2/segment.c | 10 +- mm/filemap.c | 1 + mm/page-writeback.c | 84 -- mm/rmap.c | 4 +- mm/truncate.c | 2 + 7 files changed, 76 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..aef6d13 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_update_stat(clone_page, s/clone_page/page/ mmh... shouldn't we use the same page used by TestSetPageWriteback() and inc_zone_page_state()? Sorry, I've commented wrong hunk. It's for the next one. Yes. Good catch! Will fix in the next version. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
On Tue, Mar 02, 2010 at 06:32:24PM +0530, Balbir Singh wrote: [snip] +extern long mem_cgroup_dirty_ratio(void); +extern unsigned long mem_cgroup_dirty_bytes(void); +extern long mem_cgroup_dirty_background_ratio(void); +extern unsigned long mem_cgroup_dirty_background_bytes(void); + +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item); + Docstyle comments for each function would be appreciated OK. /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -205,6 +199,9 @@ struct mem_cgroup { unsigned intswappiness; + /* control memory cgroup dirty pages */ + unsigned long dirty_param[MEM_CGROUP_DIRTY_NPARAMS]; + Could you mention what protects this field, is it the reclaim_lock? Yes, it is. Actually, we could avoid the lock completely for dirty_param[], using a validation routine to check for incoherencies after any read with get_dirty_param(), and retry if the validation fails. In practice, the same approach we're using to read global vm_dirty_ratio, vm_dirty_bytes, etc... Considering that those values are rarely written and read often we can protect them in a RCU way. BTW, is unsigned long sufficient to represent dirty_param(s)? Yes, I think. It's the same type used for the equivalent global values. /* set when res.limit == memsw.limit */ boolmemsw_is_minimum; @@ -1021,6 +1018,164 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg) return swappiness; } +static unsigned long get_dirty_param(struct mem_cgroup *memcg, + enum mem_cgroup_dirty_param idx) +{ + unsigned long ret; + + VM_BUG_ON(idx = MEM_CGROUP_DIRTY_NPARAMS); + spin_lock(memcg-reclaim_param_lock); + ret = memcg-dirty_param[idx]; + spin_unlock(memcg-reclaim_param_lock); Do we need a spinlock if we protect it using RCU? Is precise data very important? See above. +unsigned long mem_cgroup_dirty_background_bytes(void) +{ + struct mem_cgroup *memcg; + unsigned long ret = dirty_background_bytes; + + if (mem_cgroup_disabled()) + return ret; + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (likely(memcg)) + ret = get_dirty_param(memcg, MEM_CGROUP_DIRTY_BACKGROUND_BYTES); + rcu_read_unlock(); + + return ret; +} + +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg) +{ + return do_swap_account ? + res_counter_read_u64(memcg-memsw, RES_LIMIT) : Shouldn't you do a res_counter_read_u64(...) 0 for readability? OK. What happens if memcg-res, RES_LIMIT == memcg-memsw, RES_LIMIT? OK, we should also check memcg-memsw_is_minimum. static struct cgroup_subsys_state * __ref mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) { @@ -3776,8 +4031,37 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) mem-last_scanned_child = 0; spin_lock_init(mem-reclaim_param_lock); - if (parent) + if (parent) { mem-swappiness = get_swappiness(parent); + + spin_lock(parent-reclaim_param_lock); + copy_dirty_params(mem, parent); + spin_unlock(parent-reclaim_param_lock); + } else { + /* +* XXX: should we need a lock here? we could switch from +* vm_dirty_ratio to vm_dirty_bytes or vice versa but we're not +* reading them atomically. The same for dirty_background_ratio +* and dirty_background_bytes. +* +* For now, try to read them speculatively and retry if a +* conflict is detected.a The do while loop is subtle, can we add a validate check,share it with the write routine and retry if validation fails? Agreed. +*/ + do { + mem-dirty_param[MEM_CGROUP_DIRTY_RATIO] = + vm_dirty_ratio; + mem-dirty_param[MEM_CGROUP_DIRTY_BYTES] = + vm_dirty_bytes; + } while (mem-dirty_param[MEM_CGROUP_DIRTY_RATIO] +mem-dirty_param[MEM_CGROUP_DIRTY_BYTES]); + do { + mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] = + dirty_background_ratio; + mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES] = + dirty_background_bytes; + } while (mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_RATIO] + mem-dirty_param[MEM_CGROUP_DIRTY_BACKGROUND_BYTES]); + } atomic_set(mem-refcnt, 1); mem-move_charge_at_immigrate = 0; mutex_init(mem-thresholds_lock); Many thanks for reviewing, -Andrea
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 02:48:56PM +0100, Peter Zijlstra wrote: On Mon, 2010-03-01 at 22:23 +0100, Andrea Righi wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5a0f8f3..d83f41c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,13 +137,14 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { - unsigned long dirty_total; + unsigned long dirty_total, dirty_bytes; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) So you don't think 0 is a valid max dirty amount? A value of 0 means disabled. It's used to select between dirty_ratio or dirty_bytes. It's the same for the gloabl vm_dirty_* parameters. + dirty_total = dirty_bytes / PAGE_SIZE; else - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / - 100; + dirty_total = (mem_cgroup_dirty_ratio() * + determine_dirtyable_memory()) / 100; return 2 + ilog2(dirty_total - 1); } @@ -408,14 +409,16 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) */ unsigned long determine_dirtyable_memory(void) { - unsigned long x; - - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); + unsigned long memory; + s64 memcg_memory; + memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) - x -= highmem_dirtyable_memory(x); - - return x + 1; /* Ensure that we never return 0 */ + memory -= highmem_dirtyable_memory(memory); + memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES); + if (memcg_memory 0) And here you somehow return negative? + return memory + 1; + return min((unsigned long)memcg_memory, memory + 1); } void @@ -423,26 +426,28 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty, unsigned long *pbdi_dirty, struct backing_dev_info *bdi) { unsigned long background; - unsigned long dirty; + unsigned long dirty, dirty_bytes, dirty_background; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + dirty_bytes = mem_cgroup_dirty_bytes(); + if (dirty_bytes) zero not valid again + dirty = DIV_ROUND_UP(dirty_bytes, PAGE_SIZE); else { int dirty_ratio; - dirty_ratio = vm_dirty_ratio; + dirty_ratio = mem_cgroup_dirty_ratio(); if (dirty_ratio 5) dirty_ratio = 5; dirty = (dirty_ratio * available_memory) / 100; } - if (dirty_background_bytes) - background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); + dirty_background = mem_cgroup_dirty_background_bytes(); + if (dirty_background) idem + background = DIV_ROUND_UP(dirty_background, PAGE_SIZE); else - background = (dirty_background_ratio * available_memory) / 100; - + background = (mem_cgroup_dirty_background_ratio() * + available_memory) / 100; if (background = dirty) background = dirty / 2; tsk = current; @@ -508,9 +513,13 @@ static void balance_dirty_pages(struct address_space *mapping, get_dirty_limits(background_thresh, dirty_thresh, bdi_thresh, bdi); - nr_reclaimable = global_page_state(NR_FILE_DIRTY) + + nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES); + nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK); + if ((nr_reclaimable 0) || (nr_writeback 0)) { + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); ??? why would a page_state be negative.. I see you return -ENOMEM on ! cgroup, but how can one specify no dirty limit with this compiled in? - nr_writeback = global_page_state(NR_WRITEBACK); + nr_writeback = global_page_state(NR_WRITEBACK); + } bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); if (bdi_cap_account_unstable(bdi)) { @@ -611,10 +620,12 @@ static void balance_dirty_pages(struct address_space *mapping, * In normal mode, we start background writeout at the lower * background_thresh, to keep the amount of dirty memory low. */ + nr_reclaimable
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 17:23:16]: On Tue, 2 Mar 2010 09:01:58 +0100 Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 1 Mar 2010 22:23:40 +0100 Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com Seems nice. Hmm. the last problem is moving account between memcg. Right ? Correct. This was actually the last item of the TODO list. Anyway, I'm still considering if it's correct to move dirty pages when a task is migrated from a cgroup to another. Currently, dirty pages just remain in the original cgroup and are flushed depending on the original cgroup settings. That is not totally wrong... at least moving the dirty pages between memcgs should be optional (move_charge_at_immigrate?). My concern is - migration between memcg is already suppoted - at task move - at rmdir Then, if you leave DIRTY_PAGE accounting to original cgroup, the new cgroup (migration target)'s Dirty page accounting may goes to be negative, or incorrect value. Please check FILE_MAPPED implementation in __mem_cgroup_move_account() As if (page_mapped(page) !PageAnon(page)) { /* Update mapped_file data for mem_cgroup */ preempt_disable(); __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); preempt_enable(); } then, FILE_MAPPED never goes negative. Absolutely! I am not sure how complex dirty memory migration will be, but one way of working around it would be to disable migration of charges when the feature is enabled (dirty* is set in the memory cgroup). We might need additional logic to allow that to happen. I've started to look at dirty memory migration. First attempt is to add DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in __mem_cgroup_move_account(). Probably I'll have something ready for the next version of the patch. I still need to figure if this can work as expected... -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote: On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote: On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote: @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask) */ dirty_thresh += dirty_thresh / 10; /* wh... */ -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(BLK_RW_ASYNC, HZ/10); + + dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); + if (dirty 0) + dirty = global_page_state(NR_UNSTABLE_NFS) + + global_page_state(NR_WRITEBACK); dirty is unsigned long. As mentioned last time, above will never be true? In general these patches look ok to me. I will do some testing with these. Re-introduced the same bug. My bad. :( The value returned from mem_cgroup_page_stat() can be negative, i.e. when memory cgroup is disabled. We could simply use a long for dirty, the unit is in # of pages so s64 should be enough. Or cast dirty to long only for the check (see below). Thanks! -Andrea Signed-off-by: Andrea Righi ari...@develer.com --- mm/page-writeback.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d83f41c..dbee976 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); - if (dirty 0) + if ((long)dirty 0) This will also be problematic as on 32bit systems, your uppper limit of dirty memory will be 2G? I guess, I will prefer one of the two. - return the error code from function and pass a pointer to store stats in as function argument. - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if per cgroup dirty control is enabled, then use per cgroup stats. In that case you don't have to return negative values. Only tricky part will be careful accouting so that none of the stats go negative in corner cases of migration etc. What do you think about Peter's suggestion + the locking stuff? (see the previous email). Otherwise, I'll choose the other solution, passing a pointer and always return the error code is not bad. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/3] memcg: dirty pages accounting and limiting infrastructure
On Tue, Mar 02, 2010 at 10:08:17AM -0800, Greg Thelen wrote: Comments below. Yet to be tested on my end, but I will test it. On Mon, Mar 1, 2010 at 1:23 PM, Andrea Righi ari...@develer.com wrote: Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 77 ++- mm/memcontrol.c | 336 2 files changed, 384 insertions(+), 29 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f9b119..cc88b2e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,50 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* + * Statistics for memory cgroup. + */ +enum mem_cgroup_stat_index { + /* + * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. + */ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ + MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out. + used by soft limit implementation */ + MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out. + used by threshold implementation */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ + MEM_CGROUP_STAT_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEM_CGROUP_STAT_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEM_CGROUP_STAT_NSTATS, +}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +155,13 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern long mem_cgroup_dirty_ratio(void); +extern unsigned long mem_cgroup_dirty_bytes(void); +extern long mem_cgroup_dirty_background_ratio(void); +extern unsigned long mem_cgroup_dirty_background_bytes(void); + +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item); + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -125,7 +170,8 @@ static inline bool mem_cgroup_disabled(void) } extern bool mem_cgroup_oom_called(struct task_struct *task); -void mem_cgroup_update_file_mapped(struct page *page, int val); +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -300,8 +346,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) { } @@ -312,6 +358,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline long mem_cgroup_dirty_ratio(void) +{ + return vm_dirty_ratio; +} + +static inline unsigned long mem_cgroup_dirty_bytes(void) +{ + return vm_dirty_bytes; +} + +static inline long mem_cgroup_dirty_background_ratio(void) +{ + return dirty_background_ratio; +} + +static inline unsigned long mem_cgroup_dirty_background_bytes(void) +{ + return dirty_background_bytes; +} + +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote: On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote: On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote: On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote: On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote: @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask) */ dirty_thresh += dirty_thresh / 10; /* wh... */ -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(BLK_RW_ASYNC, HZ/10); + + dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); + if (dirty 0) + dirty = global_page_state(NR_UNSTABLE_NFS) + + global_page_state(NR_WRITEBACK); dirty is unsigned long. As mentioned last time, above will never be true? In general these patches look ok to me. I will do some testing with these. Re-introduced the same bug. My bad. :( The value returned from mem_cgroup_page_stat() can be negative, i.e. when memory cgroup is disabled. We could simply use a long for dirty, the unit is in # of pages so s64 should be enough. Or cast dirty to long only for the check (see below). Thanks! -Andrea Signed-off-by: Andrea Righi ari...@develer.com --- mm/page-writeback.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d83f41c..dbee976 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); - if (dirty 0) + if ((long)dirty 0) This will also be problematic as on 32bit systems, your uppper limit of dirty memory will be 2G? I guess, I will prefer one of the two. - return the error code from function and pass a pointer to store stats in as function argument. - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if per cgroup dirty control is enabled, then use per cgroup stats. In that case you don't have to return negative values. Only tricky part will be careful accouting so that none of the stats go negative in corner cases of migration etc. What do you think about Peter's suggestion + the locking stuff? (see the previous email). Otherwise, I'll choose the other solution, passing a pointer and always return the error code is not bad. Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit() call, task might change cgroup and later we might call mem_cgroup_get_page_stat() on a different cgroup altogether which might or might not have dirty limits specified? Correct. But in what cases you don't want to use memory cgroup specified limit? I thought cgroup disabled what the only case where we need to use global limits. Otherwise a memory cgroup will have either dirty_bytes specified or by default inherit global dirty_ratio which is a valid number. If that's the case then you don't have to take rcu_lock() outside get_page_stat()? IOW, apart from cgroup being disabled, what are the other cases where you expect to not use cgroup's page stat and use global stats? At boot, when mem_cgroup_from_task() may return NULL. But this is not related to the RCU acquisition. Anyway, probably the RCU protection is not so critical for this particular case, and we can simply get rid of it. In this way we can easily implement the interface proposed by Peter. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, Mar 03, 2010 at 08:21:07AM +0900, Daisuke Nishimura wrote: On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 17:23:16]: On Tue, 2 Mar 2010 09:01:58 +0100 Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 1 Mar 2010 22:23:40 +0100 Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com Seems nice. Hmm. the last problem is moving account between memcg. Right ? Correct. This was actually the last item of the TODO list. Anyway, I'm still considering if it's correct to move dirty pages when a task is migrated from a cgroup to another. Currently, dirty pages just remain in the original cgroup and are flushed depending on the original cgroup settings. That is not totally wrong... at least moving the dirty pages between memcgs should be optional (move_charge_at_immigrate?). My concern is - migration between memcg is already suppoted - at task move - at rmdir Then, if you leave DIRTY_PAGE accounting to original cgroup, the new cgroup (migration target)'s Dirty page accounting may goes to be negative, or incorrect value. Please check FILE_MAPPED implementation in __mem_cgroup_move_account() As if (page_mapped(page) !PageAnon(page)) { /* Update mapped_file data for mem_cgroup */ preempt_disable(); __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); preempt_enable(); } then, FILE_MAPPED never goes negative. Absolutely! I am not sure how complex dirty memory migration will be, but one way of working around it would be to disable migration of charges when the feature is enabled (dirty* is set in the memory cgroup). We might need additional logic to allow that to happen. I've started to look at dirty memory migration. First attempt is to add DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in __mem_cgroup_move_account(). Probably I'll have something ready for the next version of the patch. I still need to figure if this can work as expected... I agree it's a right direction(in fact, I have been planning to post a patch in that direction), so I leave it to you. Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the same way as other flags you're trying to add, and we can change if (page_mapped(page) !PageAnon(page)) to if (PageCgroupFileMapped(pc) in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO. OK, sounds good to me. I'll introduce PCG_FILE_MAPPED in the next version. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote: On Wed, 3 Mar 2010 15:15:49 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Agreed. Let's try how we can write a code in clean way. (we have time ;) For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little over killing. What I really want is lockless code...but it seems impossible under current implementation. I wonder the fact the page is never unchareged under us can give us some chances ...Hmm. How about this ? Basically, I don't like duplicating information...so, # of new pcg_flags may be able to be reduced. I'm glad this can be a hint for Andrea-san. Many thanks! I already wrote pretty the same code, but at this point I think I'll just apply and test this one. ;) -Andrea == --- include/linux/page_cgroup.h | 44 - mm/memcontrol.c | 91 +++- 2 files changed, 132 insertions(+), 3 deletions(-) Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h === --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h @@ -39,6 +39,11 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */ + PCG_ACCT_DIRTY, + PCG_ACCT_WB, + PCG_ACCT_WB_TEMP, + PCG_ACCT_UNSTABLE, }; #define TESTPCGFLAG(uname, lname)\ @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +SETPCGFLAG(AcctDirty, ACCT_DIRTY); +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY); +TESTPCGFLAG(AcctDirty, ACCT_DIRTY); + +SETPCGFLAG(AcctWB, ACCT_WB); +CLEARPCGFLAG(AcctWB, ACCT_WB); +TESTPCGFLAG(AcctWB, ACCT_WB); + +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); + +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); + + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup { return page_zonenum(pc-page); } - +/* + * lock_page_cgroup() should not be held under mapping-tree_lock + */ static inline void lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st bit_spin_unlock(PCG_LOCK, pc-flags); } +/* + * Lock order is + * lock_page_cgroup() + * lock_page_cgroup_migrate() + * This lock is not be lock for charge/uncharge but for account moving. + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself + * the page is uncharged while we hold this. + */ +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags); +} + +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags); +} + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; Index: mmotm-2.6.33-Mar2/mm/memcontrol.c === --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c +++ mmotm-2.6.33-Mar2/mm/memcontrol.c @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */ + MEM_CGROUP_STAT_DIRTY, + MEM_CGROUP_STAT_WBACK, + MEM_CGROUP_STAT_WBACK_TEMP, + MEM_CGROUP_STAT_UNSTABLE_NFS, MEM_CGROUP_STAT_NSTATS, }; @@ -1360,6 +1364,86 @@ done: } /* + * Update file cache's status for memcg. Before calling this, + * mapping-tree_lock should be held and preemption is disabled. + * Then, it's guarnteed that the page is not uncharged while we + * access page_cgroup. We can make use of that. + */ +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set) +{ + struct page_cgroup *pc; + struct mem_cgroup *mem; + + pc = lookup_page_cgroup(page); + /* Not accounted ? */ + if (!PageCgroupUsed(pc)) + return; + lock_page_cgroup_migrate(pc); + /* + * It's guarnteed that this page is never uncharged. + * The only racy problem is moving account among memcgs. + */ + switch (idx) { + case MEM_CGROUP_STAT_DIRTY: + if (set) + SetPageCgroupAcctDirty(pc); + else + ClearPageCgroupAcctDirty(pc); +
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, Mar 03, 2010 at 12:47:03PM +0100, Andrea Righi wrote: On Tue, Mar 02, 2010 at 06:59:32PM -0500, Vivek Goyal wrote: On Tue, Mar 02, 2010 at 11:22:48PM +0100, Andrea Righi wrote: On Tue, Mar 02, 2010 at 10:05:29AM -0500, Vivek Goyal wrote: On Mon, Mar 01, 2010 at 11:18:31PM +0100, Andrea Righi wrote: On Mon, Mar 01, 2010 at 05:02:08PM -0500, Vivek Goyal wrote: @@ -686,10 +699,14 @@ void throttle_vm_writeout(gfp_t gfp_mask) */ dirty_thresh += dirty_thresh / 10; /* wh... */ -if (global_page_state(NR_UNSTABLE_NFS) + - global_page_state(NR_WRITEBACK) = dirty_thresh) - break; -congestion_wait(BLK_RW_ASYNC, HZ/10); + + dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); + if (dirty 0) + dirty = global_page_state(NR_UNSTABLE_NFS) + + global_page_state(NR_WRITEBACK); dirty is unsigned long. As mentioned last time, above will never be true? In general these patches look ok to me. I will do some testing with these. Re-introduced the same bug. My bad. :( The value returned from mem_cgroup_page_stat() can be negative, i.e. when memory cgroup is disabled. We could simply use a long for dirty, the unit is in # of pages so s64 should be enough. Or cast dirty to long only for the check (see below). Thanks! -Andrea Signed-off-by: Andrea Righi ari...@develer.com --- mm/page-writeback.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d83f41c..dbee976 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,7 +701,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) dirty = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES); - if (dirty 0) + if ((long)dirty 0) This will also be problematic as on 32bit systems, your uppper limit of dirty memory will be 2G? I guess, I will prefer one of the two. - return the error code from function and pass a pointer to store stats in as function argument. - Or Peter's suggestion of checking mem_cgroup_has_dirty_limit() and if per cgroup dirty control is enabled, then use per cgroup stats. In that case you don't have to return negative values. Only tricky part will be careful accouting so that none of the stats go negative in corner cases of migration etc. What do you think about Peter's suggestion + the locking stuff? (see the previous email). Otherwise, I'll choose the other solution, passing a pointer and always return the error code is not bad. Ok, so you are worried about that by the we finish mem_cgroup_has_dirty_limit() call, task might change cgroup and later we might call mem_cgroup_get_page_stat() on a different cgroup altogether which might or might not have dirty limits specified? Correct. But in what cases you don't want to use memory cgroup specified limit? I thought cgroup disabled what the only case where we need to use global limits. Otherwise a memory cgroup will have either dirty_bytes specified or by default inherit global dirty_ratio which is a valid number. If that's the case then you don't have to take rcu_lock() outside get_page_stat()? IOW, apart from cgroup being disabled, what are the other cases where you expect to not use cgroup's page stat and use global stats? At boot, when mem_cgroup_from_task() may return NULL. But this is not related to the RCU acquisition. Nevermind. You're right. In any case even if a task is migrated to a different cgroup it will always have mem_cgroup_has_dirty_limit() == true. So RCU protection is not needed outside these functions. OK, I'll go with the Peter's suggestion. Thanks! -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, Mar 03, 2010 at 11:07:35AM +0100, Peter Zijlstra wrote: On Tue, 2010-03-02 at 23:14 +0100, Andrea Righi wrote: I agree mem_cgroup_has_dirty_limit() is nicer. But we must do that under RCU, so something like: rcu_read_lock(); if (mem_cgroup_has_dirty_limit()) mem_cgroup_get_page_stat() else global_page_state() rcu_read_unlock(); That is bad when mem_cgroup_has_dirty_limit() always returns false (e.g., when memory cgroups are disabled). So I fallback to the old interface. Why is it that mem_cgroup_has_dirty_limit() needs RCU when mem_cgroup_get_page_stat() doesn't? That is, simply make mem_cgroup_has_dirty_limit() not require RCU in the same way *_get_page_stat() doesn't either. OK, I agree we can get rid of RCU protection here (see my previous email). BTW the point was that after mem_cgroup_has_dirty_limit() the task might be moved to another cgroup, but also in this case mem_cgroup_has_dirty_limit() will be always true, so mem_cgroup_get_page_stat() is always coherent. What do you think about: mem_cgroup_lock(); if (mem_cgroup_has_dirty_limit()) mem_cgroup_get_page_stat() else global_page_state() mem_cgroup_unlock(); Where mem_cgroup_read_lock/unlock() simply expand to nothing when memory cgroups are disabled. I think you're engineering the wrong way around. That allows for a 0 dirty limit (which should work and basically makes all io synchronous). IMHO it is better to reserve 0 for the special value disabled like the global settings. A synchronous IO can be also achieved using a dirty limit of 1. Why?! 0 clearly states no writeback cache, IOW sync writes, a 1 byte/page writeback cache effectively reduces to the same thing, but its not the same thing conceptually. If you want to put the size and enable into a single variable pick -1 for disable or so. I might agree, and actually I prefer this solution.. but in this way we would use a different interface respect to the equivalent vm_dirty_ratio / vm_dirty_bytes global settings (as well as dirty_background_ratio / dirty_background_bytes). IMHO it's better to use the same interface to avoid user misunderstandings. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, Mar 03, 2010 at 05:21:32PM +0900, KAMEZAWA Hiroyuki wrote: On Wed, 3 Mar 2010 15:15:49 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Agreed. Let's try how we can write a code in clean way. (we have time ;) For now, to me, IRQ disabling while lock_page_cgroup() seems to be a little over killing. What I really want is lockless code...but it seems impossible under current implementation. I wonder the fact the page is never unchareged under us can give us some chances ...Hmm. How about this ? Basically, I don't like duplicating information...so, # of new pcg_flags may be able to be reduced. I'm glad this can be a hint for Andrea-san. == --- include/linux/page_cgroup.h | 44 - mm/memcontrol.c | 91 +++- 2 files changed, 132 insertions(+), 3 deletions(-) Index: mmotm-2.6.33-Mar2/include/linux/page_cgroup.h === --- mmotm-2.6.33-Mar2.orig/include/linux/page_cgroup.h +++ mmotm-2.6.33-Mar2/include/linux/page_cgroup.h @@ -39,6 +39,11 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */ + PCG_ACCT_DIRTY, + PCG_ACCT_WB, + PCG_ACCT_WB_TEMP, + PCG_ACCT_UNSTABLE, }; #define TESTPCGFLAG(uname, lname)\ @@ -73,6 +78,23 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +SETPCGFLAG(AcctDirty, ACCT_DIRTY); +CLEARPCGFLAG(AcctDirty, ACCT_DIRTY); +TESTPCGFLAG(AcctDirty, ACCT_DIRTY); + +SETPCGFLAG(AcctWB, ACCT_WB); +CLEARPCGFLAG(AcctWB, ACCT_WB); +TESTPCGFLAG(AcctWB, ACCT_WB); + +SETPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); +CLEARPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); +TESTPCGFLAG(AcctWBTemp, ACCT_WB_TEMP); + +SETPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); +CLEARPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); +TESTPCGFLAG(AcctUnstableNFS, ACCT_UNSTABLE); + + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); @@ -82,7 +104,9 @@ static inline enum zone_type page_cgroup { return page_zonenum(pc-page); } - +/* + * lock_page_cgroup() should not be held under mapping-tree_lock + */ static inline void lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); @@ -93,6 +117,24 @@ static inline void unlock_page_cgroup(st bit_spin_unlock(PCG_LOCK, pc-flags); } +/* + * Lock order is + * lock_page_cgroup() + * lock_page_cgroup_migrate() + * This lock is not be lock for charge/uncharge but for account moving. + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself + * the page is uncharged while we hold this. + */ +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags); +} + +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags); +} + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; Index: mmotm-2.6.33-Mar2/mm/memcontrol.c === --- mmotm-2.6.33-Mar2.orig/mm/memcontrol.c +++ mmotm-2.6.33-Mar2/mm/memcontrol.c @@ -87,6 +87,10 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */ + MEM_CGROUP_STAT_DIRTY, + MEM_CGROUP_STAT_WBACK, + MEM_CGROUP_STAT_WBACK_TEMP, + MEM_CGROUP_STAT_UNSTABLE_NFS, MEM_CGROUP_STAT_NSTATS, }; @@ -1360,6 +1364,86 @@ done: } /* + * Update file cache's status for memcg. Before calling this, + * mapping-tree_lock should be held and preemption is disabled. + * Then, it's guarnteed that the page is not uncharged while we + * access page_cgroup. We can make use of that. + */ +void mem_cgroup_update_stat_locked(struct page *page, int idx, bool set) +{ + struct page_cgroup *pc; + struct mem_cgroup *mem; + + pc = lookup_page_cgroup(page); + /* Not accounted ? */ + if (!PageCgroupUsed(pc)) + return; + lock_page_cgroup_migrate(pc); + /* + * It's guarnteed that this page is never uncharged. + * The only racy problem is moving account among memcgs. + */ + switch (idx) { + case MEM_CGROUP_STAT_DIRTY: + if (set) + SetPageCgroupAcctDirty(pc); + else + ClearPageCgroupAcctDirty(pc); + break; + case MEM_CGROUP_STAT_WBACK: + if (set) + SetPageCgroupAcctWB(pc); +
[Devel] [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v3 - v4) ~~ * handle the migration of tasks across different cgroups NOTE: at the moment we don't move charges of file cache pages, so this functionality is not immediately necessary. However, since the migration of file cache pages is in plan, it is better to start handling file pages anyway. * properly account dirty pages in nilfs2 (thanks to Kirill A. Shutemov kir...@shutemov.name) * lockless access to dirty memory parameters * fix: page_cgroup lock must not be acquired under mapping-tree_lock (thanks to Daisuke Nishimura nishim...@mxp.nes.nec.co.jp and KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com) * code restyling -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 1/4] memcg: dirty memory documentation
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache - # of bytes of page cache memory. rss- # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout- # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs- # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon- # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + 6. Hierarchy support -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
Introduce page_cgroup flags to keep track of file cache pages. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/page_cgroup.h | 49 +++ 1 files changed, 49 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..1b79ded 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -39,6 +39,12 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */ + PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_ACCT_DIRTY, /* page is dirty */ + PCG_ACCT_WRITEBACK, /* page is being written back to disk */ + PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */ }; #define TESTPCGFLAG(uname, lname) \ @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +/* File cache and dirty memory flags */ +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED) + +TESTPCGFLAG(Dirty, ACCT_DIRTY) +SETPCGFLAG(Dirty, ACCT_DIRTY) +CLEARPCGFLAG(Dirty, ACCT_DIRTY) + +TESTPCGFLAG(Writeback, ACCT_WRITEBACK) +SETPCGFLAG(Writeback, ACCT_WRITEBACK) +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK) + +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) + +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } +/* + * lock_page_cgroup() should not be held under mapping-tree_lock + */ static inline void lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); @@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc) bit_spin_unlock(PCG_LOCK, pc-flags); } +/* + * Lock order is + * lock_page_cgroup() + * lock_page_cgroup_migrate() + * + * This lock is not be lock for charge/uncharge but for account moving. + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself + * the page is uncharged while we hold this. + */ +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags); +} + +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags); +} + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 80 - mm/memcontrol.c| 420 +++- 2 files changed, 450 insertions(+), 50 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f9b119..cc3421b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,66 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* Dirty memory parameters */ +struct dirty_param { + int dirty_ratio; + unsigned long dirty_bytes; + int dirty_background_ratio; + unsigned long dirty_background_bytes; +}; + +/* + * Statistics for memory cgroup. + */ +enum mem_cgroup_stat_index { + /* +* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. +*/ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ + MEM_CGROUP_STAT_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEM_CGROUP_STAT_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEM_CGROUP_STAT_NSTATS, +}; + +/* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ +static inline void get_global_dirty_param(struct dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern bool mem_cgroup_has_dirty_limit(void); +extern void get_dirty_param(struct dirty_param *param); +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item); + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void) } extern bool mem_cgroup_oom_called(struct task_struct *task); -void mem_cgroup_update_file_mapped(struct page *page, int val); +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) { } @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline bool mem_cgroup_has_dirty_limit(void) +{ + return false; +} + +static inline void get_dirty_param(struct dirty_param *param) +{ + get_global_dirty_param(param); +} + +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item) +{ + return -ENOSYS; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 497b6f7..9842e7b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/ #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */ #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */ -/* - * Statistics for memory cgroup. - */ -enum
[Devel] [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c |5 +++ fs/nfs/write.c |4 ++ fs/nilfs2/segment.c | 11 +- mm/filemap.c|1 + mm/page-writeback.c | 91 ++- mm/rmap.c |4 +- mm/truncate.c |2 + 7 files changed, 84 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..27a01b1 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -24,6 +24,7 @@ #include linux/pagemap.h #include linux/buffer_head.h #include linux/writeback.h +#include linux/memcontrol.h #include linux/bio.h #include linux/completion.h #include linux/blkdev.h @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_update_stat(clone_page, + MEM_CGROUP_STAT_WRITEBACK, 1); inc_zone_page_state(clone_page, NR_WRITEBACK); + } unlock_page(clone_page); return 0; @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err) } if (buffer_nilfs_allocated(page_buffers(page))) { - if (TestClearPageWriteback(page)) + if (TestClearPageWriteback(page)) { + mem_cgroup_update_stat(page, + MEM_CGROUP_STAT_WRITEBACK, -1); dec_zone_page_state(page, NR_WRITEBACK); + } } else end_page_writeback(page); } diff --git a/mm/filemap.c b/mm/filemap.c index fe09e51..f85acae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); } diff --git
[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
On Thu, Mar 04, 2010 at 11:18:28AM -0500, Vivek Goyal wrote: On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote: [..] diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5a0f8f3..c5d14ea 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { + struct dirty_param dirty_param; unsigned long dirty_total; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + get_dirty_param(dirty_param); + + if (dirty_param.dirty_bytes) + dirty_total = dirty_param.dirty_bytes / PAGE_SIZE; else - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / - 100; + dirty_total = (dirty_param.dirty_ratio * + determine_dirtyable_memory()) / 100; return 2 + ilog2(dirty_total - 1); } @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) */ unsigned long determine_dirtyable_memory(void) { - unsigned long x; - - x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); + unsigned long memory; + s64 memcg_memory; + memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages(); if (!vm_highmem_is_dirtyable) - x -= highmem_dirtyable_memory(x); - - return x + 1; /* Ensure that we never return 0 */ + memory -= highmem_dirtyable_memory(memory); + if (mem_cgroup_has_dirty_limit()) + return memory + 1; Should above be? if (!mem_cgroup_has_dirty_limit()) return memory + 1; Very true. I'll post another patch with this and Kirill's fixes. Thanks, -Andrea Vivek + memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES); + return min((unsigned long)memcg_memory, memory + 1); } ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
On Thu, Mar 04, 2010 at 10:41:43PM +0530, Balbir Singh wrote: * Andrea Righi ari...@develer.com [2010-03-04 11:40:11]: Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v3 - v4) ~~ * handle the migration of tasks across different cgroups NOTE: at the moment we don't move charges of file cache pages, so this functionality is not immediately necessary. However, since the migration of file cache pages is in plan, it is better to start handling file pages anyway. * properly account dirty pages in nilfs2 (thanks to Kirill A. Shutemov kir...@shutemov.name) * lockless access to dirty memory parameters * fix: page_cgroup lock must not be acquired under mapping-tree_lock (thanks to Daisuke Nishimura nishim...@mxp.nes.nec.co.jp and KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com) * code restyling This seems to be converging, what sort of tests are you running on this patchset? A very simple test at the moment, just some parallel dd's running in different cgroups. For example: - cgroup A: low dirty limits (writes are almost sync) echo 1000 /cgroups/A/memory.dirty_bytes echo 1000 /cgroups/A/memory.dirty_background_bytes - cgroup B: high dirty limits (writes are all buffered in page cache) echo 100 /cgroups/B/memory.dirty_ratio echo 50 /cgroups/B/memory.dirty_background_ratio Then run the dd's and look at memory.stat: - cgroup A: # dd if=/dev/zero of=A bs=1M count=1000 - cgroup B: # dd if=/dev/zero of=B bs=1M count=1000 A random snapshot during the writes: # grep dirty\|writeback /cgroups/[AB]/memory.stat /cgroups/A/memory.stat:filedirty 0 /cgroups/A/memory.stat:writeback 0 /cgroups/A/memory.stat:writeback_tmp 0 /cgroups/A/memory.stat:dirty_pages 0 /cgroups/A/memory.stat:writeback_pages 0 /cgroups/A/memory.stat:writeback_temp_pages 0 /cgroups/B/memory.stat:filedirty 67226 /cgroups/B/memory.stat:writeback 136 /cgroups/B/memory.stat:writeback_tmp 0 /cgroups/B/memory.stat:dirty_pages 67226 /cgroups/B/memory.stat:writeback_pages 136 /cgroups/B/memory.stat:writeback_temp_pages 0 I plan to run more detailed IO benchmark soon. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
On Thu, Mar 04, 2010 at 02:41:44PM -0500, Vivek Goyal wrote: On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote: [..] diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5a0f8f3..c5d14ea 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties; */ static int calc_period_shift(void) { + struct dirty_param dirty_param; unsigned long dirty_total; - if (vm_dirty_bytes) - dirty_total = vm_dirty_bytes / PAGE_SIZE; + get_dirty_param(dirty_param); + + if (dirty_param.dirty_bytes) + dirty_total = dirty_param.dirty_bytes / PAGE_SIZE; else - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / - 100; + dirty_total = (dirty_param.dirty_ratio * + determine_dirtyable_memory()) / 100; return 2 + ilog2(dirty_total - 1); } Hmm.., I have been staring at this for some time and I think something is wrong. I don't fully understand the way floating proportions are working but this function seems to be calculating the period over which we need to measuer the proportions. (vm_completion proportion and vm_dirties proportions). And we this period (shift), when admin updates dirty_ratio or dirty_bytes etc. In that case we recalculate the global dirty limit and take log2 and use that as period over which we monitor and calculate proportions. If yes, then it should be global and not per cgroup (because all our accouting of bdi completion is global and not per cgroup). PeterZ, can tell us more about it. I am just raising the flag here to be sure. Thanks Vivek Hi Vivek, I tend to agree, we must use global dirty values here. BTW, update_completion_period() is called from dirty_* handlers, so it's totally unrelated to use the current memcg. That's the memcg where the admin is running, so probably it's the root memcg almost all the time, but it's wrong in principle. In conclusion this patch shouldn't touch calc_period_shift(). Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Fri, Mar 05, 2010 at 10:12:34AM +0900, Daisuke Nishimura wrote: On Thu, 4 Mar 2010 11:40:14 +0100, Andrea Righi ari...@develer.com wrote: Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 80 - mm/memcontrol.c| 420 +++- 2 files changed, 450 insertions(+), 50 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f9b119..cc3421b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,66 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* Dirty memory parameters */ +struct dirty_param { + int dirty_ratio; + unsigned long dirty_bytes; + int dirty_background_ratio; + unsigned long dirty_background_bytes; +}; + +/* + * Statistics for memory cgroup. + */ +enum mem_cgroup_stat_index { + /* +* For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. +*/ + MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ + MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ + MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ + MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ + MEM_CGROUP_STAT_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEM_CGROUP_STAT_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEM_CGROUP_STAT_NSTATS, +}; + I must have said it earlier, but I don't think exporting all of these flags is a good idea. Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)? We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially. Agreed. +/* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ +static inline void get_global_dirty_param(struct dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern bool mem_cgroup_has_dirty_limit(void); +extern void get_dirty_param(struct dirty_param *param); +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item); + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void) } extern bool mem_cgroup_oom_called(struct task_struct *task); -void mem_cgroup_update_file_mapped(struct page *page, int val); +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) { } @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline bool mem_cgroup_has_dirty_limit(void) +{ + return false; +} + +static inline void get_dirty_param(struct
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Fri, Mar 05, 2010 at 10:58:55AM +0900, KAMEZAWA Hiroyuki wrote: On Fri, 5 Mar 2010 10:12:34 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Thu, 4 Mar 2010 11:40:14 +0100, Andrea Righi ari...@develer.com wrote: Infrastructure to account dirty pages per cgroup and add dirty limit static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data) { int *val = data; @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem) } /* - * Currently used to update mapped file statistics, but the routine can be - * generalized to update other statistics as well. + * Generalized routine to update file cache's status for memcg. + * + * Before calling this, mapping-tree_lock should be held and preemption is + * disabled. Then, it's guarnteed that the page is not uncharged while we + * access page_cgroup. We can make use of that. */ IIUC, mapping-tree_lock is held with irq disabled, so I think mapping-tree_lock should be held with irq disabled would be enouth. And, as far as I can see, callers of this function have not ensured this yet in [4/4]. how about: void mem_cgroup_update_stat_locked(...) { ... } void mem_cgroup_update_stat_unlocked(mapping, ...) { spin_lock_irqsave(mapping-tree_lock, ...); mem_cgroup_update_stat_locked(); spin_unlock_irqrestore(...); } Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held. lock_page_cgroup(); mem_cgroup_update_stat_locked(); unlock_page_cgroup(); Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend on migration_lock about FILE_MAPPED. Right. I'll consider this in the next version of the patch. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
On Fri, Mar 05, 2010 at 12:02:49PM +0530, Balbir Singh wrote: * Andrea Righi ari...@develer.com [2010-03-04 11:40:13]: Introduce page_cgroup flags to keep track of file cache pages. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com --- Looks good Acked-by: Balbir Singh bal...@linux.vnet.ibm.com include/linux/page_cgroup.h | 49 +++ 1 files changed, 49 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..1b79ded 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -39,6 +39,12 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */ + PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_ACCT_DIRTY, /* page is dirty */ + PCG_ACCT_WRITEBACK, /* page is being written back to disk */ + PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */ }; #define TESTPCGFLAG(uname, lname) \ @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +/* File cache and dirty memory flags */ +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED) + +TESTPCGFLAG(Dirty, ACCT_DIRTY) +SETPCGFLAG(Dirty, ACCT_DIRTY) +CLEARPCGFLAG(Dirty, ACCT_DIRTY) + +TESTPCGFLAG(Writeback, ACCT_WRITEBACK) +SETPCGFLAG(Writeback, ACCT_WRITEBACK) +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK) + +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) + +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } +/* + * lock_page_cgroup() should not be held under mapping-tree_lock + */ May be a DEBUG WARN_ON would be appropriate here? Sounds good. WARN_ON_ONCE()? Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
On Fri, Mar 05, 2010 at 12:08:43PM +0530, Balbir Singh wrote: * Andrea Righi ari...@develer.com [2010-03-04 11:40:15]: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c |5 +++ fs/nfs/write.c |4 ++ fs/nilfs2/segment.c | 11 +- mm/filemap.c|1 + mm/page-writeback.c | 91 ++- mm/rmap.c |4 +- mm/truncate.c |2 + 7 files changed, 84 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..27a01b1 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -24,6 +24,7 @@ #include linux/pagemap.h #include linux/buffer_head.h #include linux/writeback.h +#include linux/memcontrol.h #include linux/bio.h #include linux/completion.h #include linux/blkdev.h @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page)) { + mem_cgroup_update_stat(clone_page, + MEM_CGROUP_STAT_WRITEBACK, 1); I wonder if we should start implementing inc and dec to avoid passing the +1 and -1 parameters. It should make the code easier to read. OK, it's always +1/-1, and I don't see any case where we should use different numbers. So, better to move to the inc/dec naming. inc_zone_page_state(clone_page, NR_WRITEBACK); + } unlock_page(clone_page); return 0; @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err) } if (buffer_nilfs_allocated(page_buffers(page))) { - if (TestClearPageWriteback(page)) + if (TestClearPageWriteback(page)) { + mem_cgroup_update_stat(page, + MEM_CGROUP_STAT_WRITEBACK, -1); dec_zone_page_state(page, NR_WRITEBACK); + } } else end_page_writeback(page); } diff --git a/mm/filemap.c b/mm/filemap.c index fe09e51..f85acae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c
[Devel] [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v5)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (directly or background) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes. Changelog (v4 - v5) ~~ * fixed a potential deadlock between lock_page_cgroup and mapping-tree_lock (I'm not sure I did the right thing for this point, so review and tests are very welcome) * introduce inc/dec functions to update file cache accounting * export only a restricted subset of mem_cgroup_stat_index flags * fixed a bug in determine_dirtyable_memory() to correctly return the local memcg dirtyable memory * always use global dirty memory settings in calc_period_shift() -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 1/4] memcg: dirty memory documentation
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache - # of bytes of page cache memory. rss- # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout- # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs- # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon- # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + 6. Hierarchy support -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup and to add dirty limit interface to the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 122 +++- mm/memcontrol.c| 507 +--- 2 files changed, 593 insertions(+), 36 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44301c6..61fdca4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,55 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_read_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* File cache pages accounting */ +enum mem_cgroup_write_page_stat_item { + MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */ + MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */ + MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */ + MEMCG_NR_FILE_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEMCG_NR_FILE_NSTAT, +}; + +/* Dirty memory parameters */ +struct vm_dirty_param { + int dirty_ratio; + int dirty_background_ratio; + unsigned long dirty_bytes; + unsigned long dirty_background_bytes; +}; + +/* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +160,40 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern bool mem_cgroup_has_dirty_limit(void); +extern void get_vm_dirty_param(struct vm_dirty_param *param); +extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item); + +extern void mem_cgroup_update_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge); + +extern void mem_cgroup_update_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge); + +static inline void mem_cgroup_inc_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat_locked(page, idx, true); +} + +static inline void mem_cgroup_dec_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat_locked(page, idx, false); +} + +static inline void mem_cgroup_inc_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat_unlocked(page, idx, true); +} + +static inline void mem_cgroup_dec_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat_unlocked(page, idx, false); +} + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -124,7 +201,6 @@ static inline bool mem_cgroup_disabled(void) return false; } -void mem_cgroup_update_file_mapped(struct page *page, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -294,8 +370,38 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + return -ENOSYS; +} + +static inline void mem_cgroup_update_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ +} + +static inline void mem_cgroup_update_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ +} + +static inline void mem_cgroup_inc_page_stat_locked(struct page *page, + enum
[Devel] [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
Introduce page_cgroup flags to keep track of file cache pages. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/page_cgroup.h | 45 +++ 1 files changed, 45 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..dc66bee 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -39,6 +39,12 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */ + PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_ACCT_DIRTY, /* page is dirty */ + PCG_ACCT_WRITEBACK, /* page is being written back to disk */ + PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */ }; #define TESTPCGFLAG(uname, lname) \ @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +/* File cache and dirty memory flags */ +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED) + +TESTPCGFLAG(Dirty, ACCT_DIRTY) +SETPCGFLAG(Dirty, ACCT_DIRTY) +CLEARPCGFLAG(Dirty, ACCT_DIRTY) + +TESTPCGFLAG(Writeback, ACCT_WRITEBACK) +SETPCGFLAG(Writeback, ACCT_WRITEBACK) +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK) + +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) + +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } +/* + * lock_page_cgroup() should not be held under mapping-tree_lock + */ static inline void lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); @@ -93,6 +123,21 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc) bit_spin_unlock(PCG_LOCK, pc-flags); } +/* + * This lock is not be lock for charge/uncharge but for account moving. + * i.e. overwrite pc-mem_cgroup. The lock owner should guarantee by itself + * the page is uncharged while we hold this. + */ +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_lock(PCG_MIGRATE_LOCK, pc-flags); +} + +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc) +{ + bit_spin_unlock(PCG_MIGRATE_LOCK, pc-flags); +} + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
On Tue, Mar 09, 2010 at 10:29:28AM +0900, Daisuke Nishimura wrote: On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 9 Mar 2010 01:12:52 +0100 Andrea Righi ari...@develer.com wrote: On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 8 Mar 2010 17:07:11 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 11:17:24 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK); 1328 __bdi_writeout_inc(bdi); 1329 } 1330 } 1331 spin_unlock_irqrestore(mapping-tree_lock, flags); 1332 } else { 1333 ret = TestClearPageWriteback(page); 1334 } 1335 if (ret) 1336 dec_zone_page_state(page, NR_WRITEBACK); 1337 return ret; 1338 } We can move this up to under tree_lock. Considering memcg, all our target has mapping. If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no -mapping, we need much more complex new charge/uncharge theory. But yes, adding new lock scheme seems complicated. (Sorry Andrea.) My concerns is performance. We may need somehing new re-implementation of locks/migrate/charge/uncharge. I agree. Performance is my concern too. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec Hmm... about 0.3% slower for root, 0.1% slower for child. Hmm...accepatable ? (sounds it's in error-range) BTW, why local_irq_disable() ? local_irq_save()/restore() isn't better ? Probably there's not the overhead of saving flags? maybe. Anyway, it would make the code much more readable... ok. please go ahead in this direction. Nishimura-san, would you post an independent patch ? If no, Andrea-san, please. This is the updated version. Andrea-san, can you merge this into your patch set ? OK, I'll merge, do some tests and post a new version. Thanks! -Andrea === From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp In current implementation, we don't have to disable irq at lock_page_cgroup() because the lock is never acquired in interrupt context. But we are going to call it in later patch in an interrupt context or with irq disabled, so this patch disables irq at lock_page_cgroup() and enables it at unlock_page_cgroup(). Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- include/linux/page_cgroup.h | 16 ++-- mm/memcontrol.c | 43 +-- 2 files changed, 39 insertions(+), 20 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..0d2f92c 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } -static inline void lock_page_cgroup(struct page_cgroup *pc) +static inline void __lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); } -static inline void unlock_page_cgroup
[Devel] [PATCH -mmotm 5/5] memcg: dirty pages instrumentation
Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. [ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2 bounce pages. This depends on charging also bounce pages per cgroup. ] As a bonus, make determine_dirtyable_memory() static again: this function isn't used anymore outside page writeback. Signed-off-by: Andrea Righi ari...@develer.com --- fs/nfs/write.c|4 + include/linux/writeback.h |2 - mm/filemap.c |1 + mm/page-writeback.c | 215 - mm/rmap.c |4 +- mm/truncate.c |1 + 6 files changed, 141 insertions(+), 86 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 53ff70e..3e8b9f8 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req) NFS_PAGE_TAG_COMMIT); nfsi-ncommit++; spin_unlock(inode-i_lock); + mem_cgroup_inc_page_stat(req-wb_page, MEMCG_NR_FILE_UNSTABLE_NFS); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE); return 1; @@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_dec_page_stat(req-wb_page, + MEMCG_NR_FILE_UNSTABLE_NFS); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index dd9512d..39e4cb2 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable; extern int block_dump; extern int laptop_mode; -extern unsigned long determine_dirtyable_memory(void); - extern int dirty_background_ratio_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/mm/filemap.c b/mm/filemap.c index 62cbac0..bd833fe 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ab84693..fcac9b4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions; static struct prop_descriptor vm_dirties; /* + * Work out the current dirty-memory clamping and background writeout + * thresholds. + * + * The main aim here is to lower them aggressively if there is a lot of mapped + * memory around. To avoid stressing page reclaim with lots of unreclaimable + * pages. It is better to clamp down on writers than to start swapping, and + * performing lots of scanning. + * + * We only allow 1/2 of the currently-unmapped memory to be dirtied. + * + * We don't permit the clamping level to fall below 5% - that is getting rather + * excessive. + * + * We make sure that the background writeout level is below the adjusted + * clamping level. + */ + +static unsigned long highmem_dirtyable_memory(unsigned long total) +{ +#ifdef CONFIG_HIGHMEM + int node; + unsigned long x = 0; + + for_each_node_state(node, N_HIGH_MEMORY) { + struct zone *z = + NODE_DATA(node)-node_zones[ZONE_HIGHMEM]; + + x += zone_page_state(z, NR_FREE_PAGES) + +zone_reclaimable_pages(z); + } + /* +* Make sure that the number of highmem pages is never larger +* than the number of the total dirtyable memory. This can only +* occur in very strange VM situations but we want to make sure +* that this does not occur. +*/ + return min(x, total); +#else + return 0; +#endif +} + +static unsigned long get_global_dirtyable_memory(void) +{ + unsigned long memory; + + memory = global_page_state
[Devel] [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock
From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp In current implementation, we don't have to disable irq at lock_page_cgroup() because the lock is never acquired in interrupt context. But we are going to call it in later patch in an interrupt context or with irq disabled, so this patch disables irq at lock_page_cgroup() and enables it at unlock_page_cgroup(). Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- include/linux/page_cgroup.h | 16 ++-- mm/memcontrol.c | 43 +-- 2 files changed, 39 insertions(+), 20 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..0d2f92c 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } -static inline void lock_page_cgroup(struct page_cgroup *pc) +static inline void __lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); } -static inline void unlock_page_cgroup(struct page_cgroup *pc) +static inline void __unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, pc-flags); } +#define lock_page_cgroup(pc, flags)\ + do {\ + local_irq_save(flags); \ + __lock_page_cgroup(pc); \ + } while (0) + +#define unlock_page_cgroup(pc, flags) \ + do {\ + __unlock_page_cgroup(pc); \ + local_irq_restore(flags); \ + } while (0) + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7fab84e..a9fd736 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1352,12 +1352,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) { struct mem_cgroup *mem; struct page_cgroup *pc; + unsigned long flags; pc = lookup_page_cgroup(page); if (unlikely(!pc)) return; - lock_page_cgroup(pc); + lock_page_cgroup(pc, flags); mem = pc-mem_cgroup; if (!mem) goto done; @@ -1371,7 +1372,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); done: - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); } /* @@ -1705,11 +1706,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) struct page_cgroup *pc; unsigned short id; swp_entry_t ent; + unsigned long flags; VM_BUG_ON(!PageLocked(page)); pc = lookup_page_cgroup(page); - lock_page_cgroup(pc); + lock_page_cgroup(pc, flags); if (PageCgroupUsed(pc)) { mem = pc-mem_cgroup; if (mem !css_tryget(mem-css)) @@ -1723,7 +1725,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) mem = NULL; rcu_read_unlock(); } - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); return mem; } @@ -1736,13 +1738,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, struct page_cgroup *pc, enum charge_type ctype) { + unsigned long flags; + /* try_charge() can return NULL to *memcg, taking care of it. */ if (!mem) return; - lock_page_cgroup(pc); + lock_page_cgroup(pc, flags); if (unlikely(PageCgroupUsed(pc))) { - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); mem_cgroup_cancel_charge(mem); return; } @@ -1772,7 +1776,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, mem_cgroup_charge_statistics(mem, pc, true); - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); /* * charge_statistics updated event counter. Then, check it. * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. @@ -1842,12 +1846,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge) { int ret = -EINVAL; - lock_page_cgroup(pc); + unsigned long flags; + lock_page_cgroup(pc, flags); if (PageCgroupUsed(pc) pc-mem_cgroup == from) { __mem_cgroup_move_account(pc, from, to, uncharge); ret = 0; } - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); /* * check events */ @@ -1974,17 +1979,17 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, */
[Devel] [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v5 - v6) ~~ * always disable/enable IRQs at lock/unlock_page_cgroup(): this allows to drop the previous complicated locking scheme in favor of a simpler locking, even if this obviously adds some overhead (see results below) * drop FUSE and NILFS2 dirty pages accounting for now (this depends on charging bounce pages per cgroup) Results ~~~ I ran some tests using a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ 1.2GHz as testcase using different kernels: - mmotm vanilla - mmotm with cgroup-dirty-memory using the previous complex locking scheme (my previous patchset + the fixes reported by Kame-san and Daisuke-san) - mmotm with cgroup-dirty-memory using the simple locking scheme (lock_page_cgroup() with IRQs disabled) Following the results: before - mmotm vanilla, root cgroup: 11m51.983s - mmotm vanilla, child cgroup: 11m56.596s after - mmotm, complex locking scheme, root cgroup: 11m53.037s - mmotm, complex locking scheme, child cgroup: 11m57.896s - mmotm, lock_page_cgroup+irq_disabled, root cgroup: 12m5.499s - mmotm, lock_page_cgroup+irq_disabled, child cgroup: 12m9.920s With the complex locking solution, the overhead introduced by the cgroup dirty memory accounting is minimal (0.14%), compared with the overhead introduced by the lock_page_cgroup+irq_disabled solution (1.90%). The performance overhead is not so huge in both solutions, but the impact on performance is even more reduced using a complicated solution... Maybe we can go ahead with the simplest implementation for now and start to think to an alternative implementation of the page_cgroup locking and charge/uncharge of pages. If someone is interested or want to repeat the tests (maybe on a bigger machine) I can post also the other version of the patchset. Just let me know. -Andrea Documentation/cgroups/memory.txt | 36 +++ fs/nfs/write.c |4 + include/linux/memcontrol.h | 87 +++- include/linux/page_cgroup.h | 42 - include/linux/writeback.h|2 - mm/filemap.c |1 + mm/memcontrol.c | 475 +- mm/page-writeback.c | 215 +++--- mm/rmap.c|4 +- mm/truncate.c|1 + 10 files changed, 722 insertions(+), 145 deletions(-) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 87 +- mm/memcontrol.c| 432 2 files changed, 480 insertions(+), 39 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44301c6..0602ec9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,55 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_read_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* File cache pages accounting */ +enum mem_cgroup_write_page_stat_item { + MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */ + MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */ + MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */ + MEMCG_NR_FILE_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEMCG_NR_FILE_NSTAT, +}; + +/* Dirty memory parameters */ +struct vm_dirty_param { + int dirty_ratio; + int dirty_background_ratio; + unsigned long dirty_bytes; + unsigned long dirty_background_bytes; +}; + +/* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +160,25 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern bool mem_cgroup_has_dirty_limit(void); +extern void get_vm_dirty_param(struct vm_dirty_param *param); +extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item); + +extern void mem_cgroup_update_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge); + +static inline void mem_cgroup_inc_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat(page, idx, true); +} + +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat(page, idx, false); +} + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -124,7 +186,6 @@ static inline bool mem_cgroup_disabled(void) return false; } -void mem_cgroup_update_file_mapped(struct page *page, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -294,8 +355,18 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + return -ENOSYS; +} + +static inline void mem_cgroup_inc_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ +} + +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) { } @@ -306,6 +377,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline bool mem_cgroup_has_dirty_limit(void) +{ + return false; +} + +static inline void get_vm_dirty_param(struct vm_dirty_param *param) +{ + get_global_vm_dirty_param(param); +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a9fd736..ffcf37c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -80,14 +80,21 @@ enum mem_cgroup_stat_index { /* * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss. */ - MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ + MEM_CGROUP_STAT_CACHE
[Devel] [PATCH -mmotm 3/5] page_cgroup: introduce file cache flags
Introduce page_cgroup flags to keep track of file cache pages. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/page_cgroup.h | 26 ++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 0d2f92c..4e09c8c 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -39,6 +39,11 @@ enum { PCG_CACHE, /* charged as cache */ PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ + PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_ACCT_DIRTY, /* page is dirty */ + PCG_ACCT_WRITEBACK, /* page is being written back to disk */ + PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */ }; #define TESTPCGFLAG(uname, lname) \ @@ -73,6 +78,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU) TESTPCGFLAG(AcctLRU, ACCT_LRU) TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU) +/* File cache and dirty memory flags */ +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED) +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED) + +TESTPCGFLAG(Dirty, ACCT_DIRTY) +SETPCGFLAG(Dirty, ACCT_DIRTY) +CLEARPCGFLAG(Dirty, ACCT_DIRTY) + +TESTPCGFLAG(Writeback, ACCT_WRITEBACK) +SETPCGFLAG(Writeback, ACCT_WRITEBACK) +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK) + +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP) + +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 2/5] memcg: dirty memory documentation
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache - # of bytes of page cache memory. rss- # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout- # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs- # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon- # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + 6. Hierarchy support -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Thu, Mar 11, 2010 at 09:39:13AM +0900, KAMEZAWA Hiroyuki wrote: On Wed, 10 Mar 2010 00:00:31 +0100 Andrea Righi ari...@develer.com wrote: Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v5 - v6) ~~ * always disable/enable IRQs at lock/unlock_page_cgroup(): this allows to drop the previous complicated locking scheme in favor of a simpler locking, even if this obviously adds some overhead (see results below) * drop FUSE and NILFS2 dirty pages accounting for now (this depends on charging bounce pages per cgroup) Results ~~~ I ran some tests using a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ 1.2GHz as testcase using different kernels: - mmotm vanilla - mmotm with cgroup-dirty-memory using the previous complex locking scheme (my previous patchset + the fixes reported by Kame-san and Daisuke-san) - mmotm with cgroup-dirty-memory using the simple locking scheme (lock_page_cgroup() with IRQs disabled) Following the results: before - mmotm vanilla, root cgroup: 11m51.983s - mmotm vanilla, child cgroup: 11m56.596s after - mmotm, complex locking scheme, root cgroup: 11m53.037s - mmotm, complex locking scheme, child cgroup: 11m57.896s - mmotm, lock_page_cgroup+irq_disabled, root cgroup: 12m5.499s - mmotm, lock_page_cgroup+irq_disabled, child cgroup: 12m9.920s With the complex locking solution, the overhead introduced by the cgroup dirty memory accounting is minimal (0.14%), compared with the overhead introduced by the lock_page_cgroup+irq_disabled solution (1.90%). Hmmisn't this bigger than expected ? Consider that I'm not running the kernel build on tmpfs, but on a fs defined on /dev/sda. So the additional overhead should be normal compared to the mmotm vanilla, where there's only FILE_MAPPED accounting. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Thu, Mar 11, 2010 at 06:42:44PM +0900, KAMEZAWA Hiroyuki wrote: On Thu, 11 Mar 2010 18:25:00 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Then, it's not problem that check pc-mem_cgroup is root cgroup or not without spinlock. == void mem_cgroup_update_stat(struct page *page, int idx, bool charge) { pc = lookup_page_cgroup(page); if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup)) return; ... } == This can be handle in the same logic of lock failure path. And we just do ignore accounting. There are will be no spinlocksto do more than this, I think we have to use struct page rather than struct page_cgroup. Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED status in root cgroup. This kind of change is not very good. So, one way is to use this kind of function only for new parameters. Hmm. This kind of accouting shouldn't be a big problem for the dirty memory write-out. The benefit in terms of performance is much more important I think. The missing accounting of root cgroup statistics could be an issue if we move a lot of pages from root cgroup into a child cgroup (when migration of file cache pages will be supported and enabled). But at worst we'll continue to write-out pages using the global settings. Remember that memcg dirty memory is always the min(memcg_dirty_memory, total_dirty_memory), so even if we're leaking dirty memory accounting at worst we'll touch the global dirty limit and fallback to the current write-out implementation. I'll merge this patch, re-run some tests (kernel build and large file copy) and post a new version. Unfortunately at the moment I've not a big machine to use for these tests, but maybe I can get some help. Vivek has probably a nice hardware to test this code.. ;) Thanks! -Andrea == From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, file-mapped is maintaiend. But more generic update function will be needed for dirty page accounting. For accountig page status, we have to guarantee lock_page_cgroup() will be never called under tree_lock held. To guarantee that, we use trylock at updating status. By this, we do fuzyy accounting, but in almost all case, it's correct. Changelog: - removed unnecessary preempt_disable() - added root cgroup check. By this, we do no lock/account in root cgroup. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- include/linux/memcontrol.h |7 ++- include/linux/page_cgroup.h | 15 +++ mm/memcontrol.c | 92 +--- mm/rmap.c |4 - 4 files changed, 94 insertions(+), 24 deletions(-) Index: mmotm-2.6.34-Mar9/mm/memcontrol.c === --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c +++ mmotm-2.6.34-Mar9/mm/memcontrol.c @@ -1348,30 +1348,83 @@ bool mem_cgroup_handle_oom(struct mem_cg * Currently used to update mapped file statistics, but the routine can be * generalized to update other statistics as well. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge) { struct mem_cgroup *mem; - struct page_cgroup *pc; - - pc = lookup_page_cgroup(page); - if (unlikely(!pc)) - return; + int val; - lock_page_cgroup(pc); mem = pc-mem_cgroup; - if (!mem) - goto done; + if (!mem || !PageCgroupUsed(pc)) + return; - if (!PageCgroupUsed(pc)) - goto done; + if (charge) + val = 1; + else + val = -1; + switch (idx) { + case MEMCG_NR_FILE_MAPPED: + if (charge) { + if (!PageCgroupFileMapped(pc)) + SetPageCgroupFileMapped(pc); + else + val = 0; + } else { + if (PageCgroupFileMapped(pc)) + ClearPageCgroupFileMapped(pc); + else + val = 0; + } + idx = MEM_CGROUP_STAT_FILE_MAPPED; + break; + default: + BUG(); + break; + } /* * Preemption is already disabled. We can use __this_cpu_xxx */ - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); + __this_cpu_add(mem-stat-count[idx], val); +} -done: - unlock_page_cgroup(pc); +void mem_cgroup_update_stat(struct page *page, int idx, bool charge) +{ + struct page_cgroup *pc; + + pc = lookup_page_cgroup(page); + if (!pc || mem_cgroup_is_root(pc-mem_cgroup)) + return; + + if (trylock_page_cgroup(pc)) { + __mem_cgroup_update_stat(pc, idx, charge); +
[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
On Wed, Mar 10, 2010 at 05:23:39PM -0500, Vivek Goyal wrote: On Wed, Mar 10, 2010 at 12:00:35AM +0100, Andrea Righi wrote: [..] - * Currently used to update mapped file statistics, but the routine can be - * generalized to update other statistics as well. + * mem_cgroup_update_page_stat() - update memcg file cache's accounting + * @page: the page involved in a file cache operation. + * @idx: the particular file cache statistic. + * @charge:true to increment, false to decrement the statistic specified + * by @idx. + * + * Update memory cgroup file cache's accounting. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void mem_cgroup_update_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) { - struct mem_cgroup *mem; struct page_cgroup *pc; unsigned long flags; + if (mem_cgroup_disabled()) + return; pc = lookup_page_cgroup(page); - if (unlikely(!pc)) + if (unlikely(!pc) || !PageCgroupUsed(pc)) return; - lock_page_cgroup(pc, flags); - mem = pc-mem_cgroup; - if (!mem) - goto done; - - if (!PageCgroupUsed(pc)) - goto done; - - /* -* Preemption is already disabled. We can use __this_cpu_xxx -*/ - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); - -done: + __mem_cgroup_update_page_stat(pc, idx, charge); unlock_page_cgroup(pc, flags); } +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked); CC mm/memcontrol.o mm/memcontrol.c:1600: error: ‘mem_cgroup_update_page_stat_unlocked’ undeclared here (not in a function) mm/memcontrol.c:1600: warning: type defaults to ‘int’ in declaration of ‘mem_cgroup_update_page_stat_unlocked’ make[1]: *** [mm/memcontrol.o] Error 1 make: *** [mm] Error 2 Thanks! Will fix in the next version. (mmh... why I didn't see this? probably because I'm building a static kernel...) -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Fri, Mar 12, 2010 at 08:42:30AM +0900, KAMEZAWA Hiroyuki wrote: On Thu, 11 Mar 2010 10:03:07 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Mar 11, 2010 at 06:25:00PM +0900, KAMEZAWA Hiroyuki wrote: On Thu, 11 Mar 2010 10:14:25 +0100 Peter Zijlstra pet...@infradead.org wrote: On Thu, 2010-03-11 at 10:17 +0900, KAMEZAWA Hiroyuki wrote: On Thu, 11 Mar 2010 09:39:13 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: The performance overhead is not so huge in both solutions, but the impact on performance is even more reduced using a complicated solution... Maybe we can go ahead with the simplest implementation for now and start to think to an alternative implementation of the page_cgroup locking and charge/uncharge of pages. FWIW bit spinlocks suck massive. maybe. But in this 2 years, one of our biggest concerns was the performance. So, we do something complex in memcg. But complex-locking is , yes, complex. Hmm..I don't want to bet we can fix locking scheme without something complex. But overall patch set seems good (to me.) And dirty_ratio and dirty_background_ratio will give us much benefit (of performance) than we lose by small overheads. Well, the !cgroup or root case should really have no performance impact. IIUC, this series affects trgger for background-write-out. Not sure though, while this does the accounting the actual writeout is still !cgroup aware and can definately impact performance negatively by shrinking too much. Ah, okay, your point is !cgroup (ROOT cgroup case.) I don't think accounting these file cache status against root cgroup is necessary. I think what peter meant was that with memory cgroups created we will do writeouts much more aggressively. In balance_dirty_pages() if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh) break; Now with Andrea's patches, we are calculating bdi_thres per memory cgroup (almost) hmm. bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction But bdi_nr_reclaimable and bdi_nr_writeback stats are still global. Why bdi_thresh of ROOT cgroup doesn't depend on global number ? Very true. mem_cgroup_has_dirty_limit() must always return false in case of root cgroup, so that global numbers are used. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Fri, Mar 12, 2010 at 09:03:26AM +0900, KAMEZAWA Hiroyuki wrote: On Fri, 12 Mar 2010 00:59:22 +0100 Andrea Righi ari...@develer.com wrote: On Thu, Mar 11, 2010 at 01:07:53PM -0500, Vivek Goyal wrote: On Wed, Mar 10, 2010 at 12:00:31AM +0100, Andrea Righi wrote: mmmh.. strange, on my side I get something as expected: root cgroup $ dd if=/dev/zero of=test bs=1M count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 6.28377 s, 83.4 MB/s child cgroup with 100M memory.limit_in_bytes $ dd if=/dev/zero of=test bs=1M count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 11.8884 s, 44.1 MB/s Did you change the global /proc/sys/vm/dirty_* or memcg dirty parameters? what happens when bs=4k count=100 under 100M ? no changes ? OK, I confirm the results found by Vivek. Repeating the tests 10 times: root cgroup ~= 34.05 MB/s average child cgroup (100M) ~= 38.80 MB/s average So, actually the child cgroup with the 100M limit seems to perform better in terms of throughput. IIUC, with the large write and the 100M memory limit it happens that direct write-out is enforced more frequently and a single write chunk is enough to meet the bdi_thresh or the global background_thresh + dirty_thresh limits. This means the task is never (or less) throttled with io_schedule_timeout() in the balance_dirty_pages() loop. And the child cgroup gets better performance over the root cgroup. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Fri, Mar 12, 2010 at 08:52:44AM +0900, KAMEZAWA Hiroyuki wrote: On Fri, 12 Mar 2010 00:27:09 +0100 Andrea Righi ari...@develer.com wrote: On Thu, Mar 11, 2010 at 10:03:07AM -0500, Vivek Goyal wrote: I am still setting up the system to test whether we see any speedup in writeout of large files with-in a memory cgroup with small memory limits. I am assuming that we are expecting a speedup because we will start writeouts early and background writeouts probably are faster than direct reclaim? mmh... speedup? I think with a large file write + reduced dirty limits you'll get a more uniform write-out (more frequent small writes), respect to few and less frequent large writes. The system will be more reactive, but I don't think you'll be able to see a speedup in the large write itself. Ah, sorry. I misunderstood something. But it's depends on dirty_ratio param. If background_dirty_ratio = 5 dirty_ratio= 100 under 100M cgroup, I think background write-out will be a help. Right, in this case background flusher threads will help a lot to write-out the cgroup dirty memory and it'll get better performance. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Fri, Mar 12, 2010 at 10:14:11AM +0900, Daisuke Nishimura wrote: On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 11 Mar 2010 18:25:00 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Then, it's not problem that check pc-mem_cgroup is root cgroup or not without spinlock. == void mem_cgroup_update_stat(struct page *page, int idx, bool charge) { pc = lookup_page_cgroup(page); if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup)) return; ... } == This can be handle in the same logic of lock failure path. And we just do ignore accounting. There are will be no spinlocksto do more than this, I think we have to use struct page rather than struct page_cgroup. Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED status in root cgroup. This kind of change is not very good. So, one way is to use this kind of function only for new parameters. Hmm. IMHO, if we disable accounting file stats in root cgroup, it would be better not to show them in memory.stat to avoid confusing users. Or just show the same values that we show in /proc/meminfo.. (I mean, not actually the same, but coherent with them). But, hmm, I think accounting them in root cgroup isn't so meaningless. Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup enough? Agreed. Returning false from mem_cgroup_has_dirty_limit() is enough to always use global stats for the writeback, so this shouldn't introduce any overhead for the root cgroup (at least for this part). -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 3/5] page_cgroup: introduce file cache flags
Introduce page_cgroup flags to keep track of file cache pages. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/page_cgroup.h | 22 +- 1 files changed, 21 insertions(+), 1 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index bf9a913..65247e4 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -40,7 +40,11 @@ enum { PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ /* for cache-status accounting */ - PCG_FILE_MAPPED, + PCG_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_FILE_DIRTY, /* page is dirty */ + PCG_FILE_WRITEBACK, /* page is being written back to disk */ + PCG_FILE_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_FILE_UNSTABLE_NFS, /* NFS page not yet committed to the server */ }; #define TESTPCGFLAG(uname, lname) \ @@ -83,6 +87,22 @@ TESTPCGFLAG(FileMapped, FILE_MAPPED) SETPCGFLAG(FileMapped, FILE_MAPPED) CLEARPCGFLAG(FileMapped, FILE_MAPPED) +TESTPCGFLAG(FileDirty, FILE_DIRTY) +SETPCGFLAG(FileDirty, FILE_DIRTY) +CLEARPCGFLAG(FileDirty, FILE_DIRTY) + +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK) +SETPCGFLAG(FileWriteback, FILE_WRITEBACK) +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK) + +TESTPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP) +SETPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP) +CLEARPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP) + +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) + static inline int page_cgroup_nid(struct page_cgroup *pc) { return page_to_nid(pc-page); -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v7)
Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v6 - v7) ~~ * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() is never called under tree_lock (no strict accounting, but better overall performance) * do not account file cache statistics for the root cgroup (zero overhead for the root cgroup) * fix: evaluate cgroup free pages as at the minimum free pages of all its parents Results ~~~ The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ 1.2GHz: before - root cgroup:11m51.983s - child cgroup:11m56.596s after - root cgroup: 11m51.742s - child cgroup:12m5.016s In the previous version of this patchset, using the complex locking scheme with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. With this version there's no overhead for the root cgroup (the small difference is in error range). I expected to see less overhead for the child cgroup, I'll do more testing and try to figure better what's happening. In the while, it would be great if someone could perform some tests on a larger system... unfortunately at the moment I don't have a big system available for this kind of tests... Thanks, -Andrea Documentation/cgroups/memory.txt | 36 +++ fs/nfs/write.c |4 + include/linux/memcontrol.h | 87 ++- include/linux/page_cgroup.h | 35 +++ include/linux/writeback.h|2 - mm/filemap.c |1 + mm/memcontrol.c | 542 +++--- mm/page-writeback.c | 215 ++-- mm/rmap.c|4 +- mm/truncate.c|1 + 10 files changed, 806 insertions(+), 121 deletions(-) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Direct write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 92 - mm/memcontrol.c| 484 +--- 2 files changed, 540 insertions(+), 36 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 88d3f9e..0602ec9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -19,12 +19,55 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H + +#include linux/writeback.h #include linux/cgroup.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +/* Cgroup memory statistics items exported to the kernel */ +enum mem_cgroup_read_page_stat_item { + MEMCG_NR_DIRTYABLE_PAGES, + MEMCG_NR_RECLAIM_PAGES, + MEMCG_NR_WRITEBACK, + MEMCG_NR_DIRTY_WRITEBACK_PAGES, +}; + +/* File cache pages accounting */ +enum mem_cgroup_write_page_stat_item { + MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */ + MEMCG_NR_FILE_DIRTY,/* # of dirty pages in page cache */ + MEMCG_NR_FILE_WRITEBACK,/* # of pages under writeback */ + MEMCG_NR_FILE_WRITEBACK_TEMP, /* # of pages under writeback using + temporary buffers */ + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */ + + MEMCG_NR_FILE_NSTAT, +}; + +/* Dirty memory parameters */ +struct vm_dirty_param { + int dirty_ratio; + int dirty_background_ratio; + unsigned long dirty_bytes; + unsigned long dirty_background_bytes; +}; + +/* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param) +{ + param-dirty_ratio = vm_dirty_ratio; + param-dirty_bytes = vm_dirty_bytes; + param-dirty_background_ratio = dirty_background_ratio; + param-dirty_background_bytes = dirty_background_bytes; +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All charge functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +160,25 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, extern int do_swap_account; #endif +extern bool mem_cgroup_has_dirty_limit(void); +extern void get_vm_dirty_param(struct vm_dirty_param *param); +extern s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item); + +extern void mem_cgroup_update_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge); + +static inline void mem_cgroup_inc_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat(page, idx, true); +} + +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ + mem_cgroup_update_page_stat(page, idx, false); +} + static inline bool mem_cgroup_disabled(void) { if (mem_cgroup_subsys.disabled) @@ -124,12 +186,6 @@ static inline bool mem_cgroup_disabled(void) return false; } -enum mem_cgroup_page_stat_item { - MEMCG_NR_FILE_MAPPED, - MEMCG_NR_FILE_NSTAT, -}; - -void mem_cgroup_update_stat(struct page *page, int idx, bool charge); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); @@ -299,8 +355,18 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_update_file_mapped(struct page *page, - int val) +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + return -ENOSYS; +} + +static inline void mem_cgroup_inc_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) +{ +} + +static inline void mem_cgroup_dec_page_stat(struct page *page, + enum mem_cgroup_write_page_stat_item idx) { } @@ -311,6 +377,16 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline bool mem_cgroup_has_dirty_limit(void) +{ + return false; +} + +static inline void get_vm_dirty_param(struct vm_dirty_param *param) +{ + get_global_vm_dirty_param(param); +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b7c23ea..91770d0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -80,14 +80,21 @@ enum mem_cgroup_stat_index { /* * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss
[Devel] [PATCH -mmotm 5/5] memcg: dirty pages instrumentation
Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. [ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2 bounce pages. This depends on charging also bounce pages per cgroup. ] As a bonus, make determine_dirtyable_memory() static again: this function isn't used anymore outside page writeback. Signed-off-by: Andrea Righi ari...@develer.com --- fs/nfs/write.c|4 + include/linux/writeback.h |2 - mm/filemap.c |1 + mm/page-writeback.c | 215 - mm/rmap.c |4 +- mm/truncate.c |1 + 6 files changed, 141 insertions(+), 86 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 53ff70e..3e8b9f8 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req) NFS_PAGE_TAG_COMMIT); nfsi-ncommit++; spin_unlock(inode-i_lock); + mem_cgroup_inc_page_stat(req-wb_page, MEMCG_NR_FILE_UNSTABLE_NFS); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_RECLAIMABLE); return 1; @@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_dec_page_stat(req-wb_page, + MEMCG_NR_FILE_UNSTABLE_NFS); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_RECLAIMABLE); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index dd9512d..39e4cb2 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable; extern int block_dump; extern int laptop_mode; -extern unsigned long determine_dirtyable_memory(void); - extern int dirty_background_ratio_handler(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/mm/filemap.c b/mm/filemap.c index 62cbac0..bd833fe 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_RECLAIMABLE); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ab84693..fcac9b4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions; static struct prop_descriptor vm_dirties; /* + * Work out the current dirty-memory clamping and background writeout + * thresholds. + * + * The main aim here is to lower them aggressively if there is a lot of mapped + * memory around. To avoid stressing page reclaim with lots of unreclaimable + * pages. It is better to clamp down on writers than to start swapping, and + * performing lots of scanning. + * + * We only allow 1/2 of the currently-unmapped memory to be dirtied. + * + * We don't permit the clamping level to fall below 5% - that is getting rather + * excessive. + * + * We make sure that the background writeout level is below the adjusted + * clamping level. + */ + +static unsigned long highmem_dirtyable_memory(unsigned long total) +{ +#ifdef CONFIG_HIGHMEM + int node; + unsigned long x = 0; + + for_each_node_state(node, N_HIGH_MEMORY) { + struct zone *z = + NODE_DATA(node)-node_zones[ZONE_HIGHMEM]; + + x += zone_page_state(z, NR_FREE_PAGES) + +zone_reclaimable_pages(z); + } + /* +* Make sure that the number of highmem pages is never larger +* than the number of the total dirtyable memory. This can only +* occur in very strange VM situations but we want to make sure +* that this does not occur. +*/ + return min(x, total); +#else + return 0; +#endif +} + +static unsigned long get_global_dirtyable_memory(void) +{ + unsigned long memory; + + memory = global_page_state
[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock
On Mon, Mar 15, 2010 at 09:06:38AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 15 Mar 2010 00:26:38 +0100 Andrea Righi ari...@develer.com wrote: From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, file-mapped is maintaiend. But more generic update function will be needed for dirty page accounting. For accountig page status, we have to guarantee lock_page_cgroup() will be never called under tree_lock held. To guarantee that, we use trylock at updating status. By this, we do fuzzy accounting, but in almost all case, it's correct. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Bad patch titleuse trylock for safe accounting in some contexts ? OK, sounds better. I just copy paste the email subject, but the title was probably related to the old lock_page_cgroup()+irq_disable patch. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v7)
On Mon, Mar 15, 2010 at 11:36:12AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 15 Mar 2010 00:26:37 +0100 Andrea Righi ari...@develer.com wrote: Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v6 - v7) ~~ * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() is never called under tree_lock (no strict accounting, but better overall performance) * do not account file cache statistics for the root cgroup (zero overhead for the root cgroup) * fix: evaluate cgroup free pages as at the minimum free pages of all its parents Results ~~~ The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ 1.2GHz: before - root cgroup:11m51.983s - child cgroup:11m56.596s after - root cgroup: 11m51.742s - child cgroup:12m5.016s In the previous version of this patchset, using the complex locking scheme with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. With this version there's no overhead for the root cgroup (the small difference is in error range). I expected to see less overhead for the child cgroup, I'll do more testing and try to figure better what's happening. Okay, thanks. This seems good result. Optimization for children can be done under -mm tree, I think. (If no nack, this seems ready for test in -mm.) OK, I'll wait a bit to see if someone has other fixes or issues and post a new version soon including these small changes. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Mon, Mar 15, 2010 at 10:38:41AM -0400, Vivek Goyal wrote: bdi_thres ~= per_memory_cgroup_dirty * bdi_fraction But bdi_nr_reclaimable and bdi_nr_writeback stats are still global. Why bdi_thresh of ROOT cgroup doesn't depend on global number ? I think in current implementation ROOT cgroup bdi_thres is always same as global number. It is only for other child groups where it is different from global number because of reduced dirytable_memory() limit. And we don't seem to be allowing any control on root group. But I am wondering, what happens in following case. IIUC, with use_hierarhy=0, if I create two test groups test1 and test2, then hierarchy looks as follows. root test1 test2 Now root group's DIRTYABLE is still system wide but test1 and test2's dirtyable will be reduced based on RES_LIMIT in those groups. Conceptually, per cgroup dirty ratio is like fixing page cache share of each group. So effectively we are saying that these limits apply to only child group of root but not to root as such? Correct. In this implementation root cgroup means outside all cgroups. I think this can be an acceptable behaviour since in general we don't set any limit to the root cgroup. So for the same number of dirty pages system wide on this bdi, we will be triggering writeouts much more aggressively if somebody has created few memory cgroups and tasks are running in those cgroups. I guess it might cause performance regressions in case of small file writeouts because previously one could have written the file to cache and be done with it but with this patch set, there are higher changes that you will be throttled to write the pages back to disk. I guess we need two pieces to resolve this. - BDI stats per cgroup. - Writeback of inodes from same cgroup. I think BDI stats per cgroup will increase the complextiy. Thank you for clarification. IIUC, dirty_limit implemanation shoul assume there is I/O resource controller, maybe usual users will use I/O resource controller and memcg at the same time. Then, my question is what happens when used with I/O resource controller ? Currently IO resource controller keep all the async IO queues in root group so we can't measure exactly. But my guess is until and unless we at least implement writeback inodes from same cgroup we will not see increased flow of writes from one cgroup over other cgroup. Agreed. And I plan to look a the writeback inodes per cgroup feature soon. I'm sorry but I've some deadlines this week, so probably I'll start working on this in the next weekend. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
On Tue, Mar 16, 2010 at 10:11:50AM -0400, Vivek Goyal wrote: On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote: [..] + * mem_cgroup_page_stat() - get memory cgroup file cache statistics + * @item:memory statistic item exported to the kernel + * + * Return the accounted statistic value, or a negative value in case of error. + */ +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + struct mem_cgroup_page_stat stat = {}; + struct mem_cgroup *mem; + + rcu_read_lock(); + mem = mem_cgroup_from_task(current); + if (mem !mem_cgroup_is_root(mem)) { + /* + * If we're looking for dirtyable pages we need to evaluate + * free pages depending on the limit and usage of the parents + * first of all. + */ + if (item == MEMCG_NR_DIRTYABLE_PAGES) + stat.value = memcg_get_hierarchical_free_pages(mem); + /* + * Recursively evaluate page statistics against all cgroup + * under hierarchy tree + */ + stat.item = item; + mem_cgroup_walk_tree(mem, stat, mem_cgroup_page_stat_cb); + } else + stat.value = -EINVAL; + rcu_read_unlock(); + + return stat.value; +} + hmm, mem_cgroup_page_stat() can return negative value, but you place BUG_ON() in [5/5] to check it returns negative value. What happens if the current is moved to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ? How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, and passing the mem_cgroup to mem_cgroup_page_stat() ? Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one shall have to use rcu_read_lock() and that will look ugly. Why don't we simply look at the return value and if it is negative, we fall back to using global stats and get rid of BUG_ON()? I vote for this one. IMHO the caller of mem_cgroup_page_stat() should fallback to the equivalent global stats. This allows to keep the things separated and put in mm/memcontrol.c only the memcg stuff. Or, modify mem_cgroup_page_stat() to return global stats if it can't determine per cgroup stat for some reason. (mem=NULL or root cgroup etc). Vivek Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/5] memcg: dirty memory documentation
On Tue, Mar 16, 2010 at 04:41:21PM +0900, Daisuke Nishimura wrote: On Mon, 15 Mar 2010 00:26:39 +0100, Andrea Righi ari...@develer.com wrote: Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache - # of bytes of page cache memory. rss- # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout- # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs- # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon- # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + It would be better to note that what those files of root cgroup mean. We cannot write any value to them, IOW, we cannot control dirty limit about root cgroup. OK. And they show the same value as the global one(strictly speaking, it's not true because global values can change. We need a hook in mem_cgroup_dirty_read()?). OK, we can just return system-wide value if mem_cgroup_is_root() in mem_cgroup_dirty_read(). Will change this in the next version. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote: [snip] @@ -3190,10 +3512,14 @@ struct { } memcg_stat_strings[NR_MCS_STAT] = { {cache, total_cache}, {rss, total_rss}, - {mapped_file, total_mapped_file}, {pgpgin, total_pgpgin}, {pgpgout, total_pgpgout}, {swap, total_swap}, + {mapped_file, total_mapped_file}, + {filedirty, dirty_pages}, + {writeback, writeback_pages}, + {writeback_tmp, writeback_temp_pages}, + {nfs, nfs_unstable}, {inactive_anon, total_inactive_anon}, {active_anon, total_active_anon}, {inactive_file, total_inactive_file}, Why not using total_xxx for total_name ? Agreed. I would be definitely more clear. Balbir, KAME-san, what do you think? @@ -3212,8 +3538,6 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data) s-stat[MCS_CACHE] += val * PAGE_SIZE; val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS); s-stat[MCS_RSS] += val * PAGE_SIZE; - val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED); - s-stat[MCS_FILE_MAPPED] += val * PAGE_SIZE; val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT); s-stat[MCS_PGPGIN] += val; val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT); @@ -3222,6 +3546,16 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data) val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT); s-stat[MCS_SWAP] += val * PAGE_SIZE; } + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED); + s-stat[MCS_FILE_MAPPED] += val * PAGE_SIZE; + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY); + s-stat[MCS_FILE_DIRTY] += val; + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK); + s-stat[MCS_WRITEBACK] += val; + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP); + s-stat[MCS_WRITEBACK_TEMP] += val; + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS); + s-stat[MCS_UNSTABLE_NFS] += val; I don't have a strong objection, but I prefer showing them in bytes. And can you add to mem_cgroup_stat_show() something like: for (i = 0; i NR_MCS_STAT; i++) { if (i == MCS_SWAP !do_swap_account) continue; + if (i = MCS_FILE_STAT_STAR i = MCS_FILE_STAT_END +mem_cgroup_is_root(mem_cont)) + continue; cb-fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]); } I like this. And I also prefer to show these values in bytes. not to show file stat in root cgroup ? It's meaningless value anyway. Of course, you'd better mention it in [2/5] too. OK. Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)
On Mon, Oct 12, 2009 at 05:11:20PM -0400, Vivek Goyal wrote: [snip] I modified my report scripts to also output aggreagate iops numbers and remove max-bandwidth and min-bandwidth numbers. So for same tests and same results I am now reporting iops numbers also. ( I have not re-run the tests.) IO scheduler controller + CFQ --- [Multiple Random Reader][Sequential Reader] nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops 1 223KB/s 132K usec 551 5551KB/s 129K usec 1387 2 190KB/s 154K usec 461 5718KB/s 122K usec 1429 4 445KB/s 208K usec 111 1 5909KB/s 116K usec 1477 8 158KB/s 2820 msec 361 5445KB/s 168K usec 1361 16 145KB/s 5963 msec 281 5418KB/s 164K usec 1354 32 139KB/s 12762 msec 231 5398KB/s 175K usec 1349 io-throttle + CFQ --- BW limit group1=10 MB/s BW limit group2=10 MB/s [Multiple Random Reader][Sequential Reader] nr Agg-bandw Max-latency Agg-iops nr Agg-bandw Max-latency Agg-iops 1 36KB/s218K usec 9 1 8006KB/s 20529 usec 2001 2 360KB/s 228K usec 891 7475KB/s 33665 usec 1868 4 699KB/s 262K usec 173 1 6800KB/s 46224 usec 1700 8 573KB/s 1800K usec 139 1 2835KB/s 885K usec 708 16 294KB/s 3590 msec 681 437KB/s 1855K usec 109 32 980KB/s 2861K usec 230 1 1145KB/s 1952K usec 286 Note that in case of random reader groups, iops are really small. Few thougts. - What should be the iops limit I should choose for the group. Lets say if I choose 80, then things should be better for sequential reader group, but just think of what will happen to random reader group. Especially, if nature of workload in group1 changes to sequential. Group1 will simply be killed. So yes, one can limit a group both by BW as well as iops-max, but this requires you to know in advance exactly what workload is running in the group. The moment workoload changes, these settings might have a very bad effects. So my biggest concern with max-bwidth and max-iops limits is that how will one configure the system for a dynamic environment. Think of two virtual machines being used by two customers. At one point they might be doing some copy operation and running sequential workload an later some webserver or database query might be doing some random read operations. The main problem IMHO is how to accurately evaluate the cost of an IO operation. On rotational media for example the cost to read two distant blocks is not the same cost of reading two contiguous blocks (while on a flash/SSD drive the cost is probably the same). io-throttle tries to quantify the cost in absolute terms (iops and BW), but this is not enough to cover all the possible cases. For example, you could hit a physical disk limit, because you're doing a workload too seeky, even if the iops and BW numbers are low. - Notice the interesting case of 16 random readers. iops for random reader group is really low, but still the throughput and iops of sequential reader group is very bad. I suspect that at CFQ level, some kind of mixup has taken place where we have not enabled idling for sequential reader and disk became seek bound hence both the group are loosing. (Just a guess) Yes, my guess is the same. I've re-run some of your tests using a SSD (a MOBI MTRON MSD-PATA3018-ZIF1), but changing few parameters: I used a larger block size for the sequential workload (there's no need to reduce the block size of the single reads if we suppose to read a lot of contiguous blocks). And for all the io-throttle tests I switched to noop scheduler (CFQ must be changed to be cgroup-aware before using it together with io-throttle, otherwise the result is that one simply breaks the logic of the other). === io-throttle settings === cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s cgroup #1: max-bw 10MB/s, max-iops 2150 iop/s During the tests I used a larger block size for sequential readers, respect to the random readers: sequential-read:block size = 1MB random-read:block size = 4KB sequential-readers vs sequential-reader === [ cgroup #1 workload ] fio_args=--rw=read --bs=1M --size=512M --runtime=30 --numjobs=N --direct=1 [ cgroup #2 workload ] fio_args=--rw=read --bs=1M --size=512M --runtime=30 --numjobs=1 --direct=1 __2.6.32-rc5__ [ cgroup #1 ] [ cgroup #2 ] tasks aggr-bw tasks aggr-bw 1 36210KB/s 1 36992KB/s 2 47558KB/s 1 24479KB/s 4 57587KB/s 1 14809KB/s 8 64667KB/s 1 8393KB/s
[Devel] Re: Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)
=== This time run multiple buffered writers in group1 and see run a single buffered writer in other group and see if we can provide fairness and isolation. Vanilla CFQ [Multiple Buffered Writer][Buffered Writer] nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency 1 68997KB/s 68997KB/s 67380KB/s 645K usec 1 67122KB/s 567K usec 2 47509KB/s 46218KB/s 91510KB/s 865K usec 1 45118KB/s 865K usec 4 28002KB/s 26906KB/s 105MB/s 1649K usec 1 26879KB/s 1643K usec 8 15985KB/s 14849KB/s 117MB/s 943K usec 1 15653KB/s 766K usec 16 11567KB/s 6881KB/s 128MB/s 1174K usec 1 7333KB/s 947K usec 32 5877KB/s 3649KB/s 130MB/s 1205K usec 1 5142KB/s 988K usec IO scheduler controller + CFQ --- [Multiple Buffered Writer][Buffered Writer] nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency 1 68580KB/s 68580KB/s 66972KB/s 2901K usec 1 67194KB/s 2901K usec 2 47419KB/s 45700KB/s 90936KB/s 3149K usec 1 44628KB/s 2377K usec 4 27825KB/s 27274KB/s 105MB/s 1177K usec 1 27584KB/s 1177K usec 8 15382KB/s 14288KB/s 114MB/s 1539K usec 1 14794KB/s 783K usec 16 9161KB/s 7592KB/s 124MB/s 3177K usec 1 7713KB/s 886K usec 32 4928KB/s 3961KB/s 126MB/s 1152K usec 1 6465KB/s 4510K usec Notes: - It does not work. Buffered writer in second group are being overwhelmed by writers in group1. - This is a limitation of IO scheduler based controller currently as page cache at higher layer evens out the traffic and does not throw more traffic from higher weight group. - This is something needs more work at higher layers like dirty limts per cgroup in memory contoller and the method to writeout buffered pages belonging to a particular memory cgroup. This is still being brainstormed. io-throttle + CFQ --- BW limit group1=30 MB/s BW limit group2=30 MB/s [Multiple Buffered Writer][Buffered Writer] nr Max-bandw Min-bandw Agg-bandw Max-latency nr Agg-bandw Max-latency 1 33863KB/s 33863KB/s 33070KB/s 3046K usec 1 25165KB/s 13248K usec 2 13457KB/s 12906KB/s 25745KB/s 9286K usec 1 29958KB/s 3736K usec 4 7414KB/s 6543KB/s 27145KB/s 10557K usec 1 30968KB/s 8356K usec 8 3562KB/s 2640KB/s 24430KB/s 12012K usec 1 30801KB/s 7037K usec 16 3962KB/s 881KB/s 26632KB/s 12650K usec 1 31150KB/s 7173K usec 32 3275KB/s 406KB/s 27295KB/s 14609K usec 1 26328KB/s 8069K usec Notes: - This seems to work well here. io-throttle is throttling the writers before they write too much of data in page cache. One side effect of this seems to be that now a process will not be allowed to write at memory speed in page cahce and will be limited to disk IO speed limits set for the cgroup. Andrea is thinking of removing throttling in balance_dirty_pages() to allow writting at disk speed till we hit dirty_limits. But removing it leads to a different issue where too many dirty pages from a single group can be present from a cgroup in page cache and if that cgroup is slow moving one, then pages are flushed to disk at slower speed delyaing other higher rate cgroups. (all discussed in private mails with Andrea). I confirm this. :) But IMHO before removing the throttling in balance_dirty_pages() we really need the per-cgroup dirty limit / dirty page cache quota. ioprio class and iopriority with-in cgroups issues with IO-throttle === Currently throttling logic is designed in such a way that it makes the throttling uniform for every process in the group. So we will loose the differentiation between different class of processes or differnetitation between different priority of processes with-in group. I have run the tests of these in the past and reported it here in the past. https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html Thanks Vivek -- Andrea Righi - Develer s.r.l http://www.develer.com ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: More performance numbers (Was: Re: IO scheduler based IO controller V10)
On Thu, Oct 08, 2009 at 12:42:51AM -0400, Vivek Goyal wrote: Apart from IO scheduler controller number, I also got a chance to run same tests with dm-ioband controller. I am posting these too. I am also planning to run similar numbers on Andrea's max bw controller also. Should be able to post those numbers also in 2-3 days. For those who are interested (expecially to help Vivek to test all this stuff) here the all-in-one patchset of the io-throttle controller, rebased to 2.6.31: http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch And this one is v18 rebased to 2.6.32-rc3: http://www.develer.com/~arighi/linux/patches/io-throttle/cgroup-io-throttle-v18.patch Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote: I think that only putting the hook in try_to_unmap() doesn't work correctly, because IOs will be charged to reclaiming processes or kswapd. These IOs should be charged to processes which cause memory pressure. Consider the following case: (1) There are two processes Proc-A and Proc-B. (2) Proc-A maps a large file into many pages by mmap() and writes many data to the file. (3) After (2), Proc-B try to get a page, but there are no available pages because Proc-A has used them. (4) kernel starts to reclaim pages, call try_to_unmap() to unmap a page which is owned by Proc-A, then blkio_cgroup_set_owner() sets Proc-B's ID on the page because the task's context is Proc-B. (5) After (4), kernel writes the page out to a disk. This IO is charged to Proc-B. In the above case, I think that the IO should be charged to a Proc-A, because the IO is caused by Proc-A's memory pressure. I think we should consider in the case without memory and swap isolation. mmmh.. even if they're strictly related I think we're mixing two different problems in this way: memory pressure control and IO control. It seems you're proposing something like the badness() for OOM conditions to charge swap IO depending on how bad is a cgroup in terms of memory consumption. I don't think this is the right way to proceed, also because we already have the memory and swap control. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote: On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote: On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote: Vivek Goyal wrote: ... } @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru /* * Find the io group bio belongs to. * If create is set, io group is created if it is not already present. + * If curr is set, io group is information is searched for current + * task and not with the help of bio. + * + * FIXME: Can we assume that if bio is NULL then lookup group for current + * task and not create extra function parameter ? * - * Note: There is a narrow window of race where a group is being freed - * by cgroup deletion path and some rq has slipped through in this group. - * Fix it. */ -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio, - int create) +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio, + int create, int curr) Hi Vivek, IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL, get iog from bio, otherwise get it from current task. Consider also that get_cgroup_from_bio() is much more slow than task_cgroup() and need to lock/unlock_page_cgroup() in get_blkio_cgroup_id(), while task_cgroup() is rcu protected. True. BTW another optimization could be to use the blkio-cgroup functionality only for dirty pages and cut out some blkio_set_owner(). For all the other cases IO always occurs in the same context of the current task, and you can use task_cgroup(). Yes, may be in some cases we can avoid setting page owner. I will get to it once I have got functionality going well. In the mean time if you have a patch for it, it will be great. However, this is true only for page cache pages, for IO generated by anonymous pages (swap) you still need the page tracking functionality both for reads and writes. Right now I am assuming that all the sync IO will belong to task submitting the bio hence use task_cgroup() for that. Only for async IO, I am trying to use page tracking functionality to determine the owner. Look at elv_bio_sync(bio). You seem to be saying that there are cases where even for sync IO, we can't use submitting task's context and need to rely on page tracking functionlity? In case of getting page (read) from swap, will it not happen in the context of process who will take a page fault and initiate the swap read? No, for example in read_swap_cache_async(): @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ __set_page_locked(new_page); SetPageSwapBacked(new_page); + blkio_cgroup_set_owner(new_page, current-mm); err = add_to_swap_cache(new_page, entry, gfp_mask GFP_KERNEL); if (likely(!err)) { /* This is a read, but the current task is not always the owner of this swap cache page, because it's a readahead operation. But will this readahead be not initiated in the context of the task taking the page fault? handle_pte_fault() do_swap_page() swapin_readahead() read_swap_cache_async() If yes, then swap reads issued will still be in the context of process and we should be fine? Right. I was trying to say that the current task may swap-in also pages belonging to a different task, so from a certain point of view it's not so fair to charge the current task for the whole activity. But ok, I think it's a minor issue. Anyway, this is a minor corner case I think. And probably it is safe to consider this like any other read IO and get rid of the blkio_cgroup_set_owner(). Agreed. I wonder if it would be better to attach the blkio_cgroup to the anonymous page only when swap-out occurs. Swap seems to be an interesting case in general. Somebody raised this question on lwn io controller article also. A user process never asked for swap activity. It is something enforced by kernel. So while doing some swap outs, it does not seem too fair to charge the write out to the process page belongs to and the fact of the matter may be that there is some other memory hungry application which is forcing these swap outs. Keeping this in mind, should swap activity be considered as system activity and be charged to root group instead of to user tasks in other cgroups