[RFC] Proportional bandwidth scheduling using anticipatory I/O scheduler on 2.6.24

2008-01-29 Thread Naveen Gupta
This patch creates channels in anticipatory I/O scheduler for sharing
bandwidth in specified proportions. It uses the ioprio_(get/set) interface
to create various channels and as of now it is using the best effort levels.

It is an initial attempt to get proportional b/w working in I/O
schedulers. One of the applications can be to assign a portion of
b/w on a device to a specified container. The advantages of this
approach over putting absolute restricting on b/w of a container is
that by restricting the b/w, we may end up not utilizing additional
b/w in absence of load from other containers. Not to say that we cannot
do I/O limiting in schedulers.

The current patch works for read requests and we may need more work for
congested queues and writes limiting. The idea is to allow processes to submit
requests in a round-robin manner and if it exceeds it's limit wait till the
other channel is done submitting it's share. In order to prevent a very
active channel from submitting any request (even though it may have exceeded
it's limit) in absence of i/o from other channel, counters are reset after
a period of inactivity on idle channels. Also there is a loss of overall
average bandwidth when using multiple classes. Some of which can be expected
due to different behavior of applications in multiple containers sharing a
single device, but apart from that a major portion of that loss is due to
the fact that we are still not using the scheduler optimally. Here is a
simple fio script to test this patch.


<-- snip fio.script -->
[global]
ioengine=sync
rw=read
direct=0
exitall

[file]
name=buffered1
directory=/tmp
bs=256k
size=1g
prio=0
prioclass=2

[file]
name=buffered3
directory=/tmp
bs=1m
size=1g
prio=3
prioclass=2
<-- end snip -->

Other interfaces are in /sys/block/[device]/queue/iosched/
1. priority_weights - Assign proportions to various channels.
   Note these poportions are now expressed in multiples of 1024*1024. I will
   work on getting these into exact proportions.
2. bandwidth_scheduling - writing 0 into this stops proportional scheduling.
3. bw_timer_expire - time period after which counters are reset.
   Writing large value to it will give you more exact proportions and
   small values increase overall average bandwidth. This is the time
   after which b/w on a different channels is reset due to inactivity. Some
   tuning of this variable may be needed to get required results. I will work
   on making this transparent.
This patch has default four channels.

I would like to know initial feedback regarding what do we expect especially
when it comes to container groups. Is this something which is useful or we
need hard limits for various channels? What other things are expected? Would
assigning priorities be of any use, either absolute priorities or
soft priorities along with b/w limitations. I can add cgroup interfaces in
another patch.

Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]>

Index: linux-2.6.24/block/Kconfig.iosched
===
--- linux-2.6.24.orig/block/Kconfig.iosched 2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/Kconfig.iosched  2008-01-27 11:24:50.0 -0800
@@ -21,6 +21,13 @@ config IOSCHED_AS
  deadline I/O scheduler, it can also be slower in some cases
  especially some database loads.

+config IOPRIO_AS_MAX
+   int "Bandwidth channels in anticipatory I/O scheduler"
+   depends on IOSCHED_AS
+   default "4"
+   help
+ Number of valid b/w channels in anticipatory scheduler.
+
 config IOSCHED_DEADLINE
tristate "Deadline I/O scheduler"
default y
Index: linux-2.6.24/block/as-iosched.c
===
--- linux-2.6.24.orig/block/as-iosched.c2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/as-iosched.c 2008-01-29 12:05:52.0 -0800
@@ -16,6 +16,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #define REQ_SYNC   1
 #define REQ_ASYNC  0
@@ -63,6 +65,9 @@
  */
 #define MAX_THINKTIME (HZ/50UL)

+#define default_bandwidth_scheduling  (0)
+#define default_bw_timer_expire (16)   /* msecs */
+
 /* Bits in as_io_context.state */
 enum as_io_states {
AS_TASK_RUNNING=0,  /* Process has not exited */
@@ -89,10 +94,14 @@ struct as_data {
/*
 * requests (as_rq s) are present on both sort_list and fifo_list
 */
-   struct rb_root sort_list[2];
-   struct list_head fifo_list[2];
+   struct {
+   struct rb_root sort_list[2];
+   struct list_head fifo_list[2];
+   struct request *next_rq[2];
+   unsigned long ioprio_wt;
+   unsigned long serviced;
+   } prio_q[IOPRIO_AS_MAX];

-   struct request *next_rq[2]; /* next in sort order */
sector_t last_sector[2];/* last REQ_SYNC & R

[RFC] Proportional bandwidth scheduling using anticipatory I/O scheduler on 2.6.24

2008-01-29 Thread Naveen Gupta
This patch creates channels in anticipatory I/O scheduler for sharing
bandwidth in specified proportions. It uses the ioprio_(get/set) interface
to create various channels and as of now it is using the best effort levels.

It is an initial attempt to get proportional b/w working in I/O
schedulers. One of the applications can be to assign a portion of
b/w on a device to a specified container. The advantages of this
approach over putting absolute restricting on b/w of a container is
that by restricting the b/w, we may end up not utilizing additional
b/w in absence of load from other containers. Not to say that we cannot
do I/O limiting in schedulers.

The current patch works for read requests and we may need more work for
congested queues and writes limiting. The idea is to allow processes to submit
requests in a round-robin manner and if it exceeds it's limit wait till the
other channel is done submitting it's share. In order to prevent a very
active channel from submitting any request (even though it may have exceeded
it's limit) in absence of i/o from other channel, counters are reset after
a period of inactivity on idle channels. Also there is a loss of overall
average bandwidth when using multiple classes. Some of which can be expected
due to different behavior of applications in multiple containers sharing a
single device, but apart from that a major portion of that loss is due to
the fact that we are still not using the scheduler optimally. Here is a
simple fio script to test this patch.


-- snip fio.script --
[global]
ioengine=sync
rw=read
direct=0
exitall

[file]
name=buffered1
directory=/tmp
bs=256k
size=1g
prio=0
prioclass=2

[file]
name=buffered3
directory=/tmp
bs=1m
size=1g
prio=3
prioclass=2
-- end snip --

Other interfaces are in /sys/block/[device]/queue/iosched/
1. priority_weights - Assign proportions to various channels.
   Note these poportions are now expressed in multiples of 1024*1024. I will
   work on getting these into exact proportions.
2. bandwidth_scheduling - writing 0 into this stops proportional scheduling.
3. bw_timer_expire - time period after which counters are reset.
   Writing large value to it will give you more exact proportions and
   small values increase overall average bandwidth. This is the time
   after which b/w on a different channels is reset due to inactivity. Some
   tuning of this variable may be needed to get required results. I will work
   on making this transparent.
This patch has default four channels.

I would like to know initial feedback regarding what do we expect especially
when it comes to container groups. Is this something which is useful or we
need hard limits for various channels? What other things are expected? Would
assigning priorities be of any use, either absolute priorities or
soft priorities along with b/w limitations. I can add cgroup interfaces in
another patch.

Signed-off-by: Naveen Gupta [EMAIL PROTECTED]

Index: linux-2.6.24/block/Kconfig.iosched
===
--- linux-2.6.24.orig/block/Kconfig.iosched 2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/Kconfig.iosched  2008-01-27 11:24:50.0 -0800
@@ -21,6 +21,13 @@ config IOSCHED_AS
  deadline I/O scheduler, it can also be slower in some cases
  especially some database loads.

+config IOPRIO_AS_MAX
+   int Bandwidth channels in anticipatory I/O scheduler
+   depends on IOSCHED_AS
+   default 4
+   help
+ Number of valid b/w channels in anticipatory scheduler.
+
 config IOSCHED_DEADLINE
tristate Deadline I/O scheduler
default y
Index: linux-2.6.24/block/as-iosched.c
===
--- linux-2.6.24.orig/block/as-iosched.c2008-01-24
14:58:37.0 -0800
+++ linux-2.6.24/block/as-iosched.c 2008-01-29 12:05:52.0 -0800
@@ -16,6 +16,8 @@
 #include linux/compiler.h
 #include linux/rbtree.h
 #include linux/interrupt.h
+#include linux/ioprio.h
+#include linux/ctype.h

 #define REQ_SYNC   1
 #define REQ_ASYNC  0
@@ -63,6 +65,9 @@
  */
 #define MAX_THINKTIME (HZ/50UL)

+#define default_bandwidth_scheduling  (0)
+#define default_bw_timer_expire (16)   /* msecs */
+
 /* Bits in as_io_context.state */
 enum as_io_states {
AS_TASK_RUNNING=0,  /* Process has not exited */
@@ -89,10 +94,14 @@ struct as_data {
/*
 * requests (as_rq s) are present on both sort_list and fifo_list
 */
-   struct rb_root sort_list[2];
-   struct list_head fifo_list[2];
+   struct {
+   struct rb_root sort_list[2];
+   struct list_head fifo_list[2];
+   struct request *next_rq[2];
+   unsigned long ioprio_wt;
+   unsigned long serviced;
+   } prio_q[IOPRIO_AS_MAX];

-   struct request *next_rq[2]; /* next in sort order */
sector_t last_sector[2];/* last REQ_SYNC

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-22 Thread Naveen Gupta
On 22/01/2008, Andrea Righi <[EMAIL PROTECTED]> wrote:
> Naveen Gupta wrote:
> > See if using priority levels to have per level bandwidth limit can
> > solve the priority inversion problem you were seeing earlier. I have a
> > priority scheduling patch for anticipatory scheduler, if you want to
> > try it. It's much simpler than CFQ priority.  I still need to port it
> > to 2.6.24 though and send across for review.
> >
> > Though as already said, this would be for read side only.
> >
> > -Naveen
>
> Thanks Naveen, I can test you scheduler if you want, but the priority
> inversion problem (or better we should call it a "bandwidth limiting"
> that impacts in wrong tasks) occurs only with write operations and, as
> said by Jens, the I/O scheduler is not the right place to implement this
> kind of limiting, because at this level the processes have already
> performed the operations (dirty pages in memory) that raise the requests
> to the I/O scheduler (made by different processes asynchronously).

If the i/o submission is happening in bursts, and we limit the rate
during submission, we will have to stop the current task from
submitting any further i/o and hence change it's pattern. Also, then
we are limiting the submission rate and not the rate which is going on
the wire as scheduler may reorder.

One of the ways could be - to limit the rate when the i/o is sent out
from the scheduler and if we see that the number of allocated requests
are above a threshold we disallow request allocation in the offending
task. This way an application submitting bursts under the allowed
average rate will not stop frequently. Something like leaky bucket.

Now for dirtying of memory happening in a different context than the
submission path, you could still put a limit looking at the dirty
ratio and this limit is higher than the actual b/w rate you are
looking to achieve. In process making sure you always have something
to write and still  now blow your entire memory. Or you can get really
fancy and track who dirtied the i/o and start limiting it that way.



>
> A possible way to model the write limiting is to look at the dirty page
> ratio that is, in part, the principal reason for the requests to the I/O
> scheduler. But in this way we would limit also the re-write operations
> in memory and this is too much limiting.
>
> So, the cgroup dirty page throttling could be very interesting anyway,
> but it's not the same thing as limiting the real write I/O bandwidth.
>
> For now I've rewritten my patch as following, moving away the code from
> the I/O scheduler, it seems to work in my small tests (apart all the
> things said above), but I'd like to find a different way to have a more
> sophisticated I/O throttling approach (probably looking also directly at
> the read()/write() level)... just investigating for now...
>
> BTW I've seen that also OpenVZ has not a solution for this problem, yet.
> AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by
> the user beancounters (http://wiki.openvz.org/IO_accounting), but
> there's not any policy that implements the block I/O limiting, except
> that it's possible to set different per-VE I/O priorities (mapped on CFQ
> priorities). But I've not understood if this just sets this I/O priority
> to all processes in the VE, or if it does something different. I still
> need to look at the code in details.
>
> -Andrea
>
> Signed-off-by: Andrea Righi <[EMAIL PROTECTED]>
> ---
>
> diff -urpN linux-2.6.24-rc8/block/io-throttle.c 
> linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
> --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 
> +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-22 
> 23:06:09.0 +0100
> @@ -0,0 +1,222 @@
> +/*
> + * io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <[EMAIL PROTECTED]>
> + */
> +
> +#include 
> +#include 
> +#include 
> +#i

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-22 Thread Naveen Gupta
On 20/01/2008, Andrea Righi <[EMAIL PROTECTED]> wrote:
> Jens Axboe wrote:
> > On Sun, Jan 20 2008, Andrea Righi wrote:
> >> Jens Axboe wrote:
> >>> Your approach is totally flawed, imho. For instance, you don't want a
> >>> process to be able to dirty memory at foo mb/sec but only actually
> >>> write them out at bar mb/sec.
> >> Right. Actually my problem here is that the processes that write out
> >> blocks are different respect to the processes that write bytes in
> >> memory, and I would be able to add limits on those processes that are
> >> dirtying memory.
> >
> > That's another reason why you cannot do this on a per-process or group
> > basis, because you have no way of mapping back from the io queue path
> > which process originally dirtied this memory.
> >
> >>> The noop-iosched changes are also very buggy. The queue back pointer
> >>> breaks reference counting and the task pointer storage assumes the task
> >>> will also always be around. That's of course not the case.
> >> Yes, this really need a lot of fixes. I simply posted the patch to know
> >> if such approach (in general) could have sense or not.
> >
> > It doesn't need fixes, it needs to be redesigned :-). No amount of
> > fixing will make the patch you posted correct, since the approach is
> > simply not feasible.
> >
> >>> IOW, you are doing this at the wrong level.
> >>>
> >>> What problem are you trying to solve?
> >> Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
> >> in order to provide a sort of a QoS on block I/O.
> >>
> >> Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
> >> any help is appreciated.
> >
> > For starters, you want to throttle when queuing IO, not dispatching it.
> > If you need to modify IO schedulers, then you are already at the wrong
> > level. That doesn't solve the write problem, but reads can be done.
> >
> > If you want to solve for both read/write(2), then move the code to that
> > level. That wont work for eg mmap, though...
> >
> > And as Balbir notes, the openvz group have been looking at some of these
> > problems as well. As has lots of other people btw, you probably want to
> > search around a bit and acquaint yourself with some of that work.
> >
>
> OK, now I understand, the main problem is that pages can be written to
> the block device when information about the real process that touched
> those pages in memory isn't available anymore. So, the I/O scheduler is
> not the right place to do such limitations. Another approach would be to
> just limit the I/O requests/sec for read operations and the dirty
> memory/sec for write operations (see below). But this is ugly and not
> efficient at all, since it just limits the writes in memory and not in
> the actual block devices.
>
> AFAIK openvz supports the per-VE I/O priority (via CFQ), that is great, but
> this isn't the same as bandwidth limiting. Anyway I'll look closely at the
> openvz work to see how they addressed the problem.

See if using priority levels to have per level bandwidth limit can
solve the priority inversion problem you were seeing earlier. I have a
priority scheduling patch for anticipatory scheduler, if you want to
try it. It's much simpler than CFQ priority.  I still need to port it
to 2.6.24 though and send across for review.

Though as already said, this would be for read side only.

-Naveen

>
> Thanks,
> -Andrea
>
> Signed-off-by: Andrea Righi <[EMAIL PROTECTED]>
> ---
>
> diff -urpN linux-2.6.24-rc8/block/io-throttle.c 
> linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
> --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 
> +0100
> +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-21 
> 00:40:25.0 +0100
> @@ -0,0 +1,221 @@
> +/*
> + * io-throttle.c
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + *
> + * Copyright (C) 2008 Andrea Righi <[EMAIL PROTECTED]>
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct iothrottle {
> +   struct cgroup_subsys_state css;
> +   spinlock_t lock;
> +   unsigned long iorate;
> +   unsigned long req;
> +   unsigned long last_request;
> +};
> +

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-22 Thread Naveen Gupta
On 20/01/2008, Andrea Righi [EMAIL PROTECTED] wrote:
 Jens Axboe wrote:
  On Sun, Jan 20 2008, Andrea Righi wrote:
  Jens Axboe wrote:
  Your approach is totally flawed, imho. For instance, you don't want a
  process to be able to dirty memory at foo mb/sec but only actually
  write them out at bar mb/sec.
  Right. Actually my problem here is that the processes that write out
  blocks are different respect to the processes that write bytes in
  memory, and I would be able to add limits on those processes that are
  dirtying memory.
 
  That's another reason why you cannot do this on a per-process or group
  basis, because you have no way of mapping back from the io queue path
  which process originally dirtied this memory.
 
  The noop-iosched changes are also very buggy. The queue back pointer
  breaks reference counting and the task pointer storage assumes the task
  will also always be around. That's of course not the case.
  Yes, this really need a lot of fixes. I simply posted the patch to know
  if such approach (in general) could have sense or not.
 
  It doesn't need fixes, it needs to be redesigned :-). No amount of
  fixing will make the patch you posted correct, since the approach is
  simply not feasible.
 
  IOW, you are doing this at the wrong level.
 
  What problem are you trying to solve?
  Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
  in order to provide a sort of a QoS on block I/O.
 
  Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
  any help is appreciated.
 
  For starters, you want to throttle when queuing IO, not dispatching it.
  If you need to modify IO schedulers, then you are already at the wrong
  level. That doesn't solve the write problem, but reads can be done.
 
  If you want to solve for both read/write(2), then move the code to that
  level. That wont work for eg mmap, though...
 
  And as Balbir notes, the openvz group have been looking at some of these
  problems as well. As has lots of other people btw, you probably want to
  search around a bit and acquaint yourself with some of that work.
 

 OK, now I understand, the main problem is that pages can be written to
 the block device when information about the real process that touched
 those pages in memory isn't available anymore. So, the I/O scheduler is
 not the right place to do such limitations. Another approach would be to
 just limit the I/O requests/sec for read operations and the dirty
 memory/sec for write operations (see below). But this is ugly and not
 efficient at all, since it just limits the writes in memory and not in
 the actual block devices.

 AFAIK openvz supports the per-VE I/O priority (via CFQ), that is great, but
 this isn't the same as bandwidth limiting. Anyway I'll look closely at the
 openvz work to see how they addressed the problem.

See if using priority levels to have per level bandwidth limit can
solve the priority inversion problem you were seeing earlier. I have a
priority scheduling patch for anticipatory scheduler, if you want to
try it. It's much simpler than CFQ priority.  I still need to port it
to 2.6.24 though and send across for review.

Though as already said, this would be for read side only.

-Naveen


 Thanks,
 -Andrea

 Signed-off-by: Andrea Righi [EMAIL PROTECTED]
 ---

 diff -urpN linux-2.6.24-rc8/block/io-throttle.c 
 linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
 --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 
 +0100
 +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-21 
 00:40:25.0 +0100
 @@ -0,0 +1,221 @@
 +/*
 + * io-throttle.c
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License as published by the Free Software Foundation; either
 + * version 2 of the License, or (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public
 + * License along with this program; if not, write to the
 + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 + * Boston, MA 021110-1307, USA.
 + *
 + * Copyright (C) 2008 Andrea Righi [EMAIL PROTECTED]
 + */
 +
 +#include linux/init.h
 +#include linux/module.h
 +#include linux/cgroup.h
 +#include linux/slab.h
 +#include linux/gfp.h
 +#include linux/err.h
 +#include linux/sched.h
 +#include linux/fs.h
 +#include linux/jiffies.h
 +#include linux/spinlock.h
 +#include linux/io-throttle.h
 +
 +struct iothrottle {
 +   struct cgroup_subsys_state css;
 +   spinlock_t lock;
 +   unsigned long iorate;
 +   unsigned long req;
 +   unsigned long last_request;
 +};
 +
 +static inline struct iothrottle 

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-22 Thread Naveen Gupta
On 22/01/2008, Andrea Righi [EMAIL PROTECTED] wrote:
 Naveen Gupta wrote:
  See if using priority levels to have per level bandwidth limit can
  solve the priority inversion problem you were seeing earlier. I have a
  priority scheduling patch for anticipatory scheduler, if you want to
  try it. It's much simpler than CFQ priority.  I still need to port it
  to 2.6.24 though and send across for review.
 
  Though as already said, this would be for read side only.
 
  -Naveen

 Thanks Naveen, I can test you scheduler if you want, but the priority
 inversion problem (or better we should call it a bandwidth limiting
 that impacts in wrong tasks) occurs only with write operations and, as
 said by Jens, the I/O scheduler is not the right place to implement this
 kind of limiting, because at this level the processes have already
 performed the operations (dirty pages in memory) that raise the requests
 to the I/O scheduler (made by different processes asynchronously).

If the i/o submission is happening in bursts, and we limit the rate
during submission, we will have to stop the current task from
submitting any further i/o and hence change it's pattern. Also, then
we are limiting the submission rate and not the rate which is going on
the wire as scheduler may reorder.

One of the ways could be - to limit the rate when the i/o is sent out
from the scheduler and if we see that the number of allocated requests
are above a threshold we disallow request allocation in the offending
task. This way an application submitting bursts under the allowed
average rate will not stop frequently. Something like leaky bucket.

Now for dirtying of memory happening in a different context than the
submission path, you could still put a limit looking at the dirty
ratio and this limit is higher than the actual b/w rate you are
looking to achieve. In process making sure you always have something
to write and still  now blow your entire memory. Or you can get really
fancy and track who dirtied the i/o and start limiting it that way.




 A possible way to model the write limiting is to look at the dirty page
 ratio that is, in part, the principal reason for the requests to the I/O
 scheduler. But in this way we would limit also the re-write operations
 in memory and this is too much limiting.

 So, the cgroup dirty page throttling could be very interesting anyway,
 but it's not the same thing as limiting the real write I/O bandwidth.

 For now I've rewritten my patch as following, moving away the code from
 the I/O scheduler, it seems to work in my small tests (apart all the
 things said above), but I'd like to find a different way to have a more
 sophisticated I/O throttling approach (probably looking also directly at
 the read()/write() level)... just investigating for now...

 BTW I've seen that also OpenVZ has not a solution for this problem, yet.
 AFAIU OpenVZ I/O activity is accounted in virtual enviromnents (VE) by
 the user beancounters (http://wiki.openvz.org/IO_accounting), but
 there's not any policy that implements the block I/O limiting, except
 that it's possible to set different per-VE I/O priorities (mapped on CFQ
 priorities). But I've not understood if this just sets this I/O priority
 to all processes in the VE, or if it does something different. I still
 need to look at the code in details.

 -Andrea

 Signed-off-by: Andrea Righi [EMAIL PROTECTED]
 ---

 diff -urpN linux-2.6.24-rc8/block/io-throttle.c 
 linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
 --- linux-2.6.24-rc8/block/io-throttle.c1970-01-01 01:00:00.0 
 +0100
 +++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c   2008-01-22 
 23:06:09.0 +0100
 @@ -0,0 +1,222 @@
 +/*
 + * io-throttle.c
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License as published by the Free Software Foundation; either
 + * version 2 of the License, or (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public
 + * License along with this program; if not, write to the
 + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 + * Boston, MA 021110-1307, USA.
 + *
 + * Copyright (C) 2008 Andrea Righi [EMAIL PROTECTED]
 + */
 +
 +#include linux/init.h
 +#include linux/module.h
 +#include linux/cgroup.h
 +#include linux/slab.h
 +#include linux/gfp.h
 +#include linux/err.h
 +#include linux/sched.h
 +#include linux/fs.h
 +#include linux/jiffies.h
 +#include linux/spinlock.h
 +#include linux/io-throttle.h
 +
 +struct iothrottle {
 +   struct cgroup_subsys_state css;
 +   spinlock_t lock;
 +   unsigned long iorate;
 +   unsigned long req

Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-18 Thread Naveen Gupta
>Paul Menage wrote:
>> On Jan 18,  2008 7:36 AM, Dhaval Giani <[EMAIL PROTECTED]>  wrote:
>>> On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
 Allow to limit the  block I/O bandwidth for  specific process containers
 (cgroups) imposing additional delays  on I/O requests for those processes
 that exceed the  limits defined in the control group filesystem.

  Example:
   # mkdir /dev/cgroup
   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
>>> Just a minor nit, can't we name it as io,  keeping in mind that other
>>> controllers are known as cpu and  memory?
>>
>> Or maybe "blockio"?
>
>Agree, blockio seems better. Not all I/O is performed on  block devices
>and in this case we're  considering block devices only.

Here we want to rate limit in block layer, I would think I/O scheduler
is the place where we are in much better position to do this kind of
limiting.

Also we are changing the behavior of application by adding sleeps to
it during request submission. Moreover, we will prevent requests from
being merged since we won't allow them to be submitted in this case.

Since bulk of submission for writes is done in background kernel
threads and we throttle based on limits on current, we will end up
throttling these threads and not the actual processes submitting i/o.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cgroup: limit block I/O bandwidth

2008-01-18 Thread Naveen Gupta
Paul Menage wrote:
 On Jan 18,  2008 7:36 AM, Dhaval Giani [EMAIL PROTECTED]  wrote:
 On Fri, Jan 18, 2008 at 12:41:03PM +0100, Andrea Righi  wrote:
 Allow to limit the  block I/O bandwidth for  specific process containers
 (cgroups) imposing additional delays  on I/O requests for those processes
 that exceed the  limits defined in the control group filesystem.

  Example:
   # mkdir /dev/cgroup
   # mount -t cgroup -oio-throttle io-throttle /dev/cgroup
 Just a minor nit, can't we name it as io,  keeping in mind that other
 controllers are known as cpu and  memory?

 Or maybe blockio?

Agree, blockio seems better. Not all I/O is performed on  block devices
and in this case we're  considering block devices only.

Here we want to rate limit in block layer, I would think I/O scheduler
is the place where we are in much better position to do this kind of
limiting.

Also we are changing the behavior of application by adding sleeps to
it during request submission. Moreover, we will prevent requests from
being merged since we won't allow them to be submitted in this case.

Since bulk of submission for writes is done in background kernel
threads and we throttle based on limits on current, we will end up
throttling these threads and not the actual processes submitting i/o.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH] set enable bit instead of lock bit of Watchdog Timer in Intel 6300 chipset

2005-08-15 Thread Naveen Gupta
David,

Yes, I have tested these patches. In fact I found these bugs while trying
to make the driver work on our machines.

-Naveen

On Tue, 16 Aug 2005, David Härdeman wrote:

> On Mon, Aug 15, 2005 at 02:21:05PM -0700, Naveen Gupta wrote:
> >
> >This patch sets the WDT_ENABLE bit of the Lock Register to enable the
> >watchdog and WDT_LOCK bit only if nowayout is set. The old code always
> >sets the WDT_LOCK bit of watchdog timer for Intel 6300ESB chipset. So, we
> >end up locking the watchdog instead of enabling it.
> 
> Naveen,
> 
> thanks alot for testing the driver further and finding these bugs. I've 
> not been able to do so myself as the only computers available to me with 
> this watchdog are production-servers meaning that I've only been able to 
> test during scheduled downtimes.
> 
> Have you tested and verified that the driver works after these patches 
> have been applied?
> 
> Re,
> David
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] remove use of pci_find_device in watchdog driver for Intel 6300ESB chipset

2005-08-15 Thread Naveen Gupta

This patch replaces obsolete 'pci_find_device' with 'pci_get_device' to
prevent the device from being stolen under us in Watchdog timer driver
for intel 6300ESB chipset.

Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]>

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:28:07.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:36:54.0 -0700
@@ -362,23 +362,24 @@
 {
u8 val1;
unsigned short val2;
-
+   struct pci_device_id *ids = esb_pci_tbl;
 struct pci_dev *dev = NULL;
 /*
  *  Find the PCI device
  */
 
-while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
-if (pci_match_id(esb_pci_tbl, dev)) {
-esb_pci = dev;
-break;
-}
-}
+   while (ids->vendor && ids->device) {
+   if ((dev = pci_get_device(ids->vendor, ids->device, dev)) != 
NULL) {
+   esb_pci = dev;
+   break;
+   }
+   ids++;
+   }
 
 if (esb_pci) {
if (pci_enable_device(esb_pci)) {
printk (KERN_ERR PFX "failed to enable device\n");
-   goto out;
+   goto err_devput;
}
 
if (pci_request_region(esb_pci, 0, ESB_MODULE_NAME)) {
@@ -430,8 +431,9 @@
pci_release_region(esb_pci, 0);
 err_disable:
pci_disable_device(esb_pci);
+err_devput:
+   pci_dev_put(esb_pci);
}
-out:
return 0;
 }
 
@@ -481,7 +483,8 @@
pci_release_region(esb_pci, 0);
 /* err_disable: */
pci_disable_device(esb_pci);
-/* out: */
+/* err_devput: */
+   pci_dev_put(esb_pci);
 return ret;
 }
 
@@ -497,6 +500,7 @@
iounmap(BASEADDR);
pci_release_region(esb_pci, 0);
pci_disable_device(esb_pci);
+   pci_dev_put(esb_pci);
 }
 
 module_init(watchdog_init);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] set enable bit instead of lock bit of Watchdog Timer in Intel 6300 chipset

2005-08-15 Thread Naveen Gupta

This patch sets the WDT_ENABLE bit of the Lock Register to enable the
watchdog and WDT_LOCK bit only if nowayout is set. The old code always
sets the WDT_LOCK bit of watchdog timer for Intel 6300ESB chipset. So, we
end up locking the watchdog instead of enabling it.

Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]>

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:19:01.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:21:35.0 -0700
@@ -97,7 +97,7 @@
u8 val;
 
/* Enable or Enable + Lock? */
-   val = 0x02 | nowayout ? 0x01 : 0x00;
+   val = 0x02 | (nowayout ? 0x01 : 0x00);
 
 pci_write_config_byte(esb_pci, ESB_LOCK_REG, val);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] set correct bit in reload register of Watchdog Timer for Intel 6300 chipset

2005-08-15 Thread Naveen Gupta

This patch writes into bit 8 of the reload register to perform the
correct 'Reload Sequence' instead of writing into bit 4 of Watchdog for
Intel 6300ESB chipset.

Signed-off-by: Naveen Gupta <[EMAIL PROTECTED]>

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:21:35.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:28:07.0 -0700
@@ -109,7 +109,7 @@
spin_lock(_lock);
/* First, reset timers as suggested by the docs */
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
/* Then disable the WDT */
pci_write_config_byte(esb_pci, ESB_LOCK_REG, 0x0);
pci_read_config_byte(esb_pci, ESB_LOCK_REG, );
@@ -123,7 +123,7 @@
 {
spin_lock(_lock);
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
 /* FIXME: Do we need to flush anything here? */
spin_unlock(_lock);
 }
@@ -153,7 +153,7 @@
 
 /* Reload */
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
 
/* FIXME: Do we need to flush everything out? */
 
Index: linux-2.6.12/drivers/char/watchdog/i6300esb.h
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.h  2005-08-15 
11:19:01.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.h   2005-08-15 
11:26:58.0 -0700
@@ -54,6 +54,8 @@
 #define ESB_WDT_FREQ( 0x01 << 2 )   /* Decrement frequency   */
 #define ESB_WDT_INTTYPE ( 0x11 << 0 )   /* Interrupt type on timer1 timeout  */
 
+/* Reload register bits */
+#define ESB_WDT_RELOAD ( 0x01 << 8 )/* prevent timeout   */
 
 /*
  * Some magic constants
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] set correct bit in reload register of Watchdog Timer for Intel 6300 chipset

2005-08-15 Thread Naveen Gupta

This patch writes into bit 8 of the reload register to perform the
correct 'Reload Sequence' instead of writing into bit 4 of Watchdog for
Intel 6300ESB chipset.

Signed-off-by: Naveen Gupta [EMAIL PROTECTED]

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:21:35.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:28:07.0 -0700
@@ -109,7 +109,7 @@
spin_lock(esb_lock);
/* First, reset timers as suggested by the docs */
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
/* Then disable the WDT */
pci_write_config_byte(esb_pci, ESB_LOCK_REG, 0x0);
pci_read_config_byte(esb_pci, ESB_LOCK_REG, val);
@@ -123,7 +123,7 @@
 {
spin_lock(esb_lock);
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
 /* FIXME: Do we need to flush anything here? */
spin_unlock(esb_lock);
 }
@@ -153,7 +153,7 @@
 
 /* Reload */
esb_unlock_registers();
-   writew(0x10, ESB_RELOAD_REG);
+   writew(ESB_WDT_RELOAD, ESB_RELOAD_REG);
 
/* FIXME: Do we need to flush everything out? */
 
Index: linux-2.6.12/drivers/char/watchdog/i6300esb.h
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.h  2005-08-15 
11:19:01.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.h   2005-08-15 
11:26:58.0 -0700
@@ -54,6 +54,8 @@
 #define ESB_WDT_FREQ( 0x01  2 )   /* Decrement frequency   */
 #define ESB_WDT_INTTYPE ( 0x11  0 )   /* Interrupt type on timer1 timeout  */
 
+/* Reload register bits */
+#define ESB_WDT_RELOAD ( 0x01  8 )/* prevent timeout   */
 
 /*
  * Some magic constants
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] set enable bit instead of lock bit of Watchdog Timer in Intel 6300 chipset

2005-08-15 Thread Naveen Gupta

This patch sets the WDT_ENABLE bit of the Lock Register to enable the
watchdog and WDT_LOCK bit only if nowayout is set. The old code always
sets the WDT_LOCK bit of watchdog timer for Intel 6300ESB chipset. So, we
end up locking the watchdog instead of enabling it.

Signed-off-by: Naveen Gupta [EMAIL PROTECTED]

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:19:01.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:21:35.0 -0700
@@ -97,7 +97,7 @@
u8 val;
 
/* Enable or Enable + Lock? */
-   val = 0x02 | nowayout ? 0x01 : 0x00;
+   val = 0x02 | (nowayout ? 0x01 : 0x00);
 
 pci_write_config_byte(esb_pci, ESB_LOCK_REG, val);
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-mm PATCH] remove use of pci_find_device in watchdog driver for Intel 6300ESB chipset

2005-08-15 Thread Naveen Gupta

This patch replaces obsolete 'pci_find_device' with 'pci_get_device' to
prevent the device from being stolen under us in Watchdog timer driver
for intel 6300ESB chipset.

Signed-off-by: Naveen Gupta [EMAIL PROTECTED]

Index: linux-2.6.12/drivers/char/watchdog/i6300esb.c
===
--- linux-2.6.12.orig/drivers/char/watchdog/i6300esb.c  2005-08-15 
11:28:07.0 -0700
+++ linux-2.6.12/drivers/char/watchdog/i6300esb.c   2005-08-15 
11:36:54.0 -0700
@@ -362,23 +362,24 @@
 {
u8 val1;
unsigned short val2;
-
+   struct pci_device_id *ids = esb_pci_tbl;
 struct pci_dev *dev = NULL;
 /*
  *  Find the PCI device
  */
 
-while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
-if (pci_match_id(esb_pci_tbl, dev)) {
-esb_pci = dev;
-break;
-}
-}
+   while (ids-vendor  ids-device) {
+   if ((dev = pci_get_device(ids-vendor, ids-device, dev)) != 
NULL) {
+   esb_pci = dev;
+   break;
+   }
+   ids++;
+   }
 
 if (esb_pci) {
if (pci_enable_device(esb_pci)) {
printk (KERN_ERR PFX failed to enable device\n);
-   goto out;
+   goto err_devput;
}
 
if (pci_request_region(esb_pci, 0, ESB_MODULE_NAME)) {
@@ -430,8 +431,9 @@
pci_release_region(esb_pci, 0);
 err_disable:
pci_disable_device(esb_pci);
+err_devput:
+   pci_dev_put(esb_pci);
}
-out:
return 0;
 }
 
@@ -481,7 +483,8 @@
pci_release_region(esb_pci, 0);
 /* err_disable: */
pci_disable_device(esb_pci);
-/* out: */
+/* err_devput: */
+   pci_dev_put(esb_pci);
 return ret;
 }
 
@@ -497,6 +500,7 @@
iounmap(BASEADDR);
pci_release_region(esb_pci, 0);
pci_disable_device(esb_pci);
+   pci_dev_put(esb_pci);
 }
 
 module_init(watchdog_init);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH] set enable bit instead of lock bit of Watchdog Timer in Intel 6300 chipset

2005-08-15 Thread Naveen Gupta
David,

Yes, I have tested these patches. In fact I found these bugs while trying
to make the driver work on our machines.

-Naveen

On Tue, 16 Aug 2005, David Härdeman wrote:

 On Mon, Aug 15, 2005 at 02:21:05PM -0700, Naveen Gupta wrote:
 
 This patch sets the WDT_ENABLE bit of the Lock Register to enable the
 watchdog and WDT_LOCK bit only if nowayout is set. The old code always
 sets the WDT_LOCK bit of watchdog timer for Intel 6300ESB chipset. So, we
 end up locking the watchdog instead of enabling it.
 
 Naveen,
 
 thanks alot for testing the driver further and finding these bugs. I've 
 not been able to do so myself as the only computers available to me with 
 this watchdog are production-servers meaning that I've only been able to 
 test during scheduled downtimes.
 
 Have you tested and verified that the driver works after these patches 
 have been applied?
 
 Re,
 David
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/