Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()

2016-10-04 Thread Ming Lei
On Wed, Oct 5, 2016 at 12:16 PM, Bart Van Assche
 wrote:
> On 10/01/16 15:56, Ming Lei wrote:
>>
>> If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(),
>> the
>> above code needn't to be duplicated any more.
>
>
> Hello Ming,
>
> Can you have a look at the attached patch? That patch uses an srcu read lock
> for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been set.

That is much cleaner now.

> Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like
> previous versions, this patch has been tested.

I think the flag of QUEUE_FLAG_QUIESCING is still needed because we
have to set this flag to prevent new coming .queue_rq() from being run,
and synchronize_srcu() won't wait for completion of that at all (see
section of 'Update-Side Primitives' in [1]).


[1] https://lwn.net/Articles/202847/

-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()

2016-10-04 Thread Bart Van Assche

On 10/01/16 15:56, Ming Lei wrote:

If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(), the
above code needn't to be duplicated any more.


Hello Ming,

Can you have a look at the attached patch? That patch uses an srcu read 
lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has 
been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. 
Just like previous versions, this patch has been tested.


Thanks,

Bart.
>From 25f02ed7ab7b2308fd18b89d180c0c613e55d416 Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Tue, 27 Sep 2016 10:52:36 -0700
Subject: [PATCH] blk-mq: Introduce blk_mq_quiesce_queue()

blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).

Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-mq.c | 40 ++--
 include/linux/blk-mq.h |  3 +++
 include/linux/blkdev.h |  1 +
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8c45de..38ae685 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,6 +115,23 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Additionally, it is not prevented that
+ * new queue_rq() calls occur unless the queue has been stopped first.
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		synchronize_srcu(&hctx->queue_rq_srcu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -789,11 +806,13 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(rq_list);
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
-	int queued;
+	int queued, srcu_idx;
 
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
+	srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
 		cpu_online(hctx->next_cpu));
 
@@ -885,6 +904,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
 	}
+
+	srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 }
 
 /*
@@ -1298,7 +1319,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
 	struct blk_map_ctx data;
 	struct request *rq;
-	unsigned int request_count = 0;
+	unsigned int request_count = 0, srcu_idx;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
@@ -1341,7 +1362,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
-		 * We do limited pluging. If the bio can be merged, do that.
+		 * We do limited plugging. If the bio can be merged, do that.
 		 * Otherwise the existing request in the plug list will be
 		 * issued. So the plug list will have one request at most
 		 */
@@ -1361,9 +1382,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 		if (!old_rq)
 			goto done;
-		if (!blk_mq_direct_issue_request(old_rq, &cookie))
-			goto done;
-		blk_mq_insert_request(old_rq, false, true, true);
+
+		srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
+		if (blk_mq_direct_issue_request(old_rq, &cookie) != 0)
+			blk_mq_insert_request(old_rq, false, true, true);
+		srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
+
 		goto done;
 	}
 
@@ -1659,6 +1683,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
+	cleanup_srcu_struct(&hctx->queue_rq_srcu);
+
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
@@ -1741,6 +1767,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
    flush_start_tag + hctx_idx, node))
 		goto free_fq;
 
+	init_srcu_struct(&hctx->queue_rq_srcu);
+
 	return 0;
 
  free_fq:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 368c460d..b2ccd3c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -41,6 +42,8 @@ struct blk_mq_hw_ctx {
 
 	struct blk_mq_tags	*tags;
 
+	struct srcu_struct	queue_rq_srcu;
+
 	unsigned long		queued;
 	unsigned long		run;
 #define BLK_MQ_MAX_DISPATCH_ORDER	7
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358..8259d87 100644
---

Re: [PATCH] blk-mq: Return invalid cookie if bio was split

2016-10-04 Thread Ming Lei
On Tue, Oct 4, 2016 at 6:00 AM, Keith Busch  wrote:
> On Tue, Sep 27, 2016 at 05:25:36PM +0800, Ming Lei wrote:
>> On Mon, 26 Sep 2016 19:00:30 -0400
>> Keith Busch  wrote:
>>
>> > The only user of polling requires its original request be completed in
>> > its entirety before continuing execution. If the bio needs to be split
>> > and chained for any reason, the direct IO path would have waited for just
>> > that split portion to complete, leading to potential data corruption if
>> > the remaining transfer has not yet completed.
>>
>> The issue looks a bit tricky because there is no per-bio place for holding
>> the cookie, and generic_make_request() only returns the cookie for the
>> last bio in the current bio list, so maybe we need the following patch too.
>
> I'm looking more into this, and I can't see why we're returning a cookie
> to poll on in the first place. blk_poll is only invoked when we could have
> called io_schedule, so we expect the task state gets set to TASK_RUNNING
> when all the work completes. So why do we need to poll for a specific
> tag instead of polling until task state is set back to running?

But .poll() need to check if the specific request is completed or not,
then blk_poll() can set 'current' as RUNNING if it is completed.

blk_poll():
...
ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie));
if (ret > 0) {
hctx->poll_success++;
set_current_state(TASK_RUNNING);
return true;
}
...


>
> I've tried this out and it seems to work fine, and should fix any issues
> from split IO requests. It also helps direct IO polling, since it can
> have a list of bios, but can only save one cookie.

I am glad to take a look the patch if you post it out.

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] zram: support page-based parallel write

2016-10-04 Thread Minchan Kim
Hi Sergey,

On Tue, Oct 04, 2016 at 01:43:14PM +0900, Sergey Senozhatsky wrote:

< snip >

> TEST
> 
> 
> new tests results; same tests, same conditions, same .config.
> 4-way test:
> - BASE zram, fio direct=1
> - BASE zram, fio fsync_on_close=1
> - NEW zram, fio direct=1
> - NEW zram, fio fsync_on_close=1
> 
> 
> 
> and what I see is that:
>  - new zram is x3 times slower when we do a lot of direct=1 IO
> and
>  - 10% faster when we use buffered IO (fsync_on_close); but not always;
>for instance, test execution time is longer (a reproducible behavior)
>when the number of jobs equals the number of CPUs - 4.
> 
> 
> 
> if flushing is a problem for new zram during direct=1 test, then I would
> assume that writing a huge number of small files (creat/write 4k/close)
> would probably have same fsync_on_close=1 performance as direct=1.
> 
> 
> ENV
> ===
> 
>x86_64 SMP (4 CPUs), "bare zram" 3g, lzo, static compression buffer.
> 
> 
> TEST COMMAND
> 
> 
>   ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX={NEW, OLD} FIO_LOOPS=2 
> ./zram-fio-test.sh
> 
> 
> EXECUTED TESTS
> ==
> 
>   - [seq-read]
>   - [rand-read]
>   - [seq-write]
>   - [rand-write]
>   - [mixed-seq]
>   - [mixed-rand]
> 
> 
> fio-perf-o-meter.sh test-fio-zram-OLD test-fio-zram-OLD-flush 
> test-fio-zram-NEW test-fio-zram-NEW-flush
> Processing test-fio-zram-OLD
> Processing test-fio-zram-OLD-flush
> Processing test-fio-zram-NEW
> Processing test-fio-zram-NEW-flush
> 
> BASE BASE  NEW NEW
> direct=1 fsync_on_close=1  direct=1
> fsync_on_close=1
> 
> #jobs1
> 
> READ:   2345.1MB/s 2177.2MB/s  2373.2MB/s  2185.8MB/s
> READ:   1948.2MB/s 1417.7MB/s  1987.7MB/s  1447.4MB/s
> WRITE:  1292.7MB/s 1406.1MB/s  275277KB/s  1521.1MB/s
> WRITE:  1047.5MB/s 1143.8MB/s  257140KB/s  1202.4MB/s
> READ:   429530KB/s 779523KB/s  175450KB/s  782237KB/s
> WRITE:  429840KB/s 780084KB/s  175576KB/s  782800KB/s
> READ:   414074KB/s 408214KB/s  164091KB/s  383426KB/s
> WRITE:  414402KB/s 408539KB/s  164221KB/s  383730KB/s


I tested your benchmark for job 1 on my 4 CPU mahcine with this diff.

Nothing different.

1. just changed ordering of test execution - hope to reduce testing time due to
   block population before the first reading or reading just zero pages
2. used sync_on_close instead of direct io
3. Don't use perf to avoid noise
4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior

diff --git a/conf/fio-template-static-buffer b/conf/fio-template-static-buffer
index 1a9a473..22ddee8 100644
--- a/conf/fio-template-static-buffer
+++ b/conf/fio-template-static-buffer
@@ -1,7 +1,7 @@
 [global]
 bs=${BLOCK_SIZE}k
 ioengine=sync
-direct=1
+fsync_on_close=1
 nrfiles=${NRFILES}
 size=${SIZE}
 numjobs=${NUMJOBS}
@@ -14,18 +14,18 @@ new_group
 group_reporting
 threads=1
 
-[seq-read]
-rw=read
-
-[rand-read]
-rw=randread
-
 [seq-write]
 rw=write
 
 [rand-write]
 rw=randwrite
 
+[seq-read]
+rw=read
+
+[rand-read]
+rw=randread
+
 [mixed-seq]
 rw=rw
 
diff --git a/zram-fio-test.sh b/zram-fio-test.sh
index 39c11b3..ca2d065 100755
--- a/zram-fio-test.sh
+++ b/zram-fio-test.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
 
 
 # Sergey Senozhatsky. sergey.senozhat...@gmail.com
@@ -37,6 +37,7 @@ function create_zram
echo $ZRAM_COMP_ALG > /sys/block/zram0/comp_algorithm
cat /sys/block/zram0/comp_algorithm
 
+   echo 0 > /sys/block/zram0/use_aio
echo $ZRAM_SIZE > /sys/block/zram0/disksize
if [ $? != 0 ]; then
return -1
@@ -137,7 +138,7 @@ function main
echo "#jobs$i fio" >> $LOG
 
BLOCK_SIZE=4 SIZE=100% NUMJOBS=$i NRFILES=$i 
FIO_LOOPS=$FIO_LOOPS \
-   $PERF stat -o $LOG-perf-stat $FIO ./$FIO_TEMPLATE >> 
$LOG
+   $FIO ./$FIO_TEMPLATE > $LOG
 
echo -n "perfstat jobs$i" >> $LOG
cat $LOG-perf-stat >> $LOG

And got following result.

1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 
./zram-fio-test.sh
2. modify script to disable aio via /sys/block/zram0/use_aio
   ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 
./zram-fio-test.sh

  seq-write 380930 474325 124.52%
 rand-write 286183 357469 124.91%
   seq-read 266813 265731  99.59%
  rand-read 211747 210670  99.49%
   mixed-seq(R) 145750 171232 117.48%
   mixed-seq(W) 145736 171215 117.48%
  mixed-rand(R) 115355 125239 108.57%
  mixed-rand(W) 115371 125256 108.57%

LZO compression is fast and a CPU for queueing while 3 CPU for compressing
it c

Re: [PATCH 1/3] block: Add iocontext priority to request

2016-10-04 Thread Tejun Heo
Hello, Adam.

On Tue, Oct 04, 2016 at 08:49:18AM -0700, Adam Manzanares wrote:
> > I wonder whether the right thing to do is adding bio->bi_ioprio which
> > is initialized on bio submission and carried through req->ioprio.
> 
> I looked around and thought about this and I'm not sure if this will help. 
> I dug into the bio submission code and I thought generic_make_request was 
> the best place to save the ioprio information. This is quite close in 
> the call stack to init_request_from bio. Bcache sets the bio priority before 
> the submission, so we would have to check to see if the bio priority was 
> valid on bio submission leaving us with the same problem. Leaving the 
> priority in the upper bits of bio->bi_rw is fine with me. It may help to 
> have the bio->bi_ioprio for clarity, but I think we will still face the 
> issue of having to check if this value is set when we submit the bio or 
> init the request so I'm leaning towards leaving it as is.

I see.  Thanks for looking into it.  It's icky that we don't have a
clear path of propagating ioprio but let's save that for another day.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Paolo.

On Tue, Oct 04, 2016 at 09:29:48PM +0200, Paolo Valente wrote:
> > Hmm... I think we already discussed this but here's a really simple
> > case.  There are three unknown workloads A, B and C and we want to
> > give A certain best-effort guarantees (let's say around 80% of the
> > underlying device) whether A is sharing the device with B or C.
> 
> That's the same example that you proposed me in our previous
> discussion.  For this example I showed you, with many boring numbers,
> that with BFQ you get the most accurate distribution of the resource.

Yes, it is about the same example and what I understood was that
"accurate distribution of the resources" holds as long as the
randomness is incidental (ie. due to layout on the filesystem and so
on) with the slice expiration mechanism offsetting the actually random
workloads.

> If you have enough stamina, I can repeat them again.  To save your

I'll go back to the thread and re-read them.

> patience, here is a very brief summary.  In a concrete use case, the
> unknown workloads turn into something like this: there will be a first
> time interval during which A happens to be, say, sequential, B happens
> to be, say, random and C happens to be, say, quasi-sequential.  Then
> there will be a next time interval during which their characteristics
> change, and so on.  It is easy (but boring, I acknowledge it) to show
> that, for each of these time intervals BFQ provides the best possible
> service in terms of fairness, bandwidth distribution, stability and so
> on.  Why?  Because of the elastic bandwidth-time scheduling of BFQ
> that we already discussed, and because BFQ is naturally accurate in
> redistributing aggregate throughput proportionally, when needed.

Yeah, that's what I remember and for workload above certain level of
randomness its time consumption is mapped to bw, right?

> > I get that bfq can be a good compromise on most desktop workloads and
> > behave reasonably well for some server workloads with the slice
> > expiration mechanism but it really isn't an IO resource partitioning
> > mechanism.
> 
> Right.  My argument is that BFQ enables you to give to each client the
> bandwidth and low-latency guarantees you want.  And this IMO is way
> better than partitioning a resource and then getting unavoidable
> unfairness and high latency.

But that statement only holds while bw is the main thing to guarantee,
no?  The level of isolation that we're looking for here is fairly
strict adherence to sub/few-milliseconds in terms of high percentile
scheduling latency while within the configured bw/iops limits, not
"overall this device is being used pretty well".

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 20:28, Shaohua Li  ha scritto:
> 
> On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
>> 
>>> Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li  ha scritto:
>>> 
>>> On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
 
> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo  ha 
> scritto:
> 
> Hello,
> 
> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>> Could you please elaborate more on this point?  BFQ uses sectors
>> served to measure service, and, on the all the fast devices on which
>> we have tested it, it accurately distributes
>> bandwidth as desired, redistributes excess bandwidth with any issue,
>> and guarantees high responsiveness and low latency at application and
>> system level (e.g., ~0 drop rate in video playback, with any background
>> workload tested).
> 
> The same argument as before.  Bandwidth is a very bad measure of IO
> resources spent.  For specific use cases (like desktop or whatever),
> this can work but not generally.
> 
 
 Actually, we have already discussed this point, and IMHO the arguments
 that (apparently) convinced you that bandwidth is the most relevant
 service guarantee for I/O in desktops and the like, prove that
 bandwidth is the most important service guarantee in servers too.
 
 Again, all the examples I can think of seem to confirm it:
 . file hosting: a good service must guarantee reasonable read/write,
 i.e., download/upload, speeds to users
 . file streaming: a good service must guarantee low drop rates, and
 this can be guaranteed only by guaranteeing bandwidth and latency
 . web hosting: high bandwidth and low latency needed here too
 . clouds: high bw and low latency needed to let, e.g., users of VMs
 enjoy high responsiveness and, for example, reasonable file-copy
 time
 ...
 
 To put in yet another way, with packet I/O in, e.g., clouds, there are
 basically the same issues, and the main goal is again guaranteeing
 bandwidth and low latency among nodes.
 
 Could you please provide a concrete server example (assuming we still
 agree about desktops), where I/O bandwidth does not matter while time
 does?
>>> 
>>> I don't think IO bandwidth does not matter. The problem is bandwidth can't
>>> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 
>>> 4k
>>> IO.
>>> 
>> 
>> For what goal do you need to be able to say this, once you succeeded
>> in guaranteeing bandwidth and low latency to each
>> process/client/group/node/user?
> 
> I think we are discussing if bandwidth should be used to measure IO for
> propotional IO scheduling.


Yes. But my point is upstream. It's something like this:

Can bandwidth and low latency guarantees be provided with a
sector-based proportional-share scheduler?

YOUR ANSWER: No, then we need to look for other non-trivial solutions.
Hence your arguments in this discussion.

MY ANSWER: Yes, I have already achieved this goal for years now, with
a publicly available, proportional-share scheduler.  A lot of test
results with many devices, papers discussing details, demos, and so on
are available too.

> Since bandwidth can't measure the cost and you are
> using it to do arbitration, you will either have low latency but unfair
> bandwidth, or fair bandwidth but some workloads have unexpected high latency.
> But it might be ok depending on the latency target (for example, you can set
> the latency target high, so low latency is guaranteed*) and workload
> characteristics. I think the bandwidth based proporional scheduling will only
> work for workloads disk isn't fully utilized.
> 
>> Could you please suggest me some test to show how sector-based
>> guarantees fails?
> 
> Well, mix 4k random and sequential workloads and try to distribute the
> acteual IO resources.
> 
 
 
 If I'm not mistaken, we have already gone through this example too,
 and I thought we agreed on what service scheme worked best, again
 focusing only on desktops.  To make a long story short(er), here is a
 snippet from one of our last exchanges.
 
 --
 
 On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> Maybe the source of confusion is the fact that a simple sector-based,
> proportional share scheduler always distributes total bandwidth
> according to weights. The catch is the additional BFQ rule: random
> workloads get only time isolation, and are charged for full budgets,
> so as to not affect the schedule of quasi-sequential workloads. So,
> the correct claim for BFQ is that it distributes total bandwidth
> according to weights (only) when all competing workloads are
> quasi-sequential. If some workloads are random, then these workloads
> are just time scheduled. This does break prop

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 21:14, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
>> That's exactly what BFQ has succeeded in doing in all the tests
>> devised so far.  Can you give me a concrete example for which I can
>> try with BFQ and with any other mechanism you deem better.  If
>> you are right, numbers will just make your point.
> 
> Hmm... I think we already discussed this but here's a really simple
> case.  There are three unknown workloads A, B and C and we want to
> give A certain best-effort guarantees (let's say around 80% of the
> underlying device) whether A is sharing the device with B or C.
> 

That's the same example that you proposed me in our previous
discussion.  For this example I showed you, with many boring numbers,
that with BFQ you get the most accurate distribution of the resource.

If you have enough stamina, I can repeat them again.  To save your
patience, here is a very brief summary.  In a concrete use case, the
unknown workloads turn into something like this: there will be a first
time interval during which A happens to be, say, sequential, B happens
to be, say, random and C happens to be, say, quasi-sequential.  Then
there will be a next time interval during which their characteristics
change, and so on.  It is easy (but boring, I acknowledge it) to show
that, for each of these time intervals BFQ provides the best possible
service in terms of fairness, bandwidth distribution, stability and so
on.  Why?  Because of the elastic bandwidth-time scheduling of BFQ
that we already discussed, and because BFQ is naturally accurate in
redistributing aggregate throughput proportionally, when needed.

> I get that bfq can be a good compromise on most desktop workloads and
> behave reasonably well for some server workloads with the slice
> expiration mechanism but it really isn't an IO resource partitioning
> mechanism.
> 

Right.  My argument is that BFQ enables you to give to each client the
bandwidth and low-latency guarantees you want.  And this IMO is way
better than partitioning a resource and then getting unavoidable
unfairness and high latency.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Paolo.

On Tue, Oct 04, 2016 at 09:02:47PM +0200, Paolo Valente wrote:
> That's exactly what BFQ has succeeded in doing in all the tests
> devised so far.  Can you give me a concrete example for which I can
> try with BFQ and with any other mechanism you deem better.  If
> you are right, numbers will just make your point.

Hmm... I think we already discussed this but here's a really simple
case.  There are three unknown workloads A, B and C and we want to
give A certain best-effort guarantees (let's say around 80% of the
underlying device) whether A is sharing the device with B or C.

I get that bfq can be a good compromise on most desktop workloads and
behave reasonably well for some server workloads with the slice
expiration mechanism but it really isn't an IO resource partitioning
mechanism.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 20:54, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
>>> I don't think IO bandwidth does not matter. The problem is bandwidth can't
>>> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 
>>> 4k
>>> IO.
>> 
>> For what goal do you need to be able to say this, once you succeeded
>> in guaranteeing bandwidth and low latency to each
>> process/client/group/node/user?
> 
> For resource partitioning mostly.  It's not a single user or purpose
> use case.  The same device gets shared across unrelated workloads and
> we need to guarantee differing levels of quality of service to each
> regardless of the specifics of workload.

That's exactly what BFQ has succeeded in doing in all the tests
devised so far.  Can you give me a concrete example for which I can
try with BFQ and with any other mechanism you deem better.  If
you are right, numbers will just make your point.

Thanks,
Paolo

>  We actually need to be able
> to control IO resources.
> 


> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 20:50, Tejun Heo  ha scritto:
> 
> Hello, Vivek.
> 
> On Tue, Oct 04, 2016 at 02:12:45PM -0400, Vivek Goyal wrote:
>> Agreed that we don't have a good basic unit to measure IO cost. I was
>> thinking of measuring cost in terms of sectors as that's simple and
>> gets more accurate on faster devices with almost no seek penalty. And
> 
> If this were true, we could simply base everything on bandwidth;
> unfortunately, even highspeed ssds perform wildly differently
> depending on the specifics of workloads.
> 

If you base your throttler or scheduler on time, and bandwidth varies
with workload, as you correctly point out, then the result is loss of
control on bandwidth distribution, hence unfairness and
hard-to-control (high) latency.  If you use BFQ's approach, as we
already discussed with numbers and examples, you have stable fairness
and low latency.  More precisely, given your workload, you can even
compute formally the strong service guarantees you provide.

Thanks,
Paolo

>> in fact this proposal is also providing fairness in terms of bandwitdh.
>> One extra feature seems to be this notion of minimum bandwidth for each
>> cgroup and until and unless all competing groups have met their minimum,
>> other cgroups can't cross their limits.
> 
> Haven't read the patches yet but it should allow regulating in terms
> of both bandwidth and iops.
> 
>> (BTW, should we call io.high, io.minimum instead. To say, this is the
>> minimum bandwidth group should get before others get to cross their
>> minimum limit till max limit).
> 
> The naming convetion is min, low, high, max but I'm not sure "min",
> which means hard minimum amount (whether guaranteed or best-effort),
> quite makes sense here.
> 
>>> It mostly defers the burden to the one who's configuring the limits
>>> and expects it to know the characteristics of the device and workloads
>>> and configure accordingly.  It's quite a bit more tedious to use but
>>> should be able to cover good portion of use cases without being overly
>>> complicated.  I agree that it'd be nice to have a simple proportional
>>> control but as you said can't see a good solution for it at the
>>> moment.
>> 
>> Ok, so idea is that if we can't provide something accurate in kernel,
>> then expose a very low level knob, which is harder to configure but
>> should work in some cases where users know their devices and workload
>> very well. 
> 
> Yeah, that's the basic idea for this approach.  It'd be great if we
> eventually end up with proper proportional control but having
> something low level is useful anyway, so...
> 
>>> I don't think it's catering to specific use cases.  It is a generic
>>> mechanism which demands knowledge and experimentation to configure.
>>> It's more a way for the kernel to cop out and defer figuring out
>>> device characteristics to userland.  If you have a better idea, I'm
>>> all ears.
>> 
>> I don't think I have a better idea as such. Once we had talked and you
>> mentioned that for faster devices we should probably do some token based
>> mechanism (which I believe would probably mean sector based IO
>> accounting). 
> 
> That's more about the implementation strategy and doesn't affect
> whether we support bw, iops or combined configurations.  In terms of
> implementation, I still think it'd be great to have something token
> based with per-cpu batch to lower the cpu overhead on highspeed
> devices but that shouldn't really affect the semantics.
> 
> Thanks.
> 
> -- 
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Paolo.

On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
> > I don't think IO bandwidth does not matter. The problem is bandwidth can't
> > measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 
> > 4k
> > IO.
> 
> For what goal do you need to be able to say this, once you succeeded
> in guaranteeing bandwidth and low latency to each
> process/client/group/node/user?

For resource partitioning mostly.  It's not a single user or purpose
use case.  The same device gets shared across unrelated workloads and
we need to guarantee differing levels of quality of service to each
regardless of the specifics of workload.  We actually need to be able
to control IO resources.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Vivek.

On Tue, Oct 04, 2016 at 02:12:45PM -0400, Vivek Goyal wrote:
> Agreed that we don't have a good basic unit to measure IO cost. I was
> thinking of measuring cost in terms of sectors as that's simple and
> gets more accurate on faster devices with almost no seek penalty. And

If this were true, we could simply base everything on bandwidth;
unfortunately, even highspeed ssds perform wildly differently
depending on the specifics of workloads.

> in fact this proposal is also providing fairness in terms of bandwitdh.
> One extra feature seems to be this notion of minimum bandwidth for each
> cgroup and until and unless all competing groups have met their minimum,
> other cgroups can't cross their limits.

Haven't read the patches yet but it should allow regulating in terms
of both bandwidth and iops.

> (BTW, should we call io.high, io.minimum instead. To say, this is the
>  minimum bandwidth group should get before others get to cross their
>  minimum limit till max limit).

The naming convetion is min, low, high, max but I'm not sure "min",
which means hard minimum amount (whether guaranteed or best-effort),
quite makes sense here.

> > It mostly defers the burden to the one who's configuring the limits
> > and expects it to know the characteristics of the device and workloads
> > and configure accordingly.  It's quite a bit more tedious to use but
> > should be able to cover good portion of use cases without being overly
> > complicated.  I agree that it'd be nice to have a simple proportional
> > control but as you said can't see a good solution for it at the
> > moment.
> 
> Ok, so idea is that if we can't provide something accurate in kernel,
> then expose a very low level knob, which is harder to configure but
> should work in some cases where users know their devices and workload
> very well. 

Yeah, that's the basic idea for this approach.  It'd be great if we
eventually end up with proper proportional control but having
something low level is useful anyway, so...

> > I don't think it's catering to specific use cases.  It is a generic
> > mechanism which demands knowledge and experimentation to configure.
> > It's more a way for the kernel to cop out and defer figuring out
> > device characteristics to userland.  If you have a better idea, I'm
> > all ears.
> 
> I don't think I have a better idea as such. Once we had talked and you
> mentioned that for faster devices we should probably do some token based
> mechanism (which I believe would probably mean sector based IO
> accounting). 

That's more about the implementation strategy and doesn't affect
whether we support bw, iops or combined configurations.  In terms of
implementation, I still think it'd be great to have something token
based with per-cpu batch to lower the cpu overhead on highspeed
devices but that shouldn't really affect the semantics.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Shaohua Li
On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
> 
> > Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li  ha scritto:
> > 
> > On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
> >> 
> >>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo  ha 
> >>> scritto:
> >>> 
> >>> Hello,
> >>> 
> >>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>  Could you please elaborate more on this point?  BFQ uses sectors
>  served to measure service, and, on the all the fast devices on which
>  we have tested it, it accurately distributes
>  bandwidth as desired, redistributes excess bandwidth with any issue,
>  and guarantees high responsiveness and low latency at application and
>  system level (e.g., ~0 drop rate in video playback, with any background
>  workload tested).
> >>> 
> >>> The same argument as before.  Bandwidth is a very bad measure of IO
> >>> resources spent.  For specific use cases (like desktop or whatever),
> >>> this can work but not generally.
> >>> 
> >> 
> >> Actually, we have already discussed this point, and IMHO the arguments
> >> that (apparently) convinced you that bandwidth is the most relevant
> >> service guarantee for I/O in desktops and the like, prove that
> >> bandwidth is the most important service guarantee in servers too.
> >> 
> >> Again, all the examples I can think of seem to confirm it:
> >> . file hosting: a good service must guarantee reasonable read/write,
> >> i.e., download/upload, speeds to users
> >> . file streaming: a good service must guarantee low drop rates, and
> >> this can be guaranteed only by guaranteeing bandwidth and latency
> >> . web hosting: high bandwidth and low latency needed here too
> >> . clouds: high bw and low latency needed to let, e.g., users of VMs
> >> enjoy high responsiveness and, for example, reasonable file-copy
> >> time
> >> ...
> >> 
> >> To put in yet another way, with packet I/O in, e.g., clouds, there are
> >> basically the same issues, and the main goal is again guaranteeing
> >> bandwidth and low latency among nodes.
> >> 
> >> Could you please provide a concrete server example (assuming we still
> >> agree about desktops), where I/O bandwidth does not matter while time
> >> does?
> > 
> > I don't think IO bandwidth does not matter. The problem is bandwidth can't
> > measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 
> > 4k
> > IO.
> > 
> 
> For what goal do you need to be able to say this, once you succeeded
> in guaranteeing bandwidth and low latency to each
> process/client/group/node/user?

I think we are discussing if bandwidth should be used to measure IO for
propotional IO scheduling. Since bandwidth can't measure the cost and you are
using it to do arbitration, you will either have low latency but unfair
bandwidth, or fair bandwidth but some workloads have unexpected high latency.
But it might be ok depending on the latency target (for example, you can set
the latency target high, so low latency is guaranteed*) and workload
characteristics. I think the bandwidth based proporional scheduling will only
work for workloads disk isn't fully utilized.
 
>  Could you please suggest me some test to show how sector-based
>  guarantees fails?
> >>> 
> >>> Well, mix 4k random and sequential workloads and try to distribute the
> >>> acteual IO resources.
> >>> 
> >> 
> >> 
> >> If I'm not mistaken, we have already gone through this example too,
> >> and I thought we agreed on what service scheme worked best, again
> >> focusing only on desktops.  To make a long story short(er), here is a
> >> snippet from one of our last exchanges.
> >> 
> >> --
> >> 
> >> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> >>> Maybe the source of confusion is the fact that a simple sector-based,
> >>> proportional share scheduler always distributes total bandwidth
> >>> according to weights. The catch is the additional BFQ rule: random
> >>> workloads get only time isolation, and are charged for full budgets,
> >>> so as to not affect the schedule of quasi-sequential workloads. So,
> >>> the correct claim for BFQ is that it distributes total bandwidth
> >>> according to weights (only) when all competing workloads are
> >>> quasi-sequential. If some workloads are random, then these workloads
> >>> are just time scheduled. This does break proportional-share bandwidth
> >>> distribution with mixed workloads, but, much more importantly, saves
> >>> both total throughput and individual bandwidths of quasi-sequential
> >>> workloads.
> >>> 
> >>> We could then check whether I did succeed in tuning timeouts and
> >>> budgets so as to achieve the best tradeoffs. But this is probably a
> >>> second-order problem as of now.
> > 
> > I don't see why random/sequential matters for SSD. what really matters is
> > request size and IO depth. Time scheduling is skeptical too, as workloads 
> > can
> > dispatch all IO within almost 0 time in high qu

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Vivek Goyal
On Tue, Oct 04, 2016 at 11:56:16AM -0400, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> > On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > > Hi,
> > > 
> > > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > > prioritize processes/cgroups.
> > 
> > So this is an interim solution till we have ioscheduler for blk-mq?
> 
> It's a common permanent solution which applies to both !mq and mq.
> 
> > > This patch set tries to add basic arbitration
> > > between cgroups with blk-throttle. It adds a new limit io.high for
> > > blk-throttle. It's only for cgroup2.
> > > 
> > > io.max is a hard limit throttling. cgroups with a max limit never 
> > > dispatch more
> > > IO than their max limit. While io.high is a best effort throttling. 
> > > cgroups
> > > with high limit can run above their high limit at appropriate time.
> > > Specifically, if all cgroups reach their high limit, all cgroups can run 
> > > above
> > > their high limit. If any cgroup runs under its high limit, all other 
> > > cgroups
> > > will run according to their high limit.
> > 
> > Hi Shaohua,
> > 
> > I still don't understand why we should not implement a weight based
> > proportional IO mechanism and how this mechanism is better than 
> > proportional IO .
> 
> Oh, if we actually can implement proportional IO control, it'd be
> great.  The problem is that we have no way of knowing IO cost for
> highspeed ssd devices.  CFQ gets around the problem by using the
> walltime as the measure of resource usage and scheduling time slices,
> which works fine for rotating disks but horribly for highspeed ssds.
> 
> We can get some semblance of proportional control by just counting bw
> or iops but both break down badly as a means to measure the actual
> resource consumption depending on the workload.  While limit based
> control is more tedious to configure, it doesn't misrepresent what's
> going on and is a lot less likely to produce surprising outcomes.
> 
> We *can* try to concoct something which tries to do proportional
> control for highspeed ssds but that's gonna be quite a bit of
> complexity and I'm not so sure it'd be justifiable given that we can't
> even figure out measurement of the most basic operating unit.

Hi Tejun,

Agreed that we don't have a good basic unit to measure IO cost. I was
thinking of measuring cost in terms of sectors as that's simple and
gets more accurate on faster devices with almost no seek penalty. And
in fact this proposal is also providing fairness in terms of bandwitdh.
One extra feature seems to be this notion of minimum bandwidth for each
cgroup and until and unless all competing groups have met their minimum,
other cgroups can't cross their limits.

(BTW, should we call io.high, io.minimum instead. To say, this is the
 minimum bandwidth group should get before others get to cross their
 minimum limit till max limit).

> 
> > Agreed that we have issues with proportional IO and we don't have good
> > solutions for these problems. But I can't see that how this mechanism
> > will overcome these problems either.
> 
> It mostly defers the burden to the one who's configuring the limits
> and expects it to know the characteristics of the device and workloads
> and configure accordingly.  It's quite a bit more tedious to use but
> should be able to cover good portion of use cases without being overly
> complicated.  I agree that it'd be nice to have a simple proportional
> control but as you said can't see a good solution for it at the
> moment.

Ok, so idea is that if we can't provide something accurate in kernel,
then expose a very low level knob, which is harder to configure but
should work in some cases where users know their devices and workload
very well. 

> 
> > IIRC, biggest issue with proportional IO was that a low prio group might
> > fill up the device queue with plenty of IO requests and later when high
> > prio cgroup comes, it will still experience latencies anyway. And solution
> > to the problem probably would be to get some awareness in device about 
> > priority of request and map weights to those priority. That way higher
> > prio requests get prioritized.
> 
> Nah, the real problem is that we can't even decide what the
> proportions should be based on.  The most fundamental part is missing.
> 
> > Or run device at lower queue depth. That will improve latencies but migth
> > reduce overall throughput.
> 
> And that we can't do this (and thus basically operate close to
> scheduling time slices) for highspeed ssds.
> 
> > Or thorottle number of buffered writes (as Jens's writeback throttling)
> > patches were doing. Buffered writes seem to be biggest culprit for 
> > increased latencies and being able to control these should help.
> 
> That's a different topic.
> 
> > ioprio/weight based proportional IO mechanism is much more generic and
> > much easier to configure for any kind of storage. io.high is absolu

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li  ha scritto:
> 
> On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
>> 
>>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo  ha 
>>> scritto:
>>> 
>>> Hello,
>>> 
>>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
 Could you please elaborate more on this point?  BFQ uses sectors
 served to measure service, and, on the all the fast devices on which
 we have tested it, it accurately distributes
 bandwidth as desired, redistributes excess bandwidth with any issue,
 and guarantees high responsiveness and low latency at application and
 system level (e.g., ~0 drop rate in video playback, with any background
 workload tested).
>>> 
>>> The same argument as before.  Bandwidth is a very bad measure of IO
>>> resources spent.  For specific use cases (like desktop or whatever),
>>> this can work but not generally.
>>> 
>> 
>> Actually, we have already discussed this point, and IMHO the arguments
>> that (apparently) convinced you that bandwidth is the most relevant
>> service guarantee for I/O in desktops and the like, prove that
>> bandwidth is the most important service guarantee in servers too.
>> 
>> Again, all the examples I can think of seem to confirm it:
>> . file hosting: a good service must guarantee reasonable read/write,
>> i.e., download/upload, speeds to users
>> . file streaming: a good service must guarantee low drop rates, and
>> this can be guaranteed only by guaranteeing bandwidth and latency
>> . web hosting: high bandwidth and low latency needed here too
>> . clouds: high bw and low latency needed to let, e.g., users of VMs
>> enjoy high responsiveness and, for example, reasonable file-copy
>> time
>> ...
>> 
>> To put in yet another way, with packet I/O in, e.g., clouds, there are
>> basically the same issues, and the main goal is again guaranteeing
>> bandwidth and low latency among nodes.
>> 
>> Could you please provide a concrete server example (assuming we still
>> agree about desktops), where I/O bandwidth does not matter while time
>> does?
> 
> I don't think IO bandwidth does not matter. The problem is bandwidth can't
> measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
> IO.
> 

For what goal do you need to be able to say this, once you succeeded
in guaranteeing bandwidth and low latency to each
process/client/group/node/user?

 Could you please suggest me some test to show how sector-based
 guarantees fails?
>>> 
>>> Well, mix 4k random and sequential workloads and try to distribute the
>>> acteual IO resources.
>>> 
>> 
>> 
>> If I'm not mistaken, we have already gone through this example too,
>> and I thought we agreed on what service scheme worked best, again
>> focusing only on desktops.  To make a long story short(er), here is a
>> snippet from one of our last exchanges.
>> 
>> --
>> 
>> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
>>> Maybe the source of confusion is the fact that a simple sector-based,
>>> proportional share scheduler always distributes total bandwidth
>>> according to weights. The catch is the additional BFQ rule: random
>>> workloads get only time isolation, and are charged for full budgets,
>>> so as to not affect the schedule of quasi-sequential workloads. So,
>>> the correct claim for BFQ is that it distributes total bandwidth
>>> according to weights (only) when all competing workloads are
>>> quasi-sequential. If some workloads are random, then these workloads
>>> are just time scheduled. This does break proportional-share bandwidth
>>> distribution with mixed workloads, but, much more importantly, saves
>>> both total throughput and individual bandwidths of quasi-sequential
>>> workloads.
>>> 
>>> We could then check whether I did succeed in tuning timeouts and
>>> budgets so as to achieve the best tradeoffs. But this is probably a
>>> second-order problem as of now.
> 
> I don't see why random/sequential matters for SSD. what really matters is
> request size and IO depth. Time scheduling is skeptical too, as workloads can
> dispatch all IO within almost 0 time in high queue depth disks.
> 

That's an orthogonal issue.  If what matter is, e.g., size, then it is
enough to replace "sequential I/O" with "large-request I/O".  In case
I have been too vague, here is an example: I mean that, e.g, in an I/O
scheduler you replace the function that computes whether a queue is
seeky based on request distance, with a function based on
request size.  And this is exactly what has been already done, for
example, in CFQ:

if (blk_queue_nonrot(cfqd->queue))
cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
else
cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);

Thanks,
Paolo

> Thanks,
> Shaohua


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Shaohua Li
On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
> 
> > Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo  ha 
> > scritto:
> > 
> > Hello,
> > 
> > On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> >> Could you please elaborate more on this point?  BFQ uses sectors
> >> served to measure service, and, on the all the fast devices on which
> >> we have tested it, it accurately distributes
> >> bandwidth as desired, redistributes excess bandwidth with any issue,
> >> and guarantees high responsiveness and low latency at application and
> >> system level (e.g., ~0 drop rate in video playback, with any background
> >> workload tested).
> > 
> > The same argument as before.  Bandwidth is a very bad measure of IO
> > resources spent.  For specific use cases (like desktop or whatever),
> > this can work but not generally.
> > 
> 
> Actually, we have already discussed this point, and IMHO the arguments
> that (apparently) convinced you that bandwidth is the most relevant
> service guarantee for I/O in desktops and the like, prove that
> bandwidth is the most important service guarantee in servers too.
> 
> Again, all the examples I can think of seem to confirm it:
> . file hosting: a good service must guarantee reasonable read/write,
> i.e., download/upload, speeds to users
> . file streaming: a good service must guarantee low drop rates, and
> this can be guaranteed only by guaranteeing bandwidth and latency
> . web hosting: high bandwidth and low latency needed here too
> . clouds: high bw and low latency needed to let, e.g., users of VMs
> enjoy high responsiveness and, for example, reasonable file-copy
> time
> ...
> 
> To put in yet another way, with packet I/O in, e.g., clouds, there are
> basically the same issues, and the main goal is again guaranteeing
> bandwidth and low latency among nodes.
> 
> Could you please provide a concrete server example (assuming we still
> agree about desktops), where I/O bandwidth does not matter while time
> does?

I don't think IO bandwidth does not matter. The problem is bandwidth can't
measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
IO.

> >> Could you please suggest me some test to show how sector-based
> >> guarantees fails?
> > 
> > Well, mix 4k random and sequential workloads and try to distribute the
> > acteual IO resources.
> > 
> 
> 
> If I'm not mistaken, we have already gone through this example too,
> and I thought we agreed on what service scheme worked best, again
> focusing only on desktops.  To make a long story short(er), here is a
> snippet from one of our last exchanges.
> 
> --
> 
> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> > Maybe the source of confusion is the fact that a simple sector-based,
> > proportional share scheduler always distributes total bandwidth
> > according to weights. The catch is the additional BFQ rule: random
> > workloads get only time isolation, and are charged for full budgets,
> > so as to not affect the schedule of quasi-sequential workloads. So,
> > the correct claim for BFQ is that it distributes total bandwidth
> > according to weights (only) when all competing workloads are
> > quasi-sequential. If some workloads are random, then these workloads
> > are just time scheduled. This does break proportional-share bandwidth
> > distribution with mixed workloads, but, much more importantly, saves
> > both total throughput and individual bandwidths of quasi-sequential
> > workloads.
> > 
> > We could then check whether I did succeed in tuning timeouts and
> > budgets so as to achieve the best tradeoffs. But this is probably a
> > second-order problem as of now.

I don't see why random/sequential matters for SSD. what really matters is
request size and IO depth. Time scheduling is skeptical too, as workloads can
dispatch all IO within almost 0 time in high queue depth disks.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Shaohua Li
Hi,
On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > Hi,
> > 
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups.
> 
> So this is an interim solution till we have ioscheduler for blk-mq?

This is still a generic solution to prioritize workloads.

> > This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
> > 
> > io.max is a hard limit throttling. cgroups with a max limit never dispatch 
> > more
> > IO than their max limit. While io.high is a best effort throttling. cgroups
> > with high limit can run above their high limit at appropriate time.
> > Specifically, if all cgroups reach their high limit, all cgroups can run 
> > above
> > their high limit. If any cgroup runs under its high limit, all other cgroups
> > will run according to their high limit.
> 
> Hi Shaohua,
> 
> I still don't understand why we should not implement a weight based
> proportional IO mechanism and how this mechanism is better than proportional 
> IO .
>
> Agreed that we have issues with proportional IO and we don't have good
> solutions for these problems. But I can't see that how this mechanism
> will overcome these problems either.

No, I never declare this mechanism is better than proportional IO. The problem
with proportional IO is we don't have a mechanism to measure IO cost, which is
the core for proportional. This mechanism only prioritizes IO. It's not as
useful as proportional, but works for a lot of scenarios.

> 
> IIRC, biggest issue with proportional IO was that a low prio group might
> fill up the device queue with plenty of IO requests and later when high
> prio cgroup comes, it will still experience latencies anyway. And solution
> to the problem probably would be to get some awareness in device about 
> priority of request and map weights to those priority. That way higher
> prio requests get prioritized.
> 
> Or run device at lower queue depth. That will improve latencies but migth
> reduce overall throughput.

Yep, this is the hardest part. It really depends on the tradeoff between
throughput and latency. Running device at low queue depth sounds working, but
the sacrifice is extremely high for modern SSD. Small size IO throughput has a
range from several MB/s to several GB/s depending on queue depth. If run device
at lower queue depth, the sacrific is big enough to make device sharing no
sense.
 
> Or thorottle number of buffered writes (as Jens's writeback throttling)
> patches were doing. Buffered writes seem to be biggest culprit for 
> increased latencies and being able to control these should help.

big size read can significantly increase latency too. Please note latency isn't
the only factor applications care about. non-interactive workloads don't care
about single IO latency, throughput or amortized latency is more important for
such workloads.

> ioprio/weight based proportional IO mechanism is much more generic and
> much easier to configure for any kind of storage. io.high is absolute
> limit and makes it much harder to configure. One needs to know a lot
> about underlying volume/device's bandwidth (which varies a lot anyway
> based on workload).
> IMHO, we seem to be trying to cater to one specific use case using
> this mechanism. Something ioprio/weight based will be much more
> generic and we should explore implementing that along with building
> notion of ioprio in devices. When these two work together, we might
> be able to see good results. Just software mechanism alone might not
> be enough.

Agree, proportional IO mechanism is easier to configure. The problem we can't
build it without hardware support.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo  ha scritto:
> 
> Hello,
> 
> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
>> Could you please elaborate more on this point?  BFQ uses sectors
>> served to measure service, and, on the all the fast devices on which
>> we have tested it, it accurately distributes
>> bandwidth as desired, redistributes excess bandwidth with any issue,
>> and guarantees high responsiveness and low latency at application and
>> system level (e.g., ~0 drop rate in video playback, with any background
>> workload tested).
> 
> The same argument as before.  Bandwidth is a very bad measure of IO
> resources spent.  For specific use cases (like desktop or whatever),
> this can work but not generally.
> 

Actually, we have already discussed this point, and IMHO the arguments
that (apparently) convinced you that bandwidth is the most relevant
service guarantee for I/O in desktops and the like, prove that
bandwidth is the most important service guarantee in servers too.

Again, all the examples I can think of seem to confirm it:
. file hosting: a good service must guarantee reasonable read/write,
i.e., download/upload, speeds to users
. file streaming: a good service must guarantee low drop rates, and
this can be guaranteed only by guaranteeing bandwidth and latency
. web hosting: high bandwidth and low latency needed here too
. clouds: high bw and low latency needed to let, e.g., users of VMs
enjoy high responsiveness and, for example, reasonable file-copy
time
...

To put in yet another way, with packet I/O in, e.g., clouds, there are
basically the same issues, and the main goal is again guaranteeing
bandwidth and low latency among nodes.

Could you please provide a concrete server example (assuming we still
agree about desktops), where I/O bandwidth does not matter while time
does?


>> Could you please suggest me some test to show how sector-based
>> guarantees fails?
> 
> Well, mix 4k random and sequential workloads and try to distribute the
> acteual IO resources.
> 


If I'm not mistaken, we have already gone through this example too,
and I thought we agreed on what service scheme worked best, again
focusing only on desktops.  To make a long story short(er), here is a
snippet from one of our last exchanges.

--

On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> Maybe the source of confusion is the fact that a simple sector-based,
> proportional share scheduler always distributes total bandwidth
> according to weights. The catch is the additional BFQ rule: random
> workloads get only time isolation, and are charged for full budgets,
> so as to not affect the schedule of quasi-sequential workloads. So,
> the correct claim for BFQ is that it distributes total bandwidth
> according to weights (only) when all competing workloads are
> quasi-sequential. If some workloads are random, then these workloads
> are just time scheduled. This does break proportional-share bandwidth
> distribution with mixed workloads, but, much more importantly, saves
> both total throughput and individual bandwidths of quasi-sequential
> workloads.
> 
> We could then check whether I did succeed in tuning timeouts and
> budgets so as to achieve the best tradeoffs. But this is probably a
> second-order problem as of now.

[you] Ah, I see.  Yeah, that clears it up for me.  I'm gonna play with
cgroup settings and see how it actually behaves.

-

Why does the above argument not work for a server too?  What am I
missing?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/





--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello,

On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> Could you please elaborate more on this point?  BFQ uses sectors
> served to measure service, and, on the all the fast devices on which
> we have tested it, it accurately distributes
> bandwidth as desired, redistributes excess bandwidth with any issue,
> and guarantees high responsiveness and low latency at application and
> system level (e.g., ~0 drop rate in video playback, with any background
> workload tested).

The same argument as before.  Bandwidth is a very bad measure of IO
resources spent.  For specific use cases (like desktop or whatever),
this can work but not generally.

> Could you please suggest me some test to show how sector-based
> guarantees fails?

Well, mix 4k random and sequential workloads and try to distribute the
acteual IO resources.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Paolo Valente

> Il giorno 04 ott 2016, alle ore 17:56, Tejun Heo  ha scritto:
> 
> Hello, Vivek.
> 
> On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
>> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
>>> Hi,
>>> 
>>> The background is we don't have an ioscheduler for blk-mq yet, so we can't
>>> prioritize processes/cgroups.
>> 
>> So this is an interim solution till we have ioscheduler for blk-mq?
> 
> It's a common permanent solution which applies to both !mq and mq.
> 
>>> This patch set tries to add basic arbitration
>>> between cgroups with blk-throttle. It adds a new limit io.high for
>>> blk-throttle. It's only for cgroup2.
>>> 
>>> io.max is a hard limit throttling. cgroups with a max limit never dispatch 
>>> more
>>> IO than their max limit. While io.high is a best effort throttling. cgroups
>>> with high limit can run above their high limit at appropriate time.
>>> Specifically, if all cgroups reach their high limit, all cgroups can run 
>>> above
>>> their high limit. If any cgroup runs under its high limit, all other cgroups
>>> will run according to their high limit.
>> 
>> Hi Shaohua,
>> 
>> I still don't understand why we should not implement a weight based
>> proportional IO mechanism and how this mechanism is better than proportional 
>> IO .
> 
> Oh, if we actually can implement proportional IO control, it'd be
> great.  The problem is that we have no way of knowing IO cost for
> highspeed ssd devices.  CFQ gets around the problem by using the
> walltime as the measure of resource usage and scheduling time slices,
> which works fine for rotating disks but horribly for highspeed ssds.
> 

Could you please elaborate more on this point?  BFQ uses sectors
served to measure service, and, on the all the fast devices on which
we have tested it, it accurately distributes
bandwidth as desired, redistributes excess bandwidth with any issue,
and guarantees high responsiveness and low latency at application and
system level (e.g., ~0 drop rate in video playback, with any background
workload tested).

Could you please suggest me some test to show how sector-based
guarantees fails?

Thanks,
Paolo

> We can get some semblance of proportional control by just counting bw
> or iops but both break down badly as a means to measure the actual
> resource consumption depending on the workload.  While limit based
> control is more tedious to configure, it doesn't misrepresent what's
> going on and is a lot less likely to produce surprising outcomes.
> 
> We *can* try to concoct something which tries to do proportional
> control for highspeed ssds but that's gonna be quite a bit of
> complexity and I'm not so sure it'd be justifiable given that we can't
> even figure out measurement of the most basic operating unit.
> 
>> Agreed that we have issues with proportional IO and we don't have good
>> solutions for these problems. But I can't see that how this mechanism
>> will overcome these problems either.
> 
> It mostly defers the burden to the one who's configuring the limits
> and expects it to know the characteristics of the device and workloads
> and configure accordingly.  It's quite a bit more tedious to use but
> should be able to cover good portion of use cases without being overly
> complicated.  I agree that it'd be nice to have a simple proportional
> control but as you said can't see a good solution for it at the
> moment.
> 
>> IIRC, biggest issue with proportional IO was that a low prio group might
>> fill up the device queue with plenty of IO requests and later when high
>> prio cgroup comes, it will still experience latencies anyway. And solution
>> to the problem probably would be to get some awareness in device about 
>> priority of request and map weights to those priority. That way higher
>> prio requests get prioritized.
> 
> Nah, the real problem is that we can't even decide what the
> proportions should be based on.  The most fundamental part is missing.
> 
>> Or run device at lower queue depth. That will improve latencies but migth
>> reduce overall throughput.
> 
> And that we can't do this (and thus basically operate close to
> scheduling time slices) for highspeed ssds.
> 
>> Or thorottle number of buffered writes (as Jens's writeback throttling)
>> patches were doing. Buffered writes seem to be biggest culprit for 
>> increased latencies and being able to control these should help.
> 
> That's a different topic.
> 
>> ioprio/weight based proportional IO mechanism is much more generic and
>> much easier to configure for any kind of storage. io.high is absolute
>> limit and makes it much harder to configure. One needs to know a lot
>> about underlying volume/device's bandwidth (which varies a lot anyway
>> based on workload).
> 
> Yeap, no disagreement there, but it still is a workable solution.
> 
>> IMHO, we seem to be trying to cater to one specific use case using
>> this mechanism. Something ioprio/weight based will be much more
>> generic and we should explore implementing

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Tejun Heo
Hello, Vivek.

On Tue, Oct 04, 2016 at 09:28:05AM -0400, Vivek Goyal wrote:
> On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> > Hi,
> > 
> > The background is we don't have an ioscheduler for blk-mq yet, so we can't
> > prioritize processes/cgroups.
> 
> So this is an interim solution till we have ioscheduler for blk-mq?

It's a common permanent solution which applies to both !mq and mq.

> > This patch set tries to add basic arbitration
> > between cgroups with blk-throttle. It adds a new limit io.high for
> > blk-throttle. It's only for cgroup2.
> > 
> > io.max is a hard limit throttling. cgroups with a max limit never dispatch 
> > more
> > IO than their max limit. While io.high is a best effort throttling. cgroups
> > with high limit can run above their high limit at appropriate time.
> > Specifically, if all cgroups reach their high limit, all cgroups can run 
> > above
> > their high limit. If any cgroup runs under its high limit, all other cgroups
> > will run according to their high limit.
> 
> Hi Shaohua,
> 
> I still don't understand why we should not implement a weight based
> proportional IO mechanism and how this mechanism is better than proportional 
> IO .

Oh, if we actually can implement proportional IO control, it'd be
great.  The problem is that we have no way of knowing IO cost for
highspeed ssd devices.  CFQ gets around the problem by using the
walltime as the measure of resource usage and scheduling time slices,
which works fine for rotating disks but horribly for highspeed ssds.

We can get some semblance of proportional control by just counting bw
or iops but both break down badly as a means to measure the actual
resource consumption depending on the workload.  While limit based
control is more tedious to configure, it doesn't misrepresent what's
going on and is a lot less likely to produce surprising outcomes.

We *can* try to concoct something which tries to do proportional
control for highspeed ssds but that's gonna be quite a bit of
complexity and I'm not so sure it'd be justifiable given that we can't
even figure out measurement of the most basic operating unit.

> Agreed that we have issues with proportional IO and we don't have good
> solutions for these problems. But I can't see that how this mechanism
> will overcome these problems either.

It mostly defers the burden to the one who's configuring the limits
and expects it to know the characteristics of the device and workloads
and configure accordingly.  It's quite a bit more tedious to use but
should be able to cover good portion of use cases without being overly
complicated.  I agree that it'd be nice to have a simple proportional
control but as you said can't see a good solution for it at the
moment.

> IIRC, biggest issue with proportional IO was that a low prio group might
> fill up the device queue with plenty of IO requests and later when high
> prio cgroup comes, it will still experience latencies anyway. And solution
> to the problem probably would be to get some awareness in device about 
> priority of request and map weights to those priority. That way higher
> prio requests get prioritized.

Nah, the real problem is that we can't even decide what the
proportions should be based on.  The most fundamental part is missing.

> Or run device at lower queue depth. That will improve latencies but migth
> reduce overall throughput.

And that we can't do this (and thus basically operate close to
scheduling time slices) for highspeed ssds.

> Or thorottle number of buffered writes (as Jens's writeback throttling)
> patches were doing. Buffered writes seem to be biggest culprit for 
> increased latencies and being able to control these should help.

That's a different topic.

> ioprio/weight based proportional IO mechanism is much more generic and
> much easier to configure for any kind of storage. io.high is absolute
> limit and makes it much harder to configure. One needs to know a lot
> about underlying volume/device's bandwidth (which varies a lot anyway
> based on workload).

Yeap, no disagreement there, but it still is a workable solution.

> IMHO, we seem to be trying to cater to one specific use case using
> this mechanism. Something ioprio/weight based will be much more
> generic and we should explore implementing that along with building
> notion of ioprio in devices. When these two work together, we might
> be able to see good results. Just software mechanism alone might not
> be enough.

I don't think it's catering to specific use cases.  It is a generic
mechanism which demands knowledge and experimentation to configure.
It's more a way for the kernel to cop out and defer figuring out
device characteristics to userland.  If you have a better idea, I'm
all ears.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] block: Add iocontext priority to request

2016-10-04 Thread Adam Manzanares
Hello Tejun,

10/02/2016 10:53, Tejun Heo wrote:
> Hello, Adam.
> 
> On Fri, Sep 30, 2016 at 09:02:17AM -0700, Adam Manzanares wrote:
> > I'll start with the changes I made and work my way through a grep of
> > 
> > ioprio. Please add or correct any of the assumptions I have made.   
> > 
> 
> Well, it looks like you're the one who's most familiar with ioprio
> handling at this point. :)
> 
> > In blk-core, the behavior before the patch is to get the ioprio for the 
> > request 
> > from the bio. The only references I found to bio_set_prio are in bcache. 
> > Both   
> > of these references are in low priority operations (gc, bg writeback) so 
> > the
> > iopriority of the bio is set to IO_PRIOCLASS_IDLE in these cases.   
> > 
> > 
> > 
> > A kernel thread is used to submit these bios so the ioprio is going to come 
> > 
> > from the current running task if the iocontext exists. This could be a 
> > problem  
> > if we have set a task with high priority and some background work ends up   
> > 
> > getting generated in the bcache layer. I propose that we check if the   
> > 
> > iopriority of the bio is valid and if so, then we keep the priorirty from 
> > the   
> > bio.
> > 
> 
> I wonder whether the right thing to do is adding bio->bi_ioprio which
> is initialized on bio submission and carried through req->ioprio.
> 

I looked around and thought about this and I'm not sure if this will help. 
I dug into the bio submission code and I thought generic_make_request was 
the best place to save the ioprio information. This is quite close in 
the call stack to init_request_from bio. Bcache sets the bio priority before 
the submission, so we would have to check to see if the bio priority was 
valid on bio submission leaving us with the same problem. Leaving the 
priority in the upper bits of bio->bi_rw is fine with me. It may help to 
have the bio->bi_ioprio for clarity, but I think we will still face the 
issue of having to check if this value is set when we submit the bio or 
init the request so I'm leaning towards leaving it as is.


> > The second area that I see a potential problem is in the merging code code 
> > in   
> > blk-core when a bio is queued. If there is a request that is mergeable then 
> > 
> > the merge code takes the highest priority of the bio and the request. This  
> > 
> > could wipe out the values set by bio_set_prio. I think it would be  
> > 
> > best to set the request as non mergeable when we see that it is a high  
> > 
> > priority IO request.
> > 
> 
> The current behavior should be fine for most non-pathological cases
> but I have no objection to not merging ios with differing priorities.
> 
> > The third area that is of interest is in the CFQ scheduler and the ioprio 
> > is
> > only used in the case of async IO and I found that the priority is only 
> > 
> > obtained from the task and not from the request. This leads me to believe 
> > that  
> > the changes made in the blk-core to add the priority to the request will 
> > not
> > impact the CFQ scheduler. 
> >
> > The fourth area that might be concerning is the drivers. virtio_block 
> > copies
> > the request priority into a virtual block request. I am assuming that this  
> > 
> > eventually makes it to another device driver so we don't need to worry 
> > about
> > this. null block device driver also uses the ioprio, but this is also not a 
> > 
> > concern. lightnvm also sets the ioprio to build a request that is passed 
> > onto   
> > another driver. The last driver that uses the request ioprio is the fusion  
> > 
> > mptsas driver and I don't understand how it is using the ioprio. From what 
> > I
> > can tell it is taking a request of IOPRIO_CLASS_NONE with data of 0x7 and   
> > 
> > calling this high priority IO. This could be impacted by the code I have
> > 
> > proposed, but I believe the authors intended to treat this particular 
> > ioprio
> > value as high priority. The driver will pass the request to the device  
> > 
> > with high priority if the appropriate ioprio values is seen on the request. 
> > 
> > 
> > 
> > The fifth area that I noticed may be impacted is file systems. btrfs uses 
> > low   
> > priority IO for read ahead. Ext4 uses ioprio for journaling. Both of these  
> > 
> > issues are not a problem because the ioprio is set on the task and not on a 
> > 
> > bio.
> 
> Yeah, looks good to me.  Care to include a brief summary of expected
> (non)impacts in the patch description?
> 
I'm going to send out an updated series of patches summarizing some o

Re: [PATCH V3 00/11] block-throttle: add .high limit

2016-10-04 Thread Vivek Goyal
On Mon, Oct 03, 2016 at 02:20:19PM -0700, Shaohua Li wrote:
> Hi,
> 
> The background is we don't have an ioscheduler for blk-mq yet, so we can't
> prioritize processes/cgroups.

So this is an interim solution till we have ioscheduler for blk-mq?

> This patch set tries to add basic arbitration
> between cgroups with blk-throttle. It adds a new limit io.high for
> blk-throttle. It's only for cgroup2.
> 
> io.max is a hard limit throttling. cgroups with a max limit never dispatch 
> more
> IO than their max limit. While io.high is a best effort throttling. cgroups
> with high limit can run above their high limit at appropriate time.
> Specifically, if all cgroups reach their high limit, all cgroups can run above
> their high limit. If any cgroup runs under its high limit, all other cgroups
> will run according to their high limit.

Hi Shaohua,

I still don't understand why we should not implement a weight based
proportional IO mechanism and how this mechanism is better than proportional IO 
.

Agreed that we have issues with proportional IO and we don't have good
solutions for these problems. But I can't see that how this mechanism
will overcome these problems either.

IIRC, biggest issue with proportional IO was that a low prio group might
fill up the device queue with plenty of IO requests and later when high
prio cgroup comes, it will still experience latencies anyway. And solution
to the problem probably would be to get some awareness in device about 
priority of request and map weights to those priority. That way higher
prio requests get prioritized.

Or run device at lower queue depth. That will improve latencies but migth
reduce overall throughput.

Or thorottle number of buffered writes (as Jens's writeback throttling)
patches were doing. Buffered writes seem to be biggest culprit for 
increased latencies and being able to control these should help.

ioprio/weight based proportional IO mechanism is much more generic and
much easier to configure for any kind of storage. io.high is absolute
limit and makes it much harder to configure. One needs to know a lot
about underlying volume/device's bandwidth (which varies a lot anyway
based on workload).

IMHO, we seem to be trying to cater to one specific use case using
this mechanism. Something ioprio/weight based will be much more
generic and we should explore implementing that along with building
notion of ioprio in devices. When these two work together, we might
be able to see good results. Just software mechanism alone might not
be enough.

Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-04 Thread Alex Bligh
Wouter,

>>> It is impossible for nbd to make such a guarantee, due to head-of-line
>>> blocking on TCP.
>> 
>> this is perfectly accurate as far as it goes, but this isn't the current
>> NBD definition of 'flush'.
> 
> I didn't read it that way.
> 
>> That is (from the docs):
>> 
>>> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
>>> that the server completes (i.e. replies to) prior to processing to a
>>> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to
>>> replying to thatNBD_CMD_FLUSH.
> 
> This is somewhat ambiguous, in that (IMO) it doesn't clearly state the
> point where the cutoff of "may not be on disk yet" is. What is
> "processing"?

OK. I now get the problem. There are actually two types of HOL blocking,
server to client and client to server.

Before we allowed the server to issue replies out of order with requests.
However, the protocol did guarantee that the server saw requests in the
order presented by clients. With the proposed multi-connection support,
this changes. Whilst the client needed to be prepared for things to be
disordered by the server, the server did not previously need to be
be prepared for things being disordered by the client. And (more subtly)
the client could assume that the server got its own requests in the
order it sent them, which is important for flush the way written at the
moment.


Here's an actual illustration of the problem:

Currently we have:

 Client  Server
 ==  ==

 TX: WRITE
 RX: FLUSH
 RX: WRITE
 RX: FLUSH

 Process write
  
 Process flush including write

 TX: write reply
 TX: flush reply
 RX: write reply
 RX: flush reply


Currently the RX statements cannot be disordered. However the
server can process the requests in a different order. If it
does, the flush need not include the write, like this:

 Client  Server
 ==  ==

 TX: WRITE
 RX: FLUSH
 RX: WRITE
 RX: FLUSH

 Process flush not including write

 Process write  

 TX: flush reply
 TX: write reply
RX: flush reply
RX: write reply

and the client gets to know of the fact, because the flush
reply comes before the write reply. It can know it's data has
not been flushed. It could send another flush in this case, or
simply change its code to not send the flush until the write
has been received.

However, with the multi-connection support, both the replies
and the requests can be disordered. So the client can ONLY
know a flush has been completed if it has received a reply
to the write before it sends the flush.

This is in my opinion problematic, as what you want to do as
a client is stream requests (write, write, write, flush, write,
write, write). If those go down different channels, AND you
don't wait for a reply, you can no longer safely stream requests
at all. Now you need to wait for the flush request to respond
before sending another write (if write ordering to the platter
is important), which seems to defeat the object of streaming
commands.

An 'in extremis' example would be a sequence of write / flush
requests sent down two channels, where the write requests all
end up on one channel, and the flush requests on the other,
and the write channel is serviced immediately and the flush
requests delayed indefinitely.

> We don't define that, and therefore it could be any point
> between "receipt of the request message" and "sending the reply
> message". I had interpreted it closer to the latter than was apparently
> intended, but that isn't very useful;

The thing is the server doesn't know what replies the client has
received, only the replies it has sent. Equally the server doesn't
know what commands the client has sent, only what commands it has
received.

As currently written, it's a simple rule, NBD_CMD_FLUSH means
"Mr Server: you must make sure that any write you have sent a
reply to must now be persisted on disk. If you haven't yet sent
a reply to a write - perhaps because due to HOL blocking you
haven't received it, or perhaps it's still in progress, or perhaps
it's finished but you haven't sent the reply - don't worry".

The promise to the the client is that all the writes to which the
server has sent a reply are now on disk. But the client doesn't
know what replies the server has sent a reply to. It only knows
which replies it has received (which will be a subset of those). So
to the client it means that the server has persisted to disk all
those commands to which it has received a reply. However, to the
server, the 'MUST' condition ne

Re: [PATCH 2/3] zram: support page-based parallel write

2016-10-04 Thread Minchan Kim
Hi Sergey,

On Tue, Oct 04, 2016 at 01:43:14PM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> Cc Jens and block-dev,
> 
> I'll outline the commit message for Jens and blockdev people, may be
> someone will have some thoughts/ideas/opinions:

Thanks for Ccing relevant poeple. Even, I didn't know we have block-dev
mailing list.

> 
> > On (09/22/16 15:42), Minchan Kim wrote:
> : zram supports stream-based parallel compression. IOW, it can support
> : parallel compression on SMP system only if each cpus has streams.
> : For example, assuming 4 CPU system, there are 4 sources for compressing
> : in system and each source must be located in each CPUs for full
> : parallel compression.
> :
> : So, if there is *one* stream in the system, it cannot be compressed
> : in parallel although the system supports multiple CPUs. This patch
> : aims to overcome such weakness.
> :
> : The idea is to use multiple background threads to compress pages
> : in idle CPU and foreground just queues BIOs without interrupting
> : while other CPUs consumes pages in BIO to compress.
> : It means zram begins to support asynchronous writeback to increase
> : write bandwidth.
> 
> 
> is there any way of addressing this issues? [a silly idea] can we, for
> instance, ask the bock layer to split request and put pages into different
> queues (assuming that we run in blk-mq mode)? because this looks like a
> common issue, and it may be a bit too late to fix it in zram driver.
> any thoughts?

Hmm, blk-mq works with request-level, not even bio-level. Right?
If so, I have a concern about that. Zram as swap storage has been worked
with rw_page to avoid bio allocation which was not small and heard from
product people it was great enhancement in very pool memory device.
I didn't follw up at that time but I guess it was due to waiting free
memory from mempool. If blk-rq works with request-level, should we abandon
rw_page approach? 

> 
> 
> [..]
> > Could you retest the benchmark without direct IO? Instead of dio,
> > I used fsync_on_close to flush buffered IO.
> > 
> > DIO isn't normal workload compared to buffered IO. The reason I used DIO
> > for zram benchmark was that it's handy to transfer IO to block layer 
> > effectively
> > and no-noise of page cache.
> > If we use buffered IO, the result would be fake as dirty page was just
> > queued in page cache without any flushing.
> > I think you know already it very well so no need to explan any more. :)
> > 
> > More important thing is current zram is poor for parallel IO.
> > Let's thing two usecases, zram-swap, zram-fs.
> 
> well, that's why we use direct=1 in benchmarks - to test the peformance
> of zram; not anything else. but I can run fsync_on_close=1 tests as well
> (see later).
> 
> 
> > 1) zram-swap
> > 
> > parallel IO can be done only where every CPU have reclaim context.
> > IOW,
> > 
> > 1. kswapd on CPU 0
> > 2. A process direct reclaim on CPU 1
> > 3. process direct reclaim on CPU 2
> > 4. process direct reclaim on CPU 3
> >
> > I don't think it's usual workload. Most of time, a kswapd and a process
> > direct reclaim in embedded platform workload. The point is we can not
> > use full bandwidth.
> 
> hm. but we are on an SMP system and at least one process had to start
> direct reclaim, which basically increases chances of direct reclaims
> from other CPUs, should running processes there request big enough
> memory allocations. I really see no reasons to rule this possibility
> out.

I didn't rule out the possiblity. It is just one scenario we can use
full-bandwidth. However, there are  many scenarios we cannot use
full-bandwidth.

Imagine that a kswapd waked up and a process entered direct relcaim.
The process can get a memory easily with watermark play and go with
the memory but kswapd alone should continue to reclaim memory until high
watermark. It means there is no process in direct reclaim. 

Additionally, with your scenario, we can use just 2 CPU for compression
and relies on the luck the hope with 1. other processes in 2. different CPU
will 3. allocate memory soon and VM should reclaim 4. anonymous memory,
not page cache. It needs many assumptions to use full bandwidth.

However, with current approach, we can use full-bandwidth unconditionally
once VM decide to reclaim anonymous memory.
So, I think it's better approach.

> 
> 
> > 2) zram-fs
> > 
> > Currently, there is a work per bdi. So, without fsync(and friends),
> > every IO submit would be done via that work on worker thread.
> > It means the IO couldn't be parallelized. However, if we use fsync,
> > it could be parallelized but it depends on the sync granuarity.
> > For exmaple, if your test application uses fsync directly, the IO
> > would be done in the CPU context your application running on. So,
> > if you has 4 test applications, every CPU could be utilized.
> > However, if you test application doesn't use fsync directly and
> > parant process calls sync if every test child applications, the
> > IO could be done 2