Re: Read starvation by sync writes

2013-01-23 Thread Jeff Moyer
Jens Axboe  writes:

>>> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
>>> like buffered writes, so it eats up a queue of requests very easily. On
>>> the allocation side, the assumption is that WRITE_SYNC behaves like
>>> dependent reads. Similar to a dd with oflag=direct, not like a flood of
>>> requests. For dependent sync writes, our current behaviour is fine, we
>>> treat them like reads. For commits of WRITE_SYNC, they should be treated
>>> like async WRITE instead.
>>   Yeah. But it's similar to what happens when you run fsync() on a large
>> dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
>> could probably use WRITE instead of WRITE_SYNC for large commits. It's just
>> that we don't really want to give e.g. DIO a preference over kjournald
>> because transaction commit can effectively block any metadata changes on
>> the filesystem.
>
> Sure, I'm not advocating against changing WRITE_SYNC, we just need to be
> able to handle it a bit better. I've got a test patch, will post it
> later.

Jens, did you ever post your test patch?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2013-01-23 Thread Jeff Moyer
Jens Axboe ax...@kernel.dk writes:

 The problem is really that the WRITE_SYNC is (for Jan's case) behaving
 like buffered writes, so it eats up a queue of requests very easily. On
 the allocation side, the assumption is that WRITE_SYNC behaves like
 dependent reads. Similar to a dd with oflag=direct, not like a flood of
 requests. For dependent sync writes, our current behaviour is fine, we
 treat them like reads. For commits of WRITE_SYNC, they should be treated
 like async WRITE instead.
   Yeah. But it's similar to what happens when you run fsync() on a large
 dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
 could probably use WRITE instead of WRITE_SYNC for large commits. It's just
 that we don't really want to give e.g. DIO a preference over kjournald
 because transaction commit can effectively block any metadata changes on
 the filesystem.

 Sure, I'm not advocating against changing WRITE_SYNC, we just need to be
 able to handle it a bit better. I've got a test patch, will post it
 later.

Jens, did you ever post your test patch?

-Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jens Axboe
On 2012-12-13 16:02, Jan Kara wrote:
> On Thu 13-12-12 14:30:42, Jens Axboe wrote:
>> On 2012-12-12 20:41, Jeff Moyer wrote:
>>> Jeff Moyer  writes:
>>>
> I agree. This isn't about scheduling, we haven't even reached that part
> yet. Back when we split the queues into read vs write, this problem
> obviously wasn't there. Now we have sync writes and reads, both eating
> from the same pool. The io scheduler can impact this a bit by forcing
> reads to must allocate (Jan, which io scheduler are you using?). CFQ
> does this when it's expecting a request from this process queue.
>
> Back in the day, we used to have one list. To avoid a similar problem,
> we reserved the top of the list for reads. With the batching, it's a bit
> more complicated. If we make the request allocation (just that, not the
> scheduling) be read vs write instead of sync vs async, then we have the
> same issue for sync vs buffered writes.
>
> How about something like the below? Due to the nature of sync reads, we
> should allow a much longer timeout. The batch is really tailored towards
> writes at the moment. Also shrink the batch count, 32 is pretty large...

 Does batching even make sense for dependent reads?  I don't think it
 does.
>>>
>>> Having just read the batching code in detail, I'd like to ammend this
>>> misguided comment.  Batching logic kicks in when you happen to be lucky
>>> enough to use up the last request.  As such, I'd be surprised if the
>>> patch you posted helped.  Jens, don't you think the writer is way more
>>> likely to become the batcher?  I do agree with shrinking the batch count
>>> to 16, whether or not the rest of the patch goes in.
>>>
  Assuming you disagree, then you'll have to justify that fixed
 time value of 2 seconds.  The amount of time between dependent reads
 will vary depending on other I/O sent to the device, the properties of
 the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
 there, please comment it.  Maybe it's time we started keeping track of
 worst case Q->C time?  That could be used to tell worst case latency,
 and adjust magic timeouts like this one.

 I'm still thinking about how we might solve this in a cleaner way.
>>>
>>> The way things stand today, you can do a complete end run around the I/O
>>> scheduler by queueing up enough I/O.  To address that, I think we need
>>> to move to a request list per io_context as Jan had suggested.  That
>>> way, we can keep the logic about who gets to submit I/O when in one
>>> place.
>>>
>>> Jens, what do you think?
>>
>> I think that is pretty extreme. We have way too much accounting around
>> this already, and I'd rather just limit the batching than make
>> per-ioc request lists too.
>>
>> I agree the batch addition isn't super useful for the reads. It really
>> is mostly a writer thing, and the timing reflects that.
>>
>> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
>> like buffered writes, so it eats up a queue of requests very easily. On
>> the allocation side, the assumption is that WRITE_SYNC behaves like
>> dependent reads. Similar to a dd with oflag=direct, not like a flood of
>> requests. For dependent sync writes, our current behaviour is fine, we
>> treat them like reads. For commits of WRITE_SYNC, they should be treated
>> like async WRITE instead.
>   Yeah. But it's similar to what happens when you run fsync() on a large
> dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
> could probably use WRITE instead of WRITE_SYNC for large commits. It's just
> that we don't really want to give e.g. DIO a preference over kjournald
> because transaction commit can effectively block any metadata changes on
> the filesystem.

Sure, I'm not advocating against changing WRITE_SYNC, we just need to be
able to handle it a bit better. I've got a test patch, will post it
later.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Thu 13-12-12 14:30:42, Jens Axboe wrote:
> On 2012-12-12 20:41, Jeff Moyer wrote:
> > Jeff Moyer  writes:
> > 
> >>> I agree. This isn't about scheduling, we haven't even reached that part
> >>> yet. Back when we split the queues into read vs write, this problem
> >>> obviously wasn't there. Now we have sync writes and reads, both eating
> >>> from the same pool. The io scheduler can impact this a bit by forcing
> >>> reads to must allocate (Jan, which io scheduler are you using?). CFQ
> >>> does this when it's expecting a request from this process queue.
> >>>
> >>> Back in the day, we used to have one list. To avoid a similar problem,
> >>> we reserved the top of the list for reads. With the batching, it's a bit
> >>> more complicated. If we make the request allocation (just that, not the
> >>> scheduling) be read vs write instead of sync vs async, then we have the
> >>> same issue for sync vs buffered writes.
> >>>
> >>> How about something like the below? Due to the nature of sync reads, we
> >>> should allow a much longer timeout. The batch is really tailored towards
> >>> writes at the moment. Also shrink the batch count, 32 is pretty large...
> >>
> >> Does batching even make sense for dependent reads?  I don't think it
> >> does.
> > 
> > Having just read the batching code in detail, I'd like to ammend this
> > misguided comment.  Batching logic kicks in when you happen to be lucky
> > enough to use up the last request.  As such, I'd be surprised if the
> > patch you posted helped.  Jens, don't you think the writer is way more
> > likely to become the batcher?  I do agree with shrinking the batch count
> > to 16, whether or not the rest of the patch goes in.
> > 
> >>  Assuming you disagree, then you'll have to justify that fixed
> >> time value of 2 seconds.  The amount of time between dependent reads
> >> will vary depending on other I/O sent to the device, the properties of
> >> the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
> >> there, please comment it.  Maybe it's time we started keeping track of
> >> worst case Q->C time?  That could be used to tell worst case latency,
> >> and adjust magic timeouts like this one.
> >>
> >> I'm still thinking about how we might solve this in a cleaner way.
> > 
> > The way things stand today, you can do a complete end run around the I/O
> > scheduler by queueing up enough I/O.  To address that, I think we need
> > to move to a request list per io_context as Jan had suggested.  That
> > way, we can keep the logic about who gets to submit I/O when in one
> > place.
> > 
> > Jens, what do you think?
> 
> I think that is pretty extreme. We have way too much accounting around
> this already, and I'd rather just limit the batching than make
> per-ioc request lists too.
> 
> I agree the batch addition isn't super useful for the reads. It really
> is mostly a writer thing, and the timing reflects that.
> 
> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
> like buffered writes, so it eats up a queue of requests very easily. On
> the allocation side, the assumption is that WRITE_SYNC behaves like
> dependent reads. Similar to a dd with oflag=direct, not like a flood of
> requests. For dependent sync writes, our current behaviour is fine, we
> treat them like reads. For commits of WRITE_SYNC, they should be treated
> like async WRITE instead.
  Yeah. But it's similar to what happens when you run fsync() on a large
dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
could probably use WRITE instead of WRITE_SYNC for large commits. It's just
that we don't really want to give e.g. DIO a preference over kjournald
because transaction commit can effectively block any metadata changes on
the filesystem.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jeff Moyer
Jens Axboe  writes:

> On 2012-12-12 20:41, Jeff Moyer wrote:
>> Jeff Moyer  writes:
>> 
 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...
>>>
>>> Does batching even make sense for dependent reads?  I don't think it
>>> does.
>> 
>> Having just read the batching code in detail, I'd like to ammend this
>> misguided comment.  Batching logic kicks in when you happen to be lucky
>> enough to use up the last request.  As such, I'd be surprised if the
>> patch you posted helped.  Jens, don't you think the writer is way more
>> likely to become the batcher?  I do agree with shrinking the batch count
>> to 16, whether or not the rest of the patch goes in.
>> 
>>>  Assuming you disagree, then you'll have to justify that fixed
>>> time value of 2 seconds.  The amount of time between dependent reads
>>> will vary depending on other I/O sent to the device, the properties of
>>> the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
>>> there, please comment it.  Maybe it's time we started keeping track of
>>> worst case Q->C time?  That could be used to tell worst case latency,
>>> and adjust magic timeouts like this one.
>>>
>>> I'm still thinking about how we might solve this in a cleaner way.
>> 
>> The way things stand today, you can do a complete end run around the I/O
>> scheduler by queueing up enough I/O.  To address that, I think we need
>> to move to a request list per io_context as Jan had suggested.  That
>> way, we can keep the logic about who gets to submit I/O when in one
>> place.
>> 
>> Jens, what do you think?
>
> I think that is pretty extreme. We have way too much accounting around
> this already, and I'd rather just limit the batching than make
> per-ioc request lists too.

I'm not sure I understand your comment about accounting.  I don't think
it would add overhead to move to per-ioc request lists.  Note that, if
we did move to per-ioc request lists, we could yank out the blk cgroup
implementation of same.

> I agree the batch addition isn't super useful for the reads. It really
> is mostly a writer thing, and the timing reflects that.
>
> The problem is really that the WRITE_SYNC is (for Jan's case) behaving
> like buffered writes, so it eats up a queue of requests very easily. On
> the allocation side, the assumption is that WRITE_SYNC behaves like
> dependent reads. Similar to a dd with oflag=direct, not like a flood of
> requests. For dependent sync writes, our current behaviour is fine, we
> treat them like reads. For commits of WRITE_SYNC, they should be treated
> like async WRITE instead.

What are you suggesting?  It sounds as though you might be suggesting
that WRITE_SYNCs are allocated from the async request list, but treated
as sync requests in the I/O scheduler.  Oh, but only for this case of
streaming write syncs.  How did you want to detect that?  In the
caller?  Tracking information in the ioc?

Clear as mud.  ;-)

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jens Axboe
On 2012-12-12 20:41, Jeff Moyer wrote:
> Jeff Moyer  writes:
> 
>>> I agree. This isn't about scheduling, we haven't even reached that part
>>> yet. Back when we split the queues into read vs write, this problem
>>> obviously wasn't there. Now we have sync writes and reads, both eating
>>> from the same pool. The io scheduler can impact this a bit by forcing
>>> reads to must allocate (Jan, which io scheduler are you using?). CFQ
>>> does this when it's expecting a request from this process queue.
>>>
>>> Back in the day, we used to have one list. To avoid a similar problem,
>>> we reserved the top of the list for reads. With the batching, it's a bit
>>> more complicated. If we make the request allocation (just that, not the
>>> scheduling) be read vs write instead of sync vs async, then we have the
>>> same issue for sync vs buffered writes.
>>>
>>> How about something like the below? Due to the nature of sync reads, we
>>> should allow a much longer timeout. The batch is really tailored towards
>>> writes at the moment. Also shrink the batch count, 32 is pretty large...
>>
>> Does batching even make sense for dependent reads?  I don't think it
>> does.
> 
> Having just read the batching code in detail, I'd like to ammend this
> misguided comment.  Batching logic kicks in when you happen to be lucky
> enough to use up the last request.  As such, I'd be surprised if the
> patch you posted helped.  Jens, don't you think the writer is way more
> likely to become the batcher?  I do agree with shrinking the batch count
> to 16, whether or not the rest of the patch goes in.
> 
>>  Assuming you disagree, then you'll have to justify that fixed
>> time value of 2 seconds.  The amount of time between dependent reads
>> will vary depending on other I/O sent to the device, the properties of
>> the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
>> there, please comment it.  Maybe it's time we started keeping track of
>> worst case Q->C time?  That could be used to tell worst case latency,
>> and adjust magic timeouts like this one.
>>
>> I'm still thinking about how we might solve this in a cleaner way.
> 
> The way things stand today, you can do a complete end run around the I/O
> scheduler by queueing up enough I/O.  To address that, I think we need
> to move to a request list per io_context as Jan had suggested.  That
> way, we can keep the logic about who gets to submit I/O when in one
> place.
> 
> Jens, what do you think?

I think that is pretty extreme. We have way too much accounting around
this already, and I'd rather just limit the batching than make
per-ioc request lists too.

I agree the batch addition isn't super useful for the reads. It really
is mostly a writer thing, and the timing reflects that.

The problem is really that the WRITE_SYNC is (for Jan's case) behaving
like buffered writes, so it eats up a queue of requests very easily. On
the allocation side, the assumption is that WRITE_SYNC behaves like
dependent reads. Similar to a dd with oflag=direct, not like a flood of
requests. For dependent sync writes, our current behaviour is fine, we
treat them like reads. For commits of WRITE_SYNC, they should be treated
like async WRITE instead.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Wed 12-12-12 14:41:13, Jeff Moyer wrote:
> Jeff Moyer  writes:
> 
> >> I agree. This isn't about scheduling, we haven't even reached that part
> >> yet. Back when we split the queues into read vs write, this problem
> >> obviously wasn't there. Now we have sync writes and reads, both eating
> >> from the same pool. The io scheduler can impact this a bit by forcing
> >> reads to must allocate (Jan, which io scheduler are you using?). CFQ
> >> does this when it's expecting a request from this process queue.
> >>
> >> Back in the day, we used to have one list. To avoid a similar problem,
> >> we reserved the top of the list for reads. With the batching, it's a bit
> >> more complicated. If we make the request allocation (just that, not the
> >> scheduling) be read vs write instead of sync vs async, then we have the
> >> same issue for sync vs buffered writes.
> >>
> >> How about something like the below? Due to the nature of sync reads, we
> >> should allow a much longer timeout. The batch is really tailored towards
> >> writes at the moment. Also shrink the batch count, 32 is pretty large...
> >
> > Does batching even make sense for dependent reads?  I don't think it
> > does.
> 
> Having just read the batching code in detail, I'd like to ammend this
> misguided comment.  Batching logic kicks in when you happen to be lucky
> enough to use up the last request.  As such, I'd be surprised if the
> patch you posted helped.  Jens, don't you think the writer is way more
> likely to become the batcher?  I do agree with shrinking the batch count
> to 16, whether or not the rest of the patch goes in.
  Well, batching logic also triggers unconditionally after you waited for
a request...

> >  Assuming you disagree, then you'll have to justify that fixed
> > time value of 2 seconds.  The amount of time between dependent reads
> > will vary depending on other I/O sent to the device, the properties of
> > the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
> > there, please comment it.  Maybe it's time we started keeping track of
> > worst case Q->C time?  That could be used to tell worst case latency,
> > and adjust magic timeouts like this one.
> >
> > I'm still thinking about how we might solve this in a cleaner way.
> 
> The way things stand today, you can do a complete end run around the I/O
> scheduler by queueing up enough I/O.  To address that, I think we need
> to move to a request list per io_context as Jan had suggested.  That
> way, we can keep the logic about who gets to submit I/O when in one
> place.
> 
> Jens, what do you think?
> 
> Jan, for now, try bumping nr_requests up really high.  ;-)
  Good idea. I tried bumping nr_requests to 10 (so one can queue 50 GB
of streaming writes - one transaction can carry several hundred MB) and
reader can now progress at a reasonable speed. So it indeed worked.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Thu 13-12-12 09:43:31, Shaohua Li wrote:
> 2012/12/12 Jan Kara :
> > On Wed 12-12-12 10:55:15, Shaohua Li wrote:
> >> 2012/12/11 Jan Kara :
> >> >   Hi,
> >> >
> >> >   I was looking into IO starvation problems where streaming sync writes 
> >> > (in
> >> > my case from kjournald but DIO would look the same) starve reads. This is
> >> > because reads happen in small chunks and until a request completes we 
> >> > don't
> >> > start reading further (reader reads lots of small files) while writers 
> >> > have
> >> > plenty of big requests to submit. Both processes end up fighting for IO
> >> > requests and writer writes nr_batching 512 KB requests while reader reads
> >> > just one 4 KB request or so. Here the effect is magnified by the fact 
> >> > that
> >> > the drive has relatively big queue depth so it usually takes longer than
> >> > BLK_BATCH_TIME to complete the read request. The net result is it takes
> >> > close to two minutes to read files that can be read under a second 
> >> > without
> >> > writer load. Without the big drive's queue depth, results are not ideal 
> >> > but
> >> > they are bearable - it takes about 20 seconds to do the reading. And for
> >> > comparison, when writer and reader are not competing for IO requests (as 
> >> > it
> >> > happens when writes are submitted as async), it takes about 2 seconds to
> >> > complete reading.
> >> >
> >> > Simple reproducer is:
> >> >
> >> > echo 3 >/proc/sys/vm/drop_caches
> >> > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> >> > sleep 30
> >> > time cat /etc/* 2>&1 >/dev/null
> >> > killall dd
> >> > rm /tmp/f
> >> >
> >> >   The question is how can we fix this? Two quick hacks that come to my 
> >> > mind
> >> > are remove timeout from the batching logic (is it that important?) or
> >> > further separate request allocation logic so that reads have their own
> >> > request pool. More systematic fix would be to change request allocation
> >> > logic to always allow at least a fixed number of requests per IOC. What 
> >> > do
> >> > people think about this?
> >>
> >> As long as queue depth > workload iodepth, there is little we can do
> >> to prioritize tasks/IOC. Because throttling a task/IOC means queue
> >> will be idle. We don't want to idle a queue (especially for SSD), so
> >> we always push as more requests as possible to the queue, which
> >> will break any prioritization. As far as I know we always have such
> >> issue in CFQ for big queue depth disk.
> >   Yes, I understand that. But actually big queue depth on its own doesn't
> > make the problem really bad (at least for me). When the reader doesn't have
> > to wait for free IO requests, it progresses at a reasonable speed. What
> > makes it really bad is that big queue depth effectively disallows any use
> > of ioc_batching() mode for the reader and thus it blocks in request
> > allocation for every single read request unlike writer which always uses
> > its full batch (32 requests).
> 
> This can't explain why setting queue depth 1 makes the performance
> better.
  It does, when queue depth is small, reads are completed faster so reader
is able to submit more reads during one ioc_batching() period.

> In that case, write still get that number of requests, read will
> wait for a request. Anyway, try setting nr_request to a big number
> and check if performance is different.
  I have checked. Setting nr_requests to 10 makes reader proceed at a
reasonable speed.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Thu 13-12-12 09:43:31, Shaohua Li wrote:
 2012/12/12 Jan Kara j...@suse.cz:
  On Wed 12-12-12 10:55:15, Shaohua Li wrote:
  2012/12/11 Jan Kara j...@suse.cz:
 Hi,
  
 I was looking into IO starvation problems where streaming sync writes 
   (in
   my case from kjournald but DIO would look the same) starve reads. This is
   because reads happen in small chunks and until a request completes we 
   don't
   start reading further (reader reads lots of small files) while writers 
   have
   plenty of big requests to submit. Both processes end up fighting for IO
   requests and writer writes nr_batching 512 KB requests while reader reads
   just one 4 KB request or so. Here the effect is magnified by the fact 
   that
   the drive has relatively big queue depth so it usually takes longer than
   BLK_BATCH_TIME to complete the read request. The net result is it takes
   close to two minutes to read files that can be read under a second 
   without
   writer load. Without the big drive's queue depth, results are not ideal 
   but
   they are bearable - it takes about 20 seconds to do the reading. And for
   comparison, when writer and reader are not competing for IO requests (as 
   it
   happens when writes are submitted as async), it takes about 2 seconds to
   complete reading.
  
   Simple reproducer is:
  
   echo 3 /proc/sys/vm/drop_caches
   dd if=/dev/zero of=/tmp/f bs=1M count=1 
   sleep 30
   time cat /etc/* 21 /dev/null
   killall dd
   rm /tmp/f
  
 The question is how can we fix this? Two quick hacks that come to my 
   mind
   are remove timeout from the batching logic (is it that important?) or
   further separate request allocation logic so that reads have their own
   request pool. More systematic fix would be to change request allocation
   logic to always allow at least a fixed number of requests per IOC. What 
   do
   people think about this?
 
  As long as queue depth  workload iodepth, there is little we can do
  to prioritize tasks/IOC. Because throttling a task/IOC means queue
  will be idle. We don't want to idle a queue (especially for SSD), so
  we always push as more requests as possible to the queue, which
  will break any prioritization. As far as I know we always have such
  issue in CFQ for big queue depth disk.
Yes, I understand that. But actually big queue depth on its own doesn't
  make the problem really bad (at least for me). When the reader doesn't have
  to wait for free IO requests, it progresses at a reasonable speed. What
  makes it really bad is that big queue depth effectively disallows any use
  of ioc_batching() mode for the reader and thus it blocks in request
  allocation for every single read request unlike writer which always uses
  its full batch (32 requests).
 
 This can't explain why setting queue depth 1 makes the performance
 better.
  It does, when queue depth is small, reads are completed faster so reader
is able to submit more reads during one ioc_batching() period.

 In that case, write still get that number of requests, read will
 wait for a request. Anyway, try setting nr_request to a big number
 and check if performance is different.
  I have checked. Setting nr_requests to 10 makes reader proceed at a
reasonable speed.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Wed 12-12-12 14:41:13, Jeff Moyer wrote:
 Jeff Moyer jmo...@redhat.com writes:
 
  I agree. This isn't about scheduling, we haven't even reached that part
  yet. Back when we split the queues into read vs write, this problem
  obviously wasn't there. Now we have sync writes and reads, both eating
  from the same pool. The io scheduler can impact this a bit by forcing
  reads to must allocate (Jan, which io scheduler are you using?). CFQ
  does this when it's expecting a request from this process queue.
 
  Back in the day, we used to have one list. To avoid a similar problem,
  we reserved the top of the list for reads. With the batching, it's a bit
  more complicated. If we make the request allocation (just that, not the
  scheduling) be read vs write instead of sync vs async, then we have the
  same issue for sync vs buffered writes.
 
  How about something like the below? Due to the nature of sync reads, we
  should allow a much longer timeout. The batch is really tailored towards
  writes at the moment. Also shrink the batch count, 32 is pretty large...
 
  Does batching even make sense for dependent reads?  I don't think it
  does.
 
 Having just read the batching code in detail, I'd like to ammend this
 misguided comment.  Batching logic kicks in when you happen to be lucky
 enough to use up the last request.  As such, I'd be surprised if the
 patch you posted helped.  Jens, don't you think the writer is way more
 likely to become the batcher?  I do agree with shrinking the batch count
 to 16, whether or not the rest of the patch goes in.
  Well, batching logic also triggers unconditionally after you waited for
a request...

   Assuming you disagree, then you'll have to justify that fixed
  time value of 2 seconds.  The amount of time between dependent reads
  will vary depending on other I/O sent to the device, the properties of
  the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
  there, please comment it.  Maybe it's time we started keeping track of
  worst case Q-C time?  That could be used to tell worst case latency,
  and adjust magic timeouts like this one.
 
  I'm still thinking about how we might solve this in a cleaner way.
 
 The way things stand today, you can do a complete end run around the I/O
 scheduler by queueing up enough I/O.  To address that, I think we need
 to move to a request list per io_context as Jan had suggested.  That
 way, we can keep the logic about who gets to submit I/O when in one
 place.
 
 Jens, what do you think?
 
 Jan, for now, try bumping nr_requests up really high.  ;-)
  Good idea. I tried bumping nr_requests to 10 (so one can queue 50 GB
of streaming writes - one transaction can carry several hundred MB) and
reader can now progress at a reasonable speed. So it indeed worked.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jens Axboe
On 2012-12-12 20:41, Jeff Moyer wrote:
 Jeff Moyer jmo...@redhat.com writes:
 
 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...

 Does batching even make sense for dependent reads?  I don't think it
 does.
 
 Having just read the batching code in detail, I'd like to ammend this
 misguided comment.  Batching logic kicks in when you happen to be lucky
 enough to use up the last request.  As such, I'd be surprised if the
 patch you posted helped.  Jens, don't you think the writer is way more
 likely to become the batcher?  I do agree with shrinking the batch count
 to 16, whether or not the rest of the patch goes in.
 
  Assuming you disagree, then you'll have to justify that fixed
 time value of 2 seconds.  The amount of time between dependent reads
 will vary depending on other I/O sent to the device, the properties of
 the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
 there, please comment it.  Maybe it's time we started keeping track of
 worst case Q-C time?  That could be used to tell worst case latency,
 and adjust magic timeouts like this one.

 I'm still thinking about how we might solve this in a cleaner way.
 
 The way things stand today, you can do a complete end run around the I/O
 scheduler by queueing up enough I/O.  To address that, I think we need
 to move to a request list per io_context as Jan had suggested.  That
 way, we can keep the logic about who gets to submit I/O when in one
 place.
 
 Jens, what do you think?

I think that is pretty extreme. We have way too much accounting around
this already, and I'd rather just limit the batching than make
per-ioc request lists too.

I agree the batch addition isn't super useful for the reads. It really
is mostly a writer thing, and the timing reflects that.

The problem is really that the WRITE_SYNC is (for Jan's case) behaving
like buffered writes, so it eats up a queue of requests very easily. On
the allocation side, the assumption is that WRITE_SYNC behaves like
dependent reads. Similar to a dd with oflag=direct, not like a flood of
requests. For dependent sync writes, our current behaviour is fine, we
treat them like reads. For commits of WRITE_SYNC, they should be treated
like async WRITE instead.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jeff Moyer
Jens Axboe ax...@kernel.dk writes:

 On 2012-12-12 20:41, Jeff Moyer wrote:
 Jeff Moyer jmo...@redhat.com writes:
 
 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...

 Does batching even make sense for dependent reads?  I don't think it
 does.
 
 Having just read the batching code in detail, I'd like to ammend this
 misguided comment.  Batching logic kicks in when you happen to be lucky
 enough to use up the last request.  As such, I'd be surprised if the
 patch you posted helped.  Jens, don't you think the writer is way more
 likely to become the batcher?  I do agree with shrinking the batch count
 to 16, whether or not the rest of the patch goes in.
 
  Assuming you disagree, then you'll have to justify that fixed
 time value of 2 seconds.  The amount of time between dependent reads
 will vary depending on other I/O sent to the device, the properties of
 the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
 there, please comment it.  Maybe it's time we started keeping track of
 worst case Q-C time?  That could be used to tell worst case latency,
 and adjust magic timeouts like this one.

 I'm still thinking about how we might solve this in a cleaner way.
 
 The way things stand today, you can do a complete end run around the I/O
 scheduler by queueing up enough I/O.  To address that, I think we need
 to move to a request list per io_context as Jan had suggested.  That
 way, we can keep the logic about who gets to submit I/O when in one
 place.
 
 Jens, what do you think?

 I think that is pretty extreme. We have way too much accounting around
 this already, and I'd rather just limit the batching than make
 per-ioc request lists too.

I'm not sure I understand your comment about accounting.  I don't think
it would add overhead to move to per-ioc request lists.  Note that, if
we did move to per-ioc request lists, we could yank out the blk cgroup
implementation of same.

 I agree the batch addition isn't super useful for the reads. It really
 is mostly a writer thing, and the timing reflects that.

 The problem is really that the WRITE_SYNC is (for Jan's case) behaving
 like buffered writes, so it eats up a queue of requests very easily. On
 the allocation side, the assumption is that WRITE_SYNC behaves like
 dependent reads. Similar to a dd with oflag=direct, not like a flood of
 requests. For dependent sync writes, our current behaviour is fine, we
 treat them like reads. For commits of WRITE_SYNC, they should be treated
 like async WRITE instead.

What are you suggesting?  It sounds as though you might be suggesting
that WRITE_SYNCs are allocated from the async request list, but treated
as sync requests in the I/O scheduler.  Oh, but only for this case of
streaming write syncs.  How did you want to detect that?  In the
caller?  Tracking information in the ioc?

Clear as mud.  ;-)

-Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jan Kara
On Thu 13-12-12 14:30:42, Jens Axboe wrote:
 On 2012-12-12 20:41, Jeff Moyer wrote:
  Jeff Moyer jmo...@redhat.com writes:
  
  I agree. This isn't about scheduling, we haven't even reached that part
  yet. Back when we split the queues into read vs write, this problem
  obviously wasn't there. Now we have sync writes and reads, both eating
  from the same pool. The io scheduler can impact this a bit by forcing
  reads to must allocate (Jan, which io scheduler are you using?). CFQ
  does this when it's expecting a request from this process queue.
 
  Back in the day, we used to have one list. To avoid a similar problem,
  we reserved the top of the list for reads. With the batching, it's a bit
  more complicated. If we make the request allocation (just that, not the
  scheduling) be read vs write instead of sync vs async, then we have the
  same issue for sync vs buffered writes.
 
  How about something like the below? Due to the nature of sync reads, we
  should allow a much longer timeout. The batch is really tailored towards
  writes at the moment. Also shrink the batch count, 32 is pretty large...
 
  Does batching even make sense for dependent reads?  I don't think it
  does.
  
  Having just read the batching code in detail, I'd like to ammend this
  misguided comment.  Batching logic kicks in when you happen to be lucky
  enough to use up the last request.  As such, I'd be surprised if the
  patch you posted helped.  Jens, don't you think the writer is way more
  likely to become the batcher?  I do agree with shrinking the batch count
  to 16, whether or not the rest of the patch goes in.
  
   Assuming you disagree, then you'll have to justify that fixed
  time value of 2 seconds.  The amount of time between dependent reads
  will vary depending on other I/O sent to the device, the properties of
  the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
  there, please comment it.  Maybe it's time we started keeping track of
  worst case Q-C time?  That could be used to tell worst case latency,
  and adjust magic timeouts like this one.
 
  I'm still thinking about how we might solve this in a cleaner way.
  
  The way things stand today, you can do a complete end run around the I/O
  scheduler by queueing up enough I/O.  To address that, I think we need
  to move to a request list per io_context as Jan had suggested.  That
  way, we can keep the logic about who gets to submit I/O when in one
  place.
  
  Jens, what do you think?
 
 I think that is pretty extreme. We have way too much accounting around
 this already, and I'd rather just limit the batching than make
 per-ioc request lists too.
 
 I agree the batch addition isn't super useful for the reads. It really
 is mostly a writer thing, and the timing reflects that.
 
 The problem is really that the WRITE_SYNC is (for Jan's case) behaving
 like buffered writes, so it eats up a queue of requests very easily. On
 the allocation side, the assumption is that WRITE_SYNC behaves like
 dependent reads. Similar to a dd with oflag=direct, not like a flood of
 requests. For dependent sync writes, our current behaviour is fine, we
 treat them like reads. For commits of WRITE_SYNC, they should be treated
 like async WRITE instead.
  Yeah. But it's similar to what happens when you run fsync() on a large
dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
could probably use WRITE instead of WRITE_SYNC for large commits. It's just
that we don't really want to give e.g. DIO a preference over kjournald
because transaction commit can effectively block any metadata changes on
the filesystem.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-13 Thread Jens Axboe
On 2012-12-13 16:02, Jan Kara wrote:
 On Thu 13-12-12 14:30:42, Jens Axboe wrote:
 On 2012-12-12 20:41, Jeff Moyer wrote:
 Jeff Moyer jmo...@redhat.com writes:

 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...

 Does batching even make sense for dependent reads?  I don't think it
 does.

 Having just read the batching code in detail, I'd like to ammend this
 misguided comment.  Batching logic kicks in when you happen to be lucky
 enough to use up the last request.  As such, I'd be surprised if the
 patch you posted helped.  Jens, don't you think the writer is way more
 likely to become the batcher?  I do agree with shrinking the batch count
 to 16, whether or not the rest of the patch goes in.

  Assuming you disagree, then you'll have to justify that fixed
 time value of 2 seconds.  The amount of time between dependent reads
 will vary depending on other I/O sent to the device, the properties of
 the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
 there, please comment it.  Maybe it's time we started keeping track of
 worst case Q-C time?  That could be used to tell worst case latency,
 and adjust magic timeouts like this one.

 I'm still thinking about how we might solve this in a cleaner way.

 The way things stand today, you can do a complete end run around the I/O
 scheduler by queueing up enough I/O.  To address that, I think we need
 to move to a request list per io_context as Jan had suggested.  That
 way, we can keep the logic about who gets to submit I/O when in one
 place.

 Jens, what do you think?

 I think that is pretty extreme. We have way too much accounting around
 this already, and I'd rather just limit the batching than make
 per-ioc request lists too.

 I agree the batch addition isn't super useful for the reads. It really
 is mostly a writer thing, and the timing reflects that.

 The problem is really that the WRITE_SYNC is (for Jan's case) behaving
 like buffered writes, so it eats up a queue of requests very easily. On
 the allocation side, the assumption is that WRITE_SYNC behaves like
 dependent reads. Similar to a dd with oflag=direct, not like a flood of
 requests. For dependent sync writes, our current behaviour is fine, we
 treat them like reads. For commits of WRITE_SYNC, they should be treated
 like async WRITE instead.
   Yeah. But it's similar to what happens when you run fsync() on a large
 dirty file. That will also submit a lot of WRITE_SYNC requests... kjournald
 could probably use WRITE instead of WRITE_SYNC for large commits. It's just
 that we don't really want to give e.g. DIO a preference over kjournald
 because transaction commit can effectively block any metadata changes on
 the filesystem.

Sure, I'm not advocating against changing WRITE_SYNC, we just need to be
able to handle it a bit better. I've got a test patch, will post it
later.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Shaohua Li
2012/12/12 Jan Kara :
> On Wed 12-12-12 10:55:15, Shaohua Li wrote:
>> 2012/12/11 Jan Kara :
>> >   Hi,
>> >
>> >   I was looking into IO starvation problems where streaming sync writes (in
>> > my case from kjournald but DIO would look the same) starve reads. This is
>> > because reads happen in small chunks and until a request completes we don't
>> > start reading further (reader reads lots of small files) while writers have
>> > plenty of big requests to submit. Both processes end up fighting for IO
>> > requests and writer writes nr_batching 512 KB requests while reader reads
>> > just one 4 KB request or so. Here the effect is magnified by the fact that
>> > the drive has relatively big queue depth so it usually takes longer than
>> > BLK_BATCH_TIME to complete the read request. The net result is it takes
>> > close to two minutes to read files that can be read under a second without
>> > writer load. Without the big drive's queue depth, results are not ideal but
>> > they are bearable - it takes about 20 seconds to do the reading. And for
>> > comparison, when writer and reader are not competing for IO requests (as it
>> > happens when writes are submitted as async), it takes about 2 seconds to
>> > complete reading.
>> >
>> > Simple reproducer is:
>> >
>> > echo 3 >/proc/sys/vm/drop_caches
>> > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
>> > sleep 30
>> > time cat /etc/* 2>&1 >/dev/null
>> > killall dd
>> > rm /tmp/f
>> >
>> >   The question is how can we fix this? Two quick hacks that come to my mind
>> > are remove timeout from the batching logic (is it that important?) or
>> > further separate request allocation logic so that reads have their own
>> > request pool. More systematic fix would be to change request allocation
>> > logic to always allow at least a fixed number of requests per IOC. What do
>> > people think about this?
>>
>> As long as queue depth > workload iodepth, there is little we can do
>> to prioritize tasks/IOC. Because throttling a task/IOC means queue
>> will be idle. We don't want to idle a queue (especially for SSD), so
>> we always push as more requests as possible to the queue, which
>> will break any prioritization. As far as I know we always have such
>> issue in CFQ for big queue depth disk.
>   Yes, I understand that. But actually big queue depth on its own doesn't
> make the problem really bad (at least for me). When the reader doesn't have
> to wait for free IO requests, it progresses at a reasonable speed. What
> makes it really bad is that big queue depth effectively disallows any use
> of ioc_batching() mode for the reader and thus it blocks in request
> allocation for every single read request unlike writer which always uses
> its full batch (32 requests).

This can't explain why setting queue depth 1 makes the performance
better. In that case, write still get that number of requests, read will
wait for a request. Anyway, try setting nr_request to a big number
and check if performance is different.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Dave Chinner
On Wed, Dec 12, 2012 at 11:26:17AM +0100, Jan Kara wrote:
> On Wed 12-12-12 15:18:21, Dave Chinner wrote:
> > On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
> > > On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
> > > > Jan Kara  writes:
> > > > 
> > > > >   Hi,
> > > > >
> > > > >   I was looking into IO starvation problems where streaming sync 
> > > > > writes (in
> > > > > my case from kjournald but DIO would look the same) starve reads. 
> > > > > This is
> > > > > because reads happen in small chunks and until a request completes we 
> > > > > don't
> > > > > start reading further (reader reads lots of small files) while 
> > > > > writers have
> > > > > plenty of big requests to submit. Both processes end up fighting for 
> > > > > IO
> > > > > requests and writer writes nr_batching 512 KB requests while reader 
> > > > > reads
> > > > > just one 4 KB request or so. Here the effect is magnified by the fact 
> > > > > that
> > > > > the drive has relatively big queue depth so it usually takes longer 
> > > > > than
> > > > > BLK_BATCH_TIME to complete the read request. The net result is it 
> > > > > takes
> > > > > close to two minutes to read files that can be read under a second 
> > > > > without
> > > > > writer load. Without the big drive's queue depth, results are not 
> > > > > ideal but
> > > > > they are bearable - it takes about 20 seconds to do the reading. And 
> > > > > for
> > > > > comparison, when writer and reader are not competing for IO requests 
> > > > > (as it
> > > > > happens when writes are submitted as async), it takes about 2 seconds 
> > > > > to
> > > > > complete reading.
> > > > >
> > > > > Simple reproducer is:
> > > > >
> > > > > echo 3 >/proc/sys/vm/drop_caches
> > > > > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> > > > > sleep 30
> > > > > time cat /etc/* 2>&1 >/dev/null
> > > > > killall dd
> > > > > rm /tmp/f
> > > > 
> > > > This is a buffered writer.  How does it end up that you are doing all
> > > > synchronous write I/O?  Also, you forgot to mention what file system you
> > > > were using, and which I/O scheduler.
> > >   So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
> > > ends up being synchronous - in ext3 in data=ordered mode kjournald often 
> > > ends
> > > up submitting all the data to disk and it can do it as WRITE_SYNC if 
> > > someone is
> > > waiting for transaction commit. In theory this can happen with AIO DIO
> > > writes or someone running fsync on a big file as well. Although when I
> > > tried this now, I wasn't able to create as big problem as kjournald does
> > > (a kernel thread submitting huge linked list of buffer heads in a tight 
> > > loop
> > > is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
> > > so that it's not as aggressive will solve the real world cases as well...
> > 
> > Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
> > mean, if there is that many of them then it's really a batch
> > submission an dthe latency of a single buffer IO is really
> > irrelevant to the rate at which the buffers are flushed to disk
>   Yeah, the idea why kjournald uses WRITE_SYNC is that we know someone is
> waiting for transaction commit and that's pretty much definition of what
> WRITE_SYNC means.

Well, XFS only uses WRITE_SYNC for WB_SYNC_ALL writeback, which
means only when a user is waiting on the wdata writeback will it use
WRITE_SYNC. I'm really not sure what category journal flushes fall
into, because XFS doesn't do data writeback from journal flushes

> Hum, maybe if DIO wasn't using WRITE_SYNC (one could make similar
> argument there as with kjournald). But then the definition of what
> WRITE_SYNC should mean starts to be pretty foggy.

DIO used WRITE_ODIRECT, not WRITE_SYNC. The difference is that
WRITE_SYNC sets REQ_NOIDLE, so DIO is actually different to
WRITE_SYNC behaviour for CFQ...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jeff Moyer
Jeff Moyer  writes:

>> I agree. This isn't about scheduling, we haven't even reached that part
>> yet. Back when we split the queues into read vs write, this problem
>> obviously wasn't there. Now we have sync writes and reads, both eating
>> from the same pool. The io scheduler can impact this a bit by forcing
>> reads to must allocate (Jan, which io scheduler are you using?). CFQ
>> does this when it's expecting a request from this process queue.
>>
>> Back in the day, we used to have one list. To avoid a similar problem,
>> we reserved the top of the list for reads. With the batching, it's a bit
>> more complicated. If we make the request allocation (just that, not the
>> scheduling) be read vs write instead of sync vs async, then we have the
>> same issue for sync vs buffered writes.
>>
>> How about something like the below? Due to the nature of sync reads, we
>> should allow a much longer timeout. The batch is really tailored towards
>> writes at the moment. Also shrink the batch count, 32 is pretty large...
>
> Does batching even make sense for dependent reads?  I don't think it
> does.

Having just read the batching code in detail, I'd like to ammend this
misguided comment.  Batching logic kicks in when you happen to be lucky
enough to use up the last request.  As such, I'd be surprised if the
patch you posted helped.  Jens, don't you think the writer is way more
likely to become the batcher?  I do agree with shrinking the batch count
to 16, whether or not the rest of the patch goes in.

>  Assuming you disagree, then you'll have to justify that fixed
> time value of 2 seconds.  The amount of time between dependent reads
> will vary depending on other I/O sent to the device, the properties of
> the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
> there, please comment it.  Maybe it's time we started keeping track of
> worst case Q->C time?  That could be used to tell worst case latency,
> and adjust magic timeouts like this one.
>
> I'm still thinking about how we might solve this in a cleaner way.

The way things stand today, you can do a complete end run around the I/O
scheduler by queueing up enough I/O.  To address that, I think we need
to move to a request list per io_context as Jan had suggested.  That
way, we can keep the logic about who gets to submit I/O when in one
place.

Jens, what do you think?

Jan, for now, try bumping nr_requests up really high.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jeff Moyer
Jens Axboe  writes:

> On 2012-12-12 11:11, Jan Kara wrote:
>> On Wed 12-12-12 10:55:15, Shaohua Li wrote:
>>> 2012/12/11 Jan Kara :
   Hi,

   I was looking into IO starvation problems where streaming sync writes (in
 my case from kjournald but DIO would look the same) starve reads. This is
 because reads happen in small chunks and until a request completes we don't
 start reading further (reader reads lots of small files) while writers have
 plenty of big requests to submit. Both processes end up fighting for IO
 requests and writer writes nr_batching 512 KB requests while reader reads
 just one 4 KB request or so. Here the effect is magnified by the fact that
 the drive has relatively big queue depth so it usually takes longer than
 BLK_BATCH_TIME to complete the read request. The net result is it takes
 close to two minutes to read files that can be read under a second without
 writer load. Without the big drive's queue depth, results are not ideal but
 they are bearable - it takes about 20 seconds to do the reading. And for
 comparison, when writer and reader are not competing for IO requests (as it
 happens when writes are submitted as async), it takes about 2 seconds to
 complete reading.

 Simple reproducer is:

 echo 3 >/proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 &
 sleep 30
 time cat /etc/* 2>&1 >/dev/null
 killall dd
 rm /tmp/f

   The question is how can we fix this? Two quick hacks that come to my mind
 are remove timeout from the batching logic (is it that important?) or
 further separate request allocation logic so that reads have their own
 request pool. More systematic fix would be to change request allocation
 logic to always allow at least a fixed number of requests per IOC. What do
 people think about this?
>>>
>>> As long as queue depth > workload iodepth, there is little we can do
>>> to prioritize tasks/IOC. Because throttling a task/IOC means queue
>>> will be idle. We don't want to idle a queue (especially for SSD), so
>>> we always push as more requests as possible to the queue, which
>>> will break any prioritization. As far as I know we always have such
>>> issue in CFQ for big queue depth disk.
>>   Yes, I understand that. But actually big queue depth on its own doesn't
>> make the problem really bad (at least for me). When the reader doesn't have
>> to wait for free IO requests, it progresses at a reasonable speed. What
>> makes it really bad is that big queue depth effectively disallows any use
>> of ioc_batching() mode for the reader and thus it blocks in request
>> allocation for every single read request unlike writer which always uses
>> its full batch (32 requests).
>
> I agree. This isn't about scheduling, we haven't even reached that part
> yet. Back when we split the queues into read vs write, this problem
> obviously wasn't there. Now we have sync writes and reads, both eating
> from the same pool. The io scheduler can impact this a bit by forcing
> reads to must allocate (Jan, which io scheduler are you using?). CFQ
> does this when it's expecting a request from this process queue.
>
> Back in the day, we used to have one list. To avoid a similar problem,
> we reserved the top of the list for reads. With the batching, it's a bit
> more complicated. If we make the request allocation (just that, not the
> scheduling) be read vs write instead of sync vs async, then we have the
> same issue for sync vs buffered writes.
>
> How about something like the below? Due to the nature of sync reads, we
> should allow a much longer timeout. The batch is really tailored towards
> writes at the moment. Also shrink the batch count, 32 is pretty large...

Does batching even make sense for dependent reads?  I don't think it
does.  Assuming you disagree, then you'll have to justify that fixed
time value of 2 seconds.  The amount of time between dependent reads
will vary depending on other I/O sent to the device, the properties of
the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
there, please comment it.  Maybe it's time we started keeping track of
worst case Q->C time?  That could be used to tell worst case latency,
and adjust magic timeouts like this one.

I'm still thinking about how we might solve this in a cleaner way.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jens Axboe
On 2012-12-12 11:11, Jan Kara wrote:
> On Wed 12-12-12 10:55:15, Shaohua Li wrote:
>> 2012/12/11 Jan Kara :
>>>   Hi,
>>>
>>>   I was looking into IO starvation problems where streaming sync writes (in
>>> my case from kjournald but DIO would look the same) starve reads. This is
>>> because reads happen in small chunks and until a request completes we don't
>>> start reading further (reader reads lots of small files) while writers have
>>> plenty of big requests to submit. Both processes end up fighting for IO
>>> requests and writer writes nr_batching 512 KB requests while reader reads
>>> just one 4 KB request or so. Here the effect is magnified by the fact that
>>> the drive has relatively big queue depth so it usually takes longer than
>>> BLK_BATCH_TIME to complete the read request. The net result is it takes
>>> close to two minutes to read files that can be read under a second without
>>> writer load. Without the big drive's queue depth, results are not ideal but
>>> they are bearable - it takes about 20 seconds to do the reading. And for
>>> comparison, when writer and reader are not competing for IO requests (as it
>>> happens when writes are submitted as async), it takes about 2 seconds to
>>> complete reading.
>>>
>>> Simple reproducer is:
>>>
>>> echo 3 >/proc/sys/vm/drop_caches
>>> dd if=/dev/zero of=/tmp/f bs=1M count=1 &
>>> sleep 30
>>> time cat /etc/* 2>&1 >/dev/null
>>> killall dd
>>> rm /tmp/f
>>>
>>>   The question is how can we fix this? Two quick hacks that come to my mind
>>> are remove timeout from the batching logic (is it that important?) or
>>> further separate request allocation logic so that reads have their own
>>> request pool. More systematic fix would be to change request allocation
>>> logic to always allow at least a fixed number of requests per IOC. What do
>>> people think about this?
>>
>> As long as queue depth > workload iodepth, there is little we can do
>> to prioritize tasks/IOC. Because throttling a task/IOC means queue
>> will be idle. We don't want to idle a queue (especially for SSD), so
>> we always push as more requests as possible to the queue, which
>> will break any prioritization. As far as I know we always have such
>> issue in CFQ for big queue depth disk.
>   Yes, I understand that. But actually big queue depth on its own doesn't
> make the problem really bad (at least for me). When the reader doesn't have
> to wait for free IO requests, it progresses at a reasonable speed. What
> makes it really bad is that big queue depth effectively disallows any use
> of ioc_batching() mode for the reader and thus it blocks in request
> allocation for every single read request unlike writer which always uses
> its full batch (32 requests).

I agree. This isn't about scheduling, we haven't even reached that part
yet. Back when we split the queues into read vs write, this problem
obviously wasn't there. Now we have sync writes and reads, both eating
from the same pool. The io scheduler can impact this a bit by forcing
reads to must allocate (Jan, which io scheduler are you using?). CFQ
does this when it's expecting a request from this process queue.

Back in the day, we used to have one list. To avoid a similar problem,
we reserved the top of the list for reads. With the batching, it's a bit
more complicated. If we make the request allocation (just that, not the
scheduling) be read vs write instead of sync vs async, then we have the
same issue for sync vs buffered writes.

How about something like the below? Due to the nature of sync reads, we
should allow a much longer timeout. The batch is really tailored towards
writes at the moment. Also shrink the batch count, 32 is pretty large...


diff --git a/block/blk-core.c b/block/blk-core.c
index 473015e..f2fcb74 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,7 +792,7 @@ static inline int ioc_batching(struct request_queue *q, 
struct io_context *ioc)
 */
return ioc->nr_batch_requests == q->nr_batching ||
(ioc->nr_batch_requests > 0
-   && time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
+   && time_before(jiffies, ioc->batch_timeout));
 }
 
 /*
@@ -801,13 +801,18 @@ static inline int ioc_batching(struct request_queue *q, 
struct io_context *ioc)
  * is the behaviour we want though - once it gets a wakeup it should be given
  * a nice run.
  */
-static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
+static void ioc_set_batching(struct request_queue *q, struct io_context *ioc,
+int rw_flags)
 {
if (!ioc || ioc_batching(q, ioc))
return;
 
ioc->nr_batch_requests = q->nr_batching;
-   ioc->last_waited = jiffies;
+
+   if (rw_flags & REQ_WRITE)
+   ioc->batch_timeout = jiffies + BLK_BATCH_TIME_WRITE;
+   else
+   ioc->batch_timeout = jiffies + BLK_BATCH_TIME_READ;
 }
 
 static void __freed_request(struct request_list *rl, int 

Re: Read starvation by sync writes

2012-12-12 Thread Jan Kara
On Wed 12-12-12 15:18:21, Dave Chinner wrote:
> On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
> > On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
> > > Jan Kara  writes:
> > > 
> > > >   Hi,
> > > >
> > > >   I was looking into IO starvation problems where streaming sync writes 
> > > > (in
> > > > my case from kjournald but DIO would look the same) starve reads. This 
> > > > is
> > > > because reads happen in small chunks and until a request completes we 
> > > > don't
> > > > start reading further (reader reads lots of small files) while writers 
> > > > have
> > > > plenty of big requests to submit. Both processes end up fighting for IO
> > > > requests and writer writes nr_batching 512 KB requests while reader 
> > > > reads
> > > > just one 4 KB request or so. Here the effect is magnified by the fact 
> > > > that
> > > > the drive has relatively big queue depth so it usually takes longer than
> > > > BLK_BATCH_TIME to complete the read request. The net result is it takes
> > > > close to two minutes to read files that can be read under a second 
> > > > without
> > > > writer load. Without the big drive's queue depth, results are not ideal 
> > > > but
> > > > they are bearable - it takes about 20 seconds to do the reading. And for
> > > > comparison, when writer and reader are not competing for IO requests 
> > > > (as it
> > > > happens when writes are submitted as async), it takes about 2 seconds to
> > > > complete reading.
> > > >
> > > > Simple reproducer is:
> > > >
> > > > echo 3 >/proc/sys/vm/drop_caches
> > > > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> > > > sleep 30
> > > > time cat /etc/* 2>&1 >/dev/null
> > > > killall dd
> > > > rm /tmp/f
> > > 
> > > This is a buffered writer.  How does it end up that you are doing all
> > > synchronous write I/O?  Also, you forgot to mention what file system you
> > > were using, and which I/O scheduler.
> >   So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
> > ends up being synchronous - in ext3 in data=ordered mode kjournald often 
> > ends
> > up submitting all the data to disk and it can do it as WRITE_SYNC if 
> > someone is
> > waiting for transaction commit. In theory this can happen with AIO DIO
> > writes or someone running fsync on a big file as well. Although when I
> > tried this now, I wasn't able to create as big problem as kjournald does
> > (a kernel thread submitting huge linked list of buffer heads in a tight loop
> > is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
> > so that it's not as aggressive will solve the real world cases as well...
> 
> Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
> mean, if there is that many of them then it's really a batch
> submission an dthe latency of a single buffer IO is really
> irrelevant to the rate at which the buffers are flushed to disk
  Yeah, the idea why kjournald uses WRITE_SYNC is that we know someone is
waiting for transaction commit and that's pretty much definition of what
WRITE_SYNC means. Kjournald could keep using WRITE if we see we have tens
of megabytes of data in the commit. I'm just afraid that e.g. a process
doing large DIO overwrite (which is WRITE_SYNC as well) could starve
kjournald doing async IO and thus would effectively block any other fs
activity.

Hum, maybe if DIO wasn't using WRITE_SYNC (one could make similar
argument there as with kjournald). But then the definition of what
WRITE_SYNC should mean starts to be pretty foggy.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jan Kara
On Wed 12-12-12 10:55:15, Shaohua Li wrote:
> 2012/12/11 Jan Kara :
> >   Hi,
> >
> >   I was looking into IO starvation problems where streaming sync writes (in
> > my case from kjournald but DIO would look the same) starve reads. This is
> > because reads happen in small chunks and until a request completes we don't
> > start reading further (reader reads lots of small files) while writers have
> > plenty of big requests to submit. Both processes end up fighting for IO
> > requests and writer writes nr_batching 512 KB requests while reader reads
> > just one 4 KB request or so. Here the effect is magnified by the fact that
> > the drive has relatively big queue depth so it usually takes longer than
> > BLK_BATCH_TIME to complete the read request. The net result is it takes
> > close to two minutes to read files that can be read under a second without
> > writer load. Without the big drive's queue depth, results are not ideal but
> > they are bearable - it takes about 20 seconds to do the reading. And for
> > comparison, when writer and reader are not competing for IO requests (as it
> > happens when writes are submitted as async), it takes about 2 seconds to
> > complete reading.
> >
> > Simple reproducer is:
> >
> > echo 3 >/proc/sys/vm/drop_caches
> > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> > sleep 30
> > time cat /etc/* 2>&1 >/dev/null
> > killall dd
> > rm /tmp/f
> >
> >   The question is how can we fix this? Two quick hacks that come to my mind
> > are remove timeout from the batching logic (is it that important?) or
> > further separate request allocation logic so that reads have their own
> > request pool. More systematic fix would be to change request allocation
> > logic to always allow at least a fixed number of requests per IOC. What do
> > people think about this?
> 
> As long as queue depth > workload iodepth, there is little we can do
> to prioritize tasks/IOC. Because throttling a task/IOC means queue
> will be idle. We don't want to idle a queue (especially for SSD), so
> we always push as more requests as possible to the queue, which
> will break any prioritization. As far as I know we always have such
> issue in CFQ for big queue depth disk.
  Yes, I understand that. But actually big queue depth on its own doesn't
make the problem really bad (at least for me). When the reader doesn't have
to wait for free IO requests, it progresses at a reasonable speed. What
makes it really bad is that big queue depth effectively disallows any use
of ioc_batching() mode for the reader and thus it blocks in request
allocation for every single read request unlike writer which always uses
its full batch (32 requests).

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jan Kara
On Wed 12-12-12 10:55:15, Shaohua Li wrote:
 2012/12/11 Jan Kara j...@suse.cz:
Hi,
 
I was looking into IO starvation problems where streaming sync writes (in
  my case from kjournald but DIO would look the same) starve reads. This is
  because reads happen in small chunks and until a request completes we don't
  start reading further (reader reads lots of small files) while writers have
  plenty of big requests to submit. Both processes end up fighting for IO
  requests and writer writes nr_batching 512 KB requests while reader reads
  just one 4 KB request or so. Here the effect is magnified by the fact that
  the drive has relatively big queue depth so it usually takes longer than
  BLK_BATCH_TIME to complete the read request. The net result is it takes
  close to two minutes to read files that can be read under a second without
  writer load. Without the big drive's queue depth, results are not ideal but
  they are bearable - it takes about 20 seconds to do the reading. And for
  comparison, when writer and reader are not competing for IO requests (as it
  happens when writes are submitted as async), it takes about 2 seconds to
  complete reading.
 
  Simple reproducer is:
 
  echo 3 /proc/sys/vm/drop_caches
  dd if=/dev/zero of=/tmp/f bs=1M count=1 
  sleep 30
  time cat /etc/* 21 /dev/null
  killall dd
  rm /tmp/f
 
The question is how can we fix this? Two quick hacks that come to my mind
  are remove timeout from the batching logic (is it that important?) or
  further separate request allocation logic so that reads have their own
  request pool. More systematic fix would be to change request allocation
  logic to always allow at least a fixed number of requests per IOC. What do
  people think about this?
 
 As long as queue depth  workload iodepth, there is little we can do
 to prioritize tasks/IOC. Because throttling a task/IOC means queue
 will be idle. We don't want to idle a queue (especially for SSD), so
 we always push as more requests as possible to the queue, which
 will break any prioritization. As far as I know we always have such
 issue in CFQ for big queue depth disk.
  Yes, I understand that. But actually big queue depth on its own doesn't
make the problem really bad (at least for me). When the reader doesn't have
to wait for free IO requests, it progresses at a reasonable speed. What
makes it really bad is that big queue depth effectively disallows any use
of ioc_batching() mode for the reader and thus it blocks in request
allocation for every single read request unlike writer which always uses
its full batch (32 requests).

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jan Kara
On Wed 12-12-12 15:18:21, Dave Chinner wrote:
 On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
  On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
   Jan Kara j...@suse.cz writes:
   
  Hi,
   
  I was looking into IO starvation problems where streaming sync writes 
(in
my case from kjournald but DIO would look the same) starve reads. This 
is
because reads happen in small chunks and until a request completes we 
don't
start reading further (reader reads lots of small files) while writers 
have
plenty of big requests to submit. Both processes end up fighting for IO
requests and writer writes nr_batching 512 KB requests while reader 
reads
just one 4 KB request or so. Here the effect is magnified by the fact 
that
the drive has relatively big queue depth so it usually takes longer than
BLK_BATCH_TIME to complete the read request. The net result is it takes
close to two minutes to read files that can be read under a second 
without
writer load. Without the big drive's queue depth, results are not ideal 
but
they are bearable - it takes about 20 seconds to do the reading. And for
comparison, when writer and reader are not competing for IO requests 
(as it
happens when writes are submitted as async), it takes about 2 seconds to
complete reading.
   
Simple reproducer is:
   
echo 3 /proc/sys/vm/drop_caches
dd if=/dev/zero of=/tmp/f bs=1M count=1 
sleep 30
time cat /etc/* 21 /dev/null
killall dd
rm /tmp/f
   
   This is a buffered writer.  How does it end up that you are doing all
   synchronous write I/O?  Also, you forgot to mention what file system you
   were using, and which I/O scheduler.
So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
  ends up being synchronous - in ext3 in data=ordered mode kjournald often 
  ends
  up submitting all the data to disk and it can do it as WRITE_SYNC if 
  someone is
  waiting for transaction commit. In theory this can happen with AIO DIO
  writes or someone running fsync on a big file as well. Although when I
  tried this now, I wasn't able to create as big problem as kjournald does
  (a kernel thread submitting huge linked list of buffer heads in a tight loop
  is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
  so that it's not as aggressive will solve the real world cases as well...
 
 Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
 mean, if there is that many of them then it's really a batch
 submission an dthe latency of a single buffer IO is really
 irrelevant to the rate at which the buffers are flushed to disk
  Yeah, the idea why kjournald uses WRITE_SYNC is that we know someone is
waiting for transaction commit and that's pretty much definition of what
WRITE_SYNC means. Kjournald could keep using WRITE if we see we have tens
of megabytes of data in the commit. I'm just afraid that e.g. a process
doing large DIO overwrite (which is WRITE_SYNC as well) could starve
kjournald doing async IO and thus would effectively block any other fs
activity.

Hum, maybe if DIO wasn't using WRITE_SYNC (one could make similar
argument there as with kjournald). But then the definition of what
WRITE_SYNC should mean starts to be pretty foggy.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jens Axboe
On 2012-12-12 11:11, Jan Kara wrote:
 On Wed 12-12-12 10:55:15, Shaohua Li wrote:
 2012/12/11 Jan Kara j...@suse.cz:
   Hi,

   I was looking into IO starvation problems where streaming sync writes (in
 my case from kjournald but DIO would look the same) starve reads. This is
 because reads happen in small chunks and until a request completes we don't
 start reading further (reader reads lots of small files) while writers have
 plenty of big requests to submit. Both processes end up fighting for IO
 requests and writer writes nr_batching 512 KB requests while reader reads
 just one 4 KB request or so. Here the effect is magnified by the fact that
 the drive has relatively big queue depth so it usually takes longer than
 BLK_BATCH_TIME to complete the read request. The net result is it takes
 close to two minutes to read files that can be read under a second without
 writer load. Without the big drive's queue depth, results are not ideal but
 they are bearable - it takes about 20 seconds to do the reading. And for
 comparison, when writer and reader are not competing for IO requests (as it
 happens when writes are submitted as async), it takes about 2 seconds to
 complete reading.

 Simple reproducer is:

 echo 3 /proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 
 sleep 30
 time cat /etc/* 21 /dev/null
 killall dd
 rm /tmp/f

   The question is how can we fix this? Two quick hacks that come to my mind
 are remove timeout from the batching logic (is it that important?) or
 further separate request allocation logic so that reads have their own
 request pool. More systematic fix would be to change request allocation
 logic to always allow at least a fixed number of requests per IOC. What do
 people think about this?

 As long as queue depth  workload iodepth, there is little we can do
 to prioritize tasks/IOC. Because throttling a task/IOC means queue
 will be idle. We don't want to idle a queue (especially for SSD), so
 we always push as more requests as possible to the queue, which
 will break any prioritization. As far as I know we always have such
 issue in CFQ for big queue depth disk.
   Yes, I understand that. But actually big queue depth on its own doesn't
 make the problem really bad (at least for me). When the reader doesn't have
 to wait for free IO requests, it progresses at a reasonable speed. What
 makes it really bad is that big queue depth effectively disallows any use
 of ioc_batching() mode for the reader and thus it blocks in request
 allocation for every single read request unlike writer which always uses
 its full batch (32 requests).

I agree. This isn't about scheduling, we haven't even reached that part
yet. Back when we split the queues into read vs write, this problem
obviously wasn't there. Now we have sync writes and reads, both eating
from the same pool. The io scheduler can impact this a bit by forcing
reads to must allocate (Jan, which io scheduler are you using?). CFQ
does this when it's expecting a request from this process queue.

Back in the day, we used to have one list. To avoid a similar problem,
we reserved the top of the list for reads. With the batching, it's a bit
more complicated. If we make the request allocation (just that, not the
scheduling) be read vs write instead of sync vs async, then we have the
same issue for sync vs buffered writes.

How about something like the below? Due to the nature of sync reads, we
should allow a much longer timeout. The batch is really tailored towards
writes at the moment. Also shrink the batch count, 32 is pretty large...


diff --git a/block/blk-core.c b/block/blk-core.c
index 473015e..f2fcb74 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,7 +792,7 @@ static inline int ioc_batching(struct request_queue *q, 
struct io_context *ioc)
 */
return ioc-nr_batch_requests == q-nr_batching ||
(ioc-nr_batch_requests  0
-time_before(jiffies, ioc-last_waited + BLK_BATCH_TIME));
+time_before(jiffies, ioc-batch_timeout));
 }
 
 /*
@@ -801,13 +801,18 @@ static inline int ioc_batching(struct request_queue *q, 
struct io_context *ioc)
  * is the behaviour we want though - once it gets a wakeup it should be given
  * a nice run.
  */
-static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
+static void ioc_set_batching(struct request_queue *q, struct io_context *ioc,
+int rw_flags)
 {
if (!ioc || ioc_batching(q, ioc))
return;
 
ioc-nr_batch_requests = q-nr_batching;
-   ioc-last_waited = jiffies;
+
+   if (rw_flags  REQ_WRITE)
+   ioc-batch_timeout = jiffies + BLK_BATCH_TIME_WRITE;
+   else
+   ioc-batch_timeout = jiffies + BLK_BATCH_TIME_READ;
 }
 
 static void __freed_request(struct request_list *rl, int sync)
@@ -926,7 +931,7 @@ static struct request *__get_request(struct request_list 
*rl, int rw_flags,
 * 

Re: Read starvation by sync writes

2012-12-12 Thread Jeff Moyer
Jens Axboe ax...@kernel.dk writes:

 On 2012-12-12 11:11, Jan Kara wrote:
 On Wed 12-12-12 10:55:15, Shaohua Li wrote:
 2012/12/11 Jan Kara j...@suse.cz:
   Hi,

   I was looking into IO starvation problems where streaming sync writes (in
 my case from kjournald but DIO would look the same) starve reads. This is
 because reads happen in small chunks and until a request completes we don't
 start reading further (reader reads lots of small files) while writers have
 plenty of big requests to submit. Both processes end up fighting for IO
 requests and writer writes nr_batching 512 KB requests while reader reads
 just one 4 KB request or so. Here the effect is magnified by the fact that
 the drive has relatively big queue depth so it usually takes longer than
 BLK_BATCH_TIME to complete the read request. The net result is it takes
 close to two minutes to read files that can be read under a second without
 writer load. Without the big drive's queue depth, results are not ideal but
 they are bearable - it takes about 20 seconds to do the reading. And for
 comparison, when writer and reader are not competing for IO requests (as it
 happens when writes are submitted as async), it takes about 2 seconds to
 complete reading.

 Simple reproducer is:

 echo 3 /proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 
 sleep 30
 time cat /etc/* 21 /dev/null
 killall dd
 rm /tmp/f

   The question is how can we fix this? Two quick hacks that come to my mind
 are remove timeout from the batching logic (is it that important?) or
 further separate request allocation logic so that reads have their own
 request pool. More systematic fix would be to change request allocation
 logic to always allow at least a fixed number of requests per IOC. What do
 people think about this?

 As long as queue depth  workload iodepth, there is little we can do
 to prioritize tasks/IOC. Because throttling a task/IOC means queue
 will be idle. We don't want to idle a queue (especially for SSD), so
 we always push as more requests as possible to the queue, which
 will break any prioritization. As far as I know we always have such
 issue in CFQ for big queue depth disk.
   Yes, I understand that. But actually big queue depth on its own doesn't
 make the problem really bad (at least for me). When the reader doesn't have
 to wait for free IO requests, it progresses at a reasonable speed. What
 makes it really bad is that big queue depth effectively disallows any use
 of ioc_batching() mode for the reader and thus it blocks in request
 allocation for every single read request unlike writer which always uses
 its full batch (32 requests).

 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...

Does batching even make sense for dependent reads?  I don't think it
does.  Assuming you disagree, then you'll have to justify that fixed
time value of 2 seconds.  The amount of time between dependent reads
will vary depending on other I/O sent to the device, the properties of
the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
there, please comment it.  Maybe it's time we started keeping track of
worst case Q-C time?  That could be used to tell worst case latency,
and adjust magic timeouts like this one.

I'm still thinking about how we might solve this in a cleaner way.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Jeff Moyer
Jeff Moyer jmo...@redhat.com writes:

 I agree. This isn't about scheduling, we haven't even reached that part
 yet. Back when we split the queues into read vs write, this problem
 obviously wasn't there. Now we have sync writes and reads, both eating
 from the same pool. The io scheduler can impact this a bit by forcing
 reads to must allocate (Jan, which io scheduler are you using?). CFQ
 does this when it's expecting a request from this process queue.

 Back in the day, we used to have one list. To avoid a similar problem,
 we reserved the top of the list for reads. With the batching, it's a bit
 more complicated. If we make the request allocation (just that, not the
 scheduling) be read vs write instead of sync vs async, then we have the
 same issue for sync vs buffered writes.

 How about something like the below? Due to the nature of sync reads, we
 should allow a much longer timeout. The batch is really tailored towards
 writes at the moment. Also shrink the batch count, 32 is pretty large...

 Does batching even make sense for dependent reads?  I don't think it
 does.

Having just read the batching code in detail, I'd like to ammend this
misguided comment.  Batching logic kicks in when you happen to be lucky
enough to use up the last request.  As such, I'd be surprised if the
patch you posted helped.  Jens, don't you think the writer is way more
likely to become the batcher?  I do agree with shrinking the batch count
to 16, whether or not the rest of the patch goes in.

  Assuming you disagree, then you'll have to justify that fixed
 time value of 2 seconds.  The amount of time between dependent reads
 will vary depending on other I/O sent to the device, the properties of
 the device, the I/O scheduler, and so on.  If you do stick 2 seconds in
 there, please comment it.  Maybe it's time we started keeping track of
 worst case Q-C time?  That could be used to tell worst case latency,
 and adjust magic timeouts like this one.

 I'm still thinking about how we might solve this in a cleaner way.

The way things stand today, you can do a complete end run around the I/O
scheduler by queueing up enough I/O.  To address that, I think we need
to move to a request list per io_context as Jan had suggested.  That
way, we can keep the logic about who gets to submit I/O when in one
place.

Jens, what do you think?

Jan, for now, try bumping nr_requests up really high.  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Dave Chinner
On Wed, Dec 12, 2012 at 11:26:17AM +0100, Jan Kara wrote:
 On Wed 12-12-12 15:18:21, Dave Chinner wrote:
  On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
   On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
Jan Kara j...@suse.cz writes:

   Hi,

   I was looking into IO starvation problems where streaming sync 
 writes (in
 my case from kjournald but DIO would look the same) starve reads. 
 This is
 because reads happen in small chunks and until a request completes we 
 don't
 start reading further (reader reads lots of small files) while 
 writers have
 plenty of big requests to submit. Both processes end up fighting for 
 IO
 requests and writer writes nr_batching 512 KB requests while reader 
 reads
 just one 4 KB request or so. Here the effect is magnified by the fact 
 that
 the drive has relatively big queue depth so it usually takes longer 
 than
 BLK_BATCH_TIME to complete the read request. The net result is it 
 takes
 close to two minutes to read files that can be read under a second 
 without
 writer load. Without the big drive's queue depth, results are not 
 ideal but
 they are bearable - it takes about 20 seconds to do the reading. And 
 for
 comparison, when writer and reader are not competing for IO requests 
 (as it
 happens when writes are submitted as async), it takes about 2 seconds 
 to
 complete reading.

 Simple reproducer is:

 echo 3 /proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 
 sleep 30
 time cat /etc/* 21 /dev/null
 killall dd
 rm /tmp/f

This is a buffered writer.  How does it end up that you are doing all
synchronous write I/O?  Also, you forgot to mention what file system you
were using, and which I/O scheduler.
 So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
   ends up being synchronous - in ext3 in data=ordered mode kjournald often 
   ends
   up submitting all the data to disk and it can do it as WRITE_SYNC if 
   someone is
   waiting for transaction commit. In theory this can happen with AIO DIO
   writes or someone running fsync on a big file as well. Although when I
   tried this now, I wasn't able to create as big problem as kjournald does
   (a kernel thread submitting huge linked list of buffer heads in a tight 
   loop
   is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
   so that it's not as aggressive will solve the real world cases as well...
  
  Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
  mean, if there is that many of them then it's really a batch
  submission an dthe latency of a single buffer IO is really
  irrelevant to the rate at which the buffers are flushed to disk
   Yeah, the idea why kjournald uses WRITE_SYNC is that we know someone is
 waiting for transaction commit and that's pretty much definition of what
 WRITE_SYNC means.

Well, XFS only uses WRITE_SYNC for WB_SYNC_ALL writeback, which
means only when a user is waiting on the wdata writeback will it use
WRITE_SYNC. I'm really not sure what category journal flushes fall
into, because XFS doesn't do data writeback from journal flushes

 Hum, maybe if DIO wasn't using WRITE_SYNC (one could make similar
 argument there as with kjournald). But then the definition of what
 WRITE_SYNC should mean starts to be pretty foggy.

DIO used WRITE_ODIRECT, not WRITE_SYNC. The difference is that
WRITE_SYNC sets REQ_NOIDLE, so DIO is actually different to
WRITE_SYNC behaviour for CFQ...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-12 Thread Shaohua Li
2012/12/12 Jan Kara j...@suse.cz:
 On Wed 12-12-12 10:55:15, Shaohua Li wrote:
 2012/12/11 Jan Kara j...@suse.cz:
Hi,
 
I was looking into IO starvation problems where streaming sync writes (in
  my case from kjournald but DIO would look the same) starve reads. This is
  because reads happen in small chunks and until a request completes we don't
  start reading further (reader reads lots of small files) while writers have
  plenty of big requests to submit. Both processes end up fighting for IO
  requests and writer writes nr_batching 512 KB requests while reader reads
  just one 4 KB request or so. Here the effect is magnified by the fact that
  the drive has relatively big queue depth so it usually takes longer than
  BLK_BATCH_TIME to complete the read request. The net result is it takes
  close to two minutes to read files that can be read under a second without
  writer load. Without the big drive's queue depth, results are not ideal but
  they are bearable - it takes about 20 seconds to do the reading. And for
  comparison, when writer and reader are not competing for IO requests (as it
  happens when writes are submitted as async), it takes about 2 seconds to
  complete reading.
 
  Simple reproducer is:
 
  echo 3 /proc/sys/vm/drop_caches
  dd if=/dev/zero of=/tmp/f bs=1M count=1 
  sleep 30
  time cat /etc/* 21 /dev/null
  killall dd
  rm /tmp/f
 
The question is how can we fix this? Two quick hacks that come to my mind
  are remove timeout from the batching logic (is it that important?) or
  further separate request allocation logic so that reads have their own
  request pool. More systematic fix would be to change request allocation
  logic to always allow at least a fixed number of requests per IOC. What do
  people think about this?

 As long as queue depth  workload iodepth, there is little we can do
 to prioritize tasks/IOC. Because throttling a task/IOC means queue
 will be idle. We don't want to idle a queue (especially for SSD), so
 we always push as more requests as possible to the queue, which
 will break any prioritization. As far as I know we always have such
 issue in CFQ for big queue depth disk.
   Yes, I understand that. But actually big queue depth on its own doesn't
 make the problem really bad (at least for me). When the reader doesn't have
 to wait for free IO requests, it progresses at a reasonable speed. What
 makes it really bad is that big queue depth effectively disallows any use
 of ioc_batching() mode for the reader and thus it blocks in request
 allocation for every single read request unlike writer which always uses
 its full batch (32 requests).

This can't explain why setting queue depth 1 makes the performance
better. In that case, write still get that number of requests, read will
wait for a request. Anyway, try setting nr_request to a big number
and check if performance is different.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Dave Chinner
On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
> On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
> > Jan Kara  writes:
> > 
> > >   Hi,
> > >
> > >   I was looking into IO starvation problems where streaming sync writes 
> > > (in
> > > my case from kjournald but DIO would look the same) starve reads. This is
> > > because reads happen in small chunks and until a request completes we 
> > > don't
> > > start reading further (reader reads lots of small files) while writers 
> > > have
> > > plenty of big requests to submit. Both processes end up fighting for IO
> > > requests and writer writes nr_batching 512 KB requests while reader reads
> > > just one 4 KB request or so. Here the effect is magnified by the fact that
> > > the drive has relatively big queue depth so it usually takes longer than
> > > BLK_BATCH_TIME to complete the read request. The net result is it takes
> > > close to two minutes to read files that can be read under a second without
> > > writer load. Without the big drive's queue depth, results are not ideal 
> > > but
> > > they are bearable - it takes about 20 seconds to do the reading. And for
> > > comparison, when writer and reader are not competing for IO requests (as 
> > > it
> > > happens when writes are submitted as async), it takes about 2 seconds to
> > > complete reading.
> > >
> > > Simple reproducer is:
> > >
> > > echo 3 >/proc/sys/vm/drop_caches
> > > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> > > sleep 30
> > > time cat /etc/* 2>&1 >/dev/null
> > > killall dd
> > > rm /tmp/f
> > 
> > This is a buffered writer.  How does it end up that you are doing all
> > synchronous write I/O?  Also, you forgot to mention what file system you
> > were using, and which I/O scheduler.
>   So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
> ends up being synchronous - in ext3 in data=ordered mode kjournald often ends
> up submitting all the data to disk and it can do it as WRITE_SYNC if someone 
> is
> waiting for transaction commit. In theory this can happen with AIO DIO
> writes or someone running fsync on a big file as well. Although when I
> tried this now, I wasn't able to create as big problem as kjournald does
> (a kernel thread submitting huge linked list of buffer heads in a tight loop
> is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
> so that it's not as aggressive will solve the real world cases as well...

Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
mean, if there is that many of them then it's really a batch
submission an dthe latency of a single buffer IO is really
irrelevant to the rate at which the buffers are flushed to disk

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Shaohua Li
2012/12/11 Jan Kara :
>   Hi,
>
>   I was looking into IO starvation problems where streaming sync writes (in
> my case from kjournald but DIO would look the same) starve reads. This is
> because reads happen in small chunks and until a request completes we don't
> start reading further (reader reads lots of small files) while writers have
> plenty of big requests to submit. Both processes end up fighting for IO
> requests and writer writes nr_batching 512 KB requests while reader reads
> just one 4 KB request or so. Here the effect is magnified by the fact that
> the drive has relatively big queue depth so it usually takes longer than
> BLK_BATCH_TIME to complete the read request. The net result is it takes
> close to two minutes to read files that can be read under a second without
> writer load. Without the big drive's queue depth, results are not ideal but
> they are bearable - it takes about 20 seconds to do the reading. And for
> comparison, when writer and reader are not competing for IO requests (as it
> happens when writes are submitted as async), it takes about 2 seconds to
> complete reading.
>
> Simple reproducer is:
>
> echo 3 >/proc/sys/vm/drop_caches
> dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> sleep 30
> time cat /etc/* 2>&1 >/dev/null
> killall dd
> rm /tmp/f
>
>   The question is how can we fix this? Two quick hacks that come to my mind
> are remove timeout from the batching logic (is it that important?) or
> further separate request allocation logic so that reads have their own
> request pool. More systematic fix would be to change request allocation
> logic to always allow at least a fixed number of requests per IOC. What do
> people think about this?

As long as queue depth > workload iodepth, there is little we can do
to prioritize tasks/IOC. Because throttling a task/IOC means queue
will be idle. We don't want to idle a queue (especially for SSD), so
we always push as more requests as possible to the queue, which
will break any prioritization. As far as I know we always have such
issue in CFQ for big queue depth disk.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jan Kara
On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
> Jan Kara  writes:
> 
> >   Hi,
> >
> >   I was looking into IO starvation problems where streaming sync writes (in
> > my case from kjournald but DIO would look the same) starve reads. This is
> > because reads happen in small chunks and until a request completes we don't
> > start reading further (reader reads lots of small files) while writers have
> > plenty of big requests to submit. Both processes end up fighting for IO
> > requests and writer writes nr_batching 512 KB requests while reader reads
> > just one 4 KB request or so. Here the effect is magnified by the fact that
> > the drive has relatively big queue depth so it usually takes longer than
> > BLK_BATCH_TIME to complete the read request. The net result is it takes
> > close to two minutes to read files that can be read under a second without
> > writer load. Without the big drive's queue depth, results are not ideal but
> > they are bearable - it takes about 20 seconds to do the reading. And for
> > comparison, when writer and reader are not competing for IO requests (as it
> > happens when writes are submitted as async), it takes about 2 seconds to
> > complete reading.
> >
> > Simple reproducer is:
> >
> > echo 3 >/proc/sys/vm/drop_caches
> > dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> > sleep 30
> > time cat /etc/* 2>&1 >/dev/null
> > killall dd
> > rm /tmp/f
> 
> This is a buffered writer.  How does it end up that you are doing all
> synchronous write I/O?  Also, you forgot to mention what file system you
> were using, and which I/O scheduler.
  So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
ends up being synchronous - in ext3 in data=ordered mode kjournald often ends
up submitting all the data to disk and it can do it as WRITE_SYNC if someone is
waiting for transaction commit. In theory this can happen with AIO DIO
writes or someone running fsync on a big file as well. Although when I
tried this now, I wasn't able to create as big problem as kjournald does
(a kernel thread submitting huge linked list of buffer heads in a tight loop
is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
so that it's not as aggressive will solve the real world cases as well...

> Is this happening in some real workload?  If so, can you share what that
> workload is?  How about some blktrace data?
  With ext3 it does happen in a real workload on our servers - e.g. when
you provision KVM images it's a lot of streaming writes and machine
struggles to do anything else during that time. I have put up some 40
seconds of blktrace data to

http://beta.suse.com/private/jack/read_starvation/sda.tar.gz

> >   The question is how can we fix this? Two quick hacks that come to my mind
> > are remove timeout from the batching logic (is it that important?) or
> > further separate request allocation logic so that reads have their own
> > request pool. More systematic fix would be to change request allocation
> > logic to always allow at least a fixed number of requests per IOC. What do
> > people think about this?
> 
> There has been talk of removing the limit on the number of requests
> allocated, but I haven't seen patches for it, and I certainly am not
> convinced of its practicality.  Today, when using block cgroups you do
> get a request list per cgroup, so that's kind of the same thing as one
> per ioc.  I can certainly see moving in that direction for the
> non-cgroup case.
  Ah, I thought blk_get_rl() is one of those trivial wrappers we have in
block layer but now when looking into it, it actually does something useful ;)
Thanks for looking into this!

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jan Engelhardt

On Monday 2012-12-10 23:12, Jan Kara wrote:
>
>  I was looking into IO starvation problems where streaming sync writes (in
>my case from kjournald but DIO would look the same) starve reads. This is
>because reads happen in small chunks and until a request completes we don't
>start reading further (reader reads lots of small files) while writers have
>plenty of big requests to submit. Both processes end up fighting for IO
>requests and writer writes nr_batching 512 KB requests while reader reads
>just one 4 KB request or so.[...]
>Simple reproducer is:
>
>echo 3 >/proc/sys/vm/drop_caches
>dd if=/dev/zero of=/tmp/f bs=1M count=1 &
>sleep 30
>time cat /etc/* 2>&1 >/dev/null
>killall dd
>rm /tmp/f
>
>  The question is how can we fix this?

The ROW scheduler attempts to address it up by prioritizing reads,
and it works out for me.

 (Mail info.)
 Date: Tue, 11 Dec 2012 14:41:18
 From: Tanya Brokhman 
 Subject: [PATCH v4 0/2] ROW scheduling Algorithm
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jeff Moyer
Jan Kara  writes:

>   Hi,
>
>   I was looking into IO starvation problems where streaming sync writes (in
> my case from kjournald but DIO would look the same) starve reads. This is
> because reads happen in small chunks and until a request completes we don't
> start reading further (reader reads lots of small files) while writers have
> plenty of big requests to submit. Both processes end up fighting for IO
> requests and writer writes nr_batching 512 KB requests while reader reads
> just one 4 KB request or so. Here the effect is magnified by the fact that
> the drive has relatively big queue depth so it usually takes longer than
> BLK_BATCH_TIME to complete the read request. The net result is it takes
> close to two minutes to read files that can be read under a second without
> writer load. Without the big drive's queue depth, results are not ideal but
> they are bearable - it takes about 20 seconds to do the reading. And for
> comparison, when writer and reader are not competing for IO requests (as it
> happens when writes are submitted as async), it takes about 2 seconds to
> complete reading.
>
> Simple reproducer is:
>
> echo 3 >/proc/sys/vm/drop_caches
> dd if=/dev/zero of=/tmp/f bs=1M count=1 &
> sleep 30
> time cat /etc/* 2>&1 >/dev/null
> killall dd
> rm /tmp/f

This is a buffered writer.  How does it end up that you are doing all
synchronous write I/O?  Also, you forgot to mention what file system you
were using, and which I/O scheduler.

Is this happening in some real workload?  If so, can you share what that
workload is?  How about some blktrace data?

>   The question is how can we fix this? Two quick hacks that come to my mind
> are remove timeout from the batching logic (is it that important?) or
> further separate request allocation logic so that reads have their own
> request pool. More systematic fix would be to change request allocation
> logic to always allow at least a fixed number of requests per IOC. What do
> people think about this?

There has been talk of removing the limit on the number of requests
allocated, but I haven't seen patches for it, and I certainly am not
convinced of its practicality.  Today, when using block cgroups you do
get a request list per cgroup, so that's kind of the same thing as one
per ioc.  I can certainly see moving in that direction for the
non-cgroup case.

First, though, I'd like to better understand your workload.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jeff Moyer
Jan Kara j...@suse.cz writes:

   Hi,

   I was looking into IO starvation problems where streaming sync writes (in
 my case from kjournald but DIO would look the same) starve reads. This is
 because reads happen in small chunks and until a request completes we don't
 start reading further (reader reads lots of small files) while writers have
 plenty of big requests to submit. Both processes end up fighting for IO
 requests and writer writes nr_batching 512 KB requests while reader reads
 just one 4 KB request or so. Here the effect is magnified by the fact that
 the drive has relatively big queue depth so it usually takes longer than
 BLK_BATCH_TIME to complete the read request. The net result is it takes
 close to two minutes to read files that can be read under a second without
 writer load. Without the big drive's queue depth, results are not ideal but
 they are bearable - it takes about 20 seconds to do the reading. And for
 comparison, when writer and reader are not competing for IO requests (as it
 happens when writes are submitted as async), it takes about 2 seconds to
 complete reading.

 Simple reproducer is:

 echo 3 /proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 
 sleep 30
 time cat /etc/* 21 /dev/null
 killall dd
 rm /tmp/f

This is a buffered writer.  How does it end up that you are doing all
synchronous write I/O?  Also, you forgot to mention what file system you
were using, and which I/O scheduler.

Is this happening in some real workload?  If so, can you share what that
workload is?  How about some blktrace data?

   The question is how can we fix this? Two quick hacks that come to my mind
 are remove timeout from the batching logic (is it that important?) or
 further separate request allocation logic so that reads have their own
 request pool. More systematic fix would be to change request allocation
 logic to always allow at least a fixed number of requests per IOC. What do
 people think about this?

There has been talk of removing the limit on the number of requests
allocated, but I haven't seen patches for it, and I certainly am not
convinced of its practicality.  Today, when using block cgroups you do
get a request list per cgroup, so that's kind of the same thing as one
per ioc.  I can certainly see moving in that direction for the
non-cgroup case.

First, though, I'd like to better understand your workload.

Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jan Engelhardt

On Monday 2012-12-10 23:12, Jan Kara wrote:

  I was looking into IO starvation problems where streaming sync writes (in
my case from kjournald but DIO would look the same) starve reads. This is
because reads happen in small chunks and until a request completes we don't
start reading further (reader reads lots of small files) while writers have
plenty of big requests to submit. Both processes end up fighting for IO
requests and writer writes nr_batching 512 KB requests while reader reads
just one 4 KB request or so.[...]
Simple reproducer is:

echo 3 /proc/sys/vm/drop_caches
dd if=/dev/zero of=/tmp/f bs=1M count=1 
sleep 30
time cat /etc/* 21 /dev/null
killall dd
rm /tmp/f

  The question is how can we fix this?

The ROW scheduler attempts to address it up by prioritizing reads,
and it works out for me.

 (Mail info.)
 Date: Tue, 11 Dec 2012 14:41:18
 From: Tanya Brokhman tlin...@codeaurora.org
 Subject: [PATCH v4 0/2] ROW scheduling Algorithm
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Jan Kara
On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
 Jan Kara j...@suse.cz writes:
 
Hi,
 
I was looking into IO starvation problems where streaming sync writes (in
  my case from kjournald but DIO would look the same) starve reads. This is
  because reads happen in small chunks and until a request completes we don't
  start reading further (reader reads lots of small files) while writers have
  plenty of big requests to submit. Both processes end up fighting for IO
  requests and writer writes nr_batching 512 KB requests while reader reads
  just one 4 KB request or so. Here the effect is magnified by the fact that
  the drive has relatively big queue depth so it usually takes longer than
  BLK_BATCH_TIME to complete the read request. The net result is it takes
  close to two minutes to read files that can be read under a second without
  writer load. Without the big drive's queue depth, results are not ideal but
  they are bearable - it takes about 20 seconds to do the reading. And for
  comparison, when writer and reader are not competing for IO requests (as it
  happens when writes are submitted as async), it takes about 2 seconds to
  complete reading.
 
  Simple reproducer is:
 
  echo 3 /proc/sys/vm/drop_caches
  dd if=/dev/zero of=/tmp/f bs=1M count=1 
  sleep 30
  time cat /etc/* 21 /dev/null
  killall dd
  rm /tmp/f
 
 This is a buffered writer.  How does it end up that you are doing all
 synchronous write I/O?  Also, you forgot to mention what file system you
 were using, and which I/O scheduler.
  So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
ends up being synchronous - in ext3 in data=ordered mode kjournald often ends
up submitting all the data to disk and it can do it as WRITE_SYNC if someone is
waiting for transaction commit. In theory this can happen with AIO DIO
writes or someone running fsync on a big file as well. Although when I
tried this now, I wasn't able to create as big problem as kjournald does
(a kernel thread submitting huge linked list of buffer heads in a tight loop
is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
so that it's not as aggressive will solve the real world cases as well...

 Is this happening in some real workload?  If so, can you share what that
 workload is?  How about some blktrace data?
  With ext3 it does happen in a real workload on our servers - e.g. when
you provision KVM images it's a lot of streaming writes and machine
struggles to do anything else during that time. I have put up some 40
seconds of blktrace data to

http://beta.suse.com/private/jack/read_starvation/sda.tar.gz

The question is how can we fix this? Two quick hacks that come to my mind
  are remove timeout from the batching logic (is it that important?) or
  further separate request allocation logic so that reads have their own
  request pool. More systematic fix would be to change request allocation
  logic to always allow at least a fixed number of requests per IOC. What do
  people think about this?
 
 There has been talk of removing the limit on the number of requests
 allocated, but I haven't seen patches for it, and I certainly am not
 convinced of its practicality.  Today, when using block cgroups you do
 get a request list per cgroup, so that's kind of the same thing as one
 per ioc.  I can certainly see moving in that direction for the
 non-cgroup case.
  Ah, I thought blk_get_rl() is one of those trivial wrappers we have in
block layer but now when looking into it, it actually does something useful ;)
Thanks for looking into this!

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Shaohua Li
2012/12/11 Jan Kara j...@suse.cz:
   Hi,

   I was looking into IO starvation problems where streaming sync writes (in
 my case from kjournald but DIO would look the same) starve reads. This is
 because reads happen in small chunks and until a request completes we don't
 start reading further (reader reads lots of small files) while writers have
 plenty of big requests to submit. Both processes end up fighting for IO
 requests and writer writes nr_batching 512 KB requests while reader reads
 just one 4 KB request or so. Here the effect is magnified by the fact that
 the drive has relatively big queue depth so it usually takes longer than
 BLK_BATCH_TIME to complete the read request. The net result is it takes
 close to two minutes to read files that can be read under a second without
 writer load. Without the big drive's queue depth, results are not ideal but
 they are bearable - it takes about 20 seconds to do the reading. And for
 comparison, when writer and reader are not competing for IO requests (as it
 happens when writes are submitted as async), it takes about 2 seconds to
 complete reading.

 Simple reproducer is:

 echo 3 /proc/sys/vm/drop_caches
 dd if=/dev/zero of=/tmp/f bs=1M count=1 
 sleep 30
 time cat /etc/* 21 /dev/null
 killall dd
 rm /tmp/f

   The question is how can we fix this? Two quick hacks that come to my mind
 are remove timeout from the batching logic (is it that important?) or
 further separate request allocation logic so that reads have their own
 request pool. More systematic fix would be to change request allocation
 logic to always allow at least a fixed number of requests per IOC. What do
 people think about this?

As long as queue depth  workload iodepth, there is little we can do
to prioritize tasks/IOC. Because throttling a task/IOC means queue
will be idle. We don't want to idle a queue (especially for SSD), so
we always push as more requests as possible to the queue, which
will break any prioritization. As far as I know we always have such
issue in CFQ for big queue depth disk.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Read starvation by sync writes

2012-12-11 Thread Dave Chinner
On Wed, Dec 12, 2012 at 03:31:37AM +0100, Jan Kara wrote:
 On Tue 11-12-12 16:44:15, Jeff Moyer wrote:
  Jan Kara j...@suse.cz writes:
  
 Hi,
  
 I was looking into IO starvation problems where streaming sync writes 
   (in
   my case from kjournald but DIO would look the same) starve reads. This is
   because reads happen in small chunks and until a request completes we 
   don't
   start reading further (reader reads lots of small files) while writers 
   have
   plenty of big requests to submit. Both processes end up fighting for IO
   requests and writer writes nr_batching 512 KB requests while reader reads
   just one 4 KB request or so. Here the effect is magnified by the fact that
   the drive has relatively big queue depth so it usually takes longer than
   BLK_BATCH_TIME to complete the read request. The net result is it takes
   close to two minutes to read files that can be read under a second without
   writer load. Without the big drive's queue depth, results are not ideal 
   but
   they are bearable - it takes about 20 seconds to do the reading. And for
   comparison, when writer and reader are not competing for IO requests (as 
   it
   happens when writes are submitted as async), it takes about 2 seconds to
   complete reading.
  
   Simple reproducer is:
  
   echo 3 /proc/sys/vm/drop_caches
   dd if=/dev/zero of=/tmp/f bs=1M count=1 
   sleep 30
   time cat /etc/* 21 /dev/null
   killall dd
   rm /tmp/f
  
  This is a buffered writer.  How does it end up that you are doing all
  synchronous write I/O?  Also, you forgot to mention what file system you
  were using, and which I/O scheduler.
   So IO scheduler is CFQ, filesystem is ext3 - which is the culprit why IO
 ends up being synchronous - in ext3 in data=ordered mode kjournald often ends
 up submitting all the data to disk and it can do it as WRITE_SYNC if someone 
 is
 waiting for transaction commit. In theory this can happen with AIO DIO
 writes or someone running fsync on a big file as well. Although when I
 tried this now, I wasn't able to create as big problem as kjournald does
 (a kernel thread submitting huge linked list of buffer heads in a tight loop
 is hard to beat ;). Hum, so maybe just adding some workaround in kjournald
 so that it's not as aggressive will solve the real world cases as well...

Maybe kjournald shouldn't be using WRITE_SYNC for those buffers? I
mean, if there is that many of them then it's really a batch
submission an dthe latency of a single buffer IO is really
irrelevant to the rate at which the buffers are flushed to disk

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Read starvation by sync writes

2012-12-10 Thread Jan Kara
  Hi,

  I was looking into IO starvation problems where streaming sync writes (in
my case from kjournald but DIO would look the same) starve reads. This is
because reads happen in small chunks and until a request completes we don't
start reading further (reader reads lots of small files) while writers have
plenty of big requests to submit. Both processes end up fighting for IO
requests and writer writes nr_batching 512 KB requests while reader reads
just one 4 KB request or so. Here the effect is magnified by the fact that
the drive has relatively big queue depth so it usually takes longer than
BLK_BATCH_TIME to complete the read request. The net result is it takes
close to two minutes to read files that can be read under a second without
writer load. Without the big drive's queue depth, results are not ideal but
they are bearable - it takes about 20 seconds to do the reading. And for
comparison, when writer and reader are not competing for IO requests (as it
happens when writes are submitted as async), it takes about 2 seconds to
complete reading.

Simple reproducer is:

echo 3 >/proc/sys/vm/drop_caches
dd if=/dev/zero of=/tmp/f bs=1M count=1 &
sleep 30
time cat /etc/* 2>&1 >/dev/null
killall dd
rm /tmp/f

  The question is how can we fix this? Two quick hacks that come to my mind
are remove timeout from the batching logic (is it that important?) or
further separate request allocation logic so that reads have their own
request pool. More systematic fix would be to change request allocation
logic to always allow at least a fixed number of requests per IOC. What do
people think about this?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Read starvation by sync writes

2012-12-10 Thread Jan Kara
  Hi,

  I was looking into IO starvation problems where streaming sync writes (in
my case from kjournald but DIO would look the same) starve reads. This is
because reads happen in small chunks and until a request completes we don't
start reading further (reader reads lots of small files) while writers have
plenty of big requests to submit. Both processes end up fighting for IO
requests and writer writes nr_batching 512 KB requests while reader reads
just one 4 KB request or so. Here the effect is magnified by the fact that
the drive has relatively big queue depth so it usually takes longer than
BLK_BATCH_TIME to complete the read request. The net result is it takes
close to two minutes to read files that can be read under a second without
writer load. Without the big drive's queue depth, results are not ideal but
they are bearable - it takes about 20 seconds to do the reading. And for
comparison, when writer and reader are not competing for IO requests (as it
happens when writes are submitted as async), it takes about 2 seconds to
complete reading.

Simple reproducer is:

echo 3 /proc/sys/vm/drop_caches
dd if=/dev/zero of=/tmp/f bs=1M count=1 
sleep 30
time cat /etc/* 21 /dev/null
killall dd
rm /tmp/f

  The question is how can we fix this? Two quick hacks that come to my mind
are remove timeout from the batching logic (is it that important?) or
further separate request allocation logic so that reads have their own
request pool. More systematic fix would be to change request allocation
logic to always allow at least a fixed number of requests per IOC. What do
people think about this?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/