Re: [rfc] small bioq patch

2013-10-20 Thread John-Mark Gurney
Maksim Yevmenkin wrote this message on Tue, Oct 15, 2013 at 11:15 -0700:
 On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney j...@funkthat.com wrote:
  Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
   On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote:
  
   Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
   i would like to submit the attached bioq patch for review and
   comments. this is proof of concept. it helps with smoothing disk read
   service times and arrear to eliminates outliers. please see attached
   pictures (about a week worth of data)
  
   - c034 control unmodified system
   - c044 patched system
  
   Can you describe how you got this data?  Were you using the gstat
   code or some other code?
 
  Yes, it's basically gstat data.
 
  The reason I ask this is that I don't think the data you are getting
  from gstat is what you think you are...  It accumulates time for a set
  of operations and then divides by the count...  So I'm not sure if the
  stat improvements you are seeing are as meaningful as you might think
  they are...
 
 yes, i'm aware of it. however, i'm not aware of better tools. we
 also use dtrace and PCM/PMC. ktrace is not particularly useable for us
 because it does not really work well when we push system above 5 Gbps.
 in order to actually see any issues we need to push system to 10
 Gbps range at least.

So, I put a test together using dtrace...  And my test wasn't a big
test, but I put HEAD on a 16G fs, and did a:
find /mnt -type f -exec cat {} +

And I varried the sysctl values as 0, 16 32, 64, 128, 256 and 512... I
was unable to get a significant difference between the runs... Between
each run I would unmount the fs to make sure the fs cache was clean...

I've posted my scripts/results at:
https://people.freebsd.org/~jmg/disklat/

genresults runs the dtrace script and gets the results
makeresults extracts the results from each run
disklatencycmd.d is the dtrace script I used
catall/catallp is the script containing the find command... I tried two
different versions of the command, one single threaded as above, the
other using xargs w/ 4 processes running...  the p4 results are from
the xargs, the sing is the find -exec command above...

Though I will admit that before the patch, on occasion I did see a max
latency of 6s, but in these tests I didn't see it...  The disk was:
Model Family: Maxtor MaXLine Pro 500
Device Model: Maxtor 7H500F0
Firmware Version: HA431DN0
User Capacity:500,107,862,016 bytes [500 GB]

and the partition was close to the begining of the drive...

   graphs show max/avg disk read service times for both systems across 36
   spinning drives. both systems are relatively busy serving production
   traffic (about 10 Gbps at peak). grey shaded areas on the graphs
   represent time when systems are refreshing their content, i.e. disks
   are both reading and writing at the same time.
  
   Can you describe why you think this change makes an improvement?  Unless
   you're running 10k or 15k RPM drives, 128 seems like a large number.. as
   that's about halve number of IOPs that a normal HD handles in a second..
 
  Our (Netflix) load is basically random disk io. We have tweaked the system 
  to ensure that our io path is wide enough, I.e. We read 1mb per disk io 
  for majority of the requests. However offsets we read from are all over 
  the place. It appears that we are getting into situation where larger 
  offsets are getting delayed because smaller offsets are jumping ahead of 
  them. Forcing bioq insert tail operation and effectively moving insertion 
  point seems to help avoiding getting into this situation. And, no. We 
  don't use 10k or 15k drives. Just regular enterprise 7200 sata drives.
 
  I assume that the 1mb reads are then further broken up into 8 128kb
  reads? so it's more like every 16 reads in your work load that you
  insert the ordered io...
 
 i'm not sure where 128kb comes from. are you referring to
 MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

Ahh, ok, so another difference between your system and HEAD...

  I want to make sure that we choose the right value for this number..
  What number of IOPs are you seeing?
 
 generally we see  100 IOPs per disk on a system pushing 10+ Gbps.

w/ 1MB IO's, that makes sense...

 i've experimented with different numbers on our system and i did not
 see much of a difference on our workload. i'm up a value of 1024 now.
 higher numbers seem to produce slightly bigger difference between
 average and max time, but i do not think its statistically meaningful.
 general shape of the curve remains smooth for all tried values so far.
 
 [...]
 
   Also, do you see a similar throughput of the system?
 
  Yes. We do see almost identical throughput from both systems.  I have not 
  pushed the system to its limit yet, but having much smoother disk read 
  service time is important for us because we 

Re: [rfc] small bioq patch

2013-10-15 Thread Maksim Yevmenkin
On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney j...@funkthat.com wrote:
 Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
  On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote:
 
  Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
  i would like to submit the attached bioq patch for review and
  comments. this is proof of concept. it helps with smoothing disk read
  service times and arrear to eliminates outliers. please see attached
  pictures (about a week worth of data)
 
  - c034 control unmodified system
  - c044 patched system
 
  Can you describe how you got this data?  Were you using the gstat
  code or some other code?

 Yes, it's basically gstat data.

 The reason I ask this is that I don't think the data you are getting
 from gstat is what you think you are...  It accumulates time for a set
 of operations and then divides by the count...  So I'm not sure if the
 stat improvements you are seeing are as meaningful as you might think
 they are...

yes, i'm aware of it. however, i'm not aware of better tools. we
also use dtrace and PCM/PMC. ktrace is not particularly useable for us
because it does not really work well when we push system above 5 Gbps.
in order to actually see any issues we need to push system to 10
Gbps range at least.

  graphs show max/avg disk read service times for both systems across 36
  spinning drives. both systems are relatively busy serving production
  traffic (about 10 Gbps at peak). grey shaded areas on the graphs
  represent time when systems are refreshing their content, i.e. disks
  are both reading and writing at the same time.
 
  Can you describe why you think this change makes an improvement?  Unless
  you're running 10k or 15k RPM drives, 128 seems like a large number.. as
  that's about halve number of IOPs that a normal HD handles in a second..

 Our (Netflix) load is basically random disk io. We have tweaked the system 
 to ensure that our io path is wide enough, I.e. We read 1mb per disk io 
 for majority of the requests. However offsets we read from are all over the 
 place. It appears that we are getting into situation where larger offsets 
 are getting delayed because smaller offsets are jumping ahead of them. 
 Forcing bioq insert tail operation and effectively moving insertion point 
 seems to help avoiding getting into this situation. And, no. We don't use 
 10k or 15k drives. Just regular enterprise 7200 sata drives.

 I assume that the 1mb reads are then further broken up into 8 128kb
 reads? so it's more like every 16 reads in your work load that you
 insert the ordered io...

i'm not sure where 128kb comes from. are you referring to
MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

 I want to make sure that we choose the right value for this number..
 What number of IOPs are you seeing?

generally we see  100 IOPs per disk on a system pushing 10+ Gbps.
i've experimented with different numbers on our system and i did not
see much of a difference on our workload. i'm up a value of 1024 now.
higher numbers seem to produce slightly bigger difference between
average and max time, but i do not think its statistically meaningful.
general shape of the curve remains smooth for all tried values so far.

[...]

  Also, do you see a similar throughput of the system?

 Yes. We do see almost identical throughput from both systems.  I have not 
 pushed the system to its limit yet, but having much smoother disk read 
 service time is important for us because we use it as one of the components 
 of system health metrics. We also need to ensure that disk io request is 
 actually dispatched to the disk in a timely manner.

 Per above, have you measured at the application layer that you are
 getting better latency times on your reads?  Maybe by doing a ktrace
 of the io, and calculating times between read and return or something
 like that...

ktrace is not particularly useful. i can see if i can come up with
dtrace probe or something. our application (or rather clients) are
_very_ sensitive to latency. having read service times outliers is not
very good for us.

 Have you looked at the geom disk schedulers work that Luigi did a few
 years back?  There have been known issues w/ our io scheduler for a
 long time...  If you search the mailing lists, you'll see lots of
 reports from some processes starving out others, probably due to a
 similar issue...  I've seen similar unfair behavior between processes,
 but spend time tracking it down...

yes, we have looked at it. it makes things worse for us, unfortunately.

 It does look like a good improvement though...

 Thanks for the work!

ok :) i'm interested to hear from people who have different workload
profile. for example lots of iops, i.e. very small files reads or
something like that.

thanks,
max
___
freebsd-current@freebsd.org mailing list

[rfc] small bioq patch

2013-10-11 Thread Maksim Yevmenkin
hello,

i would like to submit the attached bioq patch for review and
comments. this is proof of concept. it helps with smoothing disk read
service times and arrear to eliminates outliers. please see attached
pictures (about a week worth of data)

- c034 control unmodified system
- c044 patched system

graphs show max/avg disk read service times for both systems across 36
spinning drives. both systems are relatively busy serving production
traffic (about 10 Gbps at peak). grey shaded areas on the graphs
represent time when systems are refreshing their content, i.e. disks
are both reading and writing at the same time.

thanks,
max
Index: branches/freebsd10/src/sys/kern/subr_disk.c
===
diff -u -N -r2284 -r2698
--- branches/freebsd10/src/sys/kern/subr_disk.c (.../subr_disk.c)   
(revision 2284)
+++ branches/freebsd10/src/sys/kern/subr_disk.c (.../subr_disk.c)   
(revision 2698)
@@ -21,8 +21,13 @@
 #include sys/bio.h
 #include sys/conf.h
 #include sys/disk.h
+#include sys/sysctl.h
 #include geom/geom_disk.h
 
+static int bioq_batchsize = 128;
+SYSCTL_INT(_debug, OID_AUTO, bioq_batchsize, CTLFLAG_RW,
+bioq_batchsize, 0, BIOQ batch size);
+
 /*-
  * Disk error is the preface to plaintive error messages
  * about failing disk transfers.  It prints messages of the form
@@ -150,6 +155,8 @@
TAILQ_INIT(head-queue);
head-last_offset = 0;
head-insert_point = NULL;
+   head-total = 0;
+   head-batched = 0;
 }
 
 void
@@ -163,6 +170,7 @@
head-insert_point = NULL;
 
TAILQ_REMOVE(head-queue, bp, bio_queue);
+   head-total--;
 }
 
 void
@@ -181,13 +189,16 @@
if (head-insert_point == NULL)
head-last_offset = bp-bio_offset;
TAILQ_INSERT_HEAD(head-queue, bp, bio_queue);
+   head-total++;
 }
 
 void
 bioq_insert_tail(struct bio_queue_head *head, struct bio *bp)
 {
 
TAILQ_INSERT_TAIL(head-queue, bp, bio_queue);
+   head-total++;
+   head-batched = 0;
head-insert_point = bp;
head-last_offset = bp-bio_offset;
 }
@@ -246,6 +257,11 @@
return;
}
 
+   if (bioq_batchsize  0  head-batched  bioq_batchsize) {
+   bioq_insert_tail(head, bp);
+   return;
+   }
+
prev = NULL;
key = bioq_bio_key(head, bp);
cur = TAILQ_FIRST(head-queue);
@@ -264,4 +280,7 @@
TAILQ_INSERT_HEAD(head-queue, bp, bio_queue);
else
TAILQ_INSERT_AFTER(head-queue, prev, bp, bio_queue);
+
+   head-total++;
+   head-batched++;
 }
Index: branches/freebsd10/src/sys/sys/bio.h
===
diff -u -N -r2284 -r2698
--- branches/freebsd10/src/sys/sys/bio.h(.../bio.h) (revision 2284)
+++ branches/freebsd10/src/sys/sys/bio.h(.../bio.h) (revision 2698)
@@ -129,6 +129,8 @@
TAILQ_HEAD(bio_queue, bio) queue;
off_t last_offset;
struct  bio *insert_point;
+   int total;
+   int batched;
 };
 
 extern struct vm_map *bio_transient_map;
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: [rfc] small bioq patch

2013-10-11 Thread Eric van Gyzen

Maksim,

The graphs were not attached.  Perhaps the list stripped them.  Could 
you post them on the web instead?


Thanks,

Eric

On 10/11/2013 01:17 PM, Maksim Yevmenkin wrote:

hello,

i would like to submit the attached bioq patch for review and
comments. this is proof of concept. it helps with smoothing disk read
service times and arrear to eliminates outliers. please see attached
pictures (about a week worth of data)

- c034 control unmodified system
- c044 patched system

graphs show max/avg disk read service times for both systems across 36
spinning drives. both systems are relatively busy serving production
traffic (about 10 Gbps at peak). grey shaded areas on the graphs
represent time when systems are refreshing their content, i.e. disks
are both reading and writing at the same time.

thanks,
max



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] small bioq patch

2013-10-11 Thread Maksim Yevmenkin
On Fri, Oct 11, 2013 at 12:25 PM, Eric van Gyzen e...@vangyzen.net wrote:
 Maksim,

 The graphs were not attached.  Perhaps the list stripped them.  Could you
 post them on the web instead?

err... sorry about it.. please try

http://people.freebsd.org/~emax/c034.png
http://people.freebsd.org/~emax/c044.png

thanks
max

 hello,

 i would like to submit the attached bioq patch for review and
 comments. this is proof of concept. it helps with smoothing disk read
 service times and arrear to eliminates outliers. please see attached
 pictures (about a week worth of data)

 - c034 control unmodified system
 - c044 patched system

 graphs show max/avg disk read service times for both systems across 36
 spinning drives. both systems are relatively busy serving production
 traffic (about 10 Gbps at peak). grey shaded areas on the graphs
 represent time when systems are refreshing their content, i.e. disks
 are both reading and writing at the same time.

 thanks,
 max
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] small bioq patch

2013-10-11 Thread John-Mark Gurney
Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
 i would like to submit the attached bioq patch for review and
 comments. this is proof of concept. it helps with smoothing disk read
 service times and arrear to eliminates outliers. please see attached
 pictures (about a week worth of data)
 
 - c034 control unmodified system
 - c044 patched system

Can you describe how you got this data?  Were you using the gstat
code or some other code?

Also, was your control system w/ the patch, but w/ the sysctl set to
zero to possibly eliminate any code alignment issues?

 graphs show max/avg disk read service times for both systems across 36
 spinning drives. both systems are relatively busy serving production
 traffic (about 10 Gbps at peak). grey shaded areas on the graphs
 represent time when systems are refreshing their content, i.e. disks
 are both reading and writing at the same time.

Can you describe why you think this change makes an improvement?  Unless
you're running 10k or 15k RPM drives, 128 seems like a large number.. as
that's about halve number of IOPs that a normal HD handles in a second..

I assume you must be regularly seeing queue depths of 128+ for this
code to make a difference, do you see that w/ gstat?

Also, do you see a similar throughput of the system?

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] small bioq patch

2013-10-11 Thread Maksim Yevmenkin


 On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote:
 
 Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
 i would like to submit the attached bioq patch for review and
 comments. this is proof of concept. it helps with smoothing disk read
 service times and arrear to eliminates outliers. please see attached
 pictures (about a week worth of data)
 
 - c034 control unmodified system
 - c044 patched system
 
 Can you describe how you got this data?  Were you using the gstat
 code or some other code?

Yes, it's basically gstat data. 

 Also, was your control system w/ the patch, but w/ the sysctl set to
 zero to possibly eliminate any code alignment issues?

Both systems use the same code base and build. Patched system has patch 
included, control system does not have the patch. I can rerun my tests with 
sysctl set to zero and use it as control. So, the answer to your question is 
no. 

 graphs show max/avg disk read service times for both systems across 36
 spinning drives. both systems are relatively busy serving production
 traffic (about 10 Gbps at peak). grey shaded areas on the graphs
 represent time when systems are refreshing their content, i.e. disks
 are both reading and writing at the same time.
 
 Can you describe why you think this change makes an improvement?  Unless
 you're running 10k or 15k RPM drives, 128 seems like a large number.. as
 that's about halve number of IOPs that a normal HD handles in a second..

Our (Netflix) load is basically random disk io. We have tweaked the system to 
ensure that our io path is wide enough, I.e. We read 1mb per disk io for 
majority of the requests. However offsets we read from are all over the place. 
It appears that we are getting into situation where larger offsets are getting 
delayed because smaller offsets are jumping ahead of them. Forcing bioq 
insert tail operation and effectively moving insertion point seems to help 
avoiding getting into this situation. And, no. We don't use 10k or 15k drives. 
Just regular enterprise 7200 sata drives. 

 I assume you must be regularly seeing queue depths of 128+ for this
 code to make a difference, do you see that w/ gstat?

No, we don't see large (128+) queue sizes in gstat data. The way I see it, we 
don't have to have deep queue here. We could just have a steady stream of io 
requests where new, smaller, offsets consistently jumping ahead of older, 
larger offset. In fact gstat data show shallow queue of 5 or less items.

 Also, do you see a similar throughput of the system?

Yes. We do see almost identical throughput from both systems.  I have not 
pushed the system to its limit yet, but having much smoother disk read service 
time is important for us because we use it as one of the components of system 
health metrics. We also need to ensure that disk io request is actually 
dispatched to the disk in a timely manner. 

Thanks
Max

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [rfc] small bioq patch

2013-10-11 Thread John-Mark Gurney
Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
  On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote:
  
  Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
  i would like to submit the attached bioq patch for review and
  comments. this is proof of concept. it helps with smoothing disk read
  service times and arrear to eliminates outliers. please see attached
  pictures (about a week worth of data)
  
  - c034 control unmodified system
  - c044 patched system
  
  Can you describe how you got this data?  Were you using the gstat
  code or some other code?
 
 Yes, it's basically gstat data. 

The reason I ask this is that I don't think the data you are getting
from gstat is what you think you are...  It accumulates time for a set
of operations and then divides by the count...  So I'm not sure if the
stat improvements you are seeing are as meaningful as you might think
they are...

  Also, was your control system w/ the patch, but w/ the sysctl set to
  zero to possibly eliminate any code alignment issues?
 
 Both systems use the same code base and build. Patched system has patch 
 included, control system does not have the patch. I can rerun my tests with 
 sysctl set to zero and use it as control. So, the answer to your question 
 is no. 

I don't believe the code would make a difference, but more wanted to
know what control was...

  graphs show max/avg disk read service times for both systems across 36
  spinning drives. both systems are relatively busy serving production
  traffic (about 10 Gbps at peak). grey shaded areas on the graphs
  represent time when systems are refreshing their content, i.e. disks
  are both reading and writing at the same time.
  
  Can you describe why you think this change makes an improvement?  Unless
  you're running 10k or 15k RPM drives, 128 seems like a large number.. as
  that's about halve number of IOPs that a normal HD handles in a second..
 
 Our (Netflix) load is basically random disk io. We have tweaked the system to 
 ensure that our io path is wide enough, I.e. We read 1mb per disk io for 
 majority of the requests. However offsets we read from are all over the 
 place. It appears that we are getting into situation where larger offsets are 
 getting delayed because smaller offsets are jumping ahead of them. Forcing 
 bioq insert tail operation and effectively moving insertion point seems to 
 help avoiding getting into this situation. And, no. We don't use 10k or 15k 
 drives. Just regular enterprise 7200 sata drives. 

I assume that the 1mb reads are then further broken up into 8 128kb
reads? so it's more like every 16 reads in your work load that you
insert the ordered io...

I want to make sure that we choose the right value for this number..
What number of IOPs are you seeing?

  I assume you must be regularly seeing queue depths of 128+ for this
  code to make a difference, do you see that w/ gstat?
 
 No, we don't see large (128+) queue sizes in gstat data. The way I see it, we 
 don't have to have deep queue here. We could just have a steady stream of io 
 requests where new, smaller, offsets consistently jumping ahead of older, 
 larger offset. In fact gstat data show shallow queue of 5 or less items.

Sorry, I miss read the patch the first time...  After rereading it,
the short summary is that if there hasn't been an ordered bio
(bioq_insert_tail) after 128 requests, the next request will be
ordered...

  Also, do you see a similar throughput of the system?
 
 Yes. We do see almost identical throughput from both systems.  I have not 
 pushed the system to its limit yet, but having much smoother disk read 
 service time is important for us because we use it as one of the components 
 of system health metrics. We also need to ensure that disk io request is 
 actually dispatched to the disk in a timely manner. 

Per above, have you measured at the application layer that you are
getting better latency times on your reads?  Maybe by doing a ktrace
of the io, and calculating times between read and return or something
like that...

Have you looked at the geom disk schedulers work that Luigi did a few
years back?  There have been known issues w/ our io scheduler for a
long time...  If you search the mailing lists, you'll see lots of
reports from some processes starving out others, probably due to a
similar issue...  I've seen similar unfair behavior between processes,
but spend time tracking it down...

It does look like a good improvement though...

Thanks for the work!

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org