Re: [rfc] small bioq patch
Maksim Yevmenkin wrote this message on Tue, Oct 15, 2013 at 11:15 -0700: On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700: On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system Can you describe how you got this data? Were you using the gstat code or some other code? Yes, it's basically gstat data. The reason I ask this is that I don't think the data you are getting from gstat is what you think you are... It accumulates time for a set of operations and then divides by the count... So I'm not sure if the stat improvements you are seeing are as meaningful as you might think they are... yes, i'm aware of it. however, i'm not aware of better tools. we also use dtrace and PCM/PMC. ktrace is not particularly useable for us because it does not really work well when we push system above 5 Gbps. in order to actually see any issues we need to push system to 10 Gbps range at least. So, I put a test together using dtrace... And my test wasn't a big test, but I put HEAD on a 16G fs, and did a: find /mnt -type f -exec cat {} + And I varried the sysctl values as 0, 16 32, 64, 128, 256 and 512... I was unable to get a significant difference between the runs... Between each run I would unmount the fs to make sure the fs cache was clean... I've posted my scripts/results at: https://people.freebsd.org/~jmg/disklat/ genresults runs the dtrace script and gets the results makeresults extracts the results from each run disklatencycmd.d is the dtrace script I used catall/catallp is the script containing the find command... I tried two different versions of the command, one single threaded as above, the other using xargs w/ 4 processes running... the p4 results are from the xargs, the sing is the find -exec command above... Though I will admit that before the patch, on occasion I did see a max latency of 6s, but in these tests I didn't see it... The disk was: Model Family: Maxtor MaXLine Pro 500 Device Model: Maxtor 7H500F0 Firmware Version: HA431DN0 User Capacity:500,107,862,016 bytes [500 GB] and the partition was close to the begining of the drive... graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. Can you describe why you think this change makes an improvement? Unless you're running 10k or 15k RPM drives, 128 seems like a large number.. as that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is wide enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are jumping ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. I assume that the 1mb reads are then further broken up into 8 128kb reads? so it's more like every 16 reads in your work load that you insert the ordered io... i'm not sure where 128kb comes from. are you referring to MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB. Ahh, ok, so another difference between your system and HEAD... I want to make sure that we choose the right value for this number.. What number of IOPs are you seeing? generally we see 100 IOPs per disk on a system pushing 10+ Gbps. w/ 1MB IO's, that makes sense... i've experimented with different numbers on our system and i did not see much of a difference on our workload. i'm up a value of 1024 now. higher numbers seem to produce slightly bigger difference between average and max time, but i do not think its statistically meaningful. general shape of the curve remains smooth for all tried values so far. [...] Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we
Re: [rfc] small bioq patch
On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700: On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system Can you describe how you got this data? Were you using the gstat code or some other code? Yes, it's basically gstat data. The reason I ask this is that I don't think the data you are getting from gstat is what you think you are... It accumulates time for a set of operations and then divides by the count... So I'm not sure if the stat improvements you are seeing are as meaningful as you might think they are... yes, i'm aware of it. however, i'm not aware of better tools. we also use dtrace and PCM/PMC. ktrace is not particularly useable for us because it does not really work well when we push system above 5 Gbps. in order to actually see any issues we need to push system to 10 Gbps range at least. graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. Can you describe why you think this change makes an improvement? Unless you're running 10k or 15k RPM drives, 128 seems like a large number.. as that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is wide enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are jumping ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. I assume that the 1mb reads are then further broken up into 8 128kb reads? so it's more like every 16 reads in your work load that you insert the ordered io... i'm not sure where 128kb comes from. are you referring to MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB. I want to make sure that we choose the right value for this number.. What number of IOPs are you seeing? generally we see 100 IOPs per disk on a system pushing 10+ Gbps. i've experimented with different numbers on our system and i did not see much of a difference on our workload. i'm up a value of 1024 now. higher numbers seem to produce slightly bigger difference between average and max time, but i do not think its statistically meaningful. general shape of the curve remains smooth for all tried values so far. [...] Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner. Per above, have you measured at the application layer that you are getting better latency times on your reads? Maybe by doing a ktrace of the io, and calculating times between read and return or something like that... ktrace is not particularly useful. i can see if i can come up with dtrace probe or something. our application (or rather clients) are _very_ sensitive to latency. having read service times outliers is not very good for us. Have you looked at the geom disk schedulers work that Luigi did a few years back? There have been known issues w/ our io scheduler for a long time... If you search the mailing lists, you'll see lots of reports from some processes starving out others, probably due to a similar issue... I've seen similar unfair behavior between processes, but spend time tracking it down... yes, we have looked at it. it makes things worse for us, unfortunately. It does look like a good improvement though... Thanks for the work! ok :) i'm interested to hear from people who have different workload profile. for example lots of iops, i.e. very small files reads or something like that. thanks, max ___ freebsd-current@freebsd.org mailing list
[rfc] small bioq patch
hello, i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. thanks, max Index: branches/freebsd10/src/sys/kern/subr_disk.c === diff -u -N -r2284 -r2698 --- branches/freebsd10/src/sys/kern/subr_disk.c (.../subr_disk.c) (revision 2284) +++ branches/freebsd10/src/sys/kern/subr_disk.c (.../subr_disk.c) (revision 2698) @@ -21,8 +21,13 @@ #include sys/bio.h #include sys/conf.h #include sys/disk.h +#include sys/sysctl.h #include geom/geom_disk.h +static int bioq_batchsize = 128; +SYSCTL_INT(_debug, OID_AUTO, bioq_batchsize, CTLFLAG_RW, +bioq_batchsize, 0, BIOQ batch size); + /*- * Disk error is the preface to plaintive error messages * about failing disk transfers. It prints messages of the form @@ -150,6 +155,8 @@ TAILQ_INIT(head-queue); head-last_offset = 0; head-insert_point = NULL; + head-total = 0; + head-batched = 0; } void @@ -163,6 +170,7 @@ head-insert_point = NULL; TAILQ_REMOVE(head-queue, bp, bio_queue); + head-total--; } void @@ -181,13 +189,16 @@ if (head-insert_point == NULL) head-last_offset = bp-bio_offset; TAILQ_INSERT_HEAD(head-queue, bp, bio_queue); + head-total++; } void bioq_insert_tail(struct bio_queue_head *head, struct bio *bp) { TAILQ_INSERT_TAIL(head-queue, bp, bio_queue); + head-total++; + head-batched = 0; head-insert_point = bp; head-last_offset = bp-bio_offset; } @@ -246,6 +257,11 @@ return; } + if (bioq_batchsize 0 head-batched bioq_batchsize) { + bioq_insert_tail(head, bp); + return; + } + prev = NULL; key = bioq_bio_key(head, bp); cur = TAILQ_FIRST(head-queue); @@ -264,4 +280,7 @@ TAILQ_INSERT_HEAD(head-queue, bp, bio_queue); else TAILQ_INSERT_AFTER(head-queue, prev, bp, bio_queue); + + head-total++; + head-batched++; } Index: branches/freebsd10/src/sys/sys/bio.h === diff -u -N -r2284 -r2698 --- branches/freebsd10/src/sys/sys/bio.h(.../bio.h) (revision 2284) +++ branches/freebsd10/src/sys/sys/bio.h(.../bio.h) (revision 2698) @@ -129,6 +129,8 @@ TAILQ_HEAD(bio_queue, bio) queue; off_t last_offset; struct bio *insert_point; + int total; + int batched; }; extern struct vm_map *bio_transient_map; ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [rfc] small bioq patch
Maksim, The graphs were not attached. Perhaps the list stripped them. Could you post them on the web instead? Thanks, Eric On 10/11/2013 01:17 PM, Maksim Yevmenkin wrote: hello, i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. thanks, max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [rfc] small bioq patch
On Fri, Oct 11, 2013 at 12:25 PM, Eric van Gyzen e...@vangyzen.net wrote: Maksim, The graphs were not attached. Perhaps the list stripped them. Could you post them on the web instead? err... sorry about it.. please try http://people.freebsd.org/~emax/c034.png http://people.freebsd.org/~emax/c044.png thanks max hello, i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. thanks, max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [rfc] small bioq patch
Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system Can you describe how you got this data? Were you using the gstat code or some other code? Also, was your control system w/ the patch, but w/ the sysctl set to zero to possibly eliminate any code alignment issues? graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. Can you describe why you think this change makes an improvement? Unless you're running 10k or 15k RPM drives, 128 seems like a large number.. as that's about halve number of IOPs that a normal HD handles in a second.. I assume you must be regularly seeing queue depths of 128+ for this code to make a difference, do you see that w/ gstat? Also, do you see a similar throughput of the system? -- John-Mark Gurney Voice: +1 415 225 5579 All that I will do, has been done, All that I have, has not. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [rfc] small bioq patch
On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system Can you describe how you got this data? Were you using the gstat code or some other code? Yes, it's basically gstat data. Also, was your control system w/ the patch, but w/ the sysctl set to zero to possibly eliminate any code alignment issues? Both systems use the same code base and build. Patched system has patch included, control system does not have the patch. I can rerun my tests with sysctl set to zero and use it as control. So, the answer to your question is no. graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. Can you describe why you think this change makes an improvement? Unless you're running 10k or 15k RPM drives, 128 seems like a large number.. as that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is wide enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are jumping ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. I assume you must be regularly seeing queue depths of 128+ for this code to make a difference, do you see that w/ gstat? No, we don't see large (128+) queue sizes in gstat data. The way I see it, we don't have to have deep queue here. We could just have a steady stream of io requests where new, smaller, offsets consistently jumping ahead of older, larger offset. In fact gstat data show shallow queue of 5 or less items. Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner. Thanks Max ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [rfc] small bioq patch
Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700: On Oct 11, 2013, at 2:52 PM, John-Mark Gurney j...@funkthat.com wrote: Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: i would like to submit the attached bioq patch for review and comments. this is proof of concept. it helps with smoothing disk read service times and arrear to eliminates outliers. please see attached pictures (about a week worth of data) - c034 control unmodified system - c044 patched system Can you describe how you got this data? Were you using the gstat code or some other code? Yes, it's basically gstat data. The reason I ask this is that I don't think the data you are getting from gstat is what you think you are... It accumulates time for a set of operations and then divides by the count... So I'm not sure if the stat improvements you are seeing are as meaningful as you might think they are... Also, was your control system w/ the patch, but w/ the sysctl set to zero to possibly eliminate any code alignment issues? Both systems use the same code base and build. Patched system has patch included, control system does not have the patch. I can rerun my tests with sysctl set to zero and use it as control. So, the answer to your question is no. I don't believe the code would make a difference, but more wanted to know what control was... graphs show max/avg disk read service times for both systems across 36 spinning drives. both systems are relatively busy serving production traffic (about 10 Gbps at peak). grey shaded areas on the graphs represent time when systems are refreshing their content, i.e. disks are both reading and writing at the same time. Can you describe why you think this change makes an improvement? Unless you're running 10k or 15k RPM drives, 128 seems like a large number.. as that's about halve number of IOPs that a normal HD handles in a second.. Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is wide enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are jumping ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives. I assume that the 1mb reads are then further broken up into 8 128kb reads? so it's more like every 16 reads in your work load that you insert the ordered io... I want to make sure that we choose the right value for this number.. What number of IOPs are you seeing? I assume you must be regularly seeing queue depths of 128+ for this code to make a difference, do you see that w/ gstat? No, we don't see large (128+) queue sizes in gstat data. The way I see it, we don't have to have deep queue here. We could just have a steady stream of io requests where new, smaller, offsets consistently jumping ahead of older, larger offset. In fact gstat data show shallow queue of 5 or less items. Sorry, I miss read the patch the first time... After rereading it, the short summary is that if there hasn't been an ordered bio (bioq_insert_tail) after 128 requests, the next request will be ordered... Also, do you see a similar throughput of the system? Yes. We do see almost identical throughput from both systems. I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner. Per above, have you measured at the application layer that you are getting better latency times on your reads? Maybe by doing a ktrace of the io, and calculating times between read and return or something like that... Have you looked at the geom disk schedulers work that Luigi did a few years back? There have been known issues w/ our io scheduler for a long time... If you search the mailing lists, you'll see lots of reports from some processes starving out others, probably due to a similar issue... I've seen similar unfair behavior between processes, but spend time tracking it down... It does look like a good improvement though... Thanks for the work! -- John-Mark Gurney Voice: +1 415 225 5579 All that I will do, has been done, All that I have, has not. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org