Re: Increasing MAXPHYS
In message <20100322233607.gb1...@garage.freebsd.pl>, Pawel Jakub Dawidek write s: >A class is suppose to interact with other classes only via GEOM, so I >think it should be safe to choose g_up/g_down threads for each class >individually, for example: > > /dev/ad0s1a (DEV) > | > g_up_0 + g_down_0 > | >ad0s1a (BSD) > | > g_up_1 + g_down_1 > | >ad0s1 (MBR) > | > g_up_2 + g_down_2 > | >ad0 (DISK) Uhm, that way you get _more_ context switches than today, today g_down will typically push the requests all the way down through the stack without a context switch. (Similar for g_up) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
:The whole point of the discussion, sans PHK's interlude, is to reduce the context switches and indirection, not to increase it. But if you can show decreased latency/higher-iops benefits of increasing it, more power to you. I would think that the results of DFly's experiment with parallelism-via-more-queues would serve as a good warning, though. : :Scott Well, I'm not sure what experiment you are refering to but I'll assume its the network threading, which works quite well actually. The protocol threads can be matched against the toeplitz function and in that case the entire packet stream operates lockless. Even without the matching we still get good benefits from batching (e.g. via ether_input_chain()) which drops the IPI and per-packet switch overhead basically to zero. We have other issues but the protocol threads aren't one of them. In anycase, the lesson to learn with batching to a thread is that you don't want the thread to immediately preempt the sender (if it happens to be on the same cpu), or to generate an instant IPI (if going between cpus). This creates a degenerate case where you wind up with a thread switch on each message or an excessive messaging interrupt rate... THAT is what seriously screws up performance. The key is to be able to batch multiple messages per thread switch when under load and to be able to maintain a pipeline. A single user-process test case will always have a bit more latency and can wind up being inefficient for a variety of other reasons (e.g. whether the target thread is on the same cpu or not), but that becomes less relevant when the machine is under load so its a self-correcting problem for the most part. Once the machine is under load batching becomes highly efficient. That is, latency != cpu cycle cost under load. When the threads have enough work to do they can pick up the next message without the cost of entering a sleep state or needing a wakeup (or needing to generate an actual IPI interrupt, etc). Plus you can run lockless and you get excellent cache locality. So as long as you ensure these optimal operations become the norm under load you win. Getting the threads to pipeline properly and avoid unnecessary tsleeps and wakeups is the hard part. -- But with regard to geom, I'd have to agree with you. You don't want to pipeline a single N-stage request through N threads. One thread, sure... that can be batched to reduce overhead. N-stages through N-threads just creates unnecessary latency, complicates your ability to maintain a pipeline, and has a multiplicative effect on thread activity that negates the advantage of having multiple cpus (and destroys cache locality as well). You could possibly use a different trick at least for some of the simpler transformations, and that is to replicate the control structures on a per-cpu basis. If you replicate the control structures on a per-cpu basis then you can parallelize independent operations running through the same set of devices and remove the bottlenecks. The set of transformations for a single BIO would be able to run lockless within a single thread and the control system as a whole would have one thread per cpu. (Of course, a RAID layer would require some rendezvous to deal with contention/conflicts, but that's easily dealt with). That would be my suggestion. We use that trick for our route tables in DFly, and also for listen socket PCBs to remove choke points, and a few other things like statistics gathering. -Matt Matthew Dillon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Pawel Jakub Dawidek wrote: On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote: In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes: on 21/03/2010 16:05 Alexander Motin said the following: Ivan Voras wrote: Hmm, it looks like it could be easy to spawn more g_* threads (and, barring specific class behaviour, it has a fair chance of working out of the box) but the incoming queue will need to also be broken up for greater effect. According to "notes", looks there is a good chance to obtain races, as some places expect only one up and one down thread. I haven't given any deep thought to this issue, but I remember us discussing them over beer :-) The easiest way to obtain more parallelism, is to divide the mesh into multiple independent meshes. This will do you no good if you have five disks in a RAID-5 config, but if you have two disks each mounted on its own filesystem, you can run a g_up & g_down for each of them. A class is suppose to interact with other classes only via GEOM, so I think it should be safe to choose g_up/g_down threads for each class individually, for example: /dev/ad0s1a (DEV) | g_up_0 + g_down_0 | ad0s1a (BSD) | g_up_1 + g_down_1 | ad0s1 (MBR) | g_up_2 + g_down_2 | ad0 (DISK) We could easly calculate g_down thread based on bio_to->geom->class and g_up thread based on bio_from->geom->class, so we know I/O requests for our class are always coming from the same threads. If we could make the same assumption for geoms it would allow for even better distribution. doesn't really help my problem however.. I just want to access the base provider directly with no geom thread involved. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 22, 2010, at 5:36 PM, Pawel Jakub Dawidek wrote: > On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote: >> In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes: >>> on 21/03/2010 16:05 Alexander Motin said the following: Ivan Voras wrote: > Hmm, it looks like it could be easy to spawn more g_* threads (and, > barring specific class behaviour, it has a fair chance of working out of > the box) but the incoming queue will need to also be broken up for > greater effect. According to "notes", looks there is a good chance to obtain races, as some places expect only one up and one down thread. >>> >>> I haven't given any deep thought to this issue, but I remember us discussing >>> them over beer :-) >> >> The easiest way to obtain more parallelism, is to divide the mesh into >> multiple independent meshes. >> >> This will do you no good if you have five disks in a RAID-5 config, but >> if you have two disks each mounted on its own filesystem, you can run >> a g_up & g_down for each of them. > > A class is suppose to interact with other classes only via GEOM, so I > think it should be safe to choose g_up/g_down threads for each class > individually, for example: > > /dev/ad0s1a (DEV) > | > g_up_0 + g_down_0 > | >ad0s1a (BSD) > | > g_up_1 + g_down_1 > | >ad0s1 (MBR) > | > g_up_2 + g_down_2 > | >ad0 (DISK) > > We could easly calculate g_down thread based on bio_to->geom->class and > g_up thread based on bio_from->geom->class, so we know I/O requests for > our class are always coming from the same threads. > > If we could make the same assumption for geoms it would allow for even > better distribution. The whole point of the discussion, sans PHK's interlude, is to reduce the context switches and indirection, not to increase it. But if you can show decreased latency/higher-iops benefits of increasing it, more power to you. I would think that the results of DFly's experiment with parallelism-via-more-queues would serve as a good warning, though. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mon, Mar 22, 2010 at 08:23:43AM +, Poul-Henning Kamp wrote: > In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes: > >on 21/03/2010 16:05 Alexander Motin said the following: > >> Ivan Voras wrote: > >>> Hmm, it looks like it could be easy to spawn more g_* threads (and, > >>> barring specific class behaviour, it has a fair chance of working out of > >>> the box) but the incoming queue will need to also be broken up for > >>> greater effect. > >> > >> According to "notes", looks there is a good chance to obtain races, as > >> some places expect only one up and one down thread. > > > >I haven't given any deep thought to this issue, but I remember us discussing > >them over beer :-) > > The easiest way to obtain more parallelism, is to divide the mesh into > multiple independent meshes. > > This will do you no good if you have five disks in a RAID-5 config, but > if you have two disks each mounted on its own filesystem, you can run > a g_up & g_down for each of them. A class is suppose to interact with other classes only via GEOM, so I think it should be safe to choose g_up/g_down threads for each class individually, for example: /dev/ad0s1a (DEV) | g_up_0 + g_down_0 | ad0s1a (BSD) | g_up_1 + g_down_1 | ad0s1 (MBR) | g_up_2 + g_down_2 | ad0 (DISK) We could easly calculate g_down thread based on bio_to->geom->class and g_up thread based on bio_from->geom->class, so we know I/O requests for our class are always coming from the same threads. If we could make the same assumption for geoms it would allow for even better distribution. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpFAxWFcI5ds.pgp Description: PGP signature
Re: Increasing MAXPHYS
In message <3c0b01821003221207p4e4eecabqb4f448813bf5a...@mail.gmail.com>, Alexa nder Sack writes: >Am I going crazy or does this sound a lot like Sun/SVR's stream based >network stack? That is a good and pertinent observation. I did investigate a number of optimizations to the g_up/g_down scheme I eventually adopted, but found none that gained anything justifying the complexity they brought. In some cases, the optimizations used more CPU cycles than the straight g_up/g_down path, but obviously, the circumstances are vastly different with CPUs having 10 times higher clock, multiple cores and SSD disks, so a fresh look at this tradeoff is in order. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mon, Mar 22, 2010 at 2:45 PM, M. Warner Losh wrote: > In message: > Scott Long writes: > : I'd like to go in the opposite direction. The queue-dispatch-queue > : model of GEOM is elegant and easy to extend, but very wasteful for > : the simple case, where the simple case is one or two simple > : partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror > : transform. None of these need a dedicated dispatch context in order > : to operate. What I'd like to explore is compiling the GEOM stack at > : creation time into a linear array of operations that happen without > : a g_down/g_up context switch. As providers and consumers taste each > : other and build a stack, that stack gets compiled into a graph, and > : that graph gets executed directly from the calling context, both > : from the dev_strategy() side on the top and the bio_done() on the > : bottom. GEOM classes that need a detached context can mark > : themselves as such, doing so will prevent a graph from being > : created, and the current dispatch model will be retained. > > I have a few things to say on this. > > First, I've done similar things at past companies for systems that are > similar to geom's queueing environment. It is possible to convert the > queueing nodes in the graph to filtering nodes in the graph. Another > way to look at this is to say you're implementing direct dispatch into > geom's stack. This can be both good and bad, but should reduce > latency a lot. > > One problem that I see is that you are calling into the driver from a > different set of contexts. The queueing stuff was there to protect > the driver from LoRs due to its routines being called from many > different contexts, sometimes with other locks held (fact of life > often in the kernel). > > So this certainly is something worth exploring, especially if we have > optimized paths for up/down for certain geom classes while still > allowing the current robust, but slow, paths for the more complicated > nodes in the tree. It remains to be see if there's going to be issues > around locking order, but we've hit that with both geom and ifnet in > the past, so caution (eg, running with WITNESS turned on early and > often) is advised. Am I going crazy or does this sound a lot like Sun/SVR's stream based network stack? (design and problems, stream stack locking was notoriously tricky for the exact issue mentioned above, different running contexts with different locking granularity/requirements). -aps ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
In message: Scott Long writes: : I'd like to go in the opposite direction. The queue-dispatch-queue : model of GEOM is elegant and easy to extend, but very wasteful for : the simple case, where the simple case is one or two simple : partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror : transform. None of these need a dedicated dispatch context in order : to operate. What I'd like to explore is compiling the GEOM stack at : creation time into a linear array of operations that happen without : a g_down/g_up context switch. As providers and consumers taste each : other and build a stack, that stack gets compiled into a graph, and : that graph gets executed directly from the calling context, both : from the dev_strategy() side on the top and the bio_done() on the : bottom. GEOM classes that need a detached context can mark : themselves as such, doing so will prevent a graph from being : created, and the current dispatch model will be retained. I have a few things to say on this. First, I've done similar things at past companies for systems that are similar to geom's queueing environment. It is possible to convert the queueing nodes in the graph to filtering nodes in the graph. Another way to look at this is to say you're implementing direct dispatch into geom's stack. This can be both good and bad, but should reduce latency a lot. One problem that I see is that you are calling into the driver from a different set of contexts. The queueing stuff was there to protect the driver from LoRs due to its routines being called from many different contexts, sometimes with other locks held (fact of life often in the kernel). So this certainly is something worth exploring, especially if we have optimized paths for up/down for certain geom classes while still allowing the current robust, but slow, paths for the more complicated nodes in the tree. It remains to be see if there's going to be issues around locking order, but we've hit that with both geom and ifnet in the past, so caution (eg, running with WITNESS turned on early and often) is advised. Warner ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 22, 2010, at 9:52 AM, Alexander Sack wrote: > On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin wrote: >> On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote: >>> On Sun, 21 Mar 2010 19:03:56 +0200 >>> Alexander Motin wrote: >>> Scott Long wrote: > Are there non-CAM drivers that look at MAXPHYS, or that silently assume >> that > MAXPHYS will never be more than 128k? That is a question. >>> >>> I only did a quick&dirty grep looking for MAXPHYS in /sys. >>> >>> Some drivers redefine MAXPHYS to be 512KiB. Some use their own local >>> MAXPHYS which is usually 128KiB. >>> >>> Some look at MAXPHYS to figure out other things; the details escape me. >>> >>> There's one driver which actually uses 100*MAXPHYS for something, but I >>> didn't check the details. >>> >>> Lots of them were non-CAM drivers AFAICT. >> >> The problem is the drivers that _don't_ reference MAXPHYS. The driver author >> at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent >> calculation and just put the result in the driver (e.g. only supporting up to >> 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to >> bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or >> what the actual hardware limit on nsegments is). These cannot be found by a >> simple grep, they require manually inspecting each driver. > > 100% awesome comment. On another kernel, I myself was guilty of this > crime (I did have a nice comment though above the def). > > This has been a great thread since our application really needs some > of the optimizations that are being thrown around here. We have found > in real live performance testing that we are almost always either > controller bound (i.e. adding more disks to spread IOPs has little to > no effect in large array configurations on throughput, we suspect that > is hitting the RAID controller's firmware limitations) or tps bound, > i.e. I never thought going from 128k -> 256k per transaction would > have a dramatic effect on throughput (but I never verified). > > Back to HBAs, AFAIK, every modern iteration of the most popular HBAs > can easily do way more than a 128k scatter/gather I/O. Do you guys > know of any *modern* (circa within the last 3-4 years) that can not do > more than 128k at a shot? >64K broken in MPT at the moment. The hardware can do it, the driver thinks it >can do it, but it fails. AAC hardware traditionally cannot, but maybe the >firmware has been improved in the past few years. I know that there are other >low-performance devices that can't do more than 64 or 128K, but none are >coming to mind at the moment. Still, it shouldn't be a universal assumption >that all hardware can do big I/O's. Another consideration is that some hardware can do big I/O's, but not very efficiently. Not all DMA engines are created equal, and moving to compound commands and excessively long S/G lists can be a pessimization. For example, MFI hardware does a hinted prefetch on the segment list, but once you exceed a certain limit, that prefetch doesn't work anymore and the firmware has to take the slow path to execute the i/o. I haven't quantified this penalty yet, but it's something that should be thought about. > > In other words, I've always thought the limit was kernel imposed and > not what the memory controller on the card can do (I certainly never > got the impression talking with some of the IHVs over the years that > they were designing their hardware for a 128k limit - I sure hope > not!). You'd be surprised at the engineering compromises and handicaps that are committed at IHVs because of misguided marketters. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin wrote: > On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote: >> On Sun, 21 Mar 2010 19:03:56 +0200 >> Alexander Motin wrote: >> >> > Scott Long wrote: >> > > Are there non-CAM drivers that look at MAXPHYS, or that silently assume > that >> > > MAXPHYS will never be more than 128k? >> > >> > That is a question. >> > >> >> I only did a quick&dirty grep looking for MAXPHYS in /sys. >> >> Some drivers redefine MAXPHYS to be 512KiB. Some use their own local >> MAXPHYS which is usually 128KiB. >> >> Some look at MAXPHYS to figure out other things; the details escape me. >> >> There's one driver which actually uses 100*MAXPHYS for something, but I >> didn't check the details. >> >> Lots of them were non-CAM drivers AFAICT. > > The problem is the drivers that _don't_ reference MAXPHYS. The driver author > at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent > calculation and just put the result in the driver (e.g. only supporting up to > 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to > bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or > what the actual hardware limit on nsegments is). These cannot be found by a > simple grep, they require manually inspecting each driver. 100% awesome comment. On another kernel, I myself was guilty of this crime (I did have a nice comment though above the def). This has been a great thread since our application really needs some of the optimizations that are being thrown around here. We have found in real live performance testing that we are almost always either controller bound (i.e. adding more disks to spread IOPs has little to no effect in large array configurations on throughput, we suspect that is hitting the RAID controller's firmware limitations) or tps bound, i.e. I never thought going from 128k -> 256k per transaction would have a dramatic effect on throughput (but I never verified). Back to HBAs, AFAIK, every modern iteration of the most popular HBAs can easily do way more than a 128k scatter/gather I/O. Do you guys know of any *modern* (circa within the last 3-4 years) that can not do more than 128k at a shot? In other words, I've always thought the limit was kernel imposed and not what the memory controller on the card can do (I certainly never got the impression talking with some of the IHVs over the years that they were designing their hardware for a 128k limit - I sure hope not!). -aps ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mon, 22 Mar 2010 01:53, Alexander Motin wrote: In Message-Id: <4ba705cb.9090...@freebsd.org> jhell wrote: On Sun, 21 Mar 2010 20:54, jhell@ wrote: I played with it on one re-compile of a kernel and for the sake of it DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash dump to be performed upon request (reboot -d) due to the boundary being hit for DMA which is 65536. Obviously this would have to be adjusted in ata-dma.c. I suppose that there would have to be a better way to get the real allowable boundary from the running system instead of setting it statically. Other then the above I do not see a reason why not... It is HEAD and this is the type of experimental stuff it was meant for. I should have also said that I also repeated the above without setting DFLTPHYS and setting MAXPHYS to 256. It was bad idea to increase DFLTPHYS. It is not intended to be increased. I just wanted to see what I could break; when I increased DFLTPHYS it was just for that purpose. It booted and everything was running after. Wasn't long enough to do any damage. About DMA boundary, I do not very understand the problem. Yes, legacy ATA has DMA boundary of 64K, but there is no problem to submit S/G list of several segments. How long ago have you tried it, on which controller and which diagnostics do you have? atap...@pci0:0:31:1: class=0x01018a card=0x01271028 chip=0x24cb8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82801DB/DBL (ICH4/ICH4-L) UltraATA/100 EIDE Controller' class = mass storage subclass = ATA I do not have any diagnostics but if any are requested I do have the kernel's that I have tuned to the above values readily available to run again. The first time I tuned MAXPHYS was roughly about 7 weeks ago. That was until I noticed I could not get a crash dump for a problem I was having a week later and had to revert back to its default setting of 128. The problem I had a week later was unrelated. Two days ago when I saw this thread I recalled having modified MAXPHYS but could not remember the problem it caused so I re-enabled it again to reproduce the problem for sureness. Anything else you need please address, Regards, -- jhell ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote: > On Sun, 21 Mar 2010 19:03:56 +0200 > Alexander Motin wrote: > > > Scott Long wrote: > > > Are there non-CAM drivers that look at MAXPHYS, or that silently assume that > > > MAXPHYS will never be more than 128k? > > > > That is a question. > > > > I only did a quick&dirty grep looking for MAXPHYS in /sys. > > Some drivers redefine MAXPHYS to be 512KiB. Some use their own local > MAXPHYS which is usually 128KiB. > > Some look at MAXPHYS to figure out other things; the details escape me. > > There's one driver which actually uses 100*MAXPHYS for something, but I > didn't check the details. > > Lots of them were non-CAM drivers AFAICT. The problem is the drivers that _don't_ reference MAXPHYS. The driver author at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent calculation and just put the result in the driver (e.g. only supporting up to 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or what the actual hardware limit on nsegments is). These cannot be found by a simple grep, they require manually inspecting each driver. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Quoting Scott Long (from Sat, 20 Mar 2010 12:17:33 -0600): code was actually taking advantage of the larger I/O's. The improvement really depends on the workload, of course, and I wouldn't expect it to be noticeable for most people unless they're running something like a media server. I don't think this is limited to media servers, think about situations where you process a large amount of data seuqntially... (seuqntial access case in a big data-warehouse scenario or a 3D render farm which get's the huge amount of data from a shared resource ("how many render-clients can I support at the same time with my disk infrastructure"-scenario) or some of the bigtable/nosql stuff which seems to be more and more popular at some sites). There are enough situations where sequential file access is the key performance metric so that I wouldn't say that only media servers depend upon large sequential I/O's. Bye, Alexander. -- That's life. What's life? A magazine. How much does it cost? Two-fifty. I only have a dollar. That's life. http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Sun, 21 Mar 2010 19:03:56 +0200 Alexander Motin wrote: > Scott Long wrote: > > Are there non-CAM drivers that look at MAXPHYS, or that silently assume that > > MAXPHYS will never be more than 128k? > > That is a question. > I only did a quick&dirty grep looking for MAXPHYS in /sys. Some drivers redefine MAXPHYS to be 512KiB. Some use their own local MAXPHYS which is usually 128KiB. Some look at MAXPHYS to figure out other things; the details escape me. There's one driver which actually uses 100*MAXPHYS for something, but I didn't check the details. Lots of them were non-CAM drivers AFAICT. -- Gary Jennejohn ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
In message <4ba633a0.2090...@icyb.net.ua>, Andriy Gapon writes: >on 21/03/2010 16:05 Alexander Motin said the following: >> Ivan Voras wrote: >>> Hmm, it looks like it could be easy to spawn more g_* threads (and, >>> barring specific class behaviour, it has a fair chance of working out of >>> the box) but the incoming queue will need to also be broken up for >>> greater effect. >> >> According to "notes", looks there is a good chance to obtain races, as >> some places expect only one up and one down thread. > >I haven't given any deep thought to this issue, but I remember us discussing >them over beer :-) The easiest way to obtain more parallelism, is to divide the mesh into multiple independent meshes. This will do you no good if you have five disks in a RAID-5 config, but if you have two disks each mounted on its own filesystem, you can run a g_up & g_down for each of them. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
jhell wrote: > On Sun, 21 Mar 2010 20:54, jhell@ wrote: >> I played with it on one re-compile of a kernel and for the sake of it >> DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash >> dump to be performed upon request (reboot -d) due to the boundary >> being hit for DMA which is 65536. Obviously this would have to be >> adjusted in ata-dma.c. >> >> I suppose that there would have to be a better way to get the real >> allowable boundary from the running system instead of setting it >> statically. >> >> Other then the above I do not see a reason why not... It is HEAD and >> this is the type of experimental stuff it was meant for. > > I should have also said that I also repeated the above without setting > DFLTPHYS and setting MAXPHYS to 256. It was bad idea to increase DFLTPHYS. It is not intended to be increased. About DMA boundary, I do not very understand the problem. Yes, legacy ATA has DMA boundary of 64K, but there is no problem to submit S/G list of several segments. How long ago have you tried it, on which controller and which diagnostics do you have? -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Sun, 21 Mar 2010 20:54, jhell@ wrote: On Sun, 21 Mar 2010 10:04, mav@ wrote: Julian Elischer wrote: In the Fusion-io driver we find that the limiting factor is not the size of MAXPHYS, but the fact that we can not push more than 170k tps through geom. (in my test machine. I've seen more on some beefier machines), but that is only a limit on small transacrtions, or in the case of large transfers the DMA engine tops out before a bigger MAXPHYS would make any difference. Yes, GEOM is quite CPU-hungry on high request rates due to number of context switches. But impact probably may be reduced from two sides: by reducing overhead per request, or by reducing number of requests. Both ways may give benefits. If common opinion is not to touch defaults now - OK, agreed. (Note, Scott, I have agreed :)) But returning to the original question, does somebody knows real situation when increased MAXPHYS still causes problems? At least to make it safe. I played with it on one re-compile of a kernel and for the sake of it DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash dump to be performed upon request (reboot -d) due to the boundary being hit for DMA which is 65536. Obviously this would have to be adjusted in ata-dma.c. I suppose that there would have to be a better way to get the real allowable boundary from the running system instead of setting it statically. Other then the above I do not see a reason why not... It is HEAD and this is the type of experimental stuff it was meant for. Regards, I should have also said that I also repeated the above without setting DFLTPHYS and setting MAXPHYS to 256. Regards, -- jhell ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Sun, 21 Mar 2010 10:04, mav@ wrote: Julian Elischer wrote: In the Fusion-io driver we find that the limiting factor is not the size of MAXPHYS, but the fact that we can not push more than 170k tps through geom. (in my test machine. I've seen more on some beefier machines), but that is only a limit on small transacrtions, or in the case of large transfers the DMA engine tops out before a bigger MAXPHYS would make any difference. Yes, GEOM is quite CPU-hungry on high request rates due to number of context switches. But impact probably may be reduced from two sides: by reducing overhead per request, or by reducing number of requests. Both ways may give benefits. If common opinion is not to touch defaults now - OK, agreed. (Note, Scott, I have agreed :)) But returning to the original question, does somebody knows real situation when increased MAXPHYS still causes problems? At least to make it safe. I played with it on one re-compile of a kernel and for the sake of it DFLTPHYS=128 MAXPHYS=256 and found out that I could not cause a crash dump to be performed upon request (reboot -d) due to the boundary being hit for DMA which is 65536. Obviously this would have to be adjusted in ata-dma.c. I suppose that there would have to be a better way to get the real allowable boundary from the running system instead of setting it statically. Other then the above I do not see a reason why not... It is HEAD and this is the type of experimental stuff it was meant for. Regards, -- jhell ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Scott Long wrote: I agree that more threads just creates many more race complications. Even if it didn't, the storage driver is a serialization point; it doesn't matter if you have a dozen g_* threads if only one of them can be in the top half of the driver at a time. No amount of fine-grained locking is going to help this. Well that depends on the driver and device.. We have multiple linux threads coming in the top under some setups so it wouldn't be a problem. I'd like to go in the opposite direction. The queue-dispatch-queue model of GEOM is elegant and easy to extend, but very wasteful for the simple case, where the simple case is one or two simple partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror transform. None of these need a dedicated dispatch context in order to operate. What I'd like to explore is compiling the GEOM stack at creation time into a linear array of operations that happen without a g_down/g_up context switch. As providers and consumers taste each other and build a stack, that stack gets compiled into a graph, and that graph gets executed directly from the calling context, both from the dev_strategy() side on the top and the bio_done() on the bottom. GEOM classes that need a detached context can mark themselves as such, doing so will prevent a graph from being created, and the current dispatch model will be retained. I've considered similar ideas. Or providing a non-queuing options for some simple transformations. I expect that this will reduce i/o latency by a great margin, thus directly addressing the performance problem that FusionIO makes an example of. I'd like to also explore having the g_bio model not require a malloc at every stage in the stack/graph; even though going through UMA is fairly fast, it still represents overhead that can be eliminated. It also represents an out-of-memory failure case that can be prevented. I might try to work on this over the summer. It's really a research project in my head at this point, but I'm hopeful that it'll show results. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 21, 2010, at 10:53 AM, Ulrich Spörlein wrote: > [CC trimmed] > On Sun, 21.03.2010 at 10:39:10 -0600, Scott Long wrote: >> On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote: >>> On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote: Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an odd number less than 512k. For the purpose of benchmarking against these OS's, having comparable capabilities is essential; Linux easily beats FreeBSD in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD typically stomps linux in real I/O because of vastly better latency and caching algorithms). I'm fine with raising MAXPHYS in production once the problems are addressed. >>> >>> Hi Scott, >>> >>> while I'm sure that most of the FreeBSD admins are aware of "silly" >>> benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some >>> pointers regarding your statement that FreeBSD triumphs for real-world >>> I/O loads? Can this be simulated using iozone, bonnie, etc? More >>> importantly, is there a way to do this file system independently? >>> >> >> iozone and bonnie tend to be good at testing serialized I/O latency; each >> read and write is serialized without any buffering. My experience is that >> they give mixed results, sometimes they favor freebsd, sometime linux, >> sometimes it's a wash, all because they are so sensitive to latency. And >> that's where is also gets hard to have a "universal" benchmark; what are you >> really trying to model, and how does that model reflect your actual >> workload? Are you running a single-instance, single threaded application >> that is sensitive to latency? Are you running a >> multi-instance/multi-threaded app that is sensitive to bandwidth? Are you >> operating on a single file, or on a large tree of files, or on a raw device? >> Are you sharing a small number of relatively stable file descriptors, or >> constantly creating and deleting files and truncating space? > > All true, that's why I wanted to know from you, which real world > situations you encountered where FreeBSD did/does outperform Linux in > regards to I/O throughput and/or latency (depending on scenario, of > course). I have some tests that spawn N number of threads and then do sequential and random i/o either into a filesystem or a raw disk. FreeBSD gets more work done with fewer I/O's than linux when you're operating through the filesystem, thanks to softupdates and the block layer. Linux has a predictive cache that often times will generate too much i/o in a vain attempt to aggressively prefetch blocks. So even then it's hard to measure in a simple way; linux will do more i/o, but less of it will be useful to the application, thereby increasing latency and increasing application runtime. Sorry I can't be more specific, but you're asking for something that I explicitly say I can't provide. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Scott Long wrote: > On Mar 20, 2010, at 1:26 PM, Alexander Motin wrote: >> As you should remember, we have made it in such way, that all unchecked >> drivers keep using DFLTPHYS, which is not going to be changed ever. So >> there is no problem. I would more worry about non-CAM storages and above >> stuff, like some rare GEOM classes. > > And that's why I say that everything needs to be audited. Are there CAM > drivers > that default to being silent on cpi->maxio, but still look at DFLTPHYS and > MAXPHYS? If some (most of) drivers silent on cpi->maxio - they will be limited by safe level of DFLTPHYS, which should not be changed ever. There should be no problem. > Are there non-CAM drivers that look at MAXPHYS, or that silently assume that > MAXPHYS will never be more than 128k? That is a question. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
[CC trimmed] On Sun, 21.03.2010 at 10:39:10 -0600, Scott Long wrote: > On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote: > > On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote: > >> Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an > >> odd number less than 512k. For the purpose of benchmarking against these > >> OS's, having comparable capabilities is essential; Linux easily beats > >> FreeBSD > >> in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD > >> typically > >> stomps linux in real I/O because of vastly better latency and caching > >> algorithms). > >> I'm fine with raising MAXPHYS in production once the problems are > >> addressed. > > > > Hi Scott, > > > > while I'm sure that most of the FreeBSD admins are aware of "silly" > > benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some > > pointers regarding your statement that FreeBSD triumphs for real-world > > I/O loads? Can this be simulated using iozone, bonnie, etc? More > > importantly, is there a way to do this file system independently? > > > > iozone and bonnie tend to be good at testing serialized I/O latency; each > read and write is serialized without any buffering. My experience is that > they give mixed results, sometimes they favor freebsd, sometime linux, > sometimes it's a wash, all because they are so sensitive to latency. And > that's where is also gets hard to have a "universal" benchmark; what are you > really trying to model, and how does that model reflect your actual workload? > Are you running a single-instance, single threaded application that is > sensitive to latency? Are you running a multi-instance/multi-threaded app > that is sensitive to bandwidth? Are you operating on a single file, or on a > large tree of files, or on a raw device? Are you sharing a small number of > relatively stable file descriptors, or constantly creating and deleting files > and truncating space? All true, that's why I wanted to know from you, which real world situations you encountered where FreeBSD did/does outperform Linux in regards to I/O throughput and/or latency (depending on scenario, of course). I hope you don't mind, Uli ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 21, 2010, at 10:30 AM, Ulrich Spörlein wrote: > On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote: >> Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an >> odd number less than 512k. For the purpose of benchmarking against these >> OS's, having comparable capabilities is essential; Linux easily beats FreeBSD >> in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD >> typically >> stomps linux in real I/O because of vastly better latency and caching >> algorithms). >> I'm fine with raising MAXPHYS in production once the problems are addressed. > > Hi Scott, > > while I'm sure that most of the FreeBSD admins are aware of "silly" > benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some > pointers regarding your statement that FreeBSD triumphs for real-world > I/O loads? Can this be simulated using iozone, bonnie, etc? More > importantly, is there a way to do this file system independently? > iozone and bonnie tend to be good at testing serialized I/O latency; each read and write is serialized without any buffering. My experience is that they give mixed results, sometimes they favor freebsd, sometime linux, sometimes it's a wash, all because they are so sensitive to latency. And that's where is also gets hard to have a "universal" benchmark; what are you really trying to model, and how does that model reflect your actual workload? Are you running a single-instance, single threaded application that is sensitive to latency? Are you running a multi-instance/multi-threaded app that is sensitive to bandwidth? Are you operating on a single file, or on a large tree of files, or on a raw device? Are you sharing a small number of relatively stable file descriptors, or constantly creating and deleting files and truncating space?___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 20, 2010, at 1:26 PM, Alexander Motin wrote: > Scott Long wrote: >> On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: >>> Diminishing returns get hit pretty quickly with larger MAXPHYS values. >>> As long as the I/O can be pipelined the reduced transaction rate >>> becomes less interesting when the transaction rate is less than a >>> certain level. Off the cuff I'd say 2000 tps is a good basis for >>> considering whether it is an issue or not. 256K is actually quite >>> a reasonable value. Even 128K is reasonable. >> >> I agree completely. I did quite a bit of testing on this in 2008 and 2009. >> I even added some hooks into CAM to support this, and I thought that I had >> discussed this extensively with Alexander at the time. Guess it was yet >> another >> wasted conversation with him =-( I'll repeat it here for the record. > > AFAIR at that time you've agreed that 256K gives improvements, and 64K > of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've > implemented that hooks in CAM. I have not forgot that conversation (pity > that it quietly died for SCSI SIMs). I agree that too high value could > be just a waste of resources. As you may see I haven't blindly committed > it, but asked public opinion. If you think 256K is OK - let it be 256K. > If you think that 256K needed only for media servers - OK, but lets make > it usable there. > I think that somewhere in the range of 128-512k is appropriate for a given platform. Maybe big-iron gets 512k and notebooks and embedded systems get 128k? It's partially a platform architecture issue, and partially a platform application issue. Ultimately, it should be possible to have up to 1M, and maybe even more. I don't know how best to make that selectable, or whether it should just be the default. >> Besides the nswbuf sizing problem, there is a real problem that a lot of >> drivers >> have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are >> particular values, and they've sized their data structures accordingly. >> Before >> these values are changed, an audit needs to be done OF EVERY SINGLE >> STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS >> in the ata driver, testing that your machine boots, and then committing the >> change >> to source control. Some drivers will have non-obvious restrictions based on >> the number of SG elements allowed in a particular command format. MPT >> comes to mind (its multi message SG code seems to be broken when I tried >> testing large MAXPHYS on it), but I bet that there are others. > > As you should remember, we have made it in such way, that all unchecked > drivers keep using DFLTPHYS, which is not going to be changed ever. So > there is no problem. I would more worry about non-CAM storages and above > stuff, like some rare GEOM classes. And that's why I say that everything needs to be audited. Are there CAM drivers that default to being silent on cpi->maxio, but still look at DFLTPHYS and MAXPHYS? Are there non-CAM drivers that look at MAXPHYS, or that silently assume that MAXPHYS will never be more than 128k? Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Sat, 20.03.2010 at 12:17:33 -0600, Scott Long wrote: > Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an > odd number less than 512k. For the purpose of benchmarking against these > OS's, having comparable capabilities is essential; Linux easily beats FreeBSD > in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD > typically > stomps linux in real I/O because of vastly better latency and caching > algorithms). > I'm fine with raising MAXPHYS in production once the problems are addressed. Hi Scott, while I'm sure that most of the FreeBSD admins are aware of "silly" benchmarks where Linux I/O seems to dwarf FreeBSD, do you have some pointers regarding your statement that FreeBSD triumphs for real-world I/O loads? Can this be simulated using iozone, bonnie, etc? More importantly, is there a way to do this file system independently? Regards, Uli ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
m On Mar 21, 2010, at 8:56 AM, Andriy Gapon wrote: > on 21/03/2010 16:05 Alexander Motin said the following: >> Ivan Voras wrote: >>> Hmm, it looks like it could be easy to spawn more g_* threads (and, >>> barring specific class behaviour, it has a fair chance of working out of >>> the box) but the incoming queue will need to also be broken up for >>> greater effect. >> >> According to "notes", looks there is a good chance to obtain races, as >> some places expect only one up and one down thread. > > I haven't given any deep thought to this issue, but I remember us discussing > them over beer :-) > I think one idea was making sure (somehow) that requests traveling over the > same > edge of a geom graph (in the same direction) do it using the same > queue/thread. > Another idea was to bring some netgraph-like optimization where some > (carefully > chosen) geom vertices pass requests by a direct call instead of requeuing. > Ah, I see that we were thinking about similar things. Another tactic, and one that is easier to prototype and implement than moving GEOM to a graph, is to allow separate but related bio's to be chained. If a caller, like maybe physio or the bufdaemon or even a middle geom transform, knows that it's going to send multiple bio's at once, it chains them together into a single request, and that request gets pipelined through the stack. Each layer operates on the entire chain before requeueing to the next layer. Layers/classes that can't operate this way will get the bio serialized automatically for them, breaking the chain, but those won't be the common cases. This will bring cache locality benefits, and is something that know benefits high-transaction load network applications. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote: > Ivan Voras wrote: >> Julian Elischer wrote: >>> You can get better throughput by using TSC for timing because the geom >>> and devstat code does a bit of timing.. Geom can be told to turn off >>> it's timing but devstat can't. The 170 ktps is with TSC as timer, >>> and geom timing turned off. >> >> I see. I just ran randomio on a gzero device and with 10 userland >> threads (this is a slow 2xquad machine) I get g_up and g_down saturated >> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements. > > I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI > controller and single Core2Quad CPU. So at least on synthetic tests it > is potentially reachable even with casual hardware, while it completely > saturated quad-core CPU. > >> Hmm, it looks like it could be easy to spawn more g_* threads (and, >> barring specific class behaviour, it has a fair chance of working out of >> the box) but the incoming queue will need to also be broken up for >> greater effect. > > According to "notes", looks there is a good chance to obtain races, as > some places expect only one up and one down thread. > I agree that more threads just creates many more race complications. Even if it didn't, the storage driver is a serialization point; it doesn't matter if you have a dozen g_* threads if only one of them can be in the top half of the driver at a time. No amount of fine-grained locking is going to help this. I'd like to go in the opposite direction. The queue-dispatch-queue model of GEOM is elegant and easy to extend, but very wasteful for the simple case, where the simple case is one or two simple partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror transform. None of these need a dedicated dispatch context in order to operate. What I'd like to explore is compiling the GEOM stack at creation time into a linear array of operations that happen without a g_down/g_up context switch. As providers and consumers taste each other and build a stack, that stack gets compiled into a graph, and that graph gets executed directly from the calling context, both from the dev_strategy() side on the top and the bio_done() on the bottom. GEOM classes that need a detached context can mark themselves as such, doing so will prevent a graph from being created, and the current dispatch model will be retained. I expect that this will reduce i/o latency by a great margin, thus directly addressing the performance problem that FusionIO makes an example of. I'd like to also explore having the g_bio model not require a malloc at every stage in the stack/graph; even though going through UMA is fairly fast, it still represents overhead that can be eliminated. It also represents an out-of-memory failure case that can be prevented. I might try to work on this over the summer. It's really a research project in my head at this point, but I'm hopeful that it'll show results. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Andriy Gapon wrote: on 21/03/2010 16:05 Alexander Motin said the following: Ivan Voras wrote: Hmm, it looks like it could be easy to spawn more g_* threads (and, barring specific class behaviour, it has a fair chance of working out of the box) but the incoming queue will need to also be broken up for greater effect. According to "notes", looks there is a good chance to obtain races, as some places expect only one up and one down thread. I haven't given any deep thought to this issue, but I remember us discussing them over beer :-) I think one idea was making sure (somehow) that requests traveling over the same edge of a geom graph (in the same direction) do it using the same queue/thread. Another idea was to bring some netgraph-like optimization where some (carefully chosen) geom vertices pass requests by a direct call instead of requeuing. yeah, like the 1:1 single provider case. (which we an most of our custommers mostly use on our cards). i.e. no slicing or dicing, and just the raw flash card presented as /dev/fio0 ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Alexander Motin wrote: Julian Elischer wrote: In the Fusion-io driver we find that the limiting factor is not the size of MAXPHYS, but the fact that we can not push more than 170k tps through geom. (in my test machine. I've seen more on some beefier machines), but that is only a limit on small transacrtions, or in the case of large transfers the DMA engine tops out before a bigger MAXPHYS would make any difference. Yes, GEOM is quite CPU-hungry on high request rates due to number of context switches. But impact probably may be reduced from two sides: by reducing overhead per request, or by reducing number of requests. Both ways may give benefits. If common opinion is not to touch defaults now - OK, agreed. (Note, Scott, I have agreed :)) But returning to the original question, does somebody knows real situation when increased MAXPHYS still causes problems? At least to make it safe. well I know we havn't tested our bsd driver yet with MAXPHYS > 128KB at this time.. Must try that some time :-) ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
on 21/03/2010 16:05 Alexander Motin said the following: > Ivan Voras wrote: >> Hmm, it looks like it could be easy to spawn more g_* threads (and, >> barring specific class behaviour, it has a fair chance of working out of >> the box) but the incoming queue will need to also be broken up for >> greater effect. > > According to "notes", looks there is a good chance to obtain races, as > some places expect only one up and one down thread. I haven't given any deep thought to this issue, but I remember us discussing them over beer :-) I think one idea was making sure (somehow) that requests traveling over the same edge of a geom graph (in the same direction) do it using the same queue/thread. Another idea was to bring some netgraph-like optimization where some (carefully chosen) geom vertices pass requests by a direct call instead of requeuing. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Ivan Voras wrote: > Julian Elischer wrote: >> You can get better throughput by using TSC for timing because the geom >> and devstat code does a bit of timing.. Geom can be told to turn off >> it's timing but devstat can't. The 170 ktps is with TSC as timer, >> and geom timing turned off. > > I see. I just ran randomio on a gzero device and with 10 userland > threads (this is a slow 2xquad machine) I get g_up and g_down saturated > fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements. I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI controller and single Core2Quad CPU. So at least on synthetic tests it is potentially reachable even with casual hardware, while it completely saturated quad-core CPU. > Hmm, it looks like it could be easy to spawn more g_* threads (and, > barring specific class behaviour, it has a fair chance of working out of > the box) but the incoming queue will need to also be broken up for > greater effect. According to "notes", looks there is a good chance to obtain races, as some places expect only one up and one down thread. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Julian Elischer wrote: > In the Fusion-io driver we find that the limiting factor is not the > size of MAXPHYS, but the fact that we can not push more than > 170k tps through geom. (in my test machine. I've seen more on some > beefier machines), but that is only a limit on small transacrtions, > or in the case of large transfers the DMA engine tops out before a > bigger MAXPHYS would make any difference. Yes, GEOM is quite CPU-hungry on high request rates due to number of context switches. But impact probably may be reduced from two sides: by reducing overhead per request, or by reducing number of requests. Both ways may give benefits. If common opinion is not to touch defaults now - OK, agreed. (Note, Scott, I have agreed :)) But returning to the original question, does somebody knows real situation when increased MAXPHYS still causes problems? At least to make it safe. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Ivan Voras wrote: Julian Elischer wrote: Alexander Motin wrote: Scott Long wrote: On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: Diminishing returns get hit pretty quickly with larger MAXPHYS values. As long as the I/O can be pipelined the reduced transaction rate becomes less interesting when the transaction rate is less than a certain level. Off the cuff I'd say 2000 tps is a good basis for considering whether it is an issue or not. 256K is actually quite a reasonable value. Even 128K is reasonable. I agree completely. I did quite a bit of testing on this in 2008 and 2009. I even added some hooks into CAM to support this, and I thought that I had discussed this extensively with Alexander at the time. Guess it was yet another wasted conversation with him =-( I'll repeat it here for the record. In the Fusion-io driver we find that the limiting factor is not the size of MAXPHYS, but the fact that we can not push more than 170k tps through geom. (in my test machine. I've seen more on some beefier machines), but that is only a limit on small transacrtions, Do the GEOM threads (g_up, g_down) go into saturation? Effectively all IO is serialized through them. basically.. You can get better throughput by using TSC for timing because the geom and devstat code does a bit of timing.. Geom can be told to turn off it's timing but devstat can't. The 170 ktps is with TSC as timer, and geom timing turned off. It could just be the shear weight of the work being done. Linux on the same machine using the same driver code (with different wrappers) gets 225k tps. ___ freebsd-a...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-arch To unsubscribe, send any mail to "freebsd-arch-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Alexander Motin wrote: Scott Long wrote: On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: Diminishing returns get hit pretty quickly with larger MAXPHYS values. As long as the I/O can be pipelined the reduced transaction rate becomes less interesting when the transaction rate is less than a certain level. Off the cuff I'd say 2000 tps is a good basis for considering whether it is an issue or not. 256K is actually quite a reasonable value. Even 128K is reasonable. I agree completely. I did quite a bit of testing on this in 2008 and 2009. I even added some hooks into CAM to support this, and I thought that I had discussed this extensively with Alexander at the time. Guess it was yet another wasted conversation with him =-( I'll repeat it here for the record. In the Fusion-io driver we find that the limiting factor is not the size of MAXPHYS, but the fact that we can not push more than 170k tps through geom. (in my test machine. I've seen more on some beefier machines), but that is only a limit on small transacrtions, or in the case of large transfers the DMA engine tops out before a bigger MAXPHYS would make any difference. Where it may make a difference is that Linux only pushes 128k at a time it looks like so many hardware engines have likely never been tested with greater. (not sure about Windows). Some drivers may also be written with the assumption that they will not see more. OF course they should be able to limit the transaction size down themselves if they are written well. AFAIR at that time you've agreed that 256K gives improvements, and 64K of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've implemented that hooks in CAM. I have not forgot that conversation (pity that it quietly died for SCSI SIMs). I agree that too high value could be just a waste of resources. As you may see I haven't blindly committed it, but asked public opinion. If you think 256K is OK - let it be 256K. If you think that 256K needed only for media servers - OK, but lets make it usable there. Besides the nswbuf sizing problem, there is a real problem that a lot of drivers have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are particular values, and they've sized their data structures accordingly. Before these values are changed, an audit needs to be done OF EVERY SINGLE STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS in the ata driver, testing that your machine boots, and then committing the change to source control. Some drivers will have non-obvious restrictions based on the number of SG elements allowed in a particular command format. MPT comes to mind (its multi message SG code seems to be broken when I tried testing large MAXPHYS on it), but I bet that there are others. As you should remember, we have made it in such way, that all unchecked drivers keep using DFLTPHYS, which is not going to be changed ever. So there is no problem. I would more worry about non-CAM storages and above stuff, like some rare GEOM classes. I'm fine with raising MAXPHYS in production once the problems are addressed. That's why in my post I've asked people about any known problems. I've addressed several related issues in last months, and I am looking for more. To address problems, it would be nice to know about them first. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
2010/3/20 Alexander Motin > Hi. > > With set of changes done to ATA, CAM and GEOM subsystems last time we > may now get use for increased MAXPHYS (maximum physical I/O size) kernel > constant from 128K to some bigger value. [snip] > All above I have successfully tested last months with MAXPHYS of 1MB on > i386 and amd64 platforms. > > So my questions are: > - does somebody know any issues denying increasing MAXPHYS in HEAD? > - are there any specific opinions about value? 512K, 1MB, MD? > > For now, I think it should machine-dependent. The virtual memory system should have no problems with MAXPHYS of 1MB on amd64 and ia64. Alan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
Scott Long wrote: > On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: >>Diminishing returns get hit pretty quickly with larger MAXPHYS values. >>As long as the I/O can be pipelined the reduced transaction rate >>becomes less interesting when the transaction rate is less than a >>certain level. Off the cuff I'd say 2000 tps is a good basis for >>considering whether it is an issue or not. 256K is actually quite >>a reasonable value. Even 128K is reasonable. > > I agree completely. I did quite a bit of testing on this in 2008 and 2009. > I even added some hooks into CAM to support this, and I thought that I had > discussed this extensively with Alexander at the time. Guess it was yet > another > wasted conversation with him =-( I'll repeat it here for the record. AFAIR at that time you've agreed that 256K gives improvements, and 64K of DFLTPHYS limiting most SCSI SIMs is too small. That's why you've implemented that hooks in CAM. I have not forgot that conversation (pity that it quietly died for SCSI SIMs). I agree that too high value could be just a waste of resources. As you may see I haven't blindly committed it, but asked public opinion. If you think 256K is OK - let it be 256K. If you think that 256K needed only for media servers - OK, but lets make it usable there. > Besides the nswbuf sizing problem, there is a real problem that a lot of > drivers > have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are > particular values, and they've sized their data structures accordingly. > Before > these values are changed, an audit needs to be done OF EVERY SINGLE > STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS > in the ata driver, testing that your machine boots, and then committing the > change > to source control. Some drivers will have non-obvious restrictions based on > the number of SG elements allowed in a particular command format. MPT > comes to mind (its multi message SG code seems to be broken when I tried > testing large MAXPHYS on it), but I bet that there are others. As you should remember, we have made it in such way, that all unchecked drivers keep using DFLTPHYS, which is not going to be changed ever. So there is no problem. I would more worry about non-CAM storages and above stuff, like some rare GEOM classes. > I'm fine with raising MAXPHYS in production once the problems are > addressed. That's why in my post I've asked people about any known problems. I've addressed several related issues in last months, and I am looking for more. To address problems, it would be nice to know about them first. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Sat, Mar 20, 2010 at 6:53 PM, Matthew Dillon wrote: > > :All above I have successfully tested last months with MAXPHYS of 1MB on > :i386 and amd64 platforms. > : > :So my questions are: > :- does somebody know any issues denying increasing MAXPHYS in HEAD? > :- are there any specific opinions about value? 512K, 1MB, MD? > : > :-- > :Alexander Motin > > (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you > might hit up against KVM exhaustion issues in unrelated subsystems. > nswbuf typically maxes out at around 256. For i386 1MB is probably > too large (256M of reserved KVM is a lot for i386). On amd64 there > shouldn't be a problem. Pardon my ignorance, but wouldn't so much KVM make small embedded devices like Soekris boards with 128 MB of physical RAM totally unusable then? On my net4801, running RELENG_8: vm.kmem_size: 40878080 hw.physmem: 125272064 hw.usermen: 84840448 hw.realmem: 134217728 > Diminishing returns get hit pretty quickly with larger MAXPHYS values. > As long as the I/O can be pipelined the reduced transaction rate > becomes less interesting when the transaction rate is less than a > certain level. Off the cuff I'd say 2000 tps is a good basis for > considering whether it is an issue or not. 256K is actually quite > a reasonable value. Even 128K is reasonable. > > Nearly all the issues I've come up against in the last few years have > been related more to pipeline algorithms breaking down and less with > I/O size. The cluster_read() code is especially vulnerable to > algorithmic breakdowns when fast media (such as a SSD) is involved. > e.g. I/Os queued from the previous cluster op can create stall > conditions in subsequent cluster ops before they can issue new I/Os > to keep the pipeline hot. Thanks, -cpghost. -- Cordula's Web. http://www.cordula.ws/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
:Pardon my ignorance, but wouldn't so much KVM make small embedded :devices like Soekris boards with 128 MB of physical RAM totally unusable :then? On my net4801, running RELENG_8: : :vm.kmem_size: 40878080 : :hw.physmem: 125272064 :hw.usermen: 84840448 :hw.realmem: 134217728 KVM != physical memory. On i386 by default the kernel has 1G of KVM and userland has 3G. While the partition can be moved to increase available KVM on i386 (e.g. 2G/2G), it isn't recommended. So the KVM reserved for various things does not generally impact physical memory use. The number of swap buffers (nswbuf) is scaled to 1/4 nbufs with a maximum of 256. Systems with small amounts of memory should not be impacted. The issue w/ regards to KVM problems on i386 is mostly restricted to systems with 2G+ of ram where the kernel's various internal parameters are scaled to their maximum values or limits. On systems with less ram the kernel's internal parameters are usually scaled down sufficiently that there is very little chance of the kernel running out of KVM. -Matt ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: > > :All above I have successfully tested last months with MAXPHYS of 1MB on > :i386 and amd64 platforms. > : > :So my questions are: > :- does somebody know any issues denying increasing MAXPHYS in HEAD? > :- are there any specific opinions about value? 512K, 1MB, MD? > : > :-- > :Alexander Motin > >(nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you >might hit up against KVM exhaustion issues in unrelated subsystems. >nswbuf typically maxes out at around 256. For i386 1MB is probably >too large (256M of reserved KVM is a lot for i386). On amd64 there >shouldn't be a problem. > Yes, this needs to be addressed. I've never gotten a clear answer from VM people like Peter Wemm and Alan Cox on what should be done. >Diminishing returns get hit pretty quickly with larger MAXPHYS values. >As long as the I/O can be pipelined the reduced transaction rate >becomes less interesting when the transaction rate is less than a >certain level. Off the cuff I'd say 2000 tps is a good basis for >considering whether it is an issue or not. 256K is actually quite >a reasonable value. Even 128K is reasonable. > I agree completely. I did quite a bit of testing on this in 2008 and 2009. I even added some hooks into CAM to support this, and I thought that I had discussed this extensively with Alexander at the time. Guess it was yet another wasted conversation with him =-( I'll repeat it here for the record. What I call the silly-i/o-test, filling a disk up with the dd command, yields performance improvements up to a MAXPHYS of 512K. Beyond that and it's negligible, and actually starts running into contention on the VM page queues lock. There is some work to break down this lock, so it's worth revisiting in the future. For the non-silly-i/o-test, where I do real file i/o using various sequential and random patterns, there was a modest improvement up to 256K, and a slight improvement up to 512K. This surprised me as I figured that most filesystem i/o would be in UFS block sized chunks. Then I realized that the UFS clustering code was actually taking advantage of the larger I/O's. The improvement really depends on the workload, of course, and I wouldn't expect it to be noticeable for most people unless they're running something like a media server. Besides the nswbuf sizing problem, there is a real problem that a lot of drivers have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are particular values, and they've sized their data structures accordingly. Before these values are changed, an audit needs to be done OF EVERY SINGLE STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS in the ata driver, testing that your machine boots, and then committing the change to source control. Some drivers will have non-obvious restrictions based on the number of SG elements allowed in a particular command format. MPT comes to mind (its multi message SG code seems to be broken when I tried testing large MAXPHYS on it), but I bet that there are others. Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an odd number less than 512k. For the purpose of benchmarking against these OS's, having comparable capabilities is essential; Linux easily beats FreeBSD in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD typically stomps linux in real I/O because of vastly better latency and caching algorithms). I'm fine with raising MAXPHYS in production once the problems are addressed. >Nearly all the issues I've come up against in the last few years have >been related more to pipeline algorithms breaking down and less with >I/O size. The cluster_read() code is especially vulnerable to >algorithmic breakdowns when fast media (such as a SSD) is involved. >e.g. I/Os queued from the previous cluster op can create stall >conditions in subsequent cluster ops before they can issue new I/Os >to keep the pipeline hot. > Yes, this is another very good point. It's time to start really figuring out what SSD means for FreeBSD I/O. Scott ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Increasing MAXPHYS
:All above I have successfully tested last months with MAXPHYS of 1MB on :i386 and amd64 platforms. : :So my questions are: :- does somebody know any issues denying increasing MAXPHYS in HEAD? :- are there any specific opinions about value? 512K, 1MB, MD? : :-- :Alexander Motin (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you might hit up against KVM exhaustion issues in unrelated subsystems. nswbuf typically maxes out at around 256. For i386 1MB is probably too large (256M of reserved KVM is a lot for i386). On amd64 there shouldn't be a problem. Diminishing returns get hit pretty quickly with larger MAXPHYS values. As long as the I/O can be pipelined the reduced transaction rate becomes less interesting when the transaction rate is less than a certain level. Off the cuff I'd say 2000 tps is a good basis for considering whether it is an issue or not. 256K is actually quite a reasonable value. Even 128K is reasonable. Nearly all the issues I've come up against in the last few years have been related more to pipeline algorithms breaking down and less with I/O size. The cluster_read() code is especially vulnerable to algorithmic breakdowns when fast media (such as a SSD) is involved. e.g. I/Os queued from the previous cluster op can create stall conditions in subsequent cluster ops before they can issue new I/Os to keep the pipeline hot. -Matt Matthew Dillon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"