Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
Bump 2017-02-10 22:11 GMT+08:00 Mikael : > 2017-02-10 18:39 GMT+08:00 David Gwynne : > >> > 2017-02-09 16:41 GMT+08:00 David Gwynne : >> > .. > >> i can go into more detail if you want. >> >> cheers, >> dlg >> > > Hi David, > > Thank you - yes please go into more detail! > > Also on a more concrete level I would be the most curious to understand > how far away we would be from having concurrent IO access (e.g. on various > distances from the hardware: via some lowlevel block device API that would > be mp-safe / via open("/dev/rsd0c") + read() that would be mp-safe / that > the filesystem would be mp-safe so open("myfile") + read() would be all > concurrent)? > > Then, just to get an idea of what's going on, regarding how the system is > doing concurrent IO, or how you like to call it: While I had a > pre-understanding that the kernel's big lock could be playing in so that it > would be a performance constraint, the people I talked to seemed to suggest > that the kernel is doing IO in a procedure-call-based code-blocking, > synchronous way today, which only runs one command in the sata comand queue > concurrently, and so I got an impression that concurrency-friendly IO and > higher IOPS would require a reconstruction of the IO system where IO > operations are represented internally rather by data structures run by an > asynchronous mechanism. How is it actually, and where is it going? > > Regarding mp-safe drivers, I would guess ahci.4 and nvme.4 are the most > commonly used interfaces for SSD:s. > > Best regards, > Mikael
Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
2017-02-10 18:39 GMT+08:00 David Gwynne : > > 2017-02-09 16:41 GMT+08:00 David Gwynne : > .. > i can go into more detail if you want. > > cheers, > dlg > Hi David, Thank you - yes please go into more detail! Also on a more concrete level I would be the most curious to understand how far away we would be from having concurrent IO access (e.g. on various distances from the hardware: via some lowlevel block device API that would be mp-safe / via open("/dev/rsd0c") + read() that would be mp-safe / that the filesystem would be mp-safe so open("myfile") + read() would be all concurrent)? Then, just to get an idea of what's going on, regarding how the system is doing concurrent IO, or how you like to call it: While I had a pre-understanding that the kernel's big lock could be playing in so that it would be a performance constraint, the people I talked to seemed to suggest that the kernel is doing IO in a procedure-call-based code-blocking, synchronous way today, which only runs one command in the sata comand queue concurrently, and so I got an impression that concurrency-friendly IO and higher IOPS would require a reconstruction of the IO system where IO operations are represented internally rather by data structures run by an asynchronous mechanism. How is it actually, and where is it going? Regarding mp-safe drivers, I would guess ahci.4 and nvme.4 are the most commonly used interfaces for SSD:s. Best regards, Mikael
Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
> On 9 Feb 2017, at 7:11 pm, Mikael wrote: > > 2017-02-09 16:41 GMT+08:00 David Gwynne : > .. > hey mikael, > > can you be more specific about what you mean by multiqueuing for disks? even a > reference to an implementation of what you’re asking about would help me > answer this question. > > ill write up a bigger reply after my kids are in bed. > > cheers, > dlg > > Hi David, > > Thank you for your answer. > > The other OpenBSD:ers I talked to also used the wording "multiqueue". My understanding of the kernel's workings here is too limited. > > If I would give a reference to some implementation out there, I guess I would to the one introduced in Linux 3.13/3.16: > > "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems" > http://kernel.dk/blk-mq.pdf > > "Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)" > https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech anism_(blk-mq) > > "The multiqueue block layer" > https://lwn.net/Articles/552904/ > > Looking forward a lot to your followup. sorry, i feel asleep too. thanks for the links to info on linux mq stuff. i can understand what it provides. however, in the situation you are testing im not sure it is necessarily the means to addressing the difference in performance you’re seeing in your environment. anyway, tldr: you’re suffering under the kernels big giant lock. according to the dmesg you provided you’re testing a single ssd (a samsung 850) connected to a sata controller (ahci). with this equipment all operations between the computer and the actual disk are all issued through achi. because of way ahci operates, operations on a specific disk are effectively serialises at this point. in your setup you have multiple cpus though, and it sounds like your benchmark runs on them concurrently, issuing io through the kernel to the disk via ahci. two things are obviously different between linux and openbsd that would affect this benchmark. the first is that io to physical devices is limited to a value called MAXPHYS in the kernel, which is 64 kilobytes. any larger read operations issued by userland to the kernel get cut up into a series of 64k reads against the disk. ahci itself can handle 4 meg per transfer. the other difference is that, like most of the kernel, read() is serialised by the big lock. the result of this is if you have userland on multiple cpus creating a heavily io bound workload, all the cpus end up waiting for each other to run. while one cpu is running through the io stack down to ahci, every other cpu is spinning waiting for its turn to do the same thing. the distance between userland and ahci is relatively long. going through the buffer cache (i.e., /dev/sd0) is longer than bypassing it (through /dev/rsd0). your test results confirm this. the solution to this problem is to look at taking the big lock away from the io paths. this is non-trivial work though. i have already spent time working on making sd(4) and the scsi midlayer mpsafe, but haven’t been able to take advantage of that work because both sides of the scsi subsystem (adapters like ahci and the block layer and syscalls) still need the big lock. some adapters have been made mpsafe, but i dont think ahci was on that list. when i was playing with mpsafe scsi, i gave up the big lock at the start of sd(4) and ran it, the midlayer, and mpi(4) or mpii(4) unlocked. if i remember correctly, even just unlocking that part of the stack doubled the throughput of the system. the work ive done in the midlayer should mean if we can access it without biglock, accesses to disks beyond adapters like ahci should scale pretty well cpu cores because of how io is handed over to the midlayer. concurrent submissions by multiple cpus end up delegating one of the cpus to operate on the adapter on behalf of all the cpus. while that first cpu is still submitting to the hardware, other cpus are not blocked from queuing more work and returning to user land. i can go into more detail if you want. cheers, dlg
Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
2017-02-09 16:41 GMT+08:00 David Gwynne : .. > hey mikael, > > can you be more specific about what you mean by multiqueuing for disks? > even a > reference to an implementation of what youâre asking about would help me > answer this question. > > ill write up a bigger reply after my kids are in bed. > > cheers, > dlg Hi David, Thank you for your answer. The other OpenBSD:ers I talked to also used the wording "multiqueue". My understanding of the kernel's workings here is too limited. If I would give a reference to some implementation out there, I guess I would to the one introduced in Linux 3.13/3.16: "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems" http://kernel.dk/blk-mq.pdf "Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)" https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech anism_(blk-mq) "The multiqueue block layer" https://lwn.net/Articles/552904/ Looking forward a lot to your followup. Best regards, Mikael
Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
> On 9 Feb 2017, at 12:42 pm, Mikael wrote: > > Hi misc@, > > The SSD reading benchmark in the previous email shows that per-device > multiqueuing will boost multithreaded random read performance very much > e.g. by ~7X+, e.g. the current 50MB/sec will increase to ~350MB/sec+. > > (I didn't benchmark yet but I suspect the current 50MB/sec is system-wide, > whereas with multiqueuing the 350MB/sec+ would be per drive.) > > Multiuser databases, and any parallell file reading activity, will/would > see a proportional speedup with multiqueing. hey mikael, can you be more specific about what you mean by multiqueuing for disks? even a reference to an implementation of what you’re asking about would help me answer this question. ill write up a bigger reply after my kids are in bed. cheers, dlg > > > Do you have plans to implement this? > > Was anything done to this end already, any idea when multiqueueing can > happen? > > > Are donations a matter here, if so about what size of donations and to who? > > Someone suggested that implementing it would take a year of work. > > Any clarifications of what's going on and what's possible and how would be > much appreciated. > > > Thanks, > Mikael
Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?
Hi misc@, The SSD reading benchmark in the previous email shows that per-device multiqueuing will boost multithreaded random read performance very much e.g. by ~7X+, e.g. the current 50MB/sec will increase to ~350MB/sec+. (I didn't benchmark yet but I suspect the current 50MB/sec is system-wide, whereas with multiqueuing the 350MB/sec+ would be per drive.) Multiuser databases, and any parallell file reading activity, will/would see a proportional speedup with multiqueing. Do you have plans to implement this? Was anything done to this end already, any idea when multiqueueing can happen? Are donations a matter here, if so about what size of donations and to who? Someone suggested that implementing it would take a year of work. Any clarifications of what's going on and what's possible and how would be much appreciated. Thanks, Mikael