Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-22 Thread Mikael
Bump

2017-02-10 22:11 GMT+08:00 Mikael :

> 2017-02-10 18:39 GMT+08:00 David Gwynne :
>
>> > 2017-02-09 16:41 GMT+08:00 David Gwynne :
>>
>  ..
>
>> i can go into more detail if you want.
>>
>> cheers,
>> dlg
>>
>
> Hi David,
>
> Thank you - yes please go into more detail!
>
> Also on a more concrete level I would be the most curious to understand
> how far away we would be from having concurrent IO access (e.g. on various
> distances from the hardware: via some lowlevel block device API that would
> be mp-safe / via open("/dev/rsd0c") + read() that would be mp-safe / that
> the filesystem would be mp-safe so open("myfile") + read() would be all
> concurrent)?
>
> Then, just to get an idea of what's going on, regarding how the system is
> doing concurrent IO, or how you like to call it: While I had a
> pre-understanding that the kernel's big lock could be playing in so that it
> would be a performance constraint, the people I talked to seemed to suggest
> that the kernel is doing IO in a procedure-call-based code-blocking,
> synchronous way today, which only runs one command in the sata comand queue
> concurrently, and so I got an impression that concurrency-friendly IO and
> higher IOPS would require a reconstruction of the IO system where IO
> operations are represented internally rather by data structures run by an
> asynchronous mechanism. How is it actually, and where is it going?
>
> Regarding mp-safe drivers, I would guess ahci.4 and nvme.4 are the most
> commonly used interfaces for SSD:s.
>
> Best regards,
> Mikael



Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-10 Thread Mikael
2017-02-10 18:39 GMT+08:00 David Gwynne :

> > 2017-02-09 16:41 GMT+08:00 David Gwynne :
>
 ..

> i can go into more detail if you want.
>
> cheers,
> dlg
>

Hi David,

Thank you - yes please go into more detail!

Also on a more concrete level I would be the most curious to understand how
far away we would be from having concurrent IO access (e.g. on various
distances from the hardware: via some lowlevel block device API that would
be mp-safe / via open("/dev/rsd0c") + read() that would be mp-safe / that
the filesystem would be mp-safe so open("myfile") + read() would be all
concurrent)?

Then, just to get an idea of what's going on, regarding how the system is
doing concurrent IO, or how you like to call it: While I had a
pre-understanding that the kernel's big lock could be playing in so that it
would be a performance constraint, the people I talked to seemed to suggest
that the kernel is doing IO in a procedure-call-based code-blocking,
synchronous way today, which only runs one command in the sata comand queue
concurrently, and so I got an impression that concurrency-friendly IO and
higher IOPS would require a reconstruction of the IO system where IO
operations are represented internally rather by data structures run by an
asynchronous mechanism. How is it actually, and where is it going?

Regarding mp-safe drivers, I would guess ahci.4 and nvme.4 are the most
commonly used interfaces for SSD:s.

Best regards,
Mikael



Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-10 Thread David Gwynne
> On 9 Feb 2017, at 7:11 pm, Mikael  wrote:
>
> 2017-02-09 16:41 GMT+08:00 David Gwynne :
> ..
> hey mikael,
>
> can you be more specific about what you mean by multiqueuing for disks? even
a
> reference to an implementation of what you’re asking about would help me
> answer this question.
>
> ill write up a bigger reply after my kids are in bed.
>
> cheers,
> dlg
>
> Hi David,
>
> Thank you for your answer.
>
> The other OpenBSD:ers I talked to also used the wording "multiqueue". My
understanding of the kernel's workings here is too limited.
>
> If I would give a reference to some implementation out there, I guess I
would to the one introduced in Linux 3.13/3.16:
>
> "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems"
> http://kernel.dk/blk-mq.pdf
>
> "Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)"
>
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech
anism_(blk-mq)
>
> "The multiqueue block layer"
> https://lwn.net/Articles/552904/
>
> Looking forward a lot to your followup.

sorry, i feel asleep too.

thanks for the links to info on linux mq stuff. i can understand what it
provides. however, in the situation you are testing im not sure it is
necessarily the means to addressing the difference in performance you’re
seeing in your environment.

anyway, tldr: you’re suffering under the kernels big giant lock.

according to the dmesg you provided you’re testing a single ssd (a samsung
850) connected to a sata controller (ahci). with this equipment all operations
between the computer and the actual disk are all issued through achi. because
of way ahci operates, operations on a specific disk are effectively serialises
at this point. in your setup you have multiple cpus though, and it sounds like
your benchmark runs on them concurrently, issuing io through the kernel to the
disk via ahci.

two things are obviously different between linux and openbsd that would affect
this benchmark. the first is that io to physical devices is limited to a value
called MAXPHYS in the kernel, which is 64 kilobytes. any larger read
operations issued by userland to the kernel get cut up into a series of 64k
reads against the disk. ahci itself can handle 4 meg per transfer.

the other difference is that, like most of the kernel, read() is serialised by
the big lock. the result of this is if you have userland on multiple cpus
creating a heavily io bound workload, all the cpus end up waiting for each
other to run. while one cpu is running through the io stack down to ahci,
every other cpu is spinning waiting for its turn to do the same thing.

the distance between userland and ahci is relatively long. going through the
buffer cache (i.e., /dev/sd0) is longer than bypassing it (through /dev/rsd0).
your test results confirm this.

the solution to this problem is to look at taking the big lock away from the
io paths. this is non-trivial work though.

i have already spent time working on making sd(4) and the scsi midlayer
mpsafe, but haven’t been able to take advantage of that work because both
sides of the scsi subsystem (adapters like ahci and the block layer and
syscalls) still need the big lock. some adapters have been made mpsafe, but i
dont think ahci was on that list. when i was playing with mpsafe scsi, i gave
up the big lock at the start of sd(4) and ran it, the midlayer, and mpi(4) or
mpii(4) unlocked. if i remember correctly, even just unlocking that part of
the stack doubled the throughput of the system.

the work ive done in the midlayer should mean if we can access it without
biglock, accesses to disks beyond adapters like ahci should scale pretty well
cpu cores because of how io is handed over to the midlayer. concurrent
submissions by multiple cpus end up delegating one of the cpus to operate on
the adapter on behalf of all the cpus. while that first cpu is still
submitting to the hardware, other cpus are not blocked from queuing more work
and returning to user land.

i can go into more detail if you want.

cheers,
dlg



Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-09 Thread Mikael
2017-02-09 16:41 GMT+08:00 David Gwynne :
..

> hey mikael,
>
> can you be more specific about what you mean by multiqueuing for disks?
> even a
> reference to an implementation of what you’re asking about would help me
> answer this question.
>
> ill write up a bigger reply after my kids are in bed.
>
> cheers,
> dlg


Hi David,

Thank you for your answer.

The other OpenBSD:ers I talked to also used the wording "multiqueue". My
understanding of the kernel's workings here is too limited.

If I would give a reference to some implementation out there, I guess I
would to the one introduced in Linux 3.13/3.16:

"Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems"
http://kernel.dk/blk-mq.pdf

"Linux Multi-Queue Block IO Queueing Mechanism (blk-mq)"
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mech
anism_(blk-mq)

"The multiqueue block layer"
https://lwn.net/Articles/552904/

Looking forward a lot to your followup.

Best regards,
Mikael



Re: Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-09 Thread David Gwynne
> On 9 Feb 2017, at 12:42 pm, Mikael  wrote:
>
> Hi misc@,
>
> The SSD reading benchmark in the previous email shows that per-device
> multiqueuing will boost multithreaded random read performance very much
> e.g. by ~7X+, e.g. the current 50MB/sec will increase to ~350MB/sec+.
>
> (I didn't benchmark yet but I suspect the current 50MB/sec is system-wide,
> whereas with multiqueuing the 350MB/sec+ would be per drive.)
>
> Multiuser databases, and any parallell file reading activity, will/would
> see a proportional speedup with multiqueing.

hey mikael,

can you be more specific about what you mean by multiqueuing for disks? even a
reference to an implementation of what you’re asking about would help me
answer this question.

ill write up a bigger reply after my kids are in bed.

cheers,
dlg

>
>
> Do you have plans to implement this?
>
> Was anything done to this end already, any idea when multiqueueing can
> happen?
>
>
> Are donations a matter here, if so about what size of donations and to who?
>
> Someone suggested that implementing it would take a year of work.
>
> Any clarifications of what's going on and what's possible and how would be
> much appreciated.
>
>
> Thanks,
> Mikael



Per-device multiqueuing would be fantastic. Are there any plans? Are donations a matter here?

2017-02-08 Thread Mikael
Hi misc@,

The SSD reading benchmark in the previous email shows that per-device
multiqueuing will boost multithreaded random read performance very much
e.g. by ~7X+, e.g. the current 50MB/sec will increase to ~350MB/sec+.

(I didn't benchmark yet but I suspect the current 50MB/sec is system-wide,
whereas with multiqueuing the 350MB/sec+ would be per drive.)

Multiuser databases, and any parallell file reading activity, will/would
see a proportional speedup with multiqueing.


Do you have plans to implement this?

Was anything done to this end already, any idea when multiqueueing can
happen?


Are donations a matter here, if so about what size of donations and to who?

Someone suggested that implementing it would take a year of work.

Any clarifications of what's going on and what's possible and how would be
much appreciated.


Thanks,
Mikael