Re: scsipi: physio split the request

2019-01-01 Thread Thor Lancelot Simon
On Tue, Jan 01, 2019 at 02:48:05PM -0500, Thor Lancelot Simon wrote:
> 
> The work remaining to be done on the branch, as I see it, is:

[...]

I missed one.  I got fed up dealing with the way arguments are
passed through the mount syscalls (especially for ufs) so the
work of letting mount(8) pass a maxxfer value through (either
initially or on an update mount) is not done.  That's definitely
a thing that should be done before this hits the tree.

Thor


Re: scsipi: physio split the request

2019-01-01 Thread Thor Lancelot Simon
On Thu, Dec 27, 2018 at 09:07:41AM +, Emmanuel Dreyfus wrote:
> Hello
> 
> A few years ago I made a failed attempt at running LTFS on a LTO 6 drive. 
> I resumed the effort, and once I got the LTFS code ported, running 
> a command like mkltfs fails with kernel console saying:
> st0(mpii0:0:2:0): physio split the request.. cannot proceed
> 
> This is netbsd-current from yesterday.

You really need tls-maxphys.  It won't be a ton of fun to rebase it
on newer NetBSD-current but it can't be more than a day's work (IIRC
where I left it we were pre device/softc cleanup, and that'll be some
nuisance to address if so).

tls-maxphys propagates the maximum supported transfer size down the
system's actual discovered bus topology at boot time; any node in the
tree can enforce its own restrictions as it sees fit, and nodes like
RAIDframe that effectively demux I/O can compute and declare their
own supported maximum.

The work remaining to be done on the branch, as I see it, is:

1) *Some* backpressure mechanism *must* be implemented to prevent
   the filesystems from greedily attempting maximum size I/Os at
   all times, because with a new, much larger maximum in many
   cases, this will lead to much worse unfairness than we now see
   (and some threads doing I/O will much more obviously starve
   others).

   IIRC we've already got something effective for either read or
   write but not the other but it's been a while, so I could be wrong.

2) There's an ugly case with RAIDframe if a component is replaced with one
   that supports a smaller maxphys.  The filesystems need to be notified
   so they can change their own internal max xfer size.  I think I wrote the
   code to deal with this but it's untested.  Wants a look.

3) A number of device drivers -- particularly things in the LSI family -- will
   need to learn about newer DMA descriptor formats supported by their
   hardware in order to support transfers of reasonable size for things like
   tape drives (mpt and possibly mfi*, for example, are currently limited to
   192K because our driver only supports a very old descriptor format; this
   should be a relatively simple fix based on reading newer open-source code
   for these devices as a reference).

I believe that should be all that's needed.  I would estimate it at 5 days
of work, or perhaps a month of evenings/weekends.  I don't have that time
available now and won't in the forseeable future, but, perhaps someone reading
this does.

And of course some of you are much quicker at this stuff than I am (thorpej,
I'm looking at you ;-)).

Most of what the branch does is useful *even if* we remove the stupid VA/PA
mapping business for I/O, I think.  Because it's mostly config sugar to let
the clients know how big an I/O they can ask for at runtime, and that will
be needed regardless.

Thor


Re: scsipi: physio split the request

2018-12-28 Thread Warner Losh
On Fri, Dec 28, 2018, 11:04 AM Warner Losh 
>
> On Fri, Dec 28, 2018, 1:25 AM matthew green 
>> > Of course larger transfers would also mitigate the overhead for each I/O
>> > operation, but we already do several Gigabyte/s with 64k transfers and
>> > filesystem I/O tends to be even smaller.
>>
>> yes - the benefits will be in the 0-10% range for most things.  it
>> will help, but only a fairly small amount, most of us won't notice.
>>
>> i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top.
>>
>
>
> I've seen 3.3GB/s of 128k-512k transfers on FreeBSD off of nvme, but
> that's mostly video. It seems to be limited there not so much by transfer
> size, but by the ability to queue transactions. We see <1% by raising
> MAXPHYS to 1MB over the default 128k there.
>

Also, we are limited by what the device itself can do which varies a lot by
drive. From a low of 1GB/s to a high of just under 3.4GB/s.

Warner

>


Re: scsipi: physio split the request

2018-12-28 Thread Warner Losh
On Fri, Dec 28, 2018, 1:25 AM matthew green  > Of course larger transfers would also mitigate the overhead for each I/O
> > operation, but we already do several Gigabyte/s with 64k transfers and
> > filesystem I/O tends to be even smaller.
>
> yes - the benefits will be in the 0-10% range for most things.  it
> will help, but only a fairly small amount, most of us won't notice.
>
> i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top.
>


I've seen 3.3GB/s of 128k-512k transfers on FreeBSD off of nvme, but that's
mostly video. It seems to be limited there not so much by transfer size,
but by the ability to queue transactions. We see <1% by raising MAXPHYS to
1MB over the default 128k there.

Warner

>


Re: scsipi: physio split the request

2018-12-28 Thread Jaromír Doleček
> On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote:
> -- Subject: Re: scsipi: physio split the request
>
> |   hello.  Just out of curiosity, why did the tls-maxphys branch never
> | get merged with head once the work was done or mostly done?

Simply nobody finished it up yet.

Le jeu. 27 déc. 2018 à 22:07, Christos Zoulas  a écrit :
> mostly done...

I did the last tls-maxphys sync. It will need at least one resync,
since more stuff moved around (i.e. wd(4) was converted to dksubr
since then).

I want to eliminate some duplicated code in PCI-IDE drivers before the
merge. Other than that, I don't see any other blockers for the merge.

Of course, it requires proper round of testing before it could be
really considered merge-worthy.

As others noted, the performance benefit of tls-maxphys is likely to
be small. It however removes artificial limits in the stack, unlocking
new opportunities for further development, so it's both necessary and
good to do nevertheless.

Jaromir


Re: scsipi: physio split the request

2018-12-28 Thread Jaromír Doleček
Le jeu. 27 déc. 2018 à 15:41, Emmanuel Dreyfus  a écrit :
>
> On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote:
> > I think you need resurrect the tls-maxphys branch... It was close to working
> > IIRC.
>
> What happens if I just #define MAXPHYS (1024*1204*1024) ?

Several drivers use MAXPHYS to allocate DMA memory for them (e.g.
nvme), they usually try to allocate
MAXPHYS * (max number of scatter-gather vectors). These will either
fail, or block huge amounts of RAM.

Also it just happens to be hard I/O size limit for ISA, and ATA/IDE
drives not supporting LBA48.

Even for hw which does support bigger I/O (stuff behind PCI, SCSI, ATA
drivers with LBA48), the drivers might be buggy and not cope.

Nothing good happens out of this, unfortunately the lower layers need
to be fixed first.

Jaromir


re: scsipi: physio split the request

2018-12-27 Thread matthew green
> Of course larger transfers would also mitigate the overhead for each I/O
> operation, but we already do several Gigabyte/s with 64k transfers and
> filesystem I/O tends to be even smaller.

yes - the benefits will be in the 0-10% range for most things.  it
will help, but only a fairly small amount, most of us won't notice.

i've seen peaks of 1.4GB/s with an nvme(4) device with ffs on top.


.mrg.


Re: scsipi: physio split the request

2018-12-27 Thread Michael van Elst
jnem...@cue.bc.ca (John Nemeth) writes:

>On Dec 27,  6:49pm, Michael van Elst wrote:
>} So far that's mostly a problem with software raid and modern tape I/O.

> Wouldn't hardware RAID also benefit from bigger buffers?
>Although, I suppose a battery backed cache be used to workaround
>small transfer sizes.

The transfer size currently limits I/O of stripes because it is split
over all stripe units (drives). A hardware controller does this internally
and isn't affected by MAXPHYS.

Of course larger transfers would also mitigate the overhead for each I/O
operation, but we already do several Gigabyte/s with 64k transfers and
filesystem I/O tends to be even smaller.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: scsipi: physio split the request

2018-12-27 Thread John Nemeth
On Dec 27,  6:49pm, Michael van Elst wrote:
} m...@netbsd.org (Emmanuel Dreyfus) writes:
} 
} >Is there a reason other than historical for NetBSD 64kB limit?
} 
} It's a compromise. Some buffers are statically sized for MAXPHYS
} and some ancient hardware cannot exceed 64k (or even less) DMA transfers.
} The buffer size is mostly a problem because we don't support
} scatter-gather transfers, so the buffers need to be contigous in
} physical RAM (and some hardware doesn't support s-g either).
} 
} So far that's mostly a problem with software raid and modern tape I/O.

 Wouldn't hardware RAID also benefit from bigger buffers?
Although, I suppose a battery backed cache be used to workaround
small transfer sizes.

}-- End of excerpt from Michael van Elst


Re: scsipi: physio split the request

2018-12-27 Thread Michael van Elst
thor...@me.com (Jason Thorpe) writes:

>> You need a really huge amount of RAM for that, and also a huge
>> KVA space.

>...but it doesn't have to be that way.

>The fundamental problem is that for physio, we currently have to map the 
>buffer into kernel space at all.

Mapping into KVA is another problem.

>  We really should have a more abstract way to describe memory that is passed 
> down to device drivers that currently take struct buf *s, call it an I/O 
> memory descriptor ("iomd"). This iomd would have, say, an array of vm_page 
> *'s, or perhaps an array of paddr_t's, but would also have a pointer to the 
> buffer as mapped into kernel address space.

The problem is that currently we and also some hardware cannot handle
such a construct.

>Then a new bus_dmamap_load_iomd() call could take an iomd as an argument, and 
>skip doing a bunch of work (calling into the pmap later to get the physical 
>address), and just build the bus_dma_segment_t's directly.

There is hardware that can only handle a single bus_dma_segment.

So that's:

- support some more abstract MAXPHYS (i.e. not a global constant).
- make buffers based on scatter-gather lists instead of a single linear
  piece of memory.
- make drivers use these scatter-gather buffers
- try to emulate this behaviour when hardware is too limited.
- make other users of buffers compatible with scatter-gather lists

That's a long way to go and still not related to mapping buffers into KVA.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: scsipi: physio split the request

2018-12-27 Thread Christos Zoulas
On Dec 27, 12:29pm, buh...@nfbcal.org (Brian Buhrow) wrote:
-- Subject: Re: scsipi: physio split the request

|   hello.  Just out of curiosity, why did the tls-maxphys branch never
| get merged with head once the work was done or mostly done?

mostly done...

christos


Re: scsipi: physio split the request

2018-12-27 Thread Jason Thorpe



> On Dec 27, 2018, at 10:51 AM, Michael van Elst  wrote:
> 
> m...@netbsd.org (Emmanuel Dreyfus) writes:
> 
>> What happens if I just #define MAXPHYS (1024*1204*1024) ?
> 
> You need a really huge amount of RAM for that, and also a huge
> KVA space.

...but it doesn't have to be that way.

The fundamental problem is that for physio, we currently have to map the buffer 
into kernel space at all.  We really should have a more abstract way to 
describe memory that is passed down to device drivers that currently take 
struct buf *s, call it an I/O memory descriptor ("iomd").  This iomd would 
have, say, an array of vm_page *'s, or perhaps an array of paddr_t's, but would 
also have a pointer to the buffer as mapped into kernel address space.  The 
necessary part is having the page array filled in, along with an offset, and a 
length.  If not sufficient, then callers could map the buffer ONLY if needed, 
e.g. if you have to do PIO to your device.

Then a new bus_dmamap_load_iomd() call could take an iomd as an argument, and 
skip doing a bunch of work (calling into the pmap later to get the physical 
address), and just build the bus_dma_segment_t's directly.  If it needs to 
bounce-buffer, then the back-end takes care of calling iomd_map() or whatever.

This isn't a fully fleshed-out proposal, or anything, but I know it's ben 
brought up off and on for years... we really ought to just get around to doing 
it.  Unfortunately, it's going to mean modifying a lot of drivers before the 
upper layers can assume "I can pass iomds down everywhere for buf I/O".

-- thorpej



Re: scsipi: physio split the request

2018-12-27 Thread Brian Buhrow
hello.  Just out of curiosity, why did the tls-maxphys branch never
get merged with head once the work was done or mostly done?
-thanks
-Brian



Re: scsipi: physio split the request

2018-12-27 Thread Christos Zoulas
In article <20181227153028.gr4...@homeworld.netbsd.org>,
Emmanuel Dreyfus   wrote:
>On Thu, Dec 27, 2018 at 09:47:03AM -0500, Christos Zoulas wrote:
>> | What happens if I just #define MAXPHYS (1024*1204*1024) ?
>> I don't think that's a good idea. My guess is that things are going to
>blow up.
>
>At least if I try to be on par with Linux limit and build with
>-DMAXPHYS=1048576 the system goes to multiuser without a hitch.
>
>Running mkltfs raises aa few errors on the console, though:
>mpii0: error 27 loading dmamap
>st0(mpii0:0:2:0): passthrough: adapter inconsistency
>mpii0: error 27 loading dmamap
>st0(mpii0:0:2:0): passthrough: adapter inconsistency

Told you: EFBIG :-)
Why don't you try tls-maxphys?

christos



Re: scsipi: physio split the request

2018-12-27 Thread Michael van Elst
m...@netbsd.org (Emmanuel Dreyfus) writes:

>What happens if I just #define MAXPHYS (1024*1204*1024) ?

You need a really huge amount of RAM for that, and also a huge
KVA space. Try MAXPHYS (1024*1024) for a start.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: scsipi: physio split the request

2018-12-27 Thread Michael van Elst
m...@netbsd.org (Emmanuel Dreyfus) writes:

>Is there a reason other than historical for NetBSD 64kB limit?

It's a compromise. Some buffers are statically sized for MAXPHYS
and some ancient hardware cannot exceed 64k (or even less) DMA transfers.
The buffer size is mostly a problem because we don't support
scatter-gather transfers, so the buffers need to be contigous in
physical RAM (and some hardware doesn't support s-g either).

So far that's mostly a problem with software raid and modern tape I/O.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: scsipi: physio split the request

2018-12-27 Thread Michael van Elst
m...@netbsd.org (Emmanuel Dreyfus) writes:

>On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote:
>> tape block size are usually larger than 512 (I use 64k here).
>> What block size did mkltfs use ? Actually we can't do larger than 64k.

>It seems to attempt transfers of 256kB

We are limited to MAXPHYS which is currently 64k.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: scsipi: physio split the request

2018-12-27 Thread Emmanuel Dreyfus
On Thu, Dec 27, 2018 at 09:47:03AM -0500, Christos Zoulas wrote:
> | What happens if I just #define MAXPHYS (1024*1204*1024) ?
> I don't think that's a good idea. My guess is that things are going to blow 
> up.

At least if I try to be on par with Linux limit and build with
-DMAXPHYS=1048576 the system goes to multiuser without a hitch.

Running mkltfs raises aa few errors on the console, though:
mpii0: error 27 loading dmamap
st0(mpii0:0:2:0): passthrough: adapter inconsistency
mpii0: error 27 loading dmamap
st0(mpii0:0:2:0): passthrough: adapter inconsistency




-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: scsipi: physio split the request

2018-12-27 Thread Christos Zoulas
On Dec 27,  2:41pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: scsipi: physio split the request

| On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote:
| > I think you need resurrect the tls-maxphys branch... It was close to working
| > IIRC.
| 
| What happens if I just #define MAXPHYS (1024*1204*1024) ?

I don't think that's a good idea. My guess is that things are going to blow up.

christos


Re: scsipi: physio split the request

2018-12-27 Thread Emmanuel Dreyfus
On Thu, Dec 27, 2018 at 02:33:28PM +, Christos Zoulas wrote:
> I think you need resurrect the tls-maxphys branch... It was close to working
> IIRC.

What happens if I just #define MAXPHYS (1024*1204*1024) ?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: scsipi: physio split the request

2018-12-27 Thread Christos Zoulas
In article <20181227123711.go4...@homeworld.netbsd.org>,
Emmanuel Dreyfus   wrote:
>On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote:
>> tape block size are usually larger than 512 (I use 64k here).
>> What block size did mkltfs use ? Actually we can't do larger than 64k.
>
>It seems to attempt transfers of 256kB
>
>LTFS20010D SCSI request: [ A3 1F 08 00 00 00 04 00 00 00 00 00 ]
>Requested length=262144
>LTFS20089D Driver detail:errno = 0x5
>LTFS20089D Driver detail:  host_status = 0x0
>LTFS20089D Driver detail:driver_status = 0x0
>LTFS20089D Driver detail:   status = 0x0
>LTFS20011D SCSI outcome: Driver status=0xFF SCSI status=0xFF Actual length=0

I think you need resurrect the tls-maxphys branch... It was close to working
IIRC.

christos



Re: scsipi: physio split the request

2018-12-27 Thread Emmanuel Dreyfus
On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote:
> tape block size are usually larger than 512 (I use 64k here).

I patched ltfs so that all the max sizes (256kB and 512kB for Linux)
are set to 64kB for NetBSD. I can now format and mount the LTFS filesystem,
but I need to limit the block size to under 64kB.

This will work:
dump -0f - / | dd obs=63k of=/ltfs/dump20181227 

This hangs the filesystem:
dump -0f - / | dd obs=64k of=/ltfs/dump20181227 

I tested on glusterfs that our FUSE implementation does not limit writes
to 64k chunks, hence I assume I introduced a bug in ltfs with the 64kB 
limit everywhere.

Is there a reason other than historical for NetBSD 64kB limit?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: scsipi: physio split the request

2018-12-27 Thread Emmanuel Dreyfus
On Thu, Dec 27, 2018 at 10:44:46AM +0100, Manuel Bouyer wrote:
> tape block size are usually larger than 512 (I use 64k here).
> What block size did mkltfs use ? Actually we can't do larger than 64k.

It seems to attempt transfers of 256kB

LTFS20010D SCSI request: [ A3 1F 08 00 00 00 04 00 00 00 00 00 ] Requested 
length=262144
LTFS20089D Driver detail:errno = 0x5
LTFS20089D Driver detail:  host_status = 0x0
LTFS20089D Driver detail:driver_status = 0x0
LTFS20089D Driver detail:   status = 0x0
LTFS20011D SCSI outcome: Driver status=0xFF SCSI status=0xFF Actual length=0


-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: scsipi: physio split the request

2018-12-27 Thread Manuel Bouyer
On Thu, Dec 27, 2018 at 09:07:41AM +, Emmanuel Dreyfus wrote:
> Hello
> 
> A few years ago I made a failed attempt at running LTFS on a LTO 6 drive. 
> I resumed the effort, and once I got the LTFS code ported, running 
> a command like mkltfs fails with kernel console saying:
> st0(mpii0:0:2:0): physio split the request.. cannot proceed
> 
> This is netbsd-current from yesterday.
> 
> I understand this is about tape block size larger than usual 512. 

tape block size are usually larger than 512 (I use 64k here).
What block size did mkltfs use ? Actually we can't do larger than 64k.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--