Re: Time to increase MAXPHYS?

2017-09-15 Thread Warner Losh
On Fri, Sep 15, 2017 at 9:18 AM, Nikolai Lifanov 
wrote:

> On 6/3/17 11:55 PM, Allan Jude wrote:
> > On 2017-06-03 22:35, Julian Elischer wrote:
> >> On 4/6/17 4:59 am, Colin Percival wrote:
> >>> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> >>> wrote:
>  Add better support for larger I/O clusters, including larger physical
>  I/O.  The support is not mature yet, and some of the underlying
>  implementation
>  needs help.  However, support does exist for IDE devices now.
> >>> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> >>> again,
> >>> or do we need to wait at least two decades between changes?
> >>>
> >>> This is hurting performance on some systems; in particular, EC2 "io1"
> >>> disks
> >>> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> >>> spinning rust)
> >>> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> >>> recommends
> >>> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> >>> I/O it
> >>> seems to still be limited by MAXPHYS).
> >>>
> >> We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> >>
> >> sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> >> transfer size */
> >>
> >> ___
> >> freebsd-current@freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> >> To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> freebsd.org"
> >
> > At some point Warner and I discussed how hard it might be to make this a
> > boot time tunable, so that big amd64 machines can have a larger value
> > without causing problems for smaller machines.
> >
> > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> > of the benefit.
> >
> > I am preparing some benchmarks and other data along with a patch to
> > increase the maximum size of pipe I/O's as well, because using 1MB
> > offers a relatively large performance gain there as well.
> >
>
> Hi!
>
> I also migrated to 1mb recordsize. What's the status of your patches
> and/or making MAXPHYS a boot-time tunable? I can help test these.
>

Still in my queue to do.

Warner
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-09-15 Thread Nikolai Lifanov
On 6/3/17 11:55 PM, Allan Jude wrote:
> On 2017-06-03 22:35, Julian Elischer wrote:
>> On 4/6/17 4:59 am, Colin Percival wrote:
>>> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
>>> wrote:
 Add better support for larger I/O clusters, including larger physical
 I/O.  The support is not mature yet, and some of the underlying
 implementation
 needs help.  However, support does exist for IDE devices now.
>>> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
>>> again,
>>> or do we need to wait at least two decades between changes?
>>>
>>> This is hurting performance on some systems; in particular, EC2 "io1"
>>> disks
>>> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
>>> spinning rust)
>>> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
>>> recommends
>>> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
>>> I/O it
>>> seems to still be limited by MAXPHYS).
>>>
>> We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
>>
>> sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
>> transfer size */
>>
>> ___
>> freebsd-current@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> 
> At some point Warner and I discussed how hard it might be to make this a
> boot time tunable, so that big amd64 machines can have a larger value
> without causing problems for smaller machines.
> 
> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> of the benefit.
> 
> I am preparing some benchmarks and other data along with a patch to
> increase the maximum size of pipe I/O's as well, because using 1MB
> offers a relatively large performance gain there as well.
> 

Hi!

I also migrated to 1mb recordsize. What's the status of your patches
and/or making MAXPHYS a boot-time tunable? I can help test these.

- Nikolai



signature.asc
Description: OpenPGP digital signature


Re: Time to increase MAXPHYS?

2017-06-14 Thread Jia-Shiun Li
On Sun, Jun 4, 2017 at 1:33 PM, Warner Losh  wrote:

> On Sat, Jun 3, 2017 at 2:59 PM, Colin Percival 
> wrote:
>
> > On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> > wrote:
> > > Add better support for larger I/O clusters, including larger physical
> > > I/O.  The support is not mature yet, and some of the underlying
> > implementation
> > > needs help.  However, support does exist for IDE devices now.
> >
> > and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> > again,
> > or do we need to wait at least two decades between changes?
> >
> > This is hurting performance on some systems; in particular, EC2 "io1"
> disks
> > are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized spinning
> > rust)
> > disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> > recommends
> > using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> I/O
> > it
> > seems to still be limited by MAXPHYS).
> >
>
> MAXPHYS is the largest I/O transaction you can push through the system. It
> doesn't matter that the I/O is physical or not. The name is a relic from a
> time that NFS didn't exist.
>


Sounds like MAXPHYS usage has grown too widespread than intended to be.

Would it be better for specific components to depart from MAXPHYS if they
care about performance, and use more specific limit from protocol or
hardware spec etc. e.g. MAXDMASIZE, MAX_ATA_IO_SIZE, and maybe some query
functions.

Having a global max for everything to use just doesn't feel right.

If this is the direction to go in the long run, then changing MAXPHYS
references to own-defined limitations may be more meaningful than raising
it universally. Then I'd propose not to raise it, or people would not be
motivated to fix that.

A quick grep shows that many references use it as a cap only 'for safety'.


-Jia-Shiun
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-10 Thread Tomoaki AOKI
It's what I proposed first. ;-)

But looking through this thread, I now like Konstantin's idea, in
conjunction with quirks mechanism.

With single MAXPHYS all over the OS, many non-latest hardwares could be
fallen out if it's set larger, but the larger the MAXPHYS is, the
better the virtual instances (AWS, Azure, ...) runs.

It seems that there should be flexible way to make MAXPHYS "per
consumer" (devices, drivers, virtual instances, ...), and Konstantin's
idea looks good to me. (Although there would be some risk of memory
leak problem.)

One more possibility.
Abusing quirks to allow larger MAXPHYS only if possible, and keep
current default. This way, only something requires larger value would
be affected, IMHO.

 *I guess quirks should only be uesd for problematic things, though.


On Sun, 4 Jun 2017 12:40:55 +
Rick Macklem  wrote:

> There is an array in aio.h sized on MAXPHYS as well.
> 
> A simpler possibility might be to leave MAXPHYS as a compile
> time setting, but allow it to be set "per arch" and make it bigger
> for amd64.
> 
> Good luck with it, rick
> 
> From: owner-freebsd-curr...@freebsd.org  
> on behalf of Konstantin Belousov 
> Sent: Sunday, June 4, 2017 4:10:32 AM
> To: Warner Losh
> Cc: Allan Jude; FreeBSD Current
> Subject: Re: Time to increase MAXPHYS?
> 
> On Sat, Jun 03, 2017 at 11:28:23PM -0600, Warner Losh wrote:
> > On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:
> >
> > > On 2017-06-03 22:35, Julian Elischer wrote:
> > > > On 4/6/17 4:59 am, Colin Percival wrote:
> > > >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> > > >> wrote:
> > > >>> Add better support for larger I/O clusters, including larger physical
> > > >>> I/O.  The support is not mature yet, and some of the underlying
> > > >>> implementation
> > > >>> needs help.  However, support does exist for IDE devices now.
> > > >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> > > >> again,
> > > >> or do we need to wait at least two decades between changes?
> > > >>
> > > >> This is hurting performance on some systems; in particular, EC2 "io1"
> > > >> disks
> > > >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> > > >> spinning rust)
> > > >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> > > >> recommends
> > > >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> > > >> I/O it
> > > >> seems to still be limited by MAXPHYS).
> > > >>
> > > > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> > > >
> > > > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> > > > transfer size */
> > > >
> > > > ___
> > > > freebsd-current@freebsd.org mailing list
> > > > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> > > freebsd.org"
> > >
> > > At some point Warner and I discussed how hard it might be to make this a
> > > boot time tunable, so that big amd64 machines can have a larger value
> > > without causing problems for smaller machines.
> > >
> > > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> > > of the benefit.
> > >
> > > I am preparing some benchmarks and other data along with a patch to
> > > increase the maximum size of pipe I/O's as well, because using 1MB
> > > offers a relatively large performance gain there as well.
> > >
> >
> > It doesn't look to be hard to change this, though struct buf depends on
> > MAXPHYS:
> > struct  vm_page *b_pages[btoc(MAXPHYS)];
> > and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> > time would cause an ABI change. IMHO, we should move it to the last element
> > so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> > We'd have to audit anybody that creates one on the stack knowing it will be
> > persisted. Given how things work, I don't think this is possible, so we may
> > be safe. Thankfully, struct bio doesn't seem to be affected.
> >
> > As for making it boot-time configurable, it shouldn&#

Re: Time to increase MAXPHYS?

2017-06-08 Thread Edward Tomasz Napierała
On 0605T1849, Edward Tomasz Napierała wrote:
> On 0604T0952, Hans Petter Selasky wrote:
> > On 06/04/17 09:39, Tomoaki AOKI wrote:
> > > Hi
> > > 
> > > One possibility would be to make it MD build-time OTIONS,
> > > defaulting 1M on regular systems and 128k on smaller systems.
> > > 
> > > Of course I guess making it a tunable (or sysctl) would be best,
> > > though.
> > > 
> > 
> > Hi,
> > 
> > A tunable sysctl would be fine, but beware that commonly used firmware 
> > out there produced in the millions might hang in a non-recoverable way 
> > if you exceed their "internal limits". Conditionally lowering this 
> > definition is fine, but increasing it needs to be carefully verified.
> > 
> > For example many USB devices are only tested with OS'es like Windows and 
> > MacOS and if these have any kind of limitation on the SCSI transfer 
> > sizes, it is very likely many devices out there do not support any 
> > larger transfer sizes either.
> 
> FWIW, when testing cfiscsi(4) with Windows and OSX I've noticed
> that both issue 1MB requests.  I wouldn't be surprised if they avoided
> doing that for older devices, depending on eg the SCSI version reported
> by device.

Erm, this was obviously - or perhaps not - say: when testing cfumass(4).
As in, Windows and OSX both seem to issue 1MB requests to USB Mass Storage
devices.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Time to increase MAXPHYS?

2017-06-08 Thread Edward Tomasz Napierała
Huh, can't say I've tested that.  I'll try to look into this.  Thanks
for the report!

On 0608T2046, peter.b...@bsd4all.org wrote:
> One area that breaks after changing MAXPHYS from 128K to 1MB is the iscsi 
> target. I don’t have details, because that server is semi-production and I 
> reverted it back ASAP
> > On 5 Jun 2017, at 19:49, Edward Tomasz Napierała  wrote:
> > 
> > On 0604T0952, Hans Petter Selasky wrote:
> >> On 06/04/17 09:39, Tomoaki AOKI wrote:
> >>> Hi
> >>> 
> >>> One possibility would be to make it MD build-time OTIONS,
> >>> defaulting 1M on regular systems and 128k on smaller systems.
> >>> 
> >>> Of course I guess making it a tunable (or sysctl) would be best,
> >>> though.
> >>> 
> >> 
> >> Hi,
> >> 
> >> A tunable sysctl would be fine, but beware that commonly used firmware 
> >> out there produced in the millions might hang in a non-recoverable way 
> >> if you exceed their "internal limits". Conditionally lowering this 
> >> definition is fine, but increasing it needs to be carefully verified.
> >> 
> >> For example many USB devices are only tested with OS'es like Windows and 
> >> MacOS and if these have any kind of limitation on the SCSI transfer 
> >> sizes, it is very likely many devices out there do not support any 
> >> larger transfer sizes either.
> > 
> > FWIW, when testing cfiscsi(4) with Windows and OSX I've noticed
> > that both issue 1MB requests.  I wouldn't be surprised if they avoided
> > doing that for older devices, depending on eg the SCSI version reported
> > by device.
> > 
> > ___
> > freebsd-current@freebsd.org  mailing 
> > list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current 
> > 
> > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org 
> > "
> 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Time to increase MAXPHYS?

2017-06-08 Thread peter . blok
One area that breaks after changing MAXPHYS from 128K to 1MB is the iscsi 
target. I don’t have details, because that server is semi-production and I 
reverted it back ASAP
> On 5 Jun 2017, at 19:49, Edward Tomasz Napierała  wrote:
> 
> On 0604T0952, Hans Petter Selasky wrote:
>> On 06/04/17 09:39, Tomoaki AOKI wrote:
>>> Hi
>>> 
>>> One possibility would be to make it MD build-time OTIONS,
>>> defaulting 1M on regular systems and 128k on smaller systems.
>>> 
>>> Of course I guess making it a tunable (or sysctl) would be best,
>>> though.
>>> 
>> 
>> Hi,
>> 
>> A tunable sysctl would be fine, but beware that commonly used firmware 
>> out there produced in the millions might hang in a non-recoverable way 
>> if you exceed their "internal limits". Conditionally lowering this 
>> definition is fine, but increasing it needs to be carefully verified.
>> 
>> For example many USB devices are only tested with OS'es like Windows and 
>> MacOS and if these have any kind of limitation on the SCSI transfer 
>> sizes, it is very likely many devices out there do not support any 
>> larger transfer sizes either.
> 
> FWIW, when testing cfiscsi(4) with Windows and OSX I've noticed
> that both issue 1MB requests.  I wouldn't be surprised if they avoided
> doing that for older devices, depending on eg the SCSI version reported
> by device.
> 
> ___
> freebsd-current@freebsd.org  mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current 
> 
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org 
> "

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: Time to increase MAXPHYS?

2017-06-05 Thread Edward Tomasz Napierała
On 0604T0952, Hans Petter Selasky wrote:
> On 06/04/17 09:39, Tomoaki AOKI wrote:
> > Hi
> > 
> > One possibility would be to make it MD build-time OTIONS,
> > defaulting 1M on regular systems and 128k on smaller systems.
> > 
> > Of course I guess making it a tunable (or sysctl) would be best,
> > though.
> > 
> 
> Hi,
> 
> A tunable sysctl would be fine, but beware that commonly used firmware 
> out there produced in the millions might hang in a non-recoverable way 
> if you exceed their "internal limits". Conditionally lowering this 
> definition is fine, but increasing it needs to be carefully verified.
> 
> For example many USB devices are only tested with OS'es like Windows and 
> MacOS and if these have any kind of limitation on the SCSI transfer 
> sizes, it is very likely many devices out there do not support any 
> larger transfer sizes either.

FWIW, when testing cfiscsi(4) with Windows and OSX I've noticed
that both issue 1MB requests.  I wouldn't be surprised if they avoided
doing that for older devices, depending on eg the SCSI version reported
by device.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-05 Thread Kenneth D. Merry
On Sun, Jun 04, 2017 at 09:52:36 +0200, Hans Petter Selasky wrote:
> On 06/04/17 09:39, Tomoaki AOKI wrote:
> > Hi
> > 
> > One possibility would be to make it MD build-time OTIONS,
> > defaulting 1M on regular systems and 128k on smaller systems.
> > 
> > Of course I guess making it a tunable (or sysctl) would be best,
> > though.
> > 
> 
> Hi,
> 
> A tunable sysctl would be fine, but beware that commonly used firmware 
> out there produced in the millions might hang in a non-recoverable way 
> if you exceed their "internal limits". Conditionally lowering this 
> definition is fine, but increasing it needs to be carefully verified.
> 
> For example many USB devices are only tested with OS'es like Windows and 
> MacOS and if these have any kind of limitation on the SCSI transfer 
> sizes, it is very likely many devices out there do not support any 
> larger transfer sizes either.

I agree that I'd like to see a tunable.  We've been using a MAXPHYS value
slightly larger than 1MB at Spectra for years with no problems, but then
again, we're only running on newer hardware.

If we keep DFLTPHYS the same (64K) or come up with another constant that is
defined to 64K, the way the da(4) and sa(4) handle things will keep most
older controllers working properly.  Here is what da(4) does:

if (cpi.maxio == 0)
softc->maxio = DFLTPHYS;/* traditional default */
else if (cpi.maxio > MAXPHYS)
softc->maxio = MAXPHYS; /* for safety */
else
softc->maxio = cpi.maxio;
softc->disk->d_maxsize = softc->maxio;

cpi is the XPT_PATH_INQ CCB.  The maxio field was added later, so older,
unmodified drivers that haven't set the maxio field default to a 64K I/O
size.

Drivers for some of the more common SAS and FC hardware set maxio to a
value that is correct for the hardware.  (e.g. mpt(4), mps(4), mpr(4),
and isp(4) all set it correctly.)

As Warner pointed out, the way ahci(4) works is that it sets its maximum
I/O size to MAXPHYS.  The question is, does all AHCI hardware support
arbitrary transfer sizes?  Is there a way to figure out what the hardware
supports, and if not, we should probably default it to 128K instead of
MAXPHYS.

Tape drives are another related issue.  Tape block sizes up to 1MB are
pretty common.  LTFS allows for blocksizes up to 1MB.  You can't currently
read a tape with a 1MB blocksize on FreeBSD without bumping MAXPHYS and
having a controller and tape drive that can handle the larger blocksize.

The sa(4) driver has the same logic as the da(4) driver for limiting
transfer sizes to the smaller of MAXPHYS and cpi.maxio.

The sa(4) driver gives the user some tools for figuring things out:

{sm4u-1-mgmt:/root:!:1} mt status -v
Drive: sa0:  Serial Number: 101500520A
-
Mode  Density  Blocksize  bpi  Compression
Current:  0x58:LTO-5   variable   384607   enabled (0x1)
-
Current Driver State: at rest.
-
Partition:   0  Calc File Number:   0 Calc Record Number: 0
Residual:0  Reported File Number:   0 Reported Record Number: 0
Flags: BOP
-
Tape I/O parameters:
  Maximum I/O size allowed by driver and controller (maxio): 1048576 bytes
  Maximum I/O size reported by controller (cpi_maxio): 5197824 bytes
  Maximum block size supported by tape drive and media (max_blk): 8388608 bytes
  Minimum block size supported by tape drive and media (min_blk): 1 bytes
  Block granularity supported by tape drive and media (blk_gran): 0 bytes
  Maximum possible I/O size (max_effective_iosize): 1048576 bytes

On this particular FreeBSD/head machine, I have MAXPHYS set to 1MB.  The
controller (isp(4)) supports ~5MB I/O sizes and the drive (IBM LTO-5)
supports ~8MB I/O, but MAXPHYS is set to 1MB, so that is the limit.

I have considered changing the sa(4) driver to not use physio(9), and
instead use a custom allocator to allow reading and writing tapes with
blocksizes up to what the hardware (combination of tape drive and
controller) allows.  I haven't gotten around to it yet, because bumping
MAXPHYS works well enough in most cases.  It also has a nice side effect of
allowing unmapped I/O.

The pass(4) driver limits I/O sizes in the same way as the da(4) and sa(4)
drivers for CCBs sent via the blocking (CAMIOCOMMAND) ioctl, but for CCBs
sent via the asynchronous API, the only limit is the controller (cpi.maxio)
limit.  The latter is because the buffers for the asynchronous interface
are malloced.  If it were possible to send arbitrary sized, unmapped S/G
lists, then we could convert the asynchronous pass(4) interface to do
unmapped I/O.

Ken
-- 
Kenneth Merry
k...@freebsd.org
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubsc

Re: Time to increase MAXPHYS?

2017-06-04 Thread Slawa Olhovchenkov
On Sat, Jun 03, 2017 at 11:49:01PM -0600, Warner Losh wrote:

> > Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> > sure, especially on memory limited systems. Lots of hardware can't do this
> > big an I/O, and some drivers can't cope, even if the underlying hardware
> > can. Since we don't use such drivers at work, I don't have a list handy
> > (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> > reasonable bump by default, but I think going larger by default should be
> > approached with some caution given the overhead that adds to struct buf.
> > Having it be a run-time tunable would be great.
> >
> 
> Of course 128k is reasonable, it's the current default :). I'd mean to say
> that doubling would have a limited impact. 1MB might be a good default, but
> it might be too big for smaller systems (nothing says it has to be a MI
> constant, though). It would be a perfectly fine default if it were a
> tunable.

Some cloud providers limit IOPs per VM, for this cases MAXPHYS must be
large as posible.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-04 Thread Slawa Olhovchenkov
On Sat, Jun 03, 2017 at 11:55:51PM -0400, Allan Jude wrote:

> On 2017-06-03 22:35, Julian Elischer wrote:
> > On 4/6/17 4:59 am, Colin Percival wrote:
> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> >> wrote:
> >>> Add better support for larger I/O clusters, including larger physical
> >>> I/O.  The support is not mature yet, and some of the underlying
> >>> implementation
> >>> needs help.  However, support does exist for IDE devices now.
> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> >> again,
> >> or do we need to wait at least two decades between changes?
> >>
> >> This is hurting performance on some systems; in particular, EC2 "io1"
> >> disks
> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> >> spinning rust)
> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> >> recommends
> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> >> I/O it
> >> seems to still be limited by MAXPHYS).
> >>
> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> > 
> > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> > transfer size */
> > 
> > ___
> > freebsd-current@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> 
> At some point Warner and I discussed how hard it might be to make this a
> boot time tunable, so that big amd64 machines can have a larger value
> without causing problems for smaller machines.
> 
> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> of the benefit.

16MB

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-04 Thread Rick Macklem
There is an array in aio.h sized on MAXPHYS as well.

A simpler possibility might be to leave MAXPHYS as a compile
time setting, but allow it to be set "per arch" and make it bigger
for amd64.

Good luck with it, rick

From: owner-freebsd-curr...@freebsd.org  on 
behalf of Konstantin Belousov 
Sent: Sunday, June 4, 2017 4:10:32 AM
To: Warner Losh
Cc: Allan Jude; FreeBSD Current
Subject: Re: Time to increase MAXPHYS?

On Sat, Jun 03, 2017 at 11:28:23PM -0600, Warner Losh wrote:
> On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:
>
> > On 2017-06-03 22:35, Julian Elischer wrote:
> > > On 4/6/17 4:59 am, Colin Percival wrote:
> > >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> > >> wrote:
> > >>> Add better support for larger I/O clusters, including larger physical
> > >>> I/O.  The support is not mature yet, and some of the underlying
> > >>> implementation
> > >>> needs help.  However, support does exist for IDE devices now.
> > >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> > >> again,
> > >> or do we need to wait at least two decades between changes?
> > >>
> > >> This is hurting performance on some systems; in particular, EC2 "io1"
> > >> disks
> > >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> > >> spinning rust)
> > >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> > >> recommends
> > >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> > >> I/O it
> > >> seems to still be limited by MAXPHYS).
> > >>
> > > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> > >
> > > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> > > transfer size */
> > >
> > > ___
> > > freebsd-current@freebsd.org mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> > freebsd.org"
> >
> > At some point Warner and I discussed how hard it might be to make this a
> > boot time tunable, so that big amd64 machines can have a larger value
> > without causing problems for smaller machines.
> >
> > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> > of the benefit.
> >
> > I am preparing some benchmarks and other data along with a patch to
> > increase the maximum size of pipe I/O's as well, because using 1MB
> > offers a relatively large performance gain there as well.
> >
>
> It doesn't look to be hard to change this, though struct buf depends on
> MAXPHYS:
> struct  vm_page *b_pages[btoc(MAXPHYS)];
> and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> time would cause an ABI change. IMHO, we should move it to the last element
> so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> We'd have to audit anybody that creates one on the stack knowing it will be
> persisted. Given how things work, I don't think this is possible, so we may
> be safe. Thankfully, struct bio doesn't seem to be affected.
>
> As for making it boot-time configurable, it shouldn't be too horrible with
> the above change. We should have enough of the tunables mechanism up early
> enough to pull this in before we create the buf pool.
>
> Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> sure, especially on memory limited systems. Lots of hardware can't do this
> big an I/O, and some drivers can't cope, even if the underlying hardware
> can. Since we don't use such drivers at work, I don't have a list handy
> (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> reasonable bump by default, but I think going larger by default should be
> approached with some caution given the overhead that adds to struct buf.
> Having it be a run-time tunable would be great.
The most important side-effect of bumping MAXPHYS as high as you did,
which is somewhat counter-intuitive and also probably does not matter
for typical Netflix cache box load (as I understand it) is increase of
fragmentation for UFS volumes.

MAXPHYS limits the max cluster size, and larger the cluster we trying to
build, larger is the probability of failure.  We might end with single-block
writes more often, defeating reallocblk defragmenter.  This mig

Re: Time to increase MAXPHYS?

2017-06-04 Thread Tomoaki AOKI
On Sun, 4 Jun 2017 09:52:36 +0200
Hans Petter Selasky  wrote:

> On 06/04/17 09:39, Tomoaki AOKI wrote:
> > Hi
> > 
> > One possibility would be to make it MD build-time OTIONS,
> > defaulting 1M on regular systems and 128k on smaller systems.
> > 
> > Of course I guess making it a tunable (or sysctl) would be best,
> > though.
> > 
> 
> Hi,
> 
> A tunable sysctl would be fine, but beware that commonly used firmware 
> out there produced in the millions might hang in a non-recoverable way 
> if you exceed their "internal limits". Conditionally lowering this 
> definition is fine, but increasing it needs to be carefully verified.
> 
> For example many USB devices are only tested with OS'es like Windows and 
> MacOS and if these have any kind of limitation on the SCSI transfer 
> sizes, it is very likely many devices out there do not support any 
> larger transfer sizes either.
> 
> --HPS

Hmm, so making it tunable (or sysctl) as Warner noted would allow
drivers to use quirks to workaround, i.e., QUIRKS_MAXPHYS_128K.

> 
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> 


-- 
Tomoaki AOKI
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-04 Thread Konstantin Belousov
On Sat, Jun 03, 2017 at 11:28:23PM -0600, Warner Losh wrote:
> On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:
> 
> > On 2017-06-03 22:35, Julian Elischer wrote:
> > > On 4/6/17 4:59 am, Colin Percival wrote:
> > >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> > >> wrote:
> > >>> Add better support for larger I/O clusters, including larger physical
> > >>> I/O.  The support is not mature yet, and some of the underlying
> > >>> implementation
> > >>> needs help.  However, support does exist for IDE devices now.
> > >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> > >> again,
> > >> or do we need to wait at least two decades between changes?
> > >>
> > >> This is hurting performance on some systems; in particular, EC2 "io1"
> > >> disks
> > >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> > >> spinning rust)
> > >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> > >> recommends
> > >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> > >> I/O it
> > >> seems to still be limited by MAXPHYS).
> > >>
> > > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> > >
> > > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> > > transfer size */
> > >
> > > ___
> > > freebsd-current@freebsd.org mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> > freebsd.org"
> >
> > At some point Warner and I discussed how hard it might be to make this a
> > boot time tunable, so that big amd64 machines can have a larger value
> > without causing problems for smaller machines.
> >
> > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> > of the benefit.
> >
> > I am preparing some benchmarks and other data along with a patch to
> > increase the maximum size of pipe I/O's as well, because using 1MB
> > offers a relatively large performance gain there as well.
> >
> 
> It doesn't look to be hard to change this, though struct buf depends on
> MAXPHYS:
> struct  vm_page *b_pages[btoc(MAXPHYS)];
> and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> time would cause an ABI change. IMHO, we should move it to the last element
> so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> We'd have to audit anybody that creates one on the stack knowing it will be
> persisted. Given how things work, I don't think this is possible, so we may
> be safe. Thankfully, struct bio doesn't seem to be affected.
> 
> As for making it boot-time configurable, it shouldn't be too horrible with
> the above change. We should have enough of the tunables mechanism up early
> enough to pull this in before we create the buf pool.
> 
> Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> sure, especially on memory limited systems. Lots of hardware can't do this
> big an I/O, and some drivers can't cope, even if the underlying hardware
> can. Since we don't use such drivers at work, I don't have a list handy
> (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> reasonable bump by default, but I think going larger by default should be
> approached with some caution given the overhead that adds to struct buf.
> Having it be a run-time tunable would be great.
The most important side-effect of bumping MAXPHYS as high as you did,
which is somewhat counter-intuitive and also probably does not matter
for typical Netflix cache box load (as I understand it) is increase of
fragmentation for UFS volumes.

MAXPHYS limits the max cluster size, and larger the cluster we trying to
build, larger is the probability of failure.  We might end with single-block
writes more often, defeating reallocblk defragmenter.  This might be
somewhat theoretical, and probably can be mitigated in the clustering code
if real, but it is a thing to look at.

WRT making the MAXPHYS tunable, I do not like the proposal of converting
b_pages[] into the flexible array.  I think that making b_pages a pointer
to off-structure page run is better.  One of the reason is that buf cache
buffers are not only buffers in the system.  There are several cases where
the buffers are malloced, like markers for iterating queues.  In this case,
b_pages[] can be eliminated at all.  (I believe I changed all local
struct bufs to be allocated with malloc).

Another non-struct buf supply of buffers are phys buffers pool, see
vm/vm_pager.c.

> 
> There's a number of places in userland that depend on MAXPHYS, which is
> unfortunate since they assume a fixed value and don't pick it up from the
> kernel or kernel config. Thankfully, there are only a limited number of
> these.
> 
> Of course, there's times when I/Os can return much more than this. Reading
> drive log pages, for example, can generate tens or hundreds of M

Re: Time to increase MAXPHYS?

2017-06-04 Thread Hans Petter Selasky

On 06/04/17 09:39, Tomoaki AOKI wrote:

Hi

One possibility would be to make it MD build-time OTIONS,
defaulting 1M on regular systems and 128k on smaller systems.

Of course I guess making it a tunable (or sysctl) would be best,
though.



Hi,

A tunable sysctl would be fine, but beware that commonly used firmware 
out there produced in the millions might hang in a non-recoverable way 
if you exceed their "internal limits". Conditionally lowering this 
definition is fine, but increasing it needs to be carefully verified.


For example many USB devices are only tested with OS'es like Windows and 
MacOS and if these have any kind of limitation on the SCSI transfer 
sizes, it is very likely many devices out there do not support any 
larger transfer sizes either.


--HPS

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-04 Thread Tomoaki AOKI
Hi

One possibility would be to make it MD build-time OTIONS,
defaulting 1M on regular systems and 128k on smaller systems.

Of course I guess making it a tunable (or sysctl) would be best,
though.


On Sat, 3 Jun 2017 23:49:01 -0600
Warner Losh  wrote:

> On Sat, Jun 3, 2017 at 11:28 PM, Warner Losh  wrote:
> 
> >
> >
> > On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:
> >
> >> On 2017-06-03 22:35, Julian Elischer wrote:
> >> > On 4/6/17 4:59 am, Colin Percival wrote:
> >> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> >> >> wrote:
> >> >>> Add better support for larger I/O clusters, including larger physical
> >> >>> I/O.  The support is not mature yet, and some of the underlying
> >> >>> implementation
> >> >>> needs help.  However, support does exist for IDE devices now.
> >> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> >> >> again,
> >> >> or do we need to wait at least two decades between changes?
> >> >>
> >> >> This is hurting performance on some systems; in particular, EC2 "io1"
> >> >> disks
> >> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> >> >> spinning rust)
> >> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> >> >> recommends
> >> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> >> >> I/O it
> >> >> seems to still be limited by MAXPHYS).
> >> >>
> >> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> >> >
> >> > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> >> > transfer size */
> >> >
> >> > ___
> >> > freebsd-current@freebsd.org mailing list
> >> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> >> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@f
> >> reebsd.org"
> >>
> >> At some point Warner and I discussed how hard it might be to make this a
> >> boot time tunable, so that big amd64 machines can have a larger value
> >> without causing problems for smaller machines.
> >>
> >> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> >> of the benefit.
> >>
> >> I am preparing some benchmarks and other data along with a patch to
> >> increase the maximum size of pipe I/O's as well, because using 1MB
> >> offers a relatively large performance gain there as well.
> >>
> >
> > It doesn't look to be hard to change this, though struct buf depends on
> > MAXPHYS:
> > struct  vm_page *b_pages[btoc(MAXPHYS)];
> > and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> > time would cause an ABI change. IMHO, we should move it to the last element
> > so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> > We'd have to audit anybody that creates one on the stack knowing it will be
> > persisted. Given how things work, I don't think this is possible, so we may
> > be safe. Thankfully, struct bio doesn't seem to be affected.
> >
> > As for making it boot-time configurable, it shouldn't be too horrible with
> > the above change. We should have enough of the tunables mechanism up early
> > enough to pull this in before we create the buf pool.
> >
> > Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> > sure, especially on memory limited systems. Lots of hardware can't do this
> > big an I/O, and some drivers can't cope, even if the underlying hardware
> > can. Since we don't use such drivers at work, I don't have a list handy
> > (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> > reasonable bump by default, but I think going larger by default should be
> > approached with some caution given the overhead that adds to struct buf.
> > Having it be a run-time tunable would be great.
> >
> 
> Of course 128k is reasonable, it's the current default :). I'd mean to say
> that doubling would have a limited impact. 1MB might be a good default, but
> it might be too big for smaller systems (nothing says it has to be a MI
> constant, though). It would be a perfectly fine default if it were a
> tunable.
> 
> 
> > There's a number of places in userland that depend on MAXPHYS, which is
> > unfortunate since they assume a fixed value and don't pick it up from the
> > kernel or kernel config. Thankfully, there are only a limited number of
> > these.
> >
> 
> There's a number of other places that assume MAXPHYS is constant. The ahci
> driver uses it to define the max number of SG operations you can have, for
> example. aio has an array sized based off of it. There are some places that
> use this when they should use 128k instead. There's several places that use
> it to define other constants, and it would take a while to run them all to
> ground to make sure they are all good. We might need to bump DFLTPHYS as
> well, so it might also make a good tunable. There's a few places that check
> things in terms of a fixed multiple of MAXPHYS that are rules 

Re: Time to increase MAXPHYS?

2017-06-03 Thread Warner Losh
On Sat, Jun 3, 2017 at 11:28 PM, Warner Losh  wrote:

>
>
> On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:
>
>> On 2017-06-03 22:35, Julian Elischer wrote:
>> > On 4/6/17 4:59 am, Colin Percival wrote:
>> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
>> >> wrote:
>> >>> Add better support for larger I/O clusters, including larger physical
>> >>> I/O.  The support is not mature yet, and some of the underlying
>> >>> implementation
>> >>> needs help.  However, support does exist for IDE devices now.
>> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
>> >> again,
>> >> or do we need to wait at least two decades between changes?
>> >>
>> >> This is hurting performance on some systems; in particular, EC2 "io1"
>> >> disks
>> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
>> >> spinning rust)
>> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
>> >> recommends
>> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
>> >> I/O it
>> >> seems to still be limited by MAXPHYS).
>> >>
>> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
>> >
>> > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
>> > transfer size */
>> >
>> > ___
>> > freebsd-current@freebsd.org mailing list
>> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@f
>> reebsd.org"
>>
>> At some point Warner and I discussed how hard it might be to make this a
>> boot time tunable, so that big amd64 machines can have a larger value
>> without causing problems for smaller machines.
>>
>> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
>> of the benefit.
>>
>> I am preparing some benchmarks and other data along with a patch to
>> increase the maximum size of pipe I/O's as well, because using 1MB
>> offers a relatively large performance gain there as well.
>>
>
> It doesn't look to be hard to change this, though struct buf depends on
> MAXPHYS:
> struct  vm_page *b_pages[btoc(MAXPHYS)];
> and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> time would cause an ABI change. IMHO, we should move it to the last element
> so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> We'd have to audit anybody that creates one on the stack knowing it will be
> persisted. Given how things work, I don't think this is possible, so we may
> be safe. Thankfully, struct bio doesn't seem to be affected.
>
> As for making it boot-time configurable, it shouldn't be too horrible with
> the above change. We should have enough of the tunables mechanism up early
> enough to pull this in before we create the buf pool.
>
> Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> sure, especially on memory limited systems. Lots of hardware can't do this
> big an I/O, and some drivers can't cope, even if the underlying hardware
> can. Since we don't use such drivers at work, I don't have a list handy
> (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> reasonable bump by default, but I think going larger by default should be
> approached with some caution given the overhead that adds to struct buf.
> Having it be a run-time tunable would be great.
>

Of course 128k is reasonable, it's the current default :). I'd mean to say
that doubling would have a limited impact. 1MB might be a good default, but
it might be too big for smaller systems (nothing says it has to be a MI
constant, though). It would be a perfectly fine default if it were a
tunable.


> There's a number of places in userland that depend on MAXPHYS, which is
> unfortunate since they assume a fixed value and don't pick it up from the
> kernel or kernel config. Thankfully, there are only a limited number of
> these.
>

There's a number of other places that assume MAXPHYS is constant. The ahci
driver uses it to define the max number of SG operations you can have, for
example. aio has an array sized based off of it. There are some places that
use this when they should use 128k instead. There's several places that use
it to define other constants, and it would take a while to run them all to
ground to make sure they are all good. We might need to bump DFLTPHYS as
well, so it might also make a good tunable. There's a few places that check
things in terms of a fixed multiple of MAXPHYS that are rules of thumb that
kinda work today maybe by accident or maybe the 100 * MAXPHYS is highly
scientific. It's hard to say without careful study.

For example, until recently, nvmecontrol would use MAXPHYS. But it's the
system default MAXPHYS. And even if it isn't, there's currently a hard
limit of 1MB for an I/O imposed by how the driver uses nvme's SG lists. But
it doesn't show up as MAXPHYS, but rather as NVME_MAX_XFER_SIZE in places.
It totally su

Re: Time to increase MAXPHYS?

2017-06-03 Thread Warner Losh
On Sat, Jun 3, 2017 at 2:59 PM, Colin Percival  wrote:

> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> wrote:
> > Add better support for larger I/O clusters, including larger physical
> > I/O.  The support is not mature yet, and some of the underlying
> implementation
> > needs help.  However, support does exist for IDE devices now.
>
> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> again,
> or do we need to wait at least two decades between changes?
>
> This is hurting performance on some systems; in particular, EC2 "io1" disks
> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized spinning
> rust)
> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> recommends
> using a maximum I/O size of 1 MB (and despite NFS not being *physical* I/O
> it
> seems to still be limited by MAXPHYS).
>

MAXPHYS is the largest I/O transaction you can push through the system. It
doesn't matter that the I/O is physical or not. The name is a relic from a
time that NFS didn't exist.

Warner
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-03 Thread Warner Losh
On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude  wrote:

> On 2017-06-03 22:35, Julian Elischer wrote:
> > On 4/6/17 4:59 am, Colin Percival wrote:
> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
> >> wrote:
> >>> Add better support for larger I/O clusters, including larger physical
> >>> I/O.  The support is not mature yet, and some of the underlying
> >>> implementation
> >>> needs help.  However, support does exist for IDE devices now.
> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> >> again,
> >> or do we need to wait at least two decades between changes?
> >>
> >> This is hurting performance on some systems; in particular, EC2 "io1"
> >> disks
> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> >> spinning rust)
> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> >> recommends
> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> >> I/O it
> >> seems to still be limited by MAXPHYS).
> >>
> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> >
> > sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> > transfer size */
> >
> > ___
> > freebsd-current@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> freebsd.org"
>
> At some point Warner and I discussed how hard it might be to make this a
> boot time tunable, so that big amd64 machines can have a larger value
> without causing problems for smaller machines.
>
> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> of the benefit.
>
> I am preparing some benchmarks and other data along with a patch to
> increase the maximum size of pipe I/O's as well, because using 1MB
> offers a relatively large performance gain there as well.
>

It doesn't look to be hard to change this, though struct buf depends on
MAXPHYS:
struct  vm_page *b_pages[btoc(MAXPHYS)];
and b_pages isn't the last item in the list, so changing MAXPHYS at boot
time would cause an ABI change. IMHO, we should move it to the last element
so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
We'd have to audit anybody that creates one on the stack knowing it will be
persisted. Given how things work, I don't think this is possible, so we may
be safe. Thankfully, struct bio doesn't seem to be affected.

As for making it boot-time configurable, it shouldn't be too horrible with
the above change. We should have enough of the tunables mechanism up early
enough to pull this in before we create the buf pool.

Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
sure, especially on memory limited systems. Lots of hardware can't do this
big an I/O, and some drivers can't cope, even if the underlying hardware
can. Since we don't use such drivers at work, I don't have a list handy
(though I think the SG list for NVMe limits it to 1MB). 128k is totally
reasonable bump by default, but I think going larger by default should be
approached with some caution given the overhead that adds to struct buf.
Having it be a run-time tunable would be great.

There's a number of places in userland that depend on MAXPHYS, which is
unfortunate since they assume a fixed value and don't pick it up from the
kernel or kernel config. Thankfully, there are only a limited number of
these.

Of course, there's times when I/Os can return much more than this. Reading
drive log pages, for example, can generate tens or hundreds of MB of data,
and there's no way to do that with one transaction today. If drive makers
were perfect, we could use the generally defined offset and length fields
to read them out piecemeal. If the log is table, a big if for some of the
snapshots of internal state logs that are sometimes necessary to
investigate problems... It sure would be nice if there were a way to have
super-huge I/O on an exception basis for these situations.

Warner
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Time to increase MAXPHYS?

2017-06-03 Thread Allan Jude
On 2017-06-03 22:35, Julian Elischer wrote:
> On 4/6/17 4:59 am, Colin Percival wrote:
>> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
>> wrote:
>>> Add better support for larger I/O clusters, including larger physical
>>> I/O.  The support is not mature yet, and some of the underlying
>>> implementation
>>> needs help.  However, support does exist for IDE devices now.
>> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
>> again,
>> or do we need to wait at least two decades between changes?
>>
>> This is hurting performance on some systems; in particular, EC2 "io1"
>> disks
>> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
>> spinning rust)
>> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
>> recommends
>> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
>> I/O it
>> seems to still be limited by MAXPHYS).
>>
> We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> 
> sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O
> transfer size */
> 
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

At some point Warner and I discussed how hard it might be to make this a
boot time tunable, so that big amd64 machines can have a larger value
without causing problems for smaller machines.

ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
of the benefit.

I am preparing some benchmarks and other data along with a patch to
increase the maximum size of pipe I/O's as well, because using 1MB
offers a relatively large performance gain there as well.

-- 
Allan Jude



signature.asc
Description: OpenPGP digital signature


Re: Time to increase MAXPHYS?

2017-06-03 Thread Julian Elischer

On 4/6/17 4:59 am, Colin Percival wrote:

On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
wrote:

Add better support for larger I/O clusters, including larger physical
I/O.  The support is not mature yet, and some of the underlying implementation
needs help.  However, support does exist for IDE devices now.

and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it again,
or do we need to wait at least two decades between changes?

This is hurting performance on some systems; in particular, EC2 "io1" disks
are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized spinning rust)
disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS) recommends
using a maximum I/O size of 1 MB (and despite NFS not being *physical* I/O it
seems to still be limited by MAXPHYS).


We increase it in freebsd 8 and 10.3 on our systems,  Only good results.

sys/sys/param.h:#define MAXPHYS (1024 * 1024)   /* max raw I/O 
transfer size */


___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Time to increase MAXPHYS?

2017-06-03 Thread Colin Percival
On January 24, 1998, in what was later renumbered to SVN r32724, dyson@
wrote:
> Add better support for larger I/O clusters, including larger physical
> I/O.  The support is not mature yet, and some of the underlying implementation
> needs help.  However, support does exist for IDE devices now.

and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it again,
or do we need to wait at least two decades between changes?

This is hurting performance on some systems; in particular, EC2 "io1" disks
are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized spinning rust)
disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS) recommends
using a maximum I/O size of 1 MB (and despite NFS not being *physical* I/O it
seems to still be limited by MAXPHYS).

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"