Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Stuart Henderson
On 2021-04-21, Kent Watsen  wrote:
>   - When ZFS is told to use the SSD, it starts the partition
>  on sector 256 (not the default sector 34) to ensure good
>  SSD NAND alignment.

The OS doesn't get all that close to the NAND layer with typical
computer component SSD drives, there is a layer in between doing
translation/wear levelling (and in some cases compression).
Black box proprietary code with presumably a fair bit of deep
magic involved. (Some OS do have more direct access to certain
types of flash devices that need OS control of wear-levelling;
OpenBSD doesn't and FFS is probably not the right filesystem for
this anyway).

There are different block sizes involved too; one is the size in
which writes can be done; the other is for erases which is typically
much larger.

If someone wants this badly enough then the starting point is to
show some figures for a situation which it improves. Benchmarks for
speed improvements. Maybe there's something in SSD SMART stats that
will give clues to whether it reduces write amplification.
(Then it needs repeating on different hardware; even different
firmware versions in an SSD could change how it behaves, let alone
differences between the various controller manufacturers).

I've written disklabel/fdisk diffs for this before, but I couldn't
figure out whether they actually helped anything.




Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Kent Watsen
[My previous message was somewhat garbled when reflected back at me.  It looks 
better in the archives here: 
https://marc.info/?l=openbsd-misc=161902769301731=2.  I’m resending as 
plain-text to see if the problem is on my end.]


I’m running OpenBSD on top of bHyve using virtual disks allocated out of ZFS 
pools.  While not the same setup, some concepts carry over...

I have two types of pools:

  1) an “expensive" pool for fast random IO:
- this pool is made up stripes of SSD-based vdevs.
- ZFS is configured to use a 16K recordsize for this pool.
- good for small files (guest OS, DBs, web/mail/dns files, etc.)
- When ZFS is told to use the SSD, it starts the partition
   on sector 256 (not the default sector 34) to ensure good
   SSD NAND alignment.

  2) a less-expensive pool for large sequential IO:
- this pool is a single RAIDZ2-based vdev using spinning rust.
- ZFS is configured to use a 1M recordsize for this pool.
- good for large files (movies, high-res images, backups, etc.)

Virtual disks are exposed to the OpenBSD guests from both pools.  The guest’s 
root-disk is always allocated from pool #1.  Typically, a second 
application-specific disk is also allocated from pool #1 (e.g., /var/www/sites 
on a web server, /home on a mail server, etc.).  Only in special circumstances 
(e.g., a media server) is a disk allocated from pool #2. 

This arrangement steps around needing to read/write 1M blocks for each small 
file access, and also the possibility that a guest accessing a given block will 
span more than a single physical block.

Can VMWare virtual disks be configured similarly?

K.




Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Kent Watsen
I’m running OpenBSD on top of bHyve using virtual disks allocated out of ZFS 
pools.  While not the same setup, some concepts carry over…

I have two types of pools:

  1) an “expensive" pool for fast random IO:
- this pool is made up stripes of SSD-based vdevs.
- ZFS is configured to use a 16K recordsize for this pool.
- good for small files (guest OS, DBs, web/mail/dns files, etc.)
- When ZFS is told to use the SSD, it starts the partition
   on sector 256 (not the default sector 34) to ensure good
   SSD NAND alignment.

  2) a less-expensive pool for large sequential IO:
- this pool is a single RAIDZ2-based vdev using spinning rust.
- ZFS is configured to use a 1M recordsize for this pool.
- good for large files (movies, high-res images, backups, etc.)

Virtual disks are exposed to the OpenBSD guests from both pools.  The guest’s 
root-disk is always allocated from pool #1.  Typically, a second 
application-specific disk is also allocated from pool #1 (e.g., /var/www/sites 
on a web server, /home on a mail server, etc.).  Only in special circumstances 
(e.g., a media server) is a disk allocated from pool #2. 

This arrangement steps around needing to read/write 1M blocks for each small 
file access, and also the possibility that a guest accessing a given block will 
span more than a single physical block.

Can VMWare virtual disks be configured similarly?

K.


> On Apr 21, 2021, at 12:35 PM, Tom Smyth  wrote:
> 
> Christian, Otto, Thanks for your feedback on this one
> 
> Ill research it further,
> but NTFS has 4K, 8K 32K and 64K Allocation units on the
> filessystem and for Microsoft  windows running Exchange or Database workloads
> they were recommending alignment of the NTFS partitions
> on the 1MB offset also.
> 
> From Otto's, explanation (Thanks) of 1/16  blocks would potentially
> cross a boundary  of the
> storage subsystem,
> 6.25% of reads(or writes)  could result in a double Read ( or double write)
> 
> of course the write issue is a bigger problem for the SSDs..
> 
> I can configure the partitions how I want ,for now anyway,
> 
> Ill do a little digging on FFS and FFS2 and see how the filesystem
> database (or table)
> is structured...
> 
> Thanks for the feedback it is very helpful to me
> 
> All the best,
> 
> Tom Smyth
> 
> 
> 
> On Wed, 21 Apr 2021 at 15:25, Christian Weisgerber  wrote:
>> 
>> Tom Smyth:
>> 
>>> if you were to have a 1MB file or  a database that needed to read 1MB
>>> of data,  i
>>> f the partitions are not aligned then
>>> your underlying storage system need to load 2 chunks  or write 2
>>> chunks for 1 MB of data, written,
>> 
>> You seem to assume that FFS2 would align a 1MB file on an 1MB border
>> within the filesystem.  That is not case.  That 1MB file will be
>> aligned on a blocksize border (16/32/64 kB, depending on filesystem
>> size).  Aligning the partition on n*blocksize has no effect on this.
>> 
>> --
>> Christian "naddy" Weisgerber  na...@mips.inka.de
> 
> 
> 
> -- 
> Kindest regards,
> Tom Smyth.
> 



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Tom Smyth
Christian, Otto, Thanks for your feedback on this one

Ill research it further,
but NTFS has 4K, 8K 32K and 64K Allocation units on the
filessystem and for Microsoft  windows running Exchange or Database workloads
they were recommending alignment of the NTFS partitions
on the 1MB offset also.

>From Otto's, explanation (Thanks) of 1/16  blocks would potentially
cross a boundary  of the
storage subsystem,
6.25% of reads(or writes)  could result in a double Read ( or double write)

of course the write issue is a bigger problem for the SSDs..

I can configure the partitions how I want ,for now anyway,

Ill do a little digging on FFS and FFS2 and see how the filesystem
database (or table)
is structured...

Thanks for the feedback it is very helpful to me

All the best,

Tom Smyth



On Wed, 21 Apr 2021 at 15:25, Christian Weisgerber  wrote:
>
> Tom Smyth:
>
> > if you were to have a 1MB file or  a database that needed to read 1MB
> > of data,  i
> > f the partitions are not aligned then
> > your underlying storage system need to load 2 chunks  or write 2
> > chunks for 1 MB of data, written,
>
> You seem to assume that FFS2 would align a 1MB file on an 1MB border
> within the filesystem.  That is not case.  That 1MB file will be
> aligned on a blocksize border (16/32/64 kB, depending on filesystem
> size).  Aligning the partition on n*blocksize has no effect on this.
>
> --
> Christian "naddy" Weisgerber  na...@mips.inka.de



-- 
Kindest regards,
Tom Smyth.



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Christian Weisgerber
Tom Smyth:

> if you were to have a 1MB file or  a database that needed to read 1MB
> of data,  i
> f the partitions are not aligned then
> your underlying storage system need to load 2 chunks  or write 2
> chunks for 1 MB of data, written,

You seem to assume that FFS2 would align a 1MB file on an 1MB border
within the filesystem.  That is not case.  That 1MB file will be
aligned on a blocksize border (16/32/64 kB, depending on filesystem
size).  Aligning the partition on n*blocksize has no effect on this.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Otto Moerbeek
On Wed, Apr 21, 2021 at 09:56:59AM +0100, Tom Smyth wrote:

> Hello Otto, Christian,
> 
> I was relying on that paper for the pictures of the alignment issue,
> 
> VMFS  (vmware file system)since version 5 of vmwarehas allocation
> units of 1MB each
> 
> https://kb.vmware.com/s/article/2137120
> 
> my understanding is that SSDs   have a similar allocation unit setup of 1MB,
> 
> and that aligning your file system to 1MB would improve performance
> 
> 
> |OpenBSD Filesystem --|  FFS-Filesystem
> |VMDK Virtual Disk file for Guest |  OpenBSD-Gusest-Disk0.vmdk
> |vmware datastore--  |   1MB allocation
> |Logical Storage Device / RAID---|
> |SSD or DISK storage --|1MB allocation  unit (on some SSDs)
> 
> Figure 2 of the following paper shows what
> https://www.usenix.org/legacy/event/usenix09/tech/full_papers/rajimwale/rajimwale.pdf
> as your writes start to cross another underlying block boundary you
> see a degradation of performance
> largest impact is on a write o1 1MB (misaligned) across 2 blocks,

Max unit OpenBSD writes in one go is 64k. So the issue is not that
relevant.  Only 1 in 16 blocks would potentially cross a boundary.

You are free to setup your disks in a way that suits you, but in
general I don't think we should enforce 1Mb alignment of start of
partition and/or size because *some* *might* get a benefit.

-Otto

> but it repeats as you increase the number  of MB in a transaction but
> the % overhead
> reduces for each additional 1MB in the Transaction.
> 
> If there is no downside to allocating  /Offsetting  filesystems on 1MB
> boundaries,
> can we do that by default to reduce wear on SSDs, and improve performance
> in Virtualized Environments with large allocation units on what ever storage
> subsystem they are running.
> 
> Thanks for your time
> 
> Tom Smyth
> 
> 
> 
> 
> On Wed, 21 Apr 2021 at 08:49, Otto Moerbeek  wrote:
> >
> > On Wed, Apr 21, 2021 at 08:20:10AM +0100, Tom Smyth wrote:
> >
> > > Hi Christian,
> > >
> > > if you were to have a 1MB file or  a database that needed to read 1MB
> > > of data,  i
> > > f the partitions are not aligned then
> > > your underlying storage system need to load 2 chunks  or write 2
> > > chunks for 1 MB of data, written,
> > >
> > > So *worst* case you would double the workload for the storage hardware
> > > (SSD or Hardware RAID with large chunks)  for each transaction
> > > on writing to SSDs if you are not aligned one could *worst *case
> > > double the write / wear rate.
> > >
> > > The improvement would be less for accessing small files and writing small 
> > > files
> > > (as they would need to be across  2 Chunks )
> > >
> > > The following paper explains (better  than I do )
> > > https://www.vmware.com/pdf/esx3_partition_align.pdf
> > >
> > > if the cost is  1-8MB at the start of the disk (assuming partitions are 
> > > sized
> > >  so that they dont loose the ofset of 2048 sectors)
> > > I think it is worth pursuing. (again I only have experience on amd64
> > > /i386 hardware)
> >
> > Doing a quick scan trhough the pdf I only see talk about 64k boundaries.
> >
> > FFS(2) will split up any partiition in multiple cylinder groups. Each
> > cylinder group starts with a superblock copy, inode tables and other
> > meta datas before the data blocks of that cylinder group. Having the
> > start of a partion a 1 1MB boundary does not get you those data blocks
> > at a specific boundary. So I think your resoning does not apply to FFS(2).
> >
> > It might make sense to move the start to offset 128 for big
> > partitions, so you align with the 64k boundary mentioned in the pdf,
> > the block size is already 64k (for big parttiions).
> >
> > -Otto
> >
> > >
> > > Thanks
> > > Tom Smyth
> > >
> > > On Tue, 20 Apr 2021 at 22:52, Christian Weisgerber  
> > > wrote:
> > > >
> > > > Tom Smyth:
> > > >
> > > > > just installing todays snapshot and the default offset on amd64 is 64,
> > > > >  (as it has been for as long as I can remember)
> > > >
> > > > It was changed from 63 in 2010.
> > > >
> > > > > Is it worth while updating the defaults so that OpenBSD partition
> > > > > layout will be optimal for SSD or other Virtualized RAID environments
> > > > > with 1MB  Chunks,
> > > >
> > > > What are you trying to optimize with this?  FFS2 file systems reserve
> > > > 64 kB at the start of a partition, and after that it's filesystem
> > > > blocks, which are 16/32/64 kB, depending on the size of the filesystem.
> > > > I can barely see an argument for aligning large partitions at 128
> > > > sectors, but what purpose would larger multiples serve?
> > > >
> > > > > Is there a down side  to moving the default offset to 2048 ?
> > > >
> > > > Not really.  It wastes a bit of space, but that is rather insignificant
> > > > for today's disk sizes.
> > > >
> > > > --
> > > > Christian "naddy" Weisgerber  na...@mips.inka.de
> > > >
> > >
> > >
> > > --
> > > Kindest regards,
> > 

Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Tom Smyth
Hello Otto, Christian,

I was relying on that paper for the pictures of the alignment issue,

VMFS  (vmware file system)since version 5 of vmwarehas allocation
units of 1MB each

https://kb.vmware.com/s/article/2137120

my understanding is that SSDs   have a similar allocation unit setup of 1MB,

and that aligning your file system to 1MB would improve performance


|OpenBSD Filesystem --|  FFS-Filesystem
|VMDK Virtual Disk file for Guest |  OpenBSD-Gusest-Disk0.vmdk
|vmware datastore--  |   1MB allocation
|Logical Storage Device / RAID---|
|SSD or DISK storage --|1MB allocation  unit (on some SSDs)

Figure 2 of the following paper shows what
https://www.usenix.org/legacy/event/usenix09/tech/full_papers/rajimwale/rajimwale.pdf
as your writes start to cross another underlying block boundary you
see a degradation of performance
largest impact is on a write o1 1MB (misaligned) across 2 blocks,
but it repeats as you increase the number  of MB in a transaction but
the % overhead
reduces for each additional 1MB in the Transaction.

If there is no downside to allocating  /Offsetting  filesystems on 1MB
boundaries,
can we do that by default to reduce wear on SSDs, and improve performance
in Virtualized Environments with large allocation units on what ever storage
subsystem they are running.

Thanks for your time

Tom Smyth




On Wed, 21 Apr 2021 at 08:49, Otto Moerbeek  wrote:
>
> On Wed, Apr 21, 2021 at 08:20:10AM +0100, Tom Smyth wrote:
>
> > Hi Christian,
> >
> > if you were to have a 1MB file or  a database that needed to read 1MB
> > of data,  i
> > f the partitions are not aligned then
> > your underlying storage system need to load 2 chunks  or write 2
> > chunks for 1 MB of data, written,
> >
> > So *worst* case you would double the workload for the storage hardware
> > (SSD or Hardware RAID with large chunks)  for each transaction
> > on writing to SSDs if you are not aligned one could *worst *case
> > double the write / wear rate.
> >
> > The improvement would be less for accessing small files and writing small 
> > files
> > (as they would need to be across  2 Chunks )
> >
> > The following paper explains (better  than I do )
> > https://www.vmware.com/pdf/esx3_partition_align.pdf
> >
> > if the cost is  1-8MB at the start of the disk (assuming partitions are 
> > sized
> >  so that they dont loose the ofset of 2048 sectors)
> > I think it is worth pursuing. (again I only have experience on amd64
> > /i386 hardware)
>
> Doing a quick scan trhough the pdf I only see talk about 64k boundaries.
>
> FFS(2) will split up any partiition in multiple cylinder groups. Each
> cylinder group starts with a superblock copy, inode tables and other
> meta datas before the data blocks of that cylinder group. Having the
> start of a partion a 1 1MB boundary does not get you those data blocks
> at a specific boundary. So I think your resoning does not apply to FFS(2).
>
> It might make sense to move the start to offset 128 for big
> partitions, so you align with the 64k boundary mentioned in the pdf,
> the block size is already 64k (for big parttiions).
>
> -Otto
>
> >
> > Thanks
> > Tom Smyth
> >
> > On Tue, 20 Apr 2021 at 22:52, Christian Weisgerber  
> > wrote:
> > >
> > > Tom Smyth:
> > >
> > > > just installing todays snapshot and the default offset on amd64 is 64,
> > > >  (as it has been for as long as I can remember)
> > >
> > > It was changed from 63 in 2010.
> > >
> > > > Is it worth while updating the defaults so that OpenBSD partition
> > > > layout will be optimal for SSD or other Virtualized RAID environments
> > > > with 1MB  Chunks,
> > >
> > > What are you trying to optimize with this?  FFS2 file systems reserve
> > > 64 kB at the start of a partition, and after that it's filesystem
> > > blocks, which are 16/32/64 kB, depending on the size of the filesystem.
> > > I can barely see an argument for aligning large partitions at 128
> > > sectors, but what purpose would larger multiples serve?
> > >
> > > > Is there a down side  to moving the default offset to 2048 ?
> > >
> > > Not really.  It wastes a bit of space, but that is rather insignificant
> > > for today's disk sizes.
> > >
> > > --
> > > Christian "naddy" Weisgerber  na...@mips.inka.de
> > >
> >
> >
> > --
> > Kindest regards,
> > Tom Smyth.
> >



-- 
Kindest regards,
Tom Smyth.



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Otto Moerbeek
On Wed, Apr 21, 2021 at 08:20:10AM +0100, Tom Smyth wrote:

> Hi Christian,
> 
> if you were to have a 1MB file or  a database that needed to read 1MB
> of data,  i
> f the partitions are not aligned then
> your underlying storage system need to load 2 chunks  or write 2
> chunks for 1 MB of data, written,
> 
> So *worst* case you would double the workload for the storage hardware
> (SSD or Hardware RAID with large chunks)  for each transaction
> on writing to SSDs if you are not aligned one could *worst *case
> double the write / wear rate.
> 
> The improvement would be less for accessing small files and writing small 
> files
> (as they would need to be across  2 Chunks )
> 
> The following paper explains (better  than I do )
> https://www.vmware.com/pdf/esx3_partition_align.pdf
> 
> if the cost is  1-8MB at the start of the disk (assuming partitions are sized
>  so that they dont loose the ofset of 2048 sectors)
> I think it is worth pursuing. (again I only have experience on amd64
> /i386 hardware)

Doing a quick scan trhough the pdf I only see talk about 64k boundaries.

FFS(2) will split up any partiition in multiple cylinder groups. Each
cylinder group starts with a superblock copy, inode tables and other
meta datas before the data blocks of that cylinder group. Having the
start of a partion a 1 1MB boundary does not get you those data blocks
at a specific boundary. So I think your resoning does not apply to FFS(2).

It might make sense to move the start to offset 128 for big
partitions, so you align with the 64k boundary mentioned in the pdf,
the block size is already 64k (for big parttiions).

-Otto

> 
> Thanks
> Tom Smyth
> 
> On Tue, 20 Apr 2021 at 22:52, Christian Weisgerber  wrote:
> >
> > Tom Smyth:
> >
> > > just installing todays snapshot and the default offset on amd64 is 64,
> > >  (as it has been for as long as I can remember)
> >
> > It was changed from 63 in 2010.
> >
> > > Is it worth while updating the defaults so that OpenBSD partition
> > > layout will be optimal for SSD or other Virtualized RAID environments
> > > with 1MB  Chunks,
> >
> > What are you trying to optimize with this?  FFS2 file systems reserve
> > 64 kB at the start of a partition, and after that it's filesystem
> > blocks, which are 16/32/64 kB, depending on the size of the filesystem.
> > I can barely see an argument for aligning large partitions at 128
> > sectors, but what purpose would larger multiples serve?
> >
> > > Is there a down side  to moving the default offset to 2048 ?
> >
> > Not really.  It wastes a bit of space, but that is rather insignificant
> > for today's disk sizes.
> >
> > --
> > Christian "naddy" Weisgerber  na...@mips.inka.de
> >
> 
> 
> -- 
> Kindest regards,
> Tom Smyth.
> 



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-21 Thread Tom Smyth
Hi Christian,

if you were to have a 1MB file or  a database that needed to read 1MB
of data,  i
f the partitions are not aligned then
your underlying storage system need to load 2 chunks  or write 2
chunks for 1 MB of data, written,

So *worst* case you would double the workload for the storage hardware
(SSD or Hardware RAID with large chunks)  for each transaction
on writing to SSDs if you are not aligned one could *worst *case
double the write / wear rate.

The improvement would be less for accessing small files and writing small files
(as they would need to be across  2 Chunks )

The following paper explains (better  than I do )
https://www.vmware.com/pdf/esx3_partition_align.pdf

if the cost is  1-8MB at the start of the disk (assuming partitions are sized
 so that they dont loose the ofset of 2048 sectors)
I think it is worth pursuing. (again I only have experience on amd64
/i386 hardware)

Thanks
Tom Smyth

On Tue, 20 Apr 2021 at 22:52, Christian Weisgerber  wrote:
>
> Tom Smyth:
>
> > just installing todays snapshot and the default offset on amd64 is 64,
> >  (as it has been for as long as I can remember)
>
> It was changed from 63 in 2010.
>
> > Is it worth while updating the defaults so that OpenBSD partition
> > layout will be optimal for SSD or other Virtualized RAID environments
> > with 1MB  Chunks,
>
> What are you trying to optimize with this?  FFS2 file systems reserve
> 64 kB at the start of a partition, and after that it's filesystem
> blocks, which are 16/32/64 kB, depending on the size of the filesystem.
> I can barely see an argument for aligning large partitions at 128
> sectors, but what purpose would larger multiples serve?
>
> > Is there a down side  to moving the default offset to 2048 ?
>
> Not really.  It wastes a bit of space, but that is rather insignificant
> for today's disk sizes.
>
> --
> Christian "naddy" Weisgerber  na...@mips.inka.de
>


-- 
Kindest regards,
Tom Smyth.



Re: default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

2021-04-20 Thread Christian Weisgerber
Tom Smyth:

> just installing todays snapshot and the default offset on amd64 is 64,
>  (as it has been for as long as I can remember)

It was changed from 63 in 2010.

> Is it worth while updating the defaults so that OpenBSD partition
> layout will be optimal for SSD or other Virtualized RAID environments
> with 1MB  Chunks,

What are you trying to optimize with this?  FFS2 file systems reserve
64 kB at the start of a partition, and after that it's filesystem
blocks, which are 16/32/64 kB, depending on the size of the filesystem.
I can barely see an argument for aligning large partitions at 128
sectors, but what purpose would larger multiples serve?

> Is there a down side  to moving the default offset to 2048 ?

Not really.  It wastes a bit of space, but that is rather insignificant
for today's disk sizes.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de