Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

2017-02-22 Thread Wes Vaske (wvaske)
> I used —numjobs=1 because I needed the time series values for bandwidth, 
> latencies and iops. The command string was the same, except from varying IO 
> Depth and numjobs=1.

You might need to increase the number of jobs here. The primary reason for this 
parameter is to improve scaling when you’re single thread CPU bound. With 
numjob=1 FIO will use only a single thread and there’s only so much a single 
CPU core can do.

> Being 6 devices bought from 4 different sellers it’s impossible that they are 
> all defective.

I was a little unclear on the disk cache part. It’s a setting, generally in the 
RAID controller / HBA. It’s also a filesystem level option in Linux (hdparm) 
and Windows (somewhere in device manager?). The reason to disable the disk 
cache is that it’s NOT protected against power loss protection on the MX300. So 
by disabling it you can ensure 100% write consistency at the cost of write 
performance. (using fully power protected drives will let you keep disk cache 
enabled)

> Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as 
> possible and report here the results :)

It’s more representative of what you’ll see at the application level. (If 
you’ve got a running system, you can just use IOstat to see what your average 
QD is. (iostat -x 10, and it’s the column: avgqu-sz. Change from 10 seconds to 
whatever interval works best for your environment)

> Do you have some HBA card to suggest? What do you think of LSI SAS3008? I 
> think it’s the same as the 3108 without RAID On Chip feature. Probably I will 
> buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) 
> compared to the actual embedded RAID controller (LSI 2008).

I’ve been able to consistently get the same performance out of any of the LSI 
based cards. The 3008 and 3108 both work great, regardless of vendor. Just test 
or read up on the different configuration parameters (read ahead, write back vs 
write through, disk cache)


Wes Vaske
Senior Storage Solutions Engineer
Micron Technology

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Pietro Pugni
Sent: Tuesday, February 21, 2017 5:44 PM
To: Wes Vaske (wvaske) <wva...@micron.com>
Cc: Merlin Moncure <mmonc...@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam 
RAID10)

Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different 
RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t 
personally work with the consumer drives.

Verify whether you have your disk write cache enabled or disabled. If it’s 
disabled, that will have a large impact on write performance.

What an honor :)
My SSDs are Crucial MX300 (consumer drives) but, as previously stated, they 
gave ~90k IOPS in all benchmarks I found on the web, while mine tops at ~40k 
IOPS. Being 6 devices bought from 4 different sellers it’s impossible that they 
are all defective.


Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 
--rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio 
--bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 
--group_reporting --name=4ktest`

With FIO, you need to multiply iodepth by numjobs to get the final queue depth 
its pushing. (in this case, 256). Make sure you’re looking at the correct data.

I used —numjobs=1 because I needed the time series values for bandwidth, 
latencies and iops. The command string was the same, except from varying IO 
Depth and numjobs=1.



Few other things:
-  Mdadm will give better performance than HW RAID for specific 
benchmarks.
-  Performance is NOT linear with drive count for synthetic benchmarks.
-  It is often nearly linear for application performance.

mdadm RAID10 scaled linearly while mdadm RAID0 scaled much less.



-  HW RAID can give better performance if your drives do not have a 
capacitor backed cache (like the MX300) AND the controller has a battery backed 
cache. *Consumer drives can often get better performance from HW RAID*. 
(otherwise MDADM has been faster in all of my testing)

My RAID controller doesn’t have a BBU.



-  Mdadm RAID10 has a bug where reads are not properly distributed 
between the mirror pairs. (It uses head position calculated from the last IO to 
determine which drive in a mirror pair should get the next read. It results in 
really weird behavior of most read IO going to half of your drives instead of 
being evenly split as should be the case for SSDs).  You can see this by 
running iostat while you’ve got a load running and you’ll see uneven 
distribution of IOs. FYI, the RAID1 implementation has an exception where it 
does NOT use head position for SSDs. I have yet to test this but you should be 
able to get better performance by manually striping a RAID0 across multipl

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

2017-02-21 Thread Wes Vaske (wvaske)
> I'm curious what the entry point is for micron models are capacitor enabled...

The 5100 is the entry SATA drive with full power loss protection.
http://www.anandtech.com/show/10886/micron-announces-5100-series-enterprise-sata-ssds-with-3d-tlc-nand

Fun Fact: 3D TLC can give better endurance than planar MLC. 
http://www.chipworks.com/about-chipworks/overview/blog/intelmicron-detail-their-3d-nand-iedm

My understanding (and I’m not a process or electrical engineer) is that the 3D 
cell size is significantly larger than what was being used for planar 
(Samsung’s 3D is reportedly a ~40nm class device vs our most recent planar 
which is 16nm). This results in many more electrons per cell which provides 
better endurance.


Wes Vaske

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Merlin Moncure
Sent: Tuesday, February 21, 2017 2:02 PM
To: Wes Vaske (wvaske) <wva...@micron.com>
Cc: Pietro Pugni <pietro.pu...@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam 
RAID10)

On Tue, Feb 21, 2017 at 1:40 PM, Wes Vaske (wvaske) 
<wva...@micron.com<mailto:wva...@micron.com>> wrote:
-  HW RAID can give better performance if your drives do not have a 
capacitor backed cache (like the MX300) AND the controller has a battery backed 
cache. *Consumer drives can often get better performance from HW RAID*. 
(otherwise MDADM has been faster in all of my testing)

I stopped recommending non-capacitor drives a long time ago for databases.  A 
capacitor is basically a battery that operates on the drive itself and is not 
subject to chemical failure.   Also, drives without capacitors tend not (in my 
direct experience) to be suitable for database use in any scenario where write 
performance matters.  There are capacitor equipped drives that give excellent 
performance for around .60$/gb.  I'm curious what the entry point is for micron 
models are capacitor enabled...

MLC solid state drives are essentially raid systems already with very complex 
tradeoffs engineered into the controller itself -- hw raid controllers are 
redundant systems and their price and added latency to filesystem calls is not 
warranted.  I guess in theory a SSD specialized raid controller could cooperate 
with the drives and do things like manage wear leveling across multiple devices 
but AFAIK no such product exists (note: I haven't looked lately).

merlin



Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

2017-02-21 Thread Wes Vaske (wvaske)
Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different 
RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t 
personally work with the consumer drives.

Verify whether you have your disk write cache enabled or disabled. If it’s 
disabled, that will have a large impact on write performance.

Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 
--rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio 
--bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 
--group_reporting --name=4ktest`

With FIO, you need to multiply iodepth by numjobs to get the final queue depth 
its pushing. (in this case, 256). Make sure you’re looking at the correct data.

Few other things:

-  Mdadm will give better performance than HW RAID for specific 
benchmarks.

-  Performance is NOT linear with drive count for synthetic benchmarks.

-  It is often nearly linear for application performance.

-  HW RAID can give better performance if your drives do not have a 
capacitor backed cache (like the MX300) AND the controller has a battery backed 
cache. *Consumer drives can often get better performance from HW RAID*. 
(otherwise MDADM has been faster in all of my testing)

-  Mdadm RAID10 has a bug where reads are not properly distributed 
between the mirror pairs. (It uses head position calculated from the last IO to 
determine which drive in a mirror pair should get the next read. It results in 
really weird behavior of most read IO going to half of your drives instead of 
being evenly split as should be the case for SSDs).  You can see this by 
running iostat while you’ve got a load running and you’ll see uneven 
distribution of IOs. FYI, the RAID1 implementation has an exception where it 
does NOT use head position for SSDs. I have yet to test this but you should be 
able to get better performance by manually striping a RAID0 across multiple 
RAID1s instead of using the default RAID10 implementation.

-  Don’t focus on 4k Random Read. Do something more similar to a PG 
workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy 
OLTP). I’ve tested multiple controllers based on the LSI 3108 and found that 
default settings from one vendor to another provide drastically different 
performance profiles. Vendor A had much better benchmark performance (2x IOPS 
of B) while vendor B gave better application performance (20% better OLTP 
performance in Postgres). (I got equivalent performance from A & B when using 
the same settings).


Wes Vaske
Senior Storage Solutions Engineer
Micron Technology

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Merlin Moncure
Sent: Tuesday, February 21, 2017 9:05 AM
To: Pietro Pugni 
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam 
RAID10)

On Tue, Feb 21, 2017 at 7:49 AM, Pietro Pugni 
> wrote:
Hi there,
I configured an IBM X3650 M4 for development and testing purposes. It’s 
composed by:
 - 2 x Intel Xeon E5-2690 @ 2.90Ghz (2 x 8 physical Cores + HT)
 - 96GB RAM DDR3 1333MHz (12 x 8GB)
 - 2 x 146GB SAS HDDs @ 15k rpm configured in RAID1 (mdadm)
 - 6 x 525GB SATA SSDs (over-provisioned at 25%, so 393GB available)

I’ve done a lot of testing focusing on 4k and 8k workloads and found that the 
IOPS of those SSDs are half the expected. On 
serverfault.com someone suggested me that probably the 
bottle neck is the embedded RAID controller, a IBM ServeRaid m5110e, which 
mounts a LSI 2008 controller.

I’m using the disks in JBOD mode with mdadm software RAID, which is blazing 
fast. The CPU is also very fast, so I don’t mind having a little overhead due 
to software RAID.

My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big 
indexes, partitions and so on, but also intensive statistical computations.


Here’s my post on serverfault.com ( 
http://serverfault.com/questions/833642/slow-ssd-performance-ibm-x3650-m4-7915 )
and here’s a graph of those six SSDs evaluated using fio as stand-alone disks 
(outside of the RAID):

[https://i.stack.imgur.com/ZMhUJ.png]

All those IOPS should be doubled if all was working correctly. The curve trend 
is correct for increasing IO Depths.


Anyway, I would like to buy a HBA controller that leverages those 6 SSDs. Each 
SSD should deliver about 80k to 90k IOPS, so in RAID10 I should get ~240k IOPS 
(6 x 80k / 2) and in RAID0 ~480k IOPS (6 x 80k). I’ve seen that mdadm 
effectively scales performance, but the controller limits the overal IOPS at 
~120k (exactly the half of the expected IOPS).

What HBA controller would you suggest me able to handle 500k IOPS?


My server is able to handle 8 more SSDs, for a total of 14 SSDs and 1260k 
theoretical IOPS. 

Re: [PERFORM] Capacitors, etc., in hard drives and SSD for DBMS machines...

2016-07-08 Thread Wes Vaske (wvaske)
> Why all this concern about how long a disk (or SSD) drive can stay up
> after a power failure?

When we're discussing SSD power loss protection, it's not a question of how 
long the drive can stay up but whether data at rest or data in flight are going 
to be lost/corrupted in the event of a power loss.

There are a couple big reasons for this.

1. NAND write latency is actually somewhat poor.

SSDs are comprised of NAND chips, DRAM for cache, and the controller. If the 
SSD disabled its disk cache, the write latencies under moderate load would move 
from the sub 100 microseconds range to the 1-10 milliseconds range. This is due 
to how the SSD writes to NAND. A single write operation takes a fairly large 
amount of time but large blocks cans be written as a single operation. 


2. Garbage Collection

If you're not familiar with GC, I definitely recommend reading up as it's one 
of the defining characteristics of SSDs (and now SMR HDDs). The basic principle 
is that SSDs don't support a modification to a page (8KB). Instead, the 
contents would need to be erased then written. Additionally, the slice of the 
chip that can be read, written, or erased are not the same size for each 
operation. Erase Blocks are much bigger than the page (eg: 2MB vs 8KB). This 
means that to modify an 8KB page, the entire 2MB erase block needs to be read 
to the disk cache, erased, then written with the new 8KB page along with the 
rest of the existing data in the 2MB erase block.

This operation needs to be power loss protected (it's the operation that the 
Crucial drives protect against). If it's not, then the data that is read to 
cache could be lost or corrupted if power is lost during the operation. The 
data in the erase block is not necessarily related to the page being modified 
and could be anywhere else in the filesystem. *IMPORTANT: This is data at rest 
that may have been written years prior. It is not just new data that may be 
lost if a GC operation can not complete.*


TL;DR: Many SSDs will not disable disk cache even if you give the command to do 
so. Full Power Loss Protection at the drive level should be a requirement for 
any Enterprise or Data Center application to ensure no data loss or corruption 
of data at rest.


This is why there is so much concern with the internals to specific SSDs 
regarding behavior in a power loss event. It can have large impacts on the 
reliability of the entire system.


Wes Vaske | Senior Storage Solutions Engineer
Micron Technology


From: pgsql-performance-ow...@postgresql.org 
 on behalf of Levente Birta 

Sent: Friday, July 8, 2016 5:36 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Capacitors, etc., in hard drives and SSD for DBMS 
machines...

On 08/07/2016 13:23, Jean-David Beyer wrote:
> Why all this concern about how long a disk (or SSD) drive can stay up
> after a power failure?
>
> It seems to me that anyone interested in maintaining an important
> database would have suitable backup power on their entire systems,
> including the disk drives, so they could coast over any power loss.
>
> I do not have any database that important, but my machine has an APC
> Smart-UPS that has 2 1/2 hours of backup time with relatively new
> batteries in it. It is so oversize because my previous computer used
> much more power than this one does. And if my power company has a brown
> out or black out of over 7 seconds, my natural gas fueled backup
> generator picks up the load very quickly.
>
> Am I overlooking something?
>

UPS-es can fail too ... :)

And so many things could be happen ... once I plugged out the power cord
from the UPS which powered the database server (which was a production
server) ... I thought powering something else :)
but lucky me ... the controller was flash backed



--
Levi


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning guidelines for server with 256GB of RAM and SSDs?

2016-07-07 Thread Wes Vaske (wvaske)
?The Crucial drive does not have power loss protection. The Samsung drive does.


(The Crucial M550 has capacitors to protect data that's already been written to 
the device but not the entire cache. For instance, if data is read from the 
device during a garbage collection operation, the M550 will protect that data 
instead of introducing corruption of old data. This is listed as "power loss 
protection" on the spec sheet but it's not the level of protection that people 
on this list would expect from a drive)



From: pgsql-performance-ow...@postgresql.org 
 on behalf of Kaixi Luo 

Sent: Thursday, July 7, 2016 2:49 AM
To: Mark Kirkwood
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Tuning guidelines for server with 256GB of RAM and SSDs?

It's a Crucial CT250MX200SSD1 and a Samsung MZ7LM480HCHP-3.

Regards,

Kaixi


On Thu, Jul 7, 2016 at 6:59 AM, Mark Kirkwood 
> wrote:
On 06/07/16 07:17, Mkrtchyan, Tigran wrote:
Hi,

We had a similar situation and the best performance was with 64MB
background_bytes and 512 MB dirty_bytes.

Tigran.

On Jul 5, 2016 16:51, Kaixi Luo > 
wrote:


 Here are my server specs:

 RAID1 - 2x480GB Samsung SSD with power loss protection (will be used to
 store the PostgreSQL database)
 RAID1 - 2x240GB Crucial SSD with power loss protection. (will be used to
 store PostgreSQL transactions logs)


Can you tell the exact model numbers for the Samsung and Crucial SSD's? It 
typically matters! E.g I have some Crucial M550 that have capacitors and 
(originally) claimed to be power off safe, but with testing have been shown to 
be not really power off safe at all. I'd be dubious about Samsungs too.

The Intel Datacenter range (S3700 and similar) are known to have power off 
safety that does work.

regards

Mark



--
Sent via pgsql-performance mailing list 
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning guidelines for server with 256GB of RAM and SSDs?

2016-07-06 Thread Wes Vaske (wvaske)
Regarding the Nordeus blog Merlin linked.

They say:
"This doesn't mean the data was really written to disk, it can still remain in 
the disk cache, but enterprise drives usually make sure the data was really 
written to disk on fsync calls."

This isn't actually true for enterprise drives (when I say enterprise in the 
context of an SSD, I'm assuming full power loss protection via capacitors on 
the drive like the Intel DC S3x00 series). Most enterprise SSDs will ignore 
calls to disable disk cache or to flush the disk cache as doing so is entirely 
unnecessary.


Regarding write back cache:
Disabling the write back cache won't have a real large impact on the endurance 
of the drive unless it reduces the total number of bytes written (which it 
won't). I've seen drives that perform better with it disabled and drives that 
perform better with it enabled. I would test in your environment and make the 
decision based on performance. 


Regarding the Crucial drive for logs:
As far as I'm aware, none of the Crucial drives have power loss protection. To 
use these drives you would want to disable disk cache which would drop your 
performance a fair bit.


Write amplification:
I wouldn't expect write amplification to be a serious issue unless you hit 
every LBA on the device early in its life and never execute TRIM. This is one 
of the reasons software RAID can be a better solution for something like this. 
MDADM supports TRIM in RAID devices.  So unless you run the drives above 90% 
full, the write amplification would be minimal so long as you have a daily 
fstrim cron job.

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology


From: pgsql-performance-ow...@postgresql.org 
 on behalf of Merlin Moncure 

Sent: Wednesday, July 6, 2016 1:13 PM
To: Kaixi Luo
Cc: postgres performance list
Subject: Re: [PERFORM] Tuning guidelines for server with 256GB of RAM and SSDs?

On Tue, Jul 5, 2016 at 9:50 AM, Kaixi Luo  wrote:
> Hello,
>
> I've been reading Mr. Greg Smith's "Postgres 9.0 - High Performance" book
> and I have some questions regarding the guidelines I found in the book,
> because I suspect some of them can't be followed blindly to the letter on a
> server with lots of RAM and SSDs.
>
> Here are my server specs:
>
> Intel Xeon E5-1650 v3 Hexa-Core Haswell
> 256GB DDR4 ECC RAM
> Battery backed hardware RAID with 512MB of WriteBack cache (LSI MegaRAID SAS
> 9260-4i)
> RAID1 - 2x480GB Samsung SSD with power loss protection (will be used to
> store the PostgreSQL database)
> RAID1 - 2x240GB Crucial SSD with power loss protection. (will be used to
> store PostgreSQL transactions logs)
>
> First of all, the book suggests that I should enable the WriteBack cache of
> the HWRAID and disable the disk cache to increase performance and ensure
> data safety. Is it still advisable to do this on SSDs, specifically the step
> of disabling the disk cache? Wouldn't that increase the wear rate of the
> SSD?

At the time that book was written, the majority of SSDs were known not
to be completely honest and/or reliable about data integrity in the
face of a power event.  Now it's a hit or miss situation (for example,
see here: http://blog.nordeus.com/dev-ops/power-failure-testing-with-ssds.htm).
The intel drives S3500/S3700 and their descendants are the standard
against which other drives should be judged IMO. The S3500 family in
particular offers tremendous value for database usage.  Do your
research; the warning is still relevant but the blanket statement no
longer applies.  Spinning drives are completely obsolete for database
applications in my experience.

Disabling write back cache for write heavy database loads will will
destroy it in short order due to write amplication and will generally
cause it to underperform hard drives in my experience.

With good SSDs and a good motherboard, I do not recommend a caching
raid controller; software raid is a better choice for many reasons.

One parameter that needs to be analyzed with SSD is
effective_io_concurrency.  see
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com

merlin


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Filesystem and Disk Partitioning for New Server Setup

2016-02-24 Thread Wes Vaske (wvaske)
FYI, If your volume for pg data is the last partition, you can always add 
drives to the Dell PERC RAID group, extend the volume, then extend the 
partition and extend the filesystem.

All of this can also be done live.

Wes Vaske


From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Rick Otten
Sent: Wednesday, February 24, 2016 9:06 AM
To: Dave Stibrany
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Filesystem and Disk Partitioning for New Server Setup

An LVM gives you more options.

Without an LVM you would add a disk to the system, create a tablespace, and 
then move some of your tables over to the new disk.  Or, you'd take a full 
backup, rebuild your file system, and then restore from backup onto the newer, 
larger disk configuration.  Or you'd make softlinks to pg_log or pg_xlog or 
something to stick the extra disk in your system somehow.

You can do that with an LVM too.  However, with an LVM you can add the disk to 
the system, extend the file system, and just keep running.  Live.  No need to 
figure out which tables or files should go where.

Sometimes it is really nice to have that option.




On Wed, Feb 24, 2016 at 9:25 AM, Dave Stibrany 
> wrote:
Thanks for the advice, Rick.

I have an 8 disk chassis, so possible extension paths down the line are adding 
raid1 for WALs, adding another RAID10, or creating a 8 disk RAID10. Would LVM 
make this type of addition easier?


On Wed, Feb 24, 2016 at 6:08 AM, Rick Otten 
> wrote:

1) I'd go with xfs.  zfs might be a good alternative, but the last time I tried 
it, it was really unstable (on Linux).  I may have gotten a lot better, but xfs 
is a safe bet and well understood.

2) An LVM is just an extra couple of commands.  These days that is not a lot of 
complexity given what you gain. The main advantage is that you can extend or 
grow the file system on the fly.  Over the life of the database it is quite 
possible you'll find yourself pressed for disk space - either to drop in more 
csv files to load with the 'copy' command, to store more logs (because you need 
to turn up logging verbosity, etc...), you need more transaction logs live on 
the system, you need to take a quick database dump, or simply you collect more 
data than you expected.  It is not always convenient to change the log 
location, or move tablespaces around to make room.  In the cloud you might 
provision more volumes and attach them to the server.  On a SAN you might 
attach more disk, and with a stand alone server, you might stick more disks on 
the server.  In all those scenarios, being able to simply merge them into your 
existing volume can be really handy.

3) The main advantage of partitioning a single volume (these days) is simply 
that if one partition fills up, it doesn't impact the rest of the system.  
Putting things that are likely to fill up the disk on their own partition is 
generally a good practice.   User home directories is one example.  System 
logs.  That sort of thing.  Isolating them on their own partition will improve 
the long term reliability of your database.   The main disadvantage is those 
things get boxed into a much smaller amount of space than they would normally 
have if they could share a partition with the whole system.


On Tue, Feb 23, 2016 at 11:28 PM, dstibrany 
> wrote:
I'm about to install a new production server and wanted some advice regarding
filesystems and disk partitioning.

The server is:
- Dell PowerEdge R430
- 1 x Intel Xeon E5-2620 2.4GHz
- 32 GB RAM
- 4 x 600GB 10k SAS
- PERC H730P Raid Controller with 2GB cache

The drives will be set up in one RAID-10 volume and I'll be installing
Ubuntu 14.04 LTS as the OS. The server will be dedicated to running
PostgreSQL.

I'm trying to decide:

1) Which filesystem to use (most people seem to suggest xfs).
2) Whether to use LVM (I'm leaning against it because it seems like it adds
additional complexity).
3) How to partition the volume. Should I just create one partition on / and
create a 16-32GB swap partition? Any reason to get fancy with additional
partitions given it's all on one volume?

I'd like to keep things simple to start, but not shoot myself in the foot at
the same time.

Thanks!

Dave



--
View this message in context: 
http://postgresql.nabble.com/Filesystem-and-Disk-Partitioning-for-New-Server-Setup-tp5889074.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list 
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





--
THIS IS A TEST



Re: [PERFORM] New server: SSD/RAID recommendations?

2015-07-07 Thread Wes Vaske (wvaske)
Regarding:
“lie about their fsync status.”

This is mostly semantics but it might help google searches on the issue.

A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive 
will do a FLUSH CACHE. Before kernels 2.6.low numbers the fsync() call 
wouldn’t sent any ATA or SCSI command to flush the disk cache. 
Whereas—AFAICT—modern kernels and file system versions *will* do this. When 
‘sync’ is called the filesystem will issue the appropriate command to the disk 
to flush the write cache.

For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use:
[root@postgres ~]# smartctl --identify /dev/sdu | grep FLUSH CACHE
  83 13  1   FLUSH CACHE EXT supported
  83 12  1   FLUSH CACHE supported
  86 13  1   FLUSH CACHE EXT supported
  86 12  1   FLUSH CACHE supported

The 1s in the 3rd column represent SUPPORTED for the feature listed in the last 
column.

Cheers,
Wes Vaske

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Michael Nolan
Sent: Tuesday, July 07, 2015 12:28 PM
To: hlinn...@iki.fi
Cc: Wes Vaske (wvaske); Graeme B. Bell; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?



On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas 
hlinn...@iki.fimailto:hlinn...@iki.fi wrote:
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:
The M500/M550/M600 are consumer class drives that don't have power
protection for all inflight data.* (like the Samsung 8x0 series and
the Intel 3x0  5x0 series).

The M500DC has full power protection for inflight data and is an
enterprise-class drive (like the Samsung 845DC or Intel S3500  S3700
series).

So any drive without the capacitors to protect inflight data will
suffer from data loss if you're using disk write cache and you pull
the power.

Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a 
file that I saved just before pulling the power plug.

That can (and does) happen with spinning disks, too.


*Big addendum: There are two issues on powerloss that will mess with
Postgres. Data Loss and Data Corruption. The micron consumer drives
will have power loss protection against Data Corruption and the
enterprise drive will have power loss protection against BOTH.

https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf

 The Data Corruption problem is only an issue in non-SLC NAND but
it's industry wide. And even though some drives will protect against
that, the protection of inflight data that's been fsync'd is more
important and should disqualify *any* consumer drives from *any*
company from consideration for use with Postgres.

So it lies about fsync()... The next question is, does it nevertheless enforce 
the correct ordering of persisting fsync'd data? If you write to file A and 
fsync it, then write to another file B and fsync it too, is it guaranteed that 
if B is persisted, A is as well? Because if it isn't, you can end up with 
filesystem (or database) corruption anyway.

- Heikki


The sad fact is that MANY drives (ssd as well as spinning) lie about their 
fsync status.
--
Mike Nolan



Re: [PERFORM] New server: SSD/RAID recommendations?

2015-07-07 Thread Wes Vaske (wvaske)
The M500/M550/M600 are consumer class drives that don't have power protection 
for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0  5x0 
series).

The M500DC has full power protection for inflight data and is an 
enterprise-class drive (like the Samsung 845DC or Intel S3500  S3700 series).

So any drive without the capacitors to protect inflight data will suffer from 
data loss if you're using disk write cache and you pull the power.

*Big addendum:
There are two issues on powerloss that will mess with Postgres. Data Loss and 
Data Corruption. The micron consumer drives will have power loss protection 
against Data Corruption and the enterprise drive will have power loss 
protection against BOTH.

https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
 

The Data Corruption problem is only an issue in non-SLC NAND but it's industry 
wide. And even though some drives will protect against that, the protection of 
inflight data that's been fsync'd is more important and should disqualify *any* 
consumer drives from *any* company from consideration for use with Postgres.

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology 

-Original Message-
From: Graeme B. Bell [mailto:graeme.b...@nibio.no] 
Sent: Tuesday, July 07, 2015 8:26 AM
To: Merlin Moncure
Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?


As I have warned elsewhere,

The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a 
RAID controller with BBU to protect yourself.
The M500/M550 are NOT plug-pull safe despite the 'power loss protection' 
claimed on the packaging. Not all fsync'd data is preserved in the event of a 
power loss, which completely undermines postgres's sanity. 

I would be extremely skeptical about the M500DC given the name and 
manufacturer. 

I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the 
full details of this fault after extensive testing (we have e.g. 20-25 of these 
disks) on multiple machines and controllers, at their request. Result: they 
stopped replying to me, and soon after I saw their PR reps talking about how 
'power loss protection isn't about protecting all data during a power loss'. 

The only safe way to use an M500/M550 with postgres is:

a) disable the disk cache, which will cripple performance to about 3-5% of 
normal.
b) use a battery backed or cap-backed RAID controller, which will generally 
hurt performance, by limiting you to the peak performance of the flash on the 
raid controller. 

If you are buying such a drive, I strongly recommend buying only one and doing 
extensive plug pull testing before commiting to several. 
For myself, my time is valuable enough that it will be cheaper to buy intel in 
future. 

Graeme.

On 07 Jul 2015, at 15:12, Merlin Moncure mmonc...@gmail.com wrote:

 On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) wva...@micron.com wrote:
 Storage Review has a pretty good process and reviewed the M500DC when it 
 released last year. 
 http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
 
  
 
 The only database-specific info we have available are for Cassandra and MSSQL:
 
 http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf
 
 http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf
 
  
 
 (some of that info might be relevant)
 
  
 
 In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) 
 for 5-years. For comparison:
 
 Micron M500DC (20nm) - 2 DWPD
 
 Intel S3500 (20nm) - 0.3 DWPD
 
 Intel S3510 (16nm) - 0.3 DWPD
 
 Intel S3710 (20nm) - 10 DWPD
 
  
 
 They're all great drives, the question is how write-intensive is the workload.
 
 
 
 
 Intel added a new product, the 3610, that is rated for 3 DWPD.  Pricing looks 
 to be around 1.20$/GB.
 
 merlin 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] New server: SSD/RAID recommendations?

2015-07-02 Thread Wes Vaske (wvaske)
What about a RAID controller? Are RAID controllers even available for 
PCI-Express SSD drives, or do we have to stick with SATA if we need a 
battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

Quite a few of the benefits of using a hardware RAID controller are irrelevant 
when using modern SSDs. The great random write performance of the drives means 
the cache on the controller is less useful and the drives you’re considering 
(Intel’s enterprise grade) will have full power protection for inflight data.

In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with 
mdadm vs hw controllers) I’ve found that the RAID controller is actually 
limiting performance compared to just using software RAID. In worst-case 
workloads I’m able to saturate the controller with 2 SATA drives.

Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. 
You’ll need to test whether “discard” in your fstab will have a negative impact 
on performance but being able to run “fstrim” occasionally will definitely help 
performance in the long run.

If you want another drive to consider you should look at the Micron M500DC. 
Full power protection for inflight data, same NAND as Intel uses in their 
drives, good mixed workload performance. (I’m obviously a little biased, though 
;-)

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Andreas Joseph 
Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James 
cja...@emolecules.commailto:cja...@emolecules.com:
We're buying a new server in the near future to replace an aging system. I'd 
appreciate advice on the best SSD devices and RAID controller cards available 
today.

The database is about 750 GB. This is a warehouse server. We load supplier 
catalogs throughout a typical work week, then on the weekend (after Q/A), 
integrate the new supplier catalogs into our customer-visible store, which is 
then copied to a production server where customers see it. So the load is 
mostly data loading, and essentially no OLTP. Typically there are fewer than a 
dozen connections to Postgres.

Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition

We get 5000-7000 TPS from pgbench on this system.

The new system will have at least as many CPUs, and probably a lot more memory 
(196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so 
we'd like a 2TB file system for Postgres. We'll start with the latest versions 
of Linux and Postgres.

Intel's products have always received good reports in this forum. Is that still 
the best recommendation? Or are there good alternatives that are price 
competitive?

What about a RAID controller? Are RAID controllers even available for 
PCI-Express SSD drives, or do we have to stick with SATA if we need a 
battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

Are spinning disks still a good choice for the pg_xlog partition and OS? Is 
there any reason to get spinning disks at all, or is it better/simpler to just 
put everything on SSD drives?

Thanks in advance for your advice!


Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle 
pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in 
pg_largeobject, then move that to a separate tablespace with 
archive-grade-disks (spinning disks).

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.commailto:andr...@visena.com
www.visena.comhttps://www.visena.com
[cid:image001.png@01D0B4A3.43A37880]https://www.visena.com



Re: [PERFORM] New server: SSD/RAID recommendations?

2015-07-02 Thread Wes Vaske (wvaske)
Storage Review has a pretty good process and reviewed the M500DC when it 
released last year. 
http://www.storagereview.com/micron_m500dc_enterprise_ssd_review

The only database-specific info we have available are for Cassandra and MSSQL:
http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf
http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf

(some of that info might be relevant)

In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 
5-years. For comparison:
Micron M500DC (20nm) – 2 DWPD
Intel S3500 (20nm) – 0.3 DWPD
Intel S3510 (16nm) – 0.3 DWPD
Intel S3710 (20nm) – 10 DWPD

They’re all great drives, the question is how write-intensive is the workload.

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728
Mobile: 515-451-7742

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Craig James
Sent: Thursday, July 02, 2015 12:20 PM
To: Wes Vaske (wvaske)
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

On Thu, Jul 2, 2015 at 7:01 AM, Wes Vaske (wvaske) 
wva...@micron.commailto:wva...@micron.com wrote:
What about a RAID controller? Are RAID controllers even available for 
PCI-Express SSD drives, or do we have to stick with SATA if we need a 
battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

Quite a few of the benefits of using a hardware RAID controller are irrelevant 
when using modern SSDs. The great random write performance of the drives means 
the cache on the controller is less useful and the drives you’re considering 
(Intel’s enterprise grade) will have full power protection for inflight data.

In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with 
mdadm vs hw controllers) I’ve found that the RAID controller is actually 
limiting performance compared to just using software RAID. In worst-case 
workloads I’m able to saturate the controller with 2 SATA drives.

Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. 
You’ll need to test whether “discard” in your fstab will have a negative impact 
on performance but being able to run “fstrim” occasionally will definitely help 
performance in the long run.

If you want another drive to consider you should look at the Micron M500DC. 
Full power protection for inflight data, same NAND as Intel uses in their 
drives, good mixed workload performance. (I’m obviously a little biased, though 
;-)

Thanks Wes. That's good advice. I've always liked mdadm and how well RAID is 
supported by Linux, and mostly used a controller for the cache and BBU.

I'll definitely check out your product. Can you point me to any benchmarks, 
both on performance and lifetime?

Craig


Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

From: 
pgsql-performance-ow...@postgresql.orgmailto:pgsql-performance-ow...@postgresql.org
 
[mailto:pgsql-performance-ow...@postgresql.orgmailto:pgsql-performance-ow...@postgresql.org]
 On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.orgmailto:pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James 
cja...@emolecules.commailto:cja...@emolecules.com:
We're buying a new server in the near future to replace an aging system. I'd 
appreciate advice on the best SSD devices and RAID controller cards available 
today.

The database is about 750 GB. This is a warehouse server. We load supplier 
catalogs throughout a typical work week, then on the weekend (after Q/A), 
integrate the new supplier catalogs into our customer-visible store, which is 
then copied to a production server where customers see it. So the load is 
mostly data loading, and essentially no OLTP. Typically there are fewer than a 
dozen connections to Postgres.

Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition

We get 5000-7000 TPS from pgbench on this system.

The new system will have at least as many CPUs, and probably a lot more memory 
(196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so 
we'd like a 2TB file system for Postgres. We'll start with the latest versions 
of Linux and Postgres.

Intel's products have always received good reports in this forum. Is that still 
the best recommendation? Or are there good alternatives that are price 
competitive?

What about a RAID controller

[PERFORM] Fastest Backup Restore for perf testing

2015-05-27 Thread Wes Vaske (wvaske)
Hi,

I'm running performance tests against a PostgreSQL database (9.4) with various 
hardware configurations and a couple different benchmarks (TPC-C  TPC-H).

I'm currently using pg_dump and pg_restore to refresh my dataset between runs 
but this process seems slower than it could be.

Is it possible to do a tar/untar of the entire /var/lib/pgsql tree as a backup 
 restore method?

If not, is there another way to restore a dataset more quickly? The database is 
dedicated to the test dataset so trashing  rebuilding the entire 
application/OS/anything is no issue for me-there's no data for me to lose.

Thanks!

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728



Re: [PERFORM] Some performance testing?

2015-04-09 Thread Wes Vaske (wvaske)
Hey Mike,

What those graphs are showing is that the new kernel reduces the IO required 
for the same DB load. At least, that’s how we’re supposed to interpret it.

I’d be curious to see a measure of the database load for both of those so we 
can verify that the new kernel does in fact provide better performance.

-Wes

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Michael Nolan
Sent: Wednesday, April 08, 2015 5:09 PM
To: Josh Berkus
Cc: Mel Llaguno; Przemysław Deć; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Some performance testing?



On Wed, Apr 8, 2015 at 3:05 PM, Josh Berkus 
j...@agliodbs.commailto:j...@agliodbs.com wrote:
On 04/07/2015 11:07 AM, Mel Llaguno wrote:
 Care to elaborate? We usually do not recommend specific kernel versions
 for our customers (who run on a variety of distributions). Thanks, M.

You should.

http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html

Performance is literally 2X to 5X different between kernels.


Josh, there seems to be an inconsistency in your blog.  You say 3.10.X is safe, 
but the graph you show with the poor performance seems to be from 3.13.X which 
as I understand it is a later kernel.  Can you clarify which 3.X kernels are 
good to use and which are not?
--
Mike Nolan