[PERFORM] Partitions and joins lead to index lookups on all partitions
Hi there, Currently, we are running into serious performance problems with our paritioning setup, because index lookups are mostly done on allpartions, in stead of the one partition it should know that it can find the needed row. Simple example, were we have a partitioned tables named part_table. So here it goes: select * from part_table where id = 12123231 Will do an index lookup only in the partition that it knows it can find the id there. However: select * from part_table where id = (select 12123231) Will do an index lookup in ALL partitions, meaning it is significantly slower, even more since the database will not fit into memory. So okay, we could just not use parameterized queries... Well.. not so fast. Consider a second table referencing to the first: ref_table: group_id bigint part_table_id bigint Now when I join the two: select part_table.* from part_table join ref_table on (ref_table.part_table_id = part_table.id and group_id = 12321) It will also do index loopups on ALL partitions. How do we handle this? Above queries are simplified versions of the things gooing on but the idea is clear. I tried dooing this in 9.1 (we are currently using 9.0), but this does not matter. So what is actually the practicial use of partitioning if you can't even use it effectively for simple joins? constraint_exclusion is enabled correctly, and as far as I can see, this behaviour is according to the book. Are there any progresses in maybe 9.2 to make this any better? If not, how schould we handle this? We can also not choose to parition, but how will that perform on a 100 GB table? Kind regards, Christiaan Willemsen
[PERFORM] SSD endurance calculations
Hey there, We are looking at beefing up our servers with SSD's. Some of you did some interesting tests with the Intel 320. So the idea came to make a RAID10 with four 600GB models. I did however do some calculations with the current database server (220GB database, expected to grow to 1TB by the end of next year). I specifically looked at /proc/diskstat at the read/write figures. From there I could see a read/write ratio of 3:1, and I also saw a wopping 170GB of writes per day (for a database that currently grows 1GB per dag). That seems like an insanely high figure to me! How come? We do mostly inserts, hardly any updates, virtually no deletes. Secondly, I also looked at the reliability figures of the Intel 320. They show 5 years of 20GB per day, meaning that it will hold up for about 200 days in our system. RAID 10 wil make 400 days of that, but this seems hardly a lot.. Am I missing something here? Kind regards, Christiaan
Re: [PERFORM] Using high speed swap to improve performance?
Hi Scott, That sound like a usefull thing to do, but the big advantage of the SAN is that in case the physical machine goes down, I can quickly startup a virtual machine using the same database files to act as a fallback. It will have less memory, and less CPU's but it will do fine for some time. So when putting fast tables on local storage, I losse those tables when the machine goes down. Putting indexes on there however might me intresting.. What will Postgresql do when it is started on the backupmachine, and it finds out the index files are missing? Will it recreate those files, or will it panic and not start at all, or can we just manually reindex? Kind regards, Christiaan -Original message- From: Scott Marlowe Sent: Sun 04-04-2010 23:08 To: Christiaan Willemsen ; CC: pgsql-performance@postgresql.org; Subject: Re: [PERFORM] Using high speed swap to improve performance? On Fri, Apr 2, 2010 at 1:15 PM, Christiaan Willemsen wrote: > Hi there, > > About a year ago we setup a machine with sixteen 15k disk spindles on > Solaris using ZFS. Now that Oracle has taken Sun, and is closing up Solaris, > we want to move away (we are more familiar with Linux anyway). > > So the plan is to move to Linux and put the data on a SAN using iSCSI (two > or four network interfaces). This however leaves us with with 16 very nice > disks dooing nothing. Sound like a wast of time. If we were to use Solaris, > ZFS would have a solution: use it as L2ARC. But there is no Linux filesystem > with those features (ZFS on fuse it not really an option). > > So I was thinking: Why not make a big fat array using 14 disks (raid 1, 10 > or 5), and make this a big and fast swap disk. Latency will be lower than > the SAN can provide, and throughput will also be better, and it will relief > the SAN from a lot of read iops. > > So I could create a 1TB swap disk, and put it onto the OS next to the 64GB > of memory. Then I can set Postgres to use more than the RAM size so it will > start swapping. It would appear to postgres that the complete database will > fit into memory. The question is: will this do any good? And if so: what > will happen? I'd make a couple of RAID-10s out of it and use them for highly used tables and / or indexes etc...
[PERFORM] Using high speed swap to improve performance?
Hi there, About a year ago we setup a machine with sixteen 15k disk spindles on Solaris using ZFS. Now that Oracle has taken Sun, and is closing up Solaris, we want to move away (we are more familiar with Linux anyway). So the plan is to move to Linux and put the data on a SAN using iSCSI (two or four network interfaces). This however leaves us with with 16 very nice disks dooing nothing. Sound like a wast of time. If we were to use Solaris, ZFS would have a solution: use it as L2ARC. But there is no Linux filesystem with those features (ZFS on fuse it not really an option). So I was thinking: Why not make a big fat array using 14 disks (raid 1, 10 or 5), and make this a big and fast swap disk. Latency will be lower than the SAN can provide, and throughput will also be better, and it will relief the SAN from a lot of read iops. So I could create a 1TB swap disk, and put it onto the OS next to the 64GB of memory. Then I can set Postgres to use more than the RAM size so it will start swapping. It would appear to postgres that the complete database will fit into memory. The question is: will this do any good? And if so: what will happen? Kind regards, Christiaan
Re: [PERFORM] Configuring for maximum memory usage
Thanks guys, Lots of info here that I didn't know about! Since I have one of the latest Opensolaris builds, I guess the write throttle feature is already in there. Sadly, the blog doesn't say what build has it included. For writes, I do everything synchronized because we really need a consistent database on disk. We can see that during large inserts, the intend log is used a lot. What I'm going to te testing is a smaller shared_buffers value, and a large ARC cache, and exactly the other way around. Another question: since we have huge tables with hundreds of millions or rows, we partitioned the database (it actually is creating the partitions dynamically now on inserts with very good performance :D ), but the question is: is the size of the partions important for the memory parameters in config file? How can we determine the optimal size of the partition. obviously, when doing selects, you want those preferably only needing a single partition for speed. At the moment, that is for the majority of situations the case. But there might be some other things to think about... Kind regards, Christiaan On Oct 30, 2008, at 7:27 PM, Scott Carey wrote: On Thu, Oct 30, 2008 at 9:55 AM, Joshua D. Drake <[EMAIL PROTECTED]> wrote: On Thu, 2008-10-30 at 09:46 -0700, Scott Carey wrote: > > Remember that PostgreSQL doesn't cache anything on its own so > if you do > want to hit disk it has to be in file cache. > > By my understanding, this is absolutely false. Postgres caches pages > from tables/indexes in shared_buffers. You can make this very large if > you wish. You can make it very large with a potentially serious performance hit. It is very expensive to manage large amounts of shared buffers. It can also nail your IO on checkpoint if you are not careful (even with checkpoint smoothing). You are correct that I did not explain what I meant very well because shared buffers are exactly that, shared buffers. You can slam your I/O by havnig too large of either OS file cache or shared_buffers, and you have to tune both. In the case of large shared_buffers you have to tune postgres and especially the background writer and checkpoints. In the case of a large OS cache, you have to tune parameters to limit the ammount of dirty pages there and force writes out smoothly. Both layers attempt to delay writes for their own, often similar reasons, and suffer when a large sync comes along with lots of dirty data. Recent ZFS changes have been made to limit this, (http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle ) in earlier ZFS versions, this is what usually killed databases -- ZFS in some situations would delay writes too long (even if "long" is 5 seconds) and get in trouble. This still has to be tuned well, combined with good checkpoint tuning in Postgres as you mention. For Linux, there are similar issues that have to be tuned on many kernels, or up to 40% of RAM can fill with dirty pages not written to disk. Letting the OS do it doesn't get rid of the problem, both levels of cache share very similar issues with large sizes and dirty pages followed by a sync. The buffer cache in shared_buffers is a lot more efficient for large scanning queries -- A select count(*) test will be CPU bound if it comes from shared_buffers or the OS page cache, and in the former case I have seen it execute up to 50% faster than the latter, by avoiding calling out to the OS to get pages, purely as a result of less CPU used. However that isn't the exact same thing as a "cache" at least as I was trying to describe it. shared buffers are used to keep track of pages (as well as some other stuff) and their current status. That is not the same as caching a relation. It is not possible to pin a relation to memory using PostgreSQL. PostgreSQL relies on the operating system for that type of caching. The OS can't pin a relation either, from its point of view its all just a bunch of disk data blocks, not relations -- so it is all roughly equivalent. The OS can do a bit better job at data prefetch on sequential scans or other predictable seek sequences (ARC stands for Adaptive Replacement Cache) than postgres currently does (no intelligent prefetch in postgres AFAIK). So I apologize if I made it sound like Postgres cached the actual relation, its just pages -- but it is basically the same thing as the OS cache, but kept in process closer to the code that needs it. Its a cache that prevents disk reads. My suggestion for the OP is to try it both ways, and see what is better for his workload / OS / Hardware combination. Joshua D. Drake --
Re: [PERFORM] Configuring for maximum memory usage
Hi Scott, Thaks for the clear answers! Scott Carey wrote: You must either increase the memory that ZFS uses, or increase Postgresql shard_mem and work_mem to get the aggregate of the two to use more RAM. I believe, that you have not told ZFS to reserve 8GB, but rather told it to limit itself to 8GB. That is correct, but since it will use the whole 8 GB anyway, I can just as easily say that it will reseve that memory ;) Some comments below: On Thu, Oct 30, 2008 at 8:15 AM, Christiaan Willemsen <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Hi there, I configured OpenSolaris on our OpenSolaris Machine. Specs: 2x Quad 2.6 Ghz Xeon 64 GB of memory 16x 15k5 SAS If you do much writing, and even moreso with ZFS, it is critical to put the WAL log on a different ZFS volume (and perhaps disks) than the data and indexes. I already did that. I also have a separate disk pair for the ZFS intent log. Are you counting both the memory used by postgres and the memory used by the ZFS ARC cache? It is the combination you are interested in, and performance will be better if it is biased towards one being a good chunk larger than the other. In my experience, if you are doing more writes, a larger file system cache is better, if you are doing reads, a larger postgres cache is better (the overhead of calling read() in 8k chunks to the os, even if it is cached, causes CPU use to increase). No, the figure I gave is this is without the ARC cache. If you do very large aggregates, you may need even 1GB on work_mem. However, a setting that high would require very careful tuning and reduction of space used by shared_buffers and the ZFS ARC. Its dangerous since each connection with a large aggregate or sort may consume a lot of memory. Well, some taks may need a lot, but I guess most wil do fine with the settings we used right now. So It looks like I can tune the ARC to use more memory, and also increase shared_mem to let postgres cache more tables?
Re: [PERFORM] Configuring for maximum memory usage
Joshua D. Drake wrote: PostgreSQL is only going to use what it needs. It relies on the OS for much of the caching etc... So that would actually mean that I could raise the setting of the ARC cache to far more than 8 GB? As I said, our database is 250 GB, So I would expect that postgres needs more than it is using right now... Several tables have over 500 million records (obviously partitioned). At the moment we are doing queries over large datasets, So I would assume that postgress would need a bit more memory than this.. You are missing effective_cache_size. Try setting that to 32G. That one was set to 24 GB. But this setting only tells posgres how much caching it can expect from the OS? This is not actually memory that it will allocate, is it? You also didn't mention checkpoint_segments (which isn't memory but still important) and default_statistics_target (which isn't memory but still important). is at the moment set to: checkpoint_segments = 40 default_statistics_target is set to default (I think that is 10) Thanks already, Christiaan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Configuring for maximum memory usage
Hi there, I configured OpenSolaris on our OpenSolaris Machine. Specs: 2x Quad 2.6 Ghz Xeon 64 GB of memory 16x 15k5 SAS The filesystem is configured using ZFS, and I think I have found a configuration that performs fairly well. I installed the standard PostgreSQL that came with the OpenSolaris disk (8.3), and later added support for PostGIS. All fime. I also tried to tune postgresql.conf to maximize performance and also memory usage. Since PostgreSQL is the only thing running on this machine, we want it to take full advantage of the hardware. For the ZFS cache, we have 8 GB reserved. The rest can be used by postgres. The problem is getting it to use that much. At the moment, it only uses almost 9 GB, so by far not enough. The problem is getting it to use more... I hope you can help me with working config. Here are the parameters I set in the config file: shared_buffers = 8192MB work_mem = 128MB maintenance_work_mem = 2048MB max_fsm_pages = 204800 max_fsm_relations = 2000 Database is about 250 GB in size, so we really need to have as much data as possible in memory. I hope you can help us tweak a few parameters to make sure all memory will be used. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to setup disk spindles for best performance
On Aug 29, 2008, at 4:43 AM, Greg Smith wrote: On Thu, 21 Aug 2008, Christiaan Willemsen wrote: Anyway, I'm going to return the controller, because it does not scale very well with more that 4 disks in raid 10. Bandwidth is limited to 350MB/sec, and IOPS scale badly with extra disks... How did you determine that upper limit? Usually it takes multiple benchmark processes running at once in order to get more than 350MB/ s out of a controller. For example, if you look carefully at the end of http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ you can see that Joshua had to throw 8 threads at the disks in order to reach maximum bandwidth. I used IOmeter to do some tests, with 50 worker thread doing jobs. I can get more than 350 MB/sec, I'll have to use huge blocksizes (something like 8 MB). Even worse is random read and 70%read, 50% random tests. They don't scale at all when you add disks. A 6 disk raid 5 is exactly as fast as a 12 disk raid 10 :( The idea for xlog + os on 4 disk raid 10 and the rest for the data sound good I would just use a RAID1 pair for the OS, another pair for the xlog, and throw all the other disks into a big 0+1 set. There is some value to separating the WAL from the OS disks, from both the performance and the management perspectives. It's nice to be able to monitor the xlog write bandwidth rate under load easily for example. Yes, that's about what I had in mind. Kind regards, Christiaan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to setup disk spindles for best performance
Hi Scott, Great info! Our RAID card is at the moment a ICP vortex (Adaptec) ICP5165BR, and I'll be using it with Ubuntu server 8.04. I tried OpenSolaris, but it yielded even more terrible performance, specially using ZFS.. I guess that was just a missmatch. Anyway, I'm going to return the controller, because it does not scale very well with more that 4 disks in raid 10. Bandwidth is limited to 350MB/sec, and IOPS scale badly with extra disks... So I guess, I'll be waiting for another controller first. The idea for xlog + os on 4 disk raid 10 and the rest for the data sound good :) I hope it will turn out that way too.. First another controller.. Regards, Christiaan Scott Carey wrote: Indexes will be random write workload, but these won't by synchronous writes and will be buffered by the raid controller's cache. Assuming you're using a hardware raid controller that is, and one that doesn't have major performance problems on your platform. Which brings those questions up --- what is your RAID card and OS? For reads, if your shared_buffers is large enough, your heavily used indexes won't likely go to disk much at all. A good raid controller will typically help distribute the workload effectively on a large array. You probably want a simple 2 disk mirror or 4 disks in raid 10 for your OS + xlog, and the rest for data + indexes -- with hot spares IF your card supports them. The biggest risk to splitting up data and indexes is that you don't know how much I/O each needs relative to each other, and if this isn't a relatively constant ratio you will have one subset busy while the other subset is idle. Unless you have extensively profiled your disk activity into index and data subsets and know roughly what the optimal ratio is, its probably going to cause more problems than it fixes. Furthermore, if this ratio changes at all, its a maintenance nightmare. How much each would need in a perfect world is application dependant, so there can be no general recommendation other than: don't do it. On Thu, Aug 21, 2008 at 1:34 AM, Christiaan Willemsen <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Thanks Joshua, So what about putting the indexes on a separate array? Since we do a lot of inserts indexes are going to be worked on a lot of the time. Regards, Christiaan Joshua D. Drake wrote: Christiaan Willemsen wrote: So, what you are basically saying, is that a single mirror is in general more than enough to facilitate the transaction log. http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide And to answer your question, yes. Transaction logs are written sequentially. You do not need a journaled file system and raid 1 is plenty for most if not all work loads. Sincerely, Joshua D. Drake -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org <mailto:pgsql-performance@postgresql.org>) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to setup disk spindles for best performance
Thanks Joshua, So what about putting the indexes on a separate array? Since we do a lot of inserts indexes are going to be worked on a lot of the time. Regards, Christiaan Joshua D. Drake wrote: Christiaan Willemsen wrote: So, what you are basically saying, is that a single mirror is in general more than enough to facilitate the transaction log. http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide And to answer your question, yes. Transaction logs are written sequentially. You do not need a journaled file system and raid 1 is plenty for most if not all work loads. Sincerely, Joshua D. Drake -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How to setup disk spindles for best performance
So, what you are basically saying, is that a single mirror is in general more than enough to facilitate the transaction log. So it would not be smart to put the indexes onto a separate disk spindle to improve index performance? On Aug 21, 2008, at 3:49 AM, Merlin Moncure wrote: On Wed, Aug 20, 2008 at 6:25 PM, Christiaan Willemsen <[EMAIL PROTECTED]> wrote: I'm currently trying to find out what the best configuration is for our new database server. It will server a database of about 80 GB and growing fast. The new machine has plenty of memory (64GB) and 16 SAS disks, of wich two are already in use as a mirror for the OS. The rest can be used for PostgreSQL. So that makes a total of 14 15k.5 SAS diks. There is obviously a lot to interesting reading to be found, most of them stating that the transaction log should be put onto a separate disk spindle. You can also do this with the indexes. Since they will be updated a lot, I guess that might be a good idea. But what no-one states, is what performance these spindle should have in comparison to the data spindle? If I create a raid 10 of 6 disks for the data, 4 disk raid 10 for the log, and 4 disk raid 10 for the indexes, will this yield best performance? Or is it sufficient to just have a simple mirror for the log and/or indexes...? I have not found any information about these figures, and I guess it should be possible to give some pointers on how these different setup might affect performance? Well, the speed of your logging device puts an upper bound on the write speed of the database. While modern sas drives can do 80mb/sec + with sequential ops, this can turn to 1mb/sec real fast if the logging is duking it out with the other generally random work the database has to do, which is why it's often separated out. 80mb/sec is actually quite a lot in database terms and you will likely only get anything close to that when doing heavy insertion, so that it's unlikely to become the bottleneck. Even if you hit that limit sometimes, those drives are probably put to better use in the data volume somewhere. As for partitioning the data volume, I'd advise this only if you have a mixed duty database that does different tasks with different performance requirements. You may be serving a user interface which has very low maximum transaction time and therefore gets dedicated disk i/o apart from the data churn that is going on elsewhere. Apart from that though, I'd keep it in a single volume. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] How to setup disk spindles for best performance
I'm currently trying to find out what the best configuration is for our new database server. It will server a database of about 80 GB and growing fast. The new machine has plenty of memory (64GB) and 16 SAS disks, of wich two are already in use as a mirror for the OS. The rest can be used for PostgreSQL. So that makes a total of 14 15k.5 SAS diks. There is obviously a lot to interesting reading to be found, most of them stating that the transaction log should be put onto a separate disk spindle. You can also do this with the indexes. Since they will be updated a lot, I guess that might be a good idea. But what no-one states, is what performance these spindle should have in comparison to the data spindle? If I create a raid 10 of 6 disks for the data, 4 disk raid 10 for the log, and 4 disk raid 10 for the indexes, will this yield best performance? Or is it sufficient to just have a simple mirror for the log and/or indexes...? I have not found any information about these figures, and I guess it should be possible to give some pointers on how these different setup might affect performance? So I hope someone has already tested this and can give some tips... Kind regards, Christiaan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance