[PERFORM] Partitions and joins lead to index lookups on all partitions

2011-12-07 Thread Christiaan Willemsen
Hi there,

 

 
 

Currently, we are running into serious performance problems with our 
paritioning setup, because index lookups are mostly done on allpartions, in 
stead of the one partition it should know that it can find the needed row.

 

 
 

Simple example, were we have a partitioned tables named part_table. So here it 
goes:

 

 
 

select * from part_table where id = 12123231

 

 
 

Will do an index lookup only in the partition that it knows it can find the id 
there. However:

 

 
 

select * from part_table where id = (select 12123231)

 

 
 

Will do an index lookup in ALL partitions, meaning it is significantly slower, 
even more since the database will not fit into memory.

 

 
 

So okay, we could just not use parameterized queries... Well.. not so fast. 
Consider a second table referencing to the first:

 

 
 

ref_table:

 

group_id bigint

 

part_table_id bigint

 

 
 

Now when I join the two:

 

select part_table.* from part_table

 

join ref_table on (ref_table.part_table_id = part_table.id and group_id = 12321)

 

 
 

It will also do index loopups on ALL partitions. 

 

 
 

How do we handle this? Above queries are simplified versions of the things 
gooing on but the idea is clear. I tried dooing this in 9.1 (we are currently 
using 9.0), but this does not matter. So what is actually the practicial use of 
partitioning if you can't even use it effectively for simple joins?

 

 
 

constraint_exclusion is enabled correctly, and as far as I can see, this 
behaviour is according to the book.

 

 
 

Are there any progresses in maybe 9.2 to make this any better? If not, how 
schould we handle this? We can also not choose to parition, but how will that 
perform on a 100 GB table?

 

 
 

Kind regards,

 

 
 

Christiaan Willemsen

 

 
 

 
 

 
 

 
 

 


[PERFORM] SSD endurance calculations

2011-11-21 Thread Christiaan Willemsen
Hey there,

 
We are looking at beefing up our servers with SSD's. Some of you did some 
interesting tests with the Intel 320. So the idea came to make a RAID10 with 
four 600GB models.

 
I did however do some calculations with the current database server (220GB 
database, expected to grow to 1TB by the end of next year). I specifically 
looked at /proc/diskstat at the read/write figures. From there I could see a 
read/write ratio of 3:1, and I also saw a wopping 170GB of writes per day (for 
a database that currently grows 1GB per dag). That seems like an insanely high 
figure to me! How come? We do mostly inserts, hardly any updates, virtually no 
deletes.

 
Secondly, I also looked at the reliability figures of the Intel 320. They show 
5 years of 20GB per day, meaning that it will hold up for about 200 days in our 
system. RAID 10 wil make 400 days of that, but this seems hardly a lot.. Am I 
missing something here?

 
Kind regards,

 
Christiaan

 
 
 


Re: [PERFORM] Using high speed swap to improve performance?

2010-04-06 Thread Christiaan Willemsen

Hi Scott,

 

That sound like a usefull thing to do, but the big advantage of the SAN is that 
in case the physical machine goes down, I can quickly startup a virtual machine 
using the same database files to act as a fallback. It will have less memory, 
and less CPU's but it will do fine for some time.

 

So when putting fast tables on local storage, I losse those tables when the 
machine goes down.

 

Putting indexes on there however might me intresting.. What will Postgresql do 
when it is started on the backupmachine, and it finds out the index files are 
missing? Will it recreate those files, or will it panic and not start at all, 
or can we just manually reindex?

 

Kind regards,

 

Christiaan
 
-Original message-
From: Scott Marlowe 
Sent: Sun 04-04-2010 23:08
To: Christiaan Willemsen ; 
CC: pgsql-performance@postgresql.org; 
Subject: Re: [PERFORM] Using high speed swap to improve performance?

On Fri, Apr 2, 2010 at 1:15 PM, Christiaan Willemsen
 wrote:
> Hi there,
>
> About a year ago we setup a machine with sixteen 15k disk spindles on
> Solaris using ZFS. Now that Oracle has taken Sun, and is closing up Solaris,
> we want to move away (we are more familiar with Linux anyway).
>
> So the plan is to move to Linux and put the data on a SAN using iSCSI (two
> or four network interfaces). This however leaves us with with 16 very nice
> disks dooing nothing. Sound like a wast of time. If we were to use Solaris,
> ZFS would have a solution: use it as L2ARC. But there is no Linux filesystem
> with those features (ZFS on fuse it not really an option).
>
> So I was thinking: Why not make a big fat array using 14 disks (raid 1, 10
> or 5), and make this a big and fast swap disk. Latency will be lower than
> the SAN can provide, and throughput will also be better, and it will relief
> the SAN from a lot of read iops.
>
> So I could create a 1TB swap disk, and put it onto the OS next to the 64GB
> of memory. Then I can set Postgres to use more than the RAM size so it will
> start swapping. It would appear to postgres that the complete database will
> fit into memory. The question is: will this do any good? And if so: what
> will happen?

I'd make a couple of RAID-10s out of it and use them for highly used
tables and / or indexes etc...



[PERFORM] Using high speed swap to improve performance?

2010-04-02 Thread Christiaan Willemsen

Hi there,

 

About a year ago we setup a machine with sixteen 15k disk spindles on Solaris 
using ZFS. Now that Oracle has taken Sun, and is closing up Solaris, we want to 
move away (we are more familiar with Linux anyway).

 

So the plan is to move to Linux and put the data on a SAN using iSCSI (two or 
four network interfaces). This however leaves us with with 16 very nice disks 
dooing nothing. Sound like a wast of time. If we were to use Solaris, ZFS would 
have a solution: use it as L2ARC. But there is no Linux filesystem with those 
features (ZFS on fuse it not really an option).

 

So I was thinking: Why not make a big fat array using 14 disks (raid 1, 10 or 
5), and make this a big and fast swap disk. Latency will be lower than the SAN 
can provide, and throughput will also be better, and it will relief the SAN 
from a lot of read iops.

 

So I could create a 1TB swap disk, and put it onto the OS next to the 64GB of 
memory. Then I can set Postgres to use more than the RAM size so it will start 
swapping. It would appear to postgres that the complete database will fit into 
memory. The question is: will this do any good? And if so: what will happen?

 

Kind regards,

 

Christiaan


Re: [PERFORM] Configuring for maximum memory usage

2008-10-30 Thread Christiaan Willemsen

Thanks guys,

Lots of info here that I didn't know about! Since I have one of the  
latest Opensolaris builds, I guess the write throttle feature is  
already in there. Sadly, the blog doesn't say what build has it  
included.


For writes, I do everything synchronized because we really need a  
consistent database on disk. We can see that during large inserts, the  
intend log is used a lot.


What  I'm going to te testing is a smaller shared_buffers value, and a  
large ARC cache, and exactly the other way around.


Another question: since we have huge tables with hundreds of millions  
or rows, we partitioned the database (it actually is creating the  
partitions dynamically now on inserts with very good performance :D ),  
but the question is: is the size of the partions important for the  
memory parameters in config file? How can we determine the optimal  
size of the partition. obviously, when doing selects, you want those  
preferably only needing a single partition for speed. At the moment,  
that is for the majority of situations the case. But there might be  
some other things to think about...


Kind regards,

Christiaan


On Oct 30, 2008, at 7:27 PM, Scott Carey wrote:




On Thu, Oct 30, 2008 at 9:55 AM, Joshua D. Drake  
<[EMAIL PROTECTED]> wrote:

On Thu, 2008-10-30 at 09:46 -0700, Scott Carey wrote:

>
> Remember that PostgreSQL doesn't cache anything on its own  
so

> if you do
> want to hit disk it has to be in file cache.
>
> By my understanding, this is absolutely false.  Postgres caches  
pages
> from tables/indexes in shared_buffers. You can make this very  
large if

> you wish.

You can make it very large with a potentially serious performance hit.
It is very expensive to manage large amounts of shared buffers. It can
also nail your IO on checkpoint if you are not careful (even with
checkpoint smoothing). You are correct that I did not explain what I
meant very well because shared buffers are exactly that, shared
buffers.

You can slam your I/O by havnig too large of either OS file cache or  
shared_buffers, and you have to tune both.
In the case of large shared_buffers you have to tune postgres and  
especially the background writer and checkpoints.
In the case of a large OS cache, you have to tune parameters to  
limit the ammount of dirty pages there and force writes out smoothly.
Both layers attempt to delay writes for their own, often similar  
reasons, and suffer when a large sync comes along with lots of dirty  
data.


Recent ZFS changes have been made to limit this, (http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle 
)
in earlier ZFS versions, this is what usually killed databases --  
ZFS in some situations would delay writes too long (even if "long"  
is 5 seconds) and get in trouble.  This still has to be tuned well,  
combined with good checkpoint tuning in Postgres as you mention. For  
Linux, there are similar issues that have to be tuned on many  
kernels, or up to 40% of RAM can fill with dirty pages not written  
to disk.


Letting the OS do it doesn't get rid of the problem, both levels of  
cache share very similar issues with large sizes and dirty pages  
followed by a sync.


The buffer cache in shared_buffers is a lot more efficient for large  
scanning queries -- A select count(*) test will be CPU bound if it  
comes from shared_buffers or the OS page cache, and in the former  
case I have seen it execute up to 50% faster than the latter, by  
avoiding calling out to the OS to get pages, purely as a result of  
less CPU used.





However that isn't the exact same thing as a "cache" at least as I was
trying to describe it. shared buffers are used to keep track of pages
(as well as some other stuff) and their current status. That is not  
the

same as caching a relation.

It is not possible to pin a relation to memory using PostgreSQL.
PostgreSQL relies on the operating system for that type of caching.

The OS can't pin a relation either, from its point of view its all  
just a bunch of disk data blocks, not relations -- so it is all  
roughly equivalent.  The OS can do a bit better job at data prefetch  
on sequential scans or other predictable seek sequences (ARC stands  
for Adaptive Replacement Cache) than postgres currently does (no  
intelligent prefetch in postgres AFAIK).


So I apologize if I made it sound like Postgres cached the actual  
relation, its just pages -- but it is basically the same thing as  
the OS cache, but kept in process closer to the code that needs it.   
Its a cache that prevents disk reads.


My suggestion for the OP is to try it both ways, and see what is  
better for his workload / OS / Hardware combination.



Joshua D. Drake





--






Re: [PERFORM] Configuring for maximum memory usage

2008-10-30 Thread Christiaan Willemsen

Hi Scott,

Thaks for the clear answers!

Scott Carey wrote:
You must either increase the memory that ZFS uses, or increase 
Postgresql shard_mem and work_mem to get the aggregate of the two to 
use more RAM.


I believe, that you have not told ZFS to reserve 8GB, but rather told 
it to limit itself to 8GB. 
That is correct, but since it will use the whole 8 GB anyway, I can just 
as easily say that it will reseve that memory ;)


Some comments below:

On Thu, Oct 30, 2008 at 8:15 AM, Christiaan Willemsen 
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


Hi there,

I configured OpenSolaris on our OpenSolaris Machine. Specs:

2x Quad 2.6 Ghz Xeon
64 GB of memory
16x 15k5 SAS

If you do much writing, and even moreso with ZFS, it is critical to 
put the WAL log on a different ZFS volume (and perhaps disks) than the 
data and indexes.

I already did that. I also have a separate disk pair for the ZFS intent log.

Are you counting both the memory used by postgres and the memory used 
by the ZFS ARC cache?  It is the combination you are interested in, 
and performance will be better if it is biased towards one being a 
good chunk larger than the other.  In my experience, if you are doing 
more writes, a larger file system cache is better, if you are doing 
reads, a larger postgres cache is better (the overhead of calling 
read() in 8k chunks to the os, even if it is cached, causes CPU use to 
increase).

No, the figure I gave is this is without the ARC cache.
If you do very large aggregates, you may  need  even 1GB on work_mem.  
However, a setting that high would require very careful tuning and 
reduction of space used by shared_buffers and the ZFS ARC.  Its 
dangerous since each connection with a large aggregate or sort may 
consume a lot of memory.
Well, some taks may need a lot, but I guess most wil do fine with the 
settings we used right now.


So It looks like I can tune the ARC to use more memory, and also 
increase shared_mem to let postgres cache more tables?


Re: [PERFORM] Configuring for maximum memory usage

2008-10-30 Thread Christiaan Willemsen


Joshua D. Drake wrote:


PostgreSQL is only going to use what it needs. It relies on the OS for 
much of the caching etc...


So that would actually mean that I could raise the setting of the ARC 
cache to far more than 8 GB? As I said, our database is 250 GB, So I 
would expect that postgres needs more than it is using right now... 
Several tables have over 500 million  records (obviously partitioned). 
At the moment we are doing queries over large datasets, So I would 
assume that postgress would need a bit more memory than this..


You are missing effective_cache_size. Try setting that to 32G.
That one was set to 24 GB. But this setting only tells posgres how much 
caching it can expect from the OS? This is not actually memory that it 
will allocate, is it?


You also didn't mention checkpoint_segments (which isn't memory but 
still important) and default_statistics_target (which isn't memory but 
still important).



is at the moment set to:

checkpoint_segments = 40
default_statistics_target is set to default (I think that is 10)

Thanks already,

Christiaan

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Configuring for maximum memory usage

2008-10-30 Thread Christiaan Willemsen

Hi there,

I configured OpenSolaris on our OpenSolaris Machine. Specs:

2x Quad 2.6 Ghz Xeon
64 GB of memory
16x 15k5 SAS

The filesystem is configured using ZFS, and I think I have found a 
configuration that performs fairly well.


I installed the standard PostgreSQL that came with the OpenSolaris disk 
(8.3), and later added support for PostGIS. All fime.
I also tried to tune postgresql.conf to maximize performance and also 
memory usage.


Since PostgreSQL is the only thing running on this machine, we want it 
to take full advantage of the hardware. For the ZFS cache, we have 8 GB 
reserved. The rest can be used by postgres.


The problem is getting it to use that much. At the moment, it only uses 
almost 9 GB, so by far not enough. The problem is getting it to use 
more... I hope you can help me with working config.


Here are the parameters I set in the config file:

shared_buffers = 8192MB
work_mem = 128MB
maintenance_work_mem = 2048MB
max_fsm_pages = 204800
max_fsm_relations = 2000

Database is about 250 GB in size, so we really need to have as much data 
as possible in memory.


I hope you can help us tweak a few parameters to make sure all memory 
will be used.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to setup disk spindles for best performance

2008-08-28 Thread Christiaan Willemsen


On Aug 29, 2008, at 4:43 AM, Greg Smith wrote:


On Thu, 21 Aug 2008, Christiaan Willemsen wrote:

Anyway, I'm going to return the controller, because it does not  
scale very well with more that 4 disks in raid 10. Bandwidth is  
limited to 350MB/sec, and IOPS scale badly with extra disks...


How did you determine that upper limit?  Usually it takes multiple  
benchmark processes running at once in order to get more than 350MB/ 
s out of a controller.  For example, if you look carefully at the  
end of http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ 
 you can see that Joshua had to throw 8 threads at the disks in  
order to reach maximum bandwidth.


I used IOmeter to do some tests, with 50 worker thread doing jobs. I  
can get more than 350 MB/sec, I'll have to use huge blocksizes  
(something like 8 MB). Even worse is random read and 70%read, 50%  
random tests. They don't scale at all when you add disks.  A 6 disk  
raid 5 is exactly as fast as a 12 disk raid 10 :(


The idea for xlog + os on 4 disk raid 10 and the rest for the data  
sound good


I would just use a RAID1 pair for the OS, another pair for the xlog,  
and throw all the other disks into a big 0+1 set.  There is some  
value to separating the WAL from the OS disks, from both the  
performance and the management perspectives.  It's nice to be able  
to monitor the xlog write bandwidth rate under load easily for  
example.


Yes, that's about what I had in mind.

Kind regards,

Christiaan

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to setup disk spindles for best performance

2008-08-21 Thread Christiaan Willemsen

Hi Scott,

Great info! Our RAID card is at  the moment a ICP vortex (Adaptec) 
ICP5165BR, and I'll be using it with Ubuntu server 8.04. I tried 
OpenSolaris, but it yielded even more terrible performance, specially 
using ZFS.. I guess that was just a missmatch. Anyway, I'm going to 
return the controller, because it does not scale very well with more 
that 4 disks in raid 10. Bandwidth is limited to 350MB/sec, and IOPS 
scale badly with extra disks...


So I guess, I'll be waiting for another controller first. The idea for 
xlog + os on 4 disk raid 10 and the rest for the data sound good :) I 
hope it will turn out that way too.. First another controller..


Regards,

Christiaan

Scott Carey wrote:
Indexes will be random write workload, but these won't by synchronous 
writes and will be buffered by the raid controller's cache.  Assuming 
you're using a hardware raid controller that is, and one that doesn't 
have major performance problems on your platform.  Which brings those 
questions up --- what is your RAID card and OS?


For reads, if your shared_buffers is large enough, your heavily used 
indexes won't likely go to disk much at all.


A good raid controller will typically help distribute the workload 
effectively on a large array.


You probably want a simple 2 disk mirror or 4 disks in raid 10 for 
your OS + xlog, and the rest for data + indexes -- with hot spares IF 
your card supports them.


The biggest risk to splitting up data and indexes is that you don't 
know how much I/O each needs relative to each other, and if this isn't 
a relatively constant ratio you will have one subset busy while the 
other subset is idle.
Unless you have extensively profiled your disk activity into index and 
data subsets and know roughly what the optimal ratio is, its probably 
going to cause more problems than it fixes. 
Furthermore, if this ratio changes at all, its a maintenance 
nightmare.  How much each would need in a perfect world is application 
dependant, so there can be no general recommendation other than:  
don't do it.


On Thu, Aug 21, 2008 at 1:34 AM, Christiaan Willemsen 
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


Thanks Joshua,

So what about putting the indexes on a separate array? Since we do
a lot of inserts indexes are going to be worked on a lot of the time.

Regards,

Christiaan


Joshua D. Drake wrote:

Christiaan Willemsen wrote:

So, what you are basically saying, is that a single mirror
is in general more than enough to facilitate the
transaction log.



http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

And to answer your question, yes. Transaction logs are written
sequentially. You do not need a journaled file system and raid
1 is plenty for most if not all work loads.

Sincerely,

Joshua D. Drake


-- 
Sent via pgsql-performance mailing list

(pgsql-performance@postgresql.org
<mailto:pgsql-performance@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




Re: [PERFORM] How to setup disk spindles for best performance

2008-08-21 Thread Christiaan Willemsen

Thanks Joshua,

So what about putting the indexes on a separate array? Since we do a lot 
of inserts indexes are going to be worked on a lot of the time.


Regards,

Christiaan

Joshua D. Drake wrote:

Christiaan Willemsen wrote:
So, what you are basically saying, is that a single mirror is in 
general more than enough to facilitate the transaction log.


http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ 


http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide

And to answer your question, yes. Transaction logs are written 
sequentially. You do not need a journaled file system and raid 1 is 
plenty for most if not all work loads.


Sincerely,

Joshua D. Drake



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to setup disk spindles for best performance

2008-08-20 Thread Christiaan Willemsen
So, what you are basically saying, is that a single mirror is in  
general more than enough to facilitate the transaction log.


So it would not be smart to put the indexes onto a separate disk  
spindle to improve index performance?


On Aug 21, 2008, at 3:49 AM, Merlin Moncure wrote:


On Wed, Aug 20, 2008 at 6:25 PM, Christiaan Willemsen
<[EMAIL PROTECTED]> wrote:
I'm currently trying to find out what the best configuration is for  
our new
database server. It will server a database of about 80 GB and  
growing fast.
The new machine has plenty of memory (64GB) and 16 SAS disks, of  
wich two

are already in use as a mirror for the OS.

The rest can be used for PostgreSQL. So that makes a total of 14  
15k.5 SAS
diks. There is obviously a lot to interesting reading to be found,  
most of
them stating that the transaction log should be put onto a separate  
disk
spindle. You can also do this with the indexes. Since they will be  
updated a
lot, I guess that might be a good idea. But what no-one states, is  
what
performance these spindle should have in comparison to the data  
spindle? If
I create a raid 10 of 6 disks for the data, 4 disk raid 10 for the  
log, and
4 disk raid 10 for the indexes, will this yield best performance?  
Or is it
sufficient to just have a simple mirror for the log and/or  
indexes...? I
have not found any information about these figures, and I guess it  
should be
possible to give some pointers on how these different setup might  
affect

performance?


Well, the speed of your logging device puts an upper bound on the
write speed of the database.  While modern sas drives can do 80mb/sec
+ with sequential ops, this can turn to 1mb/sec real fast if the
logging is duking it out with the other generally random work the
database has to do, which is why it's often separated out.

80mb/sec is actually quite a lot in database terms and you will likely
only get anything close to that when doing heavy insertion, so that
it's unlikely to become the bottleneck.  Even if you hit that limit
sometimes, those drives are probably put to better use in the data
volume somewhere.

As for partitioning the data volume, I'd advise this only if you have
a mixed duty database that does different tasks with different
performance requirements.  You may be serving a user interface which
has very low maximum transaction time and therefore gets dedicated
disk i/o apart from the data churn that is going on elsewhere.  Apart
from that though, I'd keep it in a single volume.

merlin



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] How to setup disk spindles for best performance

2008-08-20 Thread Christiaan Willemsen
I'm currently trying to find out what the best configuration is for  
our new database server. It will server a database of about 80 GB and  
growing fast. The new machine has plenty of memory (64GB) and 16 SAS  
disks, of wich two are already in use as a mirror for the OS.


The rest can be used for PostgreSQL. So that makes a total of 14 15k.5  
SAS diks. There is obviously a lot to interesting reading to be found,  
most of them stating that the transaction log should be put onto a  
separate disk spindle. You can also do this with the indexes. Since  
they will be updated a lot, I guess that might be a good idea. But  
what no-one states, is what performance these spindle should have in  
comparison to the data spindle? If I create a raid 10 of 6 disks for  
the data, 4 disk raid 10 for the log, and 4 disk raid 10 for the  
indexes, will this yield best performance? Or is it sufficient to just  
have a simple mirror for the log and/or indexes...? I have not found  
any information about these figures, and I guess it should be possible  
to give some pointers on how these different setup might affect  
performance?


So I hope someone has already tested this and can give some tips...

Kind regards,

Christiaan 


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance