Re: [PERFORM] [GENERAL] How to know which queries are to be optimised?

2004-08-12 Thread Rudi Starcevic
Hi,
But is there a tool that could compile a summary out of the log? The 
log grows awefully big after a short time.
There's also pg_analyzer to check out.
http://www.samse.fr/GPL/pg_analyzer/
Some of it's features are: written in Perl and produces HTML output.
You might want to look at the "Practical Query Analyser" - haven't used 
it myself yet, but it seems a sensible idea.

http://pqa.projects.postgresql.org/

Cheers,
Rudi.
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] Help specifying new machine

2004-08-12 Thread William Yu
Rory Campbell-Lange wrote:
The present server is a 2GHz Pentium 4/512 KB cache with 2
software-raided ide disks (Maxtors) and 1GB of RAM.
I have been offered the following 1U server which I can just about
afford:
1U server
Intel Xeon 2.8GHz 512K cache  1
512MB PC2100 DDR ECC Registered   2
80Gb SATA HDD 4
4 port SATA card, 3 ware 8506-4   1
3 year next-day hardware warranty 1
You're not getting much of a bump with this server. The CPU is 
incrementally faster -- in the absolutely best case scenario where your 
queries are 100% cpu-bound, that's about ~25%-30% faster.

If you could use that money instead to upgrade your current server, 
you'd get a much bigger impact. Go for more memory and scsi (raid 
controllers w/ battery-backed cache).

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread Merlin Moncure
> would one not have to repeat this operation regularly, to keep
> any advantage of this ? my impression was that this is a relatively
> heavy operation on a large table.

Yeah, it requires an exclusive lock and a table rebuild.  It might be
useful to a message board type database since (one presumes) the reads
would be concentrated over recently created data, entered after the
cluster and losing any benefit.

As far as table size, bigger tables are a larger operation but take
longer to get all out of whack.  Question is: what percentage of the
data turns over between maintenance periods?   Plus, there has to be a
maintenance period...nobody does anything while the table is clustering.

Also, a particular method of reading the table has to really dominate as
far as user usage pattern.  So, it's pretty rare to user cluster, but it
can help in some cases.

Merlin

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread Merlin Moncure
Tom Lane wrote:
> The difference would be pretty marginal --- especially if you choose
to
> use bigints instead of ints.  (A timestamp is just a float8 or bigint
> under the hood, and is no more expensive to compare than those
datatypes.
> Timestamps *are* expensive to convert for I/O, but comparison does not
> have to do that.)  I wouldn't recommend kluging up your data schema
just
> for that.

Right (int4 use was assumed).  I agree, but it's kind of a 'two birds
with one stone' kind of thing, because it's easier to work with reverse
ordering integers than time values.  So I claim a measurable win (the
real gainer of course being able to select and sort on the same key,
which works on any type), based on the int4-int8 difference, which is a
33% reduction in key size.

One claim I don't have the data for is that read-forward is better than
read-back, but my gut tells me he'll get a better cache hit ratio that
way.  This will be very difficult to measure.

As for kludging, using a decrementing sequence is not a bad idea if the
general tendency is to read the table backwards, even if just for
conceptual reasons.  The main kludge is the int4 assumption, which (IMO)
isn't so bad.  He would just have to rebuild the existing p-key in
reverse order (10$ says his keys are all already int4s), and hopefully
not mess with the application code too much.

At least, it's what I would try if I was in his shoes :)

YMMV
Merlin







---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread gnari
"Merlin Moncure" <[EMAIL PROTECTED]> wrote:

-- optional
cluster user_message_idx messages;

would one not have to repeat this operation regularly, to keep
any advantage of this ? my impression was that this is a relatively
heavy operation on a large table.

gnari




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread Merlin Moncure
> I don't have a huge amount of experience with this in pg, but one of
the
> tricks we do in the ISAM world is a 'reverse date' system, so that you
> can scan forwards on the key to pick up datetimes in descending order.
> This is often a win because the o/s cache may assume read/forwards
> giving you more cache hits.   There are a few different ways to do
this,
> but imagine:

I've been thinking more about this and there is even a more optimal way
of doing this if you are willing to beat on your data a bit.  It
involves the use of sequences.  Lets revisit your id/timestamp query
combination for a message board.  The assumption is you are using
integer keys for all tables.  You probably have something like:

create table messages
(
user_id int4 references users,
topic_id  int4 references topics,
message_idserial,
message_time  timestamp default now(),
[...]
);

The following suggestion works in two principles: one is that instead of
using timestamps for ordering, integers are quicker, and sequences have
a built in ability for reverse-ordering.

Lets define:
create sequence message_seq increment -1 start 2147483647 minvalue 0
maxvalue 2147483647;

now we define our table:
create table messages
(
user_id  int4 references users,
topic_id int4 references topics,
message_id   int4 default nextval('message_seq') primary key,
message_time timestamp default now(),
[...]
);

create index user_message_idx on messages(user_id, message_id);
-- optional
cluster user_message_idx messages;

Since the sequence is in descending order, we don't have to do any
tricks to logically reverse order the table.

-- return last k posts made by user u in descending order;

select * from messages where user_id = u order by user_id, message_id
limit k;

-- return last k posts on a topic
create index topic_message_idx on messages(topic_id, user_id);
select * from messages where topic_id = t order by topic_id, message_id

a side benefit of clustering is that there is little penalty for
increasing k because of read ahead optimization whereas in normal
scenarios your read time scales with k (forcing small values for k).  If
we tended to pull up messages by topic more frequently than user, we
would cluster on topic_message_idx instead.  (if we couldn't decide, we
might cluster on message_id or not at all).

The crucial point is that we are making this one index run really fast
at the expense of other operations.  The other major point is we can use
a sequence in place of a timestamp for ordering.  Using int4 vs.
timestamp is a minor efficiency win, if you are worried about > 4B rows,
then stick with timestamp.

This all boils down to a central unifying principle: organize your
indices around your expected access patterns to the data.  Sorry if I'm
bleating on and on about this...I just think there is plenty of
optimization room left in there :)

Merlin


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread Tom Lane
"Merlin Moncure" <[EMAIL PROTECTED]> writes:
> The following suggestion works in two principles: one is that instead of
> using timestamps for ordering, integers are quicker,

The difference would be pretty marginal --- especially if you choose to
use bigints instead of ints.  (A timestamp is just a float8 or bigint
under the hood, and is no more expensive to compare than those datatypes.
Timestamps *are* expensive to convert for I/O, but comparison does not
have to do that.)  I wouldn't recommend kluging up your data schema just
for that.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] How to know which queries are to be optimised?

2004-08-12 Thread Christopher Kings-Lynne
> But is there a tool that could compile a summary out of the log? The log
> grows awefully big after a short time.

Actually, yes there is.  Check out www.pgfoundry.org.  I think it's called
pqa or postgres query analyzer or somethign.

Chris


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] [GENERAL] How to know which queries are to be optimised?

2004-08-12 Thread Christopher Kings-Lynne
> I do a vacuum full analyze every night.
> How can I see if my FSM setting is appropriate?

On a busy website, run vacuum analyze once an hour, or even better, use
contrib/pg_autovacuum

Chris



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-12 Thread Merlin Moncure
> This example looks fine, but since userid 51 evidently only has 35
> posts, there's not much time needed to read 'em all and sort 'em.  The
> place where the double-column index will win big is on userids with
> hundreds of posts.
> 
> You have to keep in mind that each index costs time to maintain during
> inserts/updates.  So adding an index just because it makes a few
queries
> a little faster probably isn't a win.  You need to make tradeoffs.

IMNSHO, in Jason's case he needs to do everything possible to get his
frequently run queries going as quick as possible.  ISTM he can give up
a little on the update side, especially since he is running fsync=false.
A .3-.5 sec query multiplied over 50-100 users running concurrently adds
up quick.  Ideally, you are looking up records based on a key that takes
you directly to the first record you want and is pointing to the next
number of records in ascending order.  I can't stress enough how
important this is so long as you can deal with the index/update
overhead.  

I don't have a huge amount of experience with this in pg, but one of the
tricks we do in the ISAM world is a 'reverse date' system, so that you
can scan forwards on the key to pick up datetimes in descending order.
This is often a win because the o/s cache may assume read/forwards
giving you more cache hits.   There are a few different ways to do this,
but imagine:

create table t
(
id int,
ts  timestamp default now(),
iv  interval  default ('01/01/2050'::timestamp - now())
);

create index t_idx on t(id, iv);
select * from t where id = k order by id, iv limit 5;

The above query should do a much better job pulling up data and should
be easier on your cache.  A further win might be to cluster the table on
this key if the table is really big.

note: interval is poor type to do this with, because it's a 12 byte type
(just used it here for demonstration purposes because it's easy).  With
a little trickery you can stuff it into a time type or an int4 type
(even better!).  If you want to be really clever you can do it without
adding any data to your table at all through functional indexes.

Since the planner can use the same index in the extraction and ordering,
you get some savings...not much, but worthwhile when applied over a lot
of users.  Knowing when and how to apply multiple key/functional indexes
will make you feel like you have 10 times the database you are using
right now.

Merlin

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] How to know which queries are to be optimised?

2004-08-12 Thread Richard Huxton
Ulrich Wisser wrote:
You can log queries that run for at least a specified amount of time.
This will be useful in finding what the long running queries are.
You can then use explain analyse to see why they are long running.
But is there a tool that could compile a summary out of the log? The log 
grows awefully big after a short time.
You might want to look at the "Practical Query Analyser" - haven't used 
it myself yet, but it seems a sensible idea.

http://pqa.projects.postgresql.org/
--
  Richard Huxton
  Archonet Ltd
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] [GENERAL] How to know which queries are to be optimised?

2004-08-12 Thread Ulrich Wisser
Hi Bruno,
my web application grows slower and slower over time. After some 
profiling I came to the conclusion that my SQL queries are the biggest 
time spenders (25 seconds). Obviously I need to optimise my queries and 
maybe introduce some new indexes.
This sounds like you aren't doing proper maintainance. You need to be
vacuuming with a large enough FSM setting.
I do a vacuum full analyze every night.
How can I see if my FSM setting is appropriate?
The problem is, that my application uses dynamic queries. I therefor can 
not determine what are the most common queries.

I have used the postgresql logging ption before. Is there a tool to 
analyze the logfile for the most common and/or most time consuming queries?

You can log queries that run for at least a specified amount of time.
This will be useful in finding what the long running queries are.
You can then use explain analyse to see why they are long running.
But is there a tool that could compile a summary out of the log? The log 
grows awefully big after a short time.

Thanks
/Ulrich


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match