Re: [PERFORM] inserting into brand new database faster than old database

2004-07-08 Thread Shridhar Daithankar
Missner, T. R. wrote:
Hello,
I have been a happy postgresql developer for a few years now.  Recently
I have discovered a very strange phenomenon in regards to inserting
rows.
My app inserts millions of records a day, averaging about 30 rows a
second. I use autovac to make sure my stats and indexes are up to date.
Rarely are rows ever deleted.  Each day a brand new set of tables is
created and eventually the old tables are dropped. The app calls
functions which based on some simple logic perform the correct inserts.
Have you profiled where the time goes in a brand new schema and a degraded 
database? Is it IO? Is it CPU? Is the function making decision becoming bottleneck?

The problem I am seeing is that after a particular database gets kinda
old, say a couple of months, performance begins to degrade.  Even after
creating brand new tables my insert speed is slow in comparison ( by a
magnitude of 5 or more ) with a brand new schema which has the exact
same tables.  I am running on an IBM 360 dual processor Linux server
with a 100 gig raid array spanning 5 scsi disks.  The machine has 1 gig
of ram of which 500 meg is dedicated to Postgresql.
Just to be clear, the question I have is why would a brand new db schema
allow inserts faster than an older schema with brand new tables?  Since
the tables are empty to start, vacuuming should not be an issue at all.
Each schema is identical in every way except the db name and creation
date.
You can do few things.
- Get explain analyze. See the difference between actual and projected timings. 
The difference is the hint about where planner is going wrong.

- Is IO your bottleneck? Are vacuum taking longer and longer? If yes then you 
could try the vacuum delay patch. If your IO is saturated for any reason, 
everything is going to crawl

- Are your indexes bloat free? If you are using pre7.x,vacuum does not clean up 
indexes. You need to reindex.

- Have you vacuumed the complete database? If the catalogs collect dead space it 
could cause degradation too but that is just a guess.

Basically monitor slow inserts and try to find out where time is spent.
HTH
 Shridhar
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Terrible performance after deleting/recreating indexes

2004-07-08 Thread Shridhar Daithankar
Bill Chandler wrote:
Hi,
Using PostgreSQL 7.4.2 on Solaris.  I'm trying to 
improve performance on some queries to my databases so
I wanted to try out various index structures.  

Since I'm going to be running my performance tests
repeatedly, I created some SQL scripts to delete and 
recreate various index configurations.  One of the
scripts contains the commands for recreating the 
'original' index configuration (i.e. the one I've 
already got some numbers for).  Only thing is now
when I delete and recreate the original indexes then
run the query, I'm finding the performance has gone
completely down the tubes compared to what I 
originally had.  A query that used to take 5 minutes
to complete now takes hours to complete.

For what it's worth my query looks something like:
select * from tbl_1, tbl_2 where tbl_1.id = tbl_2.id
and tbl_2.name like 'x%y%' and tbl_1.x  1234567890123
order by tbl_1.x;
tbl_1 is very big ( 2 million rows)
tbl_2 is relatively small (7000 or so rows)
tbl_1.x is a numeric(13)
tbl_1.id  tbl_2.id are integers
tbl_2.name is a varchar(64)
I've run 'VACUUM ANALYZE' on both tables involved in
the query.  I also used 'EXPLAIN' and observed that
the query plan is completely changed from what it 
was originally.  
Get an explain analyze. That gives actual v/s planned time spent. See what is 
causing the difference. A discrepency between planned and actual row is usually 
a indication of out-of-date stats.

Which are the indexes on these tables? You should list fields with indexes first 
in where clause. Also list most selective field first so that it eliminates as 
many rows as possible in first scan.

I hope you have read the tuning articles on varlena.com and applied some basic 
tuning.

And post the table schema, hardware config, postgresql config(important ones of 
course) and  explain analyze for queries. That would be something to start with.

 Shridhar
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


[PERFORM] Odd sorting behaviour

2004-07-08 Thread Steinar H. Gunderson
[Apologies if this reaches the list twice -- I sent a copy before
 subscribing, but it seems to be stuck waiting for listmaster forever, so I
 subscribed and sent it again.]

Hi,

I'm trying to find out why one of my queries is so slow -- I'm primarily
using PostgreSQL 7.2 (Debian stable), but I don't really get much better
performance with 7.4 (Debian unstable). My prototype table looks like this:

  CREATE TABLE opinions (
prodid INTEGER NOT NULL,
uid INTEGER NOT NULL,
opinion INTEGER NOT NULL,
PRIMARY KEY ( prodid, uid )
  );

In addition, there are separate indexes on prodid and uid. I've run VACUUM
ANALYZE before all queries, and they are repeatable. (If anybody needs the
data, that could be arranged -- it's not secret or anything :-) ) My query
looks like this:

EXPLAIN ANALYZE
  SELECT o3.prodid, SUM(o3.opinion*o12.correlation) AS total_correlation FROM opinions 
o3
  RIGHT JOIN (
SELECT o2.uid, SUM(o1.opinion*o2.opinion)/SQRT(count(*)+0.0) AS correlation
FROM opinions o1 LEFT JOIN opinions o2 ON o1.prodid=o2.prodid
WHERE o1.uid=1355
GROUP BY o2.uid
  ) o12 ON o3.uid=o12.uid
  LEFT JOIN (
SELECT o4.prodid, COUNT(*) as num_my_comments
FROM opinions o4
WHERE o4.uid=1355
GROUP BY o4.prodid
  ) nmc ON o3.prodid=nmc.prodid
  WHERE nmc.num_my_comments IS NULL AND o3.opinion0 AND o12.correlation0
  GROUP BY o3.prodid
  ORDER BY total_correlation desc;

And produces the query plan at

  http://www.samfundet.no/~sesse/queryplan.txt

(The lines were a bit too long to include in an e-mail :-) ) Note that the
o3.opinion0 AND o12.correleation0 lines are an optimization; I can run
the query fine without them and it will produce the same results, but it
goes slower both in 7.2 and 7.4.

There are a few oddities here:

- The subquery scan o12 phase outputs 1186 rows, yet 83792 are sorted. Where
  do the other ~82000 rows come from? And why would it take ~100ms to sort the
  rows at all? (In earlier tests, this was _one full second_ but somehow that
  seems to have improved, yet without really improving the overall query time.
  shared_buffers is 4096 and sort_mem is 16384, so it should really fit into
  RAM.)
- Why does it use uid_index for an index scan on the table, when it obviously
  has no filter on it (since it returns all the rows)? Furthermore, why would
  this take half a second? (The machine is a 950MHz machine with SCSI disks.)
- Also, the outer sort (the sorting of the 58792 rows from the merge join)
  is slow. :-)

7.4 isn't really much better:

  http://www.samfundet.no/~sesse/queryplan74.txt

Note that this is run on a machine with almost twice the speed (in terms of
CPU speed, at least). The same oddities are mostly present (such as o12
returning 1186 rows, but 58788 rows are sorted), so I really don't understand
what's going on here. Any ideas on how to improve this?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] finding a max value

2004-07-08 Thread James Antill
Edoardo Ceccarelli [EMAIL PROTECTED] writes:

 This is the query:
 select max(KA) from annuncio

 field KA is indexed and is int4,

 explaining gives:
 explain select max(KA) from annuncio;
 QUERY PLAN
 ---
 Aggregate (cost=21173.70..21173.70 rows=1 width=4)
 - Seq Scan on annuncio (cost=0.00..20326.76 rows=338776 width=4)
 (2 rows)


 wasn't supposed to do an index scan? it takes about 1sec to get the result.

 This is a known misfeature of max() in postgresql, see...

http://archives.postgresql.org/pgsql-performance/2003-12/msg00283.php

-- 
# James Antill -- [EMAIL PROTECTED]
:0:
* ^From: [EMAIL PROTECTED]
/dev/null

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] query plan wierdness?

2004-07-08 Thread Guido Barosio
The limit is tricking you.
I guess a sequential scan is cheaper than an index scan with the limit 26 found there.

I am wrong?

Greets

-- 
---
Guido Barosio
Buenos Aires, Argentina
---


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] query plan wierdness?

2004-07-08 Thread Rod Taylor

 However, this query performs a sequence scan on the table, ignoring the
 call_idx13 index (the only difference is the addition of the aspid field
 in the order by clause):

You do not have an index which matches the ORDER BY, so PostgreSQL
cannot simply scan the index for the data you want. Thus is needs to
find all matching rows, order them, etc.

 23:59:59.999' order by aspid, openeddatetime desc, callstatus desc,
 calltype desc, callkey desc limit 26;

aspid ASC, openeddatetime DESC, callstatus DESC, calltype DESC

  call_idx13 btree (aspid, openeddatetime, callstatus, calltype,
 callkey),

This index is: aspid ASC, openeddatetime ASC, callstatus ASC, calltype
ASC, callkey ASC

A reverse scan, would of course be DESC, DESC, DESC, DESC, DESC --
neither of which matches your requested order by, thus cannot help the
reduce the lines looked at to 26.

This leaves your WHERE clause to restrict the dataset and it doesn't do
a very good job of it. There are more than 45 rows matching the
where clause, which means the sequential scan was probably the right
choice (unless you have over 10 million entries in the table).


Since your WHERE clause contains a single aspid, an improvement to the
PostgreSQL optimizer may be to ignore that field in the ORDER BY as
order is no longer important since there is only one possible value. If
it did ignore aspid, it would use a plan similar to the first one you
provided.

You can accomplish the same thing by leaving out aspid ASC OR by setting
it to aspid DESC in the ORDER BY. Leaving it out entirely will be
slightly faster, but DESC will cause PostgreSQL to use index
call_idx13.



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] vacuum_mem

2004-07-08 Thread Litao Wu
Hi,

I tested vacuum_mem setting under a 
4CPU and 4G RAM machine. I am the only person 
on that machine.

The table:
 tablename | size_kb |   reltuples
---+-
 big_t | 2048392 |   7.51515e+06

Case 1:
1. vacuum full big_t;
2. begin;
   update big_t set email = lpad('a', 255, 'b');
   rollback;
3. set vacuum_mem=655360; -- 640M
4. vacuum big_t;
It takes 1415,375 ms
Also from top, the max SIZE is 615M while 
SHARE is always 566M

  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM
  TIME COMMAND
5914 postgres  16   0   615M 615M  566M D 7.5 15.8
 21:21 postgres: postgres mydb xxx.xxx.xxx.xxx:34361
VACUUM

Case 2:
1. vacuum full big_t;
2. begin;
   update big_t set email = lpad('a', 255, 'b');
   rollback;
3. set vacuum_mem=65536; -- 64M
4. vacuum big_t;
It takes 1297,798 ms
Also from top, the max SIZE is 615M while 
SHARE is always 566M

  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM
  TIME COMMAND
 3613 postgres  15   0  615M 615M  566M D17.1 15.8
  9:04 postgres: postgres mydb xxx.xxx.xxx.xxx:34365
VACUUM

It seems vacuum_mem does not have performance 
effect at all.

In reality, we vaccum nightly and I want to find out 
which vacuum_mem value is the 
best to short vacuum time.

Any thoughts?

Thanks,



__
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] vacuum_mem

2004-07-08 Thread Rod Taylor
 It seems vacuum_mem does not have performance 
 effect at all.

Wrong conclusion. It implies that your test case takes less than 64M of
memory to track your removed tuples. I think it takes 8 bytes to track a
tuple for vacuuming an index, which means it should be able to track
80 deletions. Since you're demonstration had 75 for removal,
it's under the limit.

Try your test again with 32MB; it should make a single sequential pass
on the table, and 2 passes on each index for that table.

Either that, or do a few more aborted updates.



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Terrible performance after deleting/recreating indexes

2004-07-08 Thread Bill Chandler
Thanks for the advice.

On further review it appears I am only getting this 
performance degradation when I run the command via
a JDBC app.  If I do the exact same query from
psql, the performance is fine.  I've tried both the
JDBC2 and JDBC3 jars.  Same results.

It definitely seems to correspond to deleting and
recreating the indexes, though.  The same query thru
JDBC worked fine before recreating the indexes.  

Does that make any sense at all?

thanks

Bill

--- Shridhar Daithankar [EMAIL PROTECTED]
wrote:
 Bill Chandler wrote:
 
  Hi,
  
  Using PostgreSQL 7.4.2 on Solaris.  I'm trying to 
  improve performance on some queries to my
 databases so
  I wanted to try out various index structures.  
  
  Since I'm going to be running my performance tests
  repeatedly, I created some SQL scripts to delete
 and 
  recreate various index configurations.  One of the
  scripts contains the commands for recreating the 
  'original' index configuration (i.e. the one I've 
  already got some numbers for).  Only thing is now
  when I delete and recreate the original indexes
 then
  run the query, I'm finding the performance has
 gone
  completely down the tubes compared to what I 
  originally had.  A query that used to take 5
 minutes
  to complete now takes hours to complete.
  
  For what it's worth my query looks something like:
  
  select * from tbl_1, tbl_2 where tbl_1.id =
 tbl_2.id
  and tbl_2.name like 'x%y%' and tbl_1.x 
 1234567890123
  order by tbl_1.x;
  
  tbl_1 is very big ( 2 million rows)
  tbl_2 is relatively small (7000 or so rows)
  tbl_1.x is a numeric(13)
  tbl_1.id  tbl_2.id are integers
  tbl_2.name is a varchar(64)
  
  I've run 'VACUUM ANALYZE' on both tables involved
 in
  the query.  I also used 'EXPLAIN' and observed
 that
  the query plan is completely changed from what it 
  was originally.  
 
 Get an explain analyze. That gives actual v/s
 planned time spent. See what is 
 causing the difference. A discrepency between
 planned and actual row is usually 
 a indication of out-of-date stats.
 
 
 Which are the indexes on these tables? You should
 list fields with indexes first 
 in where clause. Also list most selective field
 first so that it eliminates as 
 many rows as possible in first scan.
 
 
 I hope you have read the tuning articles on
 varlena.com and applied some basic 
 tuning.
 
 And post the table schema, hardware config,
 postgresql config(important ones of 
 course) and  explain analyze for queries. That would
 be something to start with.
 
   Shridhar
 
 ---(end of
 broadcast)---
 TIP 2: you can get off all lists at once with the
 unregister command
 (send unregister YourEmailAddressHere to
 [EMAIL PROTECTED])
 




__
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail 

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Terrible performance after deleting/recreating indexes

2004-07-08 Thread Mark Kirkwood
That is interesting - both psql and JDBC merely submit statements for 
the backend to process, so generally you would expect no difference in 
execution plan or performance.

It might be worth setting log_statement=true in postgresql.conf and 
checking that you are executing *exactly* the same statement in both 
JDBC and psql.

regards
Mark
P.s : lets see the output from  EXPLAIN ANALYZE  :-)
Bill Chandler wrote:
Thanks for the advice.
On further review it appears I am only getting this 
performance degradation when I run the command via
a JDBC app.  If I do the exact same query from
psql, the performance is fine.  I've tried both the
JDBC2 and JDBC3 jars.  Same results.

 

 
 

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Working on huge RAM based datasets

2004-07-08 Thread Andy Ballingall
Hi,

I'm really stuck and I wonder if any of you could help.

I have an application which will be sitting on a quite large database
(roughly 8-16GB). The nature of the application is such that, on a second by
second basis, the working set of the database is likely to be a substantial
portion (e.g. between 50 and 70%) of the data - Just imagine an almost
stochastic sampling of the data in each table, and you'll get an idea.
Potentially quite smelly.

To start with, I thought. No problems. Just configure a DB server with an
obscene amount of RAM (e.g. 64GB), and configure PG with a shared buffer
cache that is big enough to hold every page of data in the database, plus
10% or whatever to allow for a bit of room, ensuring that there is enough
RAM  in the box so that all the backend processes can do their thing, and
all the other services can do their thing, and the swap system on the host
remains idle.

Apparently not :(

I've read a number of places now saying that the PG cache has an optimal
size which isn't as big as you can make it without affecting other stuff on
the machine.

The suggestion is to let linux take the strain for the lion's share of the
caching (using its buffer cache), and just make the PG cache big enough to
hold the data it needs for individual queries.

___

Ignoring for a moment the problem of answering the question 'so how big
shall I make the PG cache?', and ignoring the possibility that as the
database content changes over the months this answer will need updating from
time to time for optimal performance, does anyone have any actual experience
with trying to maintain a large, mainly RAM resident database?

What is it about the buffer cache that makes it so unhappy being able to
hold everything? I don't want to be seen as a cache hit fascist, but isn't
it just better if the data is just *there*, available in the postmaster's
address space ready for each backend process to access it, rather than
expecting the Linux cache mechanism, optimised as it may be, to have to do
the caching?

Is it that the PG cache entries are accessed through a 'not particularly
optimal for large numbers of tuples' type of strategy? (Optimal though it
might be for more modest numbers).

And on a more general note, with the advent of 64 bit addressing and rising
RAM sizes, won't there, with time, be more and more DB applications that
would want to capitalise on the potential speed improvements that come with
not having to work hard to get the right bits in the right bit of memory all
the time?

And finally, am I worrying too much, and actually this problem is common to
all databases?

Thanks for reading,

Andy






---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Working on huge RAM based datasets

2004-07-08 Thread Christopher Kings-Lynne
What is it about the buffer cache that makes it so unhappy being able to
hold everything? I don't want to be seen as a cache hit fascist, but isn't
it just better if the data is just *there*, available in the postmaster's
address space ready for each backend process to access it, rather than
expecting the Linux cache mechanism, optimised as it may be, to have to do
the caching?
Because the PostgreSQL buffer management algorithms are pitiful compared 
to Linux's.  In 7.5, it's improved with the new ARC algorithm, but still 
- the Linux disk buffer cache will be very fast.

Chris
---(end of broadcast)---
TIP 8: explain analyze is your friend