[PERFORM] NOT IN query takes forever

2004-08-03 Thread Marius Andreiana
Hi

I have 2 tables like this:
CREATE TABLE query (
query_idint not null,
dat varchar(64)  null ,
sub_acc_id  int  null ,
query_ipvarchar(64)  null ,
osd_user_type   varchar(64)  null 
)
;

CREATE TABLE trans (
transaction_id  varchar(64)  not null ,
datevarchar(64)  null ,
query_idint not  null ,
sub_acc_id  int  null ,
reg_acc_id  int  null 
)
;

CREATE UNIQUE INDEX query_query_id_idx
ON query (query_id)
;

CREATE INDEX trans_reg_acc_id_idx
ON trans (reg_acc_id)
;

CREATE INDEX trans_query_id_idx
ON trans(query_id)
;
osd= select count(*) from trans
osd- ;
 count

 598809
(1 row)
 
osd=
osd= select count(*) from query
osd- ;
 count

 137042
(1 row)

I just vacuum analyse'd the database. 

Trying to run this query:
EXPLAIN ANALYSE
select * FROM trans
WHERE query_id NOT IN (select query_id FROM query)

but it will remain like that forever (cancelled after 30 min).

My postgresql.conf is the default:
# - Memory -
 
shared_buffers = 1000   # min 16, at least max_connections*2,
8KB each
#sort_mem = 1024# min 64, size in KB
#vacuum_mem = 8192  # min 1024, size in KB

Should I adjust something?

Using postgresql 7.4.2, saw in release notes that IN/NOT IN queries are
at least as faster than EXISTS.

Thank you!
-- 
Marius Andreiana
Galuna - Solutii Linux in Romania
http://www.galuna.ro


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] pg_autovacuum parameters

2004-08-03 Thread Lending, Rune
Title: Melding



Hello 
all.

I am managing a 
large database with lots of transactions in different 
tables.
The largest tables 
have around 5-6 millions tuples and around 5-6 inserts and maybe 2 
updates pr day.
While the smalest 
tables have only a few tuples and a few updates /inserts pr day. In addition we 
have small tables with many updates/inserts. So what I am saying is that there 
is all kinds of tables and uses of tables in our database.
This, I think, makes 
it difficult to set up pg_autovacuum. I am now running vacuum jobs on different 
tables in cron. 

What things should I 
consider when setting but base and threshold values in pg_autovacuum? Since the 
running of vacuum and analyze is relative to the table size, as it must be, I 
thinkit isdifficult to cover all tables..

Are there anyone who 
have some thoughts around this?

Regards
Rune



Re: [PERFORM] pg_autovacuum parameters

2004-08-03 Thread Matthew T. O'Connor
Lending, Rune wrote:
Hello all.
 
I am managing a large database with lots of transactions in different 
tables.
The largest tables have around 5-6 millions tuples and around 
5-6 inserts and maybe 2 updates pr day.
While the smalest tables have only a few tuples and a few updates 
/inserts pr day. In addition we have small tables with many 
updates/inserts. So what I am saying is that there is all kinds of 
tables and uses of tables in our database.
This, I think, makes it difficult to set up pg_autovacuum. I am now 
running vacuum jobs on different tables in cron.
 
What things should I consider when setting but base and threshold values 
in pg_autovacuum? Since the running of vacuum and analyze is relative to 
the table size, as it must be, I think it is difficult to cover all tables..
One of the biggest problems with the version of pg_autovacuum in 7.4 
contrib is that you can only specify one set of thresholds, which often 
isn't flexible enough.  That said the thresholds are based on table 
since since you specify both a base value and a scaling factor so 
pg_autovacuum -v 1000 -V 1 will vacuum a table with 100 rows every 200 
updates, but will vacuum a table with 1,000,000 rows every 1,000,100 
updates.

Are there anyone who have some thoughts around this?
Basically, you should be able to use pg_autovacuum to do most of the 
vacuuming, if there are a few tables that aren't getting vacuumed often 
enough, then you can add a vacuum command to cron for those specific tables.

Matthew

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Tom Lane
Merlin Moncure [EMAIL PROTECTED] writes:
 Try bumping up shared buffers some and sort mem as much as you safely
 can.

sort_mem is probably the issue here.  The only reasonable way to do NOT
IN is with a hash table, and the default setting of sort_mem is probably
too small to support a 137042-element table.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Marius Andreiana
On Tue, 2004-08-03 at 08:05 -0400, Merlin Moncure wrote:
  Trying to run this query:
  EXPLAIN ANALYSE
  select * FROM trans
  WHERE query_id NOT IN (select query_id FROM query)
  
  but it will remain like that forever (cancelled after 30 min).
 
 explain analyze actually runs the query to do timings.  Just run explain
 and see what you come up with.  More than likely there is a nestloop in
 there which is causing the long query time.
 
 Try bumping up shared buffers some and sort mem as much as you safely
 can.
Thank you, that did it!

With
shared_buffers = 3000   # min 16, at least max_connections*2, 8KB each
sort_mem = 128000   # min 64, size in KB

it takes 3 seconds (my hardware is not server-class).

-- 
Marius Andreiana
Galuna - Solutii Linux in Romania
http://www.galuna.ro


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Merlin Moncure
  Try bumping up shared buffers some and sort mem as much as you
safely
  can.
 Thank you, that did it!
 
 With
 shared_buffers = 3000 # min 16, at least max_connections*2,
8KB
 each
 sort_mem = 128000 # min 64, size in KB
 
 it takes 3 seconds (my hardware is not server-class).

Be careful...sort_mem applies to each connection and (IIRC) in some
cases more than once to a connection.  Of all the configuration
parameters, sort_mem (IMO) is the most important and the hardest to get
right.  128k (or 128MB) is awfully high unless you have a ton of memory
(you don't) or you are running in single connection scenarios.  Do some
experimentation by lowering the value until you get a good balance
between potential memory consumption and speed.

Merlin

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] pg_autovacuum parameters

2004-08-03 Thread Gaetano Mendola
Matthew T. O'Connor wrote:
Lending, Rune wrote:
Hello all.
 
I am managing a large database with lots of transactions in different 
tables.
The largest tables have around 5-6 millions tuples and around 
5-6 inserts and maybe 2 updates pr day.
While the smalest tables have only a few tuples and a few updates 
/inserts pr day. In addition we have small tables with many 
updates/inserts. So what I am saying is that there is all kinds of 
tables and uses of tables in our database.
This, I think, makes it difficult to set up pg_autovacuum. I am now 
running vacuum jobs on different tables in cron.
 
What things should I consider when setting but base and threshold 
values in pg_autovacuum? Since the running of vacuum and analyze is 
relative to the table size, as it must be, I think it is difficult to 
cover all tables..

One of the biggest problems with the version of pg_autovacuum in 7.4 
contrib is that you can only specify one set of thresholds, which often 
isn't flexible enough.  That said the thresholds are based on table 
since since you specify both a base value and a scaling factor so 
pg_autovacuum -v 1000 -V 1 will vacuum a table with 100 rows every 200 
updates, but will vacuum a table with 1,000,000 rows every 1,000,100 
updates.

Are there anyone who have some thoughts around this?

Basically, you should be able to use pg_autovacuum to do most of the 
vacuuming, if there are a few tables that aren't getting vacuumed often 
enough, then you can add a vacuum command to cron for those specific 
tables.
And in the version 7.5^H^H^H8.0 ( Tom Lane docet :-) ) I think is possible
specify that thresholds per table...
Regards
Gateano Mendola


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Performance Bottleneck

2004-08-03 Thread Martin Foster
I run a Perl/CGI driven website that makes extensive use of PostgreSQL 
(7.4.3) for everything from user information to formatting and display 
of specific sections of the site.   The server itself, is a dual 
processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives 
mirrored for redundancy running under FreeBSD 5.2.1 (AMD64).

Recently loads on the site have increased during peak hours to the point 
of showing considerable loss in performance.This can be observed 
when connections move from the 120 concurrent connections to PostgreSQL 
to roughly 175 or more. Essentially, the machine seems to struggle 
to keep up with continual requests and slows down respectively as 
resources are tied down.

Code changes have been made to the scripts to essentially back off in 
high load working environments which have worked to an extent. 
However, as loads continue to increase the database itself is not taking 
well to the increased traffic taking place.

Having taken a look at 'Tuning PostgreSQL for Performance' 
(http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as best 
I could in order to set my settings.However, even with statistics 
disabled and ever setting tweaked things still consider to deteriorate.

Is there anything anyone can recommend in order to give the system a 
necessary speed boost?   It would seem to me that a modest dataset of 
roughly a Gig combined with that type of hardware should be able to 
handle substantially more load then what it is.  Can anyone provide me 
with clues as where to pursue?Would disabling 'fsync' provide more 
performance if I choose that information may be lost in case of a crash?

If anyone needs access to logs, settings et cetera.   Please ask, I 
simply wish to test the waters first on what is needed. Thanks!

Martin Foster
[EMAIL PROTECTED]
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Scott Marlowe
On Tue, 2004-08-03 at 10:10, Merlin Moncure wrote:
   Try bumping up shared buffers some and sort mem as much as you
 safely
   can.
  Thank you, that did it!
  
  With
  shared_buffers = 3000   # min 16, at least max_connections*2,
 8KB
  each
  sort_mem = 128000   # min 64, size in KB
  
  it takes 3 seconds (my hardware is not server-class).
 
 Be careful...sort_mem applies to each connection and (IIRC) in some
 cases more than once to a connection.  Of all the configuration
 parameters, sort_mem (IMO) is the most important and the hardest to get
 right.  128k (or 128MB) is awfully high unless you have a ton of memory
 (you don't) or you are running in single connection scenarios.  Do some
 experimentation by lowering the value until you get a good balance
 between potential memory consumption and speed.

Minor nit, sort_mem actually applies to EACH sort individually, so a
query that had to run three sorts could use 3 x sort_mem.

Note that one can set sort_mem per backend connection with set
sort_mem=128000 if need be so as not to use up all the memory with other
backends.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Performance Bottleneck

2004-08-03 Thread Dennis Bjorklund
On Tue, 3 Aug 2004, Martin Foster wrote:

 to roughly 175 or more. Essentially, the machine seems to struggle 
 to keep up with continual requests and slows down respectively as 
 resources are tied down.

I suggest you try to find queries that are slow and check to see if the 
plans are optimal for those queries.

There are some logging options for logging quries that run longer then a 
user set limit. That can help finding the slow queries. Just doing some 
logging for some typical page fetches often show things that can be done 
better. For example, it's not uncommon to see the same information beeing 
pulled several times by misstake.

Maybe you can also try something like connection pooling. I'm not sure how
much that can give, but for small queries the connection time is usually
the big part.

 Would disabling 'fsync' provide more performance if I choose that
 information may be lost in case of a crash?

I would not do that. In most cases the performance increase is modest and
the data corruption risk after a crash is much bigger so it's not worth
it.

If you have a lot of small inserts then it might be faster with this, but
if possible it's much better to try to do more work in a transaction then 
before.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Performance Bottleneck

2004-08-03 Thread Joshua D. Drake
Hello,
It sounds to me like you are IO bound. 2x120GB hard drives just isn't 
going to cut it with that many connections (as a general rule). Are you 
swapping ?

Sincerely,
Joshua D. Drake


Martin Foster wrote:
I run a Perl/CGI driven website that makes extensive use of PostgreSQL 
(7.4.3) for everything from user information to formatting and display 
of specific sections of the site.   The server itself, is a dual 
processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives 
mirrored for redundancy running under FreeBSD 5.2.1 (AMD64).
--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL
begin:vcard
fn:Joshua D. Drake
n:Drake;Joshua D.
org:Command Prompt, Inc.
adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA
email;internet:[EMAIL PROTECTED]
title:Consultant
tel;work:503-667-4564
tel;fax:503-210-0034
note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We  provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored  the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl.
x-mozilla-html:FALSE
url:http://www.commandprompt.com/
version:2.1
end:vcard


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Gaetano Mendola
Marius Andreiana wrote:
On Tue, 2004-08-03 at 08:05 -0400, Merlin Moncure wrote:
Trying to run this query:
EXPLAIN ANALYSE
select * FROM trans
WHERE query_id NOT IN (select query_id FROM query)
but it will remain like that forever (cancelled after 30 min).
explain analyze actually runs the query to do timings.  Just run explain
and see what you come up with.  More than likely there is a nestloop in
there which is causing the long query time.
Try bumping up shared buffers some and sort mem as much as you safely
can.
Thank you, that did it!
With
shared_buffers = 3000   # min 16, at least max_connections*2, 8KB each
sort_mem = 128000   # min 64, size in KB
128 MB for sort_mem is too much, consider that in this way each backend can
use 128 MB for sort operations...
Also shared_buffers = 3000 means 24MB  that is not balanced with the 128MB
needed for sort...
Try to bump up 128 MB for shared_buffer ( may be you need to instruct your
OS to allow that ammount of shared memory usage ) and 24MB for sort_mem.

Regards
Gaetano Mendola






---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Performance Bottleneck

2004-08-03 Thread Gaetano Mendola
Martin Foster wrote:
I run a Perl/CGI driven website that makes extensive use of PostgreSQL 
(7.4.3) for everything from user information to formatting and display 
of specific sections of the site.   The server itself, is a dual 
processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives 
mirrored for redundancy running under FreeBSD 5.2.1 (AMD64).

Recently loads on the site have increased during peak hours to the point 
of showing considerable loss in performance.This can be observed 
when connections move from the 120 concurrent connections to PostgreSQL 
to roughly 175 or more. Essentially, the machine seems to struggle 
to keep up with continual requests and slows down respectively as 
resources are tied down.

Code changes have been made to the scripts to essentially back off in 
high load working environments which have worked to an extent. However, 
as loads continue to increase the database itself is not taking well to 
the increased traffic taking place.

Having taken a look at 'Tuning PostgreSQL for Performance' 
(http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as best 
I could in order to set my settings.However, even with statistics 
disabled and ever setting tweaked things still consider to deteriorate.

Is there anything anyone can recommend in order to give the system a 
necessary speed boost?   It would seem to me that a modest dataset of 
roughly a Gig combined with that type of hardware should be able to 
handle substantially more load then what it is.  Can anyone provide me 
with clues as where to pursue?Would disabling 'fsync' provide more 
performance if I choose that information may be lost in case of a crash?

If anyone needs access to logs, settings et cetera.   Please ask, I 
simply wish to test the waters first on what is needed. Thanks!
Tell us about your tipical queries, show us your configuration file.
The access are only in read only mode or do you have concurrent writers
and readers ? During peak hours your processors are tied to 100% ?
What say the vmstat and the iostat ?
May be you are not using indexes some where, or may be yes but the
planner is not using it... In two words we needs other informations
in order to help you.

Regards
Gaetano Mendola

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] NOT IN query takes forever

2004-08-03 Thread Christopher Kings-Lynne
explain analyze actually runs the query to do timings.  Just run explain
and see what you come up with.  More than likely there is a nestloop in
there which is causing the long query time.
Try bumping up shared buffers some and sort mem as much as you safely
can.
Just use an EXISTS query I suggest.
Chris
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Performance Bottleneck

2004-08-03 Thread Martin Foster
Gaetano Mendola wrote:
Martin Foster wrote:
I run a Perl/CGI driven website that makes extensive use of PostgreSQL 
(7.4.3) for everything from user information to formatting and display 
of specific sections of the site.   The server itself, is a dual 
processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives 
mirrored for redundancy running under FreeBSD 5.2.1 (AMD64).

Recently loads on the site have increased during peak hours to the 
point of showing considerable loss in performance.This can be 
observed when connections move from the 120 concurrent connections to 
PostgreSQL to roughly 175 or more. Essentially, the machine seems 
to struggle to keep up with continual requests and slows down 
respectively as resources are tied down.

Code changes have been made to the scripts to essentially back off in 
high load working environments which have worked to an extent. 
However, as loads continue to increase the database itself is not 
taking well to the increased traffic taking place.

Having taken a look at 'Tuning PostgreSQL for Performance' 
(http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as 
best I could in order to set my settings.However, even with 
statistics disabled and ever setting tweaked things still consider to 
deteriorate.

Is there anything anyone can recommend in order to give the system a 
necessary speed boost?   It would seem to me that a modest dataset of 
roughly a Gig combined with that type of hardware should be able to 
handle substantially more load then what it is.  Can anyone provide me 
with clues as where to pursue?Would disabling 'fsync' provide more 
performance if I choose that information may be lost in case of a crash?

If anyone needs access to logs, settings et cetera.   Please ask, I 
simply wish to test the waters first on what is needed. Thanks!

Tell us about your tipical queries, show us your configuration file.
The access are only in read only mode or do you have concurrent writers
and readers ? During peak hours your processors are tied to 100% ?
What say the vmstat and the iostat ?
May be you are not using indexes some where, or may be yes but the
planner is not using it... In two words we needs other informations
in order to help you.

Regards
Gaetano Mendola

I included all the files in attachments, which will hopefully cut down 
on any replied to Emails.As for things like connection pooling, the 
web server makes use of Apache::DBI to pool the connections for the Perl 
scripts being driven on that server.For the sake of being thorough, 
a quick 'apachectl status' was thrown in when the database was under a 
good load.

Since it would rather slow things down to wait for the servers to really 
get bogged down with load averages of 20.00 and more, I opted to choose 
a period of time where we are a bit busier then normal.   You will be 
able to see how the system behaves under a light load and subsequently 
reaching 125 or so concurrent connections.

The queries themselves are simple, normally drawing information from one 
table with few conditions or in the most complex cases using joins on 
two table or sub queries.   These behave very well and always have, the 
problem is that these queries take place in rather large amounts due to 
the dumb nature of the scripts themselves.

Over a year ago when I was still using MySQL for the project, the 
statistics generated would report well over 65 queries per second under 
loads ranging from 130 to 160 at peak but averaged over the weeks of 
operation.   Looking at the Apache status, one can see that it averages 
only roughly 2.5 requests per second giving you a slight indication as 
to what is taking place.

A quick run of 'systat -ifstat' shows the following graph:
   /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
Load Average   
Interface   Traffic   PeakTotal
  lo0  in  0.000 KB/s  0.000 KB/s   37.690 GB
   out 0.000 KB/s  0.000 KB/s   37.690 GB
  em0  in 34.638 KB/s 41.986 KB/s   28.998 GB
   out70.777 KB/s 70.777 KB/s   39.553 GB
Em0 is a full duplexed 100Mbs connection to an internal switch that 
supports the servers directly.   Load on the loopback was cut down 
considerably once I stopped using pg_autovaccum since its performance 
benefits under low load were buried under the hindrance it caused when 
traffic was high.

I am sure that there are some places that could benefit from some 
optimization.  Especially in the case of indexes, however as a whole the 
problem seems to be related more to the massive onslaught of queries 
then it does anything else.

Also note that some of these scripts run for longer durations even if 
they are web based.Some run as long as 30 minutes, making queries to 
the database from periods of wait from five seconds to twenty-five 
seconds. Under high duress the