Re: [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Shridhar Daithankar
Josh Berkus wrote:

Shridhar,
 However I do not agree with this logic entirely. It pegs the next vacuum
w.r.t current table size which is not always a good thing.


No, I think the logic's fine, it's the numbers which are wrong.   We want to 
vacuum when updates reach between 5% and 15% of total rows.   NOT when 
updates reach 110% of total rows ... that's much too late.
Well, looks like thresholds below 1 should be norm rather than exception.

Hmmm ... I also think the threshold level needs to be lowered; I guess the 
purpose was to prevent continuous re-vacuuuming of small tables?  
Unfortunately, in the current implementation, the result is tha small tables 
never get vacuumed at all.

So for defaults, I would peg -V at 0.1 and -v at 100, so our default 
calculation for a table with 10,000 rows is:

100 +  ( 0.1 * 10,000 ) = 1100 rows.
I would say -V 0.2-0.4 could be great as well. Fact to emphasize is that 
thresholds less than 1 should be used.

Furthermore analyze threshold depends upon inserts+updates. I think it
should also depends upon deletes for obvious reasons.
Yes.  Vacuum threshold is counting deletes, I hope?
It does.

My comment about the frequency of vacuums vs. analyze is that currently the 
*default* is to analyze twice as often as you vacuum.Based on my 
experiece as a PG admin on a variety of databases, I believe that the default 
should be to analyze half as often as you vacuum.
OK.

I am all for experimentation. If you have real life data to play with, I
can give you some patches to play around.
I will have real data very soon .
I will submit a patch that would account deletes in analyze threshold. Since you 
want to delay the analyze, I would calculate analyze count as

n=updates + inserts *-* deletes

Rather than current n = updates + inserts. Also update readme about examples 
and analyze frequency.

What does statistics gather BTW? Just number of rows or something else as well? 
I think I would put that on Hackers separately.

I am still wary of inverting vacuum analyze frequency. You think it is better to 
set inverted default rather than documenting it?

 Shridhar

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Shridhar Daithankar
On Thursday 20 November 2003 20:00, Matthew T. O'Connor wrote:
 Shridhar Daithankar wrote:
  I will submit a patch that would account deletes in analyze threshold.
  Since you want to delay the analyze, I would calculate analyze count as

 deletes are already accounted for in the analyze threshold.

Yes. My bad. Deletes are not accounted in initializing analyze count but later 
they are used.

  I am still wary of inverting vacuum analyze frequency. You think it is
  better to set inverted default rather than documenting it?

 I think inverting the vacuum and analyze frequency is wrong.

Me. Too. ATM all I can think of this patch attached. Josh, is it sufficient 
for you?..:-)

Matthew, I am confyused about one thing. Why would autovacuum count updates 
while checking for analyze threshold? Analyze does not change statistics 
right? ( w.r.t line 1072, pg_autovacuum.c). For updating statistics, only 
inserts+deletes should suffice, isn't it?

Other than that, I think autovacuum does everything it can.

Comments?

 Shridhar
*** README.pg_autovacuum.orig	Thu Nov 20 19:58:29 2003
--- README.pg_autovacuum	Thu Nov 20 20:26:39 2003
***
*** 141,150 
  depending on the mixture of table activity (insert, update, or
  delete):
  
! - If the number of (inserts + updates + deletes)  AnalyzeThreshold, then
only an analyze is performed.
  
! - If the number of (deletes + updates)  VacuumThreshold, then a
vacuum analyze is performed.
  
  VacuumThreshold is equal to:
--- 141,150 
  depending on the mixture of table activity (insert, update, or
  delete):
  
! - If the number of (inserts + updates + deletes) = AnalyzeThreshold, then
only an analyze is performed.
  
! - If the number of (deletes + updates) = VacuumThreshold, then a
vacuum analyze is performed.
  
  VacuumThreshold is equal to:
***
*** 158,163 
--- 158,186 
  and running ANALYZE more often should not substantially degrade system
  performance.
  
+ Examples:
+ 
+ Following table shows typical usage of pg_autovacuum settings.
+ These are put here so that a DBA can have some starting point while
+ tuning pg_autovacuum.
+ 
+ Vacuum is triggered by updates and deletes. So in case of vacuum,
+ last column indicates total of updates and deletes required
+ to trigger vacuum. In case of analyze, the operations would count total
+ number of inserts, updates and deletes.
+ 
+ Threshold	Scaling factor	Records		No. of Operations
+ 1,000		1		10,000		11,000
+ 1,000		2		10,000		21,000
+ 1,000		0.5		10,000		 6,000
+ 1,000		0.1		10,000		 2,000
+ 
+ Although analyze is cheaper operation compared to vacuum,
+ it might be needed less often. The default is to analyze twice as much as
+ vacuum but that might be too aggressive for some installations. It is advised that
+ such installation tune their analyze threshold separately, rather than relying upon
+ the default behaviour.
+ 
  Sleeping:
  -
  

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Shridhar Daithankar
On Thursday 20 November 2003 20:29, Shridhar Daithankar wrote:
 On Thursday 20 November 2003 20:00, Matthew T. O'Connor wrote:
  Shridhar Daithankar wrote:
   I will submit a patch that would account deletes in analyze threshold.
   Since you want to delay the analyze, I would calculate analyze count as
 
  deletes are already accounted for in the analyze threshold.

 Yes. My bad. Deletes are not accounted in initializing analyze count but
 later they are used.

   I am still wary of inverting vacuum analyze frequency. You think it is
   better to set inverted default rather than documenting it?
 
  I think inverting the vacuum and analyze frequency is wrong.

 Me. Too. ATM all I can think of this patch attached. Josh, is it sufficient
 for you?..:-)

use this one. A warning added for too aggressive vacuumming. If it is OK by 
everybody, we can send it to patches list.

 Shridhar
*** README.pg_autovacuum.orig	Thu Nov 20 19:58:29 2003
--- README.pg_autovacuum	Thu Nov 20 20:35:34 2003
***
*** 141,163 
  depending on the mixture of table activity (insert, update, or
  delete):
  
! - If the number of (inserts + updates + deletes)  AnalyzeThreshold, then
only an analyze is performed.
  
! - If the number of (deletes + updates)  VacuumThreshold, then a
vacuum analyze is performed.
  
  VacuumThreshold is equal to:
! vacuum_base_value + (vacuum_scaling_factor * number of tuples in the table)
! 
  AnalyzeThreshold is equal to:
! analyze_base_value + (analyze_scaling_factor * number of tuples in the table)
! 
  The AnalyzeThreshold defaults to half of the VacuumThreshold since it
  represents a much less expensive operation (approx 5%-10% of vacuum),
  and running ANALYZE more often should not substantially degrade system
  performance.
  
  Sleeping:
  -
  
--- 141,191 
  depending on the mixture of table activity (insert, update, or
  delete):
  
! - If the number of (inserts + updates + deletes) = AnalyzeThreshold, then
only an analyze is performed.
  
! - If the number of (deletes + updates) = VacuumThreshold, then a
vacuum analyze is performed.
  
  VacuumThreshold is equal to:
! vacuum_base_value + (vacuum_scaling_factor * number of tuples in the 
! table)
  AnalyzeThreshold is equal to:
! analyze_base_value + (analyze_scaling_factor * number of tuples in the 
! table)
  The AnalyzeThreshold defaults to half of the VacuumThreshold since it
  represents a much less expensive operation (approx 5%-10% of vacuum),
  and running ANALYZE more often should not substantially degrade system
  performance.
  
+ Examples:
+ 
+ Following table shows typical usage of pg_autovacuum settings.
+ These are put here so that a DBA can have some starting point while
+ tuning pg_autovacuum.
+ 
+ Vacuum is triggered by updates and deletes. So in case of vacuum,
+ last column indicates total of updates and deletes required
+ to trigger vacuum. In case of analyze, the operations would count total
+ number of inserts, updates and deletes.
+ 
+ Base		Scaling factor	Records		No. of Operations
+ 1,000		1		10,000		11,000
+ 1,000		2		10,000		21,000
+ 1,000		0.5		10,000		 6,000
+ 1,000		0.1		10,000		 2,000
+ 
+ Although analyze is cheaper operation compared to vacuum,
+ it might be needed less often. The default is to analyze twice as much as
+ vacuum but that might be too aggressive for some installations. It is advised 
+ thatsuch installation tune their analyze threshold separately, rather than 
+ relying upon the default behaviour.
+ 
+ Furthermore, for aggressive vacuum/analyze behaviour, it is recommended that
+ scaling factor is set to less than 1. However too aggresive operation can affect
+ performance of normal database operations adversely. Do not apply such setting
+ to production databases without prior testing.
+ 
  Sleeping:
  -
  

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Josh Berkus
Matthew,

 For small tables,  you don't need to vacuum too often.  In the testing I
 did a small table ~100 rows, didn't really show significant performance
 degredation until it had close to 1000 updates. 

This is accounted for by using the threshold value.   That way small tables 
get vacuumed less often. However, the way large tables work is very different 
and I think your strategy shows a lack of testing on large active tables.

 For large tables,
 vacuum is so expensive, that you don't want to do it very often, and
 scanning the whole table when there is only 5% wasted space is not very
 helpful.

5% is probably too low, you're right ... in my experience, performance 
degredation starts to set in a 10-15% updates to, for example, a 1.1 million 
row table, particularly since users tend to request the most recently updated 
rows.   As long as we have the I/O issues that Background Writer and ARC are 
intended to solve, though, I can see being less agressive on the defaults; 
perhaps 20% or 25%.   If you wait until 110% of a 1.1 million row table is 
updated, though, that vaccuum will take an hour or more.

Additionally, you are not thinking of this in terms of an overall database 
maintanence strategy.   Lazy Vacuum needs to stay below the threshold of the 
Free Space Map (max_fsm_pages) to prevent creeping bloat from setting in to 
your databases.   With proper configuration of pg_avd, vacuum_mem and FSM 
values, it should be possible to never run a VACUUM FULL again, and as of 7.4 
never run an REINDEX again either.  

But this means running vacuum frequently enough that your max_fsm_pages 
threshold is never reached.   Which for a large database is going to have to 
be more frequently than 110% updates, because setting 20,000,000 
max_fsm_pages will eat your RAM.

 Yes, the I set the defaults a little high perhaps so as to err on the
 side of caution.  I didn't want people to say pg_autovacuum kills the
 performance of my server.  A small table will get vacuumed, just not
 until it has reached the threshold.  So a table with 100 rows, will get
 vacuumed after 1200 updates / deletes.  

Ok, I can see that for small tables.

 In my testing it showed that
 there was no major performance problems  until you reached several
 thousand updates / deletes.

Sure.  But several thousand updates can be only 2% of a very large table.

 HUH?  analyze is very very cheap compared to vacuum.  Why not do it more
 often?

Because nothing is cheap if it's not needed.   

Analyze is needed only as often as the *aggregate distribution* of data in the 
tables changes.   Depending on the application, this could be frequently, but 
far more often (in my experience running multiple databases for several 
clients) the data distribution of very large tables changes very slowly over 
time.  

One client's database, for example, that I have running VACUUM on chron 
scripts  runs on this schedule for the main tables:
VACUUM only: twice per hour
VACUUM ANALYZE: twice per day

On the other hand, I've another client's database where most activity involves 
updates to entire classes of records.   They run ANALYZE at the end of every 
transaction.

So if you're going to have a seperate ANALYZE schedule at all, it should be 
slightly less frequent than VACUUM for large tables.   Either that, or drop 
the idea, and simplify pg_avd by running VACUUM ANALYZE all the time instead 
of having 2 seperate schedules.

BUT  now I see how you arrived at the logic you did.  If you're testing 
only on small tables, and not vacuuming them until they reach 110% updates, 
then you *would* need to analyze more frequently. This is because of your 
threshold value ... you'd want to analyze the small table as soon as even 30% 
of its rows changed.

So the answer is to dramatically lower the threshold for the small tables.

 What I think I am hearing is that people would like very much to be able
 to tweak the settings of pg_autovacuum for individual tables / databases
 etc. 

Not from me you're not.   Though that would be nice, too.

So, my suggested defaults based on our conversation above:

Vacuum threshold: 1000 records
Vacuum scale factor:  0.2
Analyze threshold:  50 records
Analyze scale factor: 0.3

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Josh Berkus
Shridhar,

 I would say -V 0.2-0.4 could be great as well. Fact to emphasize is that
 thresholds less than 1 should be used.

Yes, but not thresholds, scale factors of less than 1.0.  Thresholds should 
still be in the range of 100 to 1000.

 I will submit a patch that would account deletes in analyze threshold.
 Since you want to delay the analyze, I would calculate analyze count as

 n=updates + inserts *-* deletes

I'm not clear on how this is a benefit.  Deletes affect the statistics, too.

 What does statistics gather BTW? Just number of rows or something else as
 well? I think I would put that on Hackers separately.

Number of tuples, degree of uniqueness, some sample values, and high/low 
values.   Just query your pg_statistics view for an example.

 I am still wary of inverting vacuum analyze frequency. You think it is
 better to set inverted default rather than documenting it?

See my post to Matthew.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Chester Kustarz
On Thu, 20 Nov 2003, Josh Berkus wrote:
 Additionally, you are not thinking of this in terms of an overall database
 maintanence strategy.   Lazy Vacuum needs to stay below the threshold of the
 Free Space Map (max_fsm_pages) to prevent creeping bloat from setting in to
 your databases.   With proper configuration of pg_avd, vacuum_mem and FSM
 values, it should be possible to never run a VACUUM FULL again, and as of 7.4
 never run an REINDEX again either.

is there any command you can run to see how much of the FSM is filled? is
there any way to tell which tables are filling it?

 Analyze is needed only as often as the *aggregate distribution* of data in the
 tables changes.   Depending on the application, this could be frequently, but
 far more often (in my experience running multiple databases for several
 clients) the data distribution of very large tables changes very slowly over
 time.

analyze does 2 things for me:
1. gets reasonable aggregate statistics
2. generates STATISTICS # of bins for the most frequent hitters

(2) is very important for me. my values typically seem to have power-law
like distributions. i need enough bins to reach a cross-over point where
the last bin is frequent enough to make an index scan useful. also,
i want enough bins so that the planner can choose index a or b for:
select * from foo where a=n and b=m;

the selectivity of either index depends not only on the average selectivity
of index a or index b, but on n and m as well. for example, 1M row table:

value   % of rows
v1  23
v2  12
v3  4.5
v4  4
v5  3.5
...

you can see that picking an index for =v1 would be poor. picking the
20th most common value would be 0.5% selective. much better. of course
this breaks down for more complex operators, but = is fairly common.

 So if you're going to have a seperate ANALYZE schedule at all, it should be
 slightly less frequent than VACUUM for large tables.   Either that, or drop
 the idea, and simplify pg_avd by running VACUUM ANALYZE all the time instead
 of having 2 seperate schedules.

i have some tables which are insert only. i do not want to vacuum them
because there are never any dead tuples in them and the vacuum grows the
indexes. plus it is very expensive (they tables grow rather large.) after they
expire i drop the whole table to make room for a newer one (making sort
of a rolling log with many large tables.)

i need to analyze them every so often so that the planner knows that
there is 1 row, 100 rows, 100k rows, 1M. the funny thing is
that because i never vacuum the tables, the relpages on the index never
grows. don't know if this affects anything (this is on 7.2.3).

vacuum is to reclaim dead tuples. this means it depends on update and
delete. analyze depends on data values/distribution. this means it depends on
insert, update, and delete. thus the dependencies are slightly different
between the 2 operations, an so you can come up with use-cases that
justify running either more frequently.

i am not sure how failed transactions fit into this though, not that i think
anybody ever has very many. maybe big rollbacks during testing?



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Tom Lane
Chester Kustarz [EMAIL PROTECTED] writes:
 i have some tables which are insert only. i do not want to vacuum them
 because there are never any dead tuples in them and the vacuum grows the
 indexes.

Those claims cannot both be true.  In any case, plain vacuum cannot grow
the indexes --- only a VACUUM FULL that moves a significant number of
rows could cause index growth.

 vacuum is to reclaim dead tuples. this means it depends on update and
 delete. analyze depends on data values/distribution. this means it depends on
 insert, update, and delete. thus the dependencies are slightly different
 between the 2 operations, an so you can come up with use-cases that
 justify running either more frequently.

Agreed.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Chester Kustarz
On Thu, 20 Nov 2003, Tom Lane wrote:
 Those claims cannot both be true.  In any case, plain vacuum cannot grow
 the indexes --- only a VACUUM FULL that moves a significant number of
 rows could cause index growth.

er, yeah. you're right of course. having flashbacks of vacuum full.



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] Problem with insert into select...

2003-11-20 Thread stephen farrell
I'm having a problem with a queyr like:  INSERT INTO FACT (x,x,x,x,x,x)
SELECT a.key,b.key,c.key,d.key,e.key,f.key from x,a,b,c,d,e,f where x=a
and x=b    -- postgres7.4 is running out of memory.  I'm not sure
why this would happen -- does it buffer the subselect before doing the
insert?
Things are pretty big scale: 3gb ram, 32768 shared buffers, 700gb disk,
millions of rows in the tables.




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Problem with insert into select...

2003-11-20 Thread Tom Lane
stephen farrell [EMAIL PROTECTED] writes:
 I'm having a problem with a queyr like:  INSERT INTO FACT (x,x,x,x,x,x)
 SELECT a.key,b.key,c.key,d.key,e.key,f.key from x,a,b,c,d,e,f where x=a
 and x=b    -- postgres7.4 is running out of memory.  I'm not sure
 why this would happen -- does it buffer the subselect before doing the
 insert?

What does EXPLAIN show for the query?  And we need to see the exact
query and table definitions, not abstractions.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] duration logging setting in 7.4

2003-11-20 Thread Tom Lane
Ryszard Lach [EMAIL PROTECTED] writes:
 Nov 18 10:05:20  postgres[1348]: [318-1] LOG:  duration: 0.297 ms  statement:
 Nov 18 10:05:20  postgres[1311]: [5477-1] LOG:  duration: 0.617 ms  statement:
 Nov 18 10:05:20  postgres[1312]: [5134-1] LOG:  duration: 0.477 ms  statement:
 Nov 18 10:05:20  postgres[1349]: [318-1] LOG:  duration: 0.215 ms  statement:
 Nov 18 10:05:20  postgres[1313]: [5449-1] LOG:  duration: 0.512 ms  statement:
 Nov 18 10:05:20  postgres[1314]: [5534-1] LOG:  duration: 0.420 ms  statement:
 Nov 18 10:05:20  postgres[1330]: [772-1] LOG:  duration: 1.386 ms  statement: SELECT 
 * FROM mytablemius WHERE id = 0;
 Nov 18 10:05:20  postgres[1315]: [5757-1] LOG:  duration: 0.417 ms  statement:
 Nov 18 10:05:20  postgres[1316]: [5885-1] LOG:  duration: 0.315 ms  statement:
 Nov 18 10:05:20  postgres[1317]: [5914-1] LOG:  duration: 0.301 ms  statement:
 Nov 18 10:05:20  postgres[1318]: [5990-1] LOG:  duration: 0.293 ms  statement:
 Nov 18 10:05:20  postgres[1319]: [6009-1] LOG:  duration: 0.211 ms  statement:
 Nov 18 10:05:20  postgres[1320]: [6039-1] LOG:  duration: 0.188 ms  statement:


Is it possible that you're sending a lot of queries that have an initial
newline in the text?  I'd expect the first line of log output for such a
query to look as above.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Problem with insert into select...

2003-11-20 Thread stephen farrell
Ok -- so we created indexes and it was able to complete successfully. 
But why would creating indexes affect the memory footprint, and should it?

Does it buffer the sub-select before doing the insert, or does it do the 
insert record-by-record?

See correspondence below for details:

Steve,

With the indexes created it worked.  It took about 4 hours, but 
it inserted all of the records.

	stephen farrell [EMAIL PROTECTED]

11/20/2003 05:22 PM
	
To:James Rhodes/Almaden/[EMAIL PROTECTED]
cc:
Subject:Re: [Fwd: Re: [PERFORM] Problem with insert 
into select...]



if you do explain before the sql statement (e.g., explain select *
from foo), it'll tell you the query plan.
James Rhodes wrote:

 Steve,

 Here is the detailed structure of the tables and the query that
 is failing (the INSERT INTO FACT query) and I attached the logfile.
  Also what is EXPLAIN???

 CREATE TABLE RAW ( RAW_KEY  serial, PATNO_TEXT VARCHAR (9),
 APPDATE_DATETIME VARCHAR (11), ISDATE_DATETIME VARCHAR (11),
 WHATEVERSNO_TEXT VARCHAR (5), WHATEVERSNO_NUMBER VARCHAR (6), APPNO_TEXT
 VARCHAR (10), TITLE_TEXT TEXT, USCLASS_TEXT VARCHAR (14),
 USCLASS_TEXTLIST_TEXT TEXT, AUTHORCODE_TEXT VARCHAR (9),
 AUTHORNORM_TEXT VARCHAR (195), AUTHOR_TEXT VARCHAR (212),
 AUTHOR_TEXTLIST_TEXT TEXT, AUTHORADDRESS_TEXT VARCHAR (84),
 AUTHORADDRESS_TEXTLIST_TEXT TEXT, INVENTOR_TEXT VARCHAR (50),
 INVENTOR_TEXTLIST_TEXT TEXT, INVENTORADDRESS_TEXT VARCHAR (90),
 INVENTORADDRESS_TEXTLIST_TEXT TEXT, AGENT_TEXT TEXT, AGENT_TEXTLIST_TEXT
 TEXT, USSEARCHFIELD_TEXT VARCHAR (26), USSEARCHFIELD_TEXTLIST_TEXT
 VARCHAR (150), USREFISDATE_TEXT VARCHAR (13), USREFISDATE_TEXTLIST_TEXT
 TEXT, USREFNAME_TEXT VARCHAR (34), USREFNAME_TEXTLIST_TEXT TEXT,
 ABSTRACT_TEXT TEXT, ABSTRACT_TEXTLIST_TEXT TEXT, ABSTRACT_RICHTEXT_PAR
 TEXT, WHATEVERS_RICHTEXT_PAR TEXT, USREFPATNO_RICHTEXT_PAR TEXT, PRIMARY
 KEY(RAW_KEY));


 CREATE TABLE ISSUE_TIME (
 TAB_KEY  serial,
 ISDATE_DATETIME varchar (8),
 MONTH INT,
 DAY INT,
 YEAR INT
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE SOMETHING_NUMBER (
 TAB_KEY  serial,
 PATNO_TEXT varchar (7)
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE APP_TIME (
 TAB_KEY  serial,
 APPDATE_DATETIME varchar (8),
 MONTH INT,
 DAY INT,
 YEAR INT
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE AUTHOR (
 TAB_KEY  serial,
 CODE varchar (6),
 AUTHOR text
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE APPLICATION_NUMBER (
 TAB_KEY  serial,
 APPNO_TEXT varchar (7)
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE WHATEVERS (
 TAB_KEY  serial,
 abstract_richtext_par text,
 WHATEVERS_richtext_par text,
 raw_key int,
 title_text text
 , PRIMARY KEY(TAB_KEY))

 CREATE TABLE FACT (DYN_DIM1 BIGINT, DYN_DIM2 BIGINT,DYN_DIM3
 BIGINT,ISSUE_TIME BIGINT, SOMETHING_NUMBER BIGINT, APP_TIME BIGINT,
 AUTHOR BIGINT, APPLICATION_NUMBER BIGINT, WHATEVERS BIGINT)

 INSERT INTO FACT (ISSUE_TIME, SOMETHING_NUMBER, APP_TIME, AUTHOR,
 APPLICATION_NUMBER, WHATEVERS)  SELECT ISSUE_TIME.TAB_KEY,
 SOMETHING_NUMBER.TAB_KEY, APP_TIME.TAB_KEY, AUTHOR.TAB_KEY,
 APPLICATION_NUMBER.TAB_KEY, WHATEVERS.TAB_KEY  FROM ISSUE_TIME,
 SOMETHING_NUMBER, APP_TIME, AUTHOR, APPLICATION_NUMBER, WHATEVERS, raw
  WHERE ISSUE_TIME.ISDATE_DATETIME=raw.ISDATE_DATETIME  AND
 SOMETHING_NUMBER.PATNO_TEXT=raw.PATNO_TEXT  AND
 APP_TIME.APPDATE_DATETIME=raw.APPDATE_DATETIME  AND
 AUTHOR.CODE=AUTHORCODE_TEXT AND AUTHOR.AUTHOR=(AUTHOR_TEXT ||
  ' | ' || AUTHOR_TEXTLIST_TEXT) AND
 APPLICATION_NUMBER.APPNO_TEXT=raw.APPNO_TEXT  AND
 WHATEVERS.raw_key=raw.raw_key
Tom Lane wrote:
stephen farrell [EMAIL PROTECTED] writes:

I'm having a problem with a queyr like:  INSERT INTO FACT (x,x,x,x,x,x)
SELECT a.key,b.key,c.key,d.key,e.key,f.key from x,a,b,c,d,e,f where x=a
and x=b    -- postgres7.4 is running out of memory.  I'm not sure
why this would happen -- does it buffer the subselect before doing the
insert?


What does EXPLAIN show for the query?  And we need to see the exact
query and table definitions, not abstractions.
			regards, tom lane


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] [PERFORM] More detail on settings for pgavd?

2003-11-20 Thread Josh Berkus
Matthew,

  110% of a 1.1 million row table is updated, though, that vaccuum will
  take an hour or more.

 True, but I think it would be one hour once, rather than 30 minutes 4
 times.

Well, generally it would be about 6-8 times at 2-4 minutes each.

 This is one of the things I had hoped to add to pg_autovacuum, but never
 got to.  In addition to just the information from the stats collector on
 inserts updates and deletes, pg_autovacuum should also look at the FSM,
 and make decisions based on it.  Anyone looking for a project?

Hmmm ... I think that's the wrong approach.  Once your database is populated, 
it's very easy to determine how to set the FSM for a given pg_avd level.   If 
you're vacuuming after 20% updates, for example, just set fsm_pages to 20% of 
the total database pages plus growth  safety margins.

I'd be really reluctant to base pv-avd frequency on the fsm settings instead.  
What if the user loads 8GB of data but leaves fsm_pages at the default of 
10,000?  You can't do much with that; you'd have to vacuum if even 1% of the 
data changed.

The other problem is that calculating data pages from a count of 
updates+deletes would require pg_avd to keep more statistics and do more math 
for every table.  Do we want to do this?

 But I can't imagine that 2% makes any difference on a large table.  In
 fact I would think that 10-15% would hardly be noticable, beyond that
 I'm not sure.

I've seen performance lag at 10% of records, especially in tables where both 
update and select activity focus on one subset of the table (calendar tables, 
for example).

 Valid points, and again I think this points to the fact that
 pg_autovacuum needs to be more configurable.  Being able to set
 different thresholds for different tables will help considerably.  In
 fact, you may find that some tables should have a vac threshold much
 larger than the analyze thresold, while other tables might want the
 opposite.

Sure.  Though I think we can make the present configuration work with a little 
adjustment of the numbers.   I'll have a chance to test on production 
databases soon.

 I would be surprized if you can notice the difference between a vacuum
 analyze and a vacuum, especially on large tables.

It's substantial for tables with high statistics settings.   A 1,000,000 row 
table with 5 columns set to statistics=250 can take 3 minutes to analyze on a 
medium-grade server.

 I think you need two separate schedules.  There are lots of times where
 a vacuum doesn't help, and an analyze is all that is needed

Agreed.  And I've just talked to a client who may want to use pg_avd's ANALYZE 
scheduling but not use vacuum at all.   BTW, I think we should have a setting 
for this; for example, if -V is -1, don't vacuum.

 I'm open to discussion on changing the defaults.  Perhaps what it would
 be better to use some non-linear (perhaps logorithmic) scaling factor.
 So that you wound up with something roughly like this:

 #tuples   activity% for vacuum
 1k   100%
 10k   70%
 100k 45%
 1M20%
 10M  10%
 100M  8%

That would be cool, too.Though a count of data pages would be a better 
scale than a count of rows, and equally obtainable from pg_class.

 Thanks for the lucid feedback / discussion.  autovacuum is a feature
 that, despite it's simple implementation, has generated a lot of
 feedback from users, and I would really like to see it become something
 closer to what it should be.

Well, I hope to help now.  Until very recently, I've not had a chance to 
seriously look at pg_avd and test it in production.   Now that I do, I'm 
interested in improving it.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html