Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote:
 Tom Lane wrote:
 Matthew T. O'Connor matthew@zeut.net writes:
 I'm not sure what you are saying here, are you now saying that partial 
 vacuum won't work for autovac?  Or are you saying that saving state as 
 Jim is describing above won't work?
 
 I'm saying that I don't like the idea of trying to stop on a dime by
 saving the current contents of vacuum's dead-TID array to disk with the
 idea that we can trust those values 100% later.  Saving the array is
 expensive both in runtime and code complexity, and I don't believe we
 can trust it later --- at least not without even more expensive-and-
 complex measures, such as WAL-logging every such save :-(
 
 I'm for stopping only after completing an index-cleaning pass, at the
 point where we empty the dead-TID array anyway.  If you really have to
 have stop on a dime, just kill -INT the process, accepting that you
 will have to redo your heap scan since the last restart point.
 
 OK, so if I understand correct, a vacuum of a table with 10 indexes on 
 it can be interrupted 10 times, once after each index-cleaning pass? 
 That might have some value, especially breaking up the work required to 
 vacuum a large table. Or am I still not getting it?

It'd stop after scanning the heap, scanning all the indexes, and then
cleaning the heap. After that's done it no longer needs any of the
dead-TID info; anytime before that it does need that info, and Tom's
objection is that trying to store that info is a bad idea.

The problem with this is that typically it takes a long time to go
through a complete vacuum cycle; minutes at least, and preferably
longer. Decreasing that cycle time will greatly increase the amount of
IO required for vacuuming a table with any indexes, because every time
you cycle through you have to read the entire index. That's why I don't
see it as being useful at all for getting autovac to work on hot tables
- if you actually got that cycle time low enough you'd kill the system
with all the extra index scanning.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 Here is a worst case example: A DB with 6 tables all of which are highly 
 active and will need to be vacuumed constantly.  While this is totally 
 hypothetical, it is how I envision things working (without the threshold).
 
 I fail to see how a simple 6 table case is 'worst case'. It's common to
 see hundreds of tables, and I've run across more than one database with
 thousands of tables (think partitioning). In cases like those it's
 certainly possible, perhaps even likely that you would get many daemons
 running in the database at one time just from different tables suddenly
 needing vacuuming and appearing at a higher point in the list than other
 tables. With 100 ~1G tables getting updates it certainly wouldn't be
 hard to end up with 10 of those being vacuumed all at the same time.
 
 Yes 6 tables is small, the worst-case part of the example was that all 
 the tables would need to be vacuumed constantly.  Most databases only 
 have a few hot tables.  Most tables only need to vacuumed every once in 
 a while.
 
It's not the hot tables that are the issue; it's how many large tables
(hot or not) that can come up for vacuuming in order. For example, if
A-Z are all large tables (ie: a few GB), with A being the largest and Z
the smallest, think about what happens here:

Round 1: A needs vacuuming. Daemon gets to it and starts working.
Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon
2 gets to it.
Round 3: C now needs vacuuming. Daemon 3.
...
Round 26: Z now needs vacuuming. Daemon 26 picks it up.

You now have 26 daemons running in the database.

Now, we can argue about how likely that scenario is, but I don't think
it's relevant. What matters is that it *is* possible, and as long as
that's the case you'd have to have some kind of limit. (While this
simple 26 table example is definitely worst-case, if you've got hundreds
of tables that are all multiple GB in size I think it wouldn't be hard
at all for you to end up with a dozen or more daemons all hammering
away).

 I do like the idea since it should be easier to tune, but I think we
 still need some limit on it. Perhaps as a first-pass we could just have
 a hard limit and log a message and/or set a flag any time we hit it.
 That would hopefully allow us to get information about how big a problem
 it really is. We could go one step further and say that the last daemon
 that can start in a database will only vacuum tables that can be done
 quickly; that's essentially what we've been talking about, except the
 limit we've been discussing would be hard-coded at 2.
 
 I'm confused, what limit would be set at 2?  The number of concurrent 
 workers?  I've never said that.

The point I was making is that the proposal about limiting the 2nd
daemon to only processing tables it can do in a short period of time is
akin to setting a limit of only 2 daemons in a database at a time.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
  The advantage to keying this to autovac_naptime is that it means we
  don't need another GUC, but after I suggested that before I realized
  that's probably not the best idea. For example, I've seen clusters that
  are running dozens-hundreds of databases; in that environment you really
  need to turn naptime way down (to like a second). In that case you
  wouldn't want to key to naptime.
 
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.

BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:

Jim C. Nasby wrote:


The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.


BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.


I think we all agree that autovacuum needs to become tablespace aware at 
some point, but I think that is further down the line, we're having 
enough trouble figuring things out without that additional complication.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
 The advantage to keying this to autovac_naptime is that it means we
 don't need another GUC, but after I suggested that before I realized
 that's probably not the best idea. For example, I've seen clusters that
 are running dozens-hundreds of databases; in that environment you really
 need to turn naptime way down (to like a second). In that case you
 wouldn't want to key to naptime.
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.
 
 BTW, another issue that I don't think we can ignore: we actually need to
 do this on a per-tablespace level, or at least have the ability to
 disable or somehow limit it. While it's not common, there are users that
 run a hundred or more databases in a single cluster; it would be ugly if
 we suddenly had 100 vacuums trying to run on the same set of drives
 concurrently.
 
 I think we all agree that autovacuum needs to become tablespace aware at 
 some point, but I think that is further down the line, we're having 
 enough trouble figuring things out without that additional complication.

Sure, we just need a way to disable the multiple autovac daemon stuff
then.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Casey Duncan


On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote:


Jim C. Nasby wrote:


That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to  
tie
it to time and not size, since the key factor is that you want it  
to hit

the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.


Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use  
today.

If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode  
would

examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of  
worker
can take a long time to vacuum the whole database -- we don't  
impose any

time limit or table size limit to what it can do.

The second mode is the hot table worker mode, enabled when the  
worker

detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less  
than

autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay  
settings,

assuming that all pages constitute cache misses.


Perhaps this has already been proposed, but maybe some combination of  
the following inputs could be used to determine which table most  
needs vacuuming:


- The proportion of tuples in a table that are dead (updated rows  
since last vacuum/estimated row count). This would favor hot tables  
naturally regardless of size.


- The time since the last vacuum, so that larger tables are  
eventually vacuumed even if hot tables totally dominate


Of course tables that did not pass the minimum parameters specified  
in postgresql.conf would not even get considered.


I'm being intentionally vague here on the exact algorithm, since you  
all have though about this more than I have. One thing I like about  
the above is that it is independent of table size, and doesn't  
require anyone to determine which tables are hot manually.


-Casey

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Galy Lee

Tom Lane wrote:
 Saving the array is
 expensive both in runtime and code complexity, and I don't believe we
 can trust it later --- at least not without even more expensive-and-
 complex measures, such as WAL-logging every such save :-(
I don’t understand well the things you are worrying about.
If we find that we can not trust the saved file, or the file has
corrupted, then we can drop it and scan from the beginning of the heap
block. If something like CLUSTER, PITR has changed the physical layout
of heap, then we can simply drop the files. Why do we need WAL for it?
I don’t see any point in it.

Also, I don’t think it is expensive. If it is combined with maintenance
 window to stop once in a whole day, writing 256MB/2 = 128MB things out
can not be said expensive. Of course, this feature isn’t for autovacuum
 to use it in every minutes, autovacuum can use it after it has adopted
maintenance window.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Jim C. Nasby wrote:

 That's why I'm thinking it would be best to keep the maximum size of
 stuff for the second worker small. It probably also makes sense to tie
 it to time and not size, since the key factor is that you want it to hit
 the high-update tables every X number of seconds.
 
 If we wanted to get fancy, we could factor in how far over the vacuum
 threshold a table is, so even if the table is on the larger size, if
 it's way over the threshold the second vacuum will hit it.

Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use today.
If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode would
examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of worker
can take a long time to vacuum the whole database -- we don't impose any
time limit or table size limit to what it can do.

The second mode is the hot table worker mode, enabled when the worker
detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less than
autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay settings,
assuming that all pages constitute cache misses.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Jim C. Nasby wrote:


That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to tie
it to time and not size, since the key factor is that you want it to hit
the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.


Ok, I think we may be actually getting somewhere.


Me too.


I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use today.
If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode would
examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of worker
can take a long time to vacuum the whole database -- we don't impose any
time limit or table size limit to what it can do.


Right, I like this.


The second mode is the hot table worker mode, enabled when the worker
detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less than
autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay settings,
assuming that all pages constitute cache misses.


How can you determine what tables can be vacuumed within 
autovacuum_naptime?  I agree that large tables should be excluded, but I 
don't know how we can do that calculation based on autovacuum_naptime.


So at:
t=0*autovacuume_naptime: worker1 gets started on DBX
t=1*autovacuume_naptime: worker2 gets started on DBX
worker2 determines all tables that need to be vacuumed,
worker2 excludes tables that are too big from it's to-do list,
worker2 gets started working,
worker2 exits when it either:
a) Finishes it's entire to-do-list.
b) Catches up to worker1

I think the questions are 1) What is the exact math you are planning on 
using to determine which tables are too big?  2) Do we want worker2 to 
exit when it catches worker1 or does the fact that we have excluded 
tables that re too big mean that we don't have to worry about this?



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:

 The second mode is the hot table worker mode, enabled when the worker
 detects that there's already a worker in the database.  In this mode,
 the worker is limited to those tables that can be vacuumed in less than
 autovacuum_naptime, so large tables are not considered.  Because of
 this, it'll generally not compete with the first mode above -- the
 tables in plain worker were sorted by size, so the small tables were
 among the first vacuumed by the plain worker.  The estimated time to
 vacuum may be calculated according to autovacuum_vacuum_delay settings,
 assuming that all pages constitute cache misses.
 
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?

My assumption is that
pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

This is of course not the reality, because the delay is not how long
it takes to fetch the pages.  But it lets us have a value with which we
can do something.  With the default values, vacuum_cost_delay=10,
vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
of under 600 pages, 4800 kB (should we include indexes here in the
relpages count?  My guess is no).

A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.

 So at:
 t=0*autovacuume_naptime: worker1 gets started on DBX
 t=1*autovacuume_naptime: worker2 gets started on DBX
 worker2 determines all tables that need to be vacuumed,
 worker2 excludes tables that are too big from it's to-do list,
 worker2 gets started working,
 worker2 exits when it either:
 a) Finishes it's entire to-do-list.
 b) Catches up to worker1
 
 I think the questions are 1) What is the exact math you are planning on 
 using to determine which tables are too big?  2) Do we want worker2 to 
 exit when it catches worker1 or does the fact that we have excluded 
 tables that re too big mean that we don't have to worry about this?

Right, I think the fact that we excluded big tables means that this
won't be a problem most of the time, but we'll need some sort of
protection anyway.  I think this is easy to achieve -- store the table
each worker is currently processing in shared memory, and have all
workers check all other workers.  If a plain worker finds that another
worker is processing the table already, it skips that table and
continues with the next one.  A hot table worker instead exits right
away (caught up).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
How can you determine what tables can be vacuumed within 
autovacuum_naptime?


My assumption is that
pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

This is of course not the reality, because the delay is not how long
it takes to fetch the pages.  But it lets us have a value with which we
can do something.  With the default values, vacuum_cost_delay=10,
vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
of under 600 pages, 4800 kB (should we include indexes here in the
relpages count?  My guess is no).


I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?



A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.


I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.


Thoughts?

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
 vacuum
 
 This is of course not the reality, because the delay is not how long
 it takes to fetch the pages.  But it lets us have a value with which we
 can do something.  With the default values, vacuum_cost_delay=10,
 vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
 of under 600 pages, 4800 kB (should we include indexes here in the
 relpages count?  My guess is no).
 
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?

Well, this would only happen the first time, until the plain worker
processed the table; next time it would be picked up by the hot table
worker.  But yeah, we can build a better estimate using the same trick
the planner uses: estimate tuple density as reltuples/relpages times the
actual number of blocks on disk.

 A table over 600 pages does not sound like a good candidate for hot, so
 this seems more or less reasonable to me.  On the other hand, maybe we
 shouldn't tie this to the vacuum cost delay stuff.
 
 I'm not sure it's a good idea to tie this to the vacuum cost delay 
 settings either, so let me as you this, how is this better than just 
 allowing the admin to set a new GUC variable like 
 autovacuum_hot_table_size_threshold  (or something shorter) which we can 
 assign a decent default of say 8MB.

Yeah, maybe that's better -- it's certainly simpler.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?


Well, this would only happen the first time, until the plain worker
processed the table; next time it would be picked up by the hot table
worker.  But yeah, we can build a better estimate using the same trick
the planner uses: estimate tuple density as reltuples/relpages times the
actual number of blocks on disk.


Well even skipping it the first time isn't good, anything that further 
delays a hot table from getting vacuumed is bad.  Also, I'm not sure it 
would just be the first time since plain VACUUM isn't going to reclaim 
most of the space, just mark it as reusable.  This is moot however if we 
use a good metric, I have no idea if what you suggest above would be 
good enough.



A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.
I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.


Yeah, maybe that's better -- it's certainly simpler.


Simple is better, at least until proven otherwise.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Matthew T. O'Connor wrote:
 I'm not sure it's a good idea to tie this to the vacuum cost delay 
 settings either, so let me as you this, how is this better than just 
 allowing the admin to set a new GUC variable like 
 autovacuum_hot_table_size_threshold  (or something shorter) which we can 
 assign a decent default of say 8MB.

 Yeah, maybe that's better -- it's certainly simpler.

I'm not liking any of these very much, as they seem critically dependent
on impossible-to-tune parameters.  I think it'd be better to design this
around having the first worker explicitly expose its state (list of
tables to process, in order) and having subsequent workers key off that
info.  The shared memory state could include the OID of the table each
worker is currently working on, and we could keep the to-do list in some
simple flat file for instance (since we don't care about crash safety).

I'm not certain exactly what key off needs to mean; perhaps each
worker should make its own to-do list and then discard items that are
either in-progress or recently done by another worker when it gets to
them.

I think an absolute minimum requirement for a sane design is that no two
workers ever try to vacuum the same table concurrently, and I don't see
where that behavior will emerge from your proposal; whereas it's fairly
easy to make it happen if non-first workers pay attention to what other
workers are doing.

BTW, it's probably necessary to treat shared catalogs specially ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

Matthew T. O'Connor wrote:
I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.



Yeah, maybe that's better -- it's certainly simpler.


I'm not liking any of these very much, as they seem critically dependent
on impossible-to-tune parameters.  I think it'd be better to design this
around having the first worker explicitly expose its state (list of
tables to process, in order) and having subsequent workers key off that
info.  The shared memory state could include the OID of the table each
worker is currently working on, and we could keep the to-do list in some
simple flat file for instance (since we don't care about crash safety).


So far we are only talking about one parameter, the 
hot_table_size_threshold, which I agree would be a guess by an admin, 
but if we went in this direction, I would also advocate adding a column 
to the pg_autovacuum table that allows an admin to explicitly define a 
table as hot or not.


Also I think each worker should be mostly independent, the only caveat 
being that (assuming each worker works in size order) if we catch up to 
an older worker (get to the table they are currently working on) we 
exit.  Personally I think this is all we need, but others felt the 
additional threshold was needed.  What do you think?  Or what do you 
think might be better?



I'm not certain exactly what key off needs to mean; perhaps each
worker should make its own to-do list and then discard items that are
either in-progress or recently done by another worker when it gets to
them.


My initial design didn't have any threshold at all, but others felt this 
would/could result in too many worker working concurrently in the same DB.



I think an absolute minimum requirement for a sane design is that no two
workers ever try to vacuum the same table concurrently, and I don't see
where that behavior will emerge from your proposal; whereas it's fairly
easy to make it happen if non-first workers pay attention to what other
workers are doing.


Maybe we never made that clear, I was always working on the assumption 
that two workers would never try to work on the same table at the same time.



BTW, it's probably necessary to treat shared catalogs specially ...


Certainly.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Matthew T. O'Connor wrote:
  I'm not sure it's a good idea to tie this to the vacuum cost delay 
  settings either, so let me as you this, how is this better than just 
  allowing the admin to set a new GUC variable like 
  autovacuum_hot_table_size_threshold  (or something shorter) which we can 
  assign a decent default of say 8MB.
 
  Yeah, maybe that's better -- it's certainly simpler.
 
 I'm not liking any of these very much, as they seem critically dependent
 on impossible-to-tune parameters.  I think it'd be better to design this
 around having the first worker explicitly expose its state (list of
 tables to process, in order) and having subsequent workers key off that
 info.  The shared memory state could include the OID of the table each
 worker is currently working on, and we could keep the to-do list in some
 simple flat file for instance (since we don't care about crash safety).
 
 I'm not certain exactly what key off needs to mean; perhaps each
 worker should make its own to-do list and then discard items that are
 either in-progress or recently done by another worker when it gets to
 them.
 
The real problem is trying to set that up in such a fashion that keeps
hot tables frequently vacuumed; that means that the 2nd daemon in a
database either needs to avoid large tables that will take 'a
significant' length of time to vacuum, or you need to allow any number
of daemons in each database (which sounds like a good way to thrash the
machine).

 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently, and I don't see
 where that behavior will emerge from your proposal; whereas it's fairly
 easy to make it happen if non-first workers pay attention to what other
 workers are doing.

Isn't there a special lock acquired on a relation by vacuum? Can't we
just check for that? Seems much simpler than building out the ability
for daemons to see what each other is doing (and that still wouldn't
take manual vacuums into account.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 08:11:44PM -0300, Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
  Alvaro Herrera wrote:
 
  The second mode is the hot table worker mode, enabled when the worker
  detects that there's already a worker in the database.  In this mode,
  the worker is limited to those tables that can be vacuumed in less than
  autovacuum_naptime, so large tables are not considered.  Because of
  this, it'll generally not compete with the first mode above -- the
  tables in plain worker were sorted by size, so the small tables were
  among the first vacuumed by the plain worker.  The estimated time to
  vacuum may be calculated according to autovacuum_vacuum_delay settings,
  assuming that all pages constitute cache misses.
  
  How can you determine what tables can be vacuumed within 
  autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

Need ta take vacuum_cost_limit into account.

The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
 vacuum
 
 This is of course not the reality, because the delay is not how long
 it takes to fetch the pages.  But it lets us have a value with which we
 can do something.  With the default values, vacuum_cost_delay=10,
 vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
 of under 600 pages, 4800 kB (should we include indexes here in the
 relpages count?  My guess is no).
 
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?

The entire point of this is to ensure that the second daemon will only
vacuum tables that it can finish very quickly. If you let a table bloat
so it's too big, then you just can't vacuum it very frequently without
risking all your other hot tables bloating because they're no longer
getting vacuumed.

The reality is that you can actually vacuum a pretty good-sized table in
60 seconds with typical cost-delay settings (ie: defaults except
cost_delay set to 10). That means you can do 9 pages ~100 times a
second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
that's still 27k pages per minute.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Jim C. Nasby wrote:

 The advantage to keying this to autovac_naptime is that it means we
 don't need another GUC, but after I suggested that before I realized
 that's probably not the best idea. For example, I've seen clusters that
 are running dozens-hundreds of databases; in that environment you really
 need to turn naptime way down (to like a second). In that case you
 wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
 I'm not liking any of these very much, as they seem critically dependent
 on impossible-to-tune parameters.  I think it'd be better to design this
 around having the first worker explicitly expose its state (list of
 tables to process, in order) and having subsequent workers key off that
 info.

 The real problem is trying to set that up in such a fashion that keeps
 hot tables frequently vacuumed;

Certainly, but it's not clear where that behavior emerges from Alvaro's
or Matthew's proposals, either.

Are we assuming that no single worker instance will vacuum a given table
more than once?  (That's not a necessary assumption, certainly, but
without it there are so many degrees of freedom that I'm not sure how
it should act.)  Given that assumption, the maximum vacuuming rate for
any table is once per autovacuum_naptime, and most of the magic lies in
the launcher's algorithm for deciding which databases to launch workers
into.

I'm inclined to propose an even simpler algorithm in which every worker
acts alike; its behavior is
1. On startup, generate a to-do list of tables to process, sorted in
priority order.
2. For each table in the list, if the table is still around and has not
been vacuumed by someone else since you started (including the case of
a vacuum-in-progress), then vacuum it.

Detecting already vacuumed since you started is a bit tricky; you
can't really rely on the stats collector since its info isn't very
up-to-date.  That's why I was thinking of exposing the to-do lists
explicitly; comparing those with an advertised current-table would
allow accurate determination of what had just gotten done.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
[ oh, I forgot to respond to this: ]

Jim C. Nasby [EMAIL PROTECTED] writes:
 Isn't there a special lock acquired on a relation by vacuum? Can't we
 just check for that?

I think you're thinking that ConditionalLockRelation solves the problem,
but it does not, because it will fail if someone has taken a (quasi)
exclusive lock unrelated to vacuuming.  You don't want an application
that frequently takes short-term ExclusiveLocks on a table to thereby
cause autovacuum to frequently skip that table.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Tom Lane wrote:

 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently, and I don't see
 where that behavior will emerge from your proposal; whereas it's fairly
 easy to make it happen if non-first workers pay attention to what other
 workers are doing.

FWIW, I've always considered this to be a very important and obvious
issue, and I think I've neglected mentioning it (maybe I did too few
times).  But I think this is pretty easy to do, just have each worker
advertise the current table it's working on in shared memory, and add a
recheck loop on the table-pick algorithm (with appropriate grabs of the
autovacuum lwlock), to make sure no one starts to vacuum the same table
you're going to process, at the same time.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?


The entire point of this is to ensure that the second daemon will only
vacuum tables that it can finish very quickly. If you let a table bloat
so it's too big, then you just can't vacuum it very frequently without
risking all your other hot tables bloating because they're no longer
getting vacuumed.

The reality is that you can actually vacuum a pretty good-sized table in
60 seconds with typical cost-delay settings (ie: defaults except
cost_delay set to 10). That means you can do 9 pages ~100 times a
second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
that's still 27k pages per minute.


At the risk of sounding like a broken record, I still think the size 
limit threshold is unnecessary.  Since all workers will be working in on 
tables in size order, younger workers will typically catch older workers 
fairly quickly since the tables will be either small, or recently 
vacuumed and not need work.  And since younger workers exit when they 
catch-up to an older worker, there is some inherent stability in the 
number of workers.


Here is a worst case example: A DB with 6 tables all of which are highly 
active and will need to be vacuumed constantly.  While this is totally 
hypothetical, it is how I envision things working (without the threshold).


table1:10 rows
table2:   100 rows
table3: 1,000 rows
table4:10,000 rows
table5:   100,000 rows
table6: 1,000,000 rows

time=0*naptime: No workers in the DB

time=1*naptime: worker1 starts on table1

time=2*naptime: worker1 has finished table1,table2 and table3, it's now 
working on table4, worker2 starts on table1.


time=3*naptime: worker1 is on table5, worker2 is working on table4, 
worker3 starts on table1.


time=4*naptime: worker1 is still on table5, worker2 has caught up to 
worker1 and exits, worker3 also catches up to worker1 since tables2-4 
didn't require vacuum at this time so it exits, worker4 starts on table1


time=5*naptime: worker1 is working on table6, worker4 is up to table4, 
worker5 starts on table1


time=6*naptime: worker1 is working on table6, worker4 catches up to 
worker1 and exits, worker5 finds no additional work to be done and 
exits, worker6 starts at table1.


time=7*naptime: worker1 still working on table6, worker6 is up to 
table4, worker7 starts at table1.


time=8*naptime: worker1 still working on table6, worker6 still working 
on table4, worker7 working on table3, worker8 starting on table1.


time=9*naptime: worker1 still working on table6, worker6 working on 
table5, worker7 catches worker 6 and exits, worker8 finds nothing more 
todo and exits, worker9 starts on table1


time=10*naptim: worker1 still working on table6, worker9 working on 
table4, worker10 starts on table1.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently,

 FWIW, I've always considered this to be a very important and obvious
 issue, and I think I've neglected mentioning it (maybe I did too few
 times).  But I think this is pretty easy to do, just have each worker
 advertise the current table it's working on in shared memory, and add a
 recheck loop on the table-pick algorithm (with appropriate grabs of the
 autovacuum lwlock), to make sure no one starts to vacuum the same table
 you're going to process, at the same time.

Well, any of these proposals need that at the bottom level, to prevent
race conditions.  But I'd prefer a design that wasn't positively
encouraging multiple workers to try to pick the same table concurrently.
Not only is that wasteful, but it makes it harder to predict what is the
real behavior that emerges after race conditions are backed off from.

BTW, to what extent might this whole problem be simplified if we adopt
chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
If the unit of work has a reasonable upper bound regardless of table
size, maybe the problem of big tables starving small ones goes away.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Jim C. Nasby [EMAIL PROTECTED] writes:

The real problem is trying to set that up in such a fashion that keeps
hot tables frequently vacuumed;


Are we assuming that no single worker instance will vacuum a given table
more than once?  (That's not a necessary assumption, certainly, but
without it there are so many degrees of freedom that I'm not sure how
it should act.)  Given that assumption, the maximum vacuuming rate for
any table is once per autovacuum_naptime, and most of the magic lies in
the launcher's algorithm for deciding which databases to launch workers
into.


Yes, I have been working under the assumption that a worker goes through 
the list of tables once and exits, and yes the maximum vacuuming rate 
for any table would be once per autovacuum_naptime.  We can lower the 
default if necessary, as far as I'm concerned it's (or should be) fairly 
cheap to fire off a worker and have it find that there isn't anything 
todo and exit.



I'm inclined to propose an even simpler algorithm in which every worker
acts alike; its behavior is
1. On startup, generate a to-do list of tables to process, sorted in
priority order.
2. For each table in the list, if the table is still around and has not
been vacuumed by someone else since you started (including the case of
a vacuum-in-progress), then vacuum it.


That is what I'm proposing except for one difference, when you catch up 
to an older worker, exit.  This has the benefit reducing the number of 
workers concurrently working on big tables, which I think is a good thing.



Detecting already vacuumed since you started is a bit tricky; you
can't really rely on the stats collector since its info isn't very
up-to-date.  That's why I was thinking of exposing the to-do lists
explicitly; comparing those with an advertised current-table would
allow accurate determination of what had just gotten done.


Sounds good, but I have very little insight into how we would implement 
already vacuumed since you started or have I caught up to another 
worker.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

BTW, to what extent might this whole problem be simplified if we adopt
chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
If the unit of work has a reasonable upper bound regardless of table
size, maybe the problem of big tables starving small ones goes away.


So if we adopted chunk-at-a-time then perhaps each worker processes the 
list of tables in OID order (or some unique and stable order) and does 
one chunk per table that needs vacuuming.  This way an equal amount of 
bandwidth is given to all tables.


That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?


Matt


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 Tom Lane wrote:
 I'm inclined to propose an even simpler algorithm in which every worker
 acts alike;

 That is what I'm proposing except for one difference, when you catch up 
 to an older worker, exit.

No, that's a bad idea, because it means that any large table starves
even-larger tables.

(Note: in all this I assume we're all using size as a shorthand for
some sort of priority metric that considers number of dirty tuples not
only size.  We don't want every worker insisting on passing over every
small read-only table every time, for instance.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?

It seems fairly trivial to me to have a scheme where you do one
fill-workmem-and-scan-indexes cycle per invocation, and store the
next-heap-page-to-scan in some handy place (new pg_class column updated
along with relpages/reltuples, likely).  Galy is off in left field with
some far more complex ideas :-( but I don't see that there's all that
much needed to support this behavior ... especially if we don't expose
it to the SQL level but only support it for autovac's use.  Then we're
not making any big commitment to support the behavior forever.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:

Tom Lane wrote:

I'm inclined to propose an even simpler algorithm in which every worker
acts alike;


That is what I'm proposing except for one difference, when you catch up 
to an older worker, exit.


No, that's a bad idea, because it means that any large table starves
even-larger tables.


True, but the assumption I'm making is that there is a finite amount of 
bandwidth available and more concurrent activity will have a net 
negative effect the time it takes to vacuum all tables.  I'm willing to 
pay that price to prevent small hot tables from getting starved, but 
less willing to pay the same price for large tables where the percentage 
of bloat will be much smaller.



(Note: in all this I assume we're all using size as a shorthand for
some sort of priority metric that considers number of dirty tuples not
only size.  We don't want every worker insisting on passing over every
small read-only table every time, for instance.)


I was using size to mean reltuples.  The whole concept of sorting by 
size was designed to ensure that smaller (more susceptible to bloat) 
tables got priority.  It might be useful for workers to sort their to-do 
lists by some other metric, but I don't have a clear vision of what that 
might be.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:

That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?


It seems fairly trivial to me to have a scheme where you do one
fill-workmem-and-scan-indexes cycle per invocation, and store the
next-heap-page-to-scan in some handy place (new pg_class column updated
along with relpages/reltuples, likely).  Galy is off in left field with
some far more complex ideas :-( but I don't see that there's all that
much needed to support this behavior ... especially if we don't expose
it to the SQL level but only support it for autovac's use.  Then we're
not making any big commitment to support the behavior forever.


Well, if we can make it happen soon, it might be the best thing for 
autovacuum.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
  The advantage to keying this to autovac_naptime is that it means we
  don't need another GUC, but after I suggested that before I realized
  that's probably not the best idea. For example, I've seen clusters that
  are running dozens-hundreds of databases; in that environment you really
  need to turn naptime way down (to like a second). In that case you
  wouldn't want to key to naptime.
 
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.

Hrm... how would that work?

BTW, another thought is to only sleep if you've scanned through every
database and found nothing to do.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 10:48:49PM -0500, Tom Lane wrote:
 Matthew T. O'Connor matthew@zeut.net writes:
  That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?
 
 It seems fairly trivial to me to have a scheme where you do one
 fill-workmem-and-scan-indexes cycle per invocation, and store the
 next-heap-page-to-scan in some handy place (new pg_class column updated
 along with relpages/reltuples, likely).  Galy is off in left field with
 some far more complex ideas :-( but I don't see that there's all that
 much needed to support this behavior ... especially if we don't expose
 it to the SQL level but only support it for autovac's use.  Then we're
 not making any big commitment to support the behavior forever.

The problem I see there is that the case we're trying to fix is tables
that need to be vacuumed every few minutes. As I posted elsewhere, it's
reasonable to assume a vacuum rate of ~1000 pages/second for a small
table that's going to be in memory (assuming that vacuum dirties every
page). That means that you can only dirty about 60k pages per cycle,
which seems way to small to be practical unless we come up with a way to
avoid scanning the indexes on every cycle.

The proposal to save enough state to be able to resume a vacuum at
pretty much any point in it's cycle might work; we'd have to benchmark
it.  With the default maintenance_work_mem of 128M it would mean writing
out 64M of state every minute on average, which is likely to take
several seconds to fsync (though, maybe we wouldn't need to fsync it...)
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 The proposal to save enough state to be able to resume a vacuum at
 pretty much any point in it's cycle might work; we'd have to benchmark
 it.  With the default maintenance_work_mem of 128M it would mean writing
 out 64M of state every minute on average, which is likely to take
 several seconds to fsync (though, maybe we wouldn't need to fsync it...)

Which is exactly why we needn't bother benchmarking it.  Even if it
weren't complex and unsafe, it will be a net loss when you consider the
fact that it adds I/O instead of removing it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?
 
 The entire point of this is to ensure that the second daemon will only
 vacuum tables that it can finish very quickly. If you let a table bloat
 so it's too big, then you just can't vacuum it very frequently without
 risking all your other hot tables bloating because they're no longer
 getting vacuumed.
 
 The reality is that you can actually vacuum a pretty good-sized table in
 60 seconds with typical cost-delay settings (ie: defaults except
 cost_delay set to 10). That means you can do 9 pages ~100 times a
 second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
 that's still 27k pages per minute.
 
 At the risk of sounding like a broken record, I still think the size 
 limit threshold is unnecessary.  Since all workers will be working in on 
 tables in size order, younger workers will typically catch older workers 
 fairly quickly since the tables will be either small, or recently 
 vacuumed and not need work.  And since younger workers exit when they 
 catch-up to an older worker, there is some inherent stability in the 
 number of workers.
 
 Here is a worst case example: A DB with 6 tables all of which are highly 
 active and will need to be vacuumed constantly.  While this is totally 
 hypothetical, it is how I envision things working (without the threshold).

I fail to see how a simple 6 table case is 'worst case'. It's common to
see hundreds of tables, and I've run across more than one database with
thousands of tables (think partitioning). In cases like those it's
certainly possible, perhaps even likely that you would get many daemons
running in the database at one time just from different tables suddenly
needing vacuuming and appearing at a higher point in the list than other
tables. With 100 ~1G tables getting updates it certainly wouldn't be
hard to end up with 10 of those being vacuumed all at the same time.

I do like the idea since it should be easier to tune, but I think we
still need some limit on it. Perhaps as a first-pass we could just have
a hard limit and log a message and/or set a flag any time we hit it.
That would hopefully allow us to get information about how big a problem
it really is. We could go one step further and say that the last daemon
that can start in a database will only vacuum tables that can be done
quickly; that's essentially what we've been talking about, except the
limit we've been discussing would be hard-coded at 2.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:37:42AM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  The proposal to save enough state to be able to resume a vacuum at
  pretty much any point in it's cycle might work; we'd have to benchmark
  it.  With the default maintenance_work_mem of 128M it would mean writing
  out 64M of state every minute on average, which is likely to take
  several seconds to fsync (though, maybe we wouldn't need to fsync it...)
 
 Which is exactly why we needn't bother benchmarking it.  Even if it
 weren't complex and unsafe, it will be a net loss when you consider the
 fact that it adds I/O instead of removing it.

Well, it depends on how often you're doing that. Adding extra IO at the
end of 4 hours of vacuuming isn't going to make any real difference, but
once a minute...

Looks like partial vacuum won't help this problem. :(
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:

Jim C. Nasby wrote:
Here is a worst case example: A DB with 6 tables all of which are highly 
active and will need to be vacuumed constantly.  While this is totally 
hypothetical, it is how I envision things working (without the threshold).


I fail to see how a simple 6 table case is 'worst case'. It's common to
see hundreds of tables, and I've run across more than one database with
thousands of tables (think partitioning). In cases like those it's
certainly possible, perhaps even likely that you would get many daemons
running in the database at one time just from different tables suddenly
needing vacuuming and appearing at a higher point in the list than other
tables. With 100 ~1G tables getting updates it certainly wouldn't be
hard to end up with 10 of those being vacuumed all at the same time.


Yes 6 tables is small, the worst-case part of the example was that all 
the tables would need to be vacuumed constantly.  Most databases only 
have a few hot tables.  Most tables only need to vacuumed every once in 
a while.



I do like the idea since it should be easier to tune, but I think we
still need some limit on it. Perhaps as a first-pass we could just have
a hard limit and log a message and/or set a flag any time we hit it.
That would hopefully allow us to get information about how big a problem
it really is. We could go one step further and say that the last daemon
that can start in a database will only vacuum tables that can be done
quickly; that's essentially what we've been talking about, except the
limit we've been discussing would be hard-coded at 2.


I'm confused, what limit would be set at 2?  The number of concurrent 
workers?  I've never said that.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Jim C. Nasby [EMAIL PROTECTED] writes:

The proposal to save enough state to be able to resume a vacuum at
pretty much any point in it's cycle might work; we'd have to benchmark
it.  With the default maintenance_work_mem of 128M it would mean writing
out 64M of state every minute on average, which is likely to take
several seconds to fsync (though, maybe we wouldn't need to fsync it...)


Which is exactly why we needn't bother benchmarking it.  Even if it
weren't complex and unsafe, it will be a net loss when you consider the
fact that it adds I/O instead of removing it.


I'm not sure what you are saying here, are you now saying that partial 
vacuum won't work for autovac?  Or are you saying that saving state as 
Jim is describing above won't work?


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 I'm not sure what you are saying here, are you now saying that partial 
 vacuum won't work for autovac?  Or are you saying that saving state as 
 Jim is describing above won't work?

I'm saying that I don't like the idea of trying to stop on a dime by
saving the current contents of vacuum's dead-TID array to disk with the
idea that we can trust those values 100% later.  Saving the array is
expensive both in runtime and code complexity, and I don't believe we
can trust it later --- at least not without even more expensive-and-
complex measures, such as WAL-logging every such save :-(

I'm for stopping only after completing an index-cleaning pass, at the
point where we empty the dead-TID array anyway.  If you really have to
have stop on a dime, just kill -INT the process, accepting that you
will have to redo your heap scan since the last restart point.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:
I'm not sure what you are saying here, are you now saying that partial 
vacuum won't work for autovac?  Or are you saying that saving state as 
Jim is describing above won't work?


I'm saying that I don't like the idea of trying to stop on a dime by
saving the current contents of vacuum's dead-TID array to disk with the
idea that we can trust those values 100% later.  Saving the array is
expensive both in runtime and code complexity, and I don't believe we
can trust it later --- at least not without even more expensive-and-
complex measures, such as WAL-logging every such save :-(

I'm for stopping only after completing an index-cleaning pass, at the
point where we empty the dead-TID array anyway.  If you really have to
have stop on a dime, just kill -INT the process, accepting that you
will have to redo your heap scan since the last restart point.


OK, so if I understand correct, a vacuum of a table with 10 indexes on 
it can be interrupted 10 times, once after each index-cleaning pass? 
That might have some value, especially breaking up the work required to 
vacuum a large table. Or am I still not getting it?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-23 Thread Alvaro Herrera
Jim C. Nasby wrote:
 On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:

  I'm not sure this is a great idea, but I don't see how this would result 
  in large numbers of workers working in one database.   If workers work 
  on tables in size order, and exit as soon as they catch up to an older 
  worker, I don't see the problem.  Newer works are going to catch-up to 
  older workers pretty quickly since small tables will vacuum fairly quickly.
 
 The reason that won't necessarily happen is because you can get large
 tables popping up as needing vacuuming at any time.

Right.

We know that a table that needs frequent vacuum necessarily has to be
small -- so maybe have the second worker exit when it catches up with
the first, or when the next table is above 1 GB, whichever happens
first.  That way, only the first worker can be processing the huge
tables.  The problem with this is that if one of your hot tables grows
a bit larger than 1 GB, you suddenly have a change in autovacuuming
behavior, for no really good reason.

And while your second worker is processing the tables in the hundreds-MB
range, your high-update 2 MB tables are neglected :-(

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-23 Thread Jim C. Nasby
On Fri, Feb 23, 2007 at 01:22:17PM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
  On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
 
   I'm not sure this is a great idea, but I don't see how this would result 
   in large numbers of workers working in one database.   If workers work 
   on tables in size order, and exit as soon as they catch up to an older 
   worker, I don't see the problem.  Newer works are going to catch-up to 
   older workers pretty quickly since small tables will vacuum fairly 
   quickly.
  
  The reason that won't necessarily happen is because you can get large
  tables popping up as needing vacuuming at any time.
 
 Right.
 
 We know that a table that needs frequent vacuum necessarily has to be
 small -- so maybe have the second worker exit when it catches up with
 the first, or when the next table is above 1 GB, whichever happens
 first.  That way, only the first worker can be processing the huge
 tables.  The problem with this is that if one of your hot tables grows
 a bit larger than 1 GB, you suddenly have a change in autovacuuming
 behavior, for no really good reason.
 
 And while your second worker is processing the tables in the hundreds-MB
 range, your high-update 2 MB tables are neglected :-(

That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to tie
it to time and not size, since the key factor is that you want it to hit
the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.

You know, maybe the best way to handle this is to force both vacuums to
exit after a certain amount of time, probably with a longer time limit
for the first vacuum in a database. That would mean that after
processing a large table for 10 minutes, the first vacuum would
exit/re-evaluate what work needs to be done. That would mean
medium-sized tables wouldn't get completely starved.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Zeugswetter Andreas ADI SD

 vacuum should be a process with the least amount of voodoo. 
 If we can just have vacuum_delay and vacuum_threshold, where 
 threshold allows an arbitrary setting of how much bandwidth 
 we will allot to the process, then that is a beyond wonderful thing.
 
 It is easy to determine how much IO you have, and what you can spare.

The tricky part is what metric to use. Imho IO per second would be
good.
In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
 My Proposal:  If we require admins to identify hot tables tables, then: 
 1) Launcher fires-off a worker1 into database X.
 2) worker1 deals with hot tables first, then regular tables.
 3) Launcher continues to launch workers to DB X every autovac naptime. 
 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
 worker1 did above.  If worker1 is still working in DB X then worker2 
 looks for hot tables that are being starved because worker1 got busy. 
 If worker2 finds no hot tables that need work, then worker2 exits.

Rather than required people to manually identify hot tables, what if we
just prioritize based on table size? So if a second autovac process hits
a specific database, it would find the smallest table in need of
vacuuming that it should be able to complete before the next naptime and
vacuum that. It could even continue picking tables until it can't find
one that it could finish within the naptime. Granted, it would have to
make some assumptions about how many pages it would dirty.

ISTM that's a lot easier than forcing admins to mark specific tables.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
  
My Proposal:  If we require admins to identify hot tables tables, then: 
1) Launcher fires-off a worker1 into database X.

2) worker1 deals with hot tables first, then regular tables.
3) Launcher continues to launch workers to DB X every autovac naptime. 
4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
worker1 did above.  If worker1 is still working in DB X then worker2 
looks for hot tables that are being starved because worker1 got busy. 
If worker2 finds no hot tables that need work, then worker2 exits.



Rather than required people to manually identify hot tables, what if we
just prioritize based on table size? So if a second autovac process hits
a specific database, it would find the smallest table in need of
vacuuming that it should be able to complete before the next naptime and
vacuum that. It could even continue picking tables until it can't find
one that it could finish within the naptime. Granted, it would have to
make some assumptions about how many pages it would dirty.

ISTM that's a lot easier than forcing admins to mark specific tables.


So the heuristic would be:
* Launcher fires off workers into a database at a given interval 
(perhaps configurable?)
* Each worker works on tables in size order. 
* If a worker ever catches up to an older worker, then the younger 
worker exits.


This sounds simple and workable to me, perhaps we can later modify this 
to include some max_workers variable so that a worker would only exit if 
it catches an older worker and there are max_workers currently active.



Thoughts?

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
   
 My Proposal:  If we require admins to identify hot tables tables, then: 
 1) Launcher fires-off a worker1 into database X.
 2) worker1 deals with hot tables first, then regular tables.
 3) Launcher continues to launch workers to DB X every autovac naptime. 
 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
 worker1 did above.  If worker1 is still working in DB X then worker2 
 looks for hot tables that are being starved because worker1 got busy. 
 If worker2 finds no hot tables that need work, then worker2 exits.
 
 
 Rather than required people to manually identify hot tables, what if we
 just prioritize based on table size? So if a second autovac process hits
 a specific database, it would find the smallest table in need of
 vacuuming that it should be able to complete before the next naptime and
 vacuum that. It could even continue picking tables until it can't find
 one that it could finish within the naptime. Granted, it would have to
 make some assumptions about how many pages it would dirty.
 
 ISTM that's a lot easier than forcing admins to mark specific tables.
 
 So the heuristic would be:
 * Launcher fires off workers into a database at a given interval 
 (perhaps configurable?)
 * Each worker works on tables in size order. 
 * If a worker ever catches up to an older worker, then the younger 
 worker exits.
 
 This sounds simple and workable to me, perhaps we can later modify this 
 to include some max_workers variable so that a worker would only exit if 
 it catches an older worker and there are max_workers currently active.

That would likely result in a number of workers running in one database,
unless you limited how many workers per database. And if you did that,
you wouldn't be addressing the frequently update table problem.

A second vacuum in a database *must* exit after a fairly short time so
that we can go back in and vacuum the important tables again (well or
the 2nd vacuum has to periodically re-evaluate what tables need to be
vacuumed).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 09:35:45AM +0100, Zeugswetter Andreas ADI SD wrote:
 
  vacuum should be a process with the least amount of voodoo. 
  If we can just have vacuum_delay and vacuum_threshold, where 
  threshold allows an arbitrary setting of how much bandwidth 
  we will allot to the process, then that is a beyond wonderful thing.
  
  It is easy to determine how much IO you have, and what you can spare.
 
 The tricky part is what metric to use. Imho IO per second would be
 good.
 In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Well, right now they're one in the same... but yeah, IO/sec probably
does make more sense.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Zeugswetter Andreas ADI SD

   vacuum should be a process with the least amount of voodoo. 
   If we can just have vacuum_delay and vacuum_threshold, where 
   threshold allows an arbitrary setting of how much bandwidth we
will 
   allot to the process, then that is a beyond wonderful thing.
   
   It is easy to determine how much IO you have, and what 
 you can spare.
  
  The tricky part is what metric to use. Imho IO per second 
 would be 
  good.
  In a typical DB scenario that is the IO bottleneck, not the Mb/s.
 
 Well, right now they're one in the same... but yeah, IO/sec 
 probably does make more sense.

Hopefully not :-) Else you have no readahead. And that is imho the
problem.
You need to anticipate how many physical IO's your logical IO's cause.
And this is near impossible unless we group IO's in pg itself.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
  

So the heuristic would be:
* Launcher fires off workers into a database at a given interval 
(perhaps configurable?)
* Each worker works on tables in size order. 
* If a worker ever catches up to an older worker, then the younger 
worker exits.


This sounds simple and workable to me, perhaps we can later modify this 
to include some max_workers variable so that a worker would only exit if 
it catches an older worker and there are max_workers currently active.



That would likely result in a number of workers running in one database,
unless you limited how many workers per database. And if you did that,
you wouldn't be addressing the frequently update table problem.

A second vacuum in a database *must* exit after a fairly short time so
that we can go back in and vacuum the important tables again (well or
the 2nd vacuum has to periodically re-evaluate what tables need to be
vacuumed).
  


I'm not sure this is a great idea, but I don't see how this would result 
in large numbers of workers working in one database.   If workers work 
on tables in size order, and exit as soon as they catch up to an older 
worker, I don't see the problem.  Newer works are going to catch-up to 
older workers pretty quickly since small tables will vacuum fairly quickly.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
   
 So the heuristic would be:
 * Launcher fires off workers into a database at a given interval 
 (perhaps configurable?)
 * Each worker works on tables in size order. 
 * If a worker ever catches up to an older worker, then the younger 
 worker exits.
 
 This sounds simple and workable to me, perhaps we can later modify this 
 to include some max_workers variable so that a worker would only exit if 
 it catches an older worker and there are max_workers currently active.
 
 
 That would likely result in a number of workers running in one database,
 unless you limited how many workers per database. And if you did that,
 you wouldn't be addressing the frequently update table problem.
 
 A second vacuum in a database *must* exit after a fairly short time so
 that we can go back in and vacuum the important tables again (well or
 the 2nd vacuum has to periodically re-evaluate what tables need to be
 vacuumed).
   
 
 I'm not sure this is a great idea, but I don't see how this would result 
 in large numbers of workers working in one database.   If workers work 
 on tables in size order, and exit as soon as they catch up to an older 
 worker, I don't see the problem.  Newer works are going to catch-up to 
 older workers pretty quickly since small tables will vacuum fairly quickly.

The reason that won't necessarily happen is because you can get large
tables popping up as needing vacuuming at any time.

Round 1:
Fire up autovac worker; starts working and soon hits 100G table

Round 2:
Another worker starts. Since round 1, a 98G table now needs vacuuming,
which this worker soon hits.

Round 3:
89G table now needs vacuuming. Worker 3 starts up and soon hits it.

So now we have 3 workers, all hammering away in the same database, and
likely causing a lot of random IO.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Alvaro Herrera
Ok, scratch that :-)  Another round of braindumping below.


Launcher starts one worker in each database.  This worker is not going
to do vacuum work, just report how much vacuum effort is needed in the
database.  Vacuum effort is measured as the total number of pages in
need of vacuum, being the sum of relpages of all tables and indexes
needing vacuum.  (Note: we weight heap pages the same as index pages.
Is this OK?)

Create a plan for vacuuming all those databases within the constraints
of max_workers.  Databases needing the most work are vacuumed first.
One worker per database.  Thus max_workers databases are being vacuumed
in parallel at this time.  When one database is finished, the launcher
starts a worker in the next database in the list.

When the plan is complete (i.e. the list is empty) we can do the whole
thing again, excluding the databases that are still being vacuumed.

Perhaps we should wait autovacuum_naptime seconds between finishing one
vacuum round in all databases and starting the next.  How do we measure
this: do we start sleeping when the last worker finishes, or when the
list is empty?

Perhaps we should reserve a worker for vacuuming hot tables.  Launcher
then uses max_workers-1 workers for the above plan, and the spare worker
is continuously connecting to one database, vacuuming hot tables, going
away, the launcher starts it again to connect to the next database.
Definitional problem: how to decide what's a hot table?  One idea (the
simplest) is to let the DBA define it.

Thus, at most two workers are on any database: one of them is working on
normal tables, the other on hot tables.

(This idea can be complemented by having another GUC var,
autovacuum_hot_workers, which allows the DBA to have more than one
worker on hot tables (just for the case where there are too many hot
tables).  This may be overkill.)


Ron Mayer expressed the thought that we're complicating needlessly the
UI for vacuum_delay, naptime, etc.  He proposes that instead of having
cost_delay etc, we have a mbytes_per_second parameter of some sort.
This strikes me a good idea, but I think we could make that after this
proposal is implemented.  So this take 2 could be implemented, and
then we could switch the cost_delay stuff to using a MB/s kind of
measurement somehow (he says waving his hands wildly).

Greg Stark and Matthew O'Connor say that we're misdirected in having
more than one worker per tablespace.  I say we're not :-)  If we
consider Ron Mayer's idea of measuring MB/s, but we do it per
tablespace, then we would inflict the correct amount of vacuum pain to
each tablespace, sleeping as appropriate.  I think this would require
workers of different databases to communicate what tablespaces they are
using, so that all of them can utilize the correct amount of bandwidth.


I'd like to know if this responds to the mentioned people's objections.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Ok, scratch that :-)  Another round of braindumping below.


I still think this is solution in search of a problem.  The main problem 
we have right now is that hot tables can be starved from vacuum.  Most 
of this proposal doesn't touch that.  I would like to see that problem 
solved first, then we can talk about adding multiple workers per 
database or per tablespace etc...



(This idea can be complemented by having another GUC var,
autovacuum_hot_workers, which allows the DBA to have more than one
worker on hot tables (just for the case where there are too many hot
tables).  This may be overkill.)


I think this is more along the lines of what we need first.


Ron Mayer expressed the thought that we're complicating needlessly the
UI for vacuum_delay, naptime, etc.  He proposes that instead of having
cost_delay etc, we have a mbytes_per_second parameter of some sort.
This strikes me a good idea, but I think we could make that after this
proposal is implemented.  So this take 2 could be implemented, and
then we could switch the cost_delay stuff to using a MB/s kind of
measurement somehow (he says waving his hands wildly).


Agree this is probably a good idea in the long run, but I agree this is 
lower on the priority list and should come next.



Greg Stark and Matthew O'Connor say that we're misdirected in having
more than one worker per tablespace.  I say we're not :-)  If we
consider Ron Mayer's idea of measuring MB/s, but we do it per
tablespace, then we would inflict the correct amount of vacuum pain to
each tablespace, sleeping as appropriate.  I think this would require
workers of different databases to communicate what tablespaces they are
using, so that all of them can utilize the correct amount of bandwidth.


I agree that in the long run it might be better to have multiple workers 
with MB/s throttle and tablespace aware, but we don't have any of that 
infrastructure right now.  I think the piece of low-hanging fruit that 
your launcher concept can solve is the hot table starvation.


My Proposal:  If we require admins to identify hot tables tables, then: 
1) Launcher fires-off a worker1 into database X.

2) worker1 deals with hot tables first, then regular tables.
3) Launcher continues to launch workers to DB X every autovac naptime. 
4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
worker1 did above.  If worker1 is still working in DB X then worker2 
looks for hot tables that are being starved because worker1 got busy. 
If worker2 finds no hot tables that need work, then worker2 exits.


This seems a very simple solution (given your launcher work) that can 
solve the starvation problem.



Thoughts?

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Gregory Stark

Alvaro Herrera [EMAIL PROTECTED] writes:

 Greg Stark and Matthew O'Connor say that we're misdirected in having
 more than one worker per tablespace.  I say we're not :-)  

I did say that. But your comment about using a high cost_delay was fairly
convincing too. It would be a simpler design and I think you're right. As long
as raise both cost_delay and cost_limit by enough you should get pretty much
the same sequential i/o rate and not step on each others toes too much.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Joshua D. Drake

 Ron Mayer expressed the thought that we're complicating needlessly the
 UI for vacuum_delay, naptime, etc.  He proposes that instead of having
 cost_delay etc, we have a mbytes_per_second parameter of some sort.
 This strikes me a good idea, but I think we could make that after this
 proposal is implemented.  So this take 2 could be implemented, and
 then we could switch the cost_delay stuff to using a MB/s kind of
 measurement somehow (he says waving his hands wildly).

vacuum should be a process with the least amount of voodoo. If we can
just have vacuum_delay and vacuum_threshold, where threshold allows an
arbitrary setting of how much bandwidth we will allot to the process,
then that is a beyond wonderful thing.

It is easy to determine how much IO you have, and what you can spare.

Joshua D. Drake




 
 Greg Stark and Matthew O'Connor say that we're misdirected in having
 more than one worker per tablespace.  I say we're not :-)  If we
 consider Ron Mayer's idea of measuring MB/s, but we do it per
 tablespace, then we would inflict the correct amount of vacuum pain to
 each tablespace, sleeping as appropriate.  I think this would require
 workers of different databases to communicate what tablespaces they are
 using, so that all of them can utilize the correct amount of bandwidth.
 
 
 I'd like to know if this responds to the mentioned people's objections.
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match