Re: [HACKERS] autovacuum next steps, take 3

2007-03-14 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  1. Grab the AutovacSchedule LWLock exclusively.
  2. Check to see if another worker is currently processing
  that table; if so drop LWLock and go to next list entry.
  3. Recompute whether table needs vacuuming; if not,
  drop LWLock and go to next entry.  (This test covers the
  case where someone vacuumed the table since you made your
  list.)
  4. Put table OID into shared memory, drop LWLock, then
  vacuum table.
  5. Clear current-table OID from shared memory, then
  repeat for next list entry.
 
  The point I'm not very sure about is that this proposal means we need to
  do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
  stats.
 
 True.  You could probably drop the lock while rechecking stats, at the
 cost of having to recheck for collision (repeat step 2) afterwards.
 Or recheck stats before you start, but if collisions are likely then
 that's a waste of time.  But on the third hand, does it matter?
 Rechecking the stats should be much cheaper than a vacuum operation,
 so I'm not seeing that there's going to be a problem.  It's not like
 there are going to be hundreds of workers contending for that lock...

Turns out that it does matter, because not only we need to read pgstats,
but we also need to fetch the pg_autovacuum and pg_class rows again for
the table.  So we must release the AutovacuumSchedule lock before trying
to open pg_class etc.

Unless we are prepared to cache (keep a private copy of) the contents
of said tuples between the first check (i.e. when building the initial
table list) and the recheck?  This is possible as well, but it gives me
an uneasy feeling.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Galy Lee

Alvaro Herrera wrote:
worker to-do list
-
It removes from its to-do list the tables being processed.  Finally, it
writes the list to disk.

I am worrying about the worker-to-do-list in your proposal. I think
worker isn't suitable to maintain any vacuum task list; instead
it is better to maintain a unified vacuum task queue on autovacuum share
memory.

Here are the basic ideas:

* Why is such a task queue needed?

- Launcher might schedule all vacuum tasks by such a queue. It provides
a facility to schedule tasks smartly for further autovacuum improvement.

- Also such a task list can be viewed easily from a system view. This
can be implemented easily in 8.3 by the task queue.

* VACUUM task queue

VACUUM tasks of cluster are maintained in a unified cluster-wide queue
in the share memory of autovacuum.

  global shared TaskInfo tasks[];

It can be viewed as:

SELECT * FROM pg_autovacuum_tasks;
  dbid | relid | group | worker
---+---+---+
 2 | 20001 | 0 |  1001
 2 | 20002 | 0 |
 3 | 30001 | 0 |  1002

VACUUM tasks belong to the same database might be divided into several
groups. One worker might be assigned to process one specific task group.

The task queue might be filled by dedicated task-gathering-worker or it
might be filled by *external task gatherer*.

It allows external program to develop a more sophisticated vacuum
scheme. Based on previous discussion, it appears that it is difficult to
implement an all-purpose algorithm to satisfy the requirements of all
applications. It is better to allow user to develop their vacuum
strategies. *User-defined external program* might fill the task queue,
and schedule tasks by their own strategy. Launcher will response for
coordinating workers only. This pluggable-vacuum-strategy approach seems
a good solution.

* status of worker

It is also convenience to allow user to monitor the status of vacuum
worker by a system view.The snapshot of worker can also be viewed as:

SELECT * FROM pg_autovacuum_workers;
 pid  |  dbid | relid | group
--+---+---+---
 1001 | 2 | 20001 | 0
 1002 | 3 | 30001 | 0


Best Regards
Galy Lee
lee.galy _at_ oss.ntt.co.jp
NTT Open Source Software Center


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Tom Lane
Galy Lee [EMAIL PROTECTED] writes:
 I am worrying about the worker-to-do-list in your proposal. I think
 worker isn't suitable to maintain any vacuum task list; instead
 it is better to maintain a unified vacuum task queue on autovacuum share
 memory.

Shared memory is fixed-size.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Alvaro Herrera
Galy Lee wrote:
 
 Alvaro Herrera wrote:
 worker to-do list
 -
 It removes from its to-do list the tables being processed.  Finally, it
 writes the list to disk.
 
 I am worrying about the worker-to-do-list in your proposal. I think
 worker isn't suitable to maintain any vacuum task list; instead
 it is better to maintain a unified vacuum task queue on autovacuum share
 memory.

Galy,

Thanks for your comments.

I like the idea of having a global task queue, but sadly it doesn't work
for a simple reason: the launcher does not have enough information to
build it.  This is because we need access to catalogs in the database;
pg_class and pg_autovacuum in the current code, and the catalogs related
to the maintenance window feature when we implement it in the (hopefully
near) future.

Another point to be made, though of less importance, is that we cannot
keep such a task list in shared memory, because we aren't able to grow
that memory after postmaster start.  It is of lesser importance, because
we could keep the task list in plain files on disk; this is merely a
SMOP.  The functions to expose the task list to SQL queries would just
need to read those files.  It would be slower than shared memory,
certainly, but I don't think it's a showstopper (given the amount of
work VACUUM takes, anyway).

Not having access to the catalogs is a much more serious problem for the
scheduling.  One could think about dumping catalogs to plain files that
are readable to the launcher, but this is not very workable: how do you
dump pg_class and have it up to date all the time?  You'd have to be
writing that file pretty frequently, which doesn't sound a very good
idea.

Other idea I had was having a third kind of autovacuum process, namely a
schedule builder, which would connect to the database, read catalogs,
compute needed vacuuming, write to disk, and exit.  This seems similar
to your task-gathering worker.  The launcher could then dispatch regular
workers as appropriate.  Furthermore, the launcher could create a global
schedule, based on the combination of the schedules for all databases.
I dismissed this idea because a schedule gets out of date very quickly
as tables continue to be used by regular operation.  A worker starting
at t0 may find that a task list built at t0-5 min  is not very relevant.
So it needs to build a new task list anyway, which then begs the
question of why not just let the worker itself build its task list?
Also, combining schedules is complicated and you start thinking in
asking the DBA to give each database a priority, which is annoying.

So the idea I am currently playing with is to have workers determine the
task list at start, by looking at both the catalogs and considering the
task lists of other workers.  I think this is the natural evolution of
the other ideas -- the worker is just smarter to start with, and the
whole thing is a lot simpler.


 The task queue might be filled by dedicated task-gathering-worker or it
 might be filled by *external task gatherer*.

The idea of an external task gatherer is an interesting one which I
think would make sense to implement in the future.  I think it is not
very difficult to implement once the proposal we're currently discussing
is done, because it just means we have to modify the part where each
worker decides what needs to be done, and at what times the launcher
decides to start a worker on each database.  The rest of the stuff I'm
working on is just infrastructure to make it happen.

So I think your basic idea here is still workable, just not right now.
Let's discuss it again as soon as I'm done with the current stuff.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Alvaro Herrera
Alvaro Herrera wrote:

 worker to-do list
 -
 
 When each worker starts, it determines which tables to process in the
 usual fashion: get pg_autovacuum and pgstat data and compute the
 equations.
 
 The worker then takes a snapshot of what's currently going on in the
 database, by storing worker PIDs, the corresponding table OID that's
 being currently worked, and the to-do list for each worker.
 
 It removes from its to-do list the tables being processed.  Finally, it
 writes the list to disk.
 
 The table list will be written to a file in
 PGDATA/vacuum/database-oid/todo.worker-pid
 The file will consist of table OIDs, in the order in which they are
 going to be vacuumed.
 
 At this point, vacuuming can begin.
 
 Before processing each table, it scans the WorkerInfos to see if there's
 a new worker, in which case it reads its to-do list to memory.
 
 Then it again fetches the tables being processed by other workers in the
 same database, and for each other worker, removes from its own in-memory
 to-do all those tables mentioned in the other lists that appear earlier
 than the current table being processed (inclusive).  Then it picks the
 next non-removed table in the list.  All of this must be done with the
 Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
 pick the same table (no IO takes places here, because the whole lists
 were saved in memory at the start.)

Sorry, I confused matters here by not clarifing on-disk to-do lists
versus in-memory.  When we write the to-do list to file, that's the
to-do lists that other workers will see.  It will not change; when I say
remove a table for the to-do list, it will be removed from the to-do
list in memory, but the file will not get rewritten.

Note that a worker will not remove from its list a table that's in the
to-do list of another worker but not yet processed.  It will only remove
those tables that are currently being processed (i.e. they appear in the
shared memory entry for that worker), and any tables that appear _before
that one_ on that particular worker's file.

So this behaves very much like what Tom describes in an email downthread,
not like what Matthew is thinking.  In fact I'm thinking that the above
is needlessly complex, and that Tom's proposal is simpler and achieves
pretty much the same effect, so I'll have a look at evolving from that
instead.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 1. Grab the AutovacSchedule LWLock exclusively.
 2. Check to see if another worker is currently processing
 that table; if so drop LWLock and go to next list entry.
 3. Recompute whether table needs vacuuming; if not,
 drop LWLock and go to next entry.  (This test covers the
 case where someone vacuumed the table since you made your
 list.)
 4. Put table OID into shared memory, drop LWLock, then
 vacuum table.
 5. Clear current-table OID from shared memory, then
 repeat for next list entry.

 The point I'm not very sure about is that this proposal means we need to
 do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date
 stats.

True.  You could probably drop the lock while rechecking stats, at the
cost of having to recheck for collision (repeat step 2) afterwards.
Or recheck stats before you start, but if collisions are likely then
that's a waste of time.  But on the third hand, does it matter?
Rechecking the stats should be much cheaper than a vacuum operation,
so I'm not seeing that there's going to be a problem.  It's not like
there are going to be hundreds of workers contending for that lock...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Galy Lee
Hi, Alvaro

Alvaro Herrera wrote:
 keep such a task list in shared memory, because we aren't able to grow
 that memory after postmaster start. 

We can use the fix-size share memory to maintain such a queue. The
maximum task size is the number of all tables. So the size of the queue
can be the same with max_fsm_relations which is usually larger than the
numbers of tables and indexes in the cluster. This is sufficient to
contain most of the vacuum tasks.

Even though the queue is over flow, for task-gatherer is scanning the
whole cluster every autovacuum_naptime, it is quickly enough to pick
those tasks up again. We don’t need to write any thing to external file.
So there is no problem to use a fix-size share memory to maintain a
global queue.

 Other idea I had was having a third kind of autovacuum process, namely a
 schedule builder

If we have such a global queue, task-gathering worker can connect to
every database every naptime to gather tasks in time.

The task-gathering worker won’t build the schedule, LAUNCHER or external
program responses for such activity. How to dispatch tasks to worker is
just a scheduling problem, a good dispatching algorithm needs to ensure
each worker can finish their tasks on time, this might resolve the
headache HOT table problem. But this is a further issue to be discussed
after 8.3.

Best Regards

Galy Lee
lee.galy _at_ oss.ntt.co.jp
NTT Open Source Software Center


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Tom Lane
Galy Lee [EMAIL PROTECTED] writes:
 We can use the fix-size share memory to maintain such a queue. The
 maximum task size is the number of all tables. So the size of the queue
 can be the same with max_fsm_relations which is usually larger than the
 numbers of tables and indexes in the cluster.

The trouble with that analogy is that the system can still operate
reasonably sanely when max_fsm_relations is exceeded (at least, the
excess relations behave no worse than they did before we had FSM).
If there are relations that autovacuum ignores indefinitely because they
don't fit in a fixed-size work queue, that will be a big step backward
from prior behavior.

In any case, I still haven't seen a good case made why a global work
queue will provide better behavior than each worker keeping a local
queue.  The need for small hot tables to be visited more often than
big tables suggests to me that a global queue will actually be
counterproductive, because you'll have to contort the algorithm in
some hard-to-understand way to get it to do that.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote:

 In any case, I still haven't seen a good case made why a global work
 queue will provide better behavior than each worker keeping a local
 queue.  The need for small hot tables to be visited more often than
 big tables suggests to me that a global queue will actually be
 counterproductive, because you'll have to contort the algorithm in
 some hard-to-understand way to get it to do that.

If we have some external vacuum schedulers, we need to see and touch the
content of work queue. That's why he suggested the shared work queue.
I think the present strategy of autovacuum is not enough in some heavily-used
cases and need more sophisticated schedulers, even if the optimization
for hot tables is added. Also, the best strategies of vacuum are highly
depending on systems, so that I don't think we can supply one monolithic
strategy that fits all purposes.

That was a proposal of the infrastructure for interaction between autovacuum
and user-land vacuum schedulers. Of cource, we can supply a simple scheduler
for not-so-high-load systems, but I need a kind of autovacuum that can be
controlled from an external program that knows user application well.

Though we can use a completely separated autovacuum daemon like as
contrib/pg_autovacuum of 8.0, but I think it is good for us to share
some of the codes between a built-in scheduler and external ones.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] wrote:
 In any case, I still haven't seen a good case made why a global work
 queue will provide better behavior than each worker keeping a local
 queue.

 If we have some external vacuum schedulers, we need to see and touch the
 content of work queue.

Who said anything about external schedulers?  I remind you that this is
AUTOvacuum.  If you want to implement manual scheduling you can still
use plain 'ol vacuum commands.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 3

2007-03-12 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote:

 Who said anything about external schedulers?  I remind you that this is
 AUTOvacuum.  If you want to implement manual scheduling you can still
 use plain 'ol vacuum commands.

I think we can split autovacuum into two (or more?) functions:
task gatherers and task workers. We don't have to bother with
the monolithic style of current autovacuum.


Galy said:
 The task queue might be filled by dedicated task-gathering-worker or it
 might be filled by *external task gatherer*.

Alvaro said:
 The idea of an external task gatherer is an interesting one which I
 think would make sense to implement in the future.  I think it is not
 very difficult to implement once the proposal we're currently discussing
 is done

I said:
 Though we can use a completely separated autovacuum daemon like as
 contrib/pg_autovacuum of 8.0, but I think it is good for us to share
 some of the codes between a built-in scheduler and external ones.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Matthew T. O'Connor
My initial reaction is that this looks good to me, but still a few 
comments below.


Alvaro Herrera wrote:

Here is a low-level, very detailed description of the implementation of
the autovacuum ideas we have so far.

launcher's dealing with databases
-


[ Snip ]


launcher and worker interactions


[Snip]


worker to-do list
-
When each worker starts, it determines which tables to process in the
usual fashion: get pg_autovacuum and pgstat data and compute the
equations.

The worker then takes a snapshot of what's currently going on in the
database, by storing worker PIDs, the corresponding table OID that's
being currently worked, and the to-do list for each worker.


Does a new worker really care about the PID of other workers or what 
table they are currently working on?



It removes from its to-do list the tables being processed.  Finally, it
writes the list to disk.


Just to be clear, the new worker removes from it's todo list all the 
tables mentioned in the todo lists of all the other workers?



The table list will be written to a file in
PGDATA/vacuum/database-oid/todo.worker-pid
The file will consist of table OIDs, in the order in which they are
going to be vacuumed.

At this point, vacuuming can begin.


This all sounds good to me so far.


Before processing each table, it scans the WorkerInfos to see if there's
a new worker, in which case it reads its to-do list to memory.


It's not clear to me why a worker cares that there is a new worker, 
since the new worker is going to ignore all the tables that are already 
claimed by all worker todo lists.



Then it again fetches the tables being processed by other workers in the
same database, and for each other worker, removes from its own in-memory
to-do all those tables mentioned in the other lists that appear earlier
than the current table being processed (inclusive).  Then it picks the
next non-removed table in the list.  All of this must be done with the
Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
pick the same table (no IO takes places here, because the whole lists
were saved in memory at the start.)


Again it's not clear to me what this is gaining us?  It seems to me that 
if when a worker starts up writes out it's to-do list, it should just do 
it, I don't see the value in workers constantly updating their todo 
lists.  Maybe I'm just missing something can you enlighten me?



other things to consider


This proposal doesn't deal with the hot tables stuff at all, but that is
very easy to bolt on later: just change the first phase, where the
initial to-do list is determined, to exclude cold tables.  That way,
the vacuuming will be fast.  Determining what is a cold table is still
an exercise to the reader ...


I think we can make this algorithm naturally favor small / hot tables 
with one small change.  Having workers remove tables that they just 
vacuumed from their to-do lists and re-write their todo lists to disk. 
Assuming the todo lists are ordered by size ascending, smaller tables 
will be made available for inspection by newer workers sooner rather 
than later.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 Does a new worker really care about the PID of other workers or what 
 table they are currently working on?

As written, it needs the PIDs so it can read in the other workers' todo
lists (which are in files named by PID).

 It's not clear to me why a worker cares that there is a new worker, 
 since the new worker is going to ignore all the tables that are already 
 claimed by all worker todo lists.

That seems wrong to me, since it means that new workers will ignore
tables that are scheduled for processing by an existing worker, no
matter how far in the future that schedule extends.  As an example,
suppose you have half a dozen large tables in need of vacuuming.
The first worker in will queue them all up, and subsequent workers
will do nothing useful, at least not till the first worker is done
with the first table.  Having the first worker update its todo
list file after each table allows the earlier tables to be exposed
for reconsideration, but that's expensive and it does nothing for
later tables.

I suggest that maybe we don't need exposed TODO lists at all.  Rather
the workers could have internal TODO lists that are priority-sorted
in some way, and expose only their current table OID in shared memory.
Then the algorithm for processing each table in your list is

1. Grab the AutovacSchedule LWLock exclusively.
2. Check to see if another worker is currently processing
   that table; if so drop LWLock and go to next list entry.
3. Recompute whether table needs vacuuming; if not,
   drop LWLock and go to next entry.  (This test covers the
   case where someone vacuumed the table since you made your
   list.)
4. Put table OID into shared memory, drop LWLock, then
   vacuum table.
5. Clear current-table OID from shared memory, then
   repeat for next list entry.

This creates a behavior of whoever gets to it first rather than
allowing workers to claim tables that they actually won't be able
to service any time soon.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:
It's not clear to me why a worker cares that there is a new worker, 
since the new worker is going to ignore all the tables that are already 
claimed by all worker todo lists.


That seems wrong to me, since it means that new workers will ignore
tables that are scheduled for processing by an existing worker, no
matter how far in the future that schedule extends.  As an example,
suppose you have half a dozen large tables in need of vacuuming.
The first worker in will queue them all up, and subsequent workers
will do nothing useful, at least not till the first worker is done
with the first table.  Having the first worker update its todo
list file after each table allows the earlier tables to be exposed
for reconsideration, but that's expensive and it does nothing for
later tables.


Well the big problem that we have is not that large tables are being 
starved, so this doesn't bother me too much, plus there is only so much 
IO, so one worker working sequentially through the big tables seems OK 
to me.



I suggest that maybe we don't need exposed TODO lists at all.  Rather
the workers could have internal TODO lists that are priority-sorted
in some way, and expose only their current table OID in shared memory.
Then the algorithm for processing each table in your list is

1. Grab the AutovacSchedule LWLock exclusively.
2. Check to see if another worker is currently processing
   that table; if so drop LWLock and go to next list entry.
3. Recompute whether table needs vacuuming; if not,
   drop LWLock and go to next entry.  (This test covers the
   case where someone vacuumed the table since you made your
   list.)
4. Put table OID into shared memory, drop LWLock, then
   vacuum table.
5. Clear current-table OID from shared memory, then
   repeat for next list entry.

This creates a behavior of whoever gets to it first rather than
allowing workers to claim tables that they actually won't be able
to service any time soon.


Right, but you could wind up with as many workers working concurrently 
as you have tables in a database which doesn't seem like a good idea 
either.  One thing I like about the todo list setup Alvaro had is that 
new workers will be assigned fewer tables to work on and hence exit 
sooner.  We are going to fire off a new worker every autovac_naptime so 
availability of new workers isn't going to be a problem.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote:
 Tom Lane wrote:
 Matthew T. O'Connor matthew@zeut.net writes:
 I'm not sure what you are saying here, are you now saying that partial 
 vacuum won't work for autovac?  Or are you saying that saving state as 
 Jim is describing above won't work?
 
 I'm saying that I don't like the idea of trying to stop on a dime by
 saving the current contents of vacuum's dead-TID array to disk with the
 idea that we can trust those values 100% later.  Saving the array is
 expensive both in runtime and code complexity, and I don't believe we
 can trust it later --- at least not without even more expensive-and-
 complex measures, such as WAL-logging every such save :-(
 
 I'm for stopping only after completing an index-cleaning pass, at the
 point where we empty the dead-TID array anyway.  If you really have to
 have stop on a dime, just kill -INT the process, accepting that you
 will have to redo your heap scan since the last restart point.
 
 OK, so if I understand correct, a vacuum of a table with 10 indexes on 
 it can be interrupted 10 times, once after each index-cleaning pass? 
 That might have some value, especially breaking up the work required to 
 vacuum a large table. Or am I still not getting it?

It'd stop after scanning the heap, scanning all the indexes, and then
cleaning the heap. After that's done it no longer needs any of the
dead-TID info; anytime before that it does need that info, and Tom's
objection is that trying to store that info is a bad idea.

The problem with this is that typically it takes a long time to go
through a complete vacuum cycle; minutes at least, and preferably
longer. Decreasing that cycle time will greatly increase the amount of
IO required for vacuuming a table with any indexes, because every time
you cycle through you have to read the entire index. That's why I don't
see it as being useful at all for getting autovac to work on hot tables
- if you actually got that cycle time low enough you'd kill the system
with all the extra index scanning.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 Here is a worst case example: A DB with 6 tables all of which are highly 
 active and will need to be vacuumed constantly.  While this is totally 
 hypothetical, it is how I envision things working (without the threshold).
 
 I fail to see how a simple 6 table case is 'worst case'. It's common to
 see hundreds of tables, and I've run across more than one database with
 thousands of tables (think partitioning). In cases like those it's
 certainly possible, perhaps even likely that you would get many daemons
 running in the database at one time just from different tables suddenly
 needing vacuuming and appearing at a higher point in the list than other
 tables. With 100 ~1G tables getting updates it certainly wouldn't be
 hard to end up with 10 of those being vacuumed all at the same time.
 
 Yes 6 tables is small, the worst-case part of the example was that all 
 the tables would need to be vacuumed constantly.  Most databases only 
 have a few hot tables.  Most tables only need to vacuumed every once in 
 a while.
 
It's not the hot tables that are the issue; it's how many large tables
(hot or not) that can come up for vacuuming in order. For example, if
A-Z are all large tables (ie: a few GB), with A being the largest and Z
the smallest, think about what happens here:

Round 1: A needs vacuuming. Daemon gets to it and starts working.
Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon
2 gets to it.
Round 3: C now needs vacuuming. Daemon 3.
...
Round 26: Z now needs vacuuming. Daemon 26 picks it up.

You now have 26 daemons running in the database.

Now, we can argue about how likely that scenario is, but I don't think
it's relevant. What matters is that it *is* possible, and as long as
that's the case you'd have to have some kind of limit. (While this
simple 26 table example is definitely worst-case, if you've got hundreds
of tables that are all multiple GB in size I think it wouldn't be hard
at all for you to end up with a dozen or more daemons all hammering
away).

 I do like the idea since it should be easier to tune, but I think we
 still need some limit on it. Perhaps as a first-pass we could just have
 a hard limit and log a message and/or set a flag any time we hit it.
 That would hopefully allow us to get information about how big a problem
 it really is. We could go one step further and say that the last daemon
 that can start in a database will only vacuum tables that can be done
 quickly; that's essentially what we've been talking about, except the
 limit we've been discussing would be hard-coded at 2.
 
 I'm confused, what limit would be set at 2?  The number of concurrent 
 workers?  I've never said that.

The point I was making is that the proposal about limiting the 2nd
daemon to only processing tables it can do in a short period of time is
akin to setting a limit of only 2 daemons in a database at a time.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
  The advantage to keying this to autovac_naptime is that it means we
  don't need another GUC, but after I suggested that before I realized
  that's probably not the best idea. For example, I've seen clusters that
  are running dozens-hundreds of databases; in that environment you really
  need to turn naptime way down (to like a second). In that case you
  wouldn't want to key to naptime.
 
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.

BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:

Jim C. Nasby wrote:


The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.


BTW, another issue that I don't think we can ignore: we actually need to
do this on a per-tablespace level, or at least have the ability to
disable or somehow limit it. While it's not common, there are users that
run a hundred or more databases in a single cluster; it would be ugly if
we suddenly had 100 vacuums trying to run on the same set of drives
concurrently.


I think we all agree that autovacuum needs to become tablespace aware at 
some point, but I think that is further down the line, we're having 
enough trouble figuring things out without that additional complication.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
 The advantage to keying this to autovac_naptime is that it means we
 don't need another GUC, but after I suggested that before I realized
 that's probably not the best idea. For example, I've seen clusters that
 are running dozens-hundreds of databases; in that environment you really
 need to turn naptime way down (to like a second). In that case you
 wouldn't want to key to naptime.
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.
 
 BTW, another issue that I don't think we can ignore: we actually need to
 do this on a per-tablespace level, or at least have the ability to
 disable or somehow limit it. While it's not common, there are users that
 run a hundred or more databases in a single cluster; it would be ugly if
 we suddenly had 100 vacuums trying to run on the same set of drives
 concurrently.
 
 I think we all agree that autovacuum needs to become tablespace aware at 
 some point, but I think that is further down the line, we're having 
 enough trouble figuring things out without that additional complication.

Sure, we just need a way to disable the multiple autovac daemon stuff
then.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Casey Duncan


On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote:


Jim C. Nasby wrote:


That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to  
tie
it to time and not size, since the key factor is that you want it  
to hit

the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.


Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use  
today.

If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode  
would

examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of  
worker
can take a long time to vacuum the whole database -- we don't  
impose any

time limit or table size limit to what it can do.

The second mode is the hot table worker mode, enabled when the  
worker

detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less  
than

autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay  
settings,

assuming that all pages constitute cache misses.


Perhaps this has already been proposed, but maybe some combination of  
the following inputs could be used to determine which table most  
needs vacuuming:


- The proportion of tuples in a table that are dead (updated rows  
since last vacuum/estimated row count). This would favor hot tables  
naturally regardless of size.


- The time since the last vacuum, so that larger tables are  
eventually vacuumed even if hot tables totally dominate


Of course tables that did not pass the minimum parameters specified  
in postgresql.conf would not even get considered.


I'm being intentionally vague here on the exact algorithm, since you  
all have though about this more than I have. One thing I like about  
the above is that it is independent of table size, and doesn't  
require anyone to determine which tables are hot manually.


-Casey

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-27 Thread Galy Lee

Tom Lane wrote:
 Saving the array is
 expensive both in runtime and code complexity, and I don't believe we
 can trust it later --- at least not without even more expensive-and-
 complex measures, such as WAL-logging every such save :-(
I don’t understand well the things you are worrying about.
If we find that we can not trust the saved file, or the file has
corrupted, then we can drop it and scan from the beginning of the heap
block. If something like CLUSTER, PITR has changed the physical layout
of heap, then we can simply drop the files. Why do we need WAL for it?
I don’t see any point in it.

Also, I don’t think it is expensive. If it is combined with maintenance
 window to stop once in a whole day, writing 256MB/2 = 128MB things out
can not be said expensive. Of course, this feature isn’t for autovacuum
 to use it in every minutes, autovacuum can use it after it has adopted
maintenance window.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Jim C. Nasby wrote:

 That's why I'm thinking it would be best to keep the maximum size of
 stuff for the second worker small. It probably also makes sense to tie
 it to time and not size, since the key factor is that you want it to hit
 the high-update tables every X number of seconds.
 
 If we wanted to get fancy, we could factor in how far over the vacuum
 threshold a table is, so even if the table is on the larger size, if
 it's way over the threshold the second vacuum will hit it.

Ok, I think we may be actually getting somewhere.

I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use today.
If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode would
examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of worker
can take a long time to vacuum the whole database -- we don't impose any
time limit or table size limit to what it can do.

The second mode is the hot table worker mode, enabled when the worker
detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less than
autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay settings,
assuming that all pages constitute cache misses.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Jim C. Nasby wrote:


That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to tie
it to time and not size, since the key factor is that you want it to hit
the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.


Ok, I think we may be actually getting somewhere.


Me too.


I propose to have two different algorithms for choosing the tables to
work on.  The worker would behave differently, depending on whether
there is one or more workers on the database already or not.

The first algorithm is the plain threshold equation stuff we use today.
If a worker connects and determines that no other worker is in the
database, it uses the plain worker mode.  A worker in this mode would
examine pgstats, determine what tables to vacuum/analyze, sort them by
size (smaller to larger), and goes about its work.  This kind of worker
can take a long time to vacuum the whole database -- we don't impose any
time limit or table size limit to what it can do.


Right, I like this.


The second mode is the hot table worker mode, enabled when the worker
detects that there's already a worker in the database.  In this mode,
the worker is limited to those tables that can be vacuumed in less than
autovacuum_naptime, so large tables are not considered.  Because of
this, it'll generally not compete with the first mode above -- the
tables in plain worker were sorted by size, so the small tables were
among the first vacuumed by the plain worker.  The estimated time to
vacuum may be calculated according to autovacuum_vacuum_delay settings,
assuming that all pages constitute cache misses.


How can you determine what tables can be vacuumed within 
autovacuum_naptime?  I agree that large tables should be excluded, but I 
don't know how we can do that calculation based on autovacuum_naptime.


So at:
t=0*autovacuume_naptime: worker1 gets started on DBX
t=1*autovacuume_naptime: worker2 gets started on DBX
worker2 determines all tables that need to be vacuumed,
worker2 excludes tables that are too big from it's to-do list,
worker2 gets started working,
worker2 exits when it either:
a) Finishes it's entire to-do-list.
b) Catches up to worker1

I think the questions are 1) What is the exact math you are planning on 
using to determine which tables are too big?  2) Do we want worker2 to 
exit when it catches worker1 or does the fact that we have excluded 
tables that re too big mean that we don't have to worry about this?



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:

 The second mode is the hot table worker mode, enabled when the worker
 detects that there's already a worker in the database.  In this mode,
 the worker is limited to those tables that can be vacuumed in less than
 autovacuum_naptime, so large tables are not considered.  Because of
 this, it'll generally not compete with the first mode above -- the
 tables in plain worker were sorted by size, so the small tables were
 among the first vacuumed by the plain worker.  The estimated time to
 vacuum may be calculated according to autovacuum_vacuum_delay settings,
 assuming that all pages constitute cache misses.
 
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?

My assumption is that
pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

This is of course not the reality, because the delay is not how long
it takes to fetch the pages.  But it lets us have a value with which we
can do something.  With the default values, vacuum_cost_delay=10,
vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
of under 600 pages, 4800 kB (should we include indexes here in the
relpages count?  My guess is no).

A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.

 So at:
 t=0*autovacuume_naptime: worker1 gets started on DBX
 t=1*autovacuume_naptime: worker2 gets started on DBX
 worker2 determines all tables that need to be vacuumed,
 worker2 excludes tables that are too big from it's to-do list,
 worker2 gets started working,
 worker2 exits when it either:
 a) Finishes it's entire to-do-list.
 b) Catches up to worker1
 
 I think the questions are 1) What is the exact math you are planning on 
 using to determine which tables are too big?  2) Do we want worker2 to 
 exit when it catches worker1 or does the fact that we have excluded 
 tables that re too big mean that we don't have to worry about this?

Right, I think the fact that we excluded big tables means that this
won't be a problem most of the time, but we'll need some sort of
protection anyway.  I think this is easy to achieve -- store the table
each worker is currently processing in shared memory, and have all
workers check all other workers.  If a plain worker finds that another
worker is processing the table already, it skips that table and
continues with the next one.  A hot table worker instead exits right
away (caught up).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
How can you determine what tables can be vacuumed within 
autovacuum_naptime?


My assumption is that
pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

This is of course not the reality, because the delay is not how long
it takes to fetch the pages.  But it lets us have a value with which we
can do something.  With the default values, vacuum_cost_delay=10,
vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
of under 600 pages, 4800 kB (should we include indexes here in the
relpages count?  My guess is no).


I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?



A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.


I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.


Thoughts?

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
 vacuum
 
 This is of course not the reality, because the delay is not how long
 it takes to fetch the pages.  But it lets us have a value with which we
 can do something.  With the default values, vacuum_cost_delay=10,
 vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
 of under 600 pages, 4800 kB (should we include indexes here in the
 relpages count?  My guess is no).
 
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?

Well, this would only happen the first time, until the plain worker
processed the table; next time it would be picked up by the hot table
worker.  But yeah, we can build a better estimate using the same trick
the planner uses: estimate tuple density as reltuples/relpages times the
actual number of blocks on disk.

 A table over 600 pages does not sound like a good candidate for hot, so
 this seems more or less reasonable to me.  On the other hand, maybe we
 shouldn't tie this to the vacuum cost delay stuff.
 
 I'm not sure it's a good idea to tie this to the vacuum cost delay 
 settings either, so let me as you this, how is this better than just 
 allowing the admin to set a new GUC variable like 
 autovacuum_hot_table_size_threshold  (or something shorter) which we can 
 assign a decent default of say 8MB.

Yeah, maybe that's better -- it's certainly simpler.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?


Well, this would only happen the first time, until the plain worker
processed the table; next time it would be picked up by the hot table
worker.  But yeah, we can build a better estimate using the same trick
the planner uses: estimate tuple density as reltuples/relpages times the
actual number of blocks on disk.


Well even skipping it the first time isn't good, anything that further 
delays a hot table from getting vacuumed is bad.  Also, I'm not sure it 
would just be the first time since plain VACUUM isn't going to reclaim 
most of the space, just mark it as reusable.  This is moot however if we 
use a good metric, I have no idea if what you suggest above would be 
good enough.



A table over 600 pages does not sound like a good candidate for hot, so
this seems more or less reasonable to me.  On the other hand, maybe we
shouldn't tie this to the vacuum cost delay stuff.
I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.


Yeah, maybe that's better -- it's certainly simpler.


Simple is better, at least until proven otherwise.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Matthew T. O'Connor wrote:
 I'm not sure it's a good idea to tie this to the vacuum cost delay 
 settings either, so let me as you this, how is this better than just 
 allowing the admin to set a new GUC variable like 
 autovacuum_hot_table_size_threshold  (or something shorter) which we can 
 assign a decent default of say 8MB.

 Yeah, maybe that's better -- it's certainly simpler.

I'm not liking any of these very much, as they seem critically dependent
on impossible-to-tune parameters.  I think it'd be better to design this
around having the first worker explicitly expose its state (list of
tables to process, in order) and having subsequent workers key off that
info.  The shared memory state could include the OID of the table each
worker is currently working on, and we could keep the to-do list in some
simple flat file for instance (since we don't care about crash safety).

I'm not certain exactly what key off needs to mean; perhaps each
worker should make its own to-do list and then discard items that are
either in-progress or recently done by another worker when it gets to
them.

I think an absolute minimum requirement for a sane design is that no two
workers ever try to vacuum the same table concurrently, and I don't see
where that behavior will emerge from your proposal; whereas it's fairly
easy to make it happen if non-first workers pay attention to what other
workers are doing.

BTW, it's probably necessary to treat shared catalogs specially ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Alvaro Herrera [EMAIL PROTECTED] writes:

Matthew T. O'Connor wrote:
I'm not sure it's a good idea to tie this to the vacuum cost delay 
settings either, so let me as you this, how is this better than just 
allowing the admin to set a new GUC variable like 
autovacuum_hot_table_size_threshold  (or something shorter) which we can 
assign a decent default of say 8MB.



Yeah, maybe that's better -- it's certainly simpler.


I'm not liking any of these very much, as they seem critically dependent
on impossible-to-tune parameters.  I think it'd be better to design this
around having the first worker explicitly expose its state (list of
tables to process, in order) and having subsequent workers key off that
info.  The shared memory state could include the OID of the table each
worker is currently working on, and we could keep the to-do list in some
simple flat file for instance (since we don't care about crash safety).


So far we are only talking about one parameter, the 
hot_table_size_threshold, which I agree would be a guess by an admin, 
but if we went in this direction, I would also advocate adding a column 
to the pg_autovacuum table that allows an admin to explicitly define a 
table as hot or not.


Also I think each worker should be mostly independent, the only caveat 
being that (assuming each worker works in size order) if we catch up to 
an older worker (get to the table they are currently working on) we 
exit.  Personally I think this is all we need, but others felt the 
additional threshold was needed.  What do you think?  Or what do you 
think might be better?



I'm not certain exactly what key off needs to mean; perhaps each
worker should make its own to-do list and then discard items that are
either in-progress or recently done by another worker when it gets to
them.


My initial design didn't have any threshold at all, but others felt this 
would/could result in too many worker working concurrently in the same DB.



I think an absolute minimum requirement for a sane design is that no two
workers ever try to vacuum the same table concurrently, and I don't see
where that behavior will emerge from your proposal; whereas it's fairly
easy to make it happen if non-first workers pay attention to what other
workers are doing.


Maybe we never made that clear, I was always working on the assumption 
that two workers would never try to work on the same table at the same time.



BTW, it's probably necessary to treat shared catalogs specially ...


Certainly.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Matthew T. O'Connor wrote:
  I'm not sure it's a good idea to tie this to the vacuum cost delay 
  settings either, so let me as you this, how is this better than just 
  allowing the admin to set a new GUC variable like 
  autovacuum_hot_table_size_threshold  (or something shorter) which we can 
  assign a decent default of say 8MB.
 
  Yeah, maybe that's better -- it's certainly simpler.
 
 I'm not liking any of these very much, as they seem critically dependent
 on impossible-to-tune parameters.  I think it'd be better to design this
 around having the first worker explicitly expose its state (list of
 tables to process, in order) and having subsequent workers key off that
 info.  The shared memory state could include the OID of the table each
 worker is currently working on, and we could keep the to-do list in some
 simple flat file for instance (since we don't care about crash safety).
 
 I'm not certain exactly what key off needs to mean; perhaps each
 worker should make its own to-do list and then discard items that are
 either in-progress or recently done by another worker when it gets to
 them.
 
The real problem is trying to set that up in such a fashion that keeps
hot tables frequently vacuumed; that means that the 2nd daemon in a
database either needs to avoid large tables that will take 'a
significant' length of time to vacuum, or you need to allow any number
of daemons in each database (which sounds like a good way to thrash the
machine).

 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently, and I don't see
 where that behavior will emerge from your proposal; whereas it's fairly
 easy to make it happen if non-first workers pay attention to what other
 workers are doing.

Isn't there a special lock acquired on a relation by vacuum? Can't we
just check for that? Seems much simpler than building out the ability
for daemons to see what each other is doing (and that still wouldn't
take manual vacuums into account.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 08:11:44PM -0300, Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
  Alvaro Herrera wrote:
 
  The second mode is the hot table worker mode, enabled when the worker
  detects that there's already a worker in the database.  In this mode,
  the worker is limited to those tables that can be vacuumed in less than
  autovacuum_naptime, so large tables are not considered.  Because of
  this, it'll generally not compete with the first mode above -- the
  tables in plain worker were sorted by size, so the small tables were
  among the first vacuumed by the plain worker.  The estimated time to
  vacuum may be calculated according to autovacuum_vacuum_delay settings,
  assuming that all pages constitute cache misses.
  
  How can you determine what tables can be vacuumed within 
  autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum

Need ta take vacuum_cost_limit into account.

The advantage to keying this to autovac_naptime is that it means we
don't need another GUC, but after I suggested that before I realized
that's probably not the best idea. For example, I've seen clusters that
are running dozens-hundreds of databases; in that environment you really
need to turn naptime way down (to like a second). In that case you
wouldn't want to key to naptime.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 Matthew T. O'Connor wrote:
 How can you determine what tables can be vacuumed within 
 autovacuum_naptime?
 
 My assumption is that
 pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to 
 vacuum
 
 This is of course not the reality, because the delay is not how long
 it takes to fetch the pages.  But it lets us have a value with which we
 can do something.  With the default values, vacuum_cost_delay=10,
 vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables
 of under 600 pages, 4800 kB (should we include indexes here in the
 relpages count?  My guess is no).
 
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?

The entire point of this is to ensure that the second daemon will only
vacuum tables that it can finish very quickly. If you let a table bloat
so it's too big, then you just can't vacuum it very frequently without
risking all your other hot tables bloating because they're no longer
getting vacuumed.

The reality is that you can actually vacuum a pretty good-sized table in
60 seconds with typical cost-delay settings (ie: defaults except
cost_delay set to 10). That means you can do 9 pages ~100 times a
second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
that's still 27k pages per minute.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Jim C. Nasby wrote:

 The advantage to keying this to autovac_naptime is that it means we
 don't need another GUC, but after I suggested that before I realized
 that's probably not the best idea. For example, I've seen clusters that
 are running dozens-hundreds of databases; in that environment you really
 need to turn naptime way down (to like a second). In that case you
 wouldn't want to key to naptime.

Actually, I've been thinking that it would be a good idea to change the
semantics of autovacuum_naptime so that it means the average time to
start a worker in any given database.  That way, the time between
autovac runs is not dependent on the number of databases you have.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote:
 I'm not liking any of these very much, as they seem critically dependent
 on impossible-to-tune parameters.  I think it'd be better to design this
 around having the first worker explicitly expose its state (list of
 tables to process, in order) and having subsequent workers key off that
 info.

 The real problem is trying to set that up in such a fashion that keeps
 hot tables frequently vacuumed;

Certainly, but it's not clear where that behavior emerges from Alvaro's
or Matthew's proposals, either.

Are we assuming that no single worker instance will vacuum a given table
more than once?  (That's not a necessary assumption, certainly, but
without it there are so many degrees of freedom that I'm not sure how
it should act.)  Given that assumption, the maximum vacuuming rate for
any table is once per autovacuum_naptime, and most of the magic lies in
the launcher's algorithm for deciding which databases to launch workers
into.

I'm inclined to propose an even simpler algorithm in which every worker
acts alike; its behavior is
1. On startup, generate a to-do list of tables to process, sorted in
priority order.
2. For each table in the list, if the table is still around and has not
been vacuumed by someone else since you started (including the case of
a vacuum-in-progress), then vacuum it.

Detecting already vacuumed since you started is a bit tricky; you
can't really rely on the stats collector since its info isn't very
up-to-date.  That's why I was thinking of exposing the to-do lists
explicitly; comparing those with an advertised current-table would
allow accurate determination of what had just gotten done.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
[ oh, I forgot to respond to this: ]

Jim C. Nasby [EMAIL PROTECTED] writes:
 Isn't there a special lock acquired on a relation by vacuum? Can't we
 just check for that?

I think you're thinking that ConditionalLockRelation solves the problem,
but it does not, because it will fail if someone has taken a (quasi)
exclusive lock unrelated to vacuuming.  You don't want an application
that frequently takes short-term ExclusiveLocks on a table to thereby
cause autovacuum to frequently skip that table.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Alvaro Herrera
Tom Lane wrote:

 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently, and I don't see
 where that behavior will emerge from your proposal; whereas it's fairly
 easy to make it happen if non-first workers pay attention to what other
 workers are doing.

FWIW, I've always considered this to be a very important and obvious
issue, and I think I've neglected mentioning it (maybe I did too few
times).  But I think this is pretty easy to do, just have each worker
advertise the current table it's working on in shared memory, and add a
recheck loop on the table-pick algorithm (with appropriate grabs of the
autovacuum lwlock), to make sure no one starts to vacuum the same table
you're going to process, at the same time.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
I'm not sure how pg_class.relpages is maintained but what happens to a 
bloated table?  For example, a 100 row table that is constantly updated 
and hasn't been vacuumed in a while (say the admin disabled autovacuum 
for a while), now that small 100 row table has 1000 pages in it most of 
which are just bloat, will we miss this table?  Perhaps basing this on 
reltuples would be better?


The entire point of this is to ensure that the second daemon will only
vacuum tables that it can finish very quickly. If you let a table bloat
so it's too big, then you just can't vacuum it very frequently without
risking all your other hot tables bloating because they're no longer
getting vacuumed.

The reality is that you can actually vacuum a pretty good-sized table in
60 seconds with typical cost-delay settings (ie: defaults except
cost_delay set to 10). That means you can do 9 pages ~100 times a
second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
that's still 27k pages per minute.


At the risk of sounding like a broken record, I still think the size 
limit threshold is unnecessary.  Since all workers will be working in on 
tables in size order, younger workers will typically catch older workers 
fairly quickly since the tables will be either small, or recently 
vacuumed and not need work.  And since younger workers exit when they 
catch-up to an older worker, there is some inherent stability in the 
number of workers.


Here is a worst case example: A DB with 6 tables all of which are highly 
active and will need to be vacuumed constantly.  While this is totally 
hypothetical, it is how I envision things working (without the threshold).


table1:10 rows
table2:   100 rows
table3: 1,000 rows
table4:10,000 rows
table5:   100,000 rows
table6: 1,000,000 rows

time=0*naptime: No workers in the DB

time=1*naptime: worker1 starts on table1

time=2*naptime: worker1 has finished table1,table2 and table3, it's now 
working on table4, worker2 starts on table1.


time=3*naptime: worker1 is on table5, worker2 is working on table4, 
worker3 starts on table1.


time=4*naptime: worker1 is still on table5, worker2 has caught up to 
worker1 and exits, worker3 also catches up to worker1 since tables2-4 
didn't require vacuum at this time so it exits, worker4 starts on table1


time=5*naptime: worker1 is working on table6, worker4 is up to table4, 
worker5 starts on table1


time=6*naptime: worker1 is working on table6, worker4 catches up to 
worker1 and exits, worker5 finds no additional work to be done and 
exits, worker6 starts at table1.


time=7*naptime: worker1 still working on table6, worker6 is up to 
table4, worker7 starts at table1.


time=8*naptime: worker1 still working on table6, worker6 still working 
on table4, worker7 working on table3, worker8 starting on table1.


time=9*naptime: worker1 still working on table6, worker6 working on 
table5, worker7 catches worker 6 and exits, worker8 finds nothing more 
todo and exits, worker9 starts on table1


time=10*naptim: worker1 still working on table6, worker9 working on 
table4, worker10 starts on table1.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I think an absolute minimum requirement for a sane design is that no two
 workers ever try to vacuum the same table concurrently,

 FWIW, I've always considered this to be a very important and obvious
 issue, and I think I've neglected mentioning it (maybe I did too few
 times).  But I think this is pretty easy to do, just have each worker
 advertise the current table it's working on in shared memory, and add a
 recheck loop on the table-pick algorithm (with appropriate grabs of the
 autovacuum lwlock), to make sure no one starts to vacuum the same table
 you're going to process, at the same time.

Well, any of these proposals need that at the bottom level, to prevent
race conditions.  But I'd prefer a design that wasn't positively
encouraging multiple workers to try to pick the same table concurrently.
Not only is that wasteful, but it makes it harder to predict what is the
real behavior that emerges after race conditions are backed off from.

BTW, to what extent might this whole problem be simplified if we adopt
chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
If the unit of work has a reasonable upper bound regardless of table
size, maybe the problem of big tables starving small ones goes away.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Jim C. Nasby [EMAIL PROTECTED] writes:

The real problem is trying to set that up in such a fashion that keeps
hot tables frequently vacuumed;


Are we assuming that no single worker instance will vacuum a given table
more than once?  (That's not a necessary assumption, certainly, but
without it there are so many degrees of freedom that I'm not sure how
it should act.)  Given that assumption, the maximum vacuuming rate for
any table is once per autovacuum_naptime, and most of the magic lies in
the launcher's algorithm for deciding which databases to launch workers
into.


Yes, I have been working under the assumption that a worker goes through 
the list of tables once and exits, and yes the maximum vacuuming rate 
for any table would be once per autovacuum_naptime.  We can lower the 
default if necessary, as far as I'm concerned it's (or should be) fairly 
cheap to fire off a worker and have it find that there isn't anything 
todo and exit.



I'm inclined to propose an even simpler algorithm in which every worker
acts alike; its behavior is
1. On startup, generate a to-do list of tables to process, sorted in
priority order.
2. For each table in the list, if the table is still around and has not
been vacuumed by someone else since you started (including the case of
a vacuum-in-progress), then vacuum it.


That is what I'm proposing except for one difference, when you catch up 
to an older worker, exit.  This has the benefit reducing the number of 
workers concurrently working on big tables, which I think is a good thing.



Detecting already vacuumed since you started is a bit tricky; you
can't really rely on the stats collector since its info isn't very
up-to-date.  That's why I was thinking of exposing the to-do lists
explicitly; comparing those with an advertised current-table would
allow accurate determination of what had just gotten done.


Sounds good, but I have very little insight into how we would implement 
already vacuumed since you started or have I caught up to another 
worker.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

BTW, to what extent might this whole problem be simplified if we adopt
chunk-at-a-time vacuuming (compare current discussion with Galy Lee)?
If the unit of work has a reasonable upper bound regardless of table
size, maybe the problem of big tables starving small ones goes away.


So if we adopted chunk-at-a-time then perhaps each worker processes the 
list of tables in OID order (or some unique and stable order) and does 
one chunk per table that needs vacuuming.  This way an equal amount of 
bandwidth is given to all tables.


That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?


Matt


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 Tom Lane wrote:
 I'm inclined to propose an even simpler algorithm in which every worker
 acts alike;

 That is what I'm proposing except for one difference, when you catch up 
 to an older worker, exit.

No, that's a bad idea, because it means that any large table starves
even-larger tables.

(Note: in all this I assume we're all using size as a shorthand for
some sort of priority metric that considers number of dirty tuples not
only size.  We don't want every worker insisting on passing over every
small read-only table every time, for instance.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?

It seems fairly trivial to me to have a scheme where you do one
fill-workmem-and-scan-indexes cycle per invocation, and store the
next-heap-page-to-scan in some handy place (new pg_class column updated
along with relpages/reltuples, likely).  Galy is off in left field with
some far more complex ideas :-( but I don't see that there's all that
much needed to support this behavior ... especially if we don't expose
it to the SQL level but only support it for autovac's use.  Then we're
not making any big commitment to support the behavior forever.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:

Tom Lane wrote:

I'm inclined to propose an even simpler algorithm in which every worker
acts alike;


That is what I'm proposing except for one difference, when you catch up 
to an older worker, exit.


No, that's a bad idea, because it means that any large table starves
even-larger tables.


True, but the assumption I'm making is that there is a finite amount of 
bandwidth available and more concurrent activity will have a net 
negative effect the time it takes to vacuum all tables.  I'm willing to 
pay that price to prevent small hot tables from getting starved, but 
less willing to pay the same price for large tables where the percentage 
of bloat will be much smaller.



(Note: in all this I assume we're all using size as a shorthand for
some sort of priority metric that considers number of dirty tuples not
only size.  We don't want every worker insisting on passing over every
small read-only table every time, for instance.)


I was using size to mean reltuples.  The whole concept of sorting by 
size was designed to ensure that smaller (more susceptible to bloat) 
tables got priority.  It might be useful for workers to sort their to-do 
lists by some other metric, but I don't have a clear vision of what that 
might be.



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:

That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?


It seems fairly trivial to me to have a scheme where you do one
fill-workmem-and-scan-indexes cycle per invocation, and store the
next-heap-page-to-scan in some handy place (new pg_class column updated
along with relpages/reltuples, likely).  Galy is off in left field with
some far more complex ideas :-( but I don't see that there's all that
much needed to support this behavior ... especially if we don't expose
it to the SQL level but only support it for autovac's use.  Then we're
not making any big commitment to support the behavior forever.


Well, if we can make it happen soon, it might be the best thing for 
autovacuum.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
 
  The advantage to keying this to autovac_naptime is that it means we
  don't need another GUC, but after I suggested that before I realized
  that's probably not the best idea. For example, I've seen clusters that
  are running dozens-hundreds of databases; in that environment you really
  need to turn naptime way down (to like a second). In that case you
  wouldn't want to key to naptime.
 
 Actually, I've been thinking that it would be a good idea to change the
 semantics of autovacuum_naptime so that it means the average time to
 start a worker in any given database.  That way, the time between
 autovac runs is not dependent on the number of databases you have.

Hrm... how would that work?

BTW, another thought is to only sleep if you've scanned through every
database and found nothing to do.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 10:48:49PM -0500, Tom Lane wrote:
 Matthew T. O'Connor matthew@zeut.net writes:
  That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3?
 
 It seems fairly trivial to me to have a scheme where you do one
 fill-workmem-and-scan-indexes cycle per invocation, and store the
 next-heap-page-to-scan in some handy place (new pg_class column updated
 along with relpages/reltuples, likely).  Galy is off in left field with
 some far more complex ideas :-( but I don't see that there's all that
 much needed to support this behavior ... especially if we don't expose
 it to the SQL level but only support it for autovac's use.  Then we're
 not making any big commitment to support the behavior forever.

The problem I see there is that the case we're trying to fix is tables
that need to be vacuumed every few minutes. As I posted elsewhere, it's
reasonable to assume a vacuum rate of ~1000 pages/second for a small
table that's going to be in memory (assuming that vacuum dirties every
page). That means that you can only dirty about 60k pages per cycle,
which seems way to small to be practical unless we come up with a way to
avoid scanning the indexes on every cycle.

The proposal to save enough state to be able to resume a vacuum at
pretty much any point in it's cycle might work; we'd have to benchmark
it.  With the default maintenance_work_mem of 128M it would mean writing
out 64M of state every minute on average, which is likely to take
several seconds to fsync (though, maybe we wouldn't need to fsync it...)
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 The proposal to save enough state to be able to resume a vacuum at
 pretty much any point in it's cycle might work; we'd have to benchmark
 it.  With the default maintenance_work_mem of 128M it would mean writing
 out 64M of state every minute on average, which is likely to take
 several seconds to fsync (though, maybe we wouldn't need to fsync it...)

Which is exactly why we needn't bother benchmarking it.  Even if it
weren't complex and unsafe, it will be a net loss when you consider the
fact that it adds I/O instead of removing it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote:
 I'm not sure how pg_class.relpages is maintained but what happens to a 
 bloated table?  For example, a 100 row table that is constantly updated 
 and hasn't been vacuumed in a while (say the admin disabled autovacuum 
 for a while), now that small 100 row table has 1000 pages in it most of 
 which are just bloat, will we miss this table?  Perhaps basing this on 
 reltuples would be better?
 
 The entire point of this is to ensure that the second daemon will only
 vacuum tables that it can finish very quickly. If you let a table bloat
 so it's too big, then you just can't vacuum it very frequently without
 risking all your other hot tables bloating because they're no longer
 getting vacuumed.
 
 The reality is that you can actually vacuum a pretty good-sized table in
 60 seconds with typical cost-delay settings (ie: defaults except
 cost_delay set to 10). That means you can do 9 pages ~100 times a
 second, or 54k pages a minute. Even with a vacuum_cost_delay of 20,
 that's still 27k pages per minute.
 
 At the risk of sounding like a broken record, I still think the size 
 limit threshold is unnecessary.  Since all workers will be working in on 
 tables in size order, younger workers will typically catch older workers 
 fairly quickly since the tables will be either small, or recently 
 vacuumed and not need work.  And since younger workers exit when they 
 catch-up to an older worker, there is some inherent stability in the 
 number of workers.
 
 Here is a worst case example: A DB with 6 tables all of which are highly 
 active and will need to be vacuumed constantly.  While this is totally 
 hypothetical, it is how I envision things working (without the threshold).

I fail to see how a simple 6 table case is 'worst case'. It's common to
see hundreds of tables, and I've run across more than one database with
thousands of tables (think partitioning). In cases like those it's
certainly possible, perhaps even likely that you would get many daemons
running in the database at one time just from different tables suddenly
needing vacuuming and appearing at a higher point in the list than other
tables. With 100 ~1G tables getting updates it certainly wouldn't be
hard to end up with 10 of those being vacuumed all at the same time.

I do like the idea since it should be easier to tune, but I think we
still need some limit on it. Perhaps as a first-pass we could just have
a hard limit and log a message and/or set a flag any time we hit it.
That would hopefully allow us to get information about how big a problem
it really is. We could go one step further and say that the last daemon
that can start in a database will only vacuum tables that can be done
quickly; that's essentially what we've been talking about, except the
limit we've been discussing would be hard-coded at 2.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Jim C. Nasby
On Tue, Feb 27, 2007 at 12:37:42AM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  The proposal to save enough state to be able to resume a vacuum at
  pretty much any point in it's cycle might work; we'd have to benchmark
  it.  With the default maintenance_work_mem of 128M it would mean writing
  out 64M of state every minute on average, which is likely to take
  several seconds to fsync (though, maybe we wouldn't need to fsync it...)
 
 Which is exactly why we needn't bother benchmarking it.  Even if it
 weren't complex and unsafe, it will be a net loss when you consider the
 fact that it adds I/O instead of removing it.

Well, it depends on how often you're doing that. Adding extra IO at the
end of 4 hours of vacuuming isn't going to make any real difference, but
once a minute...

Looks like partial vacuum won't help this problem. :(
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote:

Jim C. Nasby wrote:
Here is a worst case example: A DB with 6 tables all of which are highly 
active and will need to be vacuumed constantly.  While this is totally 
hypothetical, it is how I envision things working (without the threshold).


I fail to see how a simple 6 table case is 'worst case'. It's common to
see hundreds of tables, and I've run across more than one database with
thousands of tables (think partitioning). In cases like those it's
certainly possible, perhaps even likely that you would get many daemons
running in the database at one time just from different tables suddenly
needing vacuuming and appearing at a higher point in the list than other
tables. With 100 ~1G tables getting updates it certainly wouldn't be
hard to end up with 10 of those being vacuumed all at the same time.


Yes 6 tables is small, the worst-case part of the example was that all 
the tables would need to be vacuumed constantly.  Most databases only 
have a few hot tables.  Most tables only need to vacuumed every once in 
a while.



I do like the idea since it should be easier to tune, but I think we
still need some limit on it. Perhaps as a first-pass we could just have
a hard limit and log a message and/or set a flag any time we hit it.
That would hopefully allow us to get information about how big a problem
it really is. We could go one step further and say that the last daemon
that can start in a database will only vacuum tables that can be done
quickly; that's essentially what we've been talking about, except the
limit we've been discussing would be hard-coded at 2.


I'm confused, what limit would be set at 2?  The number of concurrent 
workers?  I've never said that.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Jim C. Nasby [EMAIL PROTECTED] writes:

The proposal to save enough state to be able to resume a vacuum at
pretty much any point in it's cycle might work; we'd have to benchmark
it.  With the default maintenance_work_mem of 128M it would mean writing
out 64M of state every minute on average, which is likely to take
several seconds to fsync (though, maybe we wouldn't need to fsync it...)


Which is exactly why we needn't bother benchmarking it.  Even if it
weren't complex and unsafe, it will be a net loss when you consider the
fact that it adds I/O instead of removing it.


I'm not sure what you are saying here, are you now saying that partial 
vacuum won't work for autovac?  Or are you saying that saving state as 
Jim is describing above won't work?


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 I'm not sure what you are saying here, are you now saying that partial 
 vacuum won't work for autovac?  Or are you saying that saving state as 
 Jim is describing above won't work?

I'm saying that I don't like the idea of trying to stop on a dime by
saving the current contents of vacuum's dead-TID array to disk with the
idea that we can trust those values 100% later.  Saving the array is
expensive both in runtime and code complexity, and I don't believe we
can trust it later --- at least not without even more expensive-and-
complex measures, such as WAL-logging every such save :-(

I'm for stopping only after completing an index-cleaning pass, at the
point where we empty the dead-TID array anyway.  If you really have to
have stop on a dime, just kill -INT the process, accepting that you
will have to redo your heap scan since the last restart point.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-26 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:
I'm not sure what you are saying here, are you now saying that partial 
vacuum won't work for autovac?  Or are you saying that saving state as 
Jim is describing above won't work?


I'm saying that I don't like the idea of trying to stop on a dime by
saving the current contents of vacuum's dead-TID array to disk with the
idea that we can trust those values 100% later.  Saving the array is
expensive both in runtime and code complexity, and I don't believe we
can trust it later --- at least not without even more expensive-and-
complex measures, such as WAL-logging every such save :-(

I'm for stopping only after completing an index-cleaning pass, at the
point where we empty the dead-TID array anyway.  If you really have to
have stop on a dime, just kill -INT the process, accepting that you
will have to redo your heap scan since the last restart point.


OK, so if I understand correct, a vacuum of a table with 10 indexes on 
it can be interrupted 10 times, once after each index-cleaning pass? 
That might have some value, especially breaking up the work required to 
vacuum a large table. Or am I still not getting it?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-23 Thread Alvaro Herrera
Jim C. Nasby wrote:
 On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:

  I'm not sure this is a great idea, but I don't see how this would result 
  in large numbers of workers working in one database.   If workers work 
  on tables in size order, and exit as soon as they catch up to an older 
  worker, I don't see the problem.  Newer works are going to catch-up to 
  older workers pretty quickly since small tables will vacuum fairly quickly.
 
 The reason that won't necessarily happen is because you can get large
 tables popping up as needing vacuuming at any time.

Right.

We know that a table that needs frequent vacuum necessarily has to be
small -- so maybe have the second worker exit when it catches up with
the first, or when the next table is above 1 GB, whichever happens
first.  That way, only the first worker can be processing the huge
tables.  The problem with this is that if one of your hot tables grows
a bit larger than 1 GB, you suddenly have a change in autovacuuming
behavior, for no really good reason.

And while your second worker is processing the tables in the hundreds-MB
range, your high-update 2 MB tables are neglected :-(

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps, take 2

2007-02-23 Thread Jim C. Nasby
On Fri, Feb 23, 2007 at 01:22:17PM -0300, Alvaro Herrera wrote:
 Jim C. Nasby wrote:
  On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
 
   I'm not sure this is a great idea, but I don't see how this would result 
   in large numbers of workers working in one database.   If workers work 
   on tables in size order, and exit as soon as they catch up to an older 
   worker, I don't see the problem.  Newer works are going to catch-up to 
   older workers pretty quickly since small tables will vacuum fairly 
   quickly.
  
  The reason that won't necessarily happen is because you can get large
  tables popping up as needing vacuuming at any time.
 
 Right.
 
 We know that a table that needs frequent vacuum necessarily has to be
 small -- so maybe have the second worker exit when it catches up with
 the first, or when the next table is above 1 GB, whichever happens
 first.  That way, only the first worker can be processing the huge
 tables.  The problem with this is that if one of your hot tables grows
 a bit larger than 1 GB, you suddenly have a change in autovacuuming
 behavior, for no really good reason.
 
 And while your second worker is processing the tables in the hundreds-MB
 range, your high-update 2 MB tables are neglected :-(

That's why I'm thinking it would be best to keep the maximum size of
stuff for the second worker small. It probably also makes sense to tie
it to time and not size, since the key factor is that you want it to hit
the high-update tables every X number of seconds.

If we wanted to get fancy, we could factor in how far over the vacuum
threshold a table is, so even if the table is on the larger size, if
it's way over the threshold the second vacuum will hit it.

You know, maybe the best way to handle this is to force both vacuums to
exit after a certain amount of time, probably with a longer time limit
for the first vacuum in a database. That would mean that after
processing a large table for 10 minutes, the first vacuum would
exit/re-evaluate what work needs to be done. That would mean
medium-sized tables wouldn't get completely starved.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Zeugswetter Andreas ADI SD

 vacuum should be a process with the least amount of voodoo. 
 If we can just have vacuum_delay and vacuum_threshold, where 
 threshold allows an arbitrary setting of how much bandwidth 
 we will allot to the process, then that is a beyond wonderful thing.
 
 It is easy to determine how much IO you have, and what you can spare.

The tricky part is what metric to use. Imho IO per second would be
good.
In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
 My Proposal:  If we require admins to identify hot tables tables, then: 
 1) Launcher fires-off a worker1 into database X.
 2) worker1 deals with hot tables first, then regular tables.
 3) Launcher continues to launch workers to DB X every autovac naptime. 
 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
 worker1 did above.  If worker1 is still working in DB X then worker2 
 looks for hot tables that are being starved because worker1 got busy. 
 If worker2 finds no hot tables that need work, then worker2 exits.

Rather than required people to manually identify hot tables, what if we
just prioritize based on table size? So if a second autovac process hits
a specific database, it would find the smallest table in need of
vacuuming that it should be able to complete before the next naptime and
vacuum that. It could even continue picking tables until it can't find
one that it could finish within the naptime. Granted, it would have to
make some assumptions about how many pages it would dirty.

ISTM that's a lot easier than forcing admins to mark specific tables.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
  
My Proposal:  If we require admins to identify hot tables tables, then: 
1) Launcher fires-off a worker1 into database X.

2) worker1 deals with hot tables first, then regular tables.
3) Launcher continues to launch workers to DB X every autovac naptime. 
4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
worker1 did above.  If worker1 is still working in DB X then worker2 
looks for hot tables that are being starved because worker1 got busy. 
If worker2 finds no hot tables that need work, then worker2 exits.



Rather than required people to manually identify hot tables, what if we
just prioritize based on table size? So if a second autovac process hits
a specific database, it would find the smallest table in need of
vacuuming that it should be able to complete before the next naptime and
vacuum that. It could even continue picking tables until it can't find
one that it could finish within the naptime. Granted, it would have to
make some assumptions about how many pages it would dirty.

ISTM that's a lot easier than forcing admins to mark specific tables.


So the heuristic would be:
* Launcher fires off workers into a database at a given interval 
(perhaps configurable?)
* Each worker works on tables in size order. 
* If a worker ever catches up to an older worker, then the younger 
worker exits.


This sounds simple and workable to me, perhaps we can later modify this 
to include some max_workers variable so that a worker would only exit if 
it catches an older worker and there are max_workers currently active.



Thoughts?

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote:
   
 My Proposal:  If we require admins to identify hot tables tables, then: 
 1) Launcher fires-off a worker1 into database X.
 2) worker1 deals with hot tables first, then regular tables.
 3) Launcher continues to launch workers to DB X every autovac naptime. 
 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
 worker1 did above.  If worker1 is still working in DB X then worker2 
 looks for hot tables that are being starved because worker1 got busy. 
 If worker2 finds no hot tables that need work, then worker2 exits.
 
 
 Rather than required people to manually identify hot tables, what if we
 just prioritize based on table size? So if a second autovac process hits
 a specific database, it would find the smallest table in need of
 vacuuming that it should be able to complete before the next naptime and
 vacuum that. It could even continue picking tables until it can't find
 one that it could finish within the naptime. Granted, it would have to
 make some assumptions about how many pages it would dirty.
 
 ISTM that's a lot easier than forcing admins to mark specific tables.
 
 So the heuristic would be:
 * Launcher fires off workers into a database at a given interval 
 (perhaps configurable?)
 * Each worker works on tables in size order. 
 * If a worker ever catches up to an older worker, then the younger 
 worker exits.
 
 This sounds simple and workable to me, perhaps we can later modify this 
 to include some max_workers variable so that a worker would only exit if 
 it catches an older worker and there are max_workers currently active.

That would likely result in a number of workers running in one database,
unless you limited how many workers per database. And if you did that,
you wouldn't be addressing the frequently update table problem.

A second vacuum in a database *must* exit after a fairly short time so
that we can go back in and vacuum the important tables again (well or
the 2nd vacuum has to periodically re-evaluate what tables need to be
vacuumed).
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 09:35:45AM +0100, Zeugswetter Andreas ADI SD wrote:
 
  vacuum should be a process with the least amount of voodoo. 
  If we can just have vacuum_delay and vacuum_threshold, where 
  threshold allows an arbitrary setting of how much bandwidth 
  we will allot to the process, then that is a beyond wonderful thing.
  
  It is easy to determine how much IO you have, and what you can spare.
 
 The tricky part is what metric to use. Imho IO per second would be
 good.
 In a typical DB scenario that is the IO bottleneck, not the Mb/s.

Well, right now they're one in the same... but yeah, IO/sec probably
does make more sense.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Zeugswetter Andreas ADI SD

   vacuum should be a process with the least amount of voodoo. 
   If we can just have vacuum_delay and vacuum_threshold, where 
   threshold allows an arbitrary setting of how much bandwidth we
will 
   allot to the process, then that is a beyond wonderful thing.
   
   It is easy to determine how much IO you have, and what 
 you can spare.
  
  The tricky part is what metric to use. Imho IO per second 
 would be 
  good.
  In a typical DB scenario that is the IO bottleneck, not the Mb/s.
 
 Well, right now they're one in the same... but yeah, IO/sec 
 probably does make more sense.

Hopefully not :-) Else you have no readahead. And that is imho the
problem.
You need to anticipate how many physical IO's your logical IO's cause.
And this is near impossible unless we group IO's in pg itself.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Matthew T. O'Connor

Jim C. Nasby wrote:

On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
  

So the heuristic would be:
* Launcher fires off workers into a database at a given interval 
(perhaps configurable?)
* Each worker works on tables in size order. 
* If a worker ever catches up to an older worker, then the younger 
worker exits.


This sounds simple and workable to me, perhaps we can later modify this 
to include some max_workers variable so that a worker would only exit if 
it catches an older worker and there are max_workers currently active.



That would likely result in a number of workers running in one database,
unless you limited how many workers per database. And if you did that,
you wouldn't be addressing the frequently update table problem.

A second vacuum in a database *must* exit after a fairly short time so
that we can go back in and vacuum the important tables again (well or
the 2nd vacuum has to periodically re-evaluate what tables need to be
vacuumed).
  


I'm not sure this is a great idea, but I don't see how this would result 
in large numbers of workers working in one database.   If workers work 
on tables in size order, and exit as soon as they catch up to an older 
worker, I don't see the problem.  Newer works are going to catch-up to 
older workers pretty quickly since small tables will vacuum fairly quickly.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps, take 2

2007-02-22 Thread Jim C. Nasby
On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote:
 Jim C. Nasby wrote:
 On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote:
   
 So the heuristic would be:
 * Launcher fires off workers into a database at a given interval 
 (perhaps configurable?)
 * Each worker works on tables in size order. 
 * If a worker ever catches up to an older worker, then the younger 
 worker exits.
 
 This sounds simple and workable to me, perhaps we can later modify this 
 to include some max_workers variable so that a worker would only exit if 
 it catches an older worker and there are max_workers currently active.
 
 
 That would likely result in a number of workers running in one database,
 unless you limited how many workers per database. And if you did that,
 you wouldn't be addressing the frequently update table problem.
 
 A second vacuum in a database *must* exit after a fairly short time so
 that we can go back in and vacuum the important tables again (well or
 the 2nd vacuum has to periodically re-evaluate what tables need to be
 vacuumed).
   
 
 I'm not sure this is a great idea, but I don't see how this would result 
 in large numbers of workers working in one database.   If workers work 
 on tables in size order, and exit as soon as they catch up to an older 
 worker, I don't see the problem.  Newer works are going to catch-up to 
 older workers pretty quickly since small tables will vacuum fairly quickly.

The reason that won't necessarily happen is because you can get large
tables popping up as needing vacuuming at any time.

Round 1:
Fire up autovac worker; starts working and soon hits 100G table

Round 2:
Another worker starts. Since round 1, a 98G table now needs vacuuming,
which this worker soon hits.

Round 3:
89G table now needs vacuuming. Worker 3 starts up and soon hits it.

So now we have 3 workers, all hammering away in the same database, and
likely causing a lot of random IO.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Alvaro Herrera
Ok, scratch that :-)  Another round of braindumping below.


Launcher starts one worker in each database.  This worker is not going
to do vacuum work, just report how much vacuum effort is needed in the
database.  Vacuum effort is measured as the total number of pages in
need of vacuum, being the sum of relpages of all tables and indexes
needing vacuum.  (Note: we weight heap pages the same as index pages.
Is this OK?)

Create a plan for vacuuming all those databases within the constraints
of max_workers.  Databases needing the most work are vacuumed first.
One worker per database.  Thus max_workers databases are being vacuumed
in parallel at this time.  When one database is finished, the launcher
starts a worker in the next database in the list.

When the plan is complete (i.e. the list is empty) we can do the whole
thing again, excluding the databases that are still being vacuumed.

Perhaps we should wait autovacuum_naptime seconds between finishing one
vacuum round in all databases and starting the next.  How do we measure
this: do we start sleeping when the last worker finishes, or when the
list is empty?

Perhaps we should reserve a worker for vacuuming hot tables.  Launcher
then uses max_workers-1 workers for the above plan, and the spare worker
is continuously connecting to one database, vacuuming hot tables, going
away, the launcher starts it again to connect to the next database.
Definitional problem: how to decide what's a hot table?  One idea (the
simplest) is to let the DBA define it.

Thus, at most two workers are on any database: one of them is working on
normal tables, the other on hot tables.

(This idea can be complemented by having another GUC var,
autovacuum_hot_workers, which allows the DBA to have more than one
worker on hot tables (just for the case where there are too many hot
tables).  This may be overkill.)


Ron Mayer expressed the thought that we're complicating needlessly the
UI for vacuum_delay, naptime, etc.  He proposes that instead of having
cost_delay etc, we have a mbytes_per_second parameter of some sort.
This strikes me a good idea, but I think we could make that after this
proposal is implemented.  So this take 2 could be implemented, and
then we could switch the cost_delay stuff to using a MB/s kind of
measurement somehow (he says waving his hands wildly).

Greg Stark and Matthew O'Connor say that we're misdirected in having
more than one worker per tablespace.  I say we're not :-)  If we
consider Ron Mayer's idea of measuring MB/s, but we do it per
tablespace, then we would inflict the correct amount of vacuum pain to
each tablespace, sleeping as appropriate.  I think this would require
workers of different databases to communicate what tablespaces they are
using, so that all of them can utilize the correct amount of bandwidth.


I'd like to know if this responds to the mentioned people's objections.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Ok, scratch that :-)  Another round of braindumping below.


I still think this is solution in search of a problem.  The main problem 
we have right now is that hot tables can be starved from vacuum.  Most 
of this proposal doesn't touch that.  I would like to see that problem 
solved first, then we can talk about adding multiple workers per 
database or per tablespace etc...



(This idea can be complemented by having another GUC var,
autovacuum_hot_workers, which allows the DBA to have more than one
worker on hot tables (just for the case where there are too many hot
tables).  This may be overkill.)


I think this is more along the lines of what we need first.


Ron Mayer expressed the thought that we're complicating needlessly the
UI for vacuum_delay, naptime, etc.  He proposes that instead of having
cost_delay etc, we have a mbytes_per_second parameter of some sort.
This strikes me a good idea, but I think we could make that after this
proposal is implemented.  So this take 2 could be implemented, and
then we could switch the cost_delay stuff to using a MB/s kind of
measurement somehow (he says waving his hands wildly).


Agree this is probably a good idea in the long run, but I agree this is 
lower on the priority list and should come next.



Greg Stark and Matthew O'Connor say that we're misdirected in having
more than one worker per tablespace.  I say we're not :-)  If we
consider Ron Mayer's idea of measuring MB/s, but we do it per
tablespace, then we would inflict the correct amount of vacuum pain to
each tablespace, sleeping as appropriate.  I think this would require
workers of different databases to communicate what tablespaces they are
using, so that all of them can utilize the correct amount of bandwidth.


I agree that in the long run it might be better to have multiple workers 
with MB/s throttle and tablespace aware, but we don't have any of that 
infrastructure right now.  I think the piece of low-hanging fruit that 
your launcher concept can solve is the hot table starvation.


My Proposal:  If we require admins to identify hot tables tables, then: 
1) Launcher fires-off a worker1 into database X.

2) worker1 deals with hot tables first, then regular tables.
3) Launcher continues to launch workers to DB X every autovac naptime. 
4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as 
worker1 did above.  If worker1 is still working in DB X then worker2 
looks for hot tables that are being starved because worker1 got busy. 
If worker2 finds no hot tables that need work, then worker2 exits.


This seems a very simple solution (given your launcher work) that can 
solve the starvation problem.



Thoughts?

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Gregory Stark

Alvaro Herrera [EMAIL PROTECTED] writes:

 Greg Stark and Matthew O'Connor say that we're misdirected in having
 more than one worker per tablespace.  I say we're not :-)  

I did say that. But your comment about using a high cost_delay was fairly
convincing too. It would be a simpler design and I think you're right. As long
as raise both cost_delay and cost_limit by enough you should get pretty much
the same sequential i/o rate and not step on each others toes too much.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps, take 2

2007-02-21 Thread Joshua D. Drake

 Ron Mayer expressed the thought that we're complicating needlessly the
 UI for vacuum_delay, naptime, etc.  He proposes that instead of having
 cost_delay etc, we have a mbytes_per_second parameter of some sort.
 This strikes me a good idea, but I think we could make that after this
 proposal is implemented.  So this take 2 could be implemented, and
 then we could switch the cost_delay stuff to using a MB/s kind of
 measurement somehow (he says waving his hands wildly).

vacuum should be a process with the least amount of voodoo. If we can
just have vacuum_delay and vacuum_threshold, where threshold allows an
arbitrary setting of how much bandwidth we will allot to the process,
then that is a beyond wonderful thing.

It is easy to determine how much IO you have, and what you can spare.

Joshua D. Drake




 
 Greg Stark and Matthew O'Connor say that we're misdirected in having
 more than one worker per tablespace.  I say we're not :-)  If we
 consider Ron Mayer's idea of measuring MB/s, but we do it per
 tablespace, then we would inflict the correct amount of vacuum pain to
 each tablespace, sleeping as appropriate.  I think this would require
 workers of different databases to communicate what tablespaces they are
 using, so that all of them can utilize the correct amount of bandwidth.
 
 
 I'd like to know if this responds to the mentioned people's objections.
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps

2007-02-20 Thread Jim C. Nasby
I'm wondering if we can do one better...

Since what we really care about is I/O responsiveness for the rest of
the system, could we just time how long I/O calls take to complete? I
know that gettimeofday can have a non-trivial overhead, but do we care
that much about it in the case of autovac?

On Fri, Feb 16, 2007 at 05:37:26PM -0800, Ron Mayer wrote:
 Alvaro Herrera wrote:
  
  Once autovacuum_naptime... autovacuum_max_workers...
  How does this sound?
 
 The knobs exposed on autovacuum feel kinda tangential to
 what I think I'd really want to control.
 
 IMHO vacuum_mbytes_per_second would be quite a bit more
 intuitive than cost_delay, naptime, etc.
 
 
 ISTM I can relatively easily estimate and/or spec out how
 much extra I/O bandwidth I have per device for vacuum;
 and would pretty much want vacuum to be constantly
 running on whichever table that needs it the most so
 long as it can stay under that bandwith limit.
 
 Could vacuum have a tunable that says X MBytes/second
 (perhaps per device) and have it measure how much I/O
 it's actually doing and try to stay under that limit?
 
 For more fine-grained control a cron job could go
 around setting different MBytes/second limits during
 peak times vs idle times.
 
 
 If people are concerned about CPU intensive vacuums
 instead of I/O intensive ones (does anyone experience
 that? - another tuneable vacuum_percent_of_cpu would
 be more straightforward than delay_cost, cost_page_hit,
 etc.   But I'd be a bit surprised if cpu intensive
 vacuums are common.
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 

-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps

2007-02-19 Thread Csaba Nagy
 One option that I've heard before is to have vacuum after a single iteration
 (ie, after it fills maintenance_work_mem and does the index cleanup and the
 second heap pass), remember where it was and pick up from that point next
 time.

From my experience this is not acceptable... I have tables for which the
index cleanup takes hours, so no matter how low I would set the
maintenance_work_mem (in fact I set it high enough so there's only one
iteration), it will take too much time so the queue tables get overly
bloated (not happening either, they get now special cluster
treatment).

Cheers,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps

2007-02-19 Thread Galy Lee


Gregory Stark wrote:

If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle
of vacuums on smaller tables, then resume, that problem would go away. That
sounds too difficult though, but perhaps we could do something nearly as good.


I think to make vacuum has this interrupted-resume capability is quite 
useful for large table.


It can provide more flexibility for autovacuum to create a good schedule 
scheme. Sometimes it takes a whole day to vacuum the large table 
(Hundreds-GB table may qualify); setting the cost_delay make it even 
lasts for several days. If the system has maintenance time, vacuum task 
of the large table can be split to fit into the maintenance time by 
interrupted-resume feature.



One option that I've heard before is to have vacuum after a single iteration
(ie, after it fills maintenance_work_mem and does the index cleanup and the
second heap pass), remember where it was and pick up from that point next
time.



Even a single iteration may take a long time, so it is not so much 
useful to have a break in the boundary of the iteration. I think it is 
not so difficult to get vacuum to remember where it leaves and start 
from where it leaves last time. The following is a basic idea.


A typical vacuum process mainly have the following phases:
 Phase 1. scan heap
 Phase 2. scan and sweep index
 Phase 3. sweep heap
 Phase 4. update FSM
 Phase 5. truncate CLOG

Where vacuum is interrupted, we can just save the collected information 
into the disk, and restore it later when vacuum restarts. When vacuum 
process is interrupted, we can remember the dead tuple list and the 
block number it has scanned in phase 1; the indexes it has cleanup in 
phase 2; the tuples it has swept in phase 3. Before exiting from vacuum, 
we can also merge the free space information into FSM.


We are working on this feature now.  I will propose it latter to discuss 
with you.


Best Regards
Galy Lee
--
NTT Open Source Software Center

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

After staring at my previous notes for autovac scheduling, it has become
clear that this basics of it is not really going to work as specified.
So here is a more realistic plan:


[Snip Detailed Description]


How does this sound?


On first blush, I'm not sure I like this as it doesn't directly attack 
the table starvation problem, and I think it could be a net loss of speed.


VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
DB isn't going to make things faster, you are now going to have multiple 
processes reading from multiple tables at the same time.  I think in 
general this is a bad thing (unless we someday account for I/O made 
available from multiple tablespaces).  In general the only time it's a 
good idea to have multiple vacuums running at the same time is when a 
big table is starving a small hot table and causing bloat.


I think we can extend the current autovacuum stats to add one more 
column that specifies is hot or something to that effect.  Then when 
the AV launcher sends a worker to a DB, it will first look for tables 
marked as hot and work on them.  While working on hot tables, the 
launcher need not send any additional workers to this database, if the 
launcher notices that a worker is working on regular tables, it can send 
another worker which will look for hot tables to working, if the worker 
doesn't find any hot tables that need work, then it exits leaving the 
original working to continue plodding along.


Thoughts?



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Gregory Stark

In an ideal world I think you want precisely one vacuum process running per
tablespace on the assumption that each tablespace represents a distinct
physical device.

The cases where we currently find ourselves wanting more are where small
tables are due for vacuuming more frequently than the time it takes for a
large table to receive a single full pass.

If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle
of vacuums on smaller tables, then resume, that problem would go away. That
sounds too difficult though, but perhaps we could do something nearly as good.

One option that I've heard before is to have vacuum after a single iteration
(ie, after it fills maintenance_work_mem and does the index cleanup and the
second heap pass), remember where it was and pick up from that point next
time.

If instead autovacuum could tell vacuum exactly how long to run for (or
calculated how many pages that represented based on cost_delay) then it could
calculate when it will next need to schedule another table in the same
tablespace and try to arrange for the vacuum of the large table to be done by
then.

Once there are no smaller more frequently vacuumed small tables due to be
scheduled it would start vacuum for the large table again and it would resume
from where the first one left off.

This only works if the large tables really don't need to be vacuumed so often
that autovacuum can't keep up. Our current situation is that there is a size
at which this happens. But arranging to have only one vacuum process per
tablespace will only make that less likely to happen rather than more.

I think the changes to vacuum itself are pretty small to get it to remember
where it left off last time and start from mid-table. I'm not sure how easy it
would be to get autovacuum to juggle all these variables though.

Of course users may not create separate tablespaces for physical devices, or
they may set cost_delay so high you really do need more vacuum processes, etc.
So you probably still need a num_vacuum_daemons but the recommended setting
would be the same as the number of physical devices and autovacuum could try
to divide them equally between tablespaces which would amount to the same
thing.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Alvaro Herrera
Matthew T. O'Connor wrote:
 Alvaro Herrera wrote:
 After staring at my previous notes for autovac scheduling, it has become
 clear that this basics of it is not really going to work as specified.
 So here is a more realistic plan:
 
 [Snip Detailed Description]
 
 How does this sound?
 
 On first blush, I'm not sure I like this as it doesn't directly attack 
 the table starvation problem, and I think it could be a net loss of speed.
 
 VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
 DB isn't going to make things faster, you are now going to have multiple 
 processes reading from multiple tables at the same time.  I think in 
 general this is a bad thing (unless we someday account for I/O made 
 available from multiple tablespaces).

Yeah, I understand that.  However, I think that can be remedied by using
a reasonable autovacuum_vacuum_cost_delay setting, so that each worker
uses less than the total I/O available.  The main point of the proposal
is to allow multiple workers on a DB while also allowing multiple
databases to be processed in parallel.

 I think we can extend the current autovacuum stats to add one more 
 column that specifies is hot or something to that effect.  Then when 
 the AV launcher sends a worker to a DB, it will first look for tables 
 marked as hot and work on them.  While working on hot tables, the 
 launcher need not send any additional workers to this database, if the 
 launcher notices that a worker is working on regular tables, it can send 
 another worker which will look for hot tables to working, if the worker 
 doesn't find any hot tables that need work, then it exits leaving the 
 original working to continue plodding along.

How would you define what's a hot table?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Matthew T. O'Connor

Alvaro Herrera wrote:

Matthew T. O'Connor wrote:
On first blush, I'm not sure I like this as it doesn't directly attack 
the table starvation problem, and I think it could be a net loss of speed.


VACUUM is I/O bound, as such, just sending multiple vacuum commands at a 
DB isn't going to make things faster, you are now going to have multiple 
processes reading from multiple tables at the same time.  I think in 
general this is a bad thing (unless we someday account for I/O made 
available from multiple tablespaces).


Yeah, I understand that.  However, I think that can be remedied by using
a reasonable autovacuum_vacuum_cost_delay setting, so that each worker
uses less than the total I/O available.  The main point of the proposal
is to allow multiple workers on a DB while also allowing multiple
databases to be processed in parallel.


So you are telling people to choose an autovacuum_delay so high that 
they need to run multiple autovacuums at once to keep up?  I'm probably 
being to dramatic, but it seems inconsistent.


I think we can extend the current autovacuum stats to add one more 
column that specifies is hot or something to that effect.  Then when 
the AV launcher sends a worker to a DB, it will first look for tables 
marked as hot and work on them.  While working on hot tables, the 
launcher need not send any additional workers to this database, if the 
launcher notices that a worker is working on regular tables, it can send 
another worker which will look for hot tables to working, if the worker 
doesn't find any hot tables that need work, then it exits leaving the 
original working to continue plodding along.


How would you define what's a hot table?


I wasn't clear, I would have the Admin specified it, and we can store it 
as an additional column in the pg_autovacuum_settings table.  Or perhaps 
if the table is below some size threshold and autovacuum seems that it 
needs to be vacuumed every time it checks it 10 times in a row or 
something like that.	


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Chris Browne
[EMAIL PROTECTED] (Alvaro Herrera) writes:
 When there is a single worker processing a database, it does not recheck
 pgstat data after each table.  This is to prevent a high-update-rate
 table from starving the vacuuming of other databases.

This case is important; I don't think that having multiple workers
fully alleviates the problem condition.

Pointedly, you need to have a way of picking up tables often enough to
avoid the XID rollover problem.  That may simply require that on some
periodic basis, a query is run to queue up tables that are getting
close to having an XID problem.
-- 
(reverse (concatenate 'string ofni.secnanifxunil @ enworbbc))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. I  will never tell the  hero Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool.  Chances  are, that  incompetent  old fool  is
standing behind the curtain.  http://www.eviloverlord.com/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Each worker, including the initial one, starts vacuuming tables
 according to pgstat data.  They recheck the pgstat data after finishing
 each table, so that a table vacuumed by another worker is not processed
 twice (maybe problematic: a table with high update rate may be vacuumed
 more than once.  Maybe this is a feature not a bug).

How are you going to make that work without race conditions?  ISTM
practically guaranteed that all the workers will try to vacuum the same
table.

 Once autovacuum_naptime has passed, if the workers have not finished
 yet, the launcher wants to vacuum another database.

This seems a rather strange design, as it will encourage concentrations
of workers in a single database.  Wouldn't it be better to spread them
out among multiple databases by default?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] autovacuum next steps

2007-02-16 Thread Ron Mayer
Alvaro Herrera wrote:
 
 Once autovacuum_naptime... autovacuum_max_workers...
 How does this sound?

The knobs exposed on autovacuum feel kinda tangential to
what I think I'd really want to control.

IMHO vacuum_mbytes_per_second would be quite a bit more
intuitive than cost_delay, naptime, etc.


ISTM I can relatively easily estimate and/or spec out how
much extra I/O bandwidth I have per device for vacuum;
and would pretty much want vacuum to be constantly
running on whichever table that needs it the most so
long as it can stay under that bandwith limit.

Could vacuum have a tunable that says X MBytes/second
(perhaps per device) and have it measure how much I/O
it's actually doing and try to stay under that limit?

For more fine-grained control a cron job could go
around setting different MBytes/second limits during
peak times vs idle times.


If people are concerned about CPU intensive vacuums
instead of I/O intensive ones (does anyone experience
that? - another tuneable vacuum_percent_of_cpu would
be more straightforward than delay_cost, cost_page_hit,
etc.   But I'd be a bit surprised if cpu intensive
vacuums are common.

---(end of broadcast)---
TIP 6: explain analyze is your friend