Re: [HACKERS] autovacuum next steps, take 3
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. The point I'm not very sure about is that this proposal means we need to do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date stats. True. You could probably drop the lock while rechecking stats, at the cost of having to recheck for collision (repeat step 2) afterwards. Or recheck stats before you start, but if collisions are likely then that's a waste of time. But on the third hand, does it matter? Rechecking the stats should be much cheaper than a vacuum operation, so I'm not seeing that there's going to be a problem. It's not like there are going to be hundreds of workers contending for that lock... Turns out that it does matter, because not only we need to read pgstats, but we also need to fetch the pg_autovacuum and pg_class rows again for the table. So we must release the AutovacuumSchedule lock before trying to open pg_class etc. Unless we are prepared to cache (keep a private copy of) the contents of said tuples between the first check (i.e. when building the initial table list) and the recheck? This is possible as well, but it gives me an uneasy feeling. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 3
Alvaro Herrera wrote: worker to-do list - It removes from its to-do list the tables being processed. Finally, it writes the list to disk. I am worrying about the worker-to-do-list in your proposal. I think worker isn't suitable to maintain any vacuum task list; instead it is better to maintain a unified vacuum task queue on autovacuum share memory. Here are the basic ideas: * Why is such a task queue needed? - Launcher might schedule all vacuum tasks by such a queue. It provides a facility to schedule tasks smartly for further autovacuum improvement. - Also such a task list can be viewed easily from a system view. This can be implemented easily in 8.3 by the task queue. * VACUUM task queue VACUUM tasks of cluster are maintained in a unified cluster-wide queue in the share memory of autovacuum. global shared TaskInfo tasks[]; It can be viewed as: SELECT * FROM pg_autovacuum_tasks; dbid | relid | group | worker ---+---+---+ 2 | 20001 | 0 | 1001 2 | 20002 | 0 | 3 | 30001 | 0 | 1002 VACUUM tasks belong to the same database might be divided into several groups. One worker might be assigned to process one specific task group. The task queue might be filled by dedicated task-gathering-worker or it might be filled by *external task gatherer*. It allows external program to develop a more sophisticated vacuum scheme. Based on previous discussion, it appears that it is difficult to implement an all-purpose algorithm to satisfy the requirements of all applications. It is better to allow user to develop their vacuum strategies. *User-defined external program* might fill the task queue, and schedule tasks by their own strategy. Launcher will response for coordinating workers only. This pluggable-vacuum-strategy approach seems a good solution. * status of worker It is also convenience to allow user to monitor the status of vacuum worker by a system view.The snapshot of worker can also be viewed as: SELECT * FROM pg_autovacuum_workers; pid | dbid | relid | group --+---+---+--- 1001 | 2 | 20001 | 0 1002 | 3 | 30001 | 0 Best Regards Galy Lee lee.galy _at_ oss.ntt.co.jp NTT Open Source Software Center ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 3
Galy Lee [EMAIL PROTECTED] writes: I am worrying about the worker-to-do-list in your proposal. I think worker isn't suitable to maintain any vacuum task list; instead it is better to maintain a unified vacuum task queue on autovacuum share memory. Shared memory is fixed-size. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 3
Galy Lee wrote: Alvaro Herrera wrote: worker to-do list - It removes from its to-do list the tables being processed. Finally, it writes the list to disk. I am worrying about the worker-to-do-list in your proposal. I think worker isn't suitable to maintain any vacuum task list; instead it is better to maintain a unified vacuum task queue on autovacuum share memory. Galy, Thanks for your comments. I like the idea of having a global task queue, but sadly it doesn't work for a simple reason: the launcher does not have enough information to build it. This is because we need access to catalogs in the database; pg_class and pg_autovacuum in the current code, and the catalogs related to the maintenance window feature when we implement it in the (hopefully near) future. Another point to be made, though of less importance, is that we cannot keep such a task list in shared memory, because we aren't able to grow that memory after postmaster start. It is of lesser importance, because we could keep the task list in plain files on disk; this is merely a SMOP. The functions to expose the task list to SQL queries would just need to read those files. It would be slower than shared memory, certainly, but I don't think it's a showstopper (given the amount of work VACUUM takes, anyway). Not having access to the catalogs is a much more serious problem for the scheduling. One could think about dumping catalogs to plain files that are readable to the launcher, but this is not very workable: how do you dump pg_class and have it up to date all the time? You'd have to be writing that file pretty frequently, which doesn't sound a very good idea. Other idea I had was having a third kind of autovacuum process, namely a schedule builder, which would connect to the database, read catalogs, compute needed vacuuming, write to disk, and exit. This seems similar to your task-gathering worker. The launcher could then dispatch regular workers as appropriate. Furthermore, the launcher could create a global schedule, based on the combination of the schedules for all databases. I dismissed this idea because a schedule gets out of date very quickly as tables continue to be used by regular operation. A worker starting at t0 may find that a task list built at t0-5 min is not very relevant. So it needs to build a new task list anyway, which then begs the question of why not just let the worker itself build its task list? Also, combining schedules is complicated and you start thinking in asking the DBA to give each database a priority, which is annoying. So the idea I am currently playing with is to have workers determine the task list at start, by looking at both the catalogs and considering the task lists of other workers. I think this is the natural evolution of the other ideas -- the worker is just smarter to start with, and the whole thing is a lot simpler. The task queue might be filled by dedicated task-gathering-worker or it might be filled by *external task gatherer*. The idea of an external task gatherer is an interesting one which I think would make sense to implement in the future. I think it is not very difficult to implement once the proposal we're currently discussing is done, because it just means we have to modify the part where each worker decides what needs to be done, and at what times the launcher decides to start a worker on each database. The rest of the stuff I'm working on is just infrastructure to make it happen. So I think your basic idea here is still workable, just not right now. Let's discuss it again as soon as I'm done with the current stuff. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 3
Alvaro Herrera wrote: worker to-do list - When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a snapshot of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. It removes from its to-do list the tables being processed. Finally, it writes the list to disk. The table list will be written to a file in PGDATA/vacuum/database-oid/todo.worker-pid The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) Sorry, I confused matters here by not clarifing on-disk to-do lists versus in-memory. When we write the to-do list to file, that's the to-do lists that other workers will see. It will not change; when I say remove a table for the to-do list, it will be removed from the to-do list in memory, but the file will not get rewritten. Note that a worker will not remove from its list a table that's in the to-do list of another worker but not yet processed. It will only remove those tables that are currently being processed (i.e. they appear in the shared memory entry for that worker), and any tables that appear _before that one_ on that particular worker's file. So this behaves very much like what Tom describes in an email downthread, not like what Matthew is thinking. In fact I'm thinking that the above is needlessly complex, and that Tom's proposal is simpler and achieves pretty much the same effect, so I'll have a look at evolving from that instead. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 3
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. The point I'm not very sure about is that this proposal means we need to do I/O with the AutovacSchedule LWLock grabbed, to obtain up-to-date stats. True. You could probably drop the lock while rechecking stats, at the cost of having to recheck for collision (repeat step 2) afterwards. Or recheck stats before you start, but if collisions are likely then that's a waste of time. But on the third hand, does it matter? Rechecking the stats should be much cheaper than a vacuum operation, so I'm not seeing that there's going to be a problem. It's not like there are going to be hundreds of workers contending for that lock... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 3
Hi, Alvaro Alvaro Herrera wrote: keep such a task list in shared memory, because we aren't able to grow that memory after postmaster start. We can use the fix-size share memory to maintain such a queue. The maximum task size is the number of all tables. So the size of the queue can be the same with max_fsm_relations which is usually larger than the numbers of tables and indexes in the cluster. This is sufficient to contain most of the vacuum tasks. Even though the queue is over flow, for task-gatherer is scanning the whole cluster every autovacuum_naptime, it is quickly enough to pick those tasks up again. We don’t need to write any thing to external file. So there is no problem to use a fix-size share memory to maintain a global queue. Other idea I had was having a third kind of autovacuum process, namely a schedule builder If we have such a global queue, task-gathering worker can connect to every database every naptime to gather tasks in time. The task-gathering worker won’t build the schedule, LAUNCHER or external program responses for such activity. How to dispatch tasks to worker is just a scheduling problem, a good dispatching algorithm needs to ensure each worker can finish their tasks on time, this might resolve the headache HOT table problem. But this is a further issue to be discussed after 8.3. Best Regards Galy Lee lee.galy _at_ oss.ntt.co.jp NTT Open Source Software Center ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 3
Galy Lee [EMAIL PROTECTED] writes: We can use the fix-size share memory to maintain such a queue. The maximum task size is the number of all tables. So the size of the queue can be the same with max_fsm_relations which is usually larger than the numbers of tables and indexes in the cluster. The trouble with that analogy is that the system can still operate reasonably sanely when max_fsm_relations is exceeded (at least, the excess relations behave no worse than they did before we had FSM). If there are relations that autovacuum ignores indefinitely because they don't fit in a fixed-size work queue, that will be a big step backward from prior behavior. In any case, I still haven't seen a good case made why a global work queue will provide better behavior than each worker keeping a local queue. The need for small hot tables to be visited more often than big tables suggests to me that a global queue will actually be counterproductive, because you'll have to contort the algorithm in some hard-to-understand way to get it to do that. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 3
Tom Lane [EMAIL PROTECTED] wrote: In any case, I still haven't seen a good case made why a global work queue will provide better behavior than each worker keeping a local queue. The need for small hot tables to be visited more often than big tables suggests to me that a global queue will actually be counterproductive, because you'll have to contort the algorithm in some hard-to-understand way to get it to do that. If we have some external vacuum schedulers, we need to see and touch the content of work queue. That's why he suggested the shared work queue. I think the present strategy of autovacuum is not enough in some heavily-used cases and need more sophisticated schedulers, even if the optimization for hot tables is added. Also, the best strategies of vacuum are highly depending on systems, so that I don't think we can supply one monolithic strategy that fits all purposes. That was a proposal of the infrastructure for interaction between autovacuum and user-land vacuum schedulers. Of cource, we can supply a simple scheduler for not-so-high-load systems, but I need a kind of autovacuum that can be controlled from an external program that knows user application well. Though we can use a completely separated autovacuum daemon like as contrib/pg_autovacuum of 8.0, but I think it is good for us to share some of the codes between a built-in scheduler and external ones. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 3
ITAGAKI Takahiro [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] wrote: In any case, I still haven't seen a good case made why a global work queue will provide better behavior than each worker keeping a local queue. If we have some external vacuum schedulers, we need to see and touch the content of work queue. Who said anything about external schedulers? I remind you that this is AUTOvacuum. If you want to implement manual scheduling you can still use plain 'ol vacuum commands. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 3
Tom Lane [EMAIL PROTECTED] wrote: Who said anything about external schedulers? I remind you that this is AUTOvacuum. If you want to implement manual scheduling you can still use plain 'ol vacuum commands. I think we can split autovacuum into two (or more?) functions: task gatherers and task workers. We don't have to bother with the monolithic style of current autovacuum. Galy said: The task queue might be filled by dedicated task-gathering-worker or it might be filled by *external task gatherer*. Alvaro said: The idea of an external task gatherer is an interesting one which I think would make sense to implement in the future. I think it is not very difficult to implement once the proposal we're currently discussing is done I said: Though we can use a completely separated autovacuum daemon like as contrib/pg_autovacuum of 8.0, but I think it is good for us to share some of the codes between a built-in scheduler and external ones. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 3
My initial reaction is that this looks good to me, but still a few comments below. Alvaro Herrera wrote: Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases - [ Snip ] launcher and worker interactions [Snip] worker to-do list - When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a snapshot of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. Does a new worker really care about the PID of other workers or what table they are currently working on? It removes from its to-do list the tables being processed. Finally, it writes the list to disk. Just to be clear, the new worker removes from it's todo list all the tables mentioned in the todo lists of all the other workers? The table list will be written to a file in PGDATA/vacuum/database-oid/todo.worker-pid The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. This all sounds good to me so far. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) Again it's not clear to me what this is gaining us? It seems to me that if when a worker starts up writes out it's to-do list, it should just do it, I don't see the value in workers constantly updating their todo lists. Maybe I'm just missing something can you enlighten me? other things to consider This proposal doesn't deal with the hot tables stuff at all, but that is very easy to bolt on later: just change the first phase, where the initial to-do list is determined, to exclude cold tables. That way, the vacuuming will be fast. Determining what is a cold table is still an exercise to the reader ... I think we can make this algorithm naturally favor small / hot tables with one small change. Having workers remove tables that they just vacuumed from their to-do lists and re-write their todo lists to disk. Assuming the todo lists are ordered by size ascending, smaller tables will be made available for inspection by newer workers sooner rather than later. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 3
Matthew T. O'Connor matthew@zeut.net writes: Does a new worker really care about the PID of other workers or what table they are currently working on? As written, it needs the PIDs so it can read in the other workers' todo lists (which are in files named by PID). It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. This creates a behavior of whoever gets to it first rather than allowing workers to claim tables that they actually won't be able to service any time soon. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 3
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. Well the big problem that we have is not that large tables are being starved, so this doesn't bother me too much, plus there is only so much IO, so one worker working sequentially through the big tables seems OK to me. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. This creates a behavior of whoever gets to it first rather than allowing workers to claim tables that they actually won't be able to service any time soon. Right, but you could wind up with as many workers working concurrently as you have tables in a database which doesn't seem like a good idea either. One thing I like about the todo list setup Alvaro had is that new workers will be assigned fewer tables to work on and hence exit sooner. We are going to fire off a new worker every autovac_naptime so availability of new workers isn't going to be a problem. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 01:26:00AM -0500, Matthew T. O'Connor wrote: Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? I'm saying that I don't like the idea of trying to stop on a dime by saving the current contents of vacuum's dead-TID array to disk with the idea that we can trust those values 100% later. Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I'm for stopping only after completing an index-cleaning pass, at the point where we empty the dead-TID array anyway. If you really have to have stop on a dime, just kill -INT the process, accepting that you will have to redo your heap scan since the last restart point. OK, so if I understand correct, a vacuum of a table with 10 indexes on it can be interrupted 10 times, once after each index-cleaning pass? That might have some value, especially breaking up the work required to vacuum a large table. Or am I still not getting it? It'd stop after scanning the heap, scanning all the indexes, and then cleaning the heap. After that's done it no longer needs any of the dead-TID info; anytime before that it does need that info, and Tom's objection is that trying to store that info is a bad idea. The problem with this is that typically it takes a long time to go through a complete vacuum cycle; minutes at least, and preferably longer. Decreasing that cycle time will greatly increase the amount of IO required for vacuuming a table with any indexes, because every time you cycle through you have to read the entire index. That's why I don't see it as being useful at all for getting autovac to work on hot tables - if you actually got that cycle time low enough you'd kill the system with all the extra index scanning. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 12:54:28AM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). I fail to see how a simple 6 table case is 'worst case'. It's common to see hundreds of tables, and I've run across more than one database with thousands of tables (think partitioning). In cases like those it's certainly possible, perhaps even likely that you would get many daemons running in the database at one time just from different tables suddenly needing vacuuming and appearing at a higher point in the list than other tables. With 100 ~1G tables getting updates it certainly wouldn't be hard to end up with 10 of those being vacuumed all at the same time. Yes 6 tables is small, the worst-case part of the example was that all the tables would need to be vacuumed constantly. Most databases only have a few hot tables. Most tables only need to vacuumed every once in a while. It's not the hot tables that are the issue; it's how many large tables (hot or not) that can come up for vacuuming in order. For example, if A-Z are all large tables (ie: a few GB), with A being the largest and Z the smallest, think about what happens here: Round 1: A needs vacuuming. Daemon gets to it and starts working. Round 2: B now needs vacuuming. It's slightly smaller than A, so daemon 2 gets to it. Round 3: C now needs vacuuming. Daemon 3. ... Round 26: Z now needs vacuuming. Daemon 26 picks it up. You now have 26 daemons running in the database. Now, we can argue about how likely that scenario is, but I don't think it's relevant. What matters is that it *is* possible, and as long as that's the case you'd have to have some kind of limit. (While this simple 26 table example is definitely worst-case, if you've got hundreds of tables that are all multiple GB in size I think it wouldn't be hard at all for you to end up with a dozen or more daemons all hammering away). I do like the idea since it should be easier to tune, but I think we still need some limit on it. Perhaps as a first-pass we could just have a hard limit and log a message and/or set a flag any time we hit it. That would hopefully allow us to get information about how big a problem it really is. We could go one step further and say that the last daemon that can start in a database will only vacuum tables that can be done quickly; that's essentially what we've been talking about, except the limit we've been discussing would be hard-coded at 2. I'm confused, what limit would be set at 2? The number of concurrent workers? I've never said that. The point I was making is that the proposal about limiting the 2nd daemon to only processing tables it can do in a short period of time is akin to setting a limit of only 2 daemons in a database at a time. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. BTW, another issue that I don't think we can ignore: we actually need to do this on a per-tablespace level, or at least have the ability to disable or somehow limit it. While it's not common, there are users that run a hundred or more databases in a single cluster; it would be ugly if we suddenly had 100 vacuums trying to run on the same set of drives concurrently. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. BTW, another issue that I don't think we can ignore: we actually need to do this on a per-tablespace level, or at least have the ability to disable or somehow limit it. While it's not common, there are users that run a hundred or more databases in a single cluster; it would be ugly if we suddenly had 100 vacuums trying to run on the same set of drives concurrently. I think we all agree that autovacuum needs to become tablespace aware at some point, but I think that is further down the line, we're having enough trouble figuring things out without that additional complication. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 12:12:22PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. BTW, another issue that I don't think we can ignore: we actually need to do this on a per-tablespace level, or at least have the ability to disable or somehow limit it. While it's not common, there are users that run a hundred or more databases in a single cluster; it would be ugly if we suddenly had 100 vacuums trying to run on the same set of drives concurrently. I think we all agree that autovacuum needs to become tablespace aware at some point, but I think that is further down the line, we're having enough trouble figuring things out without that additional complication. Sure, we just need a way to disable the multiple autovac daemon stuff then. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 2
On Feb 26, 2007, at 12:49 PM, Alvaro Herrera wrote: Jim C. Nasby wrote: That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. Ok, I think we may be actually getting somewhere. I propose to have two different algorithms for choosing the tables to work on. The worker would behave differently, depending on whether there is one or more workers on the database already or not. The first algorithm is the plain threshold equation stuff we use today. If a worker connects and determines that no other worker is in the database, it uses the plain worker mode. A worker in this mode would examine pgstats, determine what tables to vacuum/analyze, sort them by size (smaller to larger), and goes about its work. This kind of worker can take a long time to vacuum the whole database -- we don't impose any time limit or table size limit to what it can do. The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. Perhaps this has already been proposed, but maybe some combination of the following inputs could be used to determine which table most needs vacuuming: - The proportion of tuples in a table that are dead (updated rows since last vacuum/estimated row count). This would favor hot tables naturally regardless of size. - The time since the last vacuum, so that larger tables are eventually vacuumed even if hot tables totally dominate Of course tables that did not pass the minimum parameters specified in postgresql.conf would not even get considered. I'm being intentionally vague here on the exact algorithm, since you all have though about this more than I have. One thing I like about the above is that it is independent of table size, and doesn't require anyone to determine which tables are hot manually. -Casey ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I don’t understand well the things you are worrying about. If we find that we can not trust the saved file, or the file has corrupted, then we can drop it and scan from the beginning of the heap block. If something like CLUSTER, PITR has changed the physical layout of heap, then we can simply drop the files. Why do we need WAL for it? I don’t see any point in it. Also, I don’t think it is expensive. If it is combined with maintenance window to stop once in a whole day, writing 256MB/2 = 128MB things out can not be said expensive. Of course, this feature isn’t for autovacuum to use it in every minutes, autovacuum can use it after it has adopted maintenance window. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. Ok, I think we may be actually getting somewhere. I propose to have two different algorithms for choosing the tables to work on. The worker would behave differently, depending on whether there is one or more workers on the database already or not. The first algorithm is the plain threshold equation stuff we use today. If a worker connects and determines that no other worker is in the database, it uses the plain worker mode. A worker in this mode would examine pgstats, determine what tables to vacuum/analyze, sort them by size (smaller to larger), and goes about its work. This kind of worker can take a long time to vacuum the whole database -- we don't impose any time limit or table size limit to what it can do. The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Jim C. Nasby wrote: That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. Ok, I think we may be actually getting somewhere. Me too. I propose to have two different algorithms for choosing the tables to work on. The worker would behave differently, depending on whether there is one or more workers on the database already or not. The first algorithm is the plain threshold equation stuff we use today. If a worker connects and determines that no other worker is in the database, it uses the plain worker mode. A worker in this mode would examine pgstats, determine what tables to vacuum/analyze, sort them by size (smaller to larger), and goes about its work. This kind of worker can take a long time to vacuum the whole database -- we don't impose any time limit or table size limit to what it can do. Right, I like this. The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. How can you determine what tables can be vacuumed within autovacuum_naptime? I agree that large tables should be excluded, but I don't know how we can do that calculation based on autovacuum_naptime. So at: t=0*autovacuume_naptime: worker1 gets started on DBX t=1*autovacuume_naptime: worker2 gets started on DBX worker2 determines all tables that need to be vacuumed, worker2 excludes tables that are too big from it's to-do list, worker2 gets started working, worker2 exits when it either: a) Finishes it's entire to-do-list. b) Catches up to worker1 I think the questions are 1) What is the exact math you are planning on using to determine which tables are too big? 2) Do we want worker2 to exit when it catches worker1 or does the fact that we have excluded tables that re too big mean that we don't have to worry about this? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Matthew T. O'Connor wrote: Alvaro Herrera wrote: The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. So at: t=0*autovacuume_naptime: worker1 gets started on DBX t=1*autovacuume_naptime: worker2 gets started on DBX worker2 determines all tables that need to be vacuumed, worker2 excludes tables that are too big from it's to-do list, worker2 gets started working, worker2 exits when it either: a) Finishes it's entire to-do-list. b) Catches up to worker1 I think the questions are 1) What is the exact math you are planning on using to determine which tables are too big? 2) Do we want worker2 to exit when it catches worker1 or does the fact that we have excluded tables that re too big mean that we don't have to worry about this? Right, I think the fact that we excluded big tables means that this won't be a problem most of the time, but we'll need some sort of protection anyway. I think this is easy to achieve -- store the table each worker is currently processing in shared memory, and have all workers check all other workers. If a plain worker finds that another worker is processing the table already, it skips that table and continues with the next one. A hot table worker instead exits right away (caught up). -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Matthew T. O'Connor wrote: How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Thoughts? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
Matthew T. O'Connor wrote: Alvaro Herrera wrote: Matthew T. O'Connor wrote: How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? Well, this would only happen the first time, until the plain worker processed the table; next time it would be picked up by the hot table worker. But yeah, we can build a better estimate using the same trick the planner uses: estimate tuple density as reltuples/relpages times the actual number of blocks on disk. A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Matthew T. O'Connor wrote: I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? Well, this would only happen the first time, until the plain worker processed the table; next time it would be picked up by the hot table worker. But yeah, we can build a better estimate using the same trick the planner uses: estimate tuple density as reltuples/relpages times the actual number of blocks on disk. Well even skipping it the first time isn't good, anything that further delays a hot table from getting vacuumed is bad. Also, I'm not sure it would just be the first time since plain VACUUM isn't going to reclaim most of the space, just mark it as reusable. This is moot however if we use a good metric, I have no idea if what you suggest above would be good enough. A table over 600 pages does not sound like a good candidate for hot, so this seems more or less reasonable to me. On the other hand, maybe we shouldn't tie this to the vacuum cost delay stuff. I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. Simple is better, at least until proven otherwise. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera [EMAIL PROTECTED] writes: Matthew T. O'Connor wrote: I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The shared memory state could include the OID of the table each worker is currently working on, and we could keep the to-do list in some simple flat file for instance (since we don't care about crash safety). I'm not certain exactly what key off needs to mean; perhaps each worker should make its own to-do list and then discard items that are either in-progress or recently done by another worker when it gets to them. I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. BTW, it's probably necessary to treat shared catalogs specially ... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Matthew T. O'Connor wrote: I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The shared memory state could include the OID of the table each worker is currently working on, and we could keep the to-do list in some simple flat file for instance (since we don't care about crash safety). So far we are only talking about one parameter, the hot_table_size_threshold, which I agree would be a guess by an admin, but if we went in this direction, I would also advocate adding a column to the pg_autovacuum table that allows an admin to explicitly define a table as hot or not. Also I think each worker should be mostly independent, the only caveat being that (assuming each worker works in size order) if we catch up to an older worker (get to the table they are currently working on) we exit. Personally I think this is all we need, but others felt the additional threshold was needed. What do you think? Or what do you think might be better? I'm not certain exactly what key off needs to mean; perhaps each worker should make its own to-do list and then discard items that are either in-progress or recently done by another worker when it gets to them. My initial design didn't have any threshold at all, but others felt this would/could result in too many worker working concurrently in the same DB. I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. Maybe we never made that clear, I was always working on the assumption that two workers would never try to work on the same table at the same time. BTW, it's probably necessary to treat shared catalogs specially ... Certainly. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Matthew T. O'Connor wrote: I'm not sure it's a good idea to tie this to the vacuum cost delay settings either, so let me as you this, how is this better than just allowing the admin to set a new GUC variable like autovacuum_hot_table_size_threshold (or something shorter) which we can assign a decent default of say 8MB. Yeah, maybe that's better -- it's certainly simpler. I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The shared memory state could include the OID of the table each worker is currently working on, and we could keep the to-do list in some simple flat file for instance (since we don't care about crash safety). I'm not certain exactly what key off needs to mean; perhaps each worker should make its own to-do list and then discard items that are either in-progress or recently done by another worker when it gets to them. The real problem is trying to set that up in such a fashion that keeps hot tables frequently vacuumed; that means that the 2nd daemon in a database either needs to avoid large tables that will take 'a significant' length of time to vacuum, or you need to allow any number of daemons in each database (which sounds like a good way to thrash the machine). I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. Isn't there a special lock acquired on a relation by vacuum? Can't we just check for that? Seems much simpler than building out the ability for daemons to see what each other is doing (and that still wouldn't take manual vacuums into account. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
On Mon, Feb 26, 2007 at 08:11:44PM -0300, Alvaro Herrera wrote: Matthew T. O'Connor wrote: Alvaro Herrera wrote: The second mode is the hot table worker mode, enabled when the worker detects that there's already a worker in the database. In this mode, the worker is limited to those tables that can be vacuumed in less than autovacuum_naptime, so large tables are not considered. Because of this, it'll generally not compete with the first mode above -- the tables in plain worker were sorted by size, so the small tables were among the first vacuumed by the plain worker. The estimated time to vacuum may be calculated according to autovacuum_vacuum_delay settings, assuming that all pages constitute cache misses. How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum Need ta take vacuum_cost_limit into account. The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: Alvaro Herrera wrote: Matthew T. O'Connor wrote: How can you determine what tables can be vacuumed within autovacuum_naptime? My assumption is that pg_class.relpages * vacuum_cost_page_miss * vacuum_cost_delay = time to vacuum This is of course not the reality, because the delay is not how long it takes to fetch the pages. But it lets us have a value with which we can do something. With the default values, vacuum_cost_delay=10, vacuum_cost_page_miss=10, autovacuum_naptime=60s, we'll consider tables of under 600 pages, 4800 kB (should we include indexes here in the relpages count? My guess is no). I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? The entire point of this is to ensure that the second daemon will only vacuum tables that it can finish very quickly. If you let a table bloat so it's too big, then you just can't vacuum it very frequently without risking all your other hot tables bloating because they're no longer getting vacuumed. The reality is that you can actually vacuum a pretty good-sized table in 60 seconds with typical cost-delay settings (ie: defaults except cost_delay set to 10). That means you can do 9 pages ~100 times a second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, that's still 27k pages per minute. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby [EMAIL PROTECTED] writes: On Mon, Feb 26, 2007 at 09:22:42PM -0500, Tom Lane wrote: I'm not liking any of these very much, as they seem critically dependent on impossible-to-tune parameters. I think it'd be better to design this around having the first worker explicitly expose its state (list of tables to process, in order) and having subsequent workers key off that info. The real problem is trying to set that up in such a fashion that keeps hot tables frequently vacuumed; Certainly, but it's not clear where that behavior emerges from Alvaro's or Matthew's proposals, either. Are we assuming that no single worker instance will vacuum a given table more than once? (That's not a necessary assumption, certainly, but without it there are so many degrees of freedom that I'm not sure how it should act.) Given that assumption, the maximum vacuuming rate for any table is once per autovacuum_naptime, and most of the magic lies in the launcher's algorithm for deciding which databases to launch workers into. I'm inclined to propose an even simpler algorithm in which every worker acts alike; its behavior is 1. On startup, generate a to-do list of tables to process, sorted in priority order. 2. For each table in the list, if the table is still around and has not been vacuumed by someone else since you started (including the case of a vacuum-in-progress), then vacuum it. Detecting already vacuumed since you started is a bit tricky; you can't really rely on the stats collector since its info isn't very up-to-date. That's why I was thinking of exposing the to-do lists explicitly; comparing those with an advertised current-table would allow accurate determination of what had just gotten done. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
[ oh, I forgot to respond to this: ] Jim C. Nasby [EMAIL PROTECTED] writes: Isn't there a special lock acquired on a relation by vacuum? Can't we just check for that? I think you're thinking that ConditionalLockRelation solves the problem, but it does not, because it will fail if someone has taken a (quasi) exclusive lock unrelated to vacuuming. You don't want an application that frequently takes short-term ExclusiveLocks on a table to thereby cause autovacuum to frequently skip that table. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, and I don't see where that behavior will emerge from your proposal; whereas it's fairly easy to make it happen if non-first workers pay attention to what other workers are doing. FWIW, I've always considered this to be a very important and obvious issue, and I think I've neglected mentioning it (maybe I did too few times). But I think this is pretty easy to do, just have each worker advertise the current table it's working on in shared memory, and add a recheck loop on the table-pick algorithm (with appropriate grabs of the autovacuum lwlock), to make sure no one starts to vacuum the same table you're going to process, at the same time. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? The entire point of this is to ensure that the second daemon will only vacuum tables that it can finish very quickly. If you let a table bloat so it's too big, then you just can't vacuum it very frequently without risking all your other hot tables bloating because they're no longer getting vacuumed. The reality is that you can actually vacuum a pretty good-sized table in 60 seconds with typical cost-delay settings (ie: defaults except cost_delay set to 10). That means you can do 9 pages ~100 times a second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, that's still 27k pages per minute. At the risk of sounding like a broken record, I still think the size limit threshold is unnecessary. Since all workers will be working in on tables in size order, younger workers will typically catch older workers fairly quickly since the tables will be either small, or recently vacuumed and not need work. And since younger workers exit when they catch-up to an older worker, there is some inherent stability in the number of workers. Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). table1:10 rows table2: 100 rows table3: 1,000 rows table4:10,000 rows table5: 100,000 rows table6: 1,000,000 rows time=0*naptime: No workers in the DB time=1*naptime: worker1 starts on table1 time=2*naptime: worker1 has finished table1,table2 and table3, it's now working on table4, worker2 starts on table1. time=3*naptime: worker1 is on table5, worker2 is working on table4, worker3 starts on table1. time=4*naptime: worker1 is still on table5, worker2 has caught up to worker1 and exits, worker3 also catches up to worker1 since tables2-4 didn't require vacuum at this time so it exits, worker4 starts on table1 time=5*naptime: worker1 is working on table6, worker4 is up to table4, worker5 starts on table1 time=6*naptime: worker1 is working on table6, worker4 catches up to worker1 and exits, worker5 finds no additional work to be done and exits, worker6 starts at table1. time=7*naptime: worker1 still working on table6, worker6 is up to table4, worker7 starts at table1. time=8*naptime: worker1 still working on table6, worker6 still working on table4, worker7 working on table3, worker8 starting on table1. time=9*naptime: worker1 still working on table6, worker6 working on table5, worker7 catches worker 6 and exits, worker8 finds nothing more todo and exits, worker9 starts on table1 time=10*naptim: worker1 still working on table6, worker9 working on table4, worker10 starts on table1. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: I think an absolute minimum requirement for a sane design is that no two workers ever try to vacuum the same table concurrently, FWIW, I've always considered this to be a very important and obvious issue, and I think I've neglected mentioning it (maybe I did too few times). But I think this is pretty easy to do, just have each worker advertise the current table it's working on in shared memory, and add a recheck loop on the table-pick algorithm (with appropriate grabs of the autovacuum lwlock), to make sure no one starts to vacuum the same table you're going to process, at the same time. Well, any of these proposals need that at the bottom level, to prevent race conditions. But I'd prefer a design that wasn't positively encouraging multiple workers to try to pick the same table concurrently. Not only is that wasteful, but it makes it harder to predict what is the real behavior that emerges after race conditions are backed off from. BTW, to what extent might this whole problem be simplified if we adopt chunk-at-a-time vacuuming (compare current discussion with Galy Lee)? If the unit of work has a reasonable upper bound regardless of table size, maybe the problem of big tables starving small ones goes away. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: The real problem is trying to set that up in such a fashion that keeps hot tables frequently vacuumed; Are we assuming that no single worker instance will vacuum a given table more than once? (That's not a necessary assumption, certainly, but without it there are so many degrees of freedom that I'm not sure how it should act.) Given that assumption, the maximum vacuuming rate for any table is once per autovacuum_naptime, and most of the magic lies in the launcher's algorithm for deciding which databases to launch workers into. Yes, I have been working under the assumption that a worker goes through the list of tables once and exits, and yes the maximum vacuuming rate for any table would be once per autovacuum_naptime. We can lower the default if necessary, as far as I'm concerned it's (or should be) fairly cheap to fire off a worker and have it find that there isn't anything todo and exit. I'm inclined to propose an even simpler algorithm in which every worker acts alike; its behavior is 1. On startup, generate a to-do list of tables to process, sorted in priority order. 2. For each table in the list, if the table is still around and has not been vacuumed by someone else since you started (including the case of a vacuum-in-progress), then vacuum it. That is what I'm proposing except for one difference, when you catch up to an older worker, exit. This has the benefit reducing the number of workers concurrently working on big tables, which I think is a good thing. Detecting already vacuumed since you started is a bit tricky; you can't really rely on the stats collector since its info isn't very up-to-date. That's why I was thinking of exposing the to-do lists explicitly; comparing those with an advertised current-table would allow accurate determination of what had just gotten done. Sounds good, but I have very little insight into how we would implement already vacuumed since you started or have I caught up to another worker. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: BTW, to what extent might this whole problem be simplified if we adopt chunk-at-a-time vacuuming (compare current discussion with Galy Lee)? If the unit of work has a reasonable upper bound regardless of table size, maybe the problem of big tables starving small ones goes away. So if we adopted chunk-at-a-time then perhaps each worker processes the list of tables in OID order (or some unique and stable order) and does one chunk per table that needs vacuuming. This way an equal amount of bandwidth is given to all tables. That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? Matt ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Matthew T. O'Connor matthew@zeut.net writes: Tom Lane wrote: I'm inclined to propose an even simpler algorithm in which every worker acts alike; That is what I'm proposing except for one difference, when you catch up to an older worker, exit. No, that's a bad idea, because it means that any large table starves even-larger tables. (Note: in all this I assume we're all using size as a shorthand for some sort of priority metric that considers number of dirty tuples not only size. We don't want every worker insisting on passing over every small read-only table every time, for instance.) regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
Matthew T. O'Connor matthew@zeut.net writes: That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? It seems fairly trivial to me to have a scheme where you do one fill-workmem-and-scan-indexes cycle per invocation, and store the next-heap-page-to-scan in some handy place (new pg_class column updated along with relpages/reltuples, likely). Galy is off in left field with some far more complex ideas :-( but I don't see that there's all that much needed to support this behavior ... especially if we don't expose it to the SQL level but only support it for autovac's use. Then we're not making any big commitment to support the behavior forever. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: Tom Lane wrote: I'm inclined to propose an even simpler algorithm in which every worker acts alike; That is what I'm proposing except for one difference, when you catch up to an older worker, exit. No, that's a bad idea, because it means that any large table starves even-larger tables. True, but the assumption I'm making is that there is a finite amount of bandwidth available and more concurrent activity will have a net negative effect the time it takes to vacuum all tables. I'm willing to pay that price to prevent small hot tables from getting starved, but less willing to pay the same price for large tables where the percentage of bloat will be much smaller. (Note: in all this I assume we're all using size as a shorthand for some sort of priority metric that considers number of dirty tuples not only size. We don't want every worker insisting on passing over every small read-only table every time, for instance.) I was using size to mean reltuples. The whole concept of sorting by size was designed to ensure that smaller (more susceptible to bloat) tables got priority. It might be useful for workers to sort their to-do lists by some other metric, but I don't have a clear vision of what that might be. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? It seems fairly trivial to me to have a scheme where you do one fill-workmem-and-scan-indexes cycle per invocation, and store the next-heap-page-to-scan in some handy place (new pg_class column updated along with relpages/reltuples, likely). Galy is off in left field with some far more complex ideas :-( but I don't see that there's all that much needed to support this behavior ... especially if we don't expose it to the SQL level but only support it for autovac's use. Then we're not making any big commitment to support the behavior forever. Well, if we can make it happen soon, it might be the best thing for autovacuum. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 12:00:41AM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: The advantage to keying this to autovac_naptime is that it means we don't need another GUC, but after I suggested that before I realized that's probably not the best idea. For example, I've seen clusters that are running dozens-hundreds of databases; in that environment you really need to turn naptime way down (to like a second). In that case you wouldn't want to key to naptime. Actually, I've been thinking that it would be a good idea to change the semantics of autovacuum_naptime so that it means the average time to start a worker in any given database. That way, the time between autovac runs is not dependent on the number of databases you have. Hrm... how would that work? BTW, another thought is to only sleep if you've scanned through every database and found nothing to do. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
On Mon, Feb 26, 2007 at 10:48:49PM -0500, Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: That does sounds simpler. Is chunk-at-a-time a realistic option for 8.3? It seems fairly trivial to me to have a scheme where you do one fill-workmem-and-scan-indexes cycle per invocation, and store the next-heap-page-to-scan in some handy place (new pg_class column updated along with relpages/reltuples, likely). Galy is off in left field with some far more complex ideas :-( but I don't see that there's all that much needed to support this behavior ... especially if we don't expose it to the SQL level but only support it for autovac's use. Then we're not making any big commitment to support the behavior forever. The problem I see there is that the case we're trying to fix is tables that need to be vacuumed every few minutes. As I posted elsewhere, it's reasonable to assume a vacuum rate of ~1000 pages/second for a small table that's going to be in memory (assuming that vacuum dirties every page). That means that you can only dirty about 60k pages per cycle, which seems way to small to be practical unless we come up with a way to avoid scanning the indexes on every cycle. The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby [EMAIL PROTECTED] writes: The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) Which is exactly why we needn't bother benchmarking it. Even if it weren't complex and unsafe, it will be a net loss when you consider the fact that it adds I/O instead of removing it. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps, take 2
On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 06:23:22PM -0500, Matthew T. O'Connor wrote: I'm not sure how pg_class.relpages is maintained but what happens to a bloated table? For example, a 100 row table that is constantly updated and hasn't been vacuumed in a while (say the admin disabled autovacuum for a while), now that small 100 row table has 1000 pages in it most of which are just bloat, will we miss this table? Perhaps basing this on reltuples would be better? The entire point of this is to ensure that the second daemon will only vacuum tables that it can finish very quickly. If you let a table bloat so it's too big, then you just can't vacuum it very frequently without risking all your other hot tables bloating because they're no longer getting vacuumed. The reality is that you can actually vacuum a pretty good-sized table in 60 seconds with typical cost-delay settings (ie: defaults except cost_delay set to 10). That means you can do 9 pages ~100 times a second, or 54k pages a minute. Even with a vacuum_cost_delay of 20, that's still 27k pages per minute. At the risk of sounding like a broken record, I still think the size limit threshold is unnecessary. Since all workers will be working in on tables in size order, younger workers will typically catch older workers fairly quickly since the tables will be either small, or recently vacuumed and not need work. And since younger workers exit when they catch-up to an older worker, there is some inherent stability in the number of workers. Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). I fail to see how a simple 6 table case is 'worst case'. It's common to see hundreds of tables, and I've run across more than one database with thousands of tables (think partitioning). In cases like those it's certainly possible, perhaps even likely that you would get many daemons running in the database at one time just from different tables suddenly needing vacuuming and appearing at a higher point in the list than other tables. With 100 ~1G tables getting updates it certainly wouldn't be hard to end up with 10 of those being vacuumed all at the same time. I do like the idea since it should be easier to tune, but I think we still need some limit on it. Perhaps as a first-pass we could just have a hard limit and log a message and/or set a flag any time we hit it. That would hopefully allow us to get information about how big a problem it really is. We could go one step further and say that the last daemon that can start in a database will only vacuum tables that can be done quickly; that's essentially what we've been talking about, except the limit we've been discussing would be hard-coded at 2. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
On Tue, Feb 27, 2007 at 12:37:42AM -0500, Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) Which is exactly why we needn't bother benchmarking it. Even if it weren't complex and unsafe, it will be a net loss when you consider the fact that it adds I/O instead of removing it. Well, it depends on how often you're doing that. Adding extra IO at the end of 4 hours of vacuuming isn't going to make any real difference, but once a minute... Looks like partial vacuum won't help this problem. :( -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Mon, Feb 26, 2007 at 10:18:36PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: Here is a worst case example: A DB with 6 tables all of which are highly active and will need to be vacuumed constantly. While this is totally hypothetical, it is how I envision things working (without the threshold). I fail to see how a simple 6 table case is 'worst case'. It's common to see hundreds of tables, and I've run across more than one database with thousands of tables (think partitioning). In cases like those it's certainly possible, perhaps even likely that you would get many daemons running in the database at one time just from different tables suddenly needing vacuuming and appearing at a higher point in the list than other tables. With 100 ~1G tables getting updates it certainly wouldn't be hard to end up with 10 of those being vacuumed all at the same time. Yes 6 tables is small, the worst-case part of the example was that all the tables would need to be vacuumed constantly. Most databases only have a few hot tables. Most tables only need to vacuumed every once in a while. I do like the idea since it should be easier to tune, but I think we still need some limit on it. Perhaps as a first-pass we could just have a hard limit and log a message and/or set a flag any time we hit it. That would hopefully allow us to get information about how big a problem it really is. We could go one step further and say that the last daemon that can start in a database will only vacuum tables that can be done quickly; that's essentially what we've been talking about, except the limit we've been discussing would be hard-coded at 2. I'm confused, what limit would be set at 2? The number of concurrent workers? I've never said that. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Jim C. Nasby [EMAIL PROTECTED] writes: The proposal to save enough state to be able to resume a vacuum at pretty much any point in it's cycle might work; we'd have to benchmark it. With the default maintenance_work_mem of 128M it would mean writing out 64M of state every minute on average, which is likely to take several seconds to fsync (though, maybe we wouldn't need to fsync it...) Which is exactly why we needn't bother benchmarking it. Even if it weren't complex and unsafe, it will be a net loss when you consider the fact that it adds I/O instead of removing it. I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum next steps, take 2
Matthew T. O'Connor matthew@zeut.net writes: I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? I'm saying that I don't like the idea of trying to stop on a dime by saving the current contents of vacuum's dead-TID array to disk with the idea that we can trust those values 100% later. Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I'm for stopping only after completing an index-cleaning pass, at the point where we empty the dead-TID array anyway. If you really have to have stop on a dime, just kill -INT the process, accepting that you will have to redo your heap scan since the last restart point. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: I'm not sure what you are saying here, are you now saying that partial vacuum won't work for autovac? Or are you saying that saving state as Jim is describing above won't work? I'm saying that I don't like the idea of trying to stop on a dime by saving the current contents of vacuum's dead-TID array to disk with the idea that we can trust those values 100% later. Saving the array is expensive both in runtime and code complexity, and I don't believe we can trust it later --- at least not without even more expensive-and- complex measures, such as WAL-logging every such save :-( I'm for stopping only after completing an index-cleaning pass, at the point where we empty the dead-TID array anyway. If you really have to have stop on a dime, just kill -INT the process, accepting that you will have to redo your heap scan since the last restart point. OK, so if I understand correct, a vacuum of a table with 10 indexes on it can be interrupted 10 times, once after each index-cleaning pass? That might have some value, especially breaking up the work required to vacuum a large table. Or am I still not getting it? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly. The reason that won't necessarily happen is because you can get large tables popping up as needing vacuuming at any time. Right. We know that a table that needs frequent vacuum necessarily has to be small -- so maybe have the second worker exit when it catches up with the first, or when the next table is above 1 GB, whichever happens first. That way, only the first worker can be processing the huge tables. The problem with this is that if one of your hot tables grows a bit larger than 1 GB, you suddenly have a change in autovacuuming behavior, for no really good reason. And while your second worker is processing the tables in the hundreds-MB range, your high-update 2 MB tables are neglected :-( -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps, take 2
On Fri, Feb 23, 2007 at 01:22:17PM -0300, Alvaro Herrera wrote: Jim C. Nasby wrote: On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly. The reason that won't necessarily happen is because you can get large tables popping up as needing vacuuming at any time. Right. We know that a table that needs frequent vacuum necessarily has to be small -- so maybe have the second worker exit when it catches up with the first, or when the next table is above 1 GB, whichever happens first. That way, only the first worker can be processing the huge tables. The problem with this is that if one of your hot tables grows a bit larger than 1 GB, you suddenly have a change in autovacuuming behavior, for no really good reason. And while your second worker is processing the tables in the hundreds-MB range, your high-update 2 MB tables are neglected :-( That's why I'm thinking it would be best to keep the maximum size of stuff for the second worker small. It probably also makes sense to tie it to time and not size, since the key factor is that you want it to hit the high-update tables every X number of seconds. If we wanted to get fancy, we could factor in how far over the vacuum threshold a table is, so even if the table is on the larger size, if it's way over the threshold the second vacuum will hit it. You know, maybe the best way to handle this is to force both vacuums to exit after a certain amount of time, probably with a longer time limit for the first vacuum in a database. That would mean that after processing a large table for 10 minutes, the first vacuum would exit/re-evaluate what work needs to be done. That would mean medium-sized tables wouldn't get completely starved. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
vacuum should be a process with the least amount of voodoo. If we can just have vacuum_delay and vacuum_threshold, where threshold allows an arbitrary setting of how much bandwidth we will allot to the process, then that is a beyond wonderful thing. It is easy to determine how much IO you have, and what you can spare. The tricky part is what metric to use. Imho IO per second would be good. In a typical DB scenario that is the IO bottleneck, not the Mb/s. Andreas ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 2
On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. Rather than required people to manually identify hot tables, what if we just prioritize based on table size? So if a second autovac process hits a specific database, it would find the smallest table in need of vacuuming that it should be able to complete before the next naptime and vacuum that. It could even continue picking tables until it can't find one that it could finish within the naptime. Granted, it would have to make some assumptions about how many pages it would dirty. ISTM that's a lot easier than forcing admins to mark specific tables. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. Rather than required people to manually identify hot tables, what if we just prioritize based on table size? So if a second autovac process hits a specific database, it would find the smallest table in need of vacuuming that it should be able to complete before the next naptime and vacuum that. It could even continue picking tables until it can't find one that it could finish within the naptime. Granted, it would have to make some assumptions about how many pages it would dirty. ISTM that's a lot easier than forcing admins to mark specific tables. So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. Thoughts? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 2
On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: On Wed, Feb 21, 2007 at 05:40:53PM -0500, Matthew T. O'Connor wrote: My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. Rather than required people to manually identify hot tables, what if we just prioritize based on table size? So if a second autovac process hits a specific database, it would find the smallest table in need of vacuuming that it should be able to complete before the next naptime and vacuum that. It could even continue picking tables until it can't find one that it could finish within the naptime. Granted, it would have to make some assumptions about how many pages it would dirty. ISTM that's a lot easier than forcing admins to mark specific tables. So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. That would likely result in a number of workers running in one database, unless you limited how many workers per database. And if you did that, you wouldn't be addressing the frequently update table problem. A second vacuum in a database *must* exit after a fairly short time so that we can go back in and vacuum the important tables again (well or the 2nd vacuum has to periodically re-evaluate what tables need to be vacuumed). -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
On Thu, Feb 22, 2007 at 09:35:45AM +0100, Zeugswetter Andreas ADI SD wrote: vacuum should be a process with the least amount of voodoo. If we can just have vacuum_delay and vacuum_threshold, where threshold allows an arbitrary setting of how much bandwidth we will allot to the process, then that is a beyond wonderful thing. It is easy to determine how much IO you have, and what you can spare. The tricky part is what metric to use. Imho IO per second would be good. In a typical DB scenario that is the IO bottleneck, not the Mb/s. Well, right now they're one in the same... but yeah, IO/sec probably does make more sense. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
vacuum should be a process with the least amount of voodoo. If we can just have vacuum_delay and vacuum_threshold, where threshold allows an arbitrary setting of how much bandwidth we will allot to the process, then that is a beyond wonderful thing. It is easy to determine how much IO you have, and what you can spare. The tricky part is what metric to use. Imho IO per second would be good. In a typical DB scenario that is the IO bottleneck, not the Mb/s. Well, right now they're one in the same... but yeah, IO/sec probably does make more sense. Hopefully not :-) Else you have no readahead. And that is imho the problem. You need to anticipate how many physical IO's your logical IO's cause. And this is near impossible unless we group IO's in pg itself. Andreas ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Jim C. Nasby wrote: On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. That would likely result in a number of workers running in one database, unless you limited how many workers per database. And if you did that, you wouldn't be addressing the frequently update table problem. A second vacuum in a database *must* exit after a fairly short time so that we can go back in and vacuum the important tables again (well or the 2nd vacuum has to periodically re-evaluate what tables need to be vacuumed). I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps, take 2
On Thu, Feb 22, 2007 at 10:32:44PM -0500, Matthew T. O'Connor wrote: Jim C. Nasby wrote: On Thu, Feb 22, 2007 at 09:32:57AM -0500, Matthew T. O'Connor wrote: So the heuristic would be: * Launcher fires off workers into a database at a given interval (perhaps configurable?) * Each worker works on tables in size order. * If a worker ever catches up to an older worker, then the younger worker exits. This sounds simple and workable to me, perhaps we can later modify this to include some max_workers variable so that a worker would only exit if it catches an older worker and there are max_workers currently active. That would likely result in a number of workers running in one database, unless you limited how many workers per database. And if you did that, you wouldn't be addressing the frequently update table problem. A second vacuum in a database *must* exit after a fairly short time so that we can go back in and vacuum the important tables again (well or the 2nd vacuum has to periodically re-evaluate what tables need to be vacuumed). I'm not sure this is a great idea, but I don't see how this would result in large numbers of workers working in one database. If workers work on tables in size order, and exit as soon as they catch up to an older worker, I don't see the problem. Newer works are going to catch-up to older workers pretty quickly since small tables will vacuum fairly quickly. The reason that won't necessarily happen is because you can get large tables popping up as needing vacuuming at any time. Round 1: Fire up autovac worker; starts working and soon hits 100G table Round 2: Another worker starts. Since round 1, a 98G table now needs vacuuming, which this worker soon hits. Round 3: 89G table now needs vacuuming. Worker 3 starts up and soon hits it. So now we have 3 workers, all hammering away in the same database, and likely causing a lot of random IO. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] autovacuum next steps, take 2
Ok, scratch that :-) Another round of braindumping below. Launcher starts one worker in each database. This worker is not going to do vacuum work, just report how much vacuum effort is needed in the database. Vacuum effort is measured as the total number of pages in need of vacuum, being the sum of relpages of all tables and indexes needing vacuum. (Note: we weight heap pages the same as index pages. Is this OK?) Create a plan for vacuuming all those databases within the constraints of max_workers. Databases needing the most work are vacuumed first. One worker per database. Thus max_workers databases are being vacuumed in parallel at this time. When one database is finished, the launcher starts a worker in the next database in the list. When the plan is complete (i.e. the list is empty) we can do the whole thing again, excluding the databases that are still being vacuumed. Perhaps we should wait autovacuum_naptime seconds between finishing one vacuum round in all databases and starting the next. How do we measure this: do we start sleeping when the last worker finishes, or when the list is empty? Perhaps we should reserve a worker for vacuuming hot tables. Launcher then uses max_workers-1 workers for the above plan, and the spare worker is continuously connecting to one database, vacuuming hot tables, going away, the launcher starts it again to connect to the next database. Definitional problem: how to decide what's a hot table? One idea (the simplest) is to let the DBA define it. Thus, at most two workers are on any database: one of them is working on normal tables, the other on hot tables. (This idea can be complemented by having another GUC var, autovacuum_hot_workers, which allows the DBA to have more than one worker on hot tables (just for the case where there are too many hot tables). This may be overkill.) Ron Mayer expressed the thought that we're complicating needlessly the UI for vacuum_delay, naptime, etc. He proposes that instead of having cost_delay etc, we have a mbytes_per_second parameter of some sort. This strikes me a good idea, but I think we could make that after this proposal is implemented. So this take 2 could be implemented, and then we could switch the cost_delay stuff to using a MB/s kind of measurement somehow (he says waving his hands wildly). Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) If we consider Ron Mayer's idea of measuring MB/s, but we do it per tablespace, then we would inflict the correct amount of vacuum pain to each tablespace, sleeping as appropriate. I think this would require workers of different databases to communicate what tablespaces they are using, so that all of them can utilize the correct amount of bandwidth. I'd like to know if this responds to the mentioned people's objections. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera wrote: Ok, scratch that :-) Another round of braindumping below. I still think this is solution in search of a problem. The main problem we have right now is that hot tables can be starved from vacuum. Most of this proposal doesn't touch that. I would like to see that problem solved first, then we can talk about adding multiple workers per database or per tablespace etc... (This idea can be complemented by having another GUC var, autovacuum_hot_workers, which allows the DBA to have more than one worker on hot tables (just for the case where there are too many hot tables). This may be overkill.) I think this is more along the lines of what we need first. Ron Mayer expressed the thought that we're complicating needlessly the UI for vacuum_delay, naptime, etc. He proposes that instead of having cost_delay etc, we have a mbytes_per_second parameter of some sort. This strikes me a good idea, but I think we could make that after this proposal is implemented. So this take 2 could be implemented, and then we could switch the cost_delay stuff to using a MB/s kind of measurement somehow (he says waving his hands wildly). Agree this is probably a good idea in the long run, but I agree this is lower on the priority list and should come next. Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) If we consider Ron Mayer's idea of measuring MB/s, but we do it per tablespace, then we would inflict the correct amount of vacuum pain to each tablespace, sleeping as appropriate. I think this would require workers of different databases to communicate what tablespaces they are using, so that all of them can utilize the correct amount of bandwidth. I agree that in the long run it might be better to have multiple workers with MB/s throttle and tablespace aware, but we don't have any of that infrastructure right now. I think the piece of low-hanging fruit that your launcher concept can solve is the hot table starvation. My Proposal: If we require admins to identify hot tables tables, then: 1) Launcher fires-off a worker1 into database X. 2) worker1 deals with hot tables first, then regular tables. 3) Launcher continues to launch workers to DB X every autovac naptime. 4) worker2 (or 3 or 4 etc...) sees it is alone in DB X, if so it acts as worker1 did above. If worker1 is still working in DB X then worker2 looks for hot tables that are being starved because worker1 got busy. If worker2 finds no hot tables that need work, then worker2 exits. This seems a very simple solution (given your launcher work) that can solve the starvation problem. Thoughts? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps, take 2
Alvaro Herrera [EMAIL PROTECTED] writes: Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) I did say that. But your comment about using a high cost_delay was fairly convincing too. It would be a simpler design and I think you're right. As long as raise both cost_delay and cost_limit by enough you should get pretty much the same sequential i/o rate and not step on each others toes too much. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps, take 2
Ron Mayer expressed the thought that we're complicating needlessly the UI for vacuum_delay, naptime, etc. He proposes that instead of having cost_delay etc, we have a mbytes_per_second parameter of some sort. This strikes me a good idea, but I think we could make that after this proposal is implemented. So this take 2 could be implemented, and then we could switch the cost_delay stuff to using a MB/s kind of measurement somehow (he says waving his hands wildly). vacuum should be a process with the least amount of voodoo. If we can just have vacuum_delay and vacuum_threshold, where threshold allows an arbitrary setting of how much bandwidth we will allot to the process, then that is a beyond wonderful thing. It is easy to determine how much IO you have, and what you can spare. Joshua D. Drake Greg Stark and Matthew O'Connor say that we're misdirected in having more than one worker per tablespace. I say we're not :-) If we consider Ron Mayer's idea of measuring MB/s, but we do it per tablespace, then we would inflict the correct amount of vacuum pain to each tablespace, sleeping as appropriate. I think this would require workers of different databases to communicate what tablespaces they are using, so that all of them can utilize the correct amount of bandwidth. I'd like to know if this responds to the mentioned people's objections. -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps
I'm wondering if we can do one better... Since what we really care about is I/O responsiveness for the rest of the system, could we just time how long I/O calls take to complete? I know that gettimeofday can have a non-trivial overhead, but do we care that much about it in the case of autovac? On Fri, Feb 16, 2007 at 05:37:26PM -0800, Ron Mayer wrote: Alvaro Herrera wrote: Once autovacuum_naptime... autovacuum_max_workers... How does this sound? The knobs exposed on autovacuum feel kinda tangential to what I think I'd really want to control. IMHO vacuum_mbytes_per_second would be quite a bit more intuitive than cost_delay, naptime, etc. ISTM I can relatively easily estimate and/or spec out how much extra I/O bandwidth I have per device for vacuum; and would pretty much want vacuum to be constantly running on whichever table that needs it the most so long as it can stay under that bandwith limit. Could vacuum have a tunable that says X MBytes/second (perhaps per device) and have it measure how much I/O it's actually doing and try to stay under that limit? For more fine-grained control a cron job could go around setting different MBytes/second limits during peak times vs idle times. If people are concerned about CPU intensive vacuums instead of I/O intensive ones (does anyone experience that? - another tuneable vacuum_percent_of_cpu would be more straightforward than delay_cost, cost_page_hit, etc. But I'd be a bit surprised if cpu intensive vacuums are common. ---(end of broadcast)--- TIP 6: explain analyze is your friend -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps
One option that I've heard before is to have vacuum after a single iteration (ie, after it fills maintenance_work_mem and does the index cleanup and the second heap pass), remember where it was and pick up from that point next time. From my experience this is not acceptable... I have tables for which the index cleanup takes hours, so no matter how low I would set the maintenance_work_mem (in fact I set it high enough so there's only one iteration), it will take too much time so the queue tables get overly bloated (not happening either, they get now special cluster treatment). Cheers, Csaba. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps
Gregory Stark wrote: If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle of vacuums on smaller tables, then resume, that problem would go away. That sounds too difficult though, but perhaps we could do something nearly as good. I think to make vacuum has this interrupted-resume capability is quite useful for large table. It can provide more flexibility for autovacuum to create a good schedule scheme. Sometimes it takes a whole day to vacuum the large table (Hundreds-GB table may qualify); setting the cost_delay make it even lasts for several days. If the system has maintenance time, vacuum task of the large table can be split to fit into the maintenance time by interrupted-resume feature. One option that I've heard before is to have vacuum after a single iteration (ie, after it fills maintenance_work_mem and does the index cleanup and the second heap pass), remember where it was and pick up from that point next time. Even a single iteration may take a long time, so it is not so much useful to have a break in the boundary of the iteration. I think it is not so difficult to get vacuum to remember where it leaves and start from where it leaves last time. The following is a basic idea. A typical vacuum process mainly have the following phases: Phase 1. scan heap Phase 2. scan and sweep index Phase 3. sweep heap Phase 4. update FSM Phase 5. truncate CLOG Where vacuum is interrupted, we can just save the collected information into the disk, and restore it later when vacuum restarts. When vacuum process is interrupted, we can remember the dead tuple list and the block number it has scanned in phase 1; the indexes it has cleanup in phase 2; the tuples it has swept in phase 3. Before exiting from vacuum, we can also merge the free space information into FSM. We are working on this feature now. I will propose it latter to discuss with you. Best Regards Galy Lee -- NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: After staring at my previous notes for autovac scheduling, it has become clear that this basics of it is not really going to work as specified. So here is a more realistic plan: [Snip Detailed Description] How does this sound? On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). In general the only time it's a good idea to have multiple vacuums running at the same time is when a big table is starving a small hot table and causing bloat. I think we can extend the current autovacuum stats to add one more column that specifies is hot or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. Thoughts? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps
In an ideal world I think you want precisely one vacuum process running per tablespace on the assumption that each tablespace represents a distinct physical device. The cases where we currently find ourselves wanting more are where small tables are due for vacuuming more frequently than the time it takes for a large table to receive a single full pass. If we could have autovacuum interrupt a vacuum in mid-sweep, perform a cycle of vacuums on smaller tables, then resume, that problem would go away. That sounds too difficult though, but perhaps we could do something nearly as good. One option that I've heard before is to have vacuum after a single iteration (ie, after it fills maintenance_work_mem and does the index cleanup and the second heap pass), remember where it was and pick up from that point next time. If instead autovacuum could tell vacuum exactly how long to run for (or calculated how many pages that represented based on cost_delay) then it could calculate when it will next need to schedule another table in the same tablespace and try to arrange for the vacuum of the large table to be done by then. Once there are no smaller more frequently vacuumed small tables due to be scheduled it would start vacuum for the large table again and it would resume from where the first one left off. This only works if the large tables really don't need to be vacuumed so often that autovacuum can't keep up. Our current situation is that there is a size at which this happens. But arranging to have only one vacuum process per tablespace will only make that less likely to happen rather than more. I think the changes to vacuum itself are pretty small to get it to remember where it left off last time and start from mid-table. I'm not sure how easy it would be to get autovacuum to juggle all these variables though. Of course users may not create separate tablespaces for physical devices, or they may set cost_delay so high you really do need more vacuum processes, etc. So you probably still need a num_vacuum_daemons but the recommended setting would be the same as the number of physical devices and autovacuum could try to divide them equally between tablespaces which would amount to the same thing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps
Matthew T. O'Connor wrote: Alvaro Herrera wrote: After staring at my previous notes for autovac scheduling, it has become clear that this basics of it is not really going to work as specified. So here is a more realistic plan: [Snip Detailed Description] How does this sound? On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). Yeah, I understand that. However, I think that can be remedied by using a reasonable autovacuum_vacuum_cost_delay setting, so that each worker uses less than the total I/O available. The main point of the proposal is to allow multiple workers on a DB while also allowing multiple databases to be processed in parallel. I think we can extend the current autovacuum stats to add one more column that specifies is hot or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. How would you define what's a hot table? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: Matthew T. O'Connor wrote: On first blush, I'm not sure I like this as it doesn't directly attack the table starvation problem, and I think it could be a net loss of speed. VACUUM is I/O bound, as such, just sending multiple vacuum commands at a DB isn't going to make things faster, you are now going to have multiple processes reading from multiple tables at the same time. I think in general this is a bad thing (unless we someday account for I/O made available from multiple tablespaces). Yeah, I understand that. However, I think that can be remedied by using a reasonable autovacuum_vacuum_cost_delay setting, so that each worker uses less than the total I/O available. The main point of the proposal is to allow multiple workers on a DB while also allowing multiple databases to be processed in parallel. So you are telling people to choose an autovacuum_delay so high that they need to run multiple autovacuums at once to keep up? I'm probably being to dramatic, but it seems inconsistent. I think we can extend the current autovacuum stats to add one more column that specifies is hot or something to that effect. Then when the AV launcher sends a worker to a DB, it will first look for tables marked as hot and work on them. While working on hot tables, the launcher need not send any additional workers to this database, if the launcher notices that a worker is working on regular tables, it can send another worker which will look for hot tables to working, if the worker doesn't find any hot tables that need work, then it exits leaving the original working to continue plodding along. How would you define what's a hot table? I wasn't clear, I would have the Admin specified it, and we can store it as an additional column in the pg_autovacuum_settings table. Or perhaps if the table is below some size threshold and autovacuum seems that it needs to be vacuumed every time it checks it 10 times in a row or something like that. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] autovacuum next steps
[EMAIL PROTECTED] (Alvaro Herrera) writes: When there is a single worker processing a database, it does not recheck pgstat data after each table. This is to prevent a high-update-rate table from starving the vacuuming of other databases. This case is important; I don't think that having multiple workers fully alleviates the problem condition. Pointedly, you need to have a way of picking up tables often enough to avoid the XID rollover problem. That may simply require that on some periodic basis, a query is run to queue up tables that are getting close to having an XID problem. -- (reverse (concatenate 'string ofni.secnanifxunil @ enworbbc)) http://linuxfinances.info/info/finances.html Rules of the Evil Overlord #189. I will never tell the hero Yes I was the one who did it, but you'll never be able to prove it to that incompetent old fool. Chances are, that incompetent old fool is standing behind the curtain. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] autovacuum next steps
Alvaro Herrera [EMAIL PROTECTED] writes: Each worker, including the initial one, starts vacuuming tables according to pgstat data. They recheck the pgstat data after finishing each table, so that a table vacuumed by another worker is not processed twice (maybe problematic: a table with high update rate may be vacuumed more than once. Maybe this is a feature not a bug). How are you going to make that work without race conditions? ISTM practically guaranteed that all the workers will try to vacuum the same table. Once autovacuum_naptime has passed, if the workers have not finished yet, the launcher wants to vacuum another database. This seems a rather strange design, as it will encourage concentrations of workers in a single database. Wouldn't it be better to spread them out among multiple databases by default? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] autovacuum next steps
Alvaro Herrera wrote: Once autovacuum_naptime... autovacuum_max_workers... How does this sound? The knobs exposed on autovacuum feel kinda tangential to what I think I'd really want to control. IMHO vacuum_mbytes_per_second would be quite a bit more intuitive than cost_delay, naptime, etc. ISTM I can relatively easily estimate and/or spec out how much extra I/O bandwidth I have per device for vacuum; and would pretty much want vacuum to be constantly running on whichever table that needs it the most so long as it can stay under that bandwith limit. Could vacuum have a tunable that says X MBytes/second (perhaps per device) and have it measure how much I/O it's actually doing and try to stay under that limit? For more fine-grained control a cron job could go around setting different MBytes/second limits during peak times vs idle times. If people are concerned about CPU intensive vacuums instead of I/O intensive ones (does anyone experience that? - another tuneable vacuum_percent_of_cpu would be more straightforward than delay_cost, cost_page_hit, etc. But I'd be a bit surprised if cpu intensive vacuums are common. ---(end of broadcast)--- TIP 6: explain analyze is your friend