Re: [HACKERS] GSoC - proposal - Materialized Views in PostgreSQL
On 11.04.10 20:47 , Robert Haas wrote: On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas wrote: Robert Haas wrote: 2010/4/10 Andrew Dunstan: Heikki Linnakangas wrote: 1. Keep the materialized view up-to-date when the base tables change. This can be further divided into many steps, you can begin by supporting automatic updates only on very simple views with e.g a single table and a where clause. Then extend that to support joins, aggregates, subqueries etc. Keeping it really limited, you could even require the user to write the required triggers himself. That last bit doesn't strike me as much of an advance. Isn't the whole point of this to automate it? Creating greedy materialized views is usually not terribly difficult now, but you do have to write the triggers. Yeah, I agree. It doesn't accomplish anything interesting on its own. But if you do the planner changes to automatically use the materialized view to satisfy queries (item 2. in my previous email), it's useful. But you can't do that with a snapshot view, only a continuous updated one. If continuous updates prove to be too hard initially, you could instead update the view on select if it's outdated. Such a materialized view would be a kind of inter-session cache for subselects. The hard part would probably be to figure out how to decide whether the view is outdated or not, and to deal with two concurrent transactions trying to use an outdates view (and both trying to refresh it). What makes the second problem hard is that you wouldn't want one of the transactions to wait for the other to complete, because this is not how SELECTs traditionally behave. best regards, Florian Pflug smime.p7s Description: S/MIME Cryptographic Signature
Re: [HACKERS] Project suggestion: benchmark utility for PostgreSQL
Mickael DELOISON wrote: For a programming language, as it would be for GSoC, it has to be realized in three month and I believe the utility has to be cross-platform (anyway I want it to be). So I think Java would be good. I am very used to Java and Swing programming. What do you think about that choice? If you feel Java is a bad choice, there is C++/Boost/wxWidget/ (like pgAdmin). But with wxWidget, I am not sure if a GUI works under Windows and Linux it will work under MacOS without hacks. I just wanted to comment on the wxWidgest-under-OSX part. I works pretty well, but of course there are some bugs in wxWidgets that plague only a particular plattform. And since OSX seems to be one of the lesser used one, there appear to be more bugs of this kind. But if you grep through the pgAdmin3 sources, you'll find only little OSX-specific code. One nice thing about wxWidgets is that applications that use if feel like native applications on the platforms that wxWidgets supports. Java-GUIs often feel rather alien, at least IMHO. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] modifying the tbale function
Gregory Stark wrote: "Andrew Dunstan" <[EMAIL PROTECTED]> writes: Yes, more or less. That's what made me think of it. OTOH, before we rush out and do it someone needs to show that it's a net win. I agree with Tom that making tuplestore faster would probably be a much better investment of time. I don't think the problem with the tuplestore is a matter of speed. It's a matter of scalability and flexibility. It limits the types of applications that can use SRFs and the amount of data they can manipulate before it becomes impractical. Consider applications like dblink that have SRFs that read data from a slow network sources. Or that generate more data than the server can actually store at any one time. Or that overflow work_mem but are used in queries that could return quickly based on the first few records. Unfortunately, I don't think there's a simple fix that'll work for all PLs using the current interface. Even languages with iterators themselves (python, I think) probably don't expect to be called externally while an iterator is in progress. Just a thought - I believe that there are portable user-space thread implementations that contain little or no machine-specific code. What if postgres used one of those to switch from the PL into the executor and back after, say, 1000 rows were returned by the SFR? What would be needed is basically some enhanced version of setjmp/longjmp that actually saves the stack, and not just resets the stackpointer. Since context switching would occur only at two well-defined places (Some return_next_row function that PLs call when a SFR returns a row, and in the executor if no more previously returned rows from that SFR are available), this wouldn't introduce the usual multithreading headache, but still allow to switch in and out of the PL interpreter. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] modifying the tbale function
Andrew Dunstan wrote: Florian G. Pflug wrote: Just a thought - I believe that there are portable user-space thread implementations that contain little or no machine-specific code. What if postgres used one of those to switch from the PL into the executor and back after, say, 1000 rows were returned by the SFR? What would be needed is basically some enhanced version of setjmp/longjmp that actually saves the stack, and not just resets the stackpointer. Since context switching would occur only at two well-defined places (Some return_next_row function that PLs call when a SFR returns a row, and in the executor if no more previously returned rows from that SFR are available), this wouldn't introduce the usual multithreading headache, but still allow to switch in and out of the PL interpreter. This just sounds horribly fragile. Why would it be? It's about the same as running postgresql in one thread, and some PL in another. This should only cause trouble if both use some non-reentrant libc-functions. But even that wouldn't matter because of the well-defined context switching points. Here is a paper about portable userspace threads that I just googled. http://www.gnu.org/software/pth/rse-pmt.ps Are we really sure that this isn't a solution in search of a problem? I think this really depends on how you define "problem". Some people might think that "select * from myfunc(...) limit 1" should stop and return a result after myfunc(...) has returned one row. Others will say "well, just use a different software design that doesn't depend on this optimization". greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] modifying the tbale function
Gregory Stark wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Since context switching would occur only at two well-defined places (Some return_next_row function that PLs call when a SFR returns a row, and in the executor if no more previously returned rows from that SFR are available), this wouldn't introduce the usual multithreading headache... Yes it would. Consider what happens if the PL function calls into SPI to execute a query I don't that would cause trouble. Postgres and the PL would still not run concurrently - in fact, from a control-flow point of view the PL would run at _exactly_ the same place as it is now. Precisely if someone tries to pull rows out of some executor node. The only difference is that now it's execution would would be stopped and restarted multiple times by some sort of stack-switching. There might be trouble if a second function has to be executed with the same PL as an already running (but currently "stopped") function. This would only work for PL that is thread-safe in some way. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] modifying the tbale function
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Here is a paper about portable userspace threads that I just googled. http://www.gnu.org/software/pth/rse-pmt.ps I'm not impressed --- the list of unsolved problems is a bit too long. One that seems a showstopper for our purposes is lack of ability to deal reliably with stack overflow on alternate stacks. If we're going to be doing anything as loosely defined as running a third-party language interpreter in these stacks, that just won't do. Another little problem is it's LGPL. I don't think using that GnuPT library the paper mentions is a viable approach for postgres. I just posted that link because it shows that this is not impossible to do. What would actually be interesting is to find out of the ucontext/getcontext/makecontext/swapcontext/setcontext functions mentioned in the paper are available on the plattform where postgres is used in production for larger dataset. I'd assume that people who need this PL optimization will run a quite update-to-date version of their particular OS, so maybe the portability problems of those functions wouldn't be a problem for postgres - the PL optimization would just be disabled at configure time if they are not available. The main question is: IF the stability problems like stack overflow can be addressed, would this be in principle considered to be feature that people would like to have? Or is it considered not worth the effort? greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: On 3/21/07, Merlin Moncure <[EMAIL PROTECTED]> wrote: On 3/21/07, Pavan Deolasee <[EMAIL PROTECTED]> wrote: > It seems much simpler to me do something like this. But important > question is whether the restriction that CREATE INDEX can not > be run in a transaction block is acceptable ? yikes -- this is huge, huge price to pay, IMHO. Think about DDL that implies index creation such as adding unique constraint to a table, many postgresql users (including me) take advantage of that in update systems to production databases. I didn't understand that quite well. How does it help to run CREATE INDEX inside a transaction ? The problem is not so much CREATE INDEX per se, but other DDL commands that implicitly create an index, for example adding a PRIMARY KEY to a table. Some applications that I have written would fail badly if CREATE INDEX was disallowed inside a transaction - mostly, because they use plpgsql functions to manipulate database objects, and disallowing CREATE INDEX inside a transaction presumably means disallowing it from inside stored procedures. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: There is a slight hole in that SERIALIZABLE transactions won't be able to use any indexes they build during their transaction, since they may need to be able to see prior data, but I don't think anybody is going to complain about that restriction. Anyone? Oh, I did not see that. If there are no HOT-chains in the table, we can set the xid to InvalidTransactionId so that the index is usable immediately after it is created in the current transaction, as well as those transactions which were started before CREATE INDEX. We can possibly further improve it by checking if there are no HOT-chains except those created by this transaction and set xid to InvalidTransactionId. IMO with that we shall address most of the use cases. There are few which might still get impacted, but even for them there won't be any correctness problem. Why exactly can't a SERIALIZABLE transaction use the index it created itself? If you add a pointer to the root of all HOT update chains where either the HEAD is alive, or some tuple is visible to the transaction creating the index, shouldn't this be sufficient for using the index in the creating transaction? greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core for inclusion
Teodor Sigaev wrote: For given schema and server's locale, it's possible to have several FTS configurations, but the only one (with special flag enabled) could be used as default. Current (active) FTS configuration contains in GUC variable tsearch_conf_name. If it's not defined, then FTS configuration is looked in search_path to match server's locale with default flag enabled. Isn't the real problem that only _one_ configuration per locale should be marked as DEFAULT at any time, no matter what schema it is in? Having one DEFAULT configuration per schema per locale will necessarily cause confusion if search_path is not set carefully I think. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: On 3/23/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote: Why exactly can't a SERIALIZABLE transaction use the index it created itself? If you add a pointer to the root of all HOT update chains where either the HEAD is alive, or some tuple is visible to the transaction creating the index, shouldn't this be sufficient for using the index in the creating transaction? Tuples in the HOT-chain may not share the same index keys with respect to the new index being built (they share the same keys for all existing indexes though). So we don't know which index key to use while building the index. Ah, of course - thanks for pointing that out. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch_core for inclusion
Oleg Bartunov wrote: On Fri, 23 Mar 2007, Florian G. Pflug wrote: Teodor Sigaev wrote: For given schema and server's locale, it's possible to have several FTS configurations, but the only one (with special flag enabled) could be used as default. Current (active) FTS configuration contains in GUC variable tsearch_conf_name. If it's not defined, then FTS configuration is looked in search_path to match server's locale with default flag enabled. Isn't the real problem that only _one_ configuration per locale should be marked as DEFAULT at any time, no matter what schema it is in? I'm not sure I understand you correct (a bit complex :), but it's allowed to have only _one_ DEFAULT configuration per schema/per locale. So, visibility is defined by search_path for given locale. Yes, but why is that needed? Wouldn't one DEFAULT configuration per database be sufficient, and avoid the search_path problems? Sorry if I'm being stupid - I just can't see what having a different DEFAULT configuration per schema buys you. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] tsearch_core for inclusion
Teodor Sigaev wrote: For given schema and server's locale, it's possible to have several FTS configurations, but the only one (with special flag enabled) could be used as default. Current (active) FTS configuration contains in GUC variable tsearch_conf_name. If it's not defined, then FTS configuration is looked in search_path to match server's locale with default flag enabled. Isn't the real problem that only _one_ configuration per locale should be marked as DEFAULT at any time, no matter what schema it is in? Having one DEFAULT configuration per schema per locale will necessarily cause confusion if search_path is not set carefully I think. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Reduction in WAL for UPDATEs
Kenneth Marshall wrote: We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to increment a spam counter or a not-spam counter while keeping the user and token information the same. This would benefit from this optimization. Currently we are forced to use MySQL with MyISM tables to support the update load, although PostgreSQL 8.2 performance is right at the I/O break-even point for switching databases. With HOT and more optimized UPDATE I/O, 8.3 would give us enough I/O headroom to switch to PostgreSQL. Interesting. I've switched from MySQL to PostgreSQL for dspam, because of concurrency issues with MyISAM which caused bad performance. I am eager to see how much HOT speeds of my setup, though ;-) BTW, the "COMMIT NOWAIT" feature Simon Riggs proposed should provide a huge speedup too, since dspam runs one transaction for each token it has to update. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CREATE INDEX and HOT - revised design
Simon Riggs wrote: On Wed, 2007-03-28 at 22:24 +0530, Pavan Deolasee wrote: Just when I thought we have nailed down CREATE INDEX, I realized that there something more to worry. The problem is with the HOT-chains created by our own transaction which is creating the index. We thought it will be enough to index the tuple at the head-of-the-chain since that would be the visible copy once the transaction commits. We thought of keeping the index unavailable for queries in pre-existing transactions by setting a new "xid" attribute in pg_index. The question is what value to assign to "xid". I though we would assign ReadNewTransactionId(). > If you are indexing a table that hasn't just been created by you, set the xcreate field on pg_index at the *end* of the build using ReadNewTransactionId(). Any xid less than that sees the index as invalid. If you created the table in this transaction (i.e. createSubId != 0) then set xcreate to creating xid. Couldn't you store the creating transaction's xid in pg_index, and let other transaction check that against their snapshot like they would for any tuple's xmin or xmax? (With one exception - the creating transaction would consider indices it built itself invalid, which is not how things usually work for xmin/xmax). This would mean that any transaction that believes that the creating transaction has committed also consideres the index to be valid. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CREATE INDEX and HOT - revised design
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Couldn't you store the creating transaction's xid in pg_index, and let other transaction check that against their snapshot like they would for any tuple's xmin or xmax? What snapshot? I keep having to remind people that system catalog operations are SnapshotNow. In the particular context here, the place where doing something else would break down is that the planner has no idea when it makes a plan what snapshot(s) the plan might later be used with. Sorry - now that you say it, I remember that you've said that already multiple times... So the question is, why did this work until now, and CREATE INDEX+HOT just doesn't seem to fit into this scheme? I think the answer is that all other DDL statements manage to assure that any database objects they create or modify are usable for everybody else immediatly after they are committed. This usually implies pretty strong locking requirements - for example, I think that the core reason why TRUNCATE needs an exclusive lock is precisely that guarantee it has to make. Maybe this could somehow be relaxed? Could, for example, the planner be allowed to base some of it's decisions on the SerializableSnapshot the every transaction (even read-only ones) posseses? It seems that this would prevent plans from living longer than a transaction, but maybe plan invalidation could help here? greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: On 3/28/07, Tom Lane <[EMAIL PROTECTED]> wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: > Couldn't you store the creating transaction's xid in pg_index, and > let other transaction check that against their snapshot like they > would for any tuple's xmin or xmax? What snapshot? I keep having to remind people that system catalog operations are SnapshotNow. In the particular context here, the place where doing something else would break down is that the planner has no idea when it makes a plan what snapshot(s) the plan might later be used with. Tom, please correct me if I am wrong. But ISTM that this idea might work in this context. In get_relation_info(), we would check if "xcreate" xid stored in pg_index for the index under consideration is seen committed with respect to the snapshot at that point of time. Even if the snapshot changes later and index becomes valid, we might not replan and hence not use index. But that doesn't seem like a big problem to me. That problem are usecases like PREPARE my_plan ; BEGIN; EXECUTE my_plan ; COMMIT ; Is that "PREPARE" even run inside a transaction? Even if it is, it probably won't have created a snapshot... I think allowing the use of some sort of snapshot from inside the planner would allow some locking to be relaxed, but there seems be a lot of corner cases to consider :-( OTOH, if you manage to make this work, a TRUNCATE that doesn't block concurrent selects might become possible to do. This would for example allow dropping and rebuilding subscriptions on a slony node while it is in use. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: In this specific context, this particular case is easy to handle because we are only concerned about the serializable transactions started before CREATE INDEX commits. If PREPARE can see the new index, it implies that the CI transaction is committed. So the transaction starting after than can only see the tuple version that we have indexed. Yes, but the non-index plan PREPARE generated will be used until the end of the session, nut only until the end of the transaction. Imagine that it wasn't explicitly PREPARED (where you might say this is acceptable), but rather just a query inside a plpgsql function, maybe even called from some app using connection pooling. This means that the non-index using plan might get used for a quite long time, which contradics the work Tom did on plan invalidation I think. Maybe Tom can comment on wheter it's possible to use plan invalidation to eventually get rid of a stale plan in this context? greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: On 3/29/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote: Yes, but the non-index plan PREPARE generated will be used until the end of the session, nut only until the end of the transaction. Frankly I don't know this works, but are you sure that the plan will be used until the end of the session ? Even if thats the case, it can happen even today if we create a new index, but the existing sessions will use the stale plan (assuming what you said is true) I've checked that: test=# prepare myplan as select * from test where id=1 ; PREPARE test=# explain execute myplan ; QUERY PLAN Seq Scan on test (cost=0.00..22897.70 rows=5421 width=36) Filter: (id = 1) (2 rows) Now I create an index in another session test=# explain select * from test where id=1 ; QUERY PLAN -- Bitmap Heap Scan on test (cost=95.11..8248.45 rows=5000 width=36) Recheck Cond: (id = 1) -> Bitmap Index Scan on idx (cost=0.00..93.86 rows=5000 width=0) Index Cond: (id = 1) (4 rows) test=# explain execute myplan ; QUERY PLAN Seq Scan on test (cost=0.00..22897.70 rows=5421 width=36) Filter: (id = 1) (2 rows) !!! Index got used by the "select .. " but not by "execute myplan ..." !!! test=# prepare myplan2 as select * from test where id=1 ; PREPARE test=# explain execute myplan2 ; QUERY PLAN - Index Scan using idx on test (cost=0.00..8.38 rows=1 width=37) Index Cond: (id = 1) (2 rows) !!! A newly prepared plan of course uses the index !!! So yes, plans get cached until the end of the session, and yes, 8.2 won't notice index creation either ;-) The open question is how CVS HEAD with plan invalidation behaves. If it replans after the index-creating transaction commits, then basing index validity on a snapshot will break this, because upon replay they index might not be useable, but later on it may very well be (but that plan invalidation machinery won't realize that) So this might not introduce a regression compared to 8.2, but to a future 8.3 with plan invalidation... Sorry for being so unclear in my previous emails - I had confused myself ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CREATE INDEX and HOT - revised design
Tom Lane wrote: "Simon Riggs" <[EMAIL PROTECTED]> writes: ISTM that the run-another-transaction-afterwards idea is the only one that does everything I think we need. I really do wish we could put in a wait, like CIC, but I just think it will break existing programs. Actually, there's a showstopper objection to that: plain CREATE INDEX has to be able to run within a larger transaction. (To do otherwise breaks "pg_dump --single-transaction", just for starters.) This means it can *not* commit partway through. I believe the original idea was to invent some kind of "on commit run this transaction" hook - similar to how files are deleted on commit, I think. At least I understood the "Run another transaction on commit" that way... greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CREATE INDEX and HOT - revised design
Simon Riggs wrote: On Thu, 2007-03-29 at 17:27 -0400, Tom Lane wrote: "Simon Riggs" <[EMAIL PROTECTED]> writes: ISTM that the run-another-transaction-afterwards idea is the only one that does everything I think we need. I really do wish we could put in a wait, like CIC, but I just think it will break existing programs. Actually, there's a showstopper objection to that: plain CREATE INDEX has to be able to run within a larger transaction. (To do otherwise breaks "pg_dump --single-transaction", just for starters.) This means it can *not* commit partway through. The idea is to make note that the transaction has created an index within a transaction block, so that after the top level transaction commits we sneak in an extra hidden transaction to update the pg_index tuple with the xcreate of the first transaction. The only other alternative is to forcibly throw a relcache inval event in the same circumstances without running the additional transaction, but the solution is mostly the same. I think one alternative might be to store a list of xid's together with a cached plan, and replan if the commit status (as percieved by the transaction the plan will be executed in) of one of those xid's changes. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] CREATE INDEX and HOT - revised design
Tom Lane wrote: "Pavan Deolasee" <[EMAIL PROTECTED]> writes: How about storing the snapshot which we used during planning in CachedPlanSource, if at least one index was seen unusable because its CREATE INDEX transaction was seen as in-progress ? I'm getting tired of repeating this, but: the planner doesn't use a snapshot. System catalogs run on SnapshotNow. But it would still do that - it would just compare the createxid of the index against some snapshot, and the query would be replanned if the cached result of this comparison differs from the one the current snapshot yields. It might well be that this won't work, because the planner is invoked in situations where there is no active snapshot - I'm not sure if your comment refers to that case, or not. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CREATE INDEX and HOT - revised design
Tom Lane wrote: "Pavan Deolasee" <[EMAIL PROTECTED]> writes: What I am suggesting is to use ActiveSnapshot (actually Florian's idea) to decide whether the transaction that created index was still running when we started. Isn't it the case that some snapshot will be "active" when we plan ? I do not think you can assume that the plan won't be used later with some older snapshot. Consider recursive plpgsql functions for a counterexample: the inner occurrence might be the first to arrive at a given line of the function, hence the first to plan it, yet when we return to the outer instance we might revert to an older snapshot. So maybe we'd need to use the SerializableSnapshot created at the start of each transaction for this check, and not the ActiveSnapshot? Could that work? What about doing PREPARE myplan select ... ; outside of a transaction? Will this be execute inside a transaction? Is is a query always planned upon it's first execution, and not when "PREPARE" is issued? greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CREATE INDEX and HOT - revised design
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Tom Lane wrote: I do not think you can assume that the plan won't be used later with some older snapshot. So maybe we'd need to use the SerializableSnapshot created at the start of each transaction for this check, and not the ActiveSnapshot? Could that work? That might work, but it doesn't seem to address the core objection: there's no mechanism to cause the query to be replanned once the snapshot is new enough, because no relcache inval will happen. So most likely existing backends will keep using old plans that don't consider the index. Pavan suggested storing the IndexSnapshot in the cached plan, and to compare it to the IndexSnapshot when the query is executed. If those two snapshots differ, the query would be replanned. My idea was to store a list of xid's together with the cached plan that are assumed to be uncommitted accoring to the IndexSnapshot. The query is replanned if upon execution the IndexSnapshot assumes that one of these xid's is committed. Those two ideas seem to be mostly equivalent, mine seems to be a bit more fine-grained, but at the cost of more work upon each query execution. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CREATE INDEX and HOT - revised design
Pavan Deolasee wrote: On 3/30/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote: My idea was to store a list of xid's together with the cached plan that are assumed to be uncommitted accoring to the IndexSnapshot. The query is replanned if upon execution the IndexSnapshot assumes that one of these xid's is committed. Actually, if we are using Serializable Snapshot then there is no chance to replan the query before the transaction completes and the next transaction to start in the session must see the index and hence we must replan. So it would be enough just to associate a transaction id with the cached plan. If this xid is set and our transaction id is different than that, we replan. I believe this is true for the CREATE INDEX scenario. However, comparing either the snapshot or the result of xid checks seems like it might be useful for other things beside CREATE INDEX. I'm specifically thinking about TRUNCATE here - the create index + HOT problems sound quite similar to the problems a non-exclusive-locking TRUNCATE would face. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Minor changes to Recovery related code
Simon Riggs wrote: On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote: "Simon Riggs" <[EMAIL PROTECTED]> writes: 2. pg_stop_backup() should wait until all archive files are safely archived before returning Not sure I agree with that one. If it fails, you can't tell whether the action is done and it failed while waiting for the archiver, or if you need to redo it. There's a slight delay between pg_stop_backup() completing and the archiver doing its stuff. Currently if somebody does a -m fast straight after the pg_stop_backup() the backup may be unusable. We need a way to plug that small hole. I suggest that pg_stop_backup() polls once per second until pg_xlog/archive_status/LOG.ready disappears, in which case it ends successfully. If it does this for more than 60 seconds it ends successfully but produces a WARNING. I fear that ending sucessfully despite having not archived all wals will make this feature less worthwile. If a dba knows what he is doing, he can code a perfectly safe backup script using 8.2 too. He'll just have to check the current wal position after pg_stop_backup(), (There is a function for that, right?), and wait until the corresponding wal was archived. In realitly, however, I feare that most people will just create a script that does 'echo "select pg_stop_backup | psql"' or something similar. If they're a bit more carefull, they will enable ON_ERROR_STOP, and check the return value of pgsql. I believe that those are the people who would really benefit from a pg_stop_backup() that waits for archiving to complete. But they probably won't check for WARNINGs. Maybe doing it the other way round would be an option? pg_stop_backup() could wait for the archiver to complete forever, but spit out a warning every 60 seconds or so "WARNING: Still waiting for wal archiving of wal ??? to complete". If someone really wants a 60-second timeout, he can just use statement_timeout. Anyway, just my 0.02 eurocents, maybe I'm totally mistaken about the postgresql dba's out there... greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Fate of pgsnmpd
Hi Does anyone know if pgsnmpd is still actively developed? The last version (0.1b1) is about 15 months old. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] "select ('{}'::text[])[1]" returns NULL -- is it correct?
Nikolay Samokhvalov wrote: On 4/10/07, Tom Lane <[EMAIL PROTECTED]> wrote: "Nikolay Samokhvalov" <[EMAIL PROTECTED]> writes: > I remember several cases when people (e.g. me :-) ) were spending some > time trying to find an error in some pl/pgsql function and the reason > lied in incorrect work with arrays (i.e. messages like "index is out > of bounds" and "index cannot be negative number" would help, surely). Well, if indexes *couldn't* be negative numbers then that might be helpful, but they can. Ooops :-) OK, my proposal is narrowing to very simple one: what about triggering WARNINGs when user tries to access nonexistent element of array? Please don't ;-) There are two sane options - return an error, or return NULL. Both are sensible, and different programming languages make different choices. The only reason for a WARNING would be a long-term plan to change the existing behaviour. But this will cause lots of pain, for no real gain, because no matter which behaviour you pick, there are always situations where the other would be more convenient. Just look at the mess PHP has created by altering fundamental aspects of the language (4.4 -> 5.0). greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [DOCS] uuid type not documented
Neil Conway wrote: On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote: The problem is that most of the standard methods are platform dependent, as they require MAC addresses or a "good" random source, for instance. http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php ISTM random() or similar sources is a sufficient PSRNG for the purposes of UUID generation -- I can't see anything in the RFC that would contradict that. Maybe a short-term solution could be a UUID-generated function that takes some kind of seed as a parameter. People not concerned about collisons could just pass some random value, while others could use the mac-address of the client or something similar. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] What tools do people use to hack on PostgreSQL?
Hi I'm very excited that my project for implementing read-only queries on PITR slaves was accepted for GSoC, and I'm now trying to work out what tools I'll use for that job. I'd like to be able to create some sort of branches and tags for my own work (only inside my local repository of course). I've considered using git, but I couldn't make the cvs->git gateway work - neither using the postgresql CVS repository directly, nor with a private copy obtained with CVSup. There is also svk, but I think I'd need a svn repo that mirrors the postgresql CVS for that to work. I think Joshua Drake created one once, but I don't now if it is kept up-to-date. What do you guys use for your development work? greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] What tools do people use to hack on PostgreSQL?
Joshua D. Drake wrote: Alexey Klyukin wrote: Alvaro Herrera wrote: But if you have a checked out tree, does it work to do an update after the tree has been regenerated? As far as I know, the repo is generated completely every few hours, so it wouldn't surprise me that the checked out copy is not compatible with the new repo. I admit I haven't tried. I have tried and svn up worked without issues. As a note we will be updating this to subversion 1.4 shortly so people can do svnsync too. Do I read this correctly as "This repository will stay around for a while, and isn't just an experiment that might be stopped tomorrow?". If so, I'll try using it - and lots of thanks for providing that greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Build-Problem with pgc.c on OSX 10.4
Hi When I try to build CVS HEAD on OSX 10.4, compiling src/interfaces/ecpg/preproc/preproc.c fails with: In file included from preproc.y:6951: pgc.l:3:20: error: config.h: No such file or directory In file included from pgc.l:28, from preproc.y:6951: preproc.h:996: error: conflicting types for 'base_yylloc' y.tab.c:18673: error: previous declaration of 'base_yylloc' was here In file included from preproc.y:6951: pgc.l:43: error: 'MAX_PARSE_BUFFER' undeclared here (not in a function) If I delete pgc.c, it is rebuilt automatically, and then preproc.c compiles just fine. I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3 greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: When I try to build CVS HEAD on OSX 10.4, compiling src/interfaces/ecpg/preproc/preproc.c fails with: ... If I delete pgc.c, it is rebuilt automatically, and then preproc.c compiles just fine. ... I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3 Perhaps you changed bison versions and didn't force a rebuild? Those line numbers don't seem to sync up with my copies of the derived files. I just realized that this file isn't even in the postgresql CVS repo. But it _is_ part of the SVN mirror at https://projects.commandprompt.com/public/pgsql/repo. The version that shows up in the trunk of the SVN repo is the revision 1.5 from CVS (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c?rev=1.5;content-type=text%2Fplain;hideattic=0) This is the same as https://projects.commandprompt.com/public/pgsql/repo/trunk/pgsql/src/interfaces/ecpg/preproc/pgc.c modulo the expansion of the $Header macro. Seems to be a bug in the CVS->SVN conversion process... greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4
Alvaro Herrera wrote: Ah, it seems the SVN repo just got its first user ;-) Congratulations. Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited enough. I hope the fact that I use the SVN repo just to get the changes into git doesn't reduce my chances of getting that t-shirt ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4
Alvaro Herrera wrote: Florian G. Pflug wrote: Alvaro Herrera wrote: Ah, it seems the SVN repo just got its first user ;-) Congratulations. Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited enough. I hope the fact that I use the SVN repo just to get the changes into git doesn't reduce my chances of getting that t-shirt ;-) Hum, why don't you just use the CVS directly then? That'd avoid this sort of infelicities. git-cvsimport didn't work for me - neither with the main CVS repo, nor with a rsync'ed copy. It complained about all sorts of problems - I don't have enough CVS knowhow to judge if those were actual problems with the repo, or just deficiencies of git-cvsimport. Plus I didn't find a way to import the current version of HEAD as one revision, any only go incrementally from there. It always wanted to mirror the whole history stores in the CVS in my git repo, which is overkill. For SVN, there is git-svn, which does just what I want - I started with some revision a few days ago, and it just incrementally imports updates from there into a special branch of my git repo, and doesn't care about what happened before that revision. It's all not perfect, but I think for me it works better than just doing my changes in a CVS checkout. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] What tools do people use to hack on PostgreSQL?
Joshua D. Drake wrote: http://projects.commandprompt.com/public/pgsql/browser or do the anonymous checkout with: svn co http://projects.commandprompt.com/public/pgsql/repo/ But if you have a checked out tree, does it work to do an update after the tree has been regenerated? As far as I know, the repo is generated completely every few hours, so it wouldn't surprise me that the checked out copy is not compatible with the new repo. I admit I haven't tried. Not a clue. Anyone try it yet? git-svn seems to work fine against the SVN repo, apart from the problem with the files deleted in CVS which still show up in SVN. It's only running for about two days though... greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hacking on PostgreSQL via GIT
Martin Langhoff wrote: Hi Florian, I am right now running an rsync of the Pg CVS repo to my work machine to get a git import underway. I'm rather keen on seeing your cool PITR Pg project go well and I have some git+cvs fu I can apply here (being one of the git-cvsimport maintainers) ;-) Cool - I'm new to git, so I really appreciate any help that I can get. For the kind of work you'll be doing (writing patches that you'll want to be rebasing onto the latest HEAD for merging later) git is probably the best tool. That's what I use it for... tracking my experimental / custom branches of projects that use CVS or SVN :-) Thats how I figured I'd work - though I don't yet understand what the advantage of "rebase" is over "merge". Currently, I've setup a git repo that pulls in the changes from the SVN repo, and pushed them to my main soc git repo. On that main repo I have two branches, master and pgsql-head, and I call "cg-merge pgsql-head" if I want to merge with CVS HEAD. Initially, I'll post it on http://git.catalyst.net.nz/ and I can run a daily import for you - once that's in place you can probably get a repo with your work on http://repo.or.cz/ Having a git mirror of the pgsql CVS would be great. BTW, I've just check out repo.or.cz, and noticed that there is already a git mirror of the pgsql CVS: http://repo.or.cz/w/PostgreSQL.git greetings + thanks Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Hacking on PostgreSQL via GIT
Aidan Van Dyk wrote: Martin Langhoff wrote: Well, now that more than one of us are working with git on PostgreSQL... I've had a repo conversion running for a while... I've only got it to what I consider "stable" last week: http://repo.or.cz/w/PostgreSQL.git git://repo.or.cz/PostgreSQL.git Ah - thats what I just stumbled over ;-) For those interested int he conversion process, I've used a slightly modified version of fromcvs (A ruby cvs to git/Hg tool), and it runs on all of pgsql in about 20 minutes. I gave up on git-svn (because of both speed and my in-ablility to easy "filter" out Keywords, etc) and git-cvsimport (because cvsps doesn't seem to like pgsql's repo) Yeah, git-cvsimport didn't work for me either... I "update" the git repo daily, based on an anonymous rsync of the cvsroot. If the anon-rsync is updated much more frequently, and people think my git conversion should match it, I have no problem having cron run it more than daily. Also - note that I give *no* guarentees of it's integrity, etc. I've "diffed" a CVS checkout and a git checkout, and the are *almost* identical. Almost, because it seems like my git repository currently has 3 files that a cvs checkout doesn't: backend/parser/gram.c |12088 +++ interfaces/ecpg/preproc/pgc.c | 2887 ++ interfaces/ecpg/preproc/preproc.c |16988 ++ And at this point, I haven't been bothered to see where those files came from (and where they dissapear) in CVS and why my import isn't picking that up... I could probably be pushed if others find this repo really useful, but those files problematic... Thats interesting - the SVN mirror of the pgsql CVS at http://projects.commandprompt.com/public/pgsql/browser has exactly the same problem with those 3 files, as I found out the hard way ;-) In the case of pgc.c, I've compared that revisions in CVS with the one in SVN. SVN include the cvs-version 1.5 if this file in trunk, which seems to be the last version of that file in CVS HEAD. Interestingly, http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c shows no trace of the file being deleted from HEAD either - it just shows that it was removed from WIN32_DEV. But still a CVS checkout doesn't include that file... Since 3 tools (cvsweb, git-cvsimport and whatever commandprompt uses to create the SVN mirror) all come to the same conclusion regarding this file, I think that this is caused by some corruption of the CVS repository - but I don't have the cvs-fu to debug this... greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Re: IDENTITY/GENERATED v36 Re: [PATCHES] Final version of IDENTITY/GENERATED patch
Zoltan Boszormenyi wrote: Tom Lane Ãrta: Zoltan Boszormenyi <[EMAIL PROTECTED]> writes: Also, the current grammar is made to give a syntax error if you say "colname type GENERATED BY DEFAULT AS ( expr )". But it makes the grammar unbalanced, and gives me: bison -y -d gram.y conflicts: 2 shift/reduce I'ts been quite a time since I last used bison, but as far as I remember, you can tell it to write a rather details log about it's analysis of the grammar. That log should include more detailed information about those conflicts - maybe that helps to figure out their exact cause, and to find a workaround. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Hacking on PostgreSQL via GIT
Tom Lane wrote: Alvaro Herrera <[EMAIL PROTECTED]> writes: These files are generated (from gram.y, pgc.l and preproc.y respectievly) and are not present in the CVS repo, though I think they have been at some point. It's strange that other generated files (that have also been in the repo in the past) like preproc.h are not showing up. The weird thing about these files is that the CVS history shows commits on HEAD later than the file removal commit. I don't recall if Vadim unintentionally re-added the files before making those commits ... but if he did, you'd think it'd have taken another explicit removal to get rid of them in HEAD. More likely, there was some problem in his local tree that allowed a "cvs commit" to think it should update the repository with copies of the derived files he happened to have. I think this is a corner case that CVS handles in a particular way and the tools people are using to read the repository handle in a different way. Which would be a bug in those tools, since CVS's interpretation must be right by definition. The question is if it'd be acceptable to manually remove that last commit from the repository. I guess simply readding, and then removing the files again should do the trick, though I'd be cleaner to fix remove the offending commit in the first place. Should postgres ever decide to switch to another version control system (which I don't advocate), that'd be one obstacle less to deal with... Or is the risk of causing breakage too high? greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] RESET command seems pretty disjointed now
Tom Lane wrote: Mark Kirkwood <[EMAIL PROTECTED]> writes: Tom Lane wrote: The current documentation for RESET exhibits a certain lack of, um, intellectual cohesiveness: Synopsis RESET configuration_parameter RESET ALL RESET { PLANS | SESSION | TEMP | TEMPORARY } Maybe DISCARD for the plans etc might be more intuitive than extending RESET? DISCARD PLANS and DISCARD TEMP seem pretty reasonable, but DISCARD SESSION sounds a bit odd --- it seems like it might mean "disconnect", which of course is exactly what we're trying to avoid. But possibly we could rename RESET SESSION as DISCARD ALL. Leastwise I haven't got any better ideas. Anyone have another proposal? What about RESET parameter RESET { PLANS | TEMP | TEMPORARY } RESET ALL { PARAMETERS | STATE } RESET ALL would become an abbreviation of RESET ALL PARAMETERS (for backwards compatibility), while RESET SESSION would be renamed to RESET ALL STATE. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Hacking on PostgreSQL via GIT
Martin Langhoff wrote: Aidan Van Dyk wrote: And remember the warning I gave that my conversion is *not* a direct CVS import - I intentionally *unexpand* all Keywords before stuffing them into GIT so that merging and branching can ignore all the Keyword conflicts... My import is unexpanding those as well to support rebasing and merging better. So - if you are committed to providing your gateway long term to Florian, I'm happy to drop my gateway in favour of yours. There seem to be other people than me who are interested in a git mirror. Maybe we could declare one of those mirrors the "official" one - I guess things would be easier if all people interested in using git would use the same mirror... What do you guys think? (Florian, before basing your code on either you should get a checkout of Aidan's and mine and check that the tips of the branches you are working on match the cvs branches -- the cvsimport code is good but whereever CVS is involved, there's a lot of interpretation at play, a sanity check is always good). I actually hoped that I could just take my current git repo, and rebase my branch onto one of those two repos - or does rebase only work from an ancestor to a descendant? greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync
Hi I believe I have discovered the following problem in pgsql 8.2 and HEAD, concerning warm-standbys using WAL log shipping. The problem is that after a crash, the master might complete incomplete actions via rm_cleanup() - but since it won't wal-log those changes, the slave won't know about this. This will at least prevent the creation of any further restart points on the slave (because safe_restartpoint) will never return true again - it it might even cause data corruption, if subsequent wal records are interpreted wrongly by the slave because it sees other data than the master did when it generated them. Attached is a patch that lets RecoveryRestartPoint call all rm_cleanup() methods and create a restart point whenever it encounters a shutdown checkpoint in the wal (because those are generated after recovery). This ought not cause a performance degradation, because shutdown checkpoints will occur very infrequently. The patch is per discussion with Simon Riggs. I've not yet had a chance to test this patch, I only made sure that it compiles. I'm sending this out now because I hope this might make it into 8.2.4. greetings, Florian Pflug diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6c67821..93c86a1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5060,10 +5060,13 @@ #endif * Perform a checkpoint to update all our recovery activity to disk. * * Note that we write a shutdown checkpoint rather than an on-line - * one. This is not particularly critical, but since we may be - * assigning a new TLI, using a shutdown checkpoint allows us to have - * the rule that TLI only changes in shutdown checkpoints, which - * allows some extra error checking in xlog_redo. + * one. A slave will always create a restart point if it sees a + * shutdown checkpoint, and will call all rm_cleanup() methods before + * it does so. This guarantees that any actions taken by the master + * in rm_cleanup will also be carried out on the slave. + * Additionally, we may be assigning a new TLI, so using a shutdow + * checkpoint allows us to have the rule that TLI only changes in shutdown + * checkpoints, which allows some extra error checking in xlog_redo. */ CreateCheckPoint(true, true); @@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo * restartpoint is needed or not. */ static void -RecoveryRestartPoint(const CheckPoint *checkPoint) +RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint) { int elapsed_secs; int rmid; /* - * Do nothing if the elapsed time since the last restartpoint is less than - * half of checkpoint_timeout. (We use a value less than - * checkpoint_timeout so that variations in the timing of checkpoints on - * the master, or speed of transmission of WAL segments to a slave, won't - * make the slave skip a restartpoint once it's synced with the master.) - * Checking true elapsed time keeps us from doing restartpoints too often - * while rapidly scanning large amounts of WAL. + * If the checkpoint we saw in the wal was a shutdown checkpoint, it might + * have been written after the recovery following a crash of the master. + * In that case, the master will have completed any actions that were + * incomplete when it crashed *during recovery*, and these completions + * are therefor *not* logged in the wal. + * To prevent getting out of sync, we follow what the master did, and + * call the rm_cleanup() methods. To be on the safe side, we then create + * a RestartPoint, regardless of the time elapsed. Note that asking + * the resource managers if they have partial state would be redundant + * after calling rm_cleanup(). */ - elapsed_secs = time(NULL) - ControlFile->time; - if (elapsed_secs < CheckPointTimeout / 2) - return; + if (shutdownCheckpoint) { + for (rmid = 0; rmid <= RM_MAX_ID; rmid++) + { + if (RmgrTable[rmid].rm_cleanup != NULL) +RmgrTable[rmid].rm_cleanup(); + } + } + else { + /* + * Do nothing if the elapsed time since the last restartpoint is less than + * half of checkpoint_timeout. (We use a value less than + * checkpoint_timeout so that variations in the timing of checkpoints on + * the master, or speed of transmission of WAL segments to a slave, won't + * make the slave skip a restartpoint once it's synced with the master.) + * Checking true elapsed time keeps us from doing restartpoints too often + * while rapidly scanning large amounts of WAL. + */ + elapsed_secs = time(NULL) - ControlFile->time; + if (elapsed_secs < CheckPointTimeout / 2) + return; - /* - * Is it safe to checkpoint? We must ask each of the resource managers - * whether they have any partial state information that might prevent a - * correct restart from this point. If so, we skip this opportunity, but - * return at the next checkpoint record for another try. - */ - for (rmid = 0; rmid <= RM
Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Simon Riggs wrote: On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote: The problem is that after a crash, the master might complete incomplete actions via rm_cleanup() - but since it won't wal-log those changes, the slave won't know about this. This will at least prevent the creation of any further restart points on the slave (because safe_restartpoint) will never return true again - it it might even cause data corruption, if subsequent wal records are interpreted wrongly by the slave because it sees other data than the master did when it generated them. I agree the problem exists. It is somewhat rare because the idea is that if the Primary crashes we would failover to the Standby, which would mean that both Primary and Standby have executed rm_cleanup(), if needed. So in the case where the Primary fails and we choose *not* to failover, there is a potential difficulty on the Standby. It occured to me today that this might plague 8.1 too. As you explain below, the problem is not really connected to restartpoints - failing to create them is merely a sympton of this. On 8.1, this might still lead to rm_cleanup() being called much "later" (if you consider the wal position to be the "time") on the slave than on the master. I dunno if this really causes trouble - I don't yet understand the btree code well enough to judge this. The rationale for this fix could be described somewhat differently: When we shutdown, we know for certain that safe_restartpoint() is true. However, we don't know whether it is true because we successfully did a clean shutdown, or because we crashed, recovered and then issued a shutdown checkpoint following recovery. In the latter case we *must* execute rm_cleanup() on the standby because it has been executed on the primary. Not doing so at this point *might* be safe, but is not certain to be safe. We don't *need* to log a restartpoint at this time, but it seems sensible to do so. While creating the patch, I've been thinking if it might be worthwile to note that we just did recovery in the ShutdownCheckpoint (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow for more error checking, because then the slave could check that safe_restartpoint() is true for all ShutdownCheckpoints that were not after recovering. We need to check that rm_cleanup() routines don't assume that they will only ever be called once or this will clearly fail. There is also no need to call rm_cleanup() unless rm_safe_restartpoint() is false. But a non-idempotent rm_cleanup() routine will cause trouble anyway, if postgres crashes after having called rm_cleanup() but before creating the ShutdownCheckpoint. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Hacking on PostgreSQL via GIT
Martin Langhoff wrote: So - if you are committed to providing your gateway long term to Florian, I'm happy to drop my gateway in favour of yours. (Florian, before basing your code on either you should get a checkout of Aidan's and mine and check that the tips of the branches you are working on match the cvs branches -- the cvsimport code is good but whereever CVS is involved, there's a lot of interpretation at play, a sanity check is always good). Sorry for responding so late - I was rather busy during the last 1 1/2 weeks with university stuff, and had only very little time to spend on SoC. I've tried to switch my repo to both git mirrors, but there seems to be something strange happening. The checkout pulls a _lot_ of objects ( a few hunder thousands), and then takes ages to unpack them all, bloating my local repository (Just rm-ing my local repo takes a few minutes after the checkout). It seems as if git pulls all revisions of all files during the pull - which it shouldn't do as far as I understand things - it should only pull those objects referenced by some head, no? The interesting thing is that exactly the same problem occurs with both if your mirrors... Any ideas? Or is this just how things are supposed to work? greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Hacking on PostgreSQL via GIT
Aidan Van Dyk wrote: * Florian G. Pflug <[EMAIL PROTECTED]> [070430 08:58]: It seems as if git pulls all revisions of all files during the pull - which it shouldn't do as far as I understand things - it should only pull those objects referenced by some head, no? Git pulls full history to a common ancestor on the clone/pull. So the first pull on a repo *will* necessarily pull in the full object history. So unless you have a recent common ancestor, it will pull lots. Note that because git uses crypto hashes to identify objects, my conversion and Martin's probably do not have a recent common ancestor (because my header munging probably doesn't match Martin's exactly). Ah, OK - that explains things. The interesting thing is that exactly the same problem occurs with both if your mirrors... Any ideas? Or is this just how things are supposed to work? Until you have a local repository of it, you'll need to go through the full pull/clone. If you're really not interested in history you can "truncate" history with the --depth option to git clone. That will give you a "shallow repository", which you can use, develop, branch, etc in, but won't give you all the history locally. I'll retry with the "--depth" option - I'm doing development on my powerbook, and OSX seems to cope badly with lots of little files - the initial unpacking took hours - literally.. Also - what version of GIT are you using? I *really* recommend using at least 1.5 (1.5.2.X is current stable). Please, do your self a favour, and don't use 1.4.4. I'm using 1.5.0 currently - it was the latest stable release when I began to experiment with git. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Heap page diagnostic functions
Zdenek Kotala wrote: I did not find "forensics" in translator and It mentions in Oxford vocabulary but explanation is not clear for me. I agree with Bruce It is not good name. What about short form of diagnostic "diag"? Doesn't forensics basically mean to find the cause of something *after* it happened, based on traces that the event left behind? Like finding the culprit of a crime done using for example fingerprints he left, or tracing the actions of an intruder by analyzing logfiles or modified binaries? In that case, it doesn't accuratly describe those functions anyway I think, because you call them from inside the database while it's running, not from the outside after it was stopped or crashed. Just the 2 eurocents of a non-native speaker... Greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Concurrently updating an updatable view
Richard Huxton wrote: Richard Huxton wrote: Heikki Linnakangas wrote: The problem is that the new tuple version is checked only against the condition in the update rule, id=OLD.id, but not the condition in the original update-claus, dt='a'. Yeah, that's confusing :(. Bit more than just normal rule confusion I'd say. Try the following two statements in parallel (assuming you've just run the previous): UPDATE test SET dt='c'; UPDATE test SET dt='x' FROM test t2 WHERE test.id=t2.id AND t2.dt='b'; This isn't a problem with the view mechanism - it's a problem with re-checking clauses involving subqueries or joins I'd guess. I'm trying to decide if it's unexpected or just plain wrong, and I think I'd have to argue wrong. Or perhaps I'd not argue that :-/ Well, src/backend/executor/README agrees with you that it's wrong.. "Note a fundamental bogosity of this approach: if the relation containing the original tuple is being used in a self-join, the other instance(s) of the relation will be treated as still containing the original tuple, whereas logical consistency would demand that the modified tuple appear in them too. But we'd have to actually substitute the modified tuple for the original, while still returning all the rest of the relation, to ensure consistent answers. Implementing this correctly is a task for future work." This is really about MVCC in read committed mode, and the "just right for simpler cases": http://www.postgresql.org/docs/8.2/interactive/transaction-iso.html#XACT-READ-COMMITTED Clearly there needs to be a change to the sentence: "Because of the above rule, it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands that affected the same rows it is trying to update" Not true if there's a subquery/join involved. If the cited part of the README is correct, then all joins and subqueries are fine, except if they refer to the table being updated. I think there should be a big, fat warning that self-referential updates have highly non-obvious behaviour in read-committed mode, and should be avoided. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Concurrently updating an updatable view
Richard Huxton wrote: Hiroshi Inoue wrote: Florian G. Pflug wrote: I think there should be a big, fat warning that self-referential updates have highly non-obvious behaviour in read-committed mode, and should be avoided. It seems pretty difficult for PostgreSQL rule system to avoid such kind of updates. I'm suspicious if UPDATABLE VIEWS can be implemented using the rule system. Remember this affects all self-referential joins on an UPDATE (and DELETE?) not just views. It's just that a rule is more likely to produce that type of query. Is there consensus what the correct behaviour should be for self-referential updates in read-committed mode? Does the SQL Spec have anything to say about this? greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Not ready for 8.3
Andrew Dunstan wrote: What would making a branch actually do for you? The only advantage I can see is that it will give you a way of checkpointing your files. As I remarked upthread, I occasionally use RCS for that. But mostly I don't actually bother. I don't see how you can do it reasonably off a local cvs mirror - rsync will just blow away any changes you have checked in next time you sync with the master. I don't think we can make CVS behave like a distributed SCM system, and ability to create local branches seems to me one of the fundamental points of such systems. If that's what the demand is for, then we should look again at moving to something like Mercurial. I think the great thing about DCVS systems is that not everybody necessarily needs to use the *same* system. And it doesn't really matter what the central repository runs on - I think they are gateway from/to nearly everything available... I currently use GIT for my SoC project, and it works quite well - I can create an abitrary number of local branches, and syncing the currently active branch with CVS is archived by just doing "cg-update pgsql-head". greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: Do we need a TODO? (was Re: [HACKERS] Concurrently updating anupdatable view)
Simon Riggs wrote: On Mon, 2007-05-28 at 19:56 -0400, Bruce Momjian wrote: Added to TODO: * Fix self-referential UPDATEs seeing inconsistent row versions in read-committed mode http://archives.postgresql.org/pgsql-hackers/2007-05/msg00507.php I'm sorry guys but I don't agree this is a TODO item. Maybe the TODO suggested has a too narrow focus, but I think that that *something* has to be done about this. IMHO this follows documented behaviour, even if y'all are shocked. Yes, but documented != sensible && documented != intuitive && documented != logical. If you don't want the example cases to fail you can - use SERIALIZABLE mode to throw an error if inconsistency is detected - use SELECT FOR SHARE to lock the rows in the subselect e.g. UPDATE foo SET pkcol = 'x' WHERE pkcol IN (SELECT pkcol FROM foo FOR SHARE); In the case of concurrent UPDATEs the second UPDATE will normally perform the subSELECT then hang waiting to perform the UPDATE. If you use FOR SHARE the query will hang on the subSELECT (i.e. slightly earlier), which makes the second query return zero rows, as some of you were expecting. Sure, but with a similar argument you could question the whole update-in-read-committed-mode logic. After all, you wouldn't need that logic if you always obtained a share lock on the rows to be updated *before* you started updating them. Maybe we need a way of specifying that the non-UPDATE relation should be locked FOR SHARE in a self-referencing UPDATE? Though that syntax could seems to look pretty weird from here, so I'd say cover this situation in a code example and be done. Also, methinks we should have agreed behaviour before we make something a TODO item. That would help us uncover this type of thing in more detail, or at least force TODO to read "investigate whether ...". Ack. Thats why I initially asked if there was consesus on what the correct behaviour is. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] recovery_target_xid & crashes on the master
Hi I'm currently working on splitting StartupXLog into smaller parts, because I need to reuse some of the parts for concurrent wal recovery (for my GSoC project) The function recoveryStopsHere in xlog.c checks if we should stop recovery due to the values of recovery_target_xid and recovery_target_time. For recovery_target_xid, we stop if we see a commit or abort record for the given xid. Now I wonder what happens if an (admittely rather confused) DBA uses an xid of a transaction that was aborted because of a crash of the master as recovery_target_xid. The way I read the code, postgres will just recover until it reaches the end of the xlog in that case because neither an COMMIT nor an ABORT for that xid exists in the WAL. I'm not sure if this is worth fixing - it seems like a rather contrived corner case - but I though I'd bring it up... greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] [RFC] GSoC Work on readonly queries done so far
Hi This is a description of the GSoC work I've so for on readonly support for PITR slaves. I'm looking for any kind of comments on this - I want to make sure that I work in a direction that the community approves. Work done so far: - .) Added a new GUC operational_mode, which can be set to either readwrite or readonly. If it is set to readwrite (the default), postgres behaves as usual. All the following changes are only in effect if operational_mode is set to readonly. .) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR) if postgre is not in readwrite mode. This macro protects the following functions to make sure that no writes occur in readonly mode. SimpleLruWritePage, SLruPhysicalWritePage EndPrepare, FinishPreparedTransaction XLogInsert, XLogWrite, ShutdownXLog CreateCheckpoint MarkBufferDirty. .) All transactions are set to readonly mode (An implicit SET TRANSACTION READONLY), and are not allowed to do SET TRANSACTION READWRITE. .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay is started, and it takes over that role that bgwriter play in the shutdown process. .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that is considered to be "later" than any other xid. .) A global ReadOnlySnapshot is maintained in shared memory. This is copied into backend local memory by GetReadonlySnapshotData (which replaces GetSnapshotData in readonly mode). .) Crash recovery is not performed in readonly mode - instead, postgres PANICs, and tells the DBA to restart in readwrite mode. Archive recovery of course *will* be allowed, but I'm not that far yet. Open Problems: -- .) Protecting MarkBufferDirty with ASSUME_OPMODE_READWRITE is troublesome, because callers usually call MarkBufferDirty from within a critical section, and thus elog(ERRROR) is turned into elog(PANIC). This e.g. happens with my patch if you call nextval() in readonly mode. Does anyone see a better solution then adding checks into all callers that are not otherwise protected from being called in readonly mode? .) Since the slaves needs to track an Snapshot in shared memory, it cannot resize that snapshot to accomodate however many concurrent transactions might have been running on the master. My current plan is to detect if that global snapshot overflows, and to lock out readonly queries on the slave (and therefore remove the need of keeping the snapshot current) until the number of active xids on the master has dropped below max_connections on the slave. A warning will be written to the postgres log that suggest that the DBA increases the max_connections value on the slave. Please tell me what you think about this approach, and especially if you see any problems that I overlooked. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Matthew T. O'Connor wrote: Florian G. Pflug wrote: Work done so far: - .) Don't start autovacuum and bgwriter. Do table stats used by the planner get replicated on a PITR slave? I assume so, but if not, you would need autovac to do analyzes. Yes - everything that get wal-logged on the master gets replicated to the slave. In my design, it isn't possible to do "analyze" on the slave, because all datafiles are strictly readonly (well, with the small exception of hit-bit updates actually). greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Jeff Davis wrote: On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote: .) Since the slaves needs to track an Snapshot in shared memory, it cannot resize that snapshot to accomodate however many concurrent transactions might have been running on the master. My current plan is to detect if that global snapshot overflows, and to lock out readonly queries on the slave (and therefore remove the need of keeping the snapshot current) until the number of active xids on the master has dropped below max_connections on the slave. A warning will be written to the postgres log that suggest that the DBA increases the max_connections value on the slave. If we did lock the slave while waiting for transactions to complete on the master, we'd need to document some stronger warnings against idle transactions so that administrators could notice and correct the problem. It's not exactly locking until it complete on the master, it's locking the slave until we reach a position in the wal on the slave with less than max_connections concurrent transactions. But yes, I agree, this will need to be documented. Are you referring to the size of the xip array being a problem? Would it help to tie the size of the xip array to max_connections? I understand that max_connections might be greater on the master, but maybe something similar? Thats what I currently do - the xip array on the slave is sized to hold max_connections entries (Actually, it's max_connections + max_prepared_xacts I think). The problem occurs exactly if those values are set too small on the slave - and since shared mem objects are not resizeable, I don't see how the slave can handle an xip overflow gracefully other than by not publishing the information in shared memory as long as it doesn't fit there. On a further thinking - maybe locking out transactions isn't even necessary - they would just continue to see the old global snapshot, so time wouldn't advance for them until the number of concurrent transactions decreases again. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Simon Riggs wrote: On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote: .) Added a new GUC operational_mode, which can be set to either readwrite or readonly. If it is set to readwrite (the default), postgres behaves as usual. All the following changes are only in effect if operational_mode is set to readonly. Do we need this? We are already InArchiveRecovery. If I understand you correctly, you suggest that readonly queries are allways allowed during archive recovery - so upon startup postgres step through these states: .) Initial recovery (Until we reach a consistent state) .) Allow readonly queries .) Finish recovery in the background (might mean recovering "forever" on a PITR slave) .) Allow readwrite queries My plan was to have a global switch, which lets you choose between .) All queries are readonly (Until the next postmaster restart at least), but you get background replay .) No background replay, but once replay is done, readwrite queries can be execute (Just what PG does now). The main reason why I invented that global switch operational_mode was to remove to need to switch between readonly mode and readwrite mode on the fly. .) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR) if postgre is not in readwrite mode. This macro protects the following functions to make sure that no writes occur in readonly mode. SimpleLruWritePage, SLruPhysicalWritePage EndPrepare, FinishPreparedTransaction XLogInsert, XLogWrite, ShutdownXLog CreateCheckpoint MarkBufferDirty. These are Asserts? The macro ASSUME_OPMODE_READWRITE just does if (!OperationalModeReadWrite) elog(ERROR, ...) .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay is started, and it takes over that role that bgwriter play in the shutdown process. Autovacuum -> understood. What does bgreplay do? Why not just start bgwriter earlier and disable some of its other functionality while InRecovery? See above - it seemed simpler to clearly seperate .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that is considered to be "later" than any other xid. So you are bumping FirstNormalTransactionId up by one for this? In fact I changed MaxTransactionId to 0xFFFE, and set ReadOnlyTransactionId to 0x. Additionally, I changed IsNormalTransactionId to test not only for >= FirstNormalTransactionid, but also for <= MaxTransactionId. You're assuming then that we will "freeze" replay while we run a query? No. My plan is to first get to a point where replay is freezes while queries are running, and to then figure out a more intelligent way to do this. I already have a few ideas how to do this, but I want to complete the "simple version", before I start with that work. Otherwise doing this will mean the snapshot changes as a query executes. Why? It's only the xid of the transaction, not it's xmin and xmax that are set to ReadOnlyTransactionId. .) A global ReadOnlySnapshot is maintained in shared memory. This is copied into backend local memory by GetReadonlySnapshotData (which replaces GetSnapshotData in readonly mode). .) Crash recovery is not performed in readonly mode - instead, postgres PANICs, and tells the DBA to restart in readwrite mode. Archive recovery of course *will* be allowed, but I'm not that far yet. This is the very heart of the matter. This isn't just a technical issue, it goes to the heart of the use case for this feature. Can we recover while running queries? Yes. My comment only applies only to crash recovery - i.e, recovery that happens *without* a recovery.conf present, after a crash. It only really matters if you do following .) Start pg in readwrite mode. .) Kill it / It crashes .) Restart in readonly mode. The main different between crash recovery, and recovery from a filesystem-level backup is the additional information that the backup label gives us in the second case - more specifically, the minRecoveryLoc that we read from the backup label. Only with that knowledge is "recovering until we reach a consistent state" a welldefined operation. And readonly queries can only be executed *after* we did this "minimal recovery". So if there is crash recovery to be done, we best we could do is to recover, and then start in readonly mode. If this is *really* what the DBA wants, he can just start in readwrite mode first, then cleanly shut PG down, and restart in readonly mode. > If not, how much time will we spend in replay mode v query mode? Will we be able to run long running queries *and* maintain a reasonable time to recover? Is this a mechanism for providing HA and additional query capacity, or is it just a mechanism for additional query capacity only? Those are open questions to which I don't have any answer
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Heikki Linnakangas wrote: Florian G. Pflug wrote: Jeff Davis wrote: Are you referring to the size of the xip array being a problem? Would it help to tie the size of the xip array to max_connections? I understand that max_connections might be greater on the master, but maybe something similar? Thats what I currently do - the xip array on the slave is sized to hold max_connections entries (Actually, it's max_connections + max_prepared_xacts I think). The problem occurs exactly if those values are set too small on the slave - and since shared mem objects are not resizeable, I don't see how the slave can handle an xip overflow gracefully other than by not publishing the information in shared memory as long as it doesn't fit there. You could store the value of max_connections in the checkpoint xlog record, and read it from there in the slave. Though one could still change it on the master and restart without restarting the slave as well. But AFAIK shmem allocation happens before recovery starts... Even if this was solved, it would only be a partial solution since as you note, the master might be restarted while the slave keeps running. So I think it's better not too add too much complexity, and just tell the DBA to increase max_connections on the slave, together with a comment in the documentation never to sex max_connections smaller on the slave than on the master. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far
Jeff Davis wrote: On Wed, 2007-06-06 at 22:36 +0100, Simon Riggs wrote: .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that is considered to be "later" than any other xid. So you are bumping FirstNormalTransactionId up by one for this? You're assuming then that we will "freeze" replay while we run a query? Otherwise doing this will mean the snapshot changes as a query executes. Is it possible to put a normal xmax for the snapshot? It wouldn't be a real transaction on the slave, and also the master will use that ID for a real transaction itself. However, I don't see a real problem on the slave because it would only be used for the purpose of the snapshot we need at that moment. My plan is the following: .) Initially, queries and recovery will run interleaved, but not concurrently. For that, an "empty" snapshot is sufficient, with xmin=xid=xmax=ReadOnlyTransactionId. .) Then, I'll work on running them concurrently. The replay process will publish a "current" snapshot in shared memory, using "real" xmin and xmax values it generates by maintaining a list of currently active (as in: running when the wal was written on the master) transactions. In that case, only xid is set to ReadOnlyTransactionId. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Is this a feature?
Joshua D. Drake wrote: Take the following: INFO: analyzing "pg_catalog.pg_authid" INFO: "pg_authid": scanned 1 of 1 pages, containing 5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows The above is completely redundant. Why not just say: INFO: "pg_authid": scanned 1 of 1 pages, containing 5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows If the first line is meant to be an indicator, then make the above line do this: INFO: analyzing "pg_catalog.pg_authid" : Don't add a new line, and when the next step of information comes up append it to the existing line to get: INFO: analyzing "pg_catalog.pg_authid": scanned 1 of 1 pages, containing 5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows But then the line could only be pushed to the client *after* the analysis of the table has finished, while with the current output you know what postgres is currently doing, because you get "analyzing ..." *before* the operation starts. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Selecting a constant question
Dann Corbit wrote: -Original Message- From: Hannu Krosing [mailto:[EMAIL PROTECTED] Since libpq function PQfsize returns -2 for all constant character strings in SQL statements ... What is the proper procedure to determine the length of a constant character column after query execution but before fetching the first row of data? Why not just get the first row and determine the width from it before you actually use any of tha data ? What if the second row is 1000x longer? Thats exactly the point. Consider select mytext from mytable ; How can PostgreSQL possibly know the maximum length of the returned values *before* it has scanned the whole table? greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] EXPLAIN omits schema?
Gregory Stark wrote: "Joshua D. Drake" <[EMAIL PROTECTED]> writes: I agree. XML seems like a fairly natural fit for this. Just as people should not try to shoehorn everything into XML, neither should they try to shoehorn everything into a relational format either. Now all we need is an XML schema for it ;-) Well I am not a big fan of XML but it certainly seems applicable in this case. I'm not a fan either so perhaps I'm biased, but this seems like a good example of where it would be an *awful* idea. Once you have an XML plan what can you do with it? All you can do is parse it into constituent bits and display it. You cant do any sort of comparison between plans, aggregate results, search for plans matching constraints, etc. How would I, with XML output, do something like: SELECT distinct node.relation FROM plan_table WHERE node.expected_rows < node.actual_rows*2; or SELECT node.type, average(node.ms/node.cost) FROM plan_table GROUP BY node.type; I believe that XQuery actually supports such queries. So if postgres supported XQuery (or does it already? I honestly don't know), writing such a query wouldn't be that hard I think. The execution probably won't be super-efficient, but for query plans that seems OK. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Controlling Load Distributed Checkpoints
Heikki Linnakangas wrote: Jim C. Nasby wrote: On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Thinking about this whole idea a bit more, it occured to me that the current approach to write all, then fsync all is really a historical artifact of the fact that we used to use the system-wide sync call instead of fsyncs to flush the pages to disk. That might not be the best way to do things in the new load-distributed-checkpoint world. How about interleaving the writes with the fsyncs? I don't think it's a historical artifact at all: it's a valid reflection of the fact that we don't know enough about disk layout to do low-level I/O scheduling. Issuing more fsyncs than necessary will do little except guarantee a less-than-optimal scheduling of the writes. If we extended relations by more than 8k at a time, we would know a lot more about disk layout, at least on filesystems with a decent amount of free space. I doubt it makes that much difference. If there was a significant amount of fragmentation, we'd hear more complaints about seq scan performance. OTOH, extending a relation that uses N pages by something like min(ceil(N/1024), 1024)) pages might help some filesystems to avoid fragmentation, and hardly introduce any waste (about 0.1% in the worst case). So if it's not too hard to do it might be worthwhile, even if it turns out that most filesystems deal well with the current allocation pattern. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Worries about delayed-commit semantics
PFC wrote: On Fri, 22 Jun 2007 16:43:00 +0200, Bruce Momjian <[EMAIL PROTECTED]> wrote: Simon Riggs wrote: On Fri, 2007-06-22 at 14:29 +0100, Gregory Stark wrote: > "Joshua D. Drake" <[EMAIL PROTECTED]> writes: > > Tom Lane wrote: > >> untrustworthy disk hardware, for instance. I'd much rather use names > >> derived from "deferred commit" or "delayed commit" or some such. > > > > Honestly, I prefer these names as well as it seems directly related versus > > transaction guarantee which sounds to be more like us saying, if we turn it off > > our transactions are bogus. That was the intention..., but name change accepted. > Hm, another possibility: "synchronous_commit = off" Ooo, I like that. Any other takers? Yea, I like that too but I am now realizing that we are not really deferring or delaying the "COMMIT" command but rather the recovery of the commit. GUC as full_commit_recovery? commit_waits_for_fsync = force_yes: makes all commits "hard" yes: commits are "hard" unless specified otherwise [default] no: commits are "soft" unless specified otherwise [should replace fsync=off use case] force_no: makes all commits "soft" (controller with write cache "emulator") I think you got the last line backwards - without the fsync() after a commit, you can't be sure that the data made it into the controller cache. To be safe you *always* need the fsync() - but it will probably be much cheaper if your controller doesn't have to actually write to the disks, but can cache in battery-backed ram instead. Therefore, if you own such a controller, you probably don't need deferred commits. BTW, I like synchronous_commit too - but maybe asynchronous_commit would be even better, with inverted semantics of course. The you'd have "asynchronous_commit = off" as default. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Worries about delayed-commit semantics
Richard Huxton wrote: Bruce Momjian wrote: Tom Lane wrote: What's wrong with synchronous_commit? It's accurate and simple. That is fine too. My concern would be that it can be read two ways: 1. When you commit, sync (something or other - unspecified) 2. Synchronise commits (to each other? to something else?)* It's obvious to people on the -hackers list what we're talking about, but is it so clear to a newbie, perhaps non-English speaker? * I can see people thinking this means something like "commit_delay". OTOH, the concept of synchronous vs. asynchronous (function) calls should be pretty well-known among database programmers and administrators. And (at least to me), this is really what this is about - the commit happens asynchronously, at the convenience of the database, and not the instant that I requested it. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Still recommending daily vacuum...
Michael Paesold wrote: Alvaro Herrera wrote: So what you are proposing above amounts to setting scale factor = 0.05. The threshold is unimportant -- in the case of a big table it matters not if it's 0 or 1000, it will be almost irrelevant in calculations. In the case of small tables, then the table will be vacuumed in almost every iteration if the threshold is 0, which is fine because the table is small anyway. So why not let the threshold be 0 and be done with it? For very small tables, setting a threshold of 0 could mean a vacuum after every single row update (or every other row). I think that is just burning cycles. What about a threshold of 10 or 50, to have at least some sanity limit? Even though the cost of vacuum of a small table is low, it is still not free, IMHO, no? A bit off-topic (because probably not realistic in a 8.3 timeframe) - but maybe the threshold should be specified in terms of "expected number of pages to be freed", instead specifing a bias for the number of modified rows as it is done now. Then "1" would probably be a reasonable default, because a vacuum that won't free at least one page seems to be not really worth the effort - it won't safe any future IO bandwith. Just an idea I got while following this thread... greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Tom Lane wrote: [ back to dealing with this patch, finally ] "Florian G. Pflug" <[EMAIL PROTECTED]> writes: While creating the patch, I've been thinking if it might be worthwile to note that we just did recovery in the ShutdownCheckpoint (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow for more error checking, because then the slave could check that safe_restartpoint() is true for all ShutdownCheckpoints that were not after recovering. I concur that this is a good idea --- we should have a third checkpoint record type that shows that a crash recovery occurred. However, we can probably only do that for 8.3 and beyond. If we try to do it in existing release branches then there's likelihood of trouble due to WAL incompatibility between master and standby. While we do advise people to update their standbys first, I don't think it's worth risking such problems just to add some more error checking. > Conclusion: we should apply Florian's patch as-is in 8.2, do something morally equivalent in 8.1 and before, and invent a CrashRecoveryCheckpoint record type in HEAD. Sounds good. Do you want me to code up such patches for 8.1 and 8.3 in the next days, or is someone else already working on it? greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Tom Lane wrote: Conclusion: we should apply Florian's patch as-is in 8.2, do something morally equivalent in 8.1 and before, and invent a CrashRecoveryCheckpoint record type in HEAD. Sounds good. Actually, now that I look closer, this patch seems completely wrong. It's predicated on an assumption that rm_cleanup won't write WAL entries describing what it did ... but, at least in the btree case, it does. (I think gist/gin might not, but that's a bug in those AMs not in xlog.) I'm therefore wondering what test case led you to think there was something wrong. It wasn't a testcase - I was trying to understand the xlog code while working on my concurrent walreplay patch, and wondered what happens if the master crashes and then recovery while the slave keeps running. I've re-read my original email to Simon, and it seems that I believed that rm_cleanup methods won't bee able to write to the xlog because they are called during recovery. But StartupXLOG *does* make the wal append able *before* the rm_cleanup methods are called. So I now think (at least for btree) that everything is fine, and that I was just being stupid. Sorry for the noise, guys greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] ReadRecord, EndRecPtr and XLOG_SWITCH
Hi When ReadRecord encounters an XLOG_SWITCH record, it does EndRecPtr.xrecoff += XLogSegSize - 1; EndRecPtr.xrecoff -= EndRecPtr.xrecoff % XLogSegSize; which seems to set xrecoff to either 0 (if it was 0) or to XLogSegSize (if it was > 0). Note that xrecoff == XLogSegSize is kind of "denormalized" - the normalized version would have xrecoff == 0, and xlogid = xlogid+1 Passing this "denormalized" EndRecPtr to ReadRecord again to read the next record than triggers a PANIC ("invalid record offset at ??/1000"). Passing NULL to ReadRecord to read the next record works, because it takes care to align the EndRecPtr to the next page, thereby fixing the "denormalization". Is there a reason not to do the same for non-NULL arguments to ReadRecord? Or is there some failure case that the current behaviour protects against? The reason I stumbled over this is that I want to restart archive recovery from a "bgreplay" process - I tried passing the EndRecPtr via shmem, and using it as my initial argument to ReadRecord, and thereby stumbled over this behaviour. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ReadRecord, EndRecPtr and XLOG_SWITCH
Florian G. Pflug wrote: Please disregard - I was confusing xlogid with xlog segments, so most of my mail was nonsense. I've fixed my problem by storing not the EndRecPtr, but rather the ReadRecPtr, in shmem and rereading the last already applied record in my bgreplay process. Then I can just use ReadRecord(NULL), and things Just Work. Sorry for the noise & greetings Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] "Running readonly queried on PITR slaves" status update
Hi After struggling with understanding xlog.c and friends far enough to be able to refactor StartupXLOG to suit the needs of concurrent recovery, I think I've finally reached a workable (but still a bit hacky) solution. My design is centered around the idea of a bgreplay process that takes over the role of the bgwriter in readonly mode, and continously replays WALs as they arrive. But since recovery during startup is still necessary (We need to bring a filesystem-level backup into a consistent state - past minRecoveryLoc - before allowing connections), this means doing recovery in two steps, from two different processes. I've changed StartupXLOG to only recover up to minRecoveryLoc in readonly mode, and to skip all steps that are not required if no writes to the database will be done later (Especially creating a checkpoint at the end of recovery). Instead, it posts the pointer to the last recovered xlog record to shared memory. bgreplay than uses that pointer for an initial call to ReadRecord to setup WAL reading for the bgreplay process. Afterwards, it repeatedly calls ReplayXLOG (new function), which always replays at least one record (If there is one, otherwise it returns false), until it reaches a safe restart point. Currently, in my test setup, I can start a slave in readonly mode and it will do initial recovery, bring postgres online, and continously recover from inside bgreplay. There isn't yet any locking between wal replay and queries. I'll add that locking during the new few days, which should result it a very early prototype. The next steps will then be finding a way to flush backend caches after replaying code that modified system tables, and (related) finding a way to deal with the flatfiles. I'd appreciate any comments on this, especially those pointing out problems that I overlooked. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] 2PC-induced lockup
Tom Lane wrote: "Simon Riggs" <[EMAIL PROTECTED]> writes: I'd be much more comfortable if LOCK TABLE caused a message to the log if it is executed on any system table. Enabled by "set training_wheels = on", perhaps? This is really pretty silly to be getting worked up about. The command in question wouldn't have been allowed at all except to a superuser, and there are plenty of ways to catastrophically destroy your database when you are superuser; most of which we will never consider blocking for the same reasons that Unix systems have never tried to block root from doing "rm -rf /". I'd say the real design flaw in Peter's referenced application is that they're running it as superuser. Yeah.. though "lock pg_auth; prepare" looks quite innocent, much more than say "delete from pg_database" or "rm -rf whatever". At least to the untrained eye. I fully agree that that special-casing this particular way to shoot yourself in the foot is not worth it - but maybe pursuing a more general solution would be worthwile? Maybe superuser-connections could e.g. ignore any errors that occur while reading a system table, together with a big, fat warning, but still allow a logon? That of course depends on the assumption that basic authentication is possible using just the information from the flatfiles and pg_hba.conf, which I'm not sure about. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] xlog switch & last record before the switch
Hi To test my PITR-slave readonly-query patch, I continously do insert into test ... pg_switch_xlog() sleep 1 on the master, and let the slave process the generated xlogs The log output on the slave looks the following (unnecessary lines remove) LOG: restored log file "0001016E" from archive LOG: REDO @ 0/16E00020; LSN 0/16E0004C: prev 0/16D00098; xid 1346: Transaction - commit: 2007-07-12 15:43:00.686056+02 LOG: REDO @ 0/16E0004C; LSN 0/16E0008C: prev 0/16E00020; xid 0: XLOG - checkpoint: redo 0/16E00020; tli 1; xid 0/1347; oid 24576; multi 1; offset 0; online LOG: REDO @ 0/16E0008C; LSN 0/16E00140: prev 0/16E0004C; xid 1350: Sequence - log: rel 1663/1/16384 LOG: REDO @ 0/16E00140; LSN 0/16E00B88: prev 0/16E0008C; xid 1350; bkpb1: Heap - insert: rel 1663/1/16386; tid 2/49 LOG: REDO @ 0/16E00B88; LSN 0/16F0: prev 0/16E00140; xid 1350: XLOG - xlog switch LOG: restored log file "0001016F" from archive LOG: REDO @ 0/16F00020; LSN 0/16F0004C: prev 0/16E00B88; xid 1350: Transaction - commit: 2007-07-12 15:43:02.159717+02 LOG: REDO @ 0/16F0004C; LSN 0/16F00098: prev 0/16F00020; xid 1352: Heap - insert: rel 1663/1/16386; tid 2/50 LOG: REDO @ 0/16F00098; LSN 0/1700: prev 0/16F0004C; xid 1352: XLOG - xlog switch LOG: restored log file "00010170" from archive LOG: REDO @ 0/1720; LSN 0/174C: prev 0/16F00098; xid 1352: Transaction - commit: 2007-07-12 15:43:02.26456+02 LOG: REDO @ 0/174C; LSN 0/1798: prev 0/1720; xid 1356: Heap - insert: rel 1663/1/16386; tid 2/51 LOG: REDO @ 0/1798; LSN 0/1710: prev 0/174C; xid 1356: XLOG - xlog switch As you can see, the COMMIT records seems to end up being logged *after* the xlog switch. I would have expected the order "heap-insert, commit, switch, heap-insert, commit, switch, ...", not "heap-insert, switch, commit, heap-insert, switch, commit, ...". Is this the expected behaviour, or just an artefact of the implementation of xlog switches? greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] xlog switch & last record before the switch
Simon Riggs wrote: On Thu, 2007-07-12 at 16:17 +0200, Florian G. Pflug wrote: To test my PITR-slave readonly-query patch, I continously do insert into test ... pg_switch_xlog() sleep 1 on the master, and let the slave process the generated xlogs The log output on the slave looks the following (unnecessary lines remove) >> >> > As you can see, the COMMIT records seems to end up being logged *after* the xlog switch. I would have expected the order "heap-insert, commit, switch, heap-insert, commit, switch, ...", not "heap-insert, switch, commit, heap-insert, switch, commit, ...". Is this the expected behaviour, or just an artefact of the implementation of xlog switches? Can you show the exact SQL executed? If you do INSERT ... ;select pg_switch_xlog() then the COMMIT will occur after the switch. If you do: INSERT ...; select pg_switch_xlog(); then the COMMIT will occur before the switch. Ah, you caught me - I had forgotten that pgsql -c "" executes the statement in one transaction. I was just going to suggest that pg_start_backup, pg_stop_backup and pg_switch_xlog emit a warning or even an error if called from within a larger transaction, because that's quite certainly not what the user wants. But since those are just plain functions, I guess checking for that might not be trivial... If it were otherwise this would imply statements were executed prior to the previous commit, which I hope and pray is never the case. You can relax, the bug was located in front of the screen :-) greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Full page images in WAL & Cache Invalidation
Hi I'm currently working on correctly flushing the catalog/relation/sgmr caches on a readonly PITR slave during recovery. These are the things that currently cause me headache. 1) It seems that the btree code sends out relcache invalidation msgs during normal operation (No DDL statements are executed). This lets any simple flush-all-caches-if-ddl-was-execute scheme fail. 2) When a full page image is written to the wal, the information about what tuple was updated is lost. So synthesizing cache invalidation msgs from the WAL records would need to reverseengineer a full page image, which seems hard and errorprone. 3) Most cache invalidations seem to be generated by heap_insert (via PrepareForTupleInvalidation). Those seems to be reconstructable from the WAL quite easily. Those sent out via CacheInvalidateRelcache*, however, seem to leave no trace in the WAL. What I'm wondering is how much performance is lost if I just let the slave flush all it's caches whenever it replayed a commit record of a transaction that executed DDL. To me it looks like that would only seriously harm performance if a lot of temporary tables are created on the master. Since there seem to be quite people who are unhappiy about the current temptable implementation, optimizing for that case might prove worthless if 8.4 or 8.5 will change the way that temptables are handled. If this brute-force approach turns out to perform really bad, does anyone see an elegant way around (2) and (3)? (2) seems solveable by writing logical and physical records to the wal - similar to what that xlog compression idea needs (I, however, lost track of what came out of that discussion). But (3) seems to be messy.. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I'm currently working on correctly flushing the catalog/relation/sgmr caches on a readonly PITR slave during recovery. I don't believe there is any workable solution to that short of logging cache-flush operations in WAL. I still don't fully understand if and what problems are caused by overly aggresive cache flushing - what puzzles me is that DDL statements seems to be considered something rare on a production system by most people on this list, yet the caches seem to be highly optimized to avoid unnecessary invalidates. Leaving aside the btree issues, are you worried about performance problems, or can aggressive cache flushing hurt correctness? The reason that I dislike WAL-logging of the flush operations so much is that it since peopel are concerned about the amount of wal traffic postgres generated, such a solution would introduce yet another GUC. And to make this reasonable foolproof, the slave would need a way to detect if that GUC is set correctly on the master. All in all, that seems to be quite hackish... greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Leaving aside the btree issues, are you worried about performance problems, or can aggressive cache flushing hurt correctness? It shouldn't hurt correctness, but I don't think you entirely grasp the magnitude of the performance hit you'll take. The last time I tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled, they took about one hundred times longer than normal. Now you are evidently hoping not to disable the caches entirely, but I don't understand how you are going to handle the various random CacheInvalidateRelcache calls that are here and there in the system (not only in btree). I must be missing something... A quick grep for CacheInvalidateRelcache turned of these places: src/backend/rewrite/rewriteDefine.c (EnableDisableRule) src/backend/rewrite/rewriteSupport.c (SetRelationRuleStatus) src/backend/access/nbtree/nbtinsert.c (_bt_insertonpg, _bt_newroot) src/backend/access/nbtree/nbtpage.c (_bt_getroot, _bt_pagedel) src/backend/commands/trigger.c (renametrig, EnableDisableTrigger) src/backend/commands/cluster.c (mark_index_clustered) src/backend/commands/indexcmds.c (DefineIndex) src/backend/commands/tablecmds.c (setRelhassubclassInRelation) src/backend/commands/vacuum.c (vac_update_relstats) src/backend/catalog/heap.c (SetRelationNumChecks) src/backend/catalog/index.c (index_drop, index_update_stats) For CacheInvalidateHeapTuple, there is an additional hit in src/backend/commands/vacuum.c (move_chain_tuple, move_plain_tuple). Note that move_chain_tuple and move_plain_tuple are only called in repair_frag, which in turn is only used in full_vacuum_rel. Now, to me all of these with the exception of the btree functions, vac_update_relstats and move_chain/plain_tuple look like they are only called during DDL statements. My basic assumption is that DDL is something quite uncommon on a production system. This is obviously *totally* *wrong* for the regression tests, and I don't doubt that my scheme will show quite bad performance if you use that as a benchmark. But if you, say, use pgbench for testing, than the picture will be quite different I imagine. My strategy would be the following 1) Mark the commit record if a transaction generated any invalidation events apart from the btree ones. The only other source of inval events seems to be "VACUUM FULL" on a system relation, which won't happen on a modestly well-tuned system I think - any VACCUM FULL will need a special treatement anyway. 2) At replay time, the caches are flushed after that record was replayed. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: My basic assumption is that DDL is something quite uncommon on a production system. I'm not sure I believe that, because of temp tables. There's also the problem that plain VACUUM (or ANALYZE) causes a relcache flush to update the relation-size statistics. > The real problem with the scheme you propose is that it turns a cache flush on one table into a system-wide cache flush. Yes.. It really builds on the idea that those flushes happen not too frequently. We might be able to do something about the temp-table case upstream: AFAICS there's no reason for backends to broadcast cache flushes for their own temp tables to other backends. But that's just a sketch of a thought at the moment. I was actually hoping that some day temptables wouldn't be stored in the pg_class and friends at all. I was actually wondering if it wouldn't be possible to keep the information about them soley in the catcache and relcache (Making the word cache a big lie). Didn't check if that is feasible at all, though - just an idea I got at one point. Anyway, if you believe that DDL is infrequent, why are you resistant to the idea of WAL-logging cache flushes? For multiple reasons. First, cache invalidations are not the only problem caused by replaying system-table updates. The whole SnapshotNow business doesn't exactly make things easier too. So it feels like a lot of added complexity and code for little gain - unless a *lot* more things (like locking requests) are logged too. Second, I'm sure that people would insist on a GUC to turn logging those records off if they don't need them in their setup. Similar to that make_wal_compressable GUC that was proposed a few weeks ago. And if it's a GUC, the slave should be able to verify that it was set correctly on the master, otherwise this becomes a huge footgun. Third, I try to keep the changes necessary on the master at a minimum - I feel that this will make merging the code at some point easier, because the risk of breaking something is smaller. Bugs in the slave code will maybe cause crashes and wrong results, but at least they won't cause data corruption on the master. And last but not least, I have only limited time for this project - so I try to find the simplest workable solution, and maybe tune things later when pratical experience shows where the real bottlenecks are. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Anyway, if you believe that DDL is infrequent, why are you resistant to the idea of WAL-logging cache flushes? First, cache invalidations are not the only problem caused by replaying system-table updates. The whole SnapshotNow business doesn't exactly make things easier too. So it feels like a lot of added complexity and code for little gain - unless a *lot* more things (like locking requests) are logged too. The mention of locking requests brought to mind the following gedankenexperiment: 1. slave server backend is running some long-running query on table X. 2. WAL-reading process receives and executes DROP TABLE X. (It doesn't even have to be a DROP; most varieties of ALTER are enough to create problems for a concurrently-running query.) It's really hard to see how to defend against that without a fairly complete simulation of locking on the slave side. Well, locking on the slave is a bit easier than on the master, for two reasons 1) Queries running on the slave only lock in AccessShareLock mode - therefore, you only need to know if a certain mode conflics with AccessShareLock - and there seems to be only one that does, namely AccessExclusiveLock. So we really only need to care about AccessExclusiveLock locks on the master 2) As far as I can see, the point of an AccessExclusiveLock is *not* actually preventing queries from running while a DDL statement is *executed*, but rather preventing queries from running while the statement is *committed*. This fits the fact that system tables are read using SnapshotNow (not SnapshotDirty) - while the DDL is running, everybody is happily using the old information, the trouble would only start after the commit because with SnapshotNow you suddenly see the new state. I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)- but I'm fairly confident, because if (2) was wrong, than how would the system survive a crash during the execution of a DDL statement? So after a bit more thought (And reading. Thanks for all the replies, guys! They are greatly appreciated.), I came up with the following plan for both inval events and locks .) Store two flags in the commit record of a transaction, for "transaction generated inval events" and "transaction held an access exlusive lock". .) Upon replay, block until no transactions are running (for "transaction held an exclusive lock") before replaying the record, or flush the caches after replaying it (for "transaction generated inval events"). This scheme has two really nice properties: First, it can be extended fairly easily to not store a simple flag, but a list of OIDs, and use that to make the flushing and blocking more fine-grained - without changing the basic way in which things work. And second, it stores all information needed in the *commit* record. That removes problems with transactions that are aborted due to a crash, and therefor the WAL doesn't tell that they were aborted. VACUUM FULL will need some special care - but at least VACUUM FULL is already such a disruptive operation, that it probably won't surprise anyone if it's disruptive on the slave too. (And now that CLUSTER is MVCC-Safe from what I read, the usecase for VACUUM FULL seems to be pretty slim). The btree metapage caching will need special treatement too - probably some flags in the WAL record that change the metapage that instruct the slave to synthesize a suitable inval event. What stays open is regular VACUUM (and maybe inline vacuuming - is that part of the latest version of HOT, or did it get removed?). Here, the key is logging the return value of GetOldestXmin() I think. Since that value is what decides in the end if a tuple can be killed or not, having it available on the slave should allow the slave to block replay until no slave query depends on the tuple anymore. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I came up with the following plan for both inval events and locks .) Store two flags in the commit record of a transaction, for "transaction generated inval events" and "transaction held an access exlusive lock". .) Upon replay, block until no transactions are running (for "transaction held an exclusive lock") before replaying the record, or flush the caches after replaying it (for "transaction generated inval events"). This does not work; the lock has to be taken earlier than that. (See for instance VACUUM's truncate calls.) Not to mention that you have converted "exclusive lock on one table" to "exclusive lock on every table", which is even worse than the idea of converting per-table cache flushes to system-wide ones. I'll check what VACUUM is doing.. I primarily had CLUSTER and TRUNCATE in mind. That "exclusive lock on one table becomes exclusive lock on all tables" issue can (as I wrote in the part of my mail that you sniped) be solved I think by storing a list of OIDs instead of a flag for the locks and inval events. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Full page images in WAL & Cache Invalidation
Simon Riggs wrote: On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote: Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I'm currently working on correctly flushing the catalog/relation/sgmr caches on a readonly PITR slave during recovery. I don't believe there is any workable solution to that short of logging cache-flush operations in WAL. The reason that I dislike WAL-logging of the flush operations so much is that it since peopel are concerned about the amount of wal traffic postgres generated, such a solution would introduce yet another GUC. And to make this reasonable foolproof, the slave would need a way to detect if that GUC is set correctly on the master. All in all, that seems to be quite hackish... Seems like we should WAL log flush operations first. It's fairly straightforward to do that and we can then measure its effect on the primary easily enough. Your other suggestions seem much more complex. I think we have a reasonable tolerance for increases in WAL and as you said earlier, we may balance that out with other optimisations. Or we may find a more efficient way of doing it later. Let's aim to get that first query running, then go back and tune it later. I've so far added an LWLock that makes replay and queries mutually exclusive, Simple testcases seem to work, but I haven't really beaten the system yet... Of course, my current version falls over as soon as you do DDL on the master - working on fixing that, and on subsequently removing that lock again :-) greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Tracking a snapshot on PITR slaves
Hi Since my attempts to find a simple solution for the read-only query locking problems (Once that doesn't need full wal logging of lock requests) haven't been successfully yet, I've decided to turn to the problems of tracking a snapshot on the slaves for now. (Because first such a snapshot is needed for any kind of concurrent recovery anyway, and second because any non-simplistic solution of the locking problems will quite likely benefit from such a snapshot). The idea is to create a special kind of snapshot that works basically like a MVCC snapshot, but with the meaning of the xip array inverted. Usually, if a xid is *not* in the xip array of a snapshot, and greater than the xmin of that snapshot, the clog state of the xid determines tuple visibility. This is not well suited for queries running during replay, because the effects of a xlog record with a (to the slave) previously unknown xid shouldn't be visible to concurrently running queries. Therefore, flag xip_inverted will be added to SnapshotData, that causes HeapTupleSatisfiesMVCC to assume that any xid >= xmin and *not* in the xip array is in progress. This allows the following to work: .) Store RecentXmin with every xlog record, in a new field xl_xmin. (Wouldn't be needed in *every* record, but for now keeping it directly inside XLogRecord make things easier, and it's just 4 bytes) .) Maintain a global snapshot template in shmem during replay, with the xmin being the highest xmin seen so far in any xlog record. That template is copied whenever a readonly query needs to obtain a snapshot. .) Upon replaying a COMMIT or COMMIT_PREPARED record, the xmin of the to-be-committed transaction is added to the global snapshot, making the commit visibile to all further copies of that snapshot. If you can shoot this down, you're welcome to do so ;-) greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] XID wraparound and busy databases
Tom Lane wrote: Bruce Momjian <[EMAIL PROTECTED]> writes: Is enlarging the xid field something we should consider for 8.4? No. We just got the tuple header down to 24 bytes, we are not going to give that back and then some. If you are processing 6K transactions per second, you can afford to vacuum every couple days... and probably need to vacuum much more often than that anyway, to avoid table bloat. Possibly your respondent should think about trying to do more than one thing per transaction? I'm wondering how many of those 6k xacts/second are actually modifying data. If a large percentage of those are readonly queries, than the need for vacuuming could be reduced if postgres assigned an xid only if that xid really hits the disk. Otherwise (for purely select-type queries) it could use some special xid value. This is what I'm doing in my Readonly-Queries-On-PITR-Slave patch. greetings, Florian Pflug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] SQL feature requests
Ben Tilly wrote: On 8/22/07, Gregory Stark <[EMAIL PROTECTED]> wrote: "Ben Tilly" <[EMAIL PROTECTED]> writes: 2. Why is 'non-integer constant in GROUP BY' an error? Hm... I was a bit surprised by this warning myself. IIRC there was an implementation convenience issue. If your implementation accepts: group by case when true then 'foo' end how much harder can it be to accept: group by 'foo' This is not about hardness of the implementation, but rather about non-confusing behaviour I think. AFAIK, "group by 1" means "group by the first selected column", not "group all rows together". But "group by 'foo'" would carry the second meaning - "group all rows together". This is so totally counter-intuitive, that it's not even funny... "group by case when true then 'foo' end" looks different enough compared to "group by 1" to make this less of a footgun. Seems that the "group by " syntax predates the appearance of aliases in the sql standard... greetings, Florian flug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] SQL feature requests
Ben Tilly wrote: On 8/22/07, Michael Glaesemann <[EMAIL PROTECTED]> wrote: On Aug 22, 2007, at 20:49 , Ben Tilly wrote: If your implementation accepts: group by case when true then 'foo' end What would that mean? Regardless of whether or not it's accepted, it should have *some* meaning. To my eyes it has a very clear meaning, we're grouping on an expression that happens to be a constant. Which happens to be the same for all rows. Which is a spectacularly useless thing to actually do, but the ability to do it happens to be convenient when I'm looking for something to terminate a series of commas in a dynamically built query. Which is the same very clear meaning that "group by 1" has - we're grouping on a expression which happens to be the constant 1. Hey, wait a second. This isn't what "group by 1" means at all - it rather means group by whatever the fist column in the select list is. So, yes, "group by 'foo'" *seems* to have a very clear meaning - but that clearness vanishes as soon as you take into account what "group by 1" means. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] SQL feature requests
Kevin Grittner wrote: On Thu, Aug 23, 2007 at 3:01 PM, in message <[EMAIL PROTECTED]>, Tom Lane <[EMAIL PROTECTED]> wrote: The only argument I've heard that carries much weight with me is that it eases porting from other DBMS's that allow this. Are there any others besides Oracle? select * from (select f1 from t) In Sybase: com.sybase.jdbc2.jdbc.SybSQLException: The derived table expression is missing a correlation name. Check derived table syntax in the Reference Manual. Error code: 11753 SQL state: Z The really funny thing is that pgsql, mysql and at least sybase *explicitly* dissallow the no-alias case. Which shows that .) This seems to be common source of confusion and errors. .) Aliasless-Subqueries wouldn't lead to ambigous grammras in those databases. Otherwise, you'd expect to get some more generic syntax error, and not the very explicit "No alias, but expected one". I agree with Tom - knowing *why* the standard committee disallows that syntax - and why everybody except oracle chose to agree with it would be quite interesting. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] simple replication
Edoardo wrote: On Aug 24, 4:27 pm, [EMAIL PROTECTED] ("Heikki Linnakangas") wrote: Edoardo wrote: - Google Summer of Code effort by Florian Pflug I would like to talk about the last one: I couldn't find any website / explanation or further post. The google linkhttp://code.google.com/soc/postgres/appinfo.html?csaid=6545828A8197EBC6 is broken does anybody knows something about it? He's working on allowing you to perform read-only queries on a warm standby server. Currently, the warm standby can't be used for queries until you fail over to it. Even like this is worth considering this solution in my scenario, do you think I can have a look at his work? The archive of the pgsql-hackers list contains some discussion of my work. I can also send you the latest version - though it's absolutely not ready for production use at this point. It's more of a prototype, to find out where the problems of lie, and how to overcome them. Work on this won't stop with the end of GSoC, and I'm not the only one interested in making this happen - so there is a realistic chance that something like this will show up in postgres sooner or later - not promises, though.. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions
I'm resending this without that actual patch attached, since my first mail didn't get through. The patch can be found here: http://soc.phlo.org/lazyxidassign.patch Hi Lately, there was some interest in assigning XIDs for toplevel Transactions only when the transaction actually needs one (that is, it writes something to the disk), like we currently do for subtransactions. This is obviously also a crucial part of my work on read-only transactions on PITR slaves - since there is no way to assign "real" XIDs on a PITR slave. I've spent the last few days factoring out that work, and turning it into a general solution. The result is this patch, which basically does the following .) It defines a special TemporaryTransactionId that is used as an xact's xid until the xact calls GetCurrentTransactionId / GetTopTransactionId. .) It introduces a new macro TransactionIdIsPermanent, which tells if an XID is valid, and not equal to TemporaryTransactionId. .) It lets GetTopTransactionId assign an XID on-demand, similar to how GetCurrentTransactionId handles that for subtransactions. .) Each transaction get an "rid" (ResourceOwnerId) assigned when it starts, and obtains a lock on that rid, similar to how the xid is locked. This can be used to wait for a transaction's toplevel resource owner to release all it's locks, and serves as a unique identifier for a running transaction. This is needed for concurrent index builds to wait until all transactions holding a ShareLock on the target relation have ended. The patch passes the regression test, though there are currently two issues that need to be resolved. 1) The second waiting phase of concurrent index builds fail to wait for xacts that haven't been assigned an XID when the reference shapshot was taken. The "rid" doesn't help here, because it's not currently store in the snapshot. 2) I'm not entirely sure yet how to handle two flags MyXactMadeTempRelUpdates, MyXactMadeXLogEntry and the MyRecPtr variable. Those seems to be made partly redundant by this patch - checking if an xact has a permanent xid assigned already tells if the transaction made any writes. (1) could be easiy solves by querying the list of currently active RIDs after taking the reference snapshot. But since AFAIK HOT introduces a new method for guaranteeing that a transaction won't use an index that might not contain all tuples that xact is interested in, I wanted to get feedback on how HOT currently handles this. It's probably not the best time to come up with new patches, since everybody seems to be busy working on getting 8.3 out. But this patch is a quite natural fallout of my work on read-only queries on PITR slaves, and I'd be very interested to know if the general direction this patch takes is deemed acceptable. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I've spent the last few days factoring out that work, and turning it into a general solution. The result is this patch, which basically does the following .) It defines a special TemporaryTransactionId that is used as an xact's xid until the xact calls GetCurrentTransactionId / GetTopTransactionId. [ squint... ] Why do you need that? The behavior we use with subtransactions is just to leave the XID as InvalidOid until there's a reason to assign it, and I think we should do the same with top-level XIDs. Having an extra TemporaryTransactionId seems ugly, mainly because it's not clear how XID comparison should handle it. I invented the TemporaryTransactionId because it seemed worthwhile to distinguish between backends which do not currently run a transaction (xid == InvalidTransactionId), and such which run a transaction that doesn't yet have an xid assiged (xid == TemporaryTransactionId). Currently, the TemporaryTransactionId is treated to be later than any other xid value. I'm not wedded to those TemporaryTransactionIds though - they just seemed like a good idea when I started with my readonly-queries on PITR-slaves work, and it allows for a few more assertions. To leave XID at 0, you will need to teach GetSnapshotData and maybe some other places that a proc could be advertising nonzero xmin even when its XID is still 0. This does not seem like a big problem though. Yeah - TemporaryTransactionId removed the need for a few special cases in that area - but at the cost of having the distinguish between TransactionIdIsValid and TransactionIdIsPermanent (meaning valid && !temporary). .) Each transaction get an "rid" (ResourceOwnerId) assigned when it starts, and obtains a lock on that rid, similar to how the xid is locked. This can be used to wait for a transaction's toplevel resource owner to release all it's locks, and serves as a unique identifier for a running transaction. This seems like inventing a concept we could do without, also overhead we could do without (assigning globally unique RIDs would require as much mechanism and contention as XID assignment does). What about locking the backend's PID instead? I do not see that long-term uniqueness is needed, we just want to be able to wait for the backend's current transaction to end. If you do think it's important to distinguish the other guy's current and next transaction (which maybe it is), then possibly we could lock a combination of the PID and a *local*, per-backend transaction counter (there should be plenty of room in LOCKTAG for this). This counter value would have to be advertised in PGPROC, but there wouldn't be any contention involved to assign a new value. I wanted some (single) value that would fit into some standard C datatype. Since I guess using "long long" in postgres code code is a bad idea (It's not supported on all 32-bit plattforms I think), I wanted to come up with some 32-bit identifier. If the PID were guaranteed to be 16-bit we could use the other 16 bits as a counter - but all modern Unixen have moved past a limit of 65535 processes I fear... While writing this mail I realized that my RID generation algorithm - while being quite lightweight I think - has a small race condition. The algorithm is for(;;) { rid = ShmemVariableCache->nextRid++ ; if (ResourceOwnerIdIsValid(rid) && ResourceOwnerLockTableInsert(rid)) break ; } I just realized that if two backend manage to obtain the same rid, and one than is paged out long enough for the other to lock the rid, run it's transaction and commit, then the second backend will get the same rid :-(. So it's back to the drawing board anyway... It's slightly annoying that this scheme involves taking two lmgr locks per read-write transaction. I wonder whether we couldn't dispense with the notion of locking one's XID per se. This would mean that where we try to wait for another transaction by XID, we have to trawl through the ProcArray to find that XID and see what PID/localID it maps to; but if we're in that path we're already going to be blocking, so more cycles there might be a good tradeoff for fewer cycles in transaction start. Yeah - I do not really like that dual-locking thing either. But it makes prepared transaction handling much easier - if we were to only lock the RID, we'd have to store the rid<->xid mapping for prepared transactions somewhere *and* guarantee that we won't assign that RID to another transaction - even after a server restart... 1) The second waiting phase of concurrent index builds fail to wait for xacts that haven't been assigned an XID when the reference shapshot was taken. The "rid" doesn't help here, because it's not currently s
Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions
Tom Lane wrote: I wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Yeah - I do not really like that dual-locking thing either. But it makes prepared transaction handling much easier - if we were to only lock the RID, we'd have to store the rid<->xid mapping for prepared transactions Hmmm that's a good point. Not sure how to include prepared xacts in the scheme. After further thought I think we should avoid depending on PIDs as part of a lock tag --- their behavior is too system-dependent, in particular we have no right to assume that when a backend exits the same PID won't be reassigned shortly after, possibly leading to confusion. Instead, I suggest that we keep a session counter in shared memory and have each backend assign itself a session ID at startup using that. A 32-bit session ID in combination with a 32-bit locally assigned transaction number should be sufficiently unique to identify a transaction (prepared or otherwise) for the purposes of locking. These "transient XIDs" only need to be unique as long as the transaction exists plus shortly thereafter (to avoid race conditions when someone waits for a transaction that actually terminated a moment before). So wraparound of the counters isn't a problem, although we probably want to reserve zero as an invalid value. I think we only need a transient XID of this sort for top-level transactions, not subtransactions. Sounds good, if we decide to go with the transient XID idea. So below for an alternative that I just came up with. To make CREATE INDEX CONCURRENTLY work, we'd need two things: * GetLockConflicts would need to report the transient XIDs of the conflicting xacts, not regular XIDs, since they might not have regular XIDs. Then we'd wait on those locks instead of regular-XID locks. Yes. This is exactly what my patch does today. * The second phase where we wait out transactions that can still see old tuples doesn't work because such transactions won't necessarily be listed in the snapshot. Instead, what we have to do is look through the ProcArray for transactions whose advertised xmin is less than the xmax of our reference snapshot. When we find one, wait for it using its transient XID. AFAICT, C.I.C. is currently the only place in the system where we really need transient XIDs at all. Everyplace else that we need to wait for a transaction, it's because we found its regular XID in a tuple we want to lock or modify. So the whole thing is a bit annoying. Maybe we could get rid of the extra overhead with some shenanigans inside the lock manager, like not bothering to create a data structure representing the holding of a transient-XID lock until such time as C.I.C. actually tries to wait for it. But again, that seems like a second-pass optimization. I've given some thought to that. There are two distinct things we need to be able to wait for 1) Until all current holders of a lock, grantmask conflicts with a given locklevel have dropped their lock 2) Until all currently in-use snapshots have an xmin larger than some given value (The xmax of the reference snapshot). (1) Could be solved directly in the lock manager. We'd need some mechanism to wake up a process whenever someone releases a ceratin lock. (2) Could be done by acquireing a ShareLock (with a new locktype LOCKTYPE_XMIN) on the xmin of a transaction's serializable snapshot when it's created. The second waiting phase of concurrent index builds would then be a) Find the oldest xmin in the ProcArray. b) If that xmin is equal or greater than the xmax of our reference snapshot, we're done. c) Wait until the ExclusiveLock (for LOCKTYPE_TRANSACTION) is released on that xmin. After that point, new transactions will compute an xmin greater than the oldest one we found in the ProcArray, because the limiting transactions has exited, and because ReadNewTransactionId returns a value greater than that xmin too (Otherwise, we'd have exited in (b)). d) Wait for all current holders of LOCKTYPE_XMIN to release their locks. (Using the machinery needed for (1)). No new holders can show up, because new snapshots will computer a larger xmin. e) Goto a). I could code (2), but I'd need help with (1) - The details of the locking subsystems are still somewhat a mystery to me. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Sounds good, if we decide to go with the transient XID idea. So below for an alternative that I just came up with. This proposal appears to require taking and releasing a brand-new lock type every time a snapshot is made or destroyed. That is certainly not going to be less overhead than the transient-XID scheme. At least in READ COMMITTED mode, there are normally multiple snapshots taken per transaction. Only for the serializable shapsnot I'd have thought, making the overhead bearable (it's is surely the oldest of all the xact's shapshots) but ... (Something worth noting here is that I expect soon, probably 8.4, we will fix things so that what a backend advertises in MyProc->xmin is the xmin of its oldest still-live snapshot. That means that xmin will change intra-transaction in READ COMMITTED mode, and thus that we would indeed need to take and release the sort of lock you are suggesting each time.) ... with this in mind, my proposal looks pretty bad :-( What do you think about solving the requirements of the *first* waiting phase (Where we wait for current ShareLock holders) inside the lock manager? The only real downside I can see is that I feel uneasy about messing with that code... It seems to be subtle, and quick to anger ;-) For the second phase, I see two options .) Go forward with the transient XIDs / RIDs approach .) Do something similar to the indcreatexid idea the HOT patch implements. This essentially puts the burden of deciding an index's usability on the using xact, not on the one creating the index. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: What do you think about solving the requirements of the *first* waiting phase (Where we wait for current ShareLock holders) inside the lock manager? The only real downside I can see is that I feel uneasy about messing with that code... It seems to be subtle, and quick to anger ;-) It sounded pretty dubious to me. The problem is that we don't want to wait until we could actually *get* the lock, we only want to wait out the conflicting xacts that existed when we looked. This is important because there might be a steady stream of new xacts acquiring conflicting locks (ie, a steady stream of writers), and we don't want to either block them, or have to hope for a window where there are none. But the lock manager does not currently track who acquired a lock when, and I think it would add a lot of usually-wasted overhead to do that. I spent a fair amount of time yesterday trying to think of alternative solutions, and didn't really think of much. The core reason why C.I.C. is implemented the way it is is that it's entirely possible that there will be a deadlock with some other process (ie, the other process is old enough that we must wait for it, but it's blocked trying to acquire some lock that conflicts with our ShareUpdateExclusiveLock). Thus, waiting by trying to acquire XID locks is a good idea because it automatically detects deadlock, and may even be able to escape the deadlock by wait-queue rearrangement. (I'm not certain that the latter is applicable in any situation C.I.C. would get into, but I'm not certain it's not, either.) Schemes involving "sleep awhile and check the ProcArray again" are right out because they fail to detect deadlock. Everything else I could think of involved special new lockmanager features that would have to still preserve the ability to handle deadlocks, which didn't sound like something I want to tackle for this. So on the whole the extra transaction identifier seems to be the way to go. I haven't looked at how that interacts with HOT though. Ok, I'll update the patch to use your global id plus local id idea than, and remove TemporaryTransactionIds. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Hi When we lazily assign XIDs, we gain another flag beside the existing MyXactMadeTempRelUpdates, MyXactMadeXLogEntry, MyLastRecPtr and smgr's pendingDeletes to tell what kind of actions a transaction performed. Adding TransactionIsIsValid(GetCurrentTransactionIdIfAny()) on top of that makes things quite impenetrable - at least for me. So I'm trying to wrap my head around that logic, and simplify it a bit if possible. (Nowadays, async commit even adds a bit more complexity) Currently, we write out a COMMIT record if a transaction either created any transaction-controlled XLOG entries (MyLastRecPtr.xrecoff != 0), or scheduled the deletion of files on commit. Afterwards, the xlog is flushed to the end last record created by the session (ProcLastRecEnd) if the transaction created any xlog record at all. If we either made transaction-controlled XLOG entries, or temporary relation updates, we update the XID status in the CLOG. An ABORT record is currently created if a transaction either created transaction-controlled XLOG entries or scheduled the deletion of files on abort. If we schedules file deletions, we flush the XLOG up to the ABORT record. If we either made transaction-controlled XLOG entries, updated temporary relations, or scheduled deletions we update the XID status in the CLOG. For subtransaction commit, a COMMIT record is emitted if we either made transaction-controlled XLOG entries, or updated temporary relations. No XLOG flush is performed. Subtransaction ABORTS are handled the same way as regular transaction aborts. For toplevel transaction commits, we defer flushing the xlog if synchronous_commit = off, and we didn't schedule any file deletions. Now, with lazy XID assignment I believe the following holds true .) If we didn't assign a valid XID, we cannot have made transaction-controlled XLOG entries (Can be checked by asserting that the current transaction id is valid if XLOG_NO_TRAN isn't set in XLogInsert). .) We cannot have scheduled files for deletion (on either COMMIT or ABORT) if we don't have a valid XID, since any such deletion will happen together with a catalog update. Note that this is already assumed to be true for subtransactions, since they only call RecordSubTransaction{Commit|Abort} if they have an XID assigned. I propose to do the following in my lazy XID assignment patch - can anyone see a hole in that? .) Get rid of MyLastRecPtr and MyXactMadeTempRelUpdates. Those are superseeded by TransactionIdIsValid(GetCurrentTransactionIdIfAny()). .) Get rid of MyXactMadeXLogEntry. Instead, just reset ProcLast .) Rename ProcLastRecEnd to XactLastRecEnd, and reset when starting a new toplevel transaction. Transaction COMMIT: Write an COMMIT record if and only if we have a valid XID. Then, flush the XLOG to XactLastRecEnd if that is set, and synchronous_commit=on. Aferwards, update the CLOG if and only if we have a valid XID. Transaction ABORT: Write an ABORT record if and only if we have a valid XID. Then, flush the XLOG to XactLastRecEnd if that is set, and we scheduled on-abort deletions. Update the CLOG if and only if we have a valid XID. Subtransaction COMMIT: Update the CLOG if and only if we have a valid XID. Subtransaction ABORT: Write an ABORT record if and only if we have a valid XID. Then, flush the XLOG to XactLastRecEnd if that is set, and we scheduled on-abort deletions. Update the CLOG if and only if we have a valid XID. I think we might go even further, and *never* flush the XLOG on abort, since if we crash just before the abort won't log anything either. But if we leak the leftover files in such a case, that's probably a bad idea. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)
Hi Since generating transient XIDs (named ResourceOwnerIDs in my patch, since their lifetime is coupled to the lifetime of a transaction's toplevel resource owner) seems to be to way to go for lazx xid assignment, I need to find a way to represent them in the pg_locks view. ResourceOwnerIds are a structure composed of two uint32s, a processID (could be the PID of the backend, but to make sure that it isn't reused too quickly, it's actually a synthentic ID generated at backend start), and localTransactionId which is just incremented whenever a new transaction is started in a backend. This design was the result of my discussion with Tom - it's main advantage is that it needs no lock to generate a new ResourceOwnerId. I see 3 possibilities to represent this in system views A) Make ResourceOwnerID a full-blown type, with in and out methods, very similar to tids. "processId/localTransactionId" would be a natural string representation. B) Just convert the ResourceOwnerId into a string in pg_lock_status. Looks quite similar to (A) from a user's point of view, but the implementation is much shorter. C) Combine the two uint32 fields of ResourceOwnerId into a int8. Might be more efficient than (B). The main disadvantage is that some ResourceOwnerIds will be represented by *negative* integers, which is pretty ugly. D) Just make them two int4 fields. This has the same "negativity" issue that (C) has, and might cause confusion if users don't read the docs carefully. I'm leaning towards (A), but it adds a lot new code (although most if it would be copied nearly 1-to-1 from tid.c) for maybe too little gain. If (A) is deemed not appropriate, doing (C) and restricting processIds to <= 0x8000 might be an option. greetings, Florian Pflug ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I propose to do the following in my lazy XID assignment patch - can anyone see a hole in that? > One comment is that at the time we make an entry into smgr's pending-deletes list, I think we might not have acquired an XID yet --- if I understand your patch correctly, a CREATE TABLE would acquire an XID when it makes its first catalog insertion, and that happens after creating the on-disk table file. So it seems like a good idea for smgr itself to trigger acquisition of an XID before it makes a pending-deletes entry. This ensures that you can't have a situation where you have deletes to record and no XID; otherwise, an elog between smgr insertion and catalog insertion would lead to just that. I wonder a bit about the whole special-casing of COMMITs/ABORTs with pending delete, though. A crash might always leave stray file around, so there ought to be a way to clean them up anyway. Still, for now I'll go with your suggestion, and force XID assignment in the smgr. .) Rename ProcLastRecEnd to XactLastRecEnd, and reset when starting a new toplevel transaction. I'm not very happy with that name for the variable, because it looks like it might refer to the last transaction-controlled record we emitted, rather than the last record of any type. Don't have a really good suggestion though --- CurXactLastRecEnd is the best I can do. Hm.. don't have a good suggestion, either - the reason I want to rename it is that ProcLastRecEnd doesn't sound like it's be reset at transaction start. One thought here is that it's not clear that we really need a concept of transaction-controlled vs not-transaction-controlled xlog records anymore. In CVS HEAD, the *only* difference no_tran makes is whether to set MyLastRecPtr, and you propose removing that variable. This seems sane to me --- the reason for having the distinction at all was Vadim's plan to implement transaction UNDO by scanning its xlog records backwards, and that idea is as dead as a doornail. So we could simplify matters conceptually if we got rid of any reference to such a distinction. I've thinking about keeping XLOG_NO_TRAN, and doing if (!no_tran) Assert(TransactionIdIsValid(GetCurrentTransactionIdIfAny()) in xlog.c as a safety measure. We can't make that assertion unconditionally, I think, because nextval() won't force XID assigment, but might do XLogInsert. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Since generating transient XIDs (named ResourceOwnerIDs in my patch, since their lifetime is coupled to the lifetime of a transaction's toplevel resource owner) seems to be to way to go for lazx xid assignment, I need to find a way to represent them in the pg_locks view. This is going very far towards gilding the lily. Try to avoid loading the patch down with a new datatype. I'm inclined to think that it'd be sufficient to show the high half of the ID (that is, the session number) in pg_locks, because there will never be cases where there are concurrently existing locks on different localTransactionIds. Hm.. I'm not too happy with that. I you for example join pg_locks to pg_stat_activity (which would need to show the RID too), than you *might* get a bogus result if a transaction ends and a new one starts on the same backend between the time pg_lock_status is called, and the time the proc array is read. This could probably be displayed in the transactionID columns, which would mean we're abusing the user-visible xid datatype, but I don't see much harm in it. I'm even more unhappy with that, because the session id of a RID might coincide with a currently in-use XID. What about the following. .) Remove the right-hand side XID from pg_locks (The one holder or waiter of the lock). It seems to make more sense to store a RID here, and let the user fetch the XID via a join to pg_stat_activity. We could also show both the XID (if set) and the RID, but that might lead people to believe that their old views or scripts on top of pg_locks still work correctly when they actually do not. .) On the left-hand side (The locked object), add a RID column of type int8, containing (2^32)*sessionID + localTransactionId. .) To prevent the int8 from being negative, we limit the sessionID to 31 bytes - which is still more then enough. greetings, Florian Pflug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: Tom Lane wrote: One thought here is that it's not clear that we really need a concept of transaction-controlled vs not-transaction-controlled xlog records anymore. I've thinking about keeping XLOG_NO_TRAN, and doing if (!no_tran) Assert(TransactionIdIsValid(GetCurrentTransactionIdIfAny()) in xlog.c as a safety measure. Why do you think this is a safety measure? All that it is checking is whether the caller has preserved an entirely useless distinction. The real correctness property is that you can't write your XID into a heap tuple or XLOG record if you haven't acquired an XID, but that seems nearly tautological. I was confused. I wanted to protect against the case the an XID hits the disk, but doesn't show up in any xl_xid field, and therefore might be reused after crash recovery. But of course, to make that happen you'd have to actually *store* the XID into the data you pass to XLogInsert, which is kind of hard if you haven't asked for it first. So, I now agree, XLOG_NO_TRAN should be buried. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] int8 & INT64_IS_BUSTED
Hi I'm confused about whether int8s work on a machine on which INT64_IS_BUSTED. My reading of the code suggests that int8 will be available, but be, well, busted in such a machine. For example, int8mul seems as if I'd just return the wrong answer on such a machine. Or are platforms with INT64_IS_BUSTED no longer supported, and are all those #ifdefs only legacy code? Please enlighten a poor linux+gcc user who can't remember ever using a compiler without a "long long" datatype after leaving TurboC under DOS. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: What about the following. .) Remove the right-hand side XID from pg_locks (The one holder or waiter of the lock). It seems to make more sense to store a RID here, Yeah, we have to do that since there might not *be* an XID holding the lock. But I still think the session ID would be sufficient here. (Perhaps we don't need the PID either, although then we'd need to change pg_stat_activity to provide session id as a join key...) Yeah, the PID seems to be redundant if we add the RID. But OTOH it does no harm to leave it there - other than the xid, which gives a false sense of security. Don't know what our policy for system-catalog backwards-compatibility is, though... .) On the left-hand side (The locked object), add a RID column of type int8, containing (2^32)*sessionID + localTransactionId. I'm a bit uncomfortable with that since it renders the view completely useless if you don't have a working int8 type. Yeah, I only now realized that int8 really *is* busted if INT64_IS_BUSTED is defined. I always thought that there is some kind of emulation code in place, but apparently there isn't. :-( So there goes this idea .) To prevent the int8 from being negative, we limit the sessionID to 31 bytes - which is still more then enough. Hmm ... actually, that just begs the question of how many bits we need at all. Could we display, say, 24 bits of sessionID and 8 bits of localXID merged into a column of nominal XID type? There's a theoretical risk of false join matches but it seems pretty theoretical, and a chance match would not break any system functionality anyway since all internal operations would be working with full-width counters. Hm.. If we go down that router, we could just calculate some hash value from sessionID and localTransactionId that fits into 31 bits, and use an int4. Or 32 bits, and use xid. I am, however a bit reluctant to do this. I'd really hate to spend a few hours tracking down some locking problem, only to find out that I'd been looking at the wrong place because of some id aliasing... I know it's only a 1-in-4-billion chance, but still it gives me an uneasy feeling. What about a string representation? Something like sessionId/localTransactionId? Should we ever decide that indeed this *should* get it's own datatype, a string representation would allow for a very painless transition... greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] int8 & INT64_IS_BUSTED
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: I'm confused about whether int8s work on a machine on which INT64_IS_BUSTED. My reading of the code suggests that int8 will be available, but be, well, busted in such a machine. The datatype exists, but it's really only int32. For example, int8mul seems as if I'd just return the wrong answer on such a machine. Well, obviously it's gonna overflow sooner than you'd think, but it will give valid answers as long as you never try to compute a value that doesn't fit in int32; and it will correctly complain if you do. I still think int8mul is buggy. It calculates result as arg1 * arg2, and then checks for an overflow by dividing again, and seeing if the right answer comes out. Which sounds good. But it *skips* that check if both arguments fit into an int32 - check is (arg1 == (int64) ((int32) arg1) && arg2 == (int64) ((int32) arg2)). Which for INT64_IS_BUSTED seems to be equivalent to (arg1 == arg1 && arg2 == arg2), and thus the check will never fire in that case. I didn't test this though - so maybe I'm just reading it wrong. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Tom Lane wrote: One comment is that at the time we make an entry into smgr's pending-deletes list, I think we might not have acquired an XID yet --- if I understand your patch correctly, a CREATE TABLE would acquire an XID when it makes its first catalog insertion, and that happens after creating the on-disk table file. So it seems like a good idea for smgr itself to trigger acquisition of an XID before it makes a pending-deletes entry. This ensures that you can't have a situation where you have deletes to record and no XID; otherwise, an elog between smgr insertion and catalog insertion would lead to just that. Hm.. I was just going to implement this, but I'm now wondering if thats really worth it. For smgrcreate, this would catch the following case: .) CREATE something .) smgrcreate: Creates file, and puts it onto the delete-on-abort list .) We elog() *before* acquiring an XID .) RecordTransactionAbort or RecordSubTransactionAbort: We don't write an ABORT record. .) We crash *before* actually deleting the file Compare the probability of that happening (The elog *and* the crash) with the probability of .) CREATE something .) smgrcreate: Creates the file .) We crash *before* we have to chance to commit or abort. The window in which a crash causes us to leak the file seems to be much wider in the second case, yet forcing XID assignment will not help to preven it, unless I'm overlooking something. In the smgrunlink case, there is no reason at all to force XID assignment, because if we abort or crash, we don't want to unlink anyway, and if we survive until we commit, we'll assign the XID during the inevitable catalog update. The only thing the forced XID assignment would buy is to be able to stick if (TransactionIdIsValid(GetCurrentTransactionIdIfAny())) Assert(nrels == 0); into the various Record{Sub|}Transction{Commit|Abort} functions So unless I'm overlooking something, I believe for now it's best to ignore this issued, and to do a proper fix in the long run that removes *all* possible leakages. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: One comment is that at the time we make an entry into smgr's pending-deletes list, I think we might not have acquired an XID yet Hm.. I was just going to implement this, but I'm now wondering if thats really worth it. Basically what you'd give up is the ability to Assert() that there are no deletable files if there's no XID, which seems to me to be an important cross-check ... although maybe making smgr do that turns this "cross-check" into a tautology ... hmm. I guess the case that's bothering me is where we reach commit with deletable files and no XID. But that should probably be an error condition anyway, ie, we should error out and turn it into an abort. On the abort side we'd consider it OK to have files and no XID. Seems reasonable to me. I've done that now, and it turned out nicely. There is an Assertion on "(nrels == 0) || xid assigned" in the COMMIT path, but not in the ABORT path. Seems reasonable and safe. And I'm quite tempted to not flush the XLOG at all during ABORT, and to only force synchronous commits if one of the to-be-deleted files is non-temporary. The last idea widens the leakage window quite a bit though, so I maybe I should rather resist that temptation... OTOH, it'd allow aynchronous commits for transactions that created temporary tables. The only way we could make this more robust is if we could have WAL-before-data rule for file *creation*, but I think that's not possible given that we don't know what relfilenode number we will use until we've successfully created a file. So there will always be windows where a crash leaks unreferenced files. There's been some debate about having crash recovery search for and delete such files, but so far I've resisted it on the grounds that it sounds like data loss waiting to happen --- someday it'll delete a file you wished it'd kept. It seems doable, but it's not pretty. One possible scheme would be to emit a record *after* chosing a name but *before* creating the file, and then a second record when the file is actually created successfully. Then, during replay we could remember a list of xids and filenames, and remove those files for which we either haven't seen a "created successfully" record, or no COMMIT record for the creating xid. With this scheme, I'd be natural to force XID assignment in smgrcreate, because we'd actually depend on logging the xid there. greetings, Florian Pflug ---(end of broadcast)--- TIP 6: explain analyze is your friend