Re: [HACKERS] Ad-hoc table type?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Sep 28, 2008 at 09:24:48PM -0700, David E. Wheeler wrote: On Sep 28, 2008, at 17:46, Tom Lane wrote: BTW, I think it is (or should be) possible to create an index on hstore-'mycol', so at least one of the reasons why you should *need* to switch to a real database column seems bogus. [...] I'm not sure what that means. Can you create normal btree or hash indexes on hstore columns? And is the index useful for both `@` and `?`? That means that those operations are supported by a GiST (or GIN) index, that is: find the records where col contains 'foo = 1, bar = 2' is supported by the index. Likewise for is contained in and has key. It's a bit like having mini-indexes on all keys (although I guess not that efficient). Pretty cool, I'd say. Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD4DBQFI4HnHBcgs9XrR2kYRAgmiAJ0U9UD8KqX5vLXOGBlW+WwPzzIpEQCY1caS F4Uug9QD6e0Jw18EvNm28g== =f8q5 -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: move column defaults into pg_attribute along with attacl
Hi, Stephen Frost wrote: * Markus Wanner ([EMAIL PROTECTED]) wrote: What does the subobject column for pg_shdepend buy us? Tracking column-level ACL dependencies rather than having those dependencies only be at the table-level. This complicates pg_shdepend some, but simplifies the dependency handling in the ACL area and in handling table/column drops. With a separate table? Or as part of pg_attribute (which can handle NULL and VARLENA attributes now?) Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Null row vs. row of nulls in plpgsql
2008/9/29 Tom Lane [EMAIL PROTECTED]: Greg Stark [EMAIL PROTECTED] writes: On 27 Sep 2008, at 09:56 PM, Tom Lane [EMAIL PROTECTED] wrote: ISTM that the fundamental problem is that plpgsql doesn't distinguish properly between a null row value (eg, null::somerowtype) and a row of null values (eg, row(null,null,...)::somerowtype). When that code was designed, our main SQL engine was pretty fuzzy about the difference too, but now there is a clear semantic distinction. Iirc the reason for this fuzziness came from the SQL spec definition of IS NULL for rows. As long as you maintain that level of spec- compliance I don't think there are any other important constraints on pg behaviour. I started to poke into this and found out that it was a bit subtler than I thought. It'd be possible to associate a rowisnull state value with a row variable, but the problem is that plpgsql treats the row fields as independent variables that can be accessed without touching the row. In particular you can assign null or nonnull values to individual fields. So consider -- presumably, this'll set rowisnull to TRUE: rowvar := NULL; -- this had better cause rowisnull to become FALSE: rowvar.field1 := 42; -- does this cause it to become TRUE again? rowvar.field1 := NULL; this sequence is wrong. in SQL rowvar has same behave as pointer. When you would to fill rowvar you should to call constructor first. some like rowvar := NULL; -- null value rowvar := constructor(null); rowvar := constructor(); rowvar.field = 42; regards Pavel Stehule There are a bunch of implementation problems with making any such behavior happen, since the row field variables don't currently know that they are members of a row, and indeed it's possible for the same variable to be a member of more than one row. But the core issue is that this interaction seems to fuzz the distinction between row is null and all the row's elements are null. In particular, if you think that rowisnull should be TRUE after the above sequence, then I think you are saying they are the same thing. So maybe the spec authors are smarter than we are. Thoughts? What would a consistent behavior look like? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: move column defaults into pg_attribute along with attacl
* Markus Wanner ([EMAIL PROTECTED]) wrote: Stephen Frost wrote: * Markus Wanner ([EMAIL PROTECTED]) wrote: What does the subobject column for pg_shdepend buy us? Tracking column-level ACL dependencies rather than having those dependencies only be at the table-level. This complicates pg_shdepend some, but simplifies the dependency handling in the ACL area and in handling table/column drops. With a separate table? Or as part of pg_attribute (which can handle NULL and VARLENA attributes now?) As part of pg_attribute.. Having a seperate table would be an alternative to adding a column to pg_shdepend. Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal: move column defaults into pg_attribute along with attacl
Hi, Stephen Frost wrote: As part of pg_attribute.. Having a seperate table would be an alternative to adding a column to pg_shdepend. Aha. Hm... I thought tracking dependencies between tables and attributes complicates DROP TABLE? Why doesn't that concern apply here? And why do we keep the attributes defaults in their own table with their own OID, instead of merging them into pg_attributes? (Or put another way around: why do these need their own dependency tracking, while the ACLs don't?) Or do we just want to keep the column-level privileges patch simple here and deferring other work to another patch? Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ad-hoc table type?
What you're talking about is a document based database like StrokeDB, CouchDB. With hstore you don't need to parse content of 'aggregate' column, it provides necessary methods. Also, we tried to speedup selects using indexes. Probably, we need to refresh our interest to hstore, do you have any actual proposals ? Proposals, not at this point. I'm trying to decide (a) if I have the time and (b) do I do it with Postgres or SQLite. The hstore module, as I said, looks really cool, I've contemplated something like it. I have a module provides a set of accessors for an XML text column that works similarly, but it parses the XML on each access and the application has to create the XML. (I have XML creation modules for Java, PHP, C++, and standard C bindings.) It is more a conflict of data ideology, IMHO. There is a class of data that is logically on the same level as other data, but is forced into a secondary storage methodology. It isn't a pressing need as there are work arounds, but don't you think a cleaner interface make sense? Also, what is the overhead for the secondary storage mechanism? I think it would make the life of application developers easier. Oleg On Sun, 28 Sep 2008, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] writes: Something like this: create adhoc table foo (); insert into foo (name, rank, serial) values ('joe', 'sargent', '42'); In an ad-hoc table type, when an insert is made, and a column is not found, then a new varchar column is added. I know the idea has a lot of holes, and is probably a bad idea, but it answers an important problem of easily mapping programmatic types to a database. Seems like a table with one contrib/hstore column might be more relevant to this guy's idea of how to do database design. That's actually a very cool module, I hadn't seen it before. I've considered writing something like it, but more XML centric, but I'm not sure it answers the concept. I'm not sure if you have dealt with web site sessions and object persistence crap, but its a pain to get up and running and improving performance is a drag. Web guys tend to know very little about databases and tend, sadly, not to be very inquisitive about such things. Web session and user attribute objects are typically stored in a database as XML, JSON, or some other aggregated format in a single column (hstore). That works great for when you just need to access the data by the key, but if you want to use the data outside the web application for something like OLAP, you have to decide which attributes reside in the aggregate column or get promoted to a full fledged column. That's why you'll see tables with username, passwdhash, email, etc. in addition to an aggregated column of things like screen template, age, etc. So, how do you have a table of a generally arbitrary number of columns without creating some sort of aggregate column? With an aggregate column, the data isn't on the same level as real column data, so you need to parse the aggregate to extract a value, and you have to do that for each value. On top of that, you then have to explain your aggregate strategy to the web guys. Being able to insert arbitrary named values, and extracting them similarly, IMHO works better and more naturally than some external aggregate system built on a column. I know it is a little outside the box thinking, what do you think? Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Andrew Dunstan wrote: Andrew Dunstan wrote: this works better but there is something fishy still - using the same dump file I get a proper restore using pg_restore normally. If I however use -m for a parallel one I only get parts (in this case only 243 of the 709 tables) of the database restored ... Yes, there are several funny things going on, including some stuff with dependencies. I'll have a new patch tomorrow with luck. Thanks for testing. OK, in this version a whole heap of bugs are fixed, mainly those to do with dependencies and saved state. I get identical row counts in the source and destination now, quite reliably. this looks much better (for a restore that usually takes 180min I can get down to 72min using -m 4) - however especially with higher concurrency I'm sometimes running into restore failures due to deadlocks happening during constraint restoration (slightly redacted): pg_restore: [archiver (db)] Error from TOC entry 7765; 2606 1460743180 FK CONSTRAINT fk_av_relations_av db_owner pg_restore: [archiver (db)] could not execute query: ERROR: deadlock detected DETAIL: Process 18100 waits for AccessExclusiveLock on relation 1460818342 of database 1460815284; blocked by process 18103. Process 18103 waits for AccessExclusiveLock on relation 1460818336 of database 1460815284; blocked by process 18100. HINT: See server log for query details. ALTER TABLE ONLY foo ADD CONSTRAINT fk_av_relations_av FOREIGN KEY (vs_id) REFERENCES bar ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: pg_restore: [archiver (db)] could not execute query: ERROR: deadlock detected DETAIL: Process 18100 waits for AccessExclusiveLock on relation 1460818342 of database 1460815284; blocked by process 18103. Process 18103 waits for AccessExclusiveLock on relation 1460818336 of database 1460815284; blocked by process 18100. HINT: See server log for query details. ALTER TABLE ONLY foo ADD CONSTRAINT fk_av_relations_av FOREIGN KEY (vs_id) REFERENCES bar ... Hmm, I'll bet the restore code doesn't realize that this can't run in parallel with index creation on either table ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: move column defaults into pg_attribute along with attacl
Markus Wanner [EMAIL PROTECTED] writes: And why do we keep the attributes defaults in their own table with their own OID, instead of merging them into pg_attributes? That has already been explained multiple times in this thread, but: the default expression is a separate entity from the attribute itself, so there needs to be some different representation for it in pg_depend. Otherwise we couldn't handle the concept that dropping some entity (like a function) forces discarding of the default, not the whole column the default is attached to. Now admittedly giving it its own OID and classid = pg_attrdef is probably not the only way to do that. But merging it into the pg_attribute row leaves no obvious way to do it within the object identity representation that's been chosen for pg_depend. (Or put another way around: why do these need their own dependency tracking, while the ACLs don't?) pg_shdepend is already designed to track ACLs: an ACL dependency says that there's some privilege that this role has been granted on this object. So as long as you can identify the object you're okay, you don't need a separate identity for the ACL. Or do we just want to keep the column-level privileges patch simple here and deferring other work to another patch? Stephen was arm-waving about getting rid of pg_attrdef, but trying to hold the column privileges patch hostage to that would be a serious error. It's an independent problem, so it ought to be addressed in a separate patch; and it has no clear solution so it's not entirely obvious that it can or should be done at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
On Sun, 2008-09-28 at 21:16 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: It does nothing AFAICS for the problem that when restarting archive recovery from a restartpoint, it's not clear when it is safe to start letting in backends. You need to get past the highest LSN that has made it out to disk, and there is no good way to know what that is. AFAICS when we set minRecoveryLoc we *never* unset it. It's recorded in the controlfile, so whenever we restart we can see that it has been set previously and now we are beyond it. Right ... So if we crash during recovery and then restart *after* we reached minRecoveryLoc then we resume in safe mode almost immediately. Wrong. OK, see where you're coming from now. Solution is needed, I agree. What minRecoveryLoc is is an upper bound for the LSNs that might be on-disk in the filesystem backup that an archive recovery starts from. (Defined as such, it never changes during a restartpoint crash/restart.) Once you pass that, the on-disk state as modified by any dirty buffers inside the recovery process represents a consistent database state. However, the on-disk state alone is not guaranteed consistent. As you flush some (not all) of your shared buffers you enter other not-certainly-consistent on-disk states. If we crash in such a state, we know how to use the last restartpoint plus WAL replay to recover to another state in which disk + dirty buffers are consistent. However, we reach such a state only when we have read WAL to beyond the highest LSN that has reached disk --- and in recovery mode there is no clean way to determine what that was. Perhaps a solution is to make XLogFLush not be a no-op in recovery mode, but have it scribble a highest-LSN somewhere on stable storage (maybe scribble on pg_control itself, or maybe better someplace else). I'm not totally sure about that. But I am sure that doing nothing will be unreliable. No need to write highest LSN to disk constantly... If we restart from a restartpoint then initially the current apply LSN will be potentially/probably earlier than the latest on-disk LSN, as you say. But once we have completed the next restartpoint *after* the value pg_control says then we will be guaranteed that the two LSNs are the same, since otherwise we would have restarted at a later point. That kinda works, but the problem is that restartpoints are time based, not log based. We need them to be deterministic for us to rely upon them in the above way. If we crash and then replay we can only be certain we are safe when we have found a restartpoint that the previous recovery will definitely have reached. So we must have log-based restartpoints, using either a constant LSN offset, or a parameter like checkpoint_segments. But if it is changeable then it needs to be written into the control file, so we don't make a mistake about it. So we need to: * add an extra test to delay safe point if required * write restart_segments value to control file * force a restartpoint on first valid checkpoint WAL record after we have passed restart_segments worth of log -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Tom Lane wrote: Stefan Kaltenbrunner [EMAIL PROTECTED] writes: pg_restore: [archiver (db)] could not execute query: ERROR: deadlock detected DETAIL: Process 18100 waits for AccessExclusiveLock on relation 1460818342 of database 1460815284; blocked by process 18103. Process 18103 waits for AccessExclusiveLock on relation 1460818336 of database 1460815284; blocked by process 18100. HINT: See server log for query details. ALTER TABLE ONLY foo ADD CONSTRAINT fk_av_relations_av FOREIGN KEY (vs_id) REFERENCES bar ... Hmm, I'll bet the restore code doesn't realize that this can't run in parallel with index creation on either table ... Yeah. Of course, it's never needed to bother with stuff like that till now. The very simple fix is probably to run a separate parallel cycle just for FKs, after the index creation. A slightly more elegant fix would probably be to add dependencies from each index that might cause this to the FK constraint. I'll work on the first for now. Is there any chance that the locks we're taking here are too strong? Intuitively it looks a bit like it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Hmm, I'll bet the restore code doesn't realize that this can't run in parallel with index creation on either table ... Yeah. Of course, it's never needed to bother with stuff like that till now. The very simple fix is probably to run a separate parallel cycle just for FKs, after the index creation. Um, FKs could conflict with each other too, so that by itself isn't gonna fix anything. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Prototype: In-place upgrade v02
Abbas napsal(a): Hi, I have gone through the following stuff 1) previous emails on the patch 2) http://wiki.postgresql.org/wiki/In-place_upgrade 3) http://www.pgcon.org/2008/schedule/attachments/57_pg_upgrade_2008.pdf 4) http://wiki.postgresql.org/wiki/In-place_upgrade:Storage Here is what I have understood so far, (correct me if I am wrong) The on disk representation of data has changed from version to version over the years. For some strange reason (performance may be) the newer versions of pg were not backwards compatible, meaning that the newer version would not read data written by an older version if the on disk representation has changed in between. The end user would be required to port the data stored using older version to the newer version format using offline import export. This project aims upgrades from older to newer version on the fly. On-disk representation is not the only change that the system should accommodate, it should also accommodate catalog changes, conf file changes etc. It is correct. Of the available design choices I think you have chosen to go with on-line data conversion, meaning that pg would now be aware of all the previous page layouts and based on a switch on page version would handle each page layout. This will only be done to read old data, newer data will be written in newer format. Yes. I am supposed to test the patch and for that I have downloaded pg versions 7.4, 8.0, 8.1, 8.2 and 8.3. I plan to create a data directory using each of the versions and then try to read the same using the 8.4 with your patch applied. It does not work. The patch is only prototype. It contains framework for implementing old page layout version and it contains partial version 3. The main purpose of this prototype is to make decision if this approach is acceptable or not. Or if some part is acceptable - it contains for example useful page API rework and implementation which is useful (by my opinion) in general. What database objects should I create in the test database, should I just create objects of my choice? Does sizes (both length and breadth) of tables matter? These test does not make sense at this moment. I have test script (created by Nidhi) for catalog upgrade already done. However, it uses currently Sun's internal framework. I will modify it and release it. Do I have to perform performance tests too? Yes, please. My colleague tested it and got 5% performance drop, but it was not complete version and I tested full patch on Friday and It was surprise for me ... I got little bit better throughput (about 0,5%) with patch. I'm going to retest it again but it would be good to get result also from others. thanks Zdenek Regards Abbas On Fri, 2008-09-19 at 14:28 +0200, Zdenek Kotala wrote: thanks Abbas napsal(a): Even with that a hunk failed for bufpage.c, but I applied that part manually to move on. Regards Abbas On Thu, 2008-09-18 at 12:17 +0200, Zdenek Kotala wrote: Abbas napsal(a): Hi, I downloaded latest postgresql source code from git clone git://git.postgresql.org/git/postgresql.git and tried to apply the patch http://archives.postgresql.org/pgsql-hackers/2008-09/gza1fGXLvf3L.gz It does not apply cleanly, see the failures in attached file. It clash with hash index patch which was committed four days ago. Try to use little bit older revision from git (without hash index modification). Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
Simon Riggs [EMAIL PROTECTED] writes: ... That kinda works, but the problem is that restartpoints are time based, not log based. We need them to be deterministic for us to rely upon them in the above way. Right, but the performance disadvantages of making them strictly log-distance-based are pretty daunting. We don't really want slaves doing that while they're in catchup mode. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: move column defaults into pg_attribute along with attacl
Hi, thank you for your patience in explaining. Rest assured that I've read the relevant messages multiple times. Tom Lane wrote: the default expression is a separate entity from the attribute itself, That was the point I didn't understand... .. Otherwise we couldn't handle the concept that dropping some entity (like a function) forces discarding of the default, not the whole column the default is attached to. ..and that finally provided the missing piece for my puzzle: it's not the dependency between the attribute and its default which matters here. But the possible dependencies of the default (and not the attribute itself) on other entities. Now admittedly giving it its own OID and classid = pg_attrdef is probably not the only way to do that. But merging it into the pg_attribute row leaves no obvious way to do it within the object identity representation that's been chosen for pg_depend. Understood, makes sense. pg_shdepend is already designed to track ACLs: an ACL dependency says that there's some privilege that this role has been granted on this object. So as long as you can identify the object you're okay, you don't need a separate identity for the ACL. Sure. Stephen was arm-waving about getting rid of pg_attrdef, but trying to hold the column privileges patch hostage to that would be a serious error. It's an independent problem, so it ought to be addressed in a separate patch; and it has no clear solution so it's not entirely obvious that it can or should be done at all. Agreed. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Hmm, I'll bet the restore code doesn't realize that this can't run in parallel with index creation on either table ... Yeah. Of course, it's never needed to bother with stuff like that till now. The very simple fix is probably to run a separate parallel cycle just for FKs, after the index creation. Um, FKs could conflict with each other too, so that by itself isn't gonna fix anything. Good point. Looks like we'll need to make a list of can't run in parallel with items as well as strict dependencies. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Um, FKs could conflict with each other too, so that by itself isn't gonna fix anything. Good point. Looks like we'll need to make a list of can't run in parallel with items as well as strict dependencies. Yeah, I was just thinking about that. The current archive format doesn't really carry enough information for this. I think there are two basic solutions we could adopt: * Extend the archive format to provide some indication that restoring this object requires exclusive access to these dependencies. * Hardwire knowledge into pg_restore that certain types of objects require exclusive access to their dependencies. The former seems more flexible, as well as more in tune with the basic design assumption that pg_restore shouldn't have a lot of knowledge about individual archive object types. But it would mean that you couldn't use parallel restore with any pre-8.4 dumps. In the long run that's no big deal, but in the short run it's annoying. Another angle is that it's not clear what happens if the need for exclusive access changes over time. You were just speculating about reducing the lock strength required for ALTER TABLE ADD FOREIGN KEY. I don't know if that's workable or not, but certainly reducing the lock strength for some types of ALTER TABLE might be in our future. Contrarily, we don't currently try hard to lock any non-table objects (schemas, functions, etc) while building dependent objects; but that's obviously not really right, and someday we might decide to fix it. So having pg_dump prepare the list of exclusive dependencies at dump time might be the wrong thing --- it would reflect the behavior of the source server version, not the target which is what matters. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM rewrite: doc changes
Tom Lane wrote: FreeSpaceMapTruncateRel seems to have a bug in its early-exit test: in the case where the number of FSM blocks stays the same, it fails to zero out slots in the last block. I also think it's got an off-by-one problem in figuring the number of FSM blocks: for the normal case where the new heap end is in the middle of a FSM block, shouldn't new_nfsmblocks be one larger than it is? The case where nblocks is an exact multiple of SlotsPerFSMPage would need to be special-cased to be exactly correct, though I see no real harm in letting the FSM be left one page too big in that case. fsm_truncate_avail seems quite broken: it's clearing the whole page always. Yep, I noticed these myself on Friday after sending the patch.. I do not like the kluge in heap_xlog_clean one bit, and think it's unnecessary anyway since we are not relying on the FSM to be accurate. Suggest reverting the heapam.c changes except for heap_sync(). The point is to have a more up-to-date FSM after recovery. PITR recovery in a warm stand-by server in particular. I'll take it out for now, but it needs more discussion. In fact, I think we should update the FSM even more aggressively, on inserts and updates as well vacuums. Probably not on all inserts and updates, though, to keep the overhead minimal, but there's a tradeoff somewhere in between. The new search algorithm in fsm_search_avail still doesn't work. Consider what happens when the target is the rightmost slot on the page; it certainly won't wrap properly. Crap, you're right. In fsm_rebuild_page, surely we needn't check if (lchild NodesPerPage). Yes, we do. There can be is a number of completely unused upper nodes on the right. The upper levels of the tree are complete, but the bottom level is missing enough nodes to make room for the page header. So the tree looks something like this: X X X XX X X X X X X X . . . Where . is a missing node. The parents that miss both children will always be zero. This test in fsm_space_needed_to_cat: if (needed = (FSM_CATEGORIES - 1) * FSM_CAT_STEP) elog(ERROR, invalid FSM request size); reveals a rather fundamental problem: it is clearly possible for this test to fail on valid request sizes, because the page header overhead is less than FSM_CAT_STEP (especially if BLCKSZ is more than 8K). I'm not sure about a really clean solution here. We could offset the needed_to_cat and avail_to_cat calculations so that category 255 corresponds exactly to the maximum possible free space, but that requires assuming that FSM knows exactly what that is, which is a bit unpleasant. Thoughts? Hmph. The other alternative is to use 2 bytes instead of one per page, and track the free space exactly. But I'd rather not do that just to deal with the very special case of huge requests. Or we could just return -1 instead of throwing an error. Requests higher than the limit would then always have to extend the heap. That's not good, but I think we already have that problem for tuples of exactly MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a new line pointer, only newly extended pages that have never had any tuples on them have enough space, as determined by PagetGetFreeSpace, to fit a tuple of MaxHeapTupleSize bytes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FSM rewrite: doc changes
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: In fsm_rebuild_page, surely we needn't check if (lchild NodesPerPage). Yes, we do. But the loop starting point is such that you must be visiting a parent with at least one child, no? reveals a rather fundamental problem: it is clearly possible for this test to fail on valid request sizes, because the page header overhead is less than FSM_CAT_STEP (especially if BLCKSZ is more than 8K). I'm not sure about a really clean solution here. Hmph. The other alternative is to use 2 bytes instead of one per page, and track the free space exactly. But I'd rather not do that just to deal with the very special case of huge requests. Yeah, I thought about that too. It's got another problem besides the sheer space cost: it would result in a whole lot more update traffic for upper levels of the tree. The quantization of possible values in the current design is good because it avoids updates of parents for relatively small deltas of free space. Or we could just return -1 instead of throwing an error. Requests higher than the limit would then always have to extend the heap. That's not good, but I think we already have that problem for tuples of exactly MaxHeapTupleSize bytes. Since PageGetFreeSpace subtracts the size of a new line pointer, only newly extended pages that have never had any tuples on them have enough space, as determined by PagetGetFreeSpace, to fit a tuple of MaxHeapTupleSize bytes. That seems like something we'll want to fix sometime, rather than hardwiring into the FSM design. I suppose an alternative possibility is to set MaxHeapTupleSize at 255/256's of a block by definition, so that no request will ever exceed what the FSM stuff can handle. But I'm sure that'd make somebody unhappy --- somewhere out there is a table with tuples wider than that. Probably the least bad alternative here is to allow FSM's category scaling to depend on MaxHeapTupleSize. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
On Mon, 2008-09-29 at 08:46 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: ... That kinda works, but the problem is that restartpoints are time based, not log based. We need them to be deterministic for us to rely upon them in the above way. Right, but the performance disadvantages of making them strictly log-distance-based are pretty daunting. We don't really want slaves doing that while they're in catchup mode. I don't think we need to perform restartpoints actually, now I think about it. It's only the LSN that is important. I think we can get away with writing the LSN value to disk, as you suggested, but only every so often. No need to do it after every WAL record, just consistently every so often, so it gives us a point at which we know we are safe. We will need to have Startup process block momentarily while the value is written. Propose Startup process writes/flushes LSN to pg_control every time we change xlogid. That's independent of WAL file size and fairly clear. When we reach that LSN + 1 we will know that no LSNs higher than that value can have reached disk. OK? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Le lundi 29 septembre 2008, Tom Lane a écrit : * Extend the archive format to provide some indication that restoring this object requires exclusive access to these dependencies. * Hardwire knowledge into pg_restore that certain types of objects require exclusive access to their dependencies. Well, it seems to me that currently the FK needs in term of existing indexes and locks, and some other object lock needs, are all hardwired. Is it even safe to consider having the locks needed for certain commands not be hardwired? Provided I'm not all wrong here, I don't see how having something more flexible at restore time than at build time is a win. The drawback is that whenever you change a lock need in commands, you have to remember teaching pg_restore about it too. So my vote here is in favor of hardwired knowledge of pg_restore, matching target server code assumptions and needs. Regards, -- dim signature.asc Description: This is a digitally signed message part.
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
Simon Riggs [EMAIL PROTECTED] writes: I think we can get away with writing the LSN value to disk, as you suggested, but only every so often. No need to do it after every WAL record, just consistently every so often, so it gives us a point at which we know we are safe. Huh? How does that make you safe? What you need to know is the max LSN that could possibly be on disk. Hmm, actually we could get away with tying this to fetching WAL files from the archive. When switching to a new WAL file, write out the *ending* WAL address of that file to pg_control. Then process the WAL records in it. Whether or not any of the affected pages get to disk, we know that there is no LSN on disk exceeding what we already put in pg_control. If we crash and restart, we'll have to get to the end of this file before we start letting backends in; which might be further than we actually got before the crash, but not too much further because we already know the whole WAL file is available. Or is that the same thing you were saying? The detail about using the end address seems fairly critical, and you didn't mention it... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fatal Errors
Is it possible to have a FATAL error that crashes a backend and for it to *not* have written an abort WAL record for any previously active transaction? I think yes, but haven't managed to create this situation while testing for it. If we either *always* write a WAL record, or PANIC then that makes some coding easier, so seems sensible to check. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Dimitri Fontaine wrote: Le lundi 29 septembre 2008, Tom Lane a écrit : * Extend the archive format to provide some indication that restoring this object requires exclusive access to these dependencies. * Hardwire knowledge into pg_restore that certain types of objects require exclusive access to their dependencies. Well, it seems to me that currently the FK needs in term of existing indexes and locks, and some other object lock needs, are all hardwired. Is it even safe to consider having the locks needed for certain commands not be hardwired? Provided I'm not all wrong here, I don't see how having something more flexible at restore time than at build time is a win. The drawback is that whenever you change a lock need in commands, you have to remember teaching pg_restore about it too. So my vote here is in favor of hardwired knowledge of pg_restore, matching target server code assumptions and needs. Well, I've had to use some knowledge of various item types already, and I have been trying not to disturb pg_dump also, so I'm inclined to build this knowledge into pg_restore. ISTM that things that will have lock conflicts are different and more target version dependent than things that logically depend on other things, so we can still rely on pg_dump to some extent to provide the latter while building the former at restore time. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
Simon Riggs [EMAIL PROTECTED] writes: Is it possible to have a FATAL error that crashes a backend and for it to *not* have written an abort WAL record for any previously active transaction? Well, a FATAL error will still go through transaction abort before exiting, IIRC. The problem case is a PANIC or an actual core dump. If we either *always* write a WAL record, or PANIC then that makes some coding easier, Like what? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Is it possible to have a FATAL error that crashes a backend and for it to *not* have written an abort WAL record for any previously active transaction? Well, a FATAL error will still go through transaction abort before exiting, IIRC. The problem case is a PANIC or an actual core dump. If we either *always* write a WAL record, or PANIC then that makes some coding easier, Like what? For constructing snapshots during standby. I need a data structure where emulated-as-running transactions can live. If backend birth/death is intimately tied to WAL visible events then I can use dummy PGPROC structures. If not, then I will have to create a special area that can expand to cater for the possibility that a backend dies and WAL replay won't know about it - which also means I would need to periodically dump a list of running backends into WAL. PANIC isn't a problem case because we'll end up generating a shutdown checkpoint which shows the backends have been terminated. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
On Mon, 2008-09-29 at 10:13 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I think we can get away with writing the LSN value to disk, as you suggested, but only every so often. No need to do it after every WAL record, just consistently every so often, so it gives us a point at which we know we are safe. Huh? How does that make you safe? What you need to know is the max LSN that could possibly be on disk. Hmm, actually we could get away with tying this to fetching WAL files from the archive. When switching to a new WAL file, write out the *ending* WAL address of that file to pg_control. Then process the WAL records in it. Whether or not any of the affected pages get to disk, we know that there is no LSN on disk exceeding what we already put in pg_control. If we crash and restart, we'll have to get to the end of this file before we start letting backends in; which might be further than we actually got before the crash, but not too much further because we already know the whole WAL file is available. Or is that the same thing you were saying? The detail about using the end address seems fairly critical, and you didn't mention it... Same! Just said safe point was LSN + 1, and since end = next start. Looks we've got a solution, no matter how it's described. (I actually have a more detailed proof of safety using snapshots/MVCC considerations so I wasn't overly worried but what we've discussed is much easier to understand and agree. Proof of safety is all we need, and this simpler proof is more secure.) Don't want to make it per file though. Big systems can whizz through WAL files very quickly, so we either make it a big number e.g. 255 files per xlogid, or we make it settable (and recorded in pg_control). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote: Like what? For constructing snapshots during standby. I need a data structure where emulated-as-running transactions can live. If backend birth/death is intimately tied to WAL visible events then I can use dummy PGPROC structures. If not, then I will have to create a special area that can expand to cater for the possibility that a backend dies and WAL replay won't know about it - which also means I would need to periodically dump a list of running backends into WAL. Mph. I find the idea of assuming that there must be an abort record to be unacceptably fragile. Consider the possibility that the transaction gets an error while trying to run AbortTransaction. Some of that code is a CRITICAL_SECTION, but I don't think I like the idea that all of it has to be one. PANIC isn't a problem case because we'll end up generating a shutdown checkpoint which shows the backends have been terminated. Thought you were trying to get rid of the shutdown checkpoint during restart? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-09-29 at 10:13 -0400, Tom Lane wrote: ... If we crash and restart, we'll have to get to the end of this file before we start letting backends in; which might be further than we actually got before the crash, but not too much further because we already know the whole WAL file is available. Don't want to make it per file though. Big systems can whizz through WAL files very quickly, so we either make it a big number e.g. 255 files per xlogid, or we make it settable (and recorded in pg_control). I think you are missing the point I made above. If you set the okay-to-resume point N files ahead, and then the master stops generating files so quickly, you've got a problem --- it might be a long time until the slave starts letting backends in after a crash/restart. Fetching a new WAL segment from the archive is expensive enough that an additional write/fsync per cycle doesn't seem that big a problem to me. There's almost certainly a few fsync-equivalents going on in the filesystem to create and delete the retrieved segment files. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
On Mon, 2008-09-29 at 11:18 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-09-29 at 10:30 -0400, Tom Lane wrote: Like what? For constructing snapshots during standby. I need a data structure where emulated-as-running transactions can live. If backend birth/death is intimately tied to WAL visible events then I can use dummy PGPROC structures. If not, then I will have to create a special area that can expand to cater for the possibility that a backend dies and WAL replay won't know about it - which also means I would need to periodically dump a list of running backends into WAL. Mph. I find the idea of assuming that there must be an abort record to be unacceptably fragile. Consider the possibility that the transaction gets an error while trying to run AbortTransaction. Some of that code is a CRITICAL_SECTION, but I don't think I like the idea that all of it has to be one. Aware of possibility fragility, hence the post. Few thoughts: * Is it close enough that we can get away with having a few spare slots to cater for that possibility? * Might we make AbortTransaction critical just as far as the END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no further? Don't expect yes, but seems worth recording thoughts. PANIC isn't a problem case because we'll end up generating a shutdown checkpoint which shows the backends have been terminated. Thought you were trying to get rid of the shutdown checkpoint during restart? Yes, but if I do there would still be a WAL record of some kind there to allow us to confirm the change of tli. Anyway, I thought you wanted me to keep it now? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PATCHES] [HACKERS] Infrastructure changes for recovery
On Mon, 2008-09-29 at 11:24 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2008-09-29 at 10:13 -0400, Tom Lane wrote: ... If we crash and restart, we'll have to get to the end of this file before we start letting backends in; which might be further than we actually got before the crash, but not too much further because we already know the whole WAL file is available. Don't want to make it per file though. Big systems can whizz through WAL files very quickly, so we either make it a big number e.g. 255 files per xlogid, or we make it settable (and recorded in pg_control). I think you are missing the point I made above. If you set the okay-to-resume point N files ahead, and then the master stops generating files so quickly, you've got a problem --- it might be a long time until the slave starts letting backends in after a crash/restart. Fetching a new WAL segment from the archive is expensive enough that an additional write/fsync per cycle doesn't seem that big a problem to me. There's almost certainly a few fsync-equivalents going on in the filesystem to create and delete the retrieved segment files. Didn't miss yer point, just didn't agree. :-) I'll put it at one (1) and then wait for any negative perf reports. No need to worry about things like that until later. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CTE patch versus UNION type determination rules
Currently, the CTE patch assumes (without checking) that the output rowtype of a recursive WITH item is the same as the output rowtype determined by inspecting its non-recursive term. Unfortunately this is not always the case. Consider WITH RECURSIVE q AS ( SELECT int4_col FROM sometable UNION ALL SELECT int8_col FROM ... something referencing q ... ); The output of this UNION will in fact be int8. However I see no way to determine that without performing parse analysis of the recursive term, and we can't do that without having assigned an output rowtype for q (else we have no idea what to do with the recursive reference to q). So it seems like we have to throw an error for this, and insist that the user explicitly do WITH RECURSIVE q AS ( SELECT int4_col::int8 FROM sometable UNION ALL SELECT int8_col FROM ... something referencing q ... ); Can anyone see a way around that? I'm inclined to go a bit further and have the code assume that the output typmods are all -1, even if some more-specific typmod can be determined from the non-recursive term. Otherwise you'd need to explicitly cast in situations like WITH RECURSIVE q AS ( SELECT varchar_10_col FROM sometable UNION ALL SELECT varchar_12_col FROM ... something referencing q ... ); On the other hand this rule would lose typmod information even in cases where both UNION arms emit the same typmod, so maybe it's debatable. Comments? Another point is that the patch assumes that the non-recursive term must be the left child of the topmost UNION operator. In SQL2008, 7.13 syntax rule 2.g.i.3 appears to allow either child to be the non-recursive term. However, rule 2.g.v.3 seems to say that a recursive query is considered expandable only if the left child is the non-recursive term, and in any case it'd be a bit odd to write a recursive query the other way; it seems more likely to be a mistake than intentional. Is everybody happy with making this restriction? If we don't make it, then we also have some issues with the output column names of the UNION not necessarily being what we derive from inspecting just the non-recursive term. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ad-hoc table type?
On Sep 28, 2008, at 23:46, [EMAIL PROTECTED] wrote: I'm not sure what that means. Can you create normal btree or hash indexes on hstore columns? And is the index useful for both `@` and `?`? That means that those operations are supported by a GiST (or GIN) index, that is: find the records where col contains 'foo = 1, bar = 2' is supported by the index. Likewise for is contained in and has key. It's a bit like having mini-indexes on all keys (although I guess not that efficient). Pretty cool, I'd say. Yeah, that does sound good. I look forward to having an excuse for playing with this type… Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
Simon Riggs [EMAIL PROTECTED] writes: * Might we make AbortTransaction critical just as far as the END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no further? Don't expect yes, but seems worth recording thoughts. The problem is that pretty much everything that proc_exit runs would have to become critical, AFAICS. And a lot of that code is explicitly intended not to be critical --- that's why we split it up into multiple proc_exit callbacks. If one fails we pick up with the next, after a recursive call to elog(). In any case it is clear that there will be failure cases where an abort record cannot be written --- out of disk space for WAL being one obvious example. Are we sure that we can, or want to, guarantee that those all result in PANIC? (We do already PANIC on out of disk space for WAL, but I'm not so sure about generalizing that to any possible failure.) Thought you were trying to get rid of the shutdown checkpoint during restart? Yes, but if I do there would still be a WAL record of some kind there to allow us to confirm the change of tli. Anyway, I thought you wanted me to keep it now? No, I don't have a strong opinion one way or the other on that bit. But an ordinary crash and restart shouldn't generate a tli change. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fatal Errors
On Mon, 2008-09-29 at 12:14 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: * Might we make AbortTransaction critical just as far as the END_CRIT_SECTION after XLogInsert in RecordTransactionAbort(), but no further? Don't expect yes, but seems worth recording thoughts. The problem is that pretty much everything that proc_exit runs would have to become critical, AFAICS. And a lot of that code is explicitly intended not to be critical --- that's why we split it up into multiple proc_exit callbacks. If one fails we pick up with the next, after a recursive call to elog(). OK...next idea. If we didn't PANIC, then Postmaster knows about child death and fumbles around with the ProcArray. Will it be OK to simply WAL log ProcArrayAdd() and ProcArrayRemove()? Methinks postmaster can't do this. But might be able to ask somebody else to do it for him? The next person to run ProcArrayAdd() could be left a message to say last user of this proc index didn't clean up and we need to log it. That way we can WAL log the ProcArrayRemove() and the ProcArrayAdd() in one message. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_upgrade performance test
I run performance test on in-place patch prototype which I sent for review and I got nice result: Original: - MQThL (Maximum Qualified Throughput LIGHT): 2202.12 tpm MQThM (Maximum Qualified Throughput MEDIUM): 4706.60 tpm MQThH (Maximum Qualified Throughput HEAVY): 3956.64 tpm TRANSACTION MIX Total number of transactions = 330457 TYPETX. COUNT MIX - --- Light: 55053 16.66% Medium: 117665 35.61% DSS:36825 11.14% Heavy: 98916 29.93% Connection: 21998 6.66% RESPONSE TIMES AVG.MAX.90TH Light 0.093 1.080 0.400 Medium 0.096 1.663 0.400 DSS 0.092 0.990 0.020 Heavy 0.095 1.262 2.000 Connections 0.092 1.030 0.400 Number of users = 60 Sum of Avg. RT * TPS for all Tx. Types = 17.154543 pg_upgrade: --- MQThL (Maximum Qualified Throughput LIGHT): 2185.16 tpm MQThM (Maximum Qualified Throughput MEDIUM): 4674.04 tpm MQThH (Maximum Qualified Throughput HEAVY): 3915.28 tpm TRANSACTION MIX Total number of transactions = 327811 TYPETX. COUNT MIX - --- Light: 54629 16.66% Medium: 116851 35.65% DSS:36367 11.09% Heavy: 97882 29.86% Connection: 22082 6.74% RESPONSE TIMES AVG.MAX.90TH Light 0.095 1.073 0.400 Medium 0.098 1.167 0.400 DSS 0.093 1.049 0.010 Heavy 0.098 1.173 2.000 Connections 0.097 1.115 0.400 Number of users = 60 Sum of Avg. RT * TPS for all Tx. Types = 17.385362 -- It looks like it has only 1% performance gap. And good to mention that it is not optimized version. It would be good if somebody will run different performance test on it and verify my results. I used iGen OLTP test with 60 concurrent users and run it for 30minutes. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
KaiGai Kohei wrote: As I repeated several times, SE-PostgreSQL applies the seuciry policy of SELinux to achieve consistency in access controls. This feature enables to restrict client's privileges on accesses to database objects, as if it accesses to filesystem objects. Its background is our nightmare for web application flaws. The major purpose of this feature is to provide the most important component to run enterprise class web application with least privilege set which is consistent at whole of the system. How important is this consistency goal in reality? We typically recommend that database applications run entirely in the database, for transaction integrity reasons and so on. Unless you are doing wild and fun things with server-side copy or untrusted procedural languages, there really shouldn't be that much use for consistency of access control between PostgreSQL and something else. In fact, on top of the transactional integrity criterion, having consistent access control is one of the reasons to have all your production data in the database system and nowhere else. Of coure, this is an ideal state, and we all of to break that once in a while. But hence the honest question, how often does that really happen and to what extent, and does that justify the significant investment that is being proposed here? In recent years, web application flaws are nightmare for us. The recent report said 95% of significant incidents on the first half of 2008, and 75% of them were SQL injection in Japan. My ultimate goal is to help the situation with mandatory access control and least priviled set for whole of LAPP stack. As I had previously mentioned, we have to distinguish these two goals: consistent access controls and mandatory access controls. Then, how does MAC help with SQL injections? Using the existing role-based system you can already define least-privilege users that are essentially powerless even if SQL injections were to happen. I am not aware that important flaws or gaps in our role-based access control system have been pointed out that would make it impossible to create applications with security levels similar to those achievable with a MAC system. Now, if you come back to your original goal of consistency in access control, then it may in fact turn out that an FLASK/SELinux/SE-PostgreSQL-based system is the best common ground for such a system, but I am unconvinced that MAC by itself is necessary. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Um, FKs could conflict with each other too, so that by itself isn't gonna fix anything. Good point. Looks like we'll need to make a list of can't run in parallel with items as well as strict dependencies. Yeah, I was just thinking about that. The current archive format doesn't really carry enough information for this. I think there are two basic solutions we could adopt: * Extend the archive format to provide some indication that restoring this object requires exclusive access to these dependencies. * Hardwire knowledge into pg_restore that certain types of objects require exclusive access to their dependencies. The former seems more flexible, as well as more in tune with the basic design assumption that pg_restore shouldn't have a lot of knowledge about individual archive object types. But it would mean that you couldn't use parallel restore with any pre-8.4 dumps. In the long run that's no big deal, but in the short run it's annoying. hmm not sure how much of a problem that really is - we usually recommend to use the pg_dump version of the target database anyway. Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Peter, How important is this consistency goal in reality? It's actually the primary point of SE-Linux. Its audience wants a centralized policy manager which applies access policies to everything on the network, regardless of whether it's a file, a port, or a database. Oracle has not been consistent, and as a result the security geeks have been very frustrated with Oracle. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Peter Eisentraut wrote: The major purpose of this feature is to provide the most important component to run enterprise class web application with least privilege set which is consistent at whole of the system. How important is this consistency goal in reality? We typically recommend that database applications run entirely in the database, for transaction integrity reasons and so on. Unless you are doing wild and fun things with server-side copy or untrusted procedural languages, there really shouldn't be that much use for consistency of access control between PostgreSQL and something else. In fact, on top of the transactional integrity criterion, having consistent access control is one of the reasons to have all your production data in the database system and nowhere else. Of coure, this is an ideal state, and we all of to break that once in a while. But hence the honest question, how often does that really happen and to what extent, and does that justify the significant investment that is being proposed here? Then, how does MAC help with SQL injections? Using the existing role-based system you can already define least-privilege users that are essentially powerless even if SQL injections were to happen. I am not aware that important flaws or gaps in our role-based access control system have been pointed out that would make it impossible to create applications with security levels similar to those achievable with a MAC system. I think there are two goals here. At the SQL-level, we will have per-role row and column permissions (which seem valuable on their own), and SE-PostgreSQL allows those permissions to be controlled at the operating system level rather than at the database level. I think your major question is how often do you have users that you need to control at both the SQL _and_ operating system level. I guess the answer is that security policy suggests controlling things at the lowest level, and bubling that security up into the database and applications. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
At the past, I had considered to implement polyinstantiated table as a part of SE-PostgreSQL, but it required us unacceptable scale of changes, so I dropped the idea. The TrustedSolaris folks would like polyinstantiation, but I don't know if they actually have anyone working on Postgres anymore. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Stefan Kaltenbrunner wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: Um, FKs could conflict with each other too, so that by itself isn't gonna fix anything. Good point. Looks like we'll need to make a list of can't run in parallel with items as well as strict dependencies. Yeah, I was just thinking about that. The current archive format doesn't really carry enough information for this. I think there are two basic solutions we could adopt: * Extend the archive format to provide some indication that restoring this object requires exclusive access to these dependencies. * Hardwire knowledge into pg_restore that certain types of objects require exclusive access to their dependencies. The former seems more flexible, as well as more in tune with the basic design assumption that pg_restore shouldn't have a lot of knowledge about individual archive object types. But it would mean that you couldn't use parallel restore with any pre-8.4 dumps. In the long run that's no big deal, but in the short run it's annoying. hmm not sure how much of a problem that really is - we usually recommend to use the pg_dump version of the target database anyway. We don't really need a huge amount of hardwiring as it turns out. Here is a version of the patch that tries to do what's needed in this area. cheers andrew Index: pg_backup.h === RCS file: /cvsroot/pgsql/src/bin/pg_dump/pg_backup.h,v retrieving revision 1.47 diff -c -r1.47 pg_backup.h *** pg_backup.h 13 Apr 2008 03:49:21 - 1.47 --- pg_backup.h 29 Sep 2008 23:34:57 - *** *** 123,128 --- 123,130 int suppressDumpWarnings; /* Suppress output of WARNING entries * to stderr */ bool single_txn; + int number_of_threads; + booltruncate_before_load; bool *idWanted; /* array showing which dump IDs to emit */ } RestoreOptions; *** *** 165,170 --- 167,173 extern void CloseArchive(Archive *AH); extern void RestoreArchive(Archive *AH, RestoreOptions *ropt); + extern void RestoreArchiveParallel(Archive *AH, RestoreOptions *ropt); /* Open an existing archive */ extern Archive *OpenArchive(const char *FileSpec, const ArchiveFormat fmt); Index: pg_backup_archiver.c === RCS file: /cvsroot/pgsql/src/bin/pg_dump/pg_backup_archiver.c,v retrieving revision 1.158 diff -c -r1.158 pg_backup_archiver.c *** pg_backup_archiver.c 5 Sep 2008 23:53:42 - 1.158 --- pg_backup_archiver.c 29 Sep 2008 23:34:58 - *** *** 27,38 --- 27,51 #include unistd.h + #include sys/types.h + #include sys/wait.h + + #ifdef WIN32 #include io.h #endif #include libpq/libpq-fs.h + typedef struct _parallel_slot + { + pid_t pid; + TocEntry *te; + DumpId dumpId; + DumpId tdeps[2]; + } ParallelSlot; + + #define NO_SLOT (-1) const char *progname; *** *** 70,76 --- 83,100 static void dumpTimestamp(ArchiveHandle *AH, const char *msg, time_t tim); static OutputContext SetOutput(ArchiveHandle *AH, char *filename, int compression); static void ResetOutput(ArchiveHandle *AH, OutputContext savedContext); + static bool work_is_being_done(ParallelSlot *slot, int n_slots); + static int get_next_slot(ParallelSlot *slots, int n_slots); + static TocEntry *get_next_work_item(ArchiveHandle *AH, ParallelSlot *slots, int n_slots); + static void prestore(ArchiveHandle *AH, TocEntry *te); + static void mark_work_done(ArchiveHandle *AH, pid_t worker, ParallelSlot *slots, int n_slots); + static int _restore_one_te(ArchiveHandle *ah, TocEntry *te, RestoreOptions *ropt,bool is_parallel); + static void _reduce_dependencies(ArchiveHandle * AH, TocEntry *te); + static void _fix_dependency_counts(ArchiveHandle *AH); + static void _inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry * te); + + static ArchiveHandle *GAH; /* * Wrapper functions. *** *** 125,137 /* Public */ void RestoreArchive(Archive *AHX, RestoreOptions *ropt) { ArchiveHandle *AH = (ArchiveHandle *) AHX; TocEntry *te; teReqs reqs; OutputContext sav; - bool defnDumped; AH-ropt = ropt; AH-stage = STAGE_INITIALIZING; --- 149,633 /* Public */ void + RestoreArchiveParallel(Archive *AHX, RestoreOptions *ropt) + { + + ArchiveHandle *AH = (ArchiveHandle *) AHX; + ParallelSlot *slots; + int next_slot; + TocEntry *next_work_item = NULL; + int work_status; + pid_t ret_child; + int n_slots = ropt-number_of_threads; + TocEntry *te; + teReqsreqs; + + + AH-debugLevel = 99; + /* some routines that use ahlog() don't get passed AH */ + GAH = AH; + + ahlog(AH,1,entering RestoreARchiveParallel\n); + + + slots = (ParallelSlot *) calloc(sizeof(ParallelSlot),n_slots); + AH-ropt = ropt; + + /* +
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Josh Berkus wrote: At the past, I had considered to implement polyinstantiated table as a part of SE-PostgreSQL, but it required us unacceptable scale of changes, so I dropped the idea. The TrustedSolaris folks would like polyinstantiation, but I don't know if they actually have anyone working on Postgres anymore. I think polyinstantiation is an issue that will need much more research before deciding on impelemeting it. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Peter Eisentraut wrote: KaiGai Kohei wrote: As I repeated several times, SE-PostgreSQL applies the seuciry policy of SELinux to achieve consistency in access controls. This feature enables to restrict client's privileges on accesses to database objects, as if it accesses to filesystem objects. Its background is our nightmare for web application flaws. The major purpose of this feature is to provide the most important component to run enterprise class web application with least privilege set which is consistent at whole of the system. How important is this consistency goal in reality? We typically recommend that database applications run entirely in the database, for transaction integrity reasons and so on. It is a bit pointless. The consistency in this term is also applied to references to static web contents, like Jpeg images, PDF documents and so on. These are not accessed via web application and databases because httpd makes a response directly, but most of web applications are constructed with dynamic and static contents. Have you seen the example? http://kaigai.myhome.cx/index.php (id: foo/var/baz pass: sepgsql) It returns different query result based on the security context associated with HTTP authenticated user. In like wise, your accesses to the listed three Jpeg files are also restricted. In both cases, a user foo cannot access anything except for objects labeled as FooData or unlabled. Of coure, this is an ideal state, and we all of to break that once in a while. But hence the honest question, how often does that really happen and to what extent, and does that justify the significant investment that is being proposed here? I think it is very usual case. For example, it is a usefull case for a web-based document management system to apply consistent access control policy for database tuples and pdf documents on filesystem. Then, how does MAC help with SQL injections? Using the existing role-based system you can already define least-privilege users that are essentially powerless even if SQL injections were to happen. I am not aware that important flaws or gaps in our role-based access control system have been pointed out that would make it impossible to create applications with security levels similar to those achievable with a MAC system. This feature is mainly placed on fine-grained access controls. If 50% of tuples are invisible, the scope of dameges are limited. However, please consider a case when connection string is hijacked and malicious SQL is injected. As widely known, it is almost impossible to eliminate all bugs withing various kind of web applications. Now, if you come back to your original goal of consistency in access control, then it may in fact turn out that an FLASK/SELinux/SE-PostgreSQL-based system is the best common ground for such a system, but I am unconvinced that MAC by itself is necessary. It's unclear for me the meanings of MAC by itself. Are you wondering the mandatory access control policy by SE-PostgreSQL, not a Linux kernel? If so, please consider the relationship between a client, an object manager and method to access. In operating system case, a client (process) invokes a system call to access filesystem object which is managed by operating system. It is similar to that a client send a SQL query to access database object which is managed by database management system. # Accesses to database objects are invisible for operating system, at all. :) Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Bruce Momjian wrote: I think there are two goals here. At the SQL-level, we will have per-role row and column permissions (which seem valuable on their own), and SE-PostgreSQL allows those permissions to be controlled at the operating system level rather than at the database level. Yes, it is correct. As someone noted, SQL-level fine-grained access controls are also usefull feature. I understand it. SE-PostgreSQL makes its decision based on the security policy stored in operating system because of the consistency. However, database objects are invisible for operating system, so we have to add an option to RDBMS. I think your major question is how often do you have users that you need to control at both the SQL _and_ operating system level. I guess the answer is that security policy suggests controlling things at the lowest level, and bubling that security up into the database and applications. As I mentioned at the previous message, it is very frequent case when a single web application accesses both filesystem objects and database objects at the same time. The important thing is to turn off at the main. Smaller number of security sensitive codes are betther for consistency and completeness in security. Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Bruce Momjian wrote: Josh Berkus wrote: At the past, I had considered to implement polyinstantiated table as a part of SE-PostgreSQL, but it required us unacceptable scale of changes, so I dropped the idea. The TrustedSolaris folks would like polyinstantiation, but I don't know if they actually have anyone working on Postgres anymore. I think polyinstantiation is an issue that will need much more research before deciding on impelemeting it. As of this April, they didn't have a plan to implement polyinstantiation at the row-level, but considered it in per-table granularity. Of course, they also understood row-level polyinstantiation is a tough work. Thanks, -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updates of SE-PostgreSQL 8.4devel patches
Have you seen the example? http://kaigai.myhome.cx/index.php (id: foo/var/baz pass: sepgsql) ^ It means we can select one of foo, var or baz, and they have same password. I'm sorry, if it was a confusable representation. It returns different query result based on the security context associated with HTTP authenticated user. In like wise, your accesses to the listed three Jpeg files are also restricted. In both cases, a user foo cannot access anything except for objects labeled as FooData or unlabled. -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
+ if (strcmp(te-desc,CONSTRAINT) == 0 || + strcmp(te-desc,FK CONSTRAINT) == 0 || + strcmp(te-desc,CHECK CONSTRAINT) == 0 || + strcmp(te-desc,TRIGGER) == 0 || + strcmp(slots[i].te-desc,CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,FK CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,CHECK CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,TRIGGER) == 0) Really just an observation from the peanut gallery here, but every time pg_restore hard-codes this kind of thing, it introduces yet another possible side-effect bug when someone, eg, adds a new TOC type. Would it substantially decrease the benefits of the patch to skip *any* toc entry that shares dependencies with another? (rather than just those listed above). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Philip Warner wrote: + if (strcmp(te-desc,CONSTRAINT) == 0 || + strcmp(te-desc,FK CONSTRAINT) == 0 || + strcmp(te-desc,CHECK CONSTRAINT) == 0 || + strcmp(te-desc,TRIGGER) == 0 || + strcmp(slots[i].te-desc,CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,FK CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,CHECK CONSTRAINT) == 0 || + strcmp(slots[i].te-desc,TRIGGER) == 0) Really just an observation from the peanut gallery here, but every time pg_restore hard-codes this kind of thing, it introduces yet another possible side-effect bug when someone, eg, adds a new TOC type. Would it substantially decrease the benefits of the patch to skip *any* toc entry that shares dependencies with another? (rather than just those listed above). Unfortunately, it quite possibly would. You would not be able to build two indexes on the same table in parallel, even though they wouldn't have conflicting locks. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel pg_restore - WIP patch
Andrew Dunstan wrote: Unfortunately, it quite possibly would. You would not be able to build two indexes on the same table in parallel, even though they wouldn't have conflicting locks. I suppose so, but: 1. By the same logic it might speed things up; it might build two completely separate indexes and thereby avoid (some kind of) contention. In any case, it would most likely do *something* else. It should only reduce performance if (a) it can do nothing or (b) there is a benefit in building multiple indexes on the same table at the same time. 2. Perhaps if there are a limited number of items that share dependencies but which are known to be OK (ie. indexes), maybe list them in the inner loop as exceptions and allow them to run parallel. This would mean a failure to list a new TOC item type would result in worse performance rather than a crash. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers