Re: [HACKERS] [COMMITTERS] pgsql: New SQL functons pg_backup_in_progress() and pg_backup_start_tim
Daniel Farina wrote: After mulling over this some more, I am less on the fence about how unfortunate it is that Postgres cannot restart when doing an exclusive base backup: I think it is a severe anti-feature that should gradually be retired. This anti-feature was introduced with http://archives.postgresql.org/pgsql-committers/2008-04/msg00275.php following discussions http://archives.postgresql.org/pgsql-hackers/2007-11/msg00800.php and http://archives.postgresql.org/pgsql-hackers/2008-03/msg01033.php The problem (at least for me) is that without this the server will often refuse to restart after a clean shutdown, namely when the WAL segment containing the required checkpoint has already been archived. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Skip checkpoint on promoting from streaming replication
Hello, This is the new version of the patch. Your patch introduced new WAL record type XLOG_END_OF_RECOVERY to mark the chenge point of TLI. But I think the information is already stored in history files and already ready to use in current code. I looked into your first patch and looked over the discussion on it, and find that my understanding that TLI switch is operable also for crash recovery besides archive recovery was half wrong. The correct half was that it can be operable for crash recovery if we properly set TimeLineID in StartupXLOG(). To achieve this, I added a new field 'latestTLI' (more proper name is welcome) and make it always catch up with the latest TLI with no relation to checkpoints. Then set the recovery target in StartupXLOG() referring it. Additionaly, in previous patch, I checked only checkpoint intervals but this ended with no effect as you said. Because the WAL files in pg_xlog are preserved as many as required for crash recovery, as I knew... The new patch seems to work correctly for changing of TLI without checkpoint following. And archive recovery and PITR also seems to work correctly. The test script for the former is attached too. The new patch consists of two parts. These might should be treated as two separate ones.. 1. Allow_TLI_Increment_without_Checkpoint_20120618.patch Removes the assumption after the 'convension' that TLI should be incremented only on shutdown checkpoint. This seems actually has no problem as the commnet(This is not particularly critical). 2. Skip_Checkpoint_on_Promotion_20120618.patch Skips checkpoint if redo record can be read in-place. 3. Test script for TLI increment patch. This is only to show how the patch is tested. The point is creating TLI increment point not followed by any kind of checkpoints. pg_controldata shows like following after running this test script. Latest timeline ID is the new field. pg_control version number:923 Database cluster state: in production ! Latest timeline ID: 2 Latest checkpoint location: 0/258 Prior checkpoint location:0/258 Latest checkpoint's REDO location:0/220 ! Latest checkpoint's TimeLineID: 1 We will see this changing as follows after crash recovery, Latest timeline ID: 2 Latest checkpoint location: 0/54D9918 Prior checkpoint location:0/258 Latest checkpoint's REDO location:0/54D9918 Latest checkpoint's TimeLineID: 2 Then, we should see both two 'ABCDE...'s and two 'VWXYZ...'s in the table after the crash recovery. What do you think about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center == My e-mail address has been changed since Apr. 1, 2012. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d68760..70b4972 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5276,6 +5276,7 @@ BootStrapXLOG(void) ControlFile-system_identifier = sysidentifier; ControlFile-state = DB_SHUTDOWNED; ControlFile-time = checkPoint.time; + ControlFile-latestTLI = ThisTimeLineID; ControlFile-checkPoint = checkPoint.redo; ControlFile-checkPointCopy = checkPoint; @@ -6083,7 +6084,7 @@ StartupXLOG(void) * Initialize on the assumption we want to recover to the same timeline * that's active according to pg_control. */ - recoveryTargetTLI = ControlFile-checkPointCopy.ThisTimeLineID; + recoveryTargetTLI = ControlFile-latestTLI; /* * Check for recovery control file, and if so set up state for offline @@ -6100,11 +6101,11 @@ StartupXLOG(void) * timeline. */ if (!list_member_int(expectedTLIs, - (int) ControlFile-checkPointCopy.ThisTimeLineID)) + (int) ControlFile-latestTLI)) ereport(FATAL, (errmsg(requested timeline %u is not a child of database system timeline %u, recoveryTargetTLI, - ControlFile-checkPointCopy.ThisTimeLineID))); + ControlFile-latestTLI))); /* * Save the selected recovery target timeline ID and @@ -6791,9 +6792,12 @@ StartupXLOG(void) * * In a normal crash recovery, we can just extend the timeline we were in. */ + + ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI); + if (InArchiveRecovery) { - ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1; + ThisTimeLineID++; ereport(LOG, (errmsg(selected new timeline ID: %u, ThisTimeLineID))); writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI, @@ -6946,6 +6950,7 @@ StartupXLOG(void) LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile-state = DB_IN_PRODUCTION; ControlFile-time = (pg_time_t) time(NULL); + ControlFile-latestTLI = ThisTimeLineID; UpdateControlFile(); LWLockRelease(ControlFileLock); @@ -8710,12 +8715,6 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) SpinLockRelease(xlogctl-info_lck); } - /* TLI
[HACKERS] Libxml2 load error on Windows
Hi, PostgreSQL 9.2 Windows build is having trouble with the XML support: http://postgresql.1045698.n5.nabble.com/9-2-beta1-libxml2-can-t-be-loaded-on-Windows-td5710672.html postgres=# SELECT xml 'foobar/foo'; ERROR: could not set up XML error handler HINT: This probably indicates that the version of libxml2 being used is not compatible with the libxml2 header files that PostgreSQL was built with. HAVE_XMLSTRUCTUREDERRORCONTEXT should be defined if libxml2 has xmlStructuredErrorContext (introduced after 2.7.3). It is being set for Linux in pg_config.h: /* Define to 1 if your libxml has xmlStructuredErrorContext. */ #define HAVE_XMLSTRUCTUREDERRORCONTEXT 1 For Windows build, we need to set it in src/tools/msvc/Solution.pm. I guess there are two approaches to get the libxml2 version: 1) Parse LIBXML_INST/include/libxml/xmlversion.h to get the libxml2 version number, this header file is included in libxml2. 2) We may also get the version by running a command line utility i.e LIBXML_INST/bin/xmllint --version. The xmllint program parses one or more XML files and it is also included in libxml2. I believe it is safe to assume that libxml2 include folder would contain xmlversion.h file containing the required version number. It might be a touch slower to parse the file line by line, but similar functionality has been used elsewhere as well. We obviously rely on xml include files (with xml enabled), so there is definitely a tighter dependency on include files than xmllint executable as it is not mandatory for build process. Therefore, I used first approach to get libxml2 version number and include the #define HAVE_XMLSTRUCTUREDERRORCONTEXT in pg_config.h if libxml2 version is greater than 20703 (which effectively is 2.7.3). Please find attached patch libxml2_win_v1.patch. Regards, Talha Bin Rizwan libxml2_win_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)
On 17 June 2012 18:30, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Gurjeet Singh wrote: Dean Rasheed wrote: in HEAD: ... (actual time=1390.037..1390.037 rows=0 loops=1) Trigger for constraint fk_table_e_fkey: time=210.184 calls=9 Total runtime: 1607.626 ms With this patch: ... (actual time=1489.640..1489.640 rows=0 loops=1) [no triggers fired] Total runtime: 1489.679 ms for every row: ... (actual time=1565.148..1565.148 rows=0 loops=1) Trigger for constraint fk_table_e_fkey: time=705.962 calls=10 Total runtime: 2279.408 ms with this patch ... (actual time=1962.755..1962.755 rows=0 loops=1) Trigger for constraint fk_table_e_fkey: time=257.845 calls=1 Total runtime: 2221.912 ms I find it interesting that 'actual time' for top level 'Update on fk_table' is always higher in patched versions, and yet the 'Total runtime' is lower for the patched versions. I would've expected 'Total runtime' to be proportional to the increase in top-level row-source's 'actual time'. I figured that the trigger time was counted separately. It seems to add up pretty well that way. I guess the question is whether there is a case where the increase in seqscan time is *not* compensated by less time in the triggers. -Kevin I wouldn't read too much into the individual timings I posted above, since I get massive variations between runs. If I repeat it enough times, I can convince myself that the update times excluding trigger execution are unchanged on average, but the trigger execution times (which are indeed counted separately) are a real savings. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)
On 17 June 2012 18:48, Tom Lane t...@sss.pgh.pa.us wrote: Gurjeet Singh singh.gurj...@gmail.com writes: On Sat, Jun 16, 2012 at 9:50 AM, Dean Rasheed dean.a.rash...@gmail.comwrote: I find it interesting that 'actual time' for top level 'Update on fk_table' is always higher in patched versions, and yet the 'Total runtime' is lower for the patched versions. I would've expected 'Total runtime' to be proportional to the increase in top-level row-source's 'actual time'. Even the time consumed by Seq scans is higher in patched version, so I think the patch's affect on performance needs to be evaluated. AFAICS, the only way that the given patch could possibly make anything slower is that if the old value of some tested attribute is NULL, the comparison routines used to fall out immediately; now, they will do an additional SPI_getbinval call to extract the new value before making any decision. So that would account for some small increase in the ModifyTable runtime in cases where there are a lot of null keys in FK rows being updated, which accurately describes Dean's test case, if not so much the real world. I don't have a big problem with it, since the point of the patch is to possibly save a great deal more work in exactly these cases. It strikes me though that we are still leaving some money on the table. The SQL spec says clearly that no RI action need be taken when a null PK key value is updated to non-null, and I think this is right because there cannot possibly be any FK rows that are considered to match the old value. (Note that both the spec and our FK code treat the RI equality operators as strict, even if the underlying functions aren't really.) So we ought to have asymmetric logic in there when making checks on PK rows, such that null-non-null is not considered an interesting change. If done properly this would remove the above- described slowdown in the PK case. Yeah, that makes sense. Conversely, if an FK value is changed from non-null to null, that is either always OK (if MATCH SIMPLE, or if MATCH FULL and all the FK columns went to null) or a certain failure (if MATCH FULL and we have a mix of nulls and non-nulls). There's no need to queue a trigger event in the always OK cases, so I think we need some asymmetric logic in the FK case as well. Makes sense too. I think that the patch already covers the most common use case (in my experience) but we may as well get as much out of it as we can while we're here. Are you planning to tackle this, or should I move the patch back to waiting on author to give Vik Reykja a chance to update it? Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On 06/15/2012 03:59 PM, Jeff Davis wrote: On Wed, 2012-06-13 at 23:10 +0200, Miroslav Šimulčík wrote: I have working patch for postgresql version 9.0.4, but it needs refactoring before i can submit it, because some parts don't meet formatting requirements yet. And yes, changes are large, so it will be better to discuss design first and then deal with code. Do you insist on compatibility with standard SQL 2011 as Pavel wrote? Try to work on solving the problem and identify the use cases. I don't think the standard will cause a major problem, we should be able to make the relevant parts of your patch match the standard. That's one reason to work on it as an extension first: we can get a better sense of the problem space and various use cases without worrying about violating any standard. Then, as you need specific backend support (e.g. special syntax), we can take the standards more seriously. Regards, Jeff Davis What's wrong with SPI/timetravel extension for system versioning? http://www.postgresql.org/docs/9.1/static/contrib-spi.html We are heavily using system-versioned and application-time period tables in our enterprise products (most of them are bi-temporal). However our implementation is based on triggers and views and therefore is not very convenient to use. There are also some locking issues with foreign keys to application-time period tables. It will be great if the new temporal SQL features will be included in the Postgresql core with SQL 2011 syntax support. It is especially important for bi-temporal tables because of complex internal logic of UPDATE/DELETE and huge SELECT queries for such tables.
Re: [HACKERS] Resource Owner reassign Locks
On 16.06.2012 09:04, Amit kapila wrote: I don't think so. C doesn't ref count its pointers. You are right I have misunderstood. I don't think that lock tags have good human readable formats, and just a pointer dump probably wouldn't be much use when something that can never happen has happened. But I'll at least add a reference to the resource owner if this stays in. I have checked in lock.c file for the message where lock tags have been used. elog(ERROR, lock %s on object %u/%u/%u is already held, lockMethodTable-lockModeNames[lockmode], lock-tag.locktag_field1, lock-tag.locktag_field2, lock-tag.locktag_field3); This can give more information about erroneous lock. A better error message would be nice, but I don't think it's worth that much. resowner.c doesn't currently know about the internals of LOCALLOCk struct or lock tags, and I'd prefer to keep it that way. Let's just print the pointer's address, that's what we do in many other corresponding error messages in other Forget* functions. On 11.06.2012 18:21, Jeff Janes wrote: On Sun, Jun 10, 2012 at 11:28 PM, Amit Kapilaamit.kap...@huawei.com wrote: 2. ResourceOwnerForgetLock(), it decrements the lock count before removing, so incase it doesn't find the lock in lockarray, the count will be decremented inspite the array still contains the same number of locks. Should it emit a FATAL rather than an ERROR? I thought ERROR was sufficient to make the backend quit, as it is not clear how it could meaningfully recover. ERROR seems right to me, it'll get promoted to FATAL or PANIC if we're in a context where we can't recover. But I agree with Amit that we should not decrement the lock count on error. I'm not sure if it makes any difference, as we'll abort out of the current transaction on error, but it certainly looks wrong. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Tab completion of function arguments not working in all cases
On 18 June 2012 04:21, Josh Kupershmidt schmi...@gmail.com wrote: +1 for the idea. I find the existing behavior rather confusing, particularly the fact that a schema-qualified function name will be tab-completed, i.e. this works. DROP FUNCTION my_schema.myTAB but then, as your second example above shows, no completions are subsequently offered for the function arguments. As a side note unrelated to this patch, I also dislike how function name tab-completions will not fill in the opening parenthesis, which makes for unnecessary work for the user, as one then has to type the parenthesis and hit tab again to get possible completions for the function arguments. The current behavior causes: DROP FUNCTION my_fTAB which completes to: DROP FUNCTION my_function enter parenthesis, and hit tab: DROP FUNCTION my_function(TAB which, if there is only one match, could complete to: DROP FUNCTION my_function(integer) when the last three steps could have been consolidated with better tab-completion. Perhaps this could be a TODO. Hmm, I find that it does automatically fill in the opening parenthesis, but only once there is a space after the completed function name. So DROP FUNCTION my_fTAB completes to DROP FUNCTION my_function (note the space at the end). Then pressing TAB one more time gives DROP FUNCTION my_function ( , and then pressing TAB again gives the function arguments. Alternatively DROP FUNCTION my_functionTAB (no space after function name) first completes to DROP FUNCTION my_function (adding the space), and then completes with the opening parenthesis, and then with the function arguments. It's a bit clunky, but I find that repeatedly pressing TAB is easier than typing the opening bracket. The attached patch fixes these problems by introducing a new macro COMPLETE_WITH_ARG, similar to the existing COMPLETE_WITH_ATTR, which seems to be the nearest analogous code that covers all the edge cases. Anyway, on to the review of the patch itself: * Submission * Patch applies cleanly to git head, and regression tests are not expected for tab-completion enhancements. * Features Usability * I've verified that tab-completing of the first argument to functions for DROP FUNCTION and ALTER FUNCTION commands for the most part works as expected. The one catch I noticed was that Query_for_list_of_arguments wasn't restricting its results to currently-visible functions, so with a default search_path, if you have these two functions defined: public.doppelganger(text) my_schema.doppelganger(bytea) and then try: DROP FUNCTION doppelganger(TAB you get tab-completions for both text) and bytea(, when you probably expected only the former. That's easy to fix though, please see attached v2 patch. Good catch. I think that's a useful additional test, and is also consistent with the existing code in Query_for_list_of_attributes. * Coding * The new macro COMPLETE_WITH_ARG seems fine. The existing code used malloc() directly for DROP FUNCTION and ALTER FUNCTION (tab-complete.c, around lines 867 and 2190), which AIUI is frowned upon in favor of pg_malloc(). The patch avoids this ugliness by using the new COMPLETE_WITH_ARG macro, so that's a nice fixup. Overall, a nice fix for an overlooked piece of the tab-completion machinery. Josh Thanks for the review. Cheers, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Resource Owner reassign Locks
On 10.06.2012 23:39, Jeff Janes wrote: The attached patch fixes that by remembering up to 10 local locks, and pushing them up specifically rather than digging through the entire lock table. If the list overflows, then it reverts to the old behavior of digging through the entire local lock table. I'm a bit uncomfortable with having a fixed limit like that, after which you fall off the cliff. But I guess we can live with that, since we've had the quadratic behavior for years, and haven't heard complaints about it until recently. We can devise some more complex scheme later, if people still run into this. One idea for a more complex scheme, should we need one in the future, is to make the cache in the ResourceOwner expandable, like all the other arrays in ResourceOwner. It would not overflow when new entries are added it. However, if you try to forget a lock, and it's not found in the bottom 10 entries or so of the cache, mark the cache as overflowed at that point. That way reassigning the locks would be fast, even if the current resource owner holds a lot of locks, as longs as you haven't tried to release any of them out of LIFO order. When it needs to forget a lock, it searches backwards in the list of released lock and then just moves the last lock into the place of the one to be forgotten. Other ResourceOwner Forget functions slide the entire list down to close the gap, rather than using the selection-sort-like method. I don't understand why they do that. If Forgets are strictly LIFO, then it would not make a difference. If they are mostly LIFO but occasionally not, maybe the insertion method would win over the selection method. Yeah, I think that's the reason. We try to keep the arrays in LIFO order. If you move the last entry into the removed slot, then even a single out-of-order removal will spoil the heuristic that removals are done in LIFO order. Imagine that you insert 5 entries in order: 1 2 3 4 5 Now imagine that you first remove 1 (out-of-order), and then 5, 4, 3, and 2 (in order). The removal of 1 moves 5 to the beginning of the array. Then you remove 5, and that has to traverse the whole list because 5 is now at the beginning. Then you swap 4 into the beginning of the list, and so forth. Every subsequent removal scans the list from bottom to top, because of the single out-of-order removal. From what I can tell, Locks are mostly released in bulk anyway at transaction end, and rarely released explicitly. Hmm, if they are released in bulk, then it doesn't matter either way. So the decision on whether to do the slide or swap has to be made on the assumption that some locks are released out-of-order. With an array of 10-15 entries, it probably doesn't make any difference either way, though. For degrading performance in other cases, I think the best test case is pgbench -P (implemented in another patch in this commitfest) which has a loop which pushes one or two locks up from a portal to the parent (which already owns them, due to previous rounds of the same loop) very frequently. There might be a performance degradation of 0.5% or so, but it is less than the run to run variation. I plan to run some longer test to get a better estimate. If there is a degradation in that range, how important is that? I think that would be acceptable. I found the interface between resowner.c and lock.c a bit confusing. resowner.c would sometimes call LockReassignCurrentOwner() to reassign all the locks, and sometimes it would call LockReassignCurrentOwner() on each individual lock, with the net effect that's the same as calling LockReleaseCurrentOwner(). And it doesn't seem right for ResourceOwnerRemember/ForgetLock to have to accept a NULL owner. I rearranged that so that there's just a single LockReassignCurrentOwner() function, like before this patch. But it takes as an optional argument a list of locks held by the current resource owner. If the caller knows it, it can pass them to make the call faster, but if it doesn't it can just pass NULL and the function will traverse the hash table the old-fashioned way. I think that's a better API. Please take a look to see if I broke something. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 9717075..9dd9c33 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -345,6 +345,7 @@ static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode); static void FinishStrongLockAcquire(void); static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner); static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock); +static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent); static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode, PROCLOCK *proclock, LockMethod lockMethodTable); static void CleanUpLock(LOCK *lock, PROCLOCK
[HACKERS] gistchoose vs. bloat
Hackers, While experimenting with gistchoose I achieve interesting results about relation of gistchoose behaviour and gist index bloat. I've created following testcase for reproducing gist index bloating: 1) Create test table with 100 points from geonames # create table geotest (id serial, point point); # insert into geotest (point) (select * from geonames order by random() limit 100); 2) Create gist index and statistics table. Insert initial statistics (rows count, index size) into table. # create index geotest_idx on geotest using gist (point) with (buffering = off); # create table geotest_stats (id serial, count bigint, size bigint); # insert into geotest_stats (count, size) values ((select count(*) from geotest), pg_relation_size('geotest_idx')); 3) Delete 10% of geotest and vacuum it. delete from geotest where id in (select id from geotest order by random() limit 10); vacuum geotest; 4) Insert new 10 points fromg eonames. insert into geotest (point) (select * from geonames order by random() limit 10); 5) Insert current statistics (rows count, index size) into table. insert into geotest_stats (count, size) values ((select count(*) from geotest), pg_relation_size('geotest_idx')); 6) Repeat 3-5 steps 50 times. I get following results with current gistchoose implementation: test=# select * from geotest1_stats; id | count | size +-+--- 1 | 100 | 75767808 2 | 100 | 76046336 3 | 100 | 76840960 4 | 100 | 77799424 5 | 100 | 78766080 6 | 100 | 79757312 7 | 100 | 80494592 8 | 100 | 81125376 9 | 100 | 81985536 10 | 100 | 82804736 11 | 100 | 83378176 12 | 100 | 84115456 13 | 100 | 84819968 14 | 100 | 85598208 15 | 100 | 86302720 16 | 100 | 87023616 17 | 100 | 87703552 18 | 100 | 88342528 19 | 100 | 88915968 20 | 100 | 89513984 21 | 100 | 90152960 22 | 100 | 90742784 23 | 100 | 91324416 24 | 100 | 91742208 25 | 100 | 92258304 26 | 100 | 92758016 27 | 100 | 93241344 28 | 100 | 93683712 29 | 100 | 93970432 30 | 100 | 94396416 31 | 100 | 94740480 32 | 100 | 95068160 33 | 100 | 95444992 34 | 100 | 95780864 35 | 100 | 96313344 36 | 100 | 96731136 37 | 100 | 96968704 38 | 100 | 97222656 39 | 100 | 97509376 40 | 100 | 97779712 41 | 100 | 98074624 42 | 100 | 98344960 43 | 100 | 98639872 44 | 100 | 99000320 45 | 100 | 99229696 46 | 100 | 99459072 47 | 100 | 99672064 48 | 100 | 99901440 49 | 100 | 100114432 50 | 100 | 100261888 51 | 100 | 100458496 (51 rows) Index grow about from 75 MB to 100 MB. Current implementation of gistchoose select first index tuple which have minimal penalty. It is possible for several tuples to have same minimal penalty. I've created simple patch which selects random from them. I then I've following results for same testcase. test=# select * from geotest_stats; id | count | size +-+-- 1 | 100 | 74694656 2 | 100 | 74743808 3 | 100 | 74891264 4 | 100 | 75120640 5 | 100 | 75374592 6 | 100 | 75612160 7 | 100 | 75833344 8 | 100 | 76144640 9 | 100 | 76333056 10 | 100 | 76595200 11 | 100 | 76718080 12 | 100 | 76939264 13 | 100 | 77070336 14 | 100 | 77332480 15 | 100 | 77520896 16 | 100 | 77750272 17 | 100 | 77996032 18 | 100 | 78127104 19 | 100 | 78307328 20 | 100 | 78512128 21 | 100 | 78610432 22 | 100 | 78774272 23 | 100 | 78929920 24 | 100 | 79060992 25 | 100 | 79216640 26 | 100 | 79331328 27 | 100 | 79454208 28 | 100 | 79593472 29 | 100 | 79708160 30 | 100 | 79822848 31 | 100 | 79921152 32 | 100 | 80035840 33 | 100 | 80076800 34 | 100 | 80175104 35 | 100 | 80207872 36 | 100 | 80322560 37 | 100 | 80363520 38 | 100 | 80445440 39 | 100 | 80494592 40 | 100 | 80576512 41 | 100 | 8024 42 | 100 | 80764928 43 | 100 | 80805888 44 | 100 | 80912384 45 | 100 | 80994304 46 | 100 | 81027072 47 | 100 | 81100800 48 | 100 | 81174528 49 | 100 | 81297408 50 | 100 | 81371136 51 | 100 | 81420288 (51 rows) Now index grow about from 75 MB to 81 MB. It is almost 4 times less index bloat! I've following explanation of that. If index tuples are overlapping then possibility of insertion into last tuple is much lower than possibility of insertion to the first tuple. Thereby, freed space underlying to last tuples of inner page is not likely to be reused. Selecting random tuple increase that chances. Obviously, it makes insertion more expensive. I need some more benchmarks to measure it. Probably, we would need a user-visible option for cheaper insert or less bloat. -- With best
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Monday, June 18, 2012 02:43:26 AM Steve Singer wrote: On 12-06-13 01:27 PM, Andres Freund wrote: The previous mail contained a patch with a mismerge caused by reording commits. Corrected version attached. Thanks to Steve Singer for noticing this quickly. Attached is a more complete review of this patch. I agree that we will need to identify the node a change originated at. We will not only want this for multi-master support but it might also be very helpful once we introduce things like cascaded replicas. Using a 16 bit integer for this purpose makes sense to me. Good. This patch (with the previous numbered patches already applied), still doesn't compile. gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include -D_GNU_SOURCE -c -o xact.o xact.c xact.c: In function 'xact_redo_commit': xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn' make[4]: *** [xact.o] Error 1 Your complete patch set did compile. origin_lsn gets added as part of your 12'th patch. Managing so many related patches is going to be a pain. but it beats one big patch. I don't think this patch actually requires the origin_lsn change. Hrmpf #666. I will go through through the series commit-by-commit again to make sure everything compiles again. Reordinging this late definitely wasn't a good idea... I pushed a rebased version with all those fixups (and removal of the zeroRecPtr patch). Code Review - src/backend/utils/misc/guc.c @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] = }, { +{multimaster_node_id, PGC_POSTMASTER, REPLICATION_MASTER, +gettext_noop(node id for multimaster.), +NULL +}, + guc_replication_origin_id, +InvalidMultimasterNodeId, InvalidMultimasterNodeId, MaxMultimasterNodeId, +NULL, assign_replication_node_id, NULL +}, I'd rather see us refer to this as the 'node id for logical replication' over the multimaster node id. I think that terminology will be less controversial. Youre right. 'replication_node_id' or such should be ok? BootStrapXLOG in xlog.c creates a XLogRecord structure and shouldit set xl_origin_id to the InvalidMultimasterNodeId? WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a well defined value. I think InvalidMultimasterNodeId should be safe even for a no-op record in from a node that actually has a node_id set on real records. Good catches. backend/replication/logical/logical.c: XLogRecPtr current_replication_origin_lsn = {0, 0}; This variable isn't used/referenced by this patch it probably belongs as part of the later patch. Yea, just as the usage of origin_lsn in the above compile failure. -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] s/UNSPECIFIED/SIMPLE/ in foreign key code?
On Sat, Jun 16, 2012 at 5:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: A small flaw in this plan is that in pg_constraint.confmatchtype, MATCH_UNSPECIFIED is stored as 'u'. In a green field I'd just rename that to 's' for SIMPLE, but it seems possible that this would confuse client-side code such as pg_dump or psql. A quick look shows that neither of those programs actually look directly at pg_constraint.confmatchtype, instead relying on backend functions when they want to deconstruct a foreign key constraint. But there could well be other client code that would notice the change. So I'm a bit torn as to whether to change it and create a release-note-worthy compatibility issue, or to leave it as-is (with documentation notes that u for MATCH_SIMPLE is a historical accident). Thoughts? For some reason I thought this change would break pg_upgrade, but then I came to me senses and realized pg_upgrade does not migrate/upgrade system tables. The other candidate to look for possible breakage would be pgAdmin. As Amit says, there might be code out in the wild that does look at this column, so not worth breaking them for this small gain. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader
On 15.06.2012 00:38, Andres Freund wrote: On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote: On 13.06.2012 14:28, Andres Freund wrote: Features: - streaming reading/writing - filtering - reassembly of records Reusing the ReadRecord infrastructure in situations where the code that wants to do so is not tightly integrated into xlog.c is rather hard and would require changes to rather integral parts of the recovery code which doesn't seem to be a good idea. It would be nice refactor ReadRecord and its subroutines out of xlog.c. That file has grown over the years to be really huge, and separating the code to read WAL sounds like it should be a pretty natural split. I don't want to duplicate all the WAL reading code, so we really should find a way to reuse that. I'd suggest rewriting ReadRecord into a thin wrapper that just calls the new xlogreader code. I aggree that it is not very nice to duplicate it. But I also don't want to go the route of replacing ReadRecord with it for a while, we can replace ReadRecord later if we want. As long as it is in flux like it is right now I don't really see the point in investing energy in it. Also I am not that sure how a callback oriented API fits into the xlog.c workflow? If the API doesn't fit the xlog.c workflow, I think that's a pretty good indication that the API is not good. The xlog.c workflow is basically: while(!end-of-wal) { record = ReadRecord(); redo(recode); } There's some pretty complicated logic within the bowels of ReadRecord (see XLogPageRead(), for example), but it seems to me those parts actually fit the your callback API pretty well. The biggest change is that the xlog-reader thingie should return to the caller after each record, instead of calling a callback for each read record and only returning at end-of-wal. Or we could put the code to call the redo-function into the callback, but there's some also some logic there to exit early if you reach the desired recovery point, for example, so the callback API would need to be extended to allow such early exit. I think a return-after-each-record API would be better, though. Can this be used for writing WAL, as well as reading? If so, what do you need the write support for? It currently can replace records which are not interesting (e.g. index changes in the case of logical rep). Filtered records are replaced with XLOG_NOOP records with correct length currently. In future the actual amount of data should really be reduced. I don't know yet know how to map LSNs of uncompressed/compressed stream onto each other... The filtered data is then passed to a writeout callback (in a streaming fashion). The whole writing out part is pretty ugly at the moment and I just bolted it ontop because it was convenient for the moment. I am not yet sure how the api for that should look Can we just leave that out for now? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] s/UNSPECIFIED/SIMPLE/ in foreign key code?
Gurjeet Singh singh.gurj...@gmail.com writes: The other candidate to look for possible breakage would be pgAdmin. As Amit says, there might be code out in the wild that does look at this column, so not worth breaking them for this small gain. Well, I already did it the other way ;-). It's probably not that big a deal either way, since in practice it's a sure thing that 9.3 will have many bigger and more-painful-to-deal-with catalog changes than this one. But the fact that both pg_dump and psql go through pg_get_constraintdef rather than looking directly at the catalogs, and always have (where always = since the pg_constraint catalog was invented, in 7.3) makes me think that other clients probably do likewise. It's rather painful to get the list of FK column names out of pg_constraint's array representation in pure SQL. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)
Dean Rasheed dean.a.rash...@gmail.com writes: I think that the patch already covers the most common use case (in my experience) but we may as well get as much out of it as we can while we're here. Yeah. The cases involving nulls are probably really rather unlikely altogether, but it seems a tad silly to fix only some of them when we can fix all of them for marginally more effort. Are you planning to tackle this, or should I move the patch back to waiting on author to give Vik Reykja a chance to update it? It doesn't look like much else is ready for committer yet, so I think I'll keep hacking on this one. The whole of ri_triggers is looking a bit old and creaky to me; for instance the SET DEFAULT triggers believe that they can't cache plans involving DEFAULT, which was fixed years ago (I'm pretty sure that the plancache level should take care of that automatically, since an ALTER TABLE ... DEFAULT command would force a relcache flush). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader
On Monday, June 18, 2012 01:51:45 PM Heikki Linnakangas wrote: On 15.06.2012 00:38, Andres Freund wrote: On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote: On 13.06.2012 14:28, Andres Freund wrote: Features: - streaming reading/writing - filtering - reassembly of records Reusing the ReadRecord infrastructure in situations where the code that wants to do so is not tightly integrated into xlog.c is rather hard and would require changes to rather integral parts of the recovery code which doesn't seem to be a good idea. It would be nice refactor ReadRecord and its subroutines out of xlog.c. That file has grown over the years to be really huge, and separating the code to read WAL sounds like it should be a pretty natural split. I don't want to duplicate all the WAL reading code, so we really should find a way to reuse that. I'd suggest rewriting ReadRecord into a thin wrapper that just calls the new xlogreader code. I aggree that it is not very nice to duplicate it. But I also don't want to go the route of replacing ReadRecord with it for a while, we can replace ReadRecord later if we want. As long as it is in flux like it is right now I don't really see the point in investing energy in it. Also I am not that sure how a callback oriented API fits into the xlog.c workflow? If the API doesn't fit the xlog.c workflow, I think that's a pretty good indication that the API is not good. The xlog.c workflow is basically: while(!end-of-wal) { record = ReadRecord(); redo(recode); } There's some pretty complicated logic within the bowels of ReadRecord (see XLogPageRead(), for example), but it seems to me those parts actually fit the your callback API pretty well. The biggest change is that the xlog-reader thingie should return to the caller after each record, instead of calling a callback for each read record and only returning at end-of-wal. I did it that way at start but it seemed to move too much knowledge to the callsites. Its also something of an efficiency/complexity thing: Either you alway reread the current page on entering XLogReader (because it could have changed if it was the last one at some point) which is noticeable performancewise or you make the contract more complex by requiring to notify the XLogReader about the fact that you changed the end-of-valid data which doesn't seem to be a good idea because I think we will rather quickly will grow more callsites. Or we could put the code to call the redo-function into the callback, but there's some also some logic there to exit early if you reach the desired recovery point, for example, so the callback API would need to be extended to allow such early exit. I think a return-after-each-record API would be better, though. Adding an early exit would be easy. Can this be used for writing WAL, as well as reading? If so, what do you need the write support for? It currently can replace records which are not interesting (e.g. index changes in the case of logical rep). Filtered records are replaced with XLOG_NOOP records with correct length currently. In future the actual amount of data should really be reduced. I don't know yet know how to map LSNs of uncompressed/compressed stream onto each other... The filtered data is then passed to a writeout callback (in a streaming fashion). The whole writing out part is pretty ugly at the moment and I just bolted it ontop because it was convenient for the moment. I am not yet sure how the api for that should look Can we just leave that out for now? Not easily I think. I should probably get busy writing up a PoC for separating that. Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standbys, txid_current_snapshot, wraparound
On Thu, Mar 22, 2012 at 5:26 PM, Daniel Farina dan...@heroku.com wrote: Some time ago I reported bug 6291[0], which reported a Xid wraparound, both as reported in pg_controldata and by txid_current_snapshot. Unfortunately, nobody could reproduce it. Today, the same system of ours just passed the wraparound mark successfully at this time, incrementing the epoch. However, two standbys have not done the same: they have wrapped to a low txid. At this time, pg_controldata does report the correct epoch, as I read it, unlike the original case. Hmm. Here we are again, to revive this old thread (whose archives can be seen at http://archives.postgresql.org/pgsql-hackers/2012-03/msg01437.php) So the system *has* wrapped this time -- and so has its standby -- but not to the zero-epoch, but rather the epoch that it was previously in, 2^33 to 2^32. That's not something I've seen before. There was another report of something similar in that thread also. So, that's...different. However, since we've started using txids for some streaming/incremental copying of data roughly a year ago (and thus started be alerted to this problem), two of the three known wraparounds have been eventful, and the third exposed a potentially unrelated bug in the standby whereby epochs were not being loaded. This database is itself descended from from an old timeline on a version of postgres that undoubtably showed this defect. Version: 9.0.6 Ubuntu 10.04 LTS amd64 -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Resource Owner reassign Locks
A better error message would be nice, but I don't think it's worth that much. resowner.c doesn't currently know about the internals of LOCALLOCk struct or lock tags, and I'd prefer to keep it that way. Let's just print the pointer's address, that's what we do in many other corresponding error messages in other Forget* functions. I have checked there also doesn't exist any functions which expose lock internal parameters like tag values. So we can modify as you suggested. -Original Message- From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] Sent: Monday, June 18, 2012 4:25 PM To: Amit kapila; Jeff Janes Cc: pgsql-hackers Subject: Re: Resource Owner reassign Locks On 16.06.2012 09:04, Amit kapila wrote: I don't think so. C doesn't ref count its pointers. You are right I have misunderstood. I don't think that lock tags have good human readable formats, and just a pointer dump probably wouldn't be much use when something that can never happen has happened. But I'll at least add a reference to the resource owner if this stays in. I have checked in lock.c file for the message where lock tags have been used. elog(ERROR, lock %s on object %u/%u/%u is already held, lockMethodTable-lockModeNames[lockmode], lock-tag.locktag_field1, lock-tag.locktag_field2, lock-tag.locktag_field3); This can give more information about erroneous lock. A better error message would be nice, but I don't think it's worth that much. resowner.c doesn't currently know about the internals of LOCALLOCk struct or lock tags, and I'd prefer to keep it that way. Let's just print the pointer's address, that's what we do in many other corresponding error messages in other Forget* functions. On 11.06.2012 18:21, Jeff Janes wrote: On Sun, Jun 10, 2012 at 11:28 PM, Amit Kapilaamit.kap...@huawei.com wrote: 2. ResourceOwnerForgetLock(), it decrements the lock count before removing, so incase it doesn't find the lock in lockarray, the count will be decremented inspite the array still contains the same number of locks. Should it emit a FATAL rather than an ERROR? I thought ERROR was sufficient to make the backend quit, as it is not clear how it could meaningfully recover. ERROR seems right to me, it'll get promoted to FATAL or PANIC if we're in a context where we can't recover. But I agree with Amit that we should not decrement the lock count on error. I'm not sure if it makes any difference, as we'll abort out of the current transaction on error, but it certainly looks wrong. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
Misa Simic misa.si...@gmail.com wrote: 2012/6/17 Kevin Grittner kevin.gritt...@wicourts.gov Can someone provide a practical example of a foreign key with array use case? The only situations I'm able to think of right now are the same cases where you would now use a table with primary keys of two tables to provide a many-to-many linkage. Does this proposed feature handle other cases or handle this type of case better? I can't imagine either other usablity... Just many-to-one linkage... or to have many-to-many link with less rows in middle table... The many-to-one case seems like it is better handled in the other direction -- with the referenced table holding the set of valid keys and the referencing table holding the single key. (I believe the general case of this is what Jeff called an inclusion constraint -- a feature he wants to add at some point.) I can't think of a use case where that would not be better for this type of relationship than putting the array on the referencing side. The many-to-many relationship does seem like a potentially useful feature, at least in some cases. For example, a message board where a limited number of tags can be attached to each message -- the message could contain an array of tag IDs. Clearly this could be done as a table holding message_id and tag_id, but it seems potentially more convenient from a programming perspective to have an array of tag_id values in the message table. Logically, you are associating each of these with the primary key of the row it is in. Are there other obvious use-cases? If nobody can put one forward, this seems like a good reality test for proposed behaviors in any corner cases where correct behavior isn't obvious -- what would you want to do to the data if it were in the separate table with just the primary keys to link the two tables? What is better - I think should be measured... It would make an interesting test to compare performance of a suitably sized data set and reasonable workload for both techniques. Of course, that's a non-trivial amount of work. It seems like people may be able to justify this just on ease-of-use, versus claiming that it's an optimization. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] transforms
Hi Peter, On Friday, June 15, 2012 12:42:12 AM Peter Eisentraut wrote: Here is my first patch for the transforms feature. This is a mechanism to adapt data types to procedural languages. The previous proposal was here: http://archives.postgresql.org/pgsql-hackers/2012-05/msg00728.php At the surface, this contains: - New catalog pg_transform - CREATE/DROP TRANSFORM Cursory code review: * pstrndup already exists as pnstrdup (hstore_plperl.c) * PyString_FromStringAndSize return value not decreffed? PyDict_SetItem doesn't steal references * In plpython_to_hstore I would move the 'pairs' and some related variables in the PG_TRY block, so the reader doesn't need to check whether it should be volatile * In the same function items needs to be volatile to fit into longjmp semantics * I don't think recording dependencies on transforms used when creating functions is a good idea as the transform might get created after the functions already exists. That seems to be a pretty confusing behaviour. * I am a bit wary that some place can be used to call functions accepting INTERNAL indirectly, couldn't think of any immediately though. Will look into this a bit, but I am not experienced in the area, so ... * I forsee the need for multiple transforms for the same type/language pair to coexist. The user would need to manually choose/call the transform in that case. This currently isn't easily possible... *) No psql backslash commands yet. Could be added. Doesn't really seem necessary to me. Not many people will need to look at this and the list of commands already is rather long. *) Permissions: Transforms don't have owners, a bit like casts. Currently, you are allowed to drop a transform if you own both the type and the language. That might be too strict, maybe own the type and have privileges on the language would be enough. Seems sensible enough to me. *) If we want to offer the ability to write transforms to end users, then we need to install all the header files for the languages and the types. This actually worked quite well; including hstore.h and plperl.h for example, gave you what you needed. In other cases, some headers might need cleaning up. Also, some magic from the plperl and plpython build systems might need to be exposed, for example to find the header files. See existing modules for how this currently looks. Doesn't look to bad to me. I dont't know how this could be made easier. *) There is currently some syntax schizophrenia. The grammar accepts CREATE TRANSFORM FOR hstore LANGUAGE plperl ( FROM SQL WITH hstore_to_plperl, TO SQL WITH plperl_to_hstore ); but pg_dump produces CREATE TRANSFORM FOR hstore LANGUAGE plperl ( FROM SQL WITH hstore_to_plperl(hstore), TO SQL WITH plperl_to_hstore(internal) ); The SQL standard allows both. (In the same way that it allows 'DROP FUNCTION foo' without arguments, if it is not ambigious.) Precedent is that CREATE CAST requires arguments, but CREATE LANGUAGE does not. I don't find that problematic personally. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote: This adds a new configuration parameter multimaster_node_id which determines the id used for wal originating in one cluster. Looks good and it seems this aspect at least is commitable in this CF. Design decisions I think we need to review are * Naming of field. I think origin is the right term, borrowing from Slony. * Can we add the origin_id dynamically to each WAL record? Probably no need, but lets consider why and document that. * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. * Do we want origin_id as a parameter or as a setting in pgcontrol? IIRC we go to a lot of trouble elsewhere to avoid problems with changing on/off parameter values. I think we need some discussion to validate where that should live. * Is there any overhead from CRC of WAL record because of this? I'd guess not, but just want to double check thinking. Presumably there is no issue wrt Heikki's WAL changes? I assume not, but ask since I know you're reviewing that also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Tab completion of function arguments not working in all cases
On Mon, Jun 18, 2012 at 3:56 AM, Dean Rasheed dean.a.rash...@gmail.com wrote: On 18 June 2012 04:21, Josh Kupershmidt schmi...@gmail.com wrote: As a side note unrelated to this patch, I also dislike how function name tab-completions will not fill in the opening parenthesis, which makes for unnecessary work for the user, as one then has to type the parenthesis and hit tab again to get possible completions for the function arguments. The current behavior causes: DROP FUNCTION my_fTAB which completes to: DROP FUNCTION my_function enter parenthesis, and hit tab: DROP FUNCTION my_function(TAB which, if there is only one match, could complete to: DROP FUNCTION my_function(integer) when the last three steps could have been consolidated with better tab-completion. Perhaps this could be a TODO. Hmm, I find that it does automatically fill in the opening parenthesis, but only once there is a space after the completed function name. So DROP FUNCTION my_fTAB completes to DROP FUNCTION my_function (note the space at the end). Then pressing TAB one more time gives DROP FUNCTION my_function ( , and then pressing TAB again gives the function arguments. Alternatively DROP FUNCTION my_functionTAB (no space after function name) first completes to DROP FUNCTION my_function (adding the space), and then completes with the opening parenthesis, and then with the function arguments. It's a bit clunky, but I find that repeatedly pressing TAB is easier than typing the opening bracket. Interesting, I see the behavior you describe on Linux, using psql + libreadline, and the behavior I showed (i.e. repeatedly pressing tab won't automatically fill in the function arguments after the function name is completed, seemingly because no space is deposited after the completed function name) is with OS X 10.6, using psql + libedit. [snip] you get tab-completions for both text) and bytea(, when you probably expected only the former. That's easy to fix though, please see attached v2 patch. Good catch. I think that's a useful additional test, and is also consistent with the existing code in Query_for_list_of_attributes. OK, I'll mark Ready for Committer in the CF. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Hi Simon, On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote: On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote: This adds a new configuration parameter multimaster_node_id which determines the id used for wal originating in one cluster. Looks good and it seems this aspect at least is commitable in this CF. I think we need to agree on the parameter name. It currently is 'multimaster_node_id'. In the discussion with Steve we got to replication_node_id. I don't particularly like either. Other suggestions? Design decisions I think we need to review are * Naming of field. I think origin is the right term, borrowing from Slony. I think it fits as well. * Can we add the origin_id dynamically to each WAL record? Probably no need, but lets consider why and document that. Not sure what you mean? Its already set in XLogInsert to current_replication_origin_id which defaults to the value of the guc? * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? * Do we want origin_id as a parameter or as a setting in pgcontrol? IIRC we go to a lot of trouble elsewhere to avoid problems with changing on/off parameter values. I think we need some discussion to validate where that should live. Hm. I don't really forsee any need to have it in pg_control. What do you want to protect against with that? It would need to be changeable anyway, because otherwise it would need to become a parameter for initdb which would suck for anybody migrating to use replication at some point. Do you want to protect against problems in replication setups after changing the value? * Is there any overhead from CRC of WAL record because of this? I'd guess not, but just want to double check thinking. I cannot imagine that there is. The actual size of the record didn't change because of alignment padding (both on 32 and 64 bit systems). Presumably there is no issue wrt Heikki's WAL changes? I assume not, but ask since I know you're reviewing that also. It might clash minimally because of offset changes or such, but otherwise there shouldn't be much. Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 02/16] Add zeroRecPtr as a shortcut for initializing a local variable to {0, 0}
On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote: On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund and...@2ndquadrant.com wrote: This is locally defined in lots of places and would get introduced frequently in the next commits. It is expected that this can be defined in a header-only manner as soon as the XLogInsert scalability groundwork from Heikki gets in. This appears to be redundant with the existing InvalidXLogRecPtr, defined in access/transam.h. I dropped the patch from the series in the git repo and replaced every usage with the version in transam.h Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sortsupport for text
On 18 June 2012 00:38, Tom Lane t...@sss.pgh.pa.us wrote: The only reason we test a = b and not a b || a b is that the latter is at least twice as expensive to evaluate. Perhaps I've been unclear. I meant to write (!(a b) !(b a)), and not (a b b a). The former is not tautological when the trichotomy law holds, but the latter is. My mistake - I was a little tired when I wrote that (though you seemed to know what I meant). You can sort things just fine if you only have a less than comparator than returns a boolean, which is what I sought to illustrate with that example: There is no requirement that the comparator indicate anything more than whether one element is less than the other. In particular, it need not tell the sort function that two elements are equal, though the sort function will still (almost invariably in practice) put equal elements beside each other. Consider std::set, a highly-generic template class that enforces uniqueness, usually implemented using a red-black tree (so its elements are sorted), that only requires that the datatype it is instantiated with have an operator. The datatype used may not even have an operator== for std::set to consult if it wanted to. So, in principle, std::set could figure out if any two given elements were equivalent. That is to say, it could determine: if (!(a b) !(b a)) It could not, even in principle given its implicit interface/requirements for datatypes, know for sure that: if (a == b) Despite the fact that equivalency and equality can be expected to coincide the vast majority of the time, they are not quite the same thing. It's perfectly reasonable that it might actually not hold that a and b are equal, even though they are equivalent, for certain datatypes. Certainly not for integers. But, for example, it could make sense to have a string datatype where with a certain locale a certain Hungarian word was equivalent to the same string with a minor variation in diacritical marks or whatever (an entirely culturally defined equivalency), even though it was not equal according to this datatype's own idiosyncratic notion of equality, which is bitwise equality. That would be a datatype that std::set would be entirely capable of rolling with, because it doesn't care about equality. It might be better if our text type had the same semantics as this hypothetical string datatype, because: 1. A culture has gone so far as to make these variations of diacritical marks or whatever equivalent. The Unicode consortium consulted with the Hungarian government, or maybe the Hungarian version of the Académie française, and they asked for this, which seems pretty serious to me. Maybe they'd like to have a unique constraint reject what they consider to be equivalent strings. If this sounds pedantic to you, google Han unification controversy, because this sounds like that in reverse (Hungarian separation controversy). The fact that when a unique constraint is violated, it isn't actually necessary to check the equality operator function (equivalence will do; no need to call texteq as when selecting with the same value in the qual) suggests that this wouldn't be too hard to facilitate, though again, I admit my understanding here is more shallow than I'd like. Now, I'll grant you that I'm not currently losing any sleep about our cultural insensitivity, and apparently neither are any Hungarians; this just serves to illustrate that my position is The Right Thing (I hope). Here is a description of the Hungarian problem: http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx It says the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs...Basically, the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs.. We're not the first people to have to deal with this: http://bugs.mysql.com/bug.php?id=12519 Here is Tom's original analysis, that lead to this patch: http://archives.postgresql.org/pgsql-general/2005-12/msg00803.php 2. This way, it's still possible to do the strxfrm() optimisation more or less for free, without breaking hash methods on text. That is not to be sniffed at - sorting text is very common and important. Most people don't know about the C locale, and couldn't use it if they did. The last section of src/backend/access/nbtree/README has some notes that you might find relevant. I assume that you're referring to Notes to Operator Class Implementors. It says: An = operator must be an equivalence relation; that is, for all non-null values A,B,C of the datatype: A = A is true reflexive law if A = B, then B = Asymmetric law if A = B and B = C, then A = C transitive law This would still hold; we'd just have to change the = operators here to something
Re: [HACKERS] sortsupport for text
On 18 June 2012 16:59, Peter Geoghegan pe...@2ndquadrant.com wrote: Perhaps more importantly, I cannot recreate any of these problems on my Fedora 16 machine. Even with hu_HU on LATIN2, Tom's original test case (from 2005, on a Fedora 4 machine) cannot be recreated. So it may be that they've tightened these things up in some way. It's far from clear why that should be. It could be worth Ugh, hit send too soon. I meant to close with the observation that it may be safe to rely on strcoll() / strxfrm() + strcmp() not ever returning 0 on certain platforms, if my inability to recreate this is anything to go by. There could be a test run by configure that determines this, enabling us to then use the strxfrm() optimisation. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Wednesday, June 13, 2012 06:53:17 PM Jeff Davis wrote: On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote: The --help output for the -N option was copy-and-pasted wrongly. The message issued when using -N is also a bit content-free. Maybe something like Running in nosync mode. The data directory might become corrupt if the operating system crashes.\n Thank you, fixed. Which leads to the question, how does one get out of this state? Is running sync(1) enough? Is starting the postgres server enough? sync(1) calls sync(2), and the man page says: According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.) So it looks like sync is enough if you are on linux *and* you have any unprotected write cache disabled. Protection can include write barries, it doesn't need to be a BBU I don't think starting the postgres server is enough. Agreed. Before, I think we were safe because we could assume that the OS would flush the buffers before you had time to store any important data. But now, that window can be much larger. I think to a large degree we didn't see any problem because of the old ext3 sync the whole world type of behaviour which was common for a very long time. So on ext3 (with data=ordered, the default) any fsync (checkpoints, commit, ...) would lead to all the other files being synced as well which means starting the server once would have been enough on such a system... Quick review: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname instead of cluttering the main function with it Looks good otherwise! Thanks, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw in contrib
On Thu, Jun 14, 2012 at 7:29 AM, Shigeru HANADA shigeru.han...@gmail.com wrote: I'd like to propose pgsql_fdw, FDW for PostgreSQL, as a contrib module in core, again. This patch is basically rebased version of the patches proposed in 9.2 development cycle, and contains some additional changes such as concern about volatile variables for PG_CATCH blocks. In addition, I combined old three patches (pgsql_fdw core functionality, push-down support, and analyze support) into one patch for ease of review. (I think these features are necessary for pgsql_fdw use.) After the development for 9.2 cycle, pgsql_fdw can gather statistics from remote side by providing sampling function to analyze module in backend core, use them to estimate selectivity of WHERE clauses which will be pushed-down, and retrieve query results from remote side with custom row processor which prevents memory exhaustion from huge result set. It would be possible to add some more features, such as ORDER BY push-down with index information support, without changing existing APIs, but at first add relatively simple pgsql_fdw and enhance it seems better. In addition, once pgsql_fdw has been merged, it would help implementing proof-of-concept of SQL/MED-related features. I can't help but wonder (having been down the contrib/core/extension road myself) if it isn't better to improve the facilities to register and search for qualified extensions (like Perl CPAN) so that people looking for code to improve their backends can find it. That way, you're free to release/do xyz/abandon your project as you see fit without having to go through -hackers. This should also remove a lot of the stigma with not being in core since if stock postgres installations can access the necessary modules via CREATE EXTENSION, I think it will make it easier for projects like this to get used with the additional benefit of decentralizing project management. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Skip checkpoint on promoting from streaming replication
On Mon, Jun 18, 2012 at 5:42 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: What do you think about this? What happens if the server skips an end-of-recovery checkpoint, is promoted to the master, runs some write transactions, crashes and restarts automatically before it completes checkpoint? In this case, the server needs to do crash recovery from the last checkpoint record with old timeline ID to the latest WAL record with new timeline ID. How does crash recovery do recovery beyond timeline? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
Excerpts from Tom Lane's message of sáb jun 16 02:41:00 -0400 2012: Amit kapila amit.kap...@huawei.com writes: The suggested patch improves the logic to recover corrupt control file. So that is the reason I felt it will be relevant to do this patch. Well, we invented pg_resetxlog with the thought that it might be useful for such situations, but I'm not sure offhand that we've ever seen a field report of corrupted pg_control files. For instance, a quick search in the archives for incorrect checksum in control file turns up only cases of pilot error, such as supposing that a 32-bit database could be used with a 64-bit server or vice versa. Actual hardware failures on the pg_control file could be expected to result in something like could not read from control file: I/O error, which I find no evidence for at all in the archives. Hm, what about the situation where pg_control is lost completely to a filesystem failure? I remember doing disaster recovery on this problem once ... As far as I recall the pg_xlog files were in a separate partition so they weren't lost. Some other files in the main data partition were lost as well. (I don't remember what is it that we had to do to create a fake pg_control). Maybe, even if Amit's code does not end up in pg_resetxlog, it could be useful as a DR tool, assuming the code does not cause endless maintenance burden. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of sáb jun 16 02:41:00 -0400 2012: Well, we invented pg_resetxlog with the thought that it might be useful for such situations, but I'm not sure offhand that we've ever seen a field report of corrupted pg_control files. Hm, what about the situation where pg_control is lost completely to a filesystem failure? I remember doing disaster recovery on this problem once ... As far as I recall the pg_xlog files were in a separate partition so they weren't lost. Some other files in the main data partition were lost as well. Hm ... well, as long as we have a clear idea of a use-case, I'm not opposed in principle to working on this area. (I don't remember what is it that we had to do to create a fake pg_control). AFAIR you can create pg_control from scratch already with pg_resetxlog. The hard part is coming up with values for the counters, such as the next WAL location. Some of them such as next OID are pretty harmless if you don't guess right, but I'm worried that wrong next WAL could make things worse not better. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] temporal support patch
On Mon, 2012-06-18 at 19:34 +0900, Vlad Arkhipov wrote: What's wrong with SPI/timetravel extension for system versioning? http://www.postgresql.org/docs/9.1/static/contrib-spi.html We are heavily using system-versioned and application-time period tables in our enterprise products (most of them are bi-temporal). However our implementation is based on triggers and views and therefore is not very convenient to use. There are also some locking issues with foreign keys to application-time period tables. It will be great if the new temporal SQL features will be included in the Postgresql core with SQL 2011 syntax support. It is especially important for bi-temporal tables because of complex internal logic of UPDATE/DELETE and huge SELECT queries for such tables. I've already pointed out some missing features in this thread, but the big ones in my mind are: 1. It doesn't use 9.2 Range Types, which would help in a lot of ways (like making the SELECT queries a lot simpler and faster). 2. It's missing a lot of options, like storing the user that modified a row or the changed columns. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq compression
On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote: The fly in the ointment with any of these ideas is that the configure list is not a list of exact cipher names, as per Magnus' comment that the current default includes tests like !aNULL. I am not sure that we know how to evaluate such conditions if we are applying an after-the-fact check on the selected cipher. Does OpenSSL expose any API for evaluating whether a selected cipher meets such a test? I'm not sure whether there's an API for it, but you can certainly check manually with openssl ciphers -v, for example: $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP' NULL-SHASSLv3 Kx=RSA Au=RSA Enc=None Mac=SHA1 NULL-MD5SSLv3 Kx=RSA Au=RSA Enc=None Mac=MD5 ...etc... So unless the openssl includes the code twice there must be a way to extract the list from the library. Have a nice ay, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
Excerpts from Boszormenyi Zoltan's message of vie may 11 03:54:13 -0400 2012: Hi, another rebasing and applied the GIT changes in ada8fa08fc6cf5f199b6df935b4d0a730aaa4fec to the Windows implementation of PGSemaphoreTimedLock. Hi, I gave the framework patch a look. One thing I'm not sure about is the way you've defined the API. It seems a bit strange to have a nice and clean, generic interface in timeout.h; and then have the internal implementation of the API cluttered with details such as what to do when the deadlock timeout expires. Wouldn't it be much better if we could keep the details of CheckDeadLock itself within proc.c, for example? Same for the other specific Check*Timeout functions. It seems to me that the only timeout.c specific requirement is to be able to poke base_timeouts[].indicator and fin_time. Maybe that can get done in timeout.c and then have the generic checker call the module-specific checker. ... I see that you have things to do before and after setting indicator. Maybe you could split the module-specific Check functions in two and have timeout.c call each in turn. Other than that I don't see that this should pose any difficulty. Also, I see you're calling GetCurrentTimestamp() in the checkers; maybe more than once per signal if you have multiple timeouts enabled. Maybe it'd be better to call it in once handle_sig_alarm and then pass the value down, if necessary (AFAICS if you restructure the code as I suggest above, you don't need to get the value down the the module-specific code). As for the Init*Timeout() and Destroy*Timeout() functions, they seem a bit pointless -- I mean if they just always call the generic InitTimeout and DestroyTimeout functions, why not just define the struct in a way that makes the specific functions unnecessary? Maybe you even already have all that's necessary ... I think you just change base_timeouts[i].timeout_destroy(false); to DestroyTimeout(i, false) and so on. Minor nitpick: the Interface: comment in timeout.c seems useless. That kind of thing tends to get overlooked and obsolete over time. We have examples of such things all over the place. I'd just rip it and have timeout.h be the interface documentation. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
AFAIR you can create pg_control from scratch already with pg_resetxlog. The hard part is coming up with values for the counters, such as the next WAL location. Some of them such as next OID are pretty harmless if you don't guess right, but I'm worried that wrong next WAL could make things worse not better. I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed logic should generate correct values. Do you see any problem in logic purposed in my original mail. Can I resume my work on this feature? With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund and...@2ndquadrant.com wrote: 1. Use a 64-bit segment number, instead of the log/seg combination. And don't waste the last segment on each logical 4 GB log file. The concept of a logical log file is now completely gone. XLogRecPtr is unchanged, but it should now be understood as a plain 64-bit value, just split into two 32-bit integers for historical reasons. On disk, this means that there will be log files ending in FF, those were skipped before. Whats the reason for keeping that awkward split now? There aren't that many users of xlogid/xcrecoff and many of those would be better served by using helper macros. I wondered that, too. There may be a good reason for keeping it split up that way, but we at least oughta think about it a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On 18.06.2012 21:00, Robert Haas wrote: On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com wrote: 1. Use a 64-bit segment number, instead of the log/seg combination. And don't waste the last segment on each logical 4 GB log file. The concept of a logical log file is now completely gone. XLogRecPtr is unchanged, but it should now be understood as a plain 64-bit value, just split into two 32-bit integers for historical reasons. On disk, this means that there will be log files ending in FF, those were skipped before. Whats the reason for keeping that awkward split now? There aren't that many users of xlogid/xcrecoff and many of those would be better served by using helper macros. I wondered that, too. There may be a good reason for keeping it split up that way, but we at least oughta think about it a bit. The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed
On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma a...@cybertec.at wrote: Exactly. I think the first question for this patch should be whether this use-case is worth the complexity of the patch. I can't imagine any really compelling use cases that need an arbitrary distinct subset of results. Me neither. The original complaint on -performance [1], didn't specify a real world use case, but it seemed to be a case of an ORM generating suboptimal queries. On the other hand, the patch itself is in my opinion rather simple, so it might be worth it. Yeah. It has one outstanding issue, query_planner chooses the cheapest path based on total cost. This can be suboptimal when that path happens to have high startup cost. It seems to me that enabling the query_planner to find the cheapest unsorted path returning a limited amount of tuples would require some major surgery to the planner. To be clear, this is only a case of missed optimization, not a regression. I'm confused by this remark, because surely the query planner does it this way only if there's no LIMIT. When there is a LIMIT, we choose based on the startup cost plus the estimated fraction of the total cost we expect to pay based on dividing the LIMIT by the overall row count estimate. Or is this not what you're talking about? It won't help set returning functions because the tuplestore for those is fully materialized when the first row is fetched. That's surely a problem for another day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote: On 18.06.2012 21:00, Robert Haas wrote: On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com wrote: 1. Use a 64-bit segment number, instead of the log/seg combination. And don't waste the last segment on each logical 4 GB log file. The concept of a logical log file is now completely gone. XLogRecPtr is unchanged, but it should now be understood as a plain 64-bit value, just split into two 32-bit integers for historical reasons. On disk, this means that there will be log files ending in FF, those were skipped before. Whats the reason for keeping that awkward split now? There aren't that many users of xlogid/xcrecoff and many of those would be better served by using helper macros. I wondered that, too. There may be a good reason for keeping it split up that way, but we at least oughta think about it a bit. The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. I am out of my depth here, not having read any of the relevant code, but couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without marking the page dirty? There is the valid argument that you would loose some information when pages with hint bits are written out again, but on the other hand you would also gain the information that it was a hint-bit write... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Mon, Jun 18, 2012 at 2:08 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 18.06.2012 21:00, Robert Haas wrote: On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com wrote: 1. Use a 64-bit segment number, instead of the log/seg combination. And don't waste the last segment on each logical 4 GB log file. The concept of a logical log file is now completely gone. XLogRecPtr is unchanged, but it should now be understood as a plain 64-bit value, just split into two 32-bit integers for historical reasons. On disk, this means that there will be log files ending in FF, those were skipped before. Whats the reason for keeping that awkward split now? There aren't that many users of xlogid/xcrecoff and many of those would be better served by using helper macros. I wondered that, too. There may be a good reason for keeping it split up that way, but we at least oughta think about it a bit. The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. Ugh. Well, that's a good reason for thinking twice before changing it, if not abandoning the idea altogether. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On 18.06.2012 21:13, Andres Freund wrote: On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote: The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. I am out of my depth here, not having read any of the relevant code, but couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without marking the page dirty? There is the valid argument that you would loose some information when pages with hint bits are written out again, but on the other hand you would also gain the information that it was a hint-bit write... Sorry, I don't understand that. Where would you replace the LSN from disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) ) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... The regression test suite also starts postgres with fsync off. This is good, and I don't want to have more overhead in the tests. It's not like we already have awesome test coverage for OS interaction. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote: Quick review: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. If you feel there will be a maintainability problem, I can give it another shot. - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname instead of cluttering the main function with it Done. Regards, Jeff Davis initdb-fsync-20120618.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote: On 18.06.2012 21:13, Andres Freund wrote: On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote: The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. I am out of my depth here, not having read any of the relevant code, but couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without marking the page dirty? There is the valid argument that you would loose some information when pages with hint bits are written out again, but on the other hand you would also gain the information that it was a hint-bit write... Sorry, I don't understand that. Where would you replace the LSN from disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) ) In ReadBuffer_common or such, after reading a page from disk and verifying that the page has a valid header it seems to be possible to replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page dirty. If the page isn't modified the lsn on disk won't be changed so you don't loose debugging information in that case. If will be zero if it has been written by a hint-bit write and thats arguable a win. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote: On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... The regression test suite also starts postgres with fsync off. This is good, and I don't want to have more overhead in the tests. It's not like we already have awesome test coverage for OS interaction. OK, changing back to using -N during regression tests due to performance concerns. Regards, Jeff Davis initdb-fsync-20120618-2.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Monday, June 18, 2012 08:39:47 PM Jeff Davis wrote: On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote: Quick review: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? I don't think the difference in initdb cost is relevant when running the regression tests. Should it prove to be we can re-add -N after a week or two in the buildfarm machines. I just remember that there were several OS specific regression when adding the pre-syncing for createdb. - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. Ok. Sensible reasons. I dislike that we know have two files using different logic (copydir.c only using fadvise, initdb using sync_file_range if available). Maybe we should just move the fadvise and sync_file_range calls into its own common function? If you feel there will be a maintainability problem, I can give it another shot. Its not too bad yet I guess, so ... - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname instead of cluttering the main function with it Done. Looks good to me. Btw, I just want to have said this, although I don't think its particularly relevant as it doesn't affect correctness: Its possible to have a system where sync_file_range is in the system headers but the kernel during runtime doesn't support it. It is relatively new (2.6.17). It would be possible to fallback to posix_fadvise which has been around far longer in that case... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On 18.06.2012 21:45, Andres Freund wrote: On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote: On 18.06.2012 21:13, Andres Freund wrote: On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote: The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. I am out of my depth here, not having read any of the relevant code, but couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without marking the page dirty? There is the valid argument that you would loose some information when pages with hint bits are written out again, but on the other hand you would also gain the information that it was a hint-bit write... Sorry, I don't understand that. Where would you replace the LSN from disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) ) In ReadBuffer_common or such, after reading a page from disk and verifying that the page has a valid header it seems to be possible to replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page dirty. If the page isn't modified the lsn on disk won't be changed so you don't loose debugging information in that case. If will be zero if it has been written by a hint-bit write and thats arguable a win. We use the LSN to decide whether a full-page image need to be xlogged if the page is modified. If you reset LSN every time you read in a page, you'll be doing full page writes every time a page is read from disk and modified, whether or not it's the first time the page is modified after the last checkpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012: Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr and SvPVUTF8() when turning a perl string into a cstring. Hmm, this patch belongs into back branches too, right? Not just the current development tree? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [Review] Add SPI_gettypmod() to return a field's typemod from a TupleDesc
On Wed, Feb 1, 2012 at 5:23 AM, Chetan Suttraway chetan(dot)suttraway(at)enterprisedb(dot)com wrote: Hi All, This is regarding the TODO item : Add SPI_gettypmod() to return a field's typemod from a TupleDesc The related message is: http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php This basically talks about having an SPI_gettypemod() which returns the typmod of a field of tupdesc Please refer the attached patch based on the suggested implementation. Here's my review of this patch. Basic stuff: - Patch applies OK - Compiles cleanly with no warnings - Regression tests pass. - Documentation changes missing - 43.2. Interface Support Functions need to add information about SPI_gettypmod. as well as need to add the Description about SPI_gettypmod. What it does: -- New SPI interface function is exposed. It is used for returning the atttypmod of the specified column. If attribute number is out of range then set the SPI_result to SPI_ERROR_NOATTRIBUTE and returns -1 else returns attributes typmod Testing: --- Created C function and validated type mode for - output basic data types - numeric, varchar with valid typmod =C-Function= PG_FUNCTION_INFO_V1(test_SPI_gettypmod); Datum test_SPI_gettypmod(PG_FUNCTION_ARGS) { Oidrelid = PG_GETARG_OID(0); OidAttidNo = PG_GETARG_OID(1); Relationrel; TupleDesctupdesc;/* tuple description */ int4retval; rel = relation_open(relid, NoLock); if (!rel) { PG_RETURN_INT32(0); } tupdesc = rel-rd_att; retval = SPI_gettypmod(tupdesc, AttidNo); relation_close(rel, NoLock); PG_RETURN_INT32(retval); } ==Function creation== CREATE FUNCTION test_spi_gettypmod (oid, int) RETURNS int AS '/lib/mytest.so','test_SPI_gettypmod' LANGUAGE C STRICT; ===Output= postgres=# \d tbl Table public.tbl Column |Type | Modifiers +-+--- a | integer | b | character varying(100) | c | numeric(10,5) | d | numeric | e | character varying | f | timestamp without time zone | postgres=# \d tt1 Table public.tt1 Column | Type | Modifiers ++--- t | tbl| b | character varying(200) | postgres=# select atttypmod, attname from pg_attribute where attrelid = 24577; atttypmod | attname ---+-- -1 | tableoid -1 | cmax -1 | xmax -1 | cmin -1 | xmin -1 | ctid -1 | a 104 | b 655369 | c (9 rows) postgres=# select test_spi_gettypmod(24577, 3); test_spi_gettypmod 655369 (1 row) postgres=# alter table tbl add column d numeric; ALTER TABLE postgres=# alter table tbl add column e varchar; ALTER TABLE postgres=# select test_spi_gettypmod(24577, 4); test_spi_gettypmod -1 (1 row) postgres=# select test_spi_gettypmod(24577, 5); test_spi_gettypmod -1 (1 row) postgres=# select test_spi_gettypmod( 24592, 1); test_spi_gettypmod -1 (1 row) postgres=# alter table tt1 add column b varchar(200); ALTER TABLE postgres=# select test_spi_gettypmod( 24592, 2); test_spi_gettypmod 204 (1 row) Conclusion: --- The patch changes are okay but needs documentation update, So I am marking it as Waiting On Author 1. For such kind of patch, even if new regression test is not added, it is okay. 2. Documentation need to be updated. 3. In case attribute number is not in valid range, currently it return -1, but -1 is a valid typmod. Committer can verify if this is appropriate. 4. In functions exec_eval_datum exec_get_datum_type_info according to me if we use SPI_gettypmod() to get the attribute typmod, it is better as a. there is comment /* XXX there's no SPI_gettypmod, for some reason */ and it uses SPI_gettypeid to get the typeid. b. it will maintain consistency of nearby code such as it will use
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: - defaulting to initdb -N in the regression suite is not a good imo, because that way the buildfarm won't catch problems in that area... I removed the -N as you suggest. How much does performance matter on the buildfarm? I don't think the difference in initdb cost is relevant when running the regression tests. Should it prove to be we can re-add -N after a week or two in the buildfarm machines. I just remember that there were several OS specific regression when adding the pre-syncing for createdb. That sounds reasonable to me. Both patches are out there, so we can figure out what the consensus is. - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. Ok. Sensible reasons. I dislike that we know have two files using different logic (copydir.c only using fadvise, initdb using sync_file_range if available). Maybe we should just move the fadvise and sync_file_range calls into its own common function? I don't see fadvise in copydir.c, it looks like it just uses fsync. It might speed it up to use a pre-sync call there, too -- database creation does take a while. If that's in the scope of this patch, I'll do it. Btw, I just want to have said this, although I don't think its particularly relevant as it doesn't affect correctness: Its possible to have a system where sync_file_range is in the system headers but the kernel during runtime doesn't support it. It is relatively new (2.6.17). It would be possible to fallback to posix_fadvise which has been around far longer in that case... Interesting point, but I'm not too worried about it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Transactions over pathological TCP connections
Out of (mostly idle) curiousity, when exactly does a transaction commit, especially with respect to a TCP connection that a pathological demon will cut off at the worst possible moment? The thing is, I'm using PostgreSQL as a queue, using asynchronous notifications and following the advice of Marko Tiikkaja in this post: http://johtopg.blogspot.com/2010/12/queues-in-sql.html I'm using a stored procedure to reduce the round trips between the database and client, and then running it in a bare transaction, that is, as SELECT dequeue_element(); with an implicit BEGIN/COMMIT to mark a row in the queue as taken and return it. My question is, would it be theoretically possible for an element of a queue to become marked but not delivered, or delivered and not marked, if the TCP connection between the backend and client was interrupted at the worst possible moment? Will the backend wait for the delivery of the row be acknowledged before the transaction is committed? Or should the truly paranoid use an explicit transaction block and not consider the row taken until confirmation that the transaction has committed has been received? Best, Leon
Re: [HACKERS] initdb and fsync
On Monday, June 18, 2012 09:32:25 PM Jeff Davis wrote: - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be unified? There's a lot of backend-specific code in the copydir versions, like using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at unifying them before, and concluded that it wouldn't add to the readability, so I just commented where they came from. Ok. Sensible reasons. I dislike that we know have two files using different logic (copydir.c only using fadvise, initdb using sync_file_range if available). Maybe we should just move the fadvise and sync_file_range calls into its own common function? I don't see fadvise in copydir.c, it looks like it just uses fsync. It might speed it up to use a pre-sync call there, too -- database creation does take a while. If that's in the scope of this patch, I'll do it. It calls pg_flush_data inside of copy_file which does the posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
Excerpts from Jeff Davis's message of lun jun 18 15:32:25 -0400 2012: On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote: Btw, I just want to have said this, although I don't think its particularly relevant as it doesn't affect correctness: Its possible to have a system where sync_file_range is in the system headers but the kernel during runtime doesn't support it. It is relatively new (2.6.17). It would be possible to fallback to posix_fadvise which has been around far longer in that case... Interesting point, but I'm not too worried about it. Yeah. 2.6.17 was released on June 2006. The latest stable release prior to 2.6.17 was 2.6.16.62 in 2008 and it was abandoned at that time in favor of 2.6.27 as the new stable branch. I don't think this is a problem any sane person is going to face; and telling them to upgrade to a newer kernel seems an acceptable answer. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [Review] Prevent the specification of conflicting transaction read/write options
Chetan Suttraway chetan(dot)suttraway(at)enterprisedb(dot)com wrote: This is regarding the TODO item: Prevent the specification of conflicting transaction read/write options listed at: http://wiki.postgresql.org/wiki/Todo Here's my review of this patch. Basic stuff: - Patch applies OK - Compiles cleanly with no warnings - changes in gram.y do not introduce any conflicts. - Regression tests pass. - Documentation changes not required as the scenario is obvious. What it does: - Prevent the specification of conflicting transaction mode options in the commands such as START/BEGIN TRANSACTION, SET TRANSACTION, SET SESSION CHARACTERISTICS AS TRANSACTION transaction_mode. Conflicting transaction modes include 1. READ WRITE | READ ONLY 2. ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } 3. DEFERRABLE | NOT DEFERABLE The changes are done in gram.y file to add a new function which checks for conflicting transaction modes. Testing: postgres=# BEGIN TRANSACTION read only read write; ERROR: conflicting options LINE 1: BEGIN TRANSACTION read only read write; postgres=# SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED ISOLATION LEVEL READ unCOMMITTED; ERROR: conflicting options LINE 1: ...ON CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMI... postgres=# start transaction deferrable not deferrable; ERROR: conflicting options LINE 1: start transaction deferrable not deferrable; ^ postgres=# start transaction deferrable read only not deferrable; ERROR: conflicting options LINE 1: start transaction deferrable read only not deferrable; ^ postgres=# start transaction deferrable, read only not deferrable; ERROR: conflicting options LINE 1: start transaction deferrable, read only not deferrable; ^ postgres=# start transaction deferrable, read only, not deferrable; ERROR: conflicting options LINE 1: start transaction deferrable, read only, not deferrable; postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ COMMITTED; START TRANSACTION postgres=# commit; COMMIT postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ UNCOMMITTED; START TRANSACTION postgres=# commit; COMMIT postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ UNCOMMITTED, read only; START TRANSACTION ^ postgres=# SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED ISOLATION LEVEL READ COMMITTED; SET ^ Code Review Comments -- 1. Complexity of function check_trans_mode is high (n^2) can be changed to n we shall not add the duplicate elements in transaction_mode_list: transaction_mode_item { $$ = list_make1($1); } | transaction_mode_list ',' transaction_mode_item { check_trans_mode((List *)$1, (DefElem *)$3, yyscanner); $$ = lappend($1, $3); } | transaction_mode_list transaction_mode_item { check_trans_mode((List *)$1, (DefElem *)$2, yyscanner); $$ = lappend($1, $2); } i,e if start transaction read only read only is given then add the mode read only to the list only for the first time. If so length of the valid list is limited to maximum 3 elements(modes) (1 ISOLATION LEVEL, 1 READ WRITE/READ ONLY, 1 DEFERRABLE). This will reduce the search complexity of check_trans_mode to 3n = n. 2. during parse when you call check_trans_mode((List *)$1, (DefElem *)$3, yyscanner); what is the need of casting first and second parameter? 3. The comment describing function can be more specific patch - checks for conflicting transaction modes by looking up current element in the given list. suggested - checks for conflicting transaction modes by looking current transaction mode element in the given transaction mode list. 4. names used for function parameters and variables can be more meaningful. a. check_trans_mode(List *list, DefElem *elem, core_yyscan_t yyscanner) for first parameter instead of list, you can use modes, it can better suggest the meaning of parameter. for second parameter instead of elem, you can use mode or input_mode; or if something better is coming to your which can better suggest the meaning. b. A_Const *elem_arg;
Re: [HACKERS] initdb and fsync
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote: It calls pg_flush_data inside of copy_file which does the posix_fadvise... So maybe just put the sync_file_range in pg_flush_data? Oh, I didn't notice that, thank you. In that case, it may be good to combine them if possible. I will look into it. There may be performance implications when used one a larger amount of data though. I can do some brief testing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework
BTW it doesn't seem that datatype/timestamp.h is really necessary in timeout.h. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format changes
On Monday, June 18, 2012 09:19:48 PM Heikki Linnakangas wrote: On 18.06.2012 21:45, Andres Freund wrote: On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote: On 18.06.2012 21:13, Andres Freund wrote: On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote: The page header contains an XLogRecPtr (LSN), so if we change it we'll have to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as the on-disk representation, and convert between the 64-bit integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog calculations would admittedly be simpler if it was an uint64. I am out of my depth here, not having read any of the relevant code, but couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without marking the page dirty? There is the valid argument that you would loose some information when pages with hint bits are written out again, but on the other hand you would also gain the information that it was a hint-bit write... Sorry, I don't understand that. Where would you replace the LSN from disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) ) In ReadBuffer_common or such, after reading a page from disk and verifying that the page has a valid header it seems to be possible to replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page dirty. If the page isn't modified the lsn on disk won't be changed so you don't loose debugging information in that case. If will be zero if it has been written by a hint-bit write and thats arguable a win. We use the LSN to decide whether a full-page image need to be xlogged if the page is modified. If you reset LSN every time you read in a page, you'll be doing full page writes every time a page is read from disk and modified, whether or not it's the first time the page is modified after the last checkpoint. Yea, I somehow disregarded that hint-bit writes would make a problem with full page writes in that case. Normal writes wouldn't be a problem... This should be fixable but the result would be too ugly. So its either converting the on-disk representation or keeping the duplicated representation. pd_lsn seems to be well enough abstracted, doing the conversion in PageSet/GetLSN seems to be easy enough. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
PE Compare the output of pg_config --configure from both installations. The only differences are 9.1 vs 9.2 in the paths. Can you check the collations of the two databases? I'm wondering if 9.1 is in C collation and 9.2 is something else. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund and...@2ndquadrant.com wrote: * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? What's the cost of going a lot higher? Because if one makes enough numerical space available, one can assign node identities without a coordinator, a massive decrease in complexity. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote: On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund and...@2ndquadrant.com wrote: * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? What's the cost of going a lot higher? Because if one makes enough numerical space available, one can assign node identities without a coordinator, a massive decrease in complexity. It would increase the size of every wal record. We just have 16bit left there by chance... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Testing 9.2 in ~production environment
JB == Josh Berkus j...@agliodbs.com writes: JB Can you check the collations of the two databases? I'm wondering if 9.1 JB is in C collation and 9.2 is something else. Thanks! pg_dump -C tells me these two differences: -SET client_encoding = 'SQL_ASCII'; +SET client_encoding = 'UTF8'; -CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LC_COLLATE = 'C' LC_CTYPE = 'C'; +CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'C' LC_CTYPE = 'en_US.UTF-8'; for every db in the clusters. I presume that lc_ctype is the significant difference? LC_CTYPE *is* specified as 'C' in the dump from which I created the 9.2 cluster, so it must have been overridden by pg_restore. I see that my dist's /etc rc script now sets LC_CTYPE. Would that explain why lc_ctype changed between the two clusters? Is there any way to alter a db's lc_ctype w/o dumping and restoring? I want to preserve some of the changes made since I copied the 9.1 cluster. Alter database reports that lc_ctype cannot be changed. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote: Hi Simon, On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote: On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote: This adds a new configuration parameter multimaster_node_id which determines the id used for wal originating in one cluster. Looks good and it seems this aspect at least is commitable in this CF. I think we need to agree on the parameter name. It currently is 'multimaster_node_id'. In the discussion with Steve we got to replication_node_id. I don't particularly like either. Other suggestions? I wonder if it should be origin_node_id? That is the term Slony uses. Design decisions I think we need to review are * Naming of field. I think origin is the right term, borrowing from Slony. I think it fits as well. * Can we add the origin_id dynamically to each WAL record? Probably no need, but lets consider why and document that. Not sure what you mean? Its already set in XLogInsert to current_replication_origin_id which defaults to the value of the guc? * Size of field. 16 bits is enough for 32,000 master nodes, which is quite a lot. Do we need that many? I think we may have need for a few flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe 8 bits. Even if we don't need them in this release, I'd like to have them. If they remain unused after a few releases, we may choose to redeploy some of them as additional nodeids in future. I don't foresee complaints that 256 master nodes is too few anytime soon, so we can defer that decision. I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Even though the number of nodes that can reasonably participate in replication is likely to be not too terribly large, it might be good to allow larger values, in case someone is keen on encoding something descriptive in the node number. If you restrict the number to a tiny range, then you'll be left wanting some other mapping. At one point, I did some work trying to get a notion of named nodes implemented in Slony; gave up on it, as the coordination process was wildly too bug-prone. In our environment, at Afilias, we have used quasi-symbolic node numbers that encoded something somewhat meaningful about the environment. That seems better to me than the risky kludge of saying: - The first node I created is node #1 - The second one is node #2. - The third and fourth are #3 and #4 - I dropped node #2 due to a problem, and thus the new node 2 is #5. That numbering scheme gets pretty anti-intuitive fairly quickly, from whence we took the approach of having a couple digits indicating data centre followed by a digit indicating which node in that data centre. If that all sounds incoherent, well, the more nodes you have around, the more difficult it becomes to make sure you *do* have a coherent picture of your cluster. I recall the Slony-II project having a notion of attaching a permanent UUID-based node ID to each node. As long as there is somewhere decent to find a symbolically significant node name, I like the idea of the ID *not* being in a tiny range, and being UUID/OID-like... Any opinions from others on this? * Do we want origin_id as a parameter or as a setting in pgcontrol? IIRC we go to a lot of trouble elsewhere to avoid problems with changing on/off parameter values. I think we need some discussion to validate where that should live. Hm. I don't really forsee any need to have it in pg_control. What do you want to protect against with that? It would need to be changeable anyway, because otherwise it would need to become a parameter for initdb which would suck for anybody migrating to use replication at some point. Do you want to protect against problems in replication setups after changing the value? In Slony, changing the node ID is Not Something That Is Done. The ID is captured in *way* too many places to be able to have any hope of updating it in a coordinated way. I should be surprised if it wasn't similarly troublesome here. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Event Triggers reduced, v1
Excerpts from Dimitri Fontaine's message of vie jun 15 16:27:50 -0400 2012: The attached patch contains all the infrastructure for event triggers and also a first implementation of them for the event command_start, implemented in a single place in utility.c. The infrastructure is about: - new catalog - grammar for new commands - documentation skeleton - pg_dump support - psql support - ability to actually run user triggers - written in core languages (pl/c, pl/pgsql, pl/python, pl/perl, pl/tcl) - limited subcommand handling Did you try REASSIGN OWNED and DROP OWNED with a role that has defined some event triggers? Look, it's an easy little skinny patch to review, right: git --no-pager diff --shortstat master 62 files changed, 4546 insertions(+), 108 deletions(-) Skinny ... right. I started to give it a look -- I may have something useful to comment later. This patch includes regression tests that we worked on with Thom last rounds, remember that they only run in the serial schedule, that means with `make installcheck` only. Adding noisy output at random while the parallel schedule run is a good way to break all the regression testing, so I've been avoiding that. Hmm, I don't like the idea of a test that only runs in serial mode. Maybe we can find some way to make it work in parallel mode as well. I don't have anything useful to comment right now. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Support for foreign keys with arrays
2012/6/18 Kevin Grittner kevin.gritt...@wicourts.gov The many-to-one case seems like it is better handled in the other direction -- with the referenced table holding the set of valid keys and the referencing table holding the single key. (I believe the general case of this is what Jeff called an inclusion constraint -- a feature he wants to add at some point.) I can't think of a use case where that would not be better for this type of relationship than putting the array on the referencing side. Hi Kevin, Well, from my point of view many-to-one or one-to-many is more related from which point we are looking to the thing... But let me better explain what I thought... Example 1) If we have one table with master data TableA(ID, other properties...) and TableB(tableAIDFK, propA, propB, propC) -PK of TableB is irrelavant in this point... and let say a lot of TableA tuples could have the same TableB properties... So we can have how many common TableA tuples, that many tuples in TableB with the same values in PropA, PropB, and PropC with FK the same type as in Table A, or to have 1 tuple in TableB with Array type as FK field So it (1 tuple in TableB) can point many tuples in TableC... And in the same time simple element can exist in TableA, but could or doesn't have to exist in TableB... What test would show is there any gain in this approach - I don't know... but think it should - especially if propA,PropB, and C should be updated for all of them... Example 2) From other side, what Jeff propose, and what is also usefull, but different thing is... to have the main data in TableA, but key field is an range datatype... what later each element what belong to the range, could have related tuple in TableB (Also, as the same range datatype - but smaller... contained by Master one... or simple datatype subtype of the range) - which is other way around... Opposite from exmaple 1 - but differnet from functional point of view... Depending what is the Master... Also, for example 1 - data in FK do not need to be in range.. so basicaly ID [1, 2 ,4 ,7] could have 1 tuple with its properties, and [3,5,6] in second tuple with different properties... I am not sure Example 2) Jeff called Inclusion Constraint - Jeff can explain it better :) Based on Simon Riggs wrote: Do we need something like Exclusion FKs? i.e. the FK partner of Exclusion Constraints? Yes, Inclusion Constraints. I've known we need something like that since I did Exclusion Constraints, but I haven't gotten further than that. Regards, Jeff Davis I have understood it as: TableA(ID, properties...) TableB(ID, properties...) Now if we define FK on TableB to TableA... It means that row inserted in TableB, must have already row with the same ID value in TableA... But what would be usefull, to define Exclude FK to table A, to we prevent insert new row in Table B with ID value what already exist in TableA... btw, if anyone is happy to point me in right direction, and there is common feeling it is usefull feature, I am happy to code it... Actually that is something what I will code anyway for in-house solution - but would be good to do it under Postgres standards... Kind Regards, Misa
Re: [HACKERS] patch: avoid heavyweight locking on hash metapage
On Fri, Jun 15, 2012 at 1:58 PM, Jeff Janes jeff.ja...@gmail.com wrote: Do we need the retry flag (applies to two places)? If oldblkno is still InvalidBlockNumber then it can't equal blkno. I was a bit concerned that blkno = BUCKET_TO_BLKNO(metap, bucket) might set blkno to InvalidBlockNumber, thus possibly messing up the algorithm. This way seemed better in terms of not requiring any assumption in that area. In the README, the psuedo codes probably needs to mention re-locking the meta page in the loop. Good point. Fixed. Also, page is used to mean either the disk page or the shared buffer currently holding that page, depending on context. This is confusing. Maybe we should clarify Lock the meta page buffer. Of course this gripe precedes your patch and applies to other parts of the code as well, but since we mingle LW locks (on buffers) and heavy locks (on names of disk pages) it might make sense to be more meticulous here. I attempted to improve this somewhat in the README, but I think it may be more than I can do completely. exclusive-lock page 0 (assert the right to begin a split) is no longer true, nor is release X-lock on page 0 I think this is fixed. Also in the README, section To prevent deadlock we enforce these coding rules: would need to be changed as those rules are being changed. But, should we change them at all? In _hash_expandtable, the claim But since we are only trylocking here it should be OK doesn't completely satisfy me. Even a conditional heavy-lock acquire still takes a transient non-conditional LW Lock on the lock manager partition. Unless there is a global rule that no one can take a buffer content lock while holding a lock manager partition lock, this seems dangerous. Could this be redone to follow the pattern of heavy locking with no content lock, then reacquiring the buffer content lock to check to make sure we locked the correct things? I don't know if it would be better to loop, or just give up, if the meta page changed underneath us. Hmm. That was actually a gloss I added on existing code to try to convince myself that it was safe; I don't think that the changes I made make that any more or less safe than it was before. But the question of whether or not it is safe is an interesting one. I do believe that your statement is true, though: lock manager locks - or backend locks, for the fast-path - should be the last thing we lock. In some cases we lock more than one lock manager partition lock, but we never lock anything else afterwards, AFAIK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company hash-avoid-heavyweight-metapage-locks-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] performance regression in 9.2 when loading lots of small tables
There was a regression introduced in 9.2 that effects the creation and loading of lots of small tables in a single transaction. It affects the loading of a pg_dump file which has a large number of small tables (10,000 schemas, one table per schema, 10 rows per table). I did not test other schema configurations, so these specifics might not be needed to invoke the problem. It causes the loading of a dump with psql -1 -f to run at half the previous speed. Speed of loading without the -1 is not changed. The regression was introduced in 39a68e5c6ca7b41b, Fix toast table creation. Perhaps the slowdown is an inevitable result of fixing the bug. The regression was removed from 9_1_STABLE at commit dff178f8017e4412, More cleanup after failed reduced-lock-levels-for-DDL feature. It is still present in 9_2_STABLE. I don't really understand what is going on in these patches, but it seems that either 9_1_STABLE now has a bug that was fixed and then unfixed, or that 9_2_STABLE is slower than it needs to be. The dump file I used can be obtained like this: perl -le 'print set client_min_messages=warning;; print create schema foo$_; create table foo$_.foo$_ (k integer, v integer); insert into foo$_.foo$_ select * from generate_series(1,10); foreach $ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 1 | psql /dev/null ; pg_dump 1.dump Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
On Sat, Jun 16, 2012 at 3:03 PM, Steve Singer st...@ssinger.info wrote: I feel that in-core support to capture changes and turn them into change records that can be replayed on other databases, without relying on triggers and log tables, would be good to have. I think we want some flexible enough that people write consumers of the LCRs to do conflict resolution for multi-master but I am not sure that the conflict resolution support actually belongs in core. I agree, on both counts. Anyone else want to chime in here? Most of the complexity of slony (both in terms of lines of code, and issues people encounter using it) comes not from the log triggers or replay of the logged data but comes from the configuration of the cluster. Controlling things like * Which tables replicate from a node to which other nodes * How do you change the cluster configuration on a running system (adding nodes, removing nodes, moving the origin of a table, adding tables to replication etc...) Not being as familiar with Slony as I probably ought to be, I hadn't given this much thought, but it's an interesting point. The number of logical replication policies that someone might want to implement, and the ways in which they might want to change them as the situation develops, is very large. Whole cluster, whole database, one or several schemas, individual tables, perhaps even more fine-grained than per-table. Trying to figure all of that out is going to require a lot of work and, frankly, I question the value of having that stuff in core anyway. I see three catalogs in play here. 1. The catalog on the origin 2. The catalog on the proxy system (this is the catalog used to translate the WAL records to LCR's). The proxy system will need essentially the same pgsql binaries (same architecture, important complie flags etc..) as the origin 3. The catalog on the destination system(s). The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need to be in-sync with catalog 1. I think catalogs 2 and 3 are combined in the current patch set (though I haven't yet looked at the code closely). I think the performance optimizations Andres has implemented to update tuples through low-level functions should be left for later and that we should be generating SQL in the apply cache so we don't start assuming much about catalog 3. +1. Although there is a lot of performance benefit to be had there, it seems better to me to get the basics working and then do performance optimization later. That is, if we can detect that the catalogs are in sync, then by all means ship around the binary tuple to make things faster. But requiring that (without having any way to know whether it actually holds) strikes me as a mess. Part of what people expect from a robust in-core solution is that it should work with the the other in-core features. If we have to list a bunch of in-core type as being incompatible with logical replication then people will look at logical replication with the same 'there be dragons here' attitude that scare many people away from the existing third party replication solutions. Non-core or third party user defined types are a slightly different matter because we can't control what they do. I agree, although I don't think either Andres or I are saying anything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture
On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund and...@2ndquadrant.com wrote: Hm. Yes, you could do that. But I have to say I don't really see a point. Maybe the fact that I do envision multimaster systems at some point is clouding my judgement though as its far less easy in that case. Why? I don't think that particularly changes anything. Because it makes conflict detection very hard. I also don't think its a feature worth supporting. Whats the use-case of updating records you cannot properly identify? Don't ask me; I just work here. I think it's something that some people want, though. I mean, if you don't support replicating a table without a primary key, then you can't even run pgbench in a replication environment. Is that an important workload? Well, objectively, no. But I guarantee you that other people with more realistic workloads than that will complain if we don't have it. Absolutely required on day one? Probably not. Completely useless appendage that no one wants? Not that, either. In my view, a logical replication solution is precisely one in which the catalogs don't need to be in sync. If the catalogs have to be in sync, it's not logical replication. ISTM that what you're talking about is sort of a hybrid between physical replication (pages) and logical replication (tuples) - you want to ship around raw binary tuple data, but not entire pages. Ok, thats a valid point. Simon argued at the cluster summit that everything thats not physical is logical. Which has some appeal because it seems hard to agree what exactly logical rep is. So definition by exclusion makes kind of sense ;) Well, the words are fuzzy, but I would define logical replication to be something which is independent of the binary format in which stuff gets stored on disk. If it's not independent of the disk format, then you can't do heterogenous replication (between versions, or between products). That precise limitation is the main thing that drives people to use anything other than SR in the first place, IME. I think what you categorized as hybrid logical/physical rep solves an important use-case thats very hard to solve at the moment. Before my 2ndquadrant days I had several client which had huge problemsing the trigger based solutions because their overhead simply was to big a burden on the master. They couldn't use SR either because every consuming database kept loads of local data. I think such scenarios are getting more and more common. I think this is to some extent true, but I also think you're conflating two different things. Change extraction via triggers introduces overhead that can be eliminated by reconstructing tuples from WAL in the background rather than forcing them to be inserted into a shadow table (and re-WAL-logged!) in the foreground. I will grant that shipping the tuple as a binary blob rather than as text eliminates additional overehead on both ends, but it also closes off a lot of important use cases. As I noted in my previous email, I think that ought to be a performance optimization that we do, if at all, when it's been proven safe, not a baked-in part of the design. Even a solution that decodes WAL to text tuples and ships those around and reinserts the via pure SQL should be significantly faster than the replication solutions we have today; if it isn't, something's wrong. The problem with that is it's going to be tough to make robust. Users could easily end up with answers that are total nonsense, or probably even crash the server. Why? Because the routines that decode tuples don't include enough sanity checks to prevent running off the end of the block, or even the end of memory completely. Consider a corrupt TOAST pointer that indicates that there is a gigabyte of data stored in an 8kB block. One of the common symptoms of corruption IME is TOAST requests for -3 bytes of memory. And, of course, even if you could avoid crashing, interpreting what was originally intended as a series of int4s as a varlena isn't likely to produce anything terribly meaningful. Tuple data isn't self-identifying; that's why this is such a hard problem. To step back and talk about DDL more generally, you've mentioned a few times the idea of using an SR instance that has been filtered down to just the system catalogs as a means of generating logical change records. However, as things stand today, there's no reason to suppose that replicating anything less than the entire cluster is sufficient. For example, you can't translate enum labels to strings without access to the pg_enum catalog, which would be there, because enums are built-in types. But someone could supply a similar user-defined type that uses a user-defined table to do those lookups, and now you've got a problem. I think this is a contractual problem, not a technical one. From the point of view of logical replication, it would be nice if type output functions were basically guaranteed to
Re: [HACKERS] Pg default's verbosity?
On Sat, Jun 16, 2012 at 2:56 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: The argument for defaulting to NOTICE is the same as it's always been: that those messages are really intended for novices, and a pretty good definition of a novice is somebody who doesn't know how to (or that he should) change the verbosity setting. So if we don't show notices by default, they will be unavailable to exactly the people who need them. Your proposal does not overcome this argument. I'm sceptical about what a real novice is expected to do about the incredible flow of useless information displayed when loading a significant script. For a start, s?he should be an incredibly fast reader:-) For a non-novice it just hides what is important and should be seen. There might be something to the idea of demoting a few of the things we've traditionally had as NOTICEs, though. IME, the following two messages account for a huge percentage of the chatter: NOTICE: CREATE TABLE will create implicit sequence foo_a_seq for serial column foo.a NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index foo_pkey for table foo I'm not going to claim that nobody in the history of the world has ever benefited from those notices ... but I would be willing to bet that a large majority of the people, in a large majority of the cases, do not care. And getting rid of them would surely make warnings and notices that might actually be of interest to the user a lot more visible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote: Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. This applies and builds cleanly and passes make check (under enable-cassert). Not test or docs are needed for a patch of this nature. It does what it says, and we want it. I wondered if the change in the return signature of s_lock would have an affect on performance. So I've run a series of pgbench -T 30 -P -c8 -j8, at a scale of 30 which fits in shared_buffers, using an Amazon c1.xlarge (8 cores). I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in both cases), in random ordering. The patch was 0.37% slower, average 298483 selects per second patched to 299582 HEAD. The difference is probably real (p value 0.042, one sided.) but is also pretty much negligible and could just be due to where the executable code falls in the cache lines which could move around with other changes to the code. I found this a bit surprising, so I retested on the IBM POWER7 box at concurrencies from 1 to 64 and found some test runs faster and some test runs slower - maybe a few more faster than slower. I could do a more exact analysis, but I'm inclined to think that if there is an effect here, it's probably just in the noise, and that the real effect you measured was, as you say, the result of cache-line shifting or some other uninteresting artifact that could just as easily move back the other way on some other commit. Really, s_lock should not be getting called often enough to matter much, and ignoring the return value of that function shouldn't cost anyone much. Two suggestions: In your original email you say number of pg_usleep() calls that are required to acquire each LWLock, but nothing in the code says this. Just reading lwlock.c I would naively assume it is reporting the number of TAS spins, not the number of spin-delays (and in fact that is what I did assume until I read your email more carefully). A comment somewhere in lwlock.c would be helpful. Yeah, or maybe I should just change the printout to say spindelay instead of spin, and rename the variables likewise. Also in lwlock.c, if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_counts[i]) I don't think we can have spins (or blocks, for that matter) unless we have some acquires to have caused them, so the last two tests in that line seem to be noise. Perhaps so, but I think it's probably safe and clear to just follow the existing code pattern. Since my suggestions are minor, should I go ahead and mark this ready for committer? If you think it should be committed, yes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 12-06-18 07:30 AM, Andres Freund wrote: Hrmpf #666. I will go through through the series commit-by-commit again to make sure everything compiles again. Reordinging this late definitely wasn't a good idea... I pushed a rebased version with all those fixups (and removal of the zeroRecPtr patch). Where did you push that rebased version to? I don't see an attachment, or an updated patch in the commitfest app and your repo at http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary hasn't been updated in 5 days. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node
On 12-06-18 11:50 AM, Andres Freund wrote: Hi Simon, I think we need to agree on the parameter name. It currently is 'multimaster_node_id'. In the discussion with Steve we got to replication_node_id. I don't particularly like either. Other suggestions? Other things that come to mind (for naming this parameter in the postgresql.conf) node_id origin_node_id local_node_id I wished we had some flag bits available before as well. I find 256 nodes a pretty low value to start with though, 4096 sounds better though, so I would be happy with 4 flag bits. I think for cascading setups and such you want to add node ids for every node, not only masters... Any opinions from others on this? 256 sounds a bit low to me as well. Sometimes the use case of a retail chain comes up where people want each store to have a postgresql instance and replicate back to a central office. I can think of many chains with more than 256 stores. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Libxml2 load error on Windows
On Mon, Jun 18, 2012 at 5:08 AM, Talha Bin Rizwan talha.riz...@gmail.com wrote: PostgreSQL 9.2 Windows build is having trouble with the XML support: http://postgresql.1045698.n5.nabble.com/9-2-beta1-libxml2-can-t-be-loaded-on-Windows-td5710672.html postgres=# SELECT xml 'foobar/foo'; ERROR: could not set up XML error handler HINT: This probably indicates that the version of libxml2 being used is not compatible with the libxml2 header files that PostgreSQL was built with. HAVE_XMLSTRUCTUREDERRORCONTEXT should be defined if libxml2 has xmlStructuredErrorContext (introduced after 2.7.3). It is being set for Linux in pg_config.h: /* Define to 1 if your libxml has xmlStructuredErrorContext. */ #define HAVE_XMLSTRUCTUREDERRORCONTEXT 1 For Windows build, we need to set it in src/tools/msvc/Solution.pm. I guess there are two approaches to get the libxml2 version: 1) Parse LIBXML_INST/include/libxml/xmlversion.h to get the libxml2 version number, this header file is included in libxml2. 2) We may also get the version by running a command line utility i.e LIBXML_INST/bin/xmllint --version. The xmllint program parses one or more XML files and it is also included in libxml2. I believe it is safe to assume that libxml2 include folder would contain xmlversion.h file containing the required version number. It might be a touch slower to parse the file line by line, but similar functionality has been used elsewhere as well. We obviously rely on xml include files (with xml enabled), so there is definitely a tighter dependency on include files than xmllint executable as it is not mandatory for build process. Therefore, I used first approach to get libxml2 version number and include the #define HAVE_XMLSTRUCTUREDERRORCONTEXT in pg_config.h if libxml2 version is greater than 20703 (which effectively is 2.7.3). Please find attached patch libxml2_win_v1.patch. Please add this patch here so that it doesn't get lost in the shuffle: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw in contrib
On Mon, Jun 18, 2012 at 12:24 PM, Merlin Moncure mmonc...@gmail.com wrote: I can't help but wonder (having been down the contrib/core/extension road myself) if it isn't better to improve the facilities to register and search for qualified extensions (like Perl CPAN) so that people looking for code to improve their backends can find it. That way, you're free to release/do xyz/abandon your project as you see fit without having to go through -hackers. This should also remove a lot of the stigma with not being in core since if stock postgres installations can access the necessary modules via CREATE EXTENSION, I think it will make it easier for projects like this to get used with the additional benefit of decentralizing project management. Well, if you're the type that likes to install everything via RPMs (and I am) then you wouldn't want this behavior, especially not automagically. It seems to open up a host of security risks, too: I believe Tom has previously stated that Red Hat (and other distro) packaging guidelines forbid packages from installing software in places where they can then turn around and run it. I suppose CPAN must have some sort of exception to this policy, though, so maybe it's not ironclad. Still, it seems like a bit of a distraction in this case: I think we want to have at least one FDW in core that actually talks to some other database server, rather than just to a file, and it seems like pgsql_fdw is the obvious choice by a mile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: autocomplete for functions
On Sun, Feb 19, 2012 at 12:10 PM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello I found so this extremely simple patch should be useful. It helps for pattern SELECT fx(); There was thread about it. Hi Pavel, I signed up to be reviewer for this patch, and finally got around to taking a look. This thread, and the thread about Peter's earlier patch along the same lines have gotten a bit muddled, so allow me some recap for my own sanity. The first point to be addressed, is that Pavel's patch is basically a subset of Peter's earlier[1] patch. Pavel's patch autocompletes SELECT TAB with possible function names. Peter's patch autocompletes both possible column names and possible function names. So, which version, if any, would we want? Both Tom[2] and depesz[3] have asked for column names to be autocompleted if we're going to go down this road, and I personally would find completion of column names handy. Others [5][6] have asked for function names to be (also?) autocompleted here, so it seems reasonable that we'd want to include both. I counted two general objections to Peter's patch in these threads, namely: 1.) Complaints about the tab-completion not covering all cases, possibly leading to user frustration at our inconsistency. [2] [4] 2.) Concerns that the tab-completion wouldn't be useful given how many results we'd have from system columns and functions [7] I do agree these are two legitimate concerns. However, for #1, our tab-completion is and has always been incomplete. I take the basic goal of the tab-completion machinery to be offer a suggestion when we're pretty sure we know what the user wants, otherwise stay quiet. There are all sorts of places where our reliance on inspecting back only a few words into the current line and not having a true command parser prevents us from being able to make tab-completion guesses, but that's been accepted so far, and I don't think it's fair to raise the bar for this patch. Re: concern #2, Tom complained about there being a bunch of possible column and function completions in the regression test database. That may be true, but if you look at this slightly-modified version of the query Peter's patch proposes: SELECT substring(name, 1, 3) AS sub, COUNT(*) FROM ( SELECT attname FROM pg_attribute WHERE NOT attisdropped UNION SELECT proname || '(' FROM pg_proc p WHERE pg_catalog.pg_function_is_visible(p.oid)) t (name) GROUP BY sub ORDER BY COUNT(*) DESC; I count only 384 distinct 3-length prefixes in an empty database, thanks to many built-in columns and functions sharing the same prefix (e.g. int or pg_). Obviously, there is plenty of room left for 3-length prefixes, out of the 27^3 or more possibilities. In some casual mucking around in my own databases, I found the column-completion rather useful, and typing 3 characters of a column-name to be sufficient to give matches which were at least non-builtin attributes, and often a single unique match. There were some ideas down-thread about how we might filter out some of the noise in these completions, which would be interesting. I'd be happy with the patch as-is though, modulo the attisdropped and pg_function_is_visible() tweaks to the query. Josh [1] http://archives.postgresql.org/message-id/1328820579.11241.4.camel%40vanquo.pezone.net [2] http://archives.postgresql.org/message-id/7745.1328855069%40sss.pgh.pa.us [3] http://www.depesz.com/2011/07/08/wish-list-for-psql/ [4] http://archives.postgresql.org/message-id/13612.1328887227%40sss.pgh.pa.us [5] http://archives.postgresql.org/message-id/CA%2BTgmoY7wRGgBbFhKwfASqrNOPwZwCjfm1AhL82769Xx-SyduA%40mail.gmail.com [6] http://archives.postgresql.org/message-id/20120210140637.GB19783%40ldn-qws-004.delacy.com [7] http://archives.postgresql.org/message-id/9966.1331920074%40sss.pgh.pa.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: autocomplete for functions
On Mon, Mar 19, 2012 at 1:01 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: I'm rather of the contrary opinion -- surely if we're going to complete function names, we should only complete those that are in schemas in the path; similarly for column names. I think it makes sense to only include currently-visible functions, but *not* only columns from currently visible tables, since we won't know yet whether the user intends to schema-qualify the table name. (BTW I didn't check but does this completion work if I schema-qualify a column name?) Peter's proposed tab-completion only kicks in for the column-name itself. Keep in mind, the user might be trying to enter: SELECT schema.table.column ... SELECT table.column ... SELECT table_alias.column ... SELECT column ... and presumably want to tab-complete the second token somehow. I'm a bit leery about trying to tab-complete those first two, and the third is right out. Just having the fourth would make me happy. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata
Amit Kapila amit.kap...@huawei.com writes: AFAIR you can create pg_control from scratch already with pg_resetxlog. The hard part is coming up with values for the counters, such as the next WAL location. Some of them such as next OID are pretty harmless if you don't guess right, but I'm worried that wrong next WAL could make things worse not better. I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed logic should generate correct values. I've got a problem with the assumption that, when pg_control is trash, megabytes or gigabytes of WAL can still be relied on completely. I'm almost inclined to suggest that we not get next-LSN from WAL, but by scanning all the pages in the main data store and computing the max observed LSN. This is clearly not very attractive from a performance standpoint, but it would avoid the obvious failure mode where you lost some recent WAL segments along with pg_control. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transactions over pathological TCP connections
Leon Smith leon.p.sm...@gmail.com writes: Out of (mostly idle) curiousity, when exactly does a transaction commit, especially with respect to a TCP connection that a pathological demon will cut off at the worst possible moment? It's committed when the XLOG_XACT_COMMIT WAL record reaches disk, if by committed you mean is guaranteed to still be visible following recovery from a subsequent database or operating system crash. If you have some other definition of committed in mind, please say what. My question is, would it be theoretically possible for an element of a queue to become marked but not delivered, or delivered and not marked, if the TCP connection between the backend and client was interrupted at the worst possible moment? The transaction would be committed before a command success report is delivered to the client, so I don't think delivered-and-not-marked is possible. The other direction is possible, if the connection drops after the transaction is committed but before the completion report can be delivered to the client. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers