Re: [HACKERS] [COMMITTERS] pgsql: New SQL functons pg_backup_in_progress() and pg_backup_start_tim

2012-06-18 Thread Albe Laurenz
Daniel Farina wrote:
 After mulling over this some more, I am less on the fence about how
 unfortunate it is that Postgres cannot restart when doing an
 exclusive base backup: I think it is a severe anti-feature that
 should gradually be retired.

This anti-feature was introduced with
http://archives.postgresql.org/pgsql-committers/2008-04/msg00275.php
following discussions
http://archives.postgresql.org/pgsql-hackers/2007-11/msg00800.php
and
http://archives.postgresql.org/pgsql-hackers/2008-03/msg01033.php

The problem (at least for me) is that without this the server will
often refuse to restart after a clean shutdown, namely when the
WAL segment containing the required checkpoint has already been
archived.

Yours,
Laurenz Albe
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Skip checkpoint on promoting from streaming replication

2012-06-18 Thread Kyotaro HORIGUCHI
Hello, This is the new version of the patch.

Your patch introduced new WAL record type XLOG_END_OF_RECOVERY to
mark the chenge point of TLI. But I think the information is
already stored in history files and already ready to use in
current code.

I looked into your first patch and looked over the discussion on
it, and find that my understanding that TLI switch is operable
also for crash recovery besides archive recovery was half
wrong. The correct half was that it can be operable for crash
recovery if we properly set TimeLineID in StartupXLOG().

To achieve this, I added a new field 'latestTLI' (more proper
name is welcome) and make it always catch up with the latest TLI
with no relation to checkpoints. Then set the recovery target in
StartupXLOG() referring it. Additionaly, in previous patch, I
checked only checkpoint intervals but this ended with no effect
as you said. Because the WAL files in pg_xlog are preserved as
many as required for crash recovery, as I knew...


The new patch seems to work correctly for changing of TLI without
checkpoint following.  And archive recovery and PITR also seems
to work correctly. The test script for the former is attached
too.

The new patch consists of two parts. These might should be
treated as two separate ones..

1. Allow_TLI_Increment_without_Checkpoint_20120618.patch

  Removes the assumption after the 'convension' that TLI should
  be incremented only on shutdown checkpoint. This seems actually
  has no problem as the commnet(This is not particularly
  critical).

2. Skip_Checkpoint_on_Promotion_20120618.patch

  Skips checkpoint if redo record can be read in-place.

3. Test script for TLI increment patch.

  This is only to show how the patch is tested. The point is
  creating TLI increment point not followed by any kind of
  checkpoints.  pg_controldata shows like following after running
  this test script. Latest timeline ID is the new field.

pg_control version number:923
Database cluster state:   in production
  ! Latest timeline ID:   2
Latest checkpoint location:   0/258
Prior checkpoint location:0/258
Latest checkpoint's REDO location:0/220
  ! Latest checkpoint's TimeLineID:   1

  We will see this changing as follows after crash recovery,

Latest timeline ID:   2
Latest checkpoint location:   0/54D9918
Prior checkpoint location:0/258
Latest checkpoint's REDO location:0/54D9918
Latest checkpoint's TimeLineID:   2

  Then, we should see both two 'ABCDE...'s and two 'VWXYZ...'s in
  the table after the crash recovery.

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d68760..70b4972 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5276,6 +5276,7 @@ BootStrapXLOG(void)
 	ControlFile-system_identifier = sysidentifier;
 	ControlFile-state = DB_SHUTDOWNED;
 	ControlFile-time = checkPoint.time;
+	ControlFile-latestTLI = ThisTimeLineID;
 	ControlFile-checkPoint = checkPoint.redo;
 	ControlFile-checkPointCopy = checkPoint;
 
@@ -6083,7 +6084,7 @@ StartupXLOG(void)
 	 * Initialize on the assumption we want to recover to the same timeline
 	 * that's active according to pg_control.
 	 */
-	recoveryTargetTLI = ControlFile-checkPointCopy.ThisTimeLineID;
+	recoveryTargetTLI = ControlFile-latestTLI;
 
 	/*
 	 * Check for recovery control file, and if so set up state for offline
@@ -6100,11 +6101,11 @@ StartupXLOG(void)
 	 * timeline.
 	 */
 	if (!list_member_int(expectedTLIs,
-		 (int) ControlFile-checkPointCopy.ThisTimeLineID))
+		 (int) ControlFile-latestTLI))
 		ereport(FATAL,
 (errmsg(requested timeline %u is not a child of database system timeline %u,
 		recoveryTargetTLI,
-		ControlFile-checkPointCopy.ThisTimeLineID)));
+		ControlFile-latestTLI)));
 
 	/*
 	 * Save the selected recovery target timeline ID and
@@ -6791,9 +6792,12 @@ StartupXLOG(void)
 	 *
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
+
+	ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI);
+	
 	if (InArchiveRecovery)
 	{
-		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
+		ThisTimeLineID++;
 		ereport(LOG,
 (errmsg(selected new timeline ID: %u, ThisTimeLineID)));
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
@@ -6946,6 +6950,7 @@ StartupXLOG(void)
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	ControlFile-state = DB_IN_PRODUCTION;
 	ControlFile-time = (pg_time_t) time(NULL);
+	ControlFile-latestTLI = ThisTimeLineID;
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
@@ -8710,12 +8715,6 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 			SpinLockRelease(xlogctl-info_lck);
 		}
 
-		/* TLI 

[HACKERS] Libxml2 load error on Windows

2012-06-18 Thread Talha Bin Rizwan
Hi,

PostgreSQL 9.2 Windows build is having trouble with the XML support:
http://postgresql.1045698.n5.nabble.com/9-2-beta1-libxml2-can-t-be-loaded-on-Windows-td5710672.html

postgres=# SELECT xml 'foobar/foo';
ERROR:  could not set up XML error handler
HINT: This probably indicates that the version of libxml2 being used is not
compatible with the libxml2 header files that PostgreSQL was built with.

HAVE_XMLSTRUCTUREDERRORCONTEXT should be defined if libxml2 has
xmlStructuredErrorContext
(introduced after 2.7.3). It is being set for Linux in pg_config.h:

/* Define to 1 if your libxml has xmlStructuredErrorContext. */
#define HAVE_XMLSTRUCTUREDERRORCONTEXT 1

For Windows build, we need to set it in src/tools/msvc/Solution.pm.

I guess there are two approaches to get the libxml2 version:

1) Parse LIBXML_INST/include/libxml/xmlversion.h to get the libxml2 version
number, this header file is included in libxml2.

2) We may also get the version by running a command line utility i.e
LIBXML_INST/bin/xmllint
--version. The xmllint program parses one or more XML files and it is also
included in libxml2.

I believe it is safe to assume that libxml2 include folder would contain
xmlversion.h file containing the required version number. It might be a
touch slower to parse the file line by line, but similar functionality has
been used elsewhere as well. We obviously rely on xml include files (with
xml enabled), so there is definitely a tighter dependency on include files
than xmllint executable as it is not mandatory for build process.

Therefore, I used first approach to get libxml2 version number and include
the #define HAVE_XMLSTRUCTUREDERRORCONTEXT in pg_config.h if libxml2 version
is greater than 20703 (which effectively is 2.7.3). Please find attached
patch libxml2_win_v1.patch.

Regards,
Talha Bin Rizwan


libxml2_win_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)

2012-06-18 Thread Dean Rasheed
On 17 June 2012 18:30, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Gurjeet Singh  wrote:
 Dean Rasheed wrote:

 in HEAD:
 ... (actual time=1390.037..1390.037 rows=0 loops=1)
 Trigger for constraint fk_table_e_fkey: time=210.184 calls=9
 Total runtime: 1607.626 ms

 With this patch:
 ... (actual time=1489.640..1489.640 rows=0 loops=1)
 [no triggers fired]
 Total runtime: 1489.679 ms

 for every row:
 ... (actual time=1565.148..1565.148 rows=0 loops=1)
 Trigger for constraint fk_table_e_fkey: time=705.962 calls=10
 Total runtime: 2279.408 ms

 with this patch
 ... (actual time=1962.755..1962.755 rows=0 loops=1)
 Trigger for constraint fk_table_e_fkey: time=257.845 calls=1
 Total runtime: 2221.912 ms

 I find it interesting that 'actual time' for top level 'Update on
 fk_table' is always higher in patched versions, and yet the 'Total
 runtime' is lower for the patched versions. I would've expected
 'Total runtime' to be proportional to the increase in top-level
 row-source's 'actual time'.

 I figured that the trigger time was counted separately.  It seems to
 add up pretty well that way.  I guess the question is whether there
 is a case where the increase in seqscan time is *not* compensated by
 less time in the triggers.

 -Kevin

I wouldn't read too much into the individual timings I posted above,
since I get massive variations between runs. If I repeat it enough
times, I can convince myself that the update times excluding trigger
execution are unchanged on average, but the trigger execution times
(which are indeed counted separately) are a real savings.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)

2012-06-18 Thread Dean Rasheed
On 17 June 2012 18:48, Tom Lane t...@sss.pgh.pa.us wrote:
 Gurjeet Singh singh.gurj...@gmail.com writes:
 On Sat, Jun 16, 2012 at 9:50 AM, Dean Rasheed 
 dean.a.rash...@gmail.comwrote:
 I find it interesting that 'actual time' for top level 'Update on fk_table'
 is always higher in patched versions, and yet the 'Total runtime' is lower
 for the patched versions. I would've expected 'Total runtime' to be
 proportional to the increase in top-level row-source's 'actual time'.
 Even the time consumed by Seq scans is higher in patched version, so I
 think the patch's affect on performance needs to be evaluated.

 AFAICS, the only way that the given patch could possibly make anything
 slower is that if the old value of some tested attribute is NULL, the
 comparison routines used to fall out immediately; now, they will do an
 additional SPI_getbinval call to extract the new value before making
 any decision.  So that would account for some small increase in the
 ModifyTable runtime in cases where there are a lot of null keys in FK
 rows being updated, which accurately describes Dean's test case, if not
 so much the real world.  I don't have a big problem with it, since the
 point of the patch is to possibly save a great deal more work in exactly
 these cases.

 It strikes me though that we are still leaving some money on the table.
 The SQL spec says clearly that no RI action need be taken when a null
 PK key value is updated to non-null, and I think this is right because
 there cannot possibly be any FK rows that are considered to match the
 old value.  (Note that both the spec and our FK code treat the RI
 equality operators as strict, even if the underlying functions aren't
 really.)  So we ought to have asymmetric logic in there when making
 checks on PK rows, such that null-non-null is not considered an
 interesting change.  If done properly this would remove the above-
 described slowdown in the PK case.


Yeah, that makes sense.

 Conversely, if an FK value is changed from non-null to null, that is
 either always OK (if MATCH SIMPLE, or if MATCH FULL and all the FK
 columns went to null) or a certain failure (if MATCH FULL and we
 have a mix of nulls and non-nulls).  There's no need to queue a
 trigger event in the always OK cases, so I think we need some
 asymmetric logic in the FK case as well.


Makes sense too.
I think that the patch already covers the most common use case (in my
experience) but we may as well get as much out of it as we can while
we're here.

Are you planning to tackle this, or should I move the patch back to
waiting on author to give Vik Reykja a chance to update it?

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-06-18 Thread Vlad Arkhipov

On 06/15/2012 03:59 PM, Jeff Davis wrote:

On Wed, 2012-06-13 at 23:10 +0200, Miroslav Šimulčík wrote:


I have working patch for postgresql version 9.0.4, but it needs
refactoring before i can submit it, because some parts don't
meet formatting requirements yet. And yes, changes are large, so it
will be better to discuss design first and then deal with code. Do you
insist on compatibility with standard SQL 2011 as Pavel wrote?


Try to work on solving the problem and identify the use cases. I don't
think the standard will cause a major problem, we should be able to make
the relevant parts of your patch match the standard.

That's one reason to work on it as an extension first: we can get a
better sense of the problem space and various use cases without worrying
about violating any standard. Then, as you need specific backend support
(e.g. special syntax), we can take the standards more seriously.

Regards,
Jeff Davis



What's wrong with SPI/timetravel extension for system versioning?
http://www.postgresql.org/docs/9.1/static/contrib-spi.html

We are heavily using system-versioned and application-time period tables 
in our enterprise products (most of them are bi-temporal). However our 
implementation is based on triggers and views and therefore is not very 
convenient to use. There are also some locking issues with foreign keys 
to application-time period tables. It will be great if the new temporal 
SQL features will be included in the Postgresql core with SQL 2011 
syntax support. It is especially important for bi-temporal tables 
because of complex internal logic of UPDATE/DELETE and huge SELECT 
queries for such tables.


Re: [HACKERS] Resource Owner reassign Locks

2012-06-18 Thread Heikki Linnakangas

On 16.06.2012 09:04, Amit kapila wrote:

I don't think so.  C doesn't ref count its pointers.

You are right I have misunderstood.


I don't think that lock tags have good human readable formats, and just
a pointer dump probably wouldn't be much use when something that can
never happen has happened.  But I'll at least add a reference to the
resource owner if this stays in.


I have checked in lock.c file for the message where lock tags have been used.
elog(ERROR, lock %s on object %u/%u/%u is already held,
 lockMethodTable-lockModeNames[lockmode],
 lock-tag.locktag_field1, lock-tag.locktag_field2,
 lock-tag.locktag_field3);

This can give more information about erroneous lock.


A better error message would be nice, but I don't think it's worth that 
much. resowner.c doesn't currently know about the internals of LOCALLOCk 
struct or lock tags, and I'd prefer to keep it that way. Let's just 
print the pointer's address, that's what we do in many other 
corresponding error messages in other Forget* functions.


On 11.06.2012 18:21, Jeff Janes wrote:

On Sun, Jun 10, 2012 at 11:28 PM, Amit Kapilaamit.kap...@huawei.com  wrote:

2. ResourceOwnerForgetLock(), it decrements the lock count before removing,
so incase it doesn't find the lock in lockarray, the count will be
decremented inspite the array still contains the same number of locks.



Should it emit a FATAL rather than an ERROR?  I thought ERROR was
sufficient to make the backend quit, as it is not clear how it could
meaningfully recover.


ERROR seems right to me, it'll get promoted to FATAL or PANIC if we're 
in a context where we can't recover. But I agree with Amit that we 
should not decrement the lock count on error. I'm not sure if it makes 
any difference, as we'll abort out of the current transaction on error, 
but it certainly looks wrong.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Tab completion of function arguments not working in all cases

2012-06-18 Thread Dean Rasheed
On 18 June 2012 04:21, Josh Kupershmidt schmi...@gmail.com wrote:
 +1 for the idea. I find the existing behavior rather confusing,
 particularly the fact that a schema-qualified function name will be
 tab-completed, i.e. this works.

  DROP FUNCTION my_schema.myTAB

 but then, as your second example above shows, no completions are
 subsequently offered for the function arguments.

 As a side note unrelated to this patch, I also dislike how function
 name tab-completions will not fill in the opening parenthesis, which
 makes for unnecessary work for the user, as one then has to type the
 parenthesis and hit tab again to get possible completions for the
 function arguments. The current behavior causes:
  DROP FUNCTION my_fTAB

 which completes to:
  DROP FUNCTION my_function

 enter parenthesis, and hit tab:
  DROP FUNCTION my_function(TAB

 which, if there is only one match, could complete to:
  DROP FUNCTION my_function(integer)

 when the last three steps could have been consolidated with better
 tab-completion. Perhaps this could be a TODO.


Hmm, I find that it does automatically fill in the opening
parenthesis, but only once there is a space after the completed
function name. So
DROP FUNCTION my_fTAB completes to DROP FUNCTION my_function 
(note the space at the end). Then pressing TAB one more time gives
DROP FUNCTION my_function ( , and then pressing TAB again gives
the function arguments.

Alternatively DROP FUNCTION my_functionTAB (no space after
function name) first completes to DROP FUNCTION my_function  (adding
the space), and then completes with the opening parenthesis, and then
with the function arguments.

It's a bit clunky, but I find that repeatedly pressing TAB is easier
than typing the opening bracket.


 The attached patch fixes these problems by introducing a new macro
 COMPLETE_WITH_ARG, similar to the existing COMPLETE_WITH_ATTR, which
 seems to be the nearest analogous code that covers all the edge cases.

 Anyway, on to the review of the patch itself:

 * Submission *

 Patch applies cleanly to git head, and regression tests are not
 expected for tab-completion enhancements.

 * Features  Usability *

 I've verified that tab-completing of the first argument to functions
 for DROP FUNCTION and ALTER FUNCTION commands for the most part works
 as expected. The one catch I noticed was that
 Query_for_list_of_arguments wasn't restricting its results to
 currently-visible functions, so with a default search_path, if you
 have these two functions defined:

  public.doppelganger(text)
  my_schema.doppelganger(bytea)

 and then try:

  DROP FUNCTION doppelganger(TAB

 you get tab-completions for both text) and bytea(, when you
 probably expected only the former. That's easy to fix though, please
 see attached v2 patch.


Good catch.
I think that's a useful additional test, and is also consistent with
the existing code in Query_for_list_of_attributes.


 * Coding *

 The new macro COMPLETE_WITH_ARG seems fine. The existing code used
 malloc() directly for DROP FUNCTION and ALTER FUNCTION
 (tab-complete.c, around lines 867 and 2190), which AIUI is frowned
 upon in favor of pg_malloc(). The patch avoids this ugliness by using
 the new COMPLETE_WITH_ARG macro, so that's a nice fixup.

 Overall, a nice fix for an overlooked piece of the tab-completion machinery.

 Josh

Thanks for the review.

Cheers,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Resource Owner reassign Locks

2012-06-18 Thread Heikki Linnakangas

On 10.06.2012 23:39, Jeff Janes wrote:

The attached patch fixes that by remembering up to 10 local locks, and
pushing them up specifically rather than digging through the entire
lock table.  If the list overflows, then it reverts to the old
behavior of digging through the entire local lock table.


I'm a bit uncomfortable with having a fixed limit like that, after which 
you fall off the cliff. But I guess we can live with that, since we've 
had the quadratic behavior for years, and haven't heard complaints about 
it until recently. We can devise some more complex scheme later, if 
people still run into this.


One idea for a more complex scheme, should we need one in the future, is 
to make the cache in the ResourceOwner expandable, like all the other 
arrays in ResourceOwner. It would not overflow when new entries are 
added it. However, if you try to forget a lock, and it's not found in 
the bottom 10 entries or so of the cache, mark the cache as overflowed 
at that point. That way reassigning the locks would be fast, even if the 
current resource owner holds a lot of locks, as longs as you haven't 
tried to release any of them out of LIFO order.



When it needs to  forget a lock, it searches backwards in the list of
released lock and then just moves the last lock into the place of the
one to be forgotten.  Other ResourceOwner Forget functions slide the
entire list down to close the gap, rather than using the
selection-sort-like method.  I don't understand why they do that.  If
Forgets are strictly LIFO, then it would not make a difference.  If
they are mostly LIFO but occasionally not, maybe the insertion method
would win over the selection method.


Yeah, I think that's the reason. We try to keep the arrays in LIFO 
order. If you move the last entry into the removed slot, then even a 
single out-of-order removal will spoil the heuristic that removals are 
done in LIFO order. Imagine that you insert 5 entries in order:


1
2
3
4
5

Now imagine that you first remove 1 (out-of-order), and then 5, 4, 3, 
and 2 (in order). The removal of 1 moves 5 to the beginning of the 
array. Then you remove 5, and that has to traverse the whole list 
because 5 is now at the beginning. Then you swap 4 into the beginning of 
the list, and so forth. Every subsequent removal scans the list from 
bottom to top, because of the single out-of-order removal.



 From what I can tell, Locks are
mostly released in bulk anyway at transaction end, and rarely released
explicitly.


Hmm, if they are released in bulk, then it doesn't matter either way. So 
the decision on whether to do the slide or swap has to be made on the 
assumption that some locks are released out-of-order. With an array of 
10-15 entries, it probably doesn't make any difference either way, though.



For degrading performance in other cases, I think the best test case
is pgbench -P (implemented in another patch in this commitfest)
which has a loop which pushes one or two locks up from a portal to the
parent (which already owns them, due to previous rounds of the same
loop) very frequently.  There might be a performance degradation of
0.5% or so, but it is less than the run to run variation.  I plan to
run some longer test to get a better estimate.  If there is a
degradation in that range, how important is that?


I think that would be acceptable.

I found the interface between resowner.c and lock.c a bit confusing. 
resowner.c would sometimes call LockReassignCurrentOwner() to reassign 
all the locks, and sometimes it would call LockReassignCurrentOwner() on 
each individual lock, with the net effect that's the same as calling 
LockReleaseCurrentOwner(). And it doesn't seem right for 
ResourceOwnerRemember/ForgetLock to have to accept a NULL owner.


I rearranged that so that there's just a single 
LockReassignCurrentOwner() function, like before this patch. But it 
takes as an optional argument a list of locks held by the current 
resource owner. If the caller knows it, it can pass them to make the 
call faster, but if it doesn't it can just pass NULL and the function 
will traverse the hash table the old-fashioned way. I think that's a 
better API.


Please take a look to see if I broke something.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9717075..9dd9c33 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -345,6 +345,7 @@ static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
 static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
+static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
 			PROCLOCK *proclock, LockMethod lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK 

[HACKERS] gistchoose vs. bloat

2012-06-18 Thread Alexander Korotkov
Hackers,

While experimenting with gistchoose I achieve interesting results about
relation of gistchoose behaviour and gist index bloat.
I've created following testcase for reproducing gist index bloating:
1) Create test table with 100 points from geonames
# create table geotest (id serial, point point);
# insert into geotest (point) (select * from geonames order by random()
limit 100);
2) Create gist index and statistics table. Insert initial statistics (rows
count, index size) into table.
# create index geotest_idx on geotest using gist (point) with (buffering =
off);
# create table geotest_stats (id serial, count bigint, size bigint);
# insert into geotest_stats (count, size) values ((select count(*) from
geotest), pg_relation_size('geotest_idx'));
3) Delete 10% of geotest and vacuum it.
delete from geotest where id in (select id from geotest order by random()
limit 10);
vacuum geotest;
4) Insert new 10 points fromg eonames.
insert into geotest (point) (select * from geonames order by random() limit
10);
5) Insert current statistics (rows count, index size) into table.
insert into geotest_stats (count, size) values ((select count(*) from
geotest), pg_relation_size('geotest_idx'));
6) Repeat 3-5 steps 50 times.

I get following results with current gistchoose implementation:
test=# select * from geotest1_stats;
 id |  count  |   size
+-+---
  1 | 100 |  75767808
  2 | 100 |  76046336
  3 | 100 |  76840960
  4 | 100 |  77799424
  5 | 100 |  78766080
  6 | 100 |  79757312
  7 | 100 |  80494592
  8 | 100 |  81125376
  9 | 100 |  81985536
 10 | 100 |  82804736
 11 | 100 |  83378176
 12 | 100 |  84115456
 13 | 100 |  84819968
 14 | 100 |  85598208
 15 | 100 |  86302720
 16 | 100 |  87023616
 17 | 100 |  87703552
 18 | 100 |  88342528
 19 | 100 |  88915968
 20 | 100 |  89513984
 21 | 100 |  90152960
 22 | 100 |  90742784
 23 | 100 |  91324416
 24 | 100 |  91742208
 25 | 100 |  92258304
 26 | 100 |  92758016
 27 | 100 |  93241344
 28 | 100 |  93683712
 29 | 100 |  93970432
 30 | 100 |  94396416
 31 | 100 |  94740480
 32 | 100 |  95068160
 33 | 100 |  95444992
 34 | 100 |  95780864
 35 | 100 |  96313344
 36 | 100 |  96731136
 37 | 100 |  96968704
 38 | 100 |  97222656
 39 | 100 |  97509376
 40 | 100 |  97779712
 41 | 100 |  98074624
 42 | 100 |  98344960
 43 | 100 |  98639872
 44 | 100 |  99000320
 45 | 100 |  99229696
 46 | 100 |  99459072
 47 | 100 |  99672064
 48 | 100 |  99901440
 49 | 100 | 100114432
 50 | 100 | 100261888
 51 | 100 | 100458496
(51 rows)

Index grow about from 75 MB to 100 MB.

Current implementation of gistchoose select first index tuple which have
minimal penalty. It is possible for several tuples to have same minimal
penalty. I've created simple patch which selects random from them. I then
I've following results for same testcase.

test=# select * from geotest_stats;
 id |  count  |   size
+-+--
  1 | 100 | 74694656
  2 | 100 | 74743808
  3 | 100 | 74891264
  4 | 100 | 75120640
  5 | 100 | 75374592
  6 | 100 | 75612160
  7 | 100 | 75833344
  8 | 100 | 76144640
  9 | 100 | 76333056
 10 | 100 | 76595200
 11 | 100 | 76718080
 12 | 100 | 76939264
 13 | 100 | 77070336
 14 | 100 | 77332480
 15 | 100 | 77520896
 16 | 100 | 77750272
 17 | 100 | 77996032
 18 | 100 | 78127104
 19 | 100 | 78307328
 20 | 100 | 78512128
 21 | 100 | 78610432
 22 | 100 | 78774272
 23 | 100 | 78929920
 24 | 100 | 79060992
 25 | 100 | 79216640
 26 | 100 | 79331328
 27 | 100 | 79454208
 28 | 100 | 79593472
 29 | 100 | 79708160
 30 | 100 | 79822848
 31 | 100 | 79921152
 32 | 100 | 80035840
 33 | 100 | 80076800
 34 | 100 | 80175104
 35 | 100 | 80207872
 36 | 100 | 80322560
 37 | 100 | 80363520
 38 | 100 | 80445440
 39 | 100 | 80494592
 40 | 100 | 80576512
 41 | 100 | 8024
 42 | 100 | 80764928
 43 | 100 | 80805888
 44 | 100 | 80912384
 45 | 100 | 80994304
 46 | 100 | 81027072
 47 | 100 | 81100800
 48 | 100 | 81174528
 49 | 100 | 81297408
 50 | 100 | 81371136
 51 | 100 | 81420288
(51 rows)

Now index grow about from 75 MB to 81 MB. It is almost 4 times less index
bloat!
I've following explanation of that. If index tuples are overlapping then
possibility of insertion into last tuple is much lower than possibility of
insertion to the first tuple. Thereby, freed space underlying to last
tuples of inner page is not likely to be reused. Selecting random tuple
increase that chances.
Obviously, it makes insertion more expensive. I need some more benchmarks
to measure it. Probably, we would need a user-visible option for cheaper
insert or less bloat.

--
With best 

Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 02:43:26 AM Steve Singer wrote:
 On 12-06-13 01:27 PM, Andres Freund wrote:
  The previous mail contained a patch with a mismerge caused by reording
  commits. Corrected version attached.
  
  Thanks to Steve Singer for noticing this quickly.
 
 Attached is a more complete review of this patch.
 
 I agree that we will need to identify the node a change originated at.
 We will not only want this for multi-master support but it might also be
 very helpful once we introduce things like cascaded replicas. Using a 16
 bit integer for this purpose makes sense to me.
Good.

 This patch (with the previous numbered patches already applied), still
 doesn't compile.
 
 gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
 -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
 -Wformat-security -fno-strict-aliasing -fwrapv -I../../../../src/include
 -D_GNU_SOURCE   -c -o xact.o xact.c
 xact.c: In function 'xact_redo_commit':
 xact.c:4678: error: 'xl_xact_commit' has no member named 'origin_lsn'
 make[4]: *** [xact.o] Error 1
 
 Your complete patch set did compile.  origin_lsn gets added as part of
 your 12'th patch.  Managing so many related patches is going to be a
 pain. but it beats one big patch.  I don't think this patch actually
 requires the origin_lsn change.
Hrmpf #666. I will go through through the series commit-by-commit again to 
make sure everything compiles again. Reordinging this late definitely wasn't a 
good idea...

I pushed a rebased version with all those fixups (and removal of the 
zeroRecPtr patch).
 
 Code Review
 -
 src/backend/utils/misc/guc.c
 @@ -1598,6 +1600,16 @@ static struct config_int ConfigureNamesInt[] =
   },
 
   {
 +{multimaster_node_id, PGC_POSTMASTER, REPLICATION_MASTER,
 +gettext_noop(node id for multimaster.),
 +NULL
 +},
 + guc_replication_origin_id,
 +InvalidMultimasterNodeId, InvalidMultimasterNodeId,
 MaxMultimasterNodeId,
 +NULL, assign_replication_node_id, NULL
 +},
 
 I'd rather see us refer to this as the 'node id for logical replication'
 over the multimaster node id.  I think that terminology will be less
 controversial. 
Youre right. 'replication_node_id' or such should be ok?

 BootStrapXLOG in xlog.c
 creates a XLogRecord structure and shouldit  set xl_origin_id to the
 InvalidMultimasterNodeId?
 WriteEmptyXLOG in pg_resetxlog.c might also should set xl_origin_id to a
 well defined value.  I think InvalidMultimasterNodeId should be safe
 even for a no-op record in from a node that actually has a node_id set
 on real records.
Good catches.


 backend/replication/logical/logical.c:
 XLogRecPtr current_replication_origin_lsn = {0, 0};
 
 This variable isn't used/referenced by this patch it probably belongs as
 part of the later patch.
Yea, just as the usage of origin_lsn in the above compile failure.


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s/UNSPECIFIED/SIMPLE/ in foreign key code?

2012-06-18 Thread Gurjeet Singh
On Sat, Jun 16, 2012 at 5:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:


 A small flaw in this plan is that in pg_constraint.confmatchtype,
 MATCH_UNSPECIFIED is stored as 'u'.  In a green field I'd just rename
 that to 's' for SIMPLE, but it seems possible that this would confuse
 client-side code such as pg_dump or psql.  A quick look shows that
 neither of those programs actually look directly at
 pg_constraint.confmatchtype, instead relying on backend functions when
 they want to deconstruct a foreign key constraint.  But there could well
 be other client code that would notice the change.  So I'm a bit torn
 as to whether to change it and create a release-note-worthy
 compatibility issue, or to leave it as-is (with documentation notes that
 u for MATCH_SIMPLE is a historical accident).

 Thoughts?


For some reason I thought this change would break pg_upgrade, but then I
came to me senses and realized pg_upgrade does not migrate/upgrade system
tables.

The other candidate to look for possible breakage would be pgAdmin. As Amit
says, there might be code out in the wild that does look at this column, so
not worth breaking them for this small gain.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader

2012-06-18 Thread Heikki Linnakangas

On 15.06.2012 00:38, Andres Freund wrote:

On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:

On 13.06.2012 14:28, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that
wants to do so is not tightly integrated into xlog.c is rather hard and
would require changes to rather integral parts of the recovery code
which doesn't seem to be a good idea.

It would be nice refactor ReadRecord and its subroutines out of xlog.c.
That file has grown over the years to be really huge, and separating the
code to read WAL sounds like it should be a pretty natural split. I
don't want to duplicate all the WAL reading code, so we really should
find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
wrapper that just calls the new xlogreader code.

I aggree that it is not very nice to duplicate it. But I also don't want to go
the route of replacing ReadRecord with it for a while, we can replace
ReadRecord later if we want. As long as it is in flux like it is right now I
don't really see the point in investing energy in it.
Also I am not that sure how a callback oriented API fits into the xlog.c
workflow?


If the API doesn't fit the xlog.c workflow, I think that's a pretty good 
indication that the API is not good. The xlog.c workflow is basically:


while(!end-of-wal)
{
  record = ReadRecord();
  redo(recode);
}

There's some pretty complicated logic within the bowels of ReadRecord 
(see XLogPageRead(), for example), but it seems to me those parts 
actually fit the your callback API pretty well. The biggest change is 
that the xlog-reader thingie should return to the caller after each 
record, instead of calling a callback for each read record and only 
returning at end-of-wal.


Or we could put the code to call the redo-function into the callback, 
but there's some also some logic there to exit early if you reach the 
desired recovery point, for example, so the callback API would need to 
be extended to allow such early exit. I think a return-after-each-record 
API would be better, though.



Can this be used for writing WAL, as well as reading? If so, what do you
need the write support for?

It currently can replace records which are not interesting (e.g. index changes
in the case of logical rep). Filtered records are replaced with XLOG_NOOP
records with correct length currently. In future the actual amount of data
should really be reduced. I don't know yet know how to map LSNs of
uncompressed/compressed stream onto each other...
The filtered data is then passed to a writeout callback (in a streaming
fashion).

The whole writing out part is pretty ugly at the moment and I just bolted it
ontop because it was convenient for the moment. I am not yet sure how the api
for that should look


Can we just leave that out for now?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s/UNSPECIFIED/SIMPLE/ in foreign key code?

2012-06-18 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 The other candidate to look for possible breakage would be pgAdmin. As Amit
 says, there might be code out in the wild that does look at this column, so
 not worth breaking them for this small gain.

Well, I already did it the other way ;-).  It's probably not that big a
deal either way, since in practice it's a sure thing that 9.3 will have
many bigger and more-painful-to-deal-with catalog changes than this one.
But the fact that both pg_dump and psql go through pg_get_constraintdef
rather than looking directly at the catalogs, and always have (where
always = since the pg_constraint catalog was invented, in 7.3)
makes me think that other clients probably do likewise.  It's rather
painful to get the list of FK column names out of pg_constraint's
array representation in pure SQL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)

2012-06-18 Thread Tom Lane
Dean Rasheed dean.a.rash...@gmail.com writes:
 I think that the patch already covers the most common use case (in my
 experience) but we may as well get as much out of it as we can while
 we're here.

Yeah.  The cases involving nulls are probably really rather unlikely
altogether, but it seems a tad silly to fix only some of them when
we can fix all of them for marginally more effort.

 Are you planning to tackle this, or should I move the patch back to
 waiting on author to give Vik Reykja a chance to update it?

It doesn't look like much else is ready for committer yet, so I think
I'll keep hacking on this one.  The whole of ri_triggers is looking a
bit old and creaky to me; for instance the SET DEFAULT triggers believe
that they can't cache plans involving DEFAULT, which was fixed years
ago (I'm pretty sure that the plancache level should take care of that
automatically, since an ALTER TABLE ... DEFAULT command would force a
relcache flush).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 06/16] Add support for a generic wal reading facility dubbed XLogReader

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 01:51:45 PM Heikki Linnakangas wrote:
 On 15.06.2012 00:38, Andres Freund wrote:
  On Thursday, June 14, 2012 11:19:00 PM Heikki Linnakangas wrote:
  On 13.06.2012 14:28, Andres Freund wrote:
  Features:
  - streaming reading/writing
  - filtering
  - reassembly of records
  
  Reusing the ReadRecord infrastructure in situations where the code that
  wants to do so is not tightly integrated into xlog.c is rather hard and
  would require changes to rather integral parts of the recovery code
  which doesn't seem to be a good idea.
  
  It would be nice refactor ReadRecord and its subroutines out of xlog.c.
  That file has grown over the years to be really huge, and separating the
  code to read WAL sounds like it should be a pretty natural split. I
  don't want to duplicate all the WAL reading code, so we really should
  find a way to reuse that. I'd suggest rewriting ReadRecord into a thin
  wrapper that just calls the new xlogreader code.
  
  I aggree that it is not very nice to duplicate it. But I also don't want
  to go the route of replacing ReadRecord with it for a while, we can
  replace ReadRecord later if we want. As long as it is in flux like it is
  right now I don't really see the point in investing energy in it.
  Also I am not that sure how a callback oriented API fits into the xlog.c
  workflow?
 
 If the API doesn't fit the xlog.c workflow, I think that's a pretty good
 indication that the API is not good. The xlog.c workflow is basically:
 
 while(!end-of-wal)
 {
record = ReadRecord();
redo(recode);
 }
 
 There's some pretty complicated logic within the bowels of ReadRecord
 (see XLogPageRead(), for example), but it seems to me those parts
 actually fit the your callback API pretty well. The biggest change is
 that the xlog-reader thingie should return to the caller after each
 record, instead of calling a callback for each read record and only
 returning at end-of-wal.
I did it that way at start but it seemed to move too much knowledge to the 
callsites.
Its also something of an efficiency/complexity thing: Either you alway reread 
the current page on entering XLogReader (because it could have changed if it 
was the last one at some point) which is noticeable performancewise or you 
make the contract more complex by requiring to notify the XLogReader about the 
fact that you changed the end-of-valid data which doesn't seem to be a good 
idea because I think we will rather quickly will grow more callsites.

 Or we could put the code to call the redo-function into the callback,
 but there's some also some logic there to exit early if you reach the
 desired recovery point, for example, so the callback API would need to
 be extended to allow such early exit. I think a return-after-each-record
 API would be better, though.
Adding an early exit would be easy.

  Can this be used for writing WAL, as well as reading? If so, what do you
  need the write support for?
  
  It currently can replace records which are not interesting (e.g. index
  changes in the case of logical rep). Filtered records are replaced with
  XLOG_NOOP records with correct length currently. In future the actual
  amount of data should really be reduced. I don't know yet know how to
  map LSNs of uncompressed/compressed stream onto each other...
  The filtered data is then passed to a writeout callback (in a streaming
  fashion).
  
  The whole writing out part is pretty ugly at the moment and I just bolted
  it ontop because it was convenient for the moment. I am not yet sure how
  the api for that should look

 Can we just leave that out for now?
Not easily I think. I should probably get busy writing up a PoC for separating 
that.

Thanks,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standbys, txid_current_snapshot, wraparound

2012-06-18 Thread Daniel Farina
On Thu, Mar 22, 2012 at 5:26 PM, Daniel Farina dan...@heroku.com wrote:
 Some time ago I reported bug 6291[0], which reported a Xid wraparound,
 both as reported in pg_controldata and by txid_current_snapshot.
 Unfortunately, nobody could reproduce it.

 Today, the same system of ours just passed the wraparound mark
 successfully at this time, incrementing the epoch.  However, two
 standbys have not done the same: they have wrapped to a low txid.  At
 this time, pg_controldata does report the correct epoch, as I read it,
 unlike the original case.

Hmm. Here we are again, to revive this old thread (whose archives can
be seen at http://archives.postgresql.org/pgsql-hackers/2012-03/msg01437.php)

So the system *has* wrapped this time -- and so has its standby -- but
not to the zero-epoch, but rather the epoch that it was previously in,
2^33 to 2^32.  That's not something I've seen before.  There was
another report of something similar in that thread also. So,
that's...different.

However, since we've started using txids for some
streaming/incremental copying of data roughly a year ago (and thus
started be alerted to this problem), two of the three known
wraparounds have been eventful, and the third exposed a potentially
unrelated bug in the standby whereby epochs were not being loaded.
This database is itself descended from from an old timeline on a
version of postgres that undoubtably showed this defect.

Version: 9.0.6
Ubuntu 10.04 LTS
amd64

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Resource Owner reassign Locks

2012-06-18 Thread Amit Kapila
 A better error message would be nice, but I don't think it's worth that 
 much. resowner.c doesn't currently know about the internals of LOCALLOCk 
 struct or lock tags, and I'd prefer to keep it that way. Let's just 
 print the pointer's address, that's what we do in many other 
 corresponding error messages in other Forget* functions.

I have checked there also doesn't exist any functions which expose lock
internal parameters like tag values.
So we can modify as you suggested.

-Original Message-
From: Heikki Linnakangas [mailto:heikki.linnakan...@enterprisedb.com] 
Sent: Monday, June 18, 2012 4:25 PM
To: Amit kapila; Jeff Janes
Cc: pgsql-hackers
Subject: Re: Resource Owner reassign Locks

On 16.06.2012 09:04, Amit kapila wrote:
 I don't think so.  C doesn't ref count its pointers.
 You are right I have misunderstood.

 I don't think that lock tags have good human readable formats, and just
 a pointer dump probably wouldn't be much use when something that can
 never happen has happened.  But I'll at least add a reference to the
 resource owner if this stays in.

 I have checked in lock.c file for the message where lock tags have been
used.
 elog(ERROR, lock %s on object %u/%u/%u is already held,
  lockMethodTable-lockModeNames[lockmode],
  lock-tag.locktag_field1, lock-tag.locktag_field2,
  lock-tag.locktag_field3);

 This can give more information about erroneous lock.

A better error message would be nice, but I don't think it's worth that 
much. resowner.c doesn't currently know about the internals of LOCALLOCk 
struct or lock tags, and I'd prefer to keep it that way. Let's just 
print the pointer's address, that's what we do in many other 
corresponding error messages in other Forget* functions.

On 11.06.2012 18:21, Jeff Janes wrote:
 On Sun, Jun 10, 2012 at 11:28 PM, Amit Kapilaamit.kap...@huawei.com
wrote:
 2. ResourceOwnerForgetLock(), it decrements the lock count before
removing,
 so incase it doesn't find the lock in lockarray, the count will be
 decremented inspite the array still contains the same number of locks.
 
 Should it emit a FATAL rather than an ERROR?  I thought ERROR was
 sufficient to make the backend quit, as it is not clear how it could
 meaningfully recover.

ERROR seems right to me, it'll get promoted to FATAL or PANIC if we're 
in a context where we can't recover. But I agree with Amit that we 
should not decrement the lock count on error. I'm not sure if it makes 
any difference, as we'll abort out of the current transaction on error, 
but it certainly looks wrong.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-06-18 Thread Kevin Grittner
Misa Simic misa.si...@gmail.com wrote:
 2012/6/17 Kevin Grittner kevin.gritt...@wicourts.gov
 
 Can someone provide a practical example of a foreign key with
 array use case?  The only situations I'm able to think of right
 now are the same cases where you would now use a table with
 primary keys of two tables to provide a many-to-many linkage. 
 Does this proposed feature handle other cases or handle this type
 of case better?
 
 I can't imagine either other usablity... Just many-to-one
 linkage... or to have many-to-many link with less rows in middle
 table...
 
The many-to-one case seems like it is better handled in the other
direction -- with the referenced table holding the set of valid keys
and the referencing table holding the single key.  (I believe the
general case of this is what Jeff called an inclusion constraint
-- a feature he wants to add at some point.)  I can't think of a use
case where that would not be better for this type of relationship
than putting the array on the referencing side.
 
The many-to-many relationship does seem like a potentially useful
feature, at least in some cases.  For example, a message board where
a limited number of tags can be attached to each message -- the
message could contain an array of tag IDs.  Clearly this could be
done as a table holding message_id and tag_id, but it seems
potentially more convenient from a programming perspective to have
an array of tag_id values in the message table.  Logically, you are
associating each of these with the primary key of the row it is in. 
Are there other obvious use-cases?  If nobody can put one forward,
this seems like a good reality test for proposed behaviors in any
corner cases where correct behavior isn't obvious -- what would
you want to do to the data if it were in the separate table with
just the primary keys to link the two tables?
 
 What is better - I think should be measured...
 
It would make an interesting test to compare performance of a
suitably sized data set and reasonable workload for both techniques.
Of course, that's a non-trivial amount of work.  It seems like
people may be able to justify this just on ease-of-use, versus
claiming that it's an optimization.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transforms

2012-06-18 Thread Andres Freund
Hi Peter,

On Friday, June 15, 2012 12:42:12 AM Peter Eisentraut wrote:
 Here is my first patch for the transforms feature.  This is a mechanism
 to adapt data types to procedural languages.  The previous proposal was
 here: http://archives.postgresql.org/pgsql-hackers/2012-05/msg00728.php
 
 At the surface, this contains:
 
 - New catalog pg_transform
 
 - CREATE/DROP TRANSFORM
Cursory code review:
* pstrndup already exists as pnstrdup (hstore_plperl.c)
* PyString_FromStringAndSize return value not decreffed? PyDict_SetItem 
doesn't steal references
* In plpython_to_hstore I would move the 'pairs' and some related variables in 
the PG_TRY block, so the reader doesn't need to check whether it should be 
volatile
* In the same function items needs to be volatile to fit into longjmp 
semantics
* I don't think recording dependencies on transforms used when creating 
functions is a good idea as the transform might get created after the 
functions already exists. That seems to be a pretty confusing behaviour.
* I am a bit wary that some place can be used to call functions accepting 
INTERNAL indirectly, couldn't think of any immediately though. Will look  into 
this a bit, but I am not experienced in the area, so ...
* I forsee the need for multiple transforms for the same type/language pair to 
coexist. The user would need to manually choose/call the transform in that 
case. This currently isn't easily possible...


 *) No psql backslash commands yet.  Could be added.
Doesn't really seem necessary to me. Not many people will need to look at this 
and the list of commands already is rather long.

 *) Permissions: Transforms don't have owners, a bit like casts.
 Currently, you are allowed to drop a transform if you own both the type
 and the language.  That might be too strict, maybe own the type and have
 privileges on the language would be enough.
Seems sensible enough to me.

 *) If we want to offer the ability to write transforms to end users,
 then we need to install all the header files for the languages and the
 types.  This actually worked quite well; including hstore.h and plperl.h
 for example, gave you what you needed.  In other cases, some headers
 might need cleaning up.  Also, some magic from the plperl and plpython
 build systems might need to be exposed, for example to find the header
 files.  See existing modules for how this currently looks.
Doesn't look to bad to me. I dont't know how this could be made easier.

 *) There is currently some syntax schizophrenia.  The grammar accepts
 
 CREATE TRANSFORM FOR hstore LANGUAGE plperl (
 FROM SQL WITH hstore_to_plperl,
 TO SQL WITH plperl_to_hstore
 );
 
 but pg_dump produces
 
 CREATE TRANSFORM FOR hstore LANGUAGE plperl (
 FROM SQL WITH hstore_to_plperl(hstore),
 TO SQL WITH plperl_to_hstore(internal)
 );
 
 The SQL standard allows both.  (In the same way that it allows 'DROP
 FUNCTION foo' without arguments, if it is not ambigious.)  Precedent is
 that CREATE CAST requires arguments, but CREATE LANGUAGE does not.
I don't find that problematic personally.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Simon Riggs
On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:

 This adds a new configuration parameter multimaster_node_id which determines
 the id used for wal originating in one cluster.

Looks good and it seems this aspect at least is commitable in this CF.

Design decisions I think we need to review are

* Naming of field. I think origin is the right term, borrowing from Slony.

* Can we add the origin_id dynamically to each WAL record? Probably no
need, but lets consider why and document that.

* Size of field. 16 bits is enough for 32,000 master nodes, which is
quite a lot. Do we need that many? I think we may have need for a few
flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
8 bits. Even if we don't need them in this release, I'd like to have
them. If they remain unused after a few releases, we may choose to
redeploy some of them as additional nodeids in future. I don't foresee
complaints that 256 master nodes is too few anytime soon, so we can
defer that decision.

* Do we want origin_id as a parameter or as a setting in pgcontrol?
IIRC we go to a lot of trouble elsewhere to avoid problems with
changing on/off parameter values. I think we need some discussion to
validate where that should live.

* Is there any overhead from CRC of WAL record because of this? I'd
guess not, but just want to double check thinking.

Presumably there is no issue wrt Heikki's WAL changes? I assume not,
but ask since I know you're reviewing that also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Tab completion of function arguments not working in all cases

2012-06-18 Thread Josh Kupershmidt
On Mon, Jun 18, 2012 at 3:56 AM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 18 June 2012 04:21, Josh Kupershmidt schmi...@gmail.com wrote:

 As a side note unrelated to this patch, I also dislike how function
 name tab-completions will not fill in the opening parenthesis, which
 makes for unnecessary work for the user, as one then has to type the
 parenthesis and hit tab again to get possible completions for the
 function arguments. The current behavior causes:
  DROP FUNCTION my_fTAB

 which completes to:
  DROP FUNCTION my_function

 enter parenthesis, and hit tab:
  DROP FUNCTION my_function(TAB

 which, if there is only one match, could complete to:
  DROP FUNCTION my_function(integer)

 when the last three steps could have been consolidated with better
 tab-completion. Perhaps this could be a TODO.


 Hmm, I find that it does automatically fill in the opening
 parenthesis, but only once there is a space after the completed
 function name. So
 DROP FUNCTION my_fTAB completes to DROP FUNCTION my_function 
 (note the space at the end). Then pressing TAB one more time gives
 DROP FUNCTION my_function ( , and then pressing TAB again gives
 the function arguments.

 Alternatively DROP FUNCTION my_functionTAB (no space after
 function name) first completes to DROP FUNCTION my_function  (adding
 the space), and then completes with the opening parenthesis, and then
 with the function arguments.

 It's a bit clunky, but I find that repeatedly pressing TAB is easier
 than typing the opening bracket.

Interesting, I see the behavior you describe on Linux, using psql +
libreadline, and the behavior I showed (i.e. repeatedly pressing tab
won't automatically fill in the function arguments after the function
name is completed, seemingly because no space is deposited after the
completed function name)  is with OS X 10.6, using psql + libedit.

[snip]
 you get tab-completions for both text) and bytea(, when you
 probably expected only the former. That's easy to fix though, please
 see attached v2 patch.

 Good catch.
 I think that's a useful additional test, and is also consistent with
 the existing code in Query_for_list_of_attributes.

OK, I'll mark Ready for Committer in the CF.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund
Hi Simon,

On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
 On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
  This adds a new configuration parameter multimaster_node_id which
  determines the id used for wal originating in one cluster.
 
 Looks good and it seems this aspect at least is commitable in this CF.
I think we need to agree on the parameter name. It currently is 
'multimaster_node_id'. In the discussion with Steve we got to 
replication_node_id. I don't particularly like either.

Other suggestions?

 Design decisions I think we need to review are
 
 * Naming of field. I think origin is the right term, borrowing from Slony.
I think it fits as well.

 * Can we add the origin_id dynamically to each WAL record? Probably no
 need, but lets consider why and document that.
Not sure what you mean? Its already set in XLogInsert to 
current_replication_origin_id which defaults to the value of the guc?

 * Size of field. 16 bits is enough for 32,000 master nodes, which is
 quite a lot. Do we need that many? I think we may have need for a few
 flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
 8 bits. Even if we don't need them in this release, I'd like to have
 them. If they remain unused after a few releases, we may choose to
 redeploy some of them as additional nodeids in future. I don't foresee
 complaints that 256 master nodes is too few anytime soon, so we can
 defer that decision.
I wished we had some flag bits available before as well. I find 256 nodes a 
pretty low value to start with though, 4096 sounds better though, so I would 
be happy with 4 flag bits. I think for cascading setups and such you want to 
add node ids for every node, not only masters...

Any opinions from others on this?

 * Do we want origin_id as a parameter or as a setting in pgcontrol?
 IIRC we go to a lot of trouble elsewhere to avoid problems with
 changing on/off parameter values. I think we need some discussion to
 validate where that should live.
Hm. I don't really forsee any need to have it in pg_control. What do you want 
to protect against with that?
It would need to be changeable anyway, because otherwise it would need to 
become a parameter for initdb which would suck for anybody migrating to use 
replication at some point.

Do you want to protect against problems in replication setups after changing 
the value?

 * Is there any overhead from CRC of WAL record because of this? I'd
 guess not, but just want to double check thinking.
I cannot imagine that there is. The actual size of the record didn't change 
because of alignment padding (both on 32 and 64 bit systems). 

 Presumably there is no issue wrt Heikki's WAL changes? I assume not,
 but ask since I know you're reviewing that also.
It might clash minimally because of offset changes or such, but otherwise 
there shouldn't be much.

Thanks,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 02/16] Add zeroRecPtr as a shortcut for initializing a local variable to {0, 0}

2012-06-18 Thread Andres Freund
On Thursday, June 14, 2012 03:50:28 PM Robert Haas wrote:
 On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  This is locally defined in lots of places and would get introduced
  frequently in the next commits. It is expected that this can be defined
  in a header-only manner as soon as the XLogInsert scalability groundwork
  from Heikki gets in.
 
 This appears to be redundant with the existing InvalidXLogRecPtr,
 defined in access/transam.h.
I dropped the patch from the series in the git repo and replaced every usage 
with the version in transam.h

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-18 Thread Peter Geoghegan
On 18 June 2012 00:38, Tom Lane t...@sss.pgh.pa.us wrote:
 The only reason we test a = b and not a  b || a  b
 is that the latter is at least twice as expensive to evaluate.

Perhaps I've been unclear. I meant to write (!(a  b)  !(b  a)),
and not (a  b  b  a). The former is not tautological when the
trichotomy law holds, but the latter is. My mistake - I was a little
tired when I wrote that (though you seemed to know what I meant).

You can sort things just fine if you only have a less than comparator
than returns a boolean, which is what I sought to illustrate with that
example: There is no requirement that the comparator indicate anything
more than whether one element is less than the other. In particular,
it need not tell the sort function that two elements are equal, though
the sort function will still (almost invariably in practice) put equal
elements beside each other. Consider std::set, a highly-generic
template class that enforces uniqueness, usually implemented using a
red-black tree (so its elements are sorted), that only requires that
the datatype it is instantiated with have an operator. The datatype
used may not even have an operator== for std::set to consult if it
wanted to.

So, in principle, std::set could figure out if any two given elements
were equivalent. That is to say, it could determine:

if (!(a  b)  !(b  a))

It could not, even in principle given its implicit
interface/requirements for datatypes, know for sure that:

if (a == b)

Despite the fact that equivalency and equality can be expected to
coincide the vast majority of the time, they are not quite the same
thing. It's perfectly reasonable that it might actually not hold that
a and b are equal, even though they are equivalent, for certain
datatypes. Certainly not for integers. But, for example, it could make
sense to have a string datatype where with a certain locale a certain
Hungarian word was equivalent to the same string with a minor
variation in diacritical marks or whatever (an entirely culturally
defined equivalency), even though it was not equal according to this
datatype's own idiosyncratic notion of equality, which is bitwise
equality.

That would be a datatype that std::set would be entirely capable of
rolling with, because it doesn't care about equality. It might be
better if our text type had the same semantics as this hypothetical
string datatype, because:

1. A culture has gone so far as to make these variations of
diacritical marks or whatever equivalent. The Unicode consortium
consulted with the Hungarian government, or maybe the Hungarian
version of the Académie française, and they asked for this, which
seems pretty serious to me. Maybe they'd like to have a unique
constraint reject what they consider to be equivalent strings. If this
sounds pedantic to you, google Han unification controversy, because
this sounds like that in reverse (Hungarian separation controversy).
The fact that when a unique constraint is violated, it isn't actually
necessary to check the equality operator function (equivalence will
do; no need to call texteq as when selecting with the same value in
the qual) suggests that this wouldn't be too hard to facilitate,
though again, I admit my understanding here is more shallow than I'd
like. Now, I'll grant you that I'm not currently losing any sleep
about our cultural insensitivity, and apparently neither are any
Hungarians; this just serves to illustrate that my position is The
Right Thing (I hope).

Here is a description of the Hungarian problem:

http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx

It says  the feature is such that since dzs is a compression within
Hungarian, that if you see ddzs it should be treated as equivalent to
dzsdzs...Basically, the feature is such that since dzs is a
compression within Hungarian, that if you see ddzs it should be
treated as equivalent to dzsdzs..

We're not the first people to have to deal with this:

http://bugs.mysql.com/bug.php?id=12519

Here is Tom's original analysis, that lead to this patch:

http://archives.postgresql.org/pgsql-general/2005-12/msg00803.php

2. This way, it's still possible to do the strxfrm() optimisation more
or less for free, without breaking hash methods on text. That is not
to be sniffed at - sorting text is very common and important. Most
people don't know about the C locale, and couldn't use it if they did.

 The last section of src/backend/access/nbtree/README has some notes
 that you might find relevant.

I assume that you're referring to Notes to Operator Class
Implementors. It says:


An = operator must be an equivalence relation; that is, for all non-null
values A,B,C of the datatype:

A = A is true   reflexive law
if A = B, then B = Asymmetric law
if A = B and B = C, then A = C  transitive law


This would still hold; we'd just have to change the = operators here
to something 

Re: [HACKERS] sortsupport for text

2012-06-18 Thread Peter Geoghegan
On 18 June 2012 16:59, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Perhaps more importantly, I cannot recreate any of these problems on
 my Fedora 16 machine. Even with hu_HU on LATIN2, Tom's original test
 case (from 2005, on a Fedora 4 machine) cannot be recreated. So it may
 be that they've tightened these things up in some way. It's far from
 clear why that should be.

 It could be worth

Ugh, hit send too soon. I meant to close with the observation that it
may be safe to rely on strcoll() / strxfrm() + strcmp() not ever
returning 0 on certain platforms, if my inability to recreate this is
anything to go by. There could be a test run by configure that
determines this, enabling us to then use the strxfrm() optimisation.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Andres Freund
On Wednesday, June 13, 2012 06:53:17 PM Jeff Davis wrote:
 On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
  The --help output for the -N option was copy-and-pasted wrongly.
  
  The message issued when using -N is also a bit content-free.  Maybe
  something like
  
  Running in nosync mode.  The data directory might become corrupt if the
  operating system crashes.\n
 
 Thank you, fixed.
 
  Which leads to the question, how does one get out of this state?  Is
  running sync(1) enough?  Is starting the postgres server enough?
 
 sync(1) calls sync(2), and the man page says:
 
 According to the standard specification  (e.g.,  POSIX.1-2001),  sync()
 schedules the writes, but may return before the actual writing is done.
 However, since version 1.3.20 Linux does actually  wait.   (This  still
 does not guarantee data integrity: modern disks have large caches.)
 
 So it looks like sync is enough if you are on linux *and* you have any
 unprotected write cache disabled.
Protection can include write barries, it doesn't need to be a BBU

 I don't think starting the postgres server is enough.
Agreed.

 Before, I think we were safe because we could assume that the OS would
 flush the buffers before you had time to store any important data. But
 now, that window can be much larger.
I think to a large degree we didn't see any problem because of the old ext3 
sync the whole world type of behaviour which was common for a very long 
time.
So on ext3 (with data=ordered, the default) any fsync (checkpoints, commit, 
...) would lead to all the other files being synced as well which means 
starting the server once would have been enough on such a system...


Quick review:
- defaulting to initdb -N in the regression suite is not a good imo, because 
that way the buildfarm won't catch problems in that area...
- could the copydir.c and initdb.c versions of walkdir/sync_fname et al be 
unified?
- I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname 
instead of cluttering the main function with it

Looks good otherwise!

Thanks,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw in contrib

2012-06-18 Thread Merlin Moncure
On Thu, Jun 14, 2012 at 7:29 AM, Shigeru HANADA
shigeru.han...@gmail.com wrote:
 I'd like to propose pgsql_fdw, FDW for PostgreSQL, as a contrib module
 in core, again.  This patch is basically rebased version of the patches
 proposed in 9.2 development cycle, and contains some additional changes
 such as concern about volatile variables for PG_CATCH blocks.  In
 addition, I combined old three patches (pgsql_fdw core functionality,
 push-down support, and analyze support) into one patch for ease of
 review. (I think these features are necessary for pgsql_fdw use.)

 After the development for 9.2 cycle, pgsql_fdw can gather statistics
 from remote side by providing sampling function to analyze module in
 backend core, use them to estimate selectivity of WHERE clauses which
 will be pushed-down, and retrieve query results from remote side with
 custom row processor which prevents memory exhaustion from huge result set.

 It would be possible to add some more features, such as ORDER BY
 push-down with index information support, without changing existing
 APIs, but at first add relatively simple pgsql_fdw and enhance it seems
 better.  In addition, once pgsql_fdw has been merged, it would help
 implementing proof-of-concept of SQL/MED-related features.

I can't help but wonder (having been down the contrib/core/extension
road myself) if it isn't better to improve the facilities to register
and search for qualified extensions (like Perl CPAN) so that people
looking for code to improve their backends can find it.  That way,
you're free to release/do xyz/abandon your project as you see fit
without having to go through -hackers.  This should also remove a lot
of the stigma with not being in core since if stock postgres
installations can access the necessary modules via CREATE EXTENSION, I
think it will make it easier for projects like this to get used with
the additional benefit of decentralizing project management.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Skip checkpoint on promoting from streaming replication

2012-06-18 Thread Fujii Masao
On Mon, Jun 18, 2012 at 5:42 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 What do you think about this?

What happens if the server skips an end-of-recovery checkpoint, is promoted to
the master, runs some write transactions, crashes and restarts automatically
before it completes checkpoint? In this case, the server needs to do crash
recovery from the last checkpoint record with old timeline ID to the latest WAL
record with new timeline ID. How does crash recovery do recovery beyond
timeline?

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-18 Thread Alvaro Herrera

Excerpts from Tom Lane's message of sáb jun 16 02:41:00 -0400 2012:
 Amit kapila amit.kap...@huawei.com writes:

   The suggested patch improves the logic to recover corrupt control file. So 
  that is the reason I felt it will be relevant to do this patch.
 
 Well, we invented pg_resetxlog with the thought that it might be useful
 for such situations, but I'm not sure offhand that we've ever seen a
 field report of corrupted pg_control files.  For instance, a quick
 search in the archives for incorrect checksum in control file turns up
 only cases of pilot error, such as supposing that a 32-bit database
 could be used with a 64-bit server or vice versa.  Actual hardware
 failures on the pg_control file could be expected to result in something
 like could not read from control file: I/O error, which I find no
 evidence for at all in the archives.

Hm, what about the situation where pg_control is lost completely to a
filesystem failure?  I remember doing disaster recovery on this problem
once ... As far as I recall the pg_xlog files were in a separate
partition so they weren't lost.  Some other files in the main data
partition were lost as well.  (I don't remember what is it that we had
to do to create a fake pg_control).

Maybe, even if Amit's code does not end up in pg_resetxlog, it could be
useful as a DR tool, assuming the code does not cause endless
maintenance burden.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-18 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of sáb jun 16 02:41:00 -0400 2012:
 Well, we invented pg_resetxlog with the thought that it might be useful
 for such situations, but I'm not sure offhand that we've ever seen a
 field report of corrupted pg_control files.

 Hm, what about the situation where pg_control is lost completely to a
 filesystem failure?  I remember doing disaster recovery on this problem
 once ... As far as I recall the pg_xlog files were in a separate
 partition so they weren't lost.  Some other files in the main data
 partition were lost as well.

Hm ... well, as long as we have a clear idea of a use-case, I'm not
opposed in principle to working on this area.

 (I don't remember what is it that we had
 to do to create a fake pg_control).

AFAIR you can create pg_control from scratch already with pg_resetxlog.
The hard part is coming up with values for the counters, such as the
next WAL location.  Some of them such as next OID are pretty harmless
if you don't guess right, but I'm worried that wrong next WAL could
make things worse not better.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] temporal support patch

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 19:34 +0900, Vlad Arkhipov wrote:
 What's wrong with SPI/timetravel extension for system versioning?
 http://www.postgresql.org/docs/9.1/static/contrib-spi.html
 
 We are heavily using system-versioned and application-time period
 tables in our enterprise products (most of them are bi-temporal).
 However our implementation is based on triggers and views and
 therefore is not very convenient to use. There are also some locking
 issues with foreign keys to application-time period tables. It will be
 great if the new temporal SQL features will be included in the
 Postgresql core with SQL 2011 syntax support. It is especially
 important for bi-temporal tables because of complex internal logic of
 UPDATE/DELETE and huge SELECT queries for such tables.

I've already pointed out some missing features in this thread, but the
big ones in my mind are:

1. It doesn't use 9.2 Range Types, which would help in a lot of ways
(like making the SELECT queries a lot simpler and faster).

2. It's missing a lot of options, like storing the user that modified a
row or the changed columns.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-18 Thread Martijn van Oosterhout
On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
 The fly in the ointment with any of these ideas is that the configure
 list is not a list of exact cipher names, as per Magnus' comment that
 the current default includes tests like !aNULL.  I am not sure that
 we know how to evaluate such conditions if we are applying an
 after-the-fact check on the selected cipher.  Does OpenSSL expose any
 API for evaluating whether a selected cipher meets such a test?

I'm not sure whether there's an API for it, but you can certainly check
manually with openssl ciphers -v, for example:

$ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
NULL-SHASSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=SHA1
NULL-MD5SSLv3 Kx=RSA  Au=RSA  Enc=None  Mac=MD5

...etc...

So unless the openssl includes the code twice there must be a way to
extract the list from the library.

Have a nice ay,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-18 Thread Alvaro Herrera

Excerpts from Boszormenyi Zoltan's message of vie may 11 03:54:13 -0400 2012:
 Hi,
 
 another rebasing and applied the GIT changes in
 ada8fa08fc6cf5f199b6df935b4d0a730aaa4fec to the
 Windows implementation of PGSemaphoreTimedLock.

Hi,

I gave the framework patch a look.  One thing I'm not sure about is the
way you've defined the API.  It seems a bit strange to have a nice and
clean, generic interface in timeout.h; and then have the internal
implementation of the API cluttered with details such as what to do when
the deadlock timeout expires.  Wouldn't it be much better if we could
keep the details of CheckDeadLock itself within proc.c, for example?  

Same for the other specific Check*Timeout functions.  It seems to me
that the only timeout.c specific requirement is to be able to poke
base_timeouts[].indicator and fin_time.  Maybe that can get done in
timeout.c and then have the generic checker call the module-specific
checker.  ... I see that you have things to do before and after setting
indicator.  Maybe you could split the module-specific Check functions
in two and have timeout.c call each in turn.  Other than that I don't
see that this should pose any difficulty.

Also, I see you're calling GetCurrentTimestamp() in the checkers; maybe
more than once per signal if you have multiple timeouts enabled.  Maybe
it'd be better to call it in once handle_sig_alarm and then pass the
value down, if necessary (AFAICS if you restructure the code as I
suggest above, you don't need to get the value down the the
module-specific code).

As for the Init*Timeout() and Destroy*Timeout() functions, they seem
a bit pointless -- I mean if they just always call the generic
InitTimeout and DestroyTimeout functions, why not just define the struct
in a way that makes the specific functions unnecessary?  Maybe you even
already have all that's necessary ... I think you just change
base_timeouts[i].timeout_destroy(false); to DestroyTimeout(i, false) and
so on.

Minor nitpick: the Interface: comment in timeout.c seems useless.
That kind of thing tends to get overlooked and obsolete over time.  We
have examples of such things all over the place.  I'd just rip it and
have timeout.h be the interface documentation.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-18 Thread Amit Kapila
 AFAIR you can create pg_control from scratch already with pg_resetxlog.
 The hard part is coming up with values for the counters, such as the
 next WAL location.  Some of them such as next OID are pretty harmless
 if you don't guess right, but I'm worried that wrong next WAL could
 make things worse not better.

I believe if WAL files are proper as mentioned in Alvaro's mail, the purposed 
logic should generate
correct values.
Do you see any problem in logic purposed in my original mail.
Can I resume my work on this feature?


With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Robert Haas
On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund and...@2ndquadrant.com wrote:
 1. Use a 64-bit segment number, instead of the log/seg combination. And
 don't waste the last segment on each logical 4 GB log file. The concept
 of a logical log file is now completely gone. XLogRecPtr is unchanged,
 but it should now be understood as a plain 64-bit value, just split into
 two 32-bit integers for historical reasons. On disk, this means that
 there will be log files ending in FF, those were skipped before.
 Whats the reason for keeping that awkward split now? There aren't that many
 users of xlogid/xcrecoff and many of those would be better served by using
 helper macros.

I wondered that, too.  There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Heikki Linnakangas

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com  wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a logical log file is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.


I wondered that, too.  There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.


The page header contains an XLogRecPtr (LSN), so if we change it we'll 
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr 
around as the on-disk representation, and convert between the 64-bit 
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - 
many xlog calculations would admittedly be simpler if it was an uint64.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-06-18 Thread Robert Haas
On Fri, Jun 15, 2012 at 9:22 AM, Ants Aasma a...@cybertec.at wrote:
 Exactly. I think the first question for this patch should be whether
 this use-case is worth the complexity of the patch. I can't imagine
 any really compelling use cases that need an arbitrary distinct subset
 of results.

Me neither.

 The original complaint on -performance [1], didn't specify
 a real world use case, but it seemed to be a case of an ORM generating
 suboptimal queries. On the other hand, the patch itself is in my
 opinion rather simple, so it might be worth it.

Yeah.

 It has one outstanding issue, query_planner chooses the cheapest path
 based on total cost. This can be suboptimal when that path happens to
 have high startup cost. It seems to me that enabling the query_planner
 to find the cheapest unsorted path returning a limited amount of
 tuples would require some major surgery to the planner. To be clear,
 this is only a case of missed optimization, not a regression.

I'm confused by this remark, because surely the query planner does it
this way only if there's no LIMIT.  When there is a LIMIT, we choose
based on the startup cost plus the estimated fraction of the total
cost we expect to pay based on dividing the LIMIT by the overall row
count estimate.  Or is this not what you're talking about?

 It won't help set returning functions because the tuplestore for those
 is fully materialized when the first row is fetched.

That's surely a problem for another day.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
 On 18.06.2012 21:00, Robert Haas wrote:
  On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com  
wrote:
  1. Use a 64-bit segment number, instead of the log/seg combination. And
  don't waste the last segment on each logical 4 GB log file. The concept
  of a logical log file is now completely gone. XLogRecPtr is
  unchanged, but it should now be understood as a plain 64-bit value,
  just split into two 32-bit integers for historical reasons. On disk,
  this means that there will be log files ending in FF, those were
  skipped before.
  
  Whats the reason for keeping that awkward split now? There aren't that
  many users of xlogid/xcrecoff and many of those would be better served
  by using helper macros.
  
  I wondered that, too.  There may be a good reason for keeping it split
  up that way, but we at least oughta think about it a bit.
 
 The page header contains an XLogRecPtr (LSN), so if we change it we'll
 have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
 around as the on-disk representation, and convert between the 64-bit
 integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
 many xlog calculations would admittedly be simpler if it was an uint64.
I am out of my depth here, not having read any of the relevant code, but 
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without 
marking the page dirty?

There is the valid argument that you would loose some information when pages 
with hint bits are written out again, but on the other hand you would also 
gain the information that it was a hint-bit write...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Robert Haas
On Mon, Jun 18, 2012 at 2:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 18.06.2012 21:00, Robert Haas wrote:
 On Thu, Jun 14, 2012 at 5:58 PM, Andres Freundand...@2ndquadrant.com
  wrote:

 1. Use a 64-bit segment number, instead of the log/seg combination. And
 don't waste the last segment on each logical 4 GB log file. The concept
 of a logical log file is now completely gone. XLogRecPtr is unchanged,
 but it should now be understood as a plain 64-bit value, just split into
 two 32-bit integers for historical reasons. On disk, this means that
 there will be log files ending in FF, those were skipped before.

 Whats the reason for keeping that awkward split now? There aren't that
 many
 users of xlogid/xcrecoff and many of those would be better served by
 using
 helper macros.

 I wondered that, too.  There may be a good reason for keeping it split
 up that way, but we at least oughta think about it a bit.

 The page header contains an XLogRecPtr (LSN), so if we change it we'll have
 to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as
 the on-disk representation, and convert between the 64-bit integer and
 XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog
 calculations would admittedly be simpler if it was an uint64.

Ugh.  Well, that's a good reason for thinking twice before changing
it, if not abandoning the idea altogether.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Heikki Linnakangas

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...


Sorry, I don't understand that. Where would you replace the LSN from 
disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) )


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Peter Eisentraut
On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
 - defaulting to initdb -N in the regression suite is not a good imo,
 because that way the buildfarm won't catch problems in that area...
 
The regression test suite also starts postgres with fsync off.  This is
good, and I don't want to have more overhead in the tests.  It's not
like we already have awesome test coverage for OS interaction.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
 Quick review:
 - defaulting to initdb -N in the regression suite is not a good imo, because 
 that way the buildfarm won't catch problems in that area...

I removed the -N as you suggest. How much does performance matter on the
buildfarm?

 - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be 
 unified?

There's a lot of backend-specific code in the copydir versions, like
using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
unifying them before, and concluded that it wouldn't add to the
readability, so I just commented where they came from.

If you feel there will be a maintainability problem, I can give it
another shot.

 - I personally would find it way nicer to put USE_PRE_SYNC into 
 pre_sync_fname 
 instead of cluttering the main function with it

Done.

Regards,
Jeff Davis


initdb-fsync-20120618.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:
 On 18.06.2012 21:13, Andres Freund wrote:
  On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
  The page header contains an XLogRecPtr (LSN), so if we change it we'll
  have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
  around as the on-disk representation, and convert between the 64-bit
  integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
  many xlog calculations would admittedly be simpler if it was an uint64.
  
  I am out of my depth here, not having read any of the relevant code, but
  couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
  without marking the page dirty?
  
  There is the valid argument that you would loose some information when
  pages with hint bits are written out again, but on the other hand you
  would also gain the information that it was a hint-bit write...
 
 Sorry, I don't understand that. Where would you replace the LSN from
 disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) )
In ReadBuffer_common or such, after reading a page from disk and verifying 
that the page has a valid header it seems to be possible to replace pd_lsn *in 
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose 
debugging information in that case. If will be zero if it has been written by 
a hint-bit write and thats arguable a win.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote:
 On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
  - defaulting to initdb -N in the regression suite is not a good imo,
  because that way the buildfarm won't catch problems in that area...
  
 The regression test suite also starts postgres with fsync off.  This is
 good, and I don't want to have more overhead in the tests.  It's not
 like we already have awesome test coverage for OS interaction.

OK, changing back to using -N during regression tests due to performance
concerns.

Regards,
Jeff Davis


initdb-fsync-20120618-2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 08:39:47 PM Jeff Davis wrote:
 On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
  Quick review:
  - defaulting to initdb -N in the regression suite is not a good imo,
  because that way the buildfarm won't catch problems in that area...
 I removed the -N as you suggest. How much does performance matter on the
 buildfarm?
I don't think the difference in initdb cost is relevant when running the 
regression tests. Should it prove to be we can re-add -N after a week or two 
in the buildfarm machines. I just remember that there were several OS specific 
regression when adding the pre-syncing for createdb.

  - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
  be unified?
 There's a lot of backend-specific code in the copydir versions, like
 using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
 unifying them before, and concluded that it wouldn't add to the
 readability, so I just commented where they came from.
Ok. Sensible reasons. I dislike that we know have two files using different 
logic (copydir.c only using fadvise, initdb using sync_file_range if 
available). Maybe we should just move the fadvise and sync_file_range calls 
into its own common function?

 If you feel there will be a maintainability problem, I can give it
 another shot.
Its not too bad yet I guess, so ...

  - I personally would find it way nicer to put USE_PRE_SYNC into
  pre_sync_fname instead of cluttering the main function with it
 Done.

Looks good to me.

Btw, I just want to have said this, although I don't think its particularly 
relevant as it doesn't affect correctness: Its possible to have a system where 
sync_file_range is in the system headers but the kernel during runtime doesn't 
support it. It is relatively new (2.6.17). It would be possible to fallback to 
posix_fadvise which has been around far longer in that case...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Heikki Linnakangas

On 18.06.2012 21:45, Andres Freund wrote:

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.


I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...


Sorry, I don't understand that. Where would you replace the LSN from
disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.


We use the LSN to decide whether a full-page image need to be xlogged if 
the page is modified. If you reset LSN every time you read in a page, 
you'll be doing full page writes every time a page is read from disk and 
modified, whether or not it's the first time the page is modified after 
the last checkpoint.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-18 Thread Alvaro Herrera

Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:

 Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
 and SvPVUTF8() when turning a perl string into a cstring.

Hmm, this patch belongs into back branches too, right?  Not just the
current development tree?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [Review] Add SPI_gettypmod() to return a field's typemod from a TupleDesc

2012-06-18 Thread Amit Kapila
On Wed, Feb 1, 2012 at 5:23 AM, Chetan Suttraway

chetan(dot)suttraway(at)enterprisedb(dot)com wrote:

 Hi All,

 

 This is regarding the TODO item :

 Add SPI_gettypmod() to return a field's typemod from a TupleDesc

 

 The related message is:

 http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php

 

 This basically talks about having an SPI_gettypemod() which returns the

 typmod of a field of tupdesc

 

 Please refer the attached patch based on the suggested implementation.

 

Here's my review of this patch. 

Basic stuff: 
 
- Patch applies OK 
- Compiles cleanly with no warnings 
- Regression tests pass. 
- Documentation changes missing 
- 43.2. Interface Support Functions need to add information about
SPI_gettypmod. 
  as well as need to add the Description about SPI_gettypmod. 

What it does: 
-- 
New SPI interface function is exposed. 

It is used for returning the atttypmod of the specified column. 

If attribute number is out of range then set the SPI_result to
SPI_ERROR_NOATTRIBUTE and returns -1 else returns attributes typmod 

Testing: 
--- 
Created C function and validated type mode for 
- output basic data types 
- numeric, varchar with valid typmod 

=C-Function= 
PG_FUNCTION_INFO_V1(test_SPI_gettypmod); 
Datum 
test_SPI_gettypmod(PG_FUNCTION_ARGS) 
{ 
Oidrelid = PG_GETARG_OID(0); 
OidAttidNo = PG_GETARG_OID(1); 
Relationrel; 
TupleDesctupdesc;/* tuple
description */ 
int4retval; 

rel = relation_open(relid, NoLock); 
if (!rel) 
{ 
PG_RETURN_INT32(0); 
} 

tupdesc = rel-rd_att; 
retval = SPI_gettypmod(tupdesc, AttidNo); 
relation_close(rel, NoLock); 
PG_RETURN_INT32(retval); 
} 

==Function creation== 
CREATE FUNCTION test_spi_gettypmod (oid, int) RETURNS int AS
'/lib/mytest.so','test_SPI_gettypmod' LANGUAGE C STRICT; 

===Output= 
postgres=# \d tbl 
Table public.tbl 
 Column |Type | Modifiers 
+-+--- 
 a  | integer | 
 b  | character varying(100)  | 
 c  | numeric(10,5)   | 
 d  | numeric | 
 e  | character varying   | 
 f  | timestamp without time zone | 

postgres=# \d tt1 
 Table public.tt1 
 Column |  Type  | Modifiers 
++--- 
 t  | tbl| 
 b  | character varying(200) | 

postgres=# select atttypmod, attname from pg_attribute where attrelid =
24577; 
 atttypmod | attname 
---+-- 
-1 | tableoid 
-1 | cmax 
-1 | xmax 
-1 | cmin 
-1 | xmin 
-1 | ctid 
-1 | a 
   104 | b 
655369 | c 
(9 rows) 

postgres=# select test_spi_gettypmod(24577, 3); 
 test_spi_gettypmod 
 
 655369 
(1 row) 

postgres=# alter table tbl add column d numeric; 
ALTER TABLE 
postgres=# alter table tbl add column e varchar; 
ALTER TABLE 
postgres=# select test_spi_gettypmod(24577, 4); 
 test_spi_gettypmod 
 
 -1 
(1 row) 

postgres=# select test_spi_gettypmod(24577, 5); 
 test_spi_gettypmod 
 
 -1 
(1 row) 

postgres=# select test_spi_gettypmod( 24592, 1); 
 test_spi_gettypmod 
 
 -1 
(1 row) 


postgres=# alter table tt1 add column b varchar(200); 
ALTER TABLE 
postgres=# select test_spi_gettypmod( 24592, 2); 
 test_spi_gettypmod 
 
204 
(1 row) 
 

Conclusion: 
--- 
 The patch changes are okay but needs documentation update, So I am marking
it as Waiting On Author


1. For such kind of patch, even if new regression test is not added, it
is okay. 
2. Documentation need to be updated. 
3. In case attribute number is not in valid range, currently it return
-1, but -1 is a valid typmod. 
Committer can verify if this is appropriate. 
4. In functions exec_eval_datum  exec_get_datum_type_info according to
me if we use SPI_gettypmod() to get the 
attribute typmod, it is better as 
a. there is comment  /* XXX there's no SPI_gettypmod, for some
reason */ and it uses SPI_gettypeid to get the typeid. 
b. it will maintain consistency of nearby code such as it will use

Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
   - defaulting to initdb -N in the regression suite is not a good imo,
   because that way the buildfarm won't catch problems in that area...
  I removed the -N as you suggest. How much does performance matter on the
  buildfarm?
 I don't think the difference in initdb cost is relevant when running the 
 regression tests. Should it prove to be we can re-add -N after a week or two 
 in the buildfarm machines. I just remember that there were several OS 
 specific 
 regression when adding the pre-syncing for createdb.

That sounds reasonable to me. Both patches are out there, so we can
figure out what the consensus is.

   - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
   be unified?
  There's a lot of backend-specific code in the copydir versions, like
  using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
  unifying them before, and concluded that it wouldn't add to the
  readability, so I just commented where they came from.
 Ok. Sensible reasons. I dislike that we know have two files using different 
 logic (copydir.c only using fadvise, initdb using sync_file_range if 
 available). Maybe we should just move the fadvise and sync_file_range calls 
 into its own common function?

I don't see fadvise in copydir.c, it looks like it just uses fsync. It
might speed it up to use a pre-sync call there, too -- database creation
does take a while.

If that's in the scope of this patch, I'll do it.

 Btw, I just want to have said this, although I don't think its particularly 
 relevant as it doesn't affect correctness: Its possible to have a system 
 where 
 sync_file_range is in the system headers but the kernel during runtime 
 doesn't 
 support it. It is relatively new (2.6.17). It would be possible to fallback 
 to 
 posix_fadvise which has been around far longer in that case...

Interesting point, but I'm not too worried about it.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Transactions over pathological TCP connections

2012-06-18 Thread Leon Smith
Out of (mostly idle) curiousity,  when exactly does a transaction commit,
especially with respect to a TCP connection that a pathological demon will
cut off at the worst possible moment?

The thing is,  I'm using PostgreSQL as a queue,  using asynchronous
notifications and following the advice of Marko Tiikkaja in this post:

http://johtopg.blogspot.com/2010/12/queues-in-sql.html

I'm using a stored procedure to reduce the round trips between the database
and client,  and then running it in a bare transaction,  that is,  as
SELECT dequeue_element();  with an implicit BEGIN/COMMIT to mark a row in
the queue as taken and return it.

My question is,  would it be theoretically possible for an element of a
queue to become marked but not delivered,  or delivered and not marked,  if
the TCP connection between the backend and client was interrupted at the
worst possible moment?   Will the backend wait for the delivery of the row
be acknowledged before the transaction is committed? Or should the
truly paranoid use an explicit transaction block and not consider the row
taken until confirmation that the transaction has committed has been
received?

Best,
Leon


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 09:32:25 PM Jeff Davis wrote:
- could the copydir.c and initdb.c versions of walkdir/sync_fname et
al be unified?
   
   There's a lot of backend-specific code in the copydir versions, like
   using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
   unifying them before, and concluded that it wouldn't add to the
   readability, so I just commented where they came from.
  
  Ok. Sensible reasons. I dislike that we know have two files using
  different logic (copydir.c only using fadvise, initdb using
  sync_file_range if available). Maybe we should just move the fadvise and
  sync_file_range calls into its own common function?
 
 I don't see fadvise in copydir.c, it looks like it just uses fsync. It
 might speed it up to use a pre-sync call there, too -- database creation
 does take a while.
 
 If that's in the scope of this patch, I'll do it.
It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
maybe just put the sync_file_range in pg_flush_data?

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-18 Thread Alvaro Herrera

Excerpts from Jeff Davis's message of lun jun 18 15:32:25 -0400 2012:
 On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:

  Btw, I just want to have said this, although I don't think its particularly 
  relevant as it doesn't affect correctness: Its possible to have a system 
  where 
  sync_file_range is in the system headers but the kernel during runtime 
  doesn't 
  support it. It is relatively new (2.6.17). It would be possible to fallback 
  to 
  posix_fadvise which has been around far longer in that case...
 
 Interesting point, but I'm not too worried about it.

Yeah.  2.6.17 was released on June 2006.  The latest stable release
prior to 2.6.17 was 2.6.16.62 in 2008 and it was abandoned at that time
in favor of 2.6.27 as the new stable branch.

I don't think this is a problem any sane person is going to face; and
telling them to upgrade to a newer kernel seems an acceptable answer.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [Review] Prevent the specification of conflicting transaction read/write options

2012-06-18 Thread Amit Kapila
Chetan Suttraway chetan(dot)suttraway(at)enterprisedb(dot)com wrote:

 This is regarding the TODO item:

 Prevent the specification of conflicting transaction read/write

 options

 

 listed at:

 http://wiki.postgresql.org/wiki/Todo

 

Here's my review of this patch. 

Basic stuff: 
 

- Patch applies OK 
- Compiles cleanly with no warnings 
- changes in gram.y do not introduce any conflicts. 
- Regression tests pass. 
- Documentation changes not required as the scenario is obvious. 

What it does: 
- 
Prevent the specification of conflicting transaction mode options 
in the commands such as 
START/BEGIN TRANSACTION, SET TRANSACTION, SET SESSION CHARACTERISTICS AS
TRANSACTION transaction_mode. 

Conflicting transaction modes include 
1. READ WRITE | READ ONLY 

2. ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ
UNCOMMITTED } 

3. DEFERRABLE | NOT DEFERABLE 

 

The changes are done in gram.y file to add a new function which checks for
conflicting transaction modes.
  

Testing: 
 

postgres=# BEGIN TRANSACTION read only read write; 
ERROR:  conflicting options 
LINE 1: BEGIN TRANSACTION read only read write; 

postgres=# SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ
COMMITTED ISOLATION LEVEL READ unCOMMITTED; 
ERROR:  conflicting options 
LINE 1: ...ON CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMI... 

postgres=# start transaction deferrable not deferrable; 
ERROR:  conflicting options 
LINE 1: start transaction deferrable not deferrable; 
  ^ 
postgres=# start transaction deferrable read only not deferrable; 
ERROR:  conflicting options 
LINE 1: start transaction deferrable read only not deferrable; 
  ^ 
postgres=# start transaction deferrable, read only not deferrable; 
ERROR:  conflicting options 
LINE 1: start transaction deferrable, read only not deferrable; 
  ^ 
postgres=# start transaction deferrable, read only, not deferrable; 
ERROR:  conflicting options 
LINE 1: start transaction deferrable, read only, not deferrable; 


postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ
COMMITTED; 
START TRANSACTION 
postgres=# commit; 
COMMIT 
postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ
UNCOMMITTED; 
START TRANSACTION 
postgres=# commit; 
COMMIT 
postgres=# start transaction deferrable, read only, ISOLATION LEVEL READ
UNCOMMITTED, read only; 
START TRANSACTION 
 ^ 
postgres=# SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ
COMMITTED ISOLATION LEVEL READ COMMITTED; 
SET 
  ^ 


Code Review Comments
-- 

1. Complexity of function check_trans_mode is high (n^2) can be changed to n

   we shall not add the duplicate elements in 
   transaction_mode_list: 
transaction_mode_item 
{ $$ = list_make1($1); } 
| transaction_mode_list ',' transaction_mode_item 
{ 
check_trans_mode((List *)$1,
(DefElem *)$3, yyscanner); 
$$ = lappend($1, $3); 
} 

| transaction_mode_list transaction_mode_item 
{ 
check_trans_mode((List *)$1,
(DefElem *)$2, yyscanner); 
$$ = lappend($1, $2); 
} 

   i,e if start transaction read only read only is given then add the mode
read only to the list only for the first time. 
   If so length of the valid list is limited to maximum 3 elements(modes) (1
ISOLATION LEVEL, 1 READ WRITE/READ ONLY, 1 DEFERRABLE). 
   This will reduce the search complexity of check_trans_mode to 3n = n. 

2. during parse when you call check_trans_mode((List *)$1, (DefElem *)$3,
yyscanner); what is the need 
   of casting first and second parameter? 

3. The comment describing function can be more specific 
   patch - checks for conflicting transaction modes by looking up current
element in the given list. 
   suggested - checks for conflicting transaction modes by looking current
transaction mode element 
   in the given transaction mode list. 
  
4. names used for function parameters and variables can be more meaningful. 
   a. check_trans_mode(List *list, DefElem *elem, core_yyscan_t yyscanner) 
   for first parameter instead of list, you can use modes, it can better
suggest the meaning of parameter. 
   for second parameter instead of elem, you can use mode or input_mode; or
if something better is coming to your which can better suggest the meaning. 

   b. A_Const *elem_arg; 

Re: [HACKERS] initdb and fsync

2012-06-18 Thread Jeff Davis
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
 It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
 maybe just put the sync_file_range in pg_flush_data?

Oh, I didn't notice that, thank you.

In that case, it may be good to combine them if possible. I will look
into it. There may be performance implications when used one a larger
amount of data though. I can do some brief testing.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-18 Thread Alvaro Herrera


BTW it doesn't seem that datatype/timestamp.h is really necessary in
timeout.h.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 09:19:48 PM Heikki Linnakangas wrote:
 On 18.06.2012 21:45, Andres Freund wrote:
  On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:
  On 18.06.2012 21:13, Andres Freund wrote:
  On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:
  The page header contains an XLogRecPtr (LSN), so if we change it we'll
  have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
  around as the on-disk representation, and convert between the 64-bit
  integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
  many xlog calculations would admittedly be simpler if it was an
  uint64.
  
  I am out of my depth here, not having read any of the relevant code,
  but couldn't we simply replace the lsn from disk with
  InvalidXLogRecPtr without marking the page dirty?
  
  There is the valid argument that you would loose some information when
  pages with hint bits are written out again, but on the other hand you
  would also gain the information that it was a hint-bit write...
  
  Sorry, I don't understand that. Where would you replace the LSN from
  disk with InvalidXLogRecPtr ? (and no, it probably won't work ;-) )
  
  In ReadBuffer_common or such, after reading a page from disk and
  verifying that the page has a valid header it seems to be possible to
  replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page
  dirty.
  If the page isn't modified the lsn on disk won't be changed so you don't
  loose debugging information in that case. If will be zero if it has been
  written by a hint-bit write and thats arguable a win.
 
 We use the LSN to decide whether a full-page image need to be xlogged if
 the page is modified. If you reset LSN every time you read in a page,
 you'll be doing full page writes every time a page is read from disk and
 modified, whether or not it's the first time the page is modified after
 the last checkpoint.
Yea, I somehow disregarded that hint-bit writes would make a problem with full 
page writes in that case. Normal writes wouldn't be a problem...
This should be fixable but the result would be too ugly. So its either 
converting the on-disk representation or keeping the duplicated 
representation.

pd_lsn seems to be well enough abstracted, doing the conversion in 
PageSet/GetLSN seems to be easy enough.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-18 Thread Josh Berkus

 PE Compare the output of pg_config --configure from both installations.
 
 The only differences are 9.1 vs 9.2 in the paths.

Can you check the collations of the two databases?  I'm wondering if 9.1
is in C collation and 9.2 is something else.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Daniel Farina
On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund and...@2ndquadrant.com wrote:

 * Size of field. 16 bits is enough for 32,000 master nodes, which is
 quite a lot. Do we need that many? I think we may have need for a few
 flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
 8 bits. Even if we don't need them in this release, I'd like to have
 them. If they remain unused after a few releases, we may choose to
 redeploy some of them as additional nodeids in future. I don't foresee
 complaints that 256 master nodes is too few anytime soon, so we can
 defer that decision.
 I wished we had some flag bits available before as well. I find 256 nodes a
 pretty low value to start with though, 4096 sounds better though, so I would
 be happy with 4 flag bits. I think for cascading setups and such you want to
 add node ids for every node, not only masters...

 Any opinions from others on this?

What's the cost of going a lot higher?  Because if one makes enough
numerical space available, one can assign node identities without a
coordinator, a massive decrease in complexity.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Andres Freund
On Monday, June 18, 2012 11:51:27 PM Daniel Farina wrote:
 On Mon, Jun 18, 2012 at 8:50 AM, Andres Freund and...@2ndquadrant.com 
wrote:
  * Size of field. 16 bits is enough for 32,000 master nodes, which is
  quite a lot. Do we need that many? I think we may have need for a few
  flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
  8 bits. Even if we don't need them in this release, I'd like to have
  them. If they remain unused after a few releases, we may choose to
  redeploy some of them as additional nodeids in future. I don't foresee
  complaints that 256 master nodes is too few anytime soon, so we can
  defer that decision.
  
  I wished we had some flag bits available before as well. I find 256 nodes
  a pretty low value to start with though, 4096 sounds better though, so I
  would be happy with 4 flag bits. I think for cascading setups and such
  you want to add node ids for every node, not only masters...
  
  Any opinions from others on this?
 
 What's the cost of going a lot higher?  Because if one makes enough
 numerical space available, one can assign node identities without a
 coordinator, a massive decrease in complexity.
It would increase the size of every wal record. We just have 16bit left there 
by chance...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-18 Thread James Cloos
 JB == Josh Berkus j...@agliodbs.com writes:

JB Can you check the collations of the two databases?  I'm wondering if 9.1
JB is in C collation and 9.2 is something else.

Thanks!

pg_dump -C tells me these two differences:

 -SET client_encoding = 'SQL_ASCII';
 +SET client_encoding = 'UTF8';

 -CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' 
LC_COLLATE = 'C' LC_CTYPE = 'C';
 +CREATE DATABASE dbm WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 
'C' LC_CTYPE = 'en_US.UTF-8';

for every db in the clusters.

I presume that lc_ctype is the significant difference?

LC_CTYPE *is* specified as 'C' in the dump from which I created the 9.2
cluster, so it must have been overridden by pg_restore.  I see that my
dist's /etc rc script now sets LC_CTYPE.  Would that explain why lc_ctype
changed between the two clusters?

Is there any way to alter a db's lc_ctype w/o dumping and restoring?  I
want to preserve some of the changes made since I copied the 9.1 cluster.
Alter database reports that lc_ctype cannot be changed.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Christopher Browne
On Mon, Jun 18, 2012 at 11:50 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi Simon,

 On Monday, June 18, 2012 05:35:40 PM Simon Riggs wrote:
 On 13 June 2012 19:28, Andres Freund and...@2ndquadrant.com wrote:
  This adds a new configuration parameter multimaster_node_id which
  determines the id used for wal originating in one cluster.

 Looks good and it seems this aspect at least is commitable in this CF.
 I think we need to agree on the parameter name. It currently is
 'multimaster_node_id'. In the discussion with Steve we got to
 replication_node_id. I don't particularly like either.

 Other suggestions?

I wonder if it should be origin_node_id?  That is the term Slony uses.

 Design decisions I think we need to review are

 * Naming of field. I think origin is the right term, borrowing from Slony.
 I think it fits as well.

 * Can we add the origin_id dynamically to each WAL record? Probably no
 need, but lets consider why and document that.
 Not sure what you mean? Its already set in XLogInsert to
 current_replication_origin_id which defaults to the value of the guc?

 * Size of field. 16 bits is enough for 32,000 master nodes, which is
 quite a lot. Do we need that many? I think we may have need for a few
 flag bits, so I'd like to reserve at least 4 bits for flag bits, maybe
 8 bits. Even if we don't need them in this release, I'd like to have
 them. If they remain unused after a few releases, we may choose to
 redeploy some of them as additional nodeids in future. I don't foresee
 complaints that 256 master nodes is too few anytime soon, so we can
 defer that decision.
 I wished we had some flag bits available before as well. I find 256 nodes a
 pretty low value to start with though, 4096 sounds better though, so I would
 be happy with 4 flag bits. I think for cascading setups and such you want to
 add node ids for every node, not only masters...

Even though the number of nodes that can reasonably participate in
replication is likely to be not too terribly large, it might be good
to allow larger values, in case someone is keen on encoding something
descriptive in the node number.

If you restrict the number to a tiny range, then you'll be left
wanting some other mapping.  At one point, I did some work trying to
get a notion of named nodes implemented in Slony; gave up on it, as
the coordination process was wildly too bug-prone.

In our environment, at Afilias, we have used quasi-symbolic node
numbers that encoded something somewhat meaningful about the
environment.  That seems better to me than the risky kludge of
saying:
- The first node I created is node #1
- The second one is node #2.
- The third and fourth are #3 and #4
- I dropped node #2 due to a problem, and thus the new node 2 is #5.

That numbering scheme gets pretty anti-intuitive fairly quickly, from
whence we took the approach of having a couple digits indicating data
centre followed by a digit indicating which node in that data centre.

If that all sounds incoherent, well, the more nodes you have around,
the more difficult it becomes to make sure you *do* have a coherent
picture of your cluster.

I recall the Slony-II project having a notion of attaching a permanent
UUID-based node ID to each node.  As long as there is somewhere decent
to find a symbolically significant node name, I like the idea of the
ID *not* being in a tiny range, and being UUID/OID-like...

 Any opinions from others on this?

 * Do we want origin_id as a parameter or as a setting in pgcontrol?
 IIRC we go to a lot of trouble elsewhere to avoid problems with
 changing on/off parameter values. I think we need some discussion to
 validate where that should live.
 Hm. I don't really forsee any need to have it in pg_control. What do you want
 to protect against with that?
 It would need to be changeable anyway, because otherwise it would need to
 become a parameter for initdb which would suck for anybody migrating to use
 replication at some point.

 Do you want to protect against problems in replication setups after changing
 the value?

In Slony, changing the node ID is Not Something That Is Done.  The ID
is captured in *way* too many places to be able to have any hope of
updating it in a coordinated way.  I should be surprised if it wasn't
similarly troublesome here.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-18 Thread Alvaro Herrera

Excerpts from Dimitri Fontaine's message of vie jun 15 16:27:50 -0400 2012:

 The attached patch contains all the infrastructure for event triggers
 and also a first implementation of them for the event command_start,
 implemented in a single place in utility.c.
 
 The infrastructure is about:
 
  - new catalog
  - grammar for new commands
  - documentation skeleton
  - pg_dump support
  - psql support
  - ability to actually run user triggers
  - written in core languages
 (pl/c, pl/pgsql, pl/python, pl/perl, pl/tcl)
  - limited subcommand handling

Did you try REASSIGN OWNED and DROP OWNED with a role that has defined
some event triggers?

 Look, it's an easy little skinny patch to review, right:
 
 git --no-pager diff --shortstat master
  62 files changed, 4546 insertions(+), 108 deletions(-)

Skinny ... right.  I started to give it a look -- I may have something
useful to comment later.

 This patch includes regression tests that we worked on with Thom last
 rounds, remember that they only run in the serial schedule, that means
 with `make installcheck` only. Adding noisy output at random while the
 parallel schedule run is a good way to break all the regression testing,
 so I've been avoiding that.

Hmm, I don't like the idea of a test that only runs in serial mode.
Maybe we can find some way to make it work in parallel mode as well.
I don't have anything useful to comment right now.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-06-18 Thread Misa Simic
2012/6/18 Kevin Grittner kevin.gritt...@wicourts.gov

 The many-to-one case seems like it is better handled in the other
 direction -- with the referenced table holding the set of valid keys
 and the referencing table holding the single key.  (I believe the
 general case of this is what Jeff called an inclusion constraint
 -- a feature he wants to add at some point.)  I can't think of a use
 case where that would not be better for this type of relationship
 than putting the array on the referencing side.


Hi Kevin,

Well, from my point of view many-to-one or one-to-many is more related
from which point we are looking to the thing... But let me better explain
what I thought...

Example 1)

If we have one table with master data TableA(ID, other properties...) and
TableB(tableAIDFK, propA, propB, propC) -PK of TableB is irrelavant in this
point... and let say a lot of TableA tuples could have the same TableB
properties... So we can have how many common TableA tuples, that many
tuples in TableB with the same values in PropA, PropB, and PropC with FK
the same type as in Table A, or to have 1 tuple in TableB with Array type
as FK field So it (1 tuple in TableB) can point many tuples in
TableC... And in the same time simple element can exist in TableA, but
could or doesn't have to exist in TableB...

What test would show is there any gain in this approach - I don't know...
but think it should - especially if propA,PropB, and C should be updated
for all of them...

Example 2)
From other side, what Jeff propose, and what is also usefull, but different
thing is... to have the main data in TableA, but key field is an range
datatype... what later each element what belong to the range, could have
related tuple in TableB (Also, as the same range datatype - but smaller...
contained by Master one... or simple datatype subtype of the range) - which
is other way around... Opposite from exmaple 1 - but differnet from
functional point of view... Depending what is the Master... Also, for
example 1 - data in FK do not need to be in range.. so basicaly ID [1, 2 ,4
,7] could have 1 tuple with its properties, and [3,5,6] in second tuple
with different properties...

I am not sure Example 2) Jeff called Inclusion Constraint - Jeff can
explain it better :)

Based on

Simon Riggs wrote:
 Do we need something like Exclusion FKs? i.e. the FK partner of
 Exclusion Constraints?

Yes, Inclusion Constraints. I've known we need something like that
since I did Exclusion Constraints, but I haven't gotten further than
that.

Regards,
   Jeff Davis

I have understood it as:

TableA(ID, properties...)
TableB(ID, properties...)

Now if we define FK on TableB to TableA... It means that row inserted in
TableB, must have already row with the same ID value in TableA...

But what would be usefull, to define Exclude FK to table A,  to we prevent
insert new row in Table B with ID value what already exist in TableA...

btw, if anyone is happy to point me in right direction, and there is common
feeling it is usefull feature, I am happy to code it... Actually that is
something what I will code anyway for in-house solution - but would be
good to do it under Postgres standards...

Kind Regards,

Misa


Re: [HACKERS] patch: avoid heavyweight locking on hash metapage

2012-06-18 Thread Robert Haas
On Fri, Jun 15, 2012 at 1:58 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Do we need the retry flag (applies to two places)?  If oldblkno is
 still InvalidBlockNumber then it can't equal blkno.

I was a bit concerned that blkno = BUCKET_TO_BLKNO(metap, bucket)
might set blkno to InvalidBlockNumber, thus possibly messing up the
algorithm.  This way seemed better in terms of not requiring any
assumption in that area.

 In the README, the psuedo codes probably needs to mention re-locking
 the meta page in the loop.

Good point.  Fixed.

 Also, page is used to mean either the disk page or the shared buffer
 currently holding that page, depending on context.  This is confusing.
  Maybe we should clarify Lock the meta page buffer.  Of course this
 gripe precedes your patch and applies to other parts of the code as
 well, but since we mingle LW locks (on buffers) and heavy locks (on
 names of disk pages) it might make sense to be more meticulous here.

I attempted to improve this somewhat in the README, but I think it may
be more than I can do completely.

 exclusive-lock page 0 (assert the right to begin a split) is no
 longer true, nor is release X-lock on page 0

I think this is fixed.

 Also in the README, section To prevent deadlock we enforce these
 coding rules: would need to be changed as those rules are being
 changed.  But, should we change them at all?

 In _hash_expandtable, the claim But since we are only trylocking here
 it should be OK doesn't completely satisfy me.  Even a conditional
 heavy-lock acquire still takes a transient non-conditional LW Lock on
 the lock manager partition.  Unless there is a global rule that no one
 can take a buffer content lock while holding a lock manager partition
 lock, this seems dangerous.  Could this be redone to follow the
 pattern of heavy locking with no content lock, then reacquiring the
 buffer content lock to check to make sure we locked the correct
 things?  I don't know if it would be better to loop, or just give up,
 if the meta page changed underneath us.

Hmm.  That was actually a gloss I added on existing code to try to
convince myself that it was safe; I don't think that the changes I
made make that any more or less safe than it was before.  But the
question of whether or not it is safe is an interesting one.  I do
believe that your statement is true, though: lock manager locks - or
backend locks, for the fast-path - should be the last thing we lock.
In some cases we lock more than one lock manager partition lock, but
we never lock anything else afterwards, AFAIK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


hash-avoid-heavyweight-metapage-locks-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-18 Thread Jeff Janes
There was a regression introduced in 9.2 that effects the creation and
loading of lots of small tables in a single transaction.

It affects the loading of a pg_dump file which has a large number of
small tables (10,000 schemas, one table per schema, 10 rows per
table).  I did not test other schema configurations, so these
specifics might not be needed to invoke the problem.

It causes the loading of a dump with psql -1 -f  to run at half the
previous speed.  Speed of loading without the -1 is not changed.

The regression was introduced in 39a68e5c6ca7b41b, Fix toast table
creation.  Perhaps the slowdown is an inevitable result of fixing the
bug.

The regression was removed from 9_1_STABLE at commit dff178f8017e4412,
More cleanup after failed reduced-lock-levels-for-DDL feature.  It
is still present in 9_2_STABLE.

I don't really understand what is going on in these patches, but it
seems that either 9_1_STABLE now has a bug that was fixed and then
unfixed, or that 9_2_STABLE is slower than it needs to be.


The dump file I used can be obtained like this:

perl -le 'print set client_min_messages=warning;; print create
schema foo$_; create table foo$_.foo$_ (k integer, v integer); insert
into foo$_.foo$_ select * from generate_series(1,10);  foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 1 | psql  /dev/null ; pg_dump 
1.dump


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 3:03 PM, Steve Singer st...@ssinger.info wrote:
 I feel that in-core support to capture changes and turn them into change
 records that can be replayed on other databases, without relying on triggers
 and log tables, would be good to have.

 I think we want some flexible enough that people write consumers of the LCRs
 to do conflict resolution for multi-master but I am not sure that the
 conflict resolution support actually belongs in core.

I agree, on both counts.  Anyone else want to chime in here?

 Most of the complexity of slony (both in terms of lines of code, and issues
 people encounter using it) comes not from the log triggers or replay of the
 logged data but comes from the configuration of the cluster.
 Controlling things like

 * Which tables replicate from a node to which other nodes
 * How do you change the cluster configuration on a running system (adding
 nodes, removing nodes, moving the origin of a table, adding tables to
 replication etc...)

Not being as familiar with Slony as I probably ought to be, I hadn't
given this much thought, but it's an interesting point.  The number of
logical replication policies that someone might want to implement, and
the ways in which they might want to change them as the situation
develops, is very large.  Whole cluster, whole database, one or
several schemas, individual tables, perhaps even more fine-grained
than per-table.  Trying to figure all of that out is going to require
a lot of work and, frankly, I question the value of having that stuff
in core anyway.

 I see three catalogs in play here.
 1. The catalog on the origin
 2. The catalog on the proxy system (this is the catalog used to translate
 the WAL records to LCR's).  The proxy system will need essentially the same
 pgsql binaries (same architecture, important complie flags etc..) as the
 origin
 3. The catalog on the destination system(s).

 The catalog 2 must be in sync with catalog 1, catalog 3 shouldn't need to be
 in-sync with catalog 1.   I think catalogs 2 and 3 are combined in the
 current patch set (though I haven't yet looked at the code closely).   I
 think the performance optimizations Andres has implemented to update tuples
 through low-level functions should be left for later and that we should  be
 generating SQL in the apply cache so we don't start assuming much about
 catalog 3.

+1.  Although there is a lot of performance benefit to be had there,
it seems better to me to get the basics working and then do
performance optimization later.  That is, if we can detect that the
catalogs are in sync, then by all means ship around the binary tuple
to make things faster.  But requiring that (without having any way to
know whether it actually holds) strikes me as a mess.

 Part of what people expect from a robust in-core solution is that it should
 work with the the other in-core features.  If we have to list a bunch of
 in-core type as being incompatible with logical replication then people will
 look at logical replication with the same 'there be dragons here' attitude
 that scare many people away from the existing third party replication
 solutions.   Non-core or third party user defined types are a slightly
 different matter because we can't control what they do.

I agree, although I don't think either Andres or I are saying anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund and...@2ndquadrant.com wrote:
  Hm. Yes, you could do that. But I have to say I don't really see a point.
  Maybe the fact that I do envision multimaster systems at some point is
  clouding my judgement though as its far less easy in that case.
 Why?  I don't think that particularly changes anything.
 Because it makes conflict detection very hard. I also don't think its a
 feature worth supporting. Whats the use-case of updating records you cannot
 properly identify?

Don't ask me; I just work here.  I think it's something that some
people want, though.  I mean, if you don't support replicating a table
without a primary key, then you can't even run pgbench in a
replication environment.  Is that an important workload?  Well,
objectively, no.  But I guarantee you that other people with more
realistic workloads than that will complain if we don't have it.
Absolutely required on day one?  Probably not.  Completely useless
appendage that no one wants?  Not that, either.

 In my view, a logical replication solution is precisely one in which
 the catalogs don't need to be in sync.  If the catalogs have to be in
 sync, it's not logical replication.  ISTM that what you're talking
 about is sort of a hybrid between physical replication (pages) and
 logical replication (tuples) - you want to ship around raw binary
 tuple data, but not entire pages.
 Ok, thats a valid point. Simon argued at the cluster summit that everything
 thats not physical is logical. Which has some appeal because it seems hard to
 agree what exactly logical rep is. So definition by exclusion makes kind of
 sense ;)

Well, the words are fuzzy, but I would define logical replication to
be something which is independent of the binary format in which stuff
gets stored on disk.  If it's not independent of the disk format, then
you can't do heterogenous replication (between versions, or between
products).  That precise limitation is the main thing that drives
people to use anything other than SR in the first place, IME.

 I think what you categorized as hybrid logical/physical rep solves an
 important use-case thats very hard to solve at the moment. Before my
 2ndquadrant days I had several client which had huge problemsing the trigger
 based solutions because their overhead simply was to big a burden on the
 master. They couldn't use SR either because every consuming database kept
 loads of local data.
 I think such scenarios are getting more and more common.

I think this is to some extent true, but I also think you're
conflating two different things.  Change extraction via triggers
introduces overhead that can be eliminated by reconstructing tuples
from WAL in the background rather than forcing them to be inserted
into a shadow table (and re-WAL-logged!) in the foreground.  I will
grant that shipping the tuple as a binary blob rather than as text
eliminates additional overehead on both ends, but it also closes off a
lot of important use cases.  As I noted in my previous email, I think
that ought to be a performance optimization that we do, if at all,
when it's been proven safe, not a baked-in part of the design.  Even a
solution that decodes WAL to text tuples and ships those around and
reinserts the via pure SQL should be significantly faster than the
replication solutions we have today; if it isn't, something's wrong.

 The problem with that is it's going to be tough to make robust.  Users could
 easily end up with answers that are total nonsense, or probably even crash
 the server.
 Why?

Because the routines that decode tuples don't include enough sanity
checks to prevent running off the end of the block, or even the end of
memory completely.  Consider a corrupt TOAST pointer that indicates
that there is a gigabyte of data stored in an 8kB block.  One of the
common symptoms of corruption IME is TOAST requests for -3 bytes of
memory.

And, of course, even if you could avoid crashing, interpreting what
was originally intended as a series of int4s as a varlena isn't likely
to produce anything terribly meaningful.  Tuple data isn't
self-identifying; that's why this is such a hard problem.

 To step back and talk about DDL more generally, you've mentioned a few
 times the idea of using an SR instance that has been filtered down to
 just the system catalogs as a means of generating logical change
 records.  However, as things stand today, there's no reason to suppose
 that replicating anything less than the entire cluster is sufficient.
 For example, you can't translate enum labels to strings without access
 to the pg_enum catalog, which would be there, because enums are
 built-in types.  But someone could supply a similar user-defined type
 that uses a user-defined table to do those lookups, and now you've got
 a problem.  I think this is a contractual problem, not a technical
 one.  From the point of view of logical replication, it would be nice
 if type output functions were basically guaranteed to 

Re: [HACKERS] Pg default's verbosity?

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 2:56 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 The argument for defaulting to NOTICE is the same as it's always been:
 that those messages are really intended for novices, and a pretty good
 definition of a novice is somebody who doesn't know how to (or that he
 should) change the verbosity setting.  So if we don't show notices by
 default, they will be unavailable to exactly the people who need them.
 Your proposal does not overcome this argument.

 I'm sceptical about what a real novice is expected to do about the
 incredible flow of useless information displayed when loading a significant
 script. For a start, s?he should be an incredibly fast reader:-)

 For a non-novice it just hides what is important and should be seen.

There might be something to the idea of demoting a few of the things
we've traditionally had as NOTICEs, though.  IME, the following two
messages account for a huge percentage of the chatter:

NOTICE:  CREATE TABLE will create implicit sequence foo_a_seq for
serial column foo.a
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
foo_pkey for table foo

I'm not going to claim that nobody in the history of the world has
ever benefited from those notices ... but I would be willing to bet
that a large majority of the people, in a large majority of the cases,
do not care.  And getting rid of them would surely make warnings and
notices that might actually be of interest to the user a lot more
visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] measuring spinning

2012-06-18 Thread Robert Haas
On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Well, this fell through the cracks, because I forgot to add it to the
 January CommitFest.  Here it is again, rebased.

 This applies and builds cleanly and passes make check (under enable-cassert).

 Not test or docs are needed for a patch of this nature.

 It does what it says, and we want it.

 I wondered if the change in the return signature of s_lock would have
 an affect on performance.  So I've run a series of pgbench -T 30 -P
 -c8 -j8, at a scale of 30 which fits in shared_buffers, using an
 Amazon c1.xlarge
 (8 cores).  I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in
 both cases), in random ordering.  The patch was 0.37% slower, average
 298483 selects per second patched to 299582 HEAD.  The difference is
 probably real (p value 0.042, one sided.) but is also pretty much
 negligible and could just be due to where the executable code falls in
 the cache lines which could move around with other changes to the
 code.

I found this a bit surprising, so I retested on the IBM POWER7 box at
concurrencies from 1 to 64 and found some test runs faster and some
test runs slower - maybe a few more faster than slower.   I could do a
more exact analysis, but I'm inclined to think that if there is an
effect here, it's probably just in the noise, and that the real effect
you measured was, as you say, the result of cache-line shifting or
some other uninteresting artifact that could just as easily move back
the other way on some other commit.  Really, s_lock should not be
getting called often enough to matter much, and ignoring the return
value of that function shouldn't cost anyone much.

 Two suggestions:

 In your original email you say number of pg_usleep() calls that are
 required to acquire each LWLock, but nothing in the code says this.
 Just reading lwlock.c I would naively assume it is reporting the
 number of TAS spins, not the number of spin-delays (and in fact that
 is what I did assume until I read your email more carefully).  A
 comment somewhere in lwlock.c would be helpful.

Yeah, or maybe I should just change the printout to say spindelay
instead of spin, and rename the variables likewise.

 Also in lwlock.c,

        if (sh_acquire_counts[i] || ex_acquire_counts[i] ||
 block_counts[i] || spin_counts[i])

 I don't think we can have spins (or blocks, for that matter) unless we
 have some acquires to have caused them, so the last two tests in that
 line seem to be noise.

Perhaps so, but I think it's probably safe and clear to just follow
the existing code pattern.

 Since my suggestions are minor, should I go ahead and mark this ready
 for committer?

If you think it should be committed, yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Steve Singer

On 12-06-18 07:30 AM, Andres Freund wrote:


Hrmpf #666. I will go through through the series commit-by-commit again to
make sure everything compiles again. Reordinging this late definitely wasn't a
good idea...

I pushed a rebased version with all those fixups (and removal of the
zeroRecPtr patch).


Where did you push that rebased version to? I don't see an attachment, 
or an updated patch in the commitfest app and your repo at 
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary 
hasn't been updated in 5 days.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-18 Thread Steve Singer

On 12-06-18 11:50 AM, Andres Freund wrote:

Hi Simon,



I think we need to agree on the parameter name. It currently is
'multimaster_node_id'. In the discussion with Steve we got to
replication_node_id. I don't particularly like either.

Other suggestions?

Other things that come to mind (for naming this parameter in the 
postgresql.conf)


node_id
origin_node_id
local_node_id

I wished we had some flag bits available before as well. I find 256 nodes a
pretty low value to start with though, 4096 sounds better though, so I would
be happy with 4 flag bits. I think for cascading setups and such you want to
add node ids for every node, not only masters...

Any opinions from others on this?



256 sounds a bit low to me as well.  Sometimes the use case of a retail 
chain comes up where people want each store to have a postgresql 
instance and replicate back to a central office.  I can think of many 
chains with more than 256 stores.





Thanks,

Andres



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Libxml2 load error on Windows

2012-06-18 Thread Robert Haas
On Mon, Jun 18, 2012 at 5:08 AM, Talha Bin Rizwan
talha.riz...@gmail.com wrote:
 PostgreSQL 9.2 Windows build is having trouble with the XML support:
 http://postgresql.1045698.n5.nabble.com/9-2-beta1-libxml2-can-t-be-loaded-on-Windows-td5710672.html

 postgres=# SELECT xml 'foobar/foo';
 ERROR:  could not set up XML error handler
 HINT: This probably indicates that the version of libxml2 being used is not
 compatible with the libxml2 header files that PostgreSQL was built with.

 HAVE_XMLSTRUCTUREDERRORCONTEXT should be defined if libxml2
 has xmlStructuredErrorContext (introduced after 2.7.3). It is being set for
 Linux in pg_config.h:

 /* Define to 1 if your libxml has xmlStructuredErrorContext. */
 #define HAVE_XMLSTRUCTUREDERRORCONTEXT 1

 For Windows build, we need to set it in src/tools/msvc/Solution.pm.

 I guess there are two approaches to get the libxml2 version:

 1) Parse LIBXML_INST/include/libxml/xmlversion.h to get the
 libxml2 version number, this header file is included in libxml2.

 2) We may also get the version by running a command line utility
 i.e LIBXML_INST/bin/xmllint --version. The xmllint program parses one
 or more XML files and it is also included in libxml2.

 I believe it is safe to assume that libxml2 include folder would contain
 xmlversion.h file containing the required version number. It might be a
 touch slower to parse the file line by line, but similar functionality has
 been used elsewhere as well. We obviously rely on xml include files (with
 xml enabled), so there is definitely a tighter dependency on include files
 than xmllint executable as it is not mandatory for build process.

 Therefore, I used first approach to get libxml2 version number and include
 the #define HAVE_XMLSTRUCTUREDERRORCONTEXT in pg_config.h if
 libxml2 version is greater than 20703 (which effectively is 2.7.3). Please
 find attached patch libxml2_win_v1.patch.

Please add this patch here so that it doesn't get lost in the shuffle:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw in contrib

2012-06-18 Thread Robert Haas
On Mon, Jun 18, 2012 at 12:24 PM, Merlin Moncure mmonc...@gmail.com wrote:
 I can't help but wonder (having been down the contrib/core/extension
 road myself) if it isn't better to improve the facilities to register
 and search for qualified extensions (like Perl CPAN) so that people
 looking for code to improve their backends can find it.  That way,
 you're free to release/do xyz/abandon your project as you see fit
 without having to go through -hackers.  This should also remove a lot
 of the stigma with not being in core since if stock postgres
 installations can access the necessary modules via CREATE EXTENSION, I
 think it will make it easier for projects like this to get used with
 the additional benefit of decentralizing project management.

Well, if you're the type that likes to install everything via RPMs
(and I am) then you wouldn't want this behavior, especially not
automagically.  It seems to open up a host of security risks, too: I
believe Tom has previously stated that Red Hat (and other distro)
packaging guidelines forbid packages from installing software in
places where they can then turn around and run it.  I suppose CPAN
must have some sort of exception to this policy, though, so maybe it's
not ironclad.

Still, it seems like a bit of a distraction in this case: I think we
want to have at least one FDW in core that actually talks to some
other database server, rather than just to a file, and it seems like
pgsql_fdw is the obvious choice by a mile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: autocomplete for functions

2012-06-18 Thread Josh Kupershmidt
On Sun, Feb 19, 2012 at 12:10 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Hello

 I found so this extremely simple patch should be useful.

 It helps for pattern SELECT fx();

 There was thread about it.

Hi Pavel,
I signed up to be reviewer for this patch, and finally got around to
taking a look. This thread, and the thread about Peter's earlier patch
along the same lines have gotten a bit muddled, so allow me some recap
for my own sanity.

The first point to be addressed, is that Pavel's patch is basically a
subset of Peter's earlier[1] patch. Pavel's patch autocompletes

  SELECT TAB

with possible function names. Peter's patch autocompletes both
possible column names and possible function names. So, which version,
if any, would we want? Both Tom[2] and depesz[3] have asked for column
names to be autocompleted if we're going to go down this road, and I
personally would find completion of column names handy. Others [5][6]
have asked for function names to be (also?) autocompleted here, so it
seems reasonable that we'd want to include both.

I counted two general objections to Peter's patch in these threads, namely:

 1.) Complaints about the tab-completion not covering all cases,
possibly leading to user frustration at our inconsistency. [2] [4]
 2.) Concerns that the tab-completion wouldn't be useful given how
many results we'd have from system columns and functions [7]

I do agree these are two legitimate concerns. However, for #1, our
tab-completion is and has always been incomplete. I take the basic
goal of the tab-completion machinery to be offer a suggestion when
we're pretty sure we know what the user wants, otherwise stay quiet.
There are all sorts of places where our reliance on inspecting back
only a few words into the current line and not having a true command
parser prevents us from being able to make tab-completion guesses, but
that's been accepted so far, and I don't think it's fair to raise the
bar for this patch.

Re: concern #2, Tom complained about there being a bunch of possible
column and function completions in the regression test database. That
may be true, but if you look at this slightly-modified version of the
query Peter's patch proposes:

SELECT substring(name, 1, 3) AS sub, COUNT(*)
  FROM (
SELECT attname FROM pg_attribute WHERE NOT attisdropped
UNION
SELECT proname || '(' FROM pg_proc p WHERE
pg_catalog.pg_function_is_visible(p.oid)) t (name)
  GROUP BY sub ORDER BY COUNT(*) DESC;

I count only 384 distinct 3-length prefixes in an empty database,
thanks to many built-in columns and functions sharing the same prefix
(e.g. int or pg_). Obviously, there is plenty of room left for
3-length prefixes, out of the 27^3 or more possibilities. In some
casual mucking around in my own databases, I found the
column-completion rather useful, and typing 3 characters of a
column-name to be sufficient to give matches which were at least
non-builtin attributes, and often a single unique match.

There were some ideas down-thread about how we might filter out some
of the noise in these completions, which would be interesting. I'd be
happy with the patch as-is though, modulo the attisdropped and
pg_function_is_visible() tweaks to the query.

Josh


[1] 
http://archives.postgresql.org/message-id/1328820579.11241.4.camel%40vanquo.pezone.net
[2] http://archives.postgresql.org/message-id/7745.1328855069%40sss.pgh.pa.us
[3] http://www.depesz.com/2011/07/08/wish-list-for-psql/
[4] http://archives.postgresql.org/message-id/13612.1328887227%40sss.pgh.pa.us
[5] 
http://archives.postgresql.org/message-id/CA%2BTgmoY7wRGgBbFhKwfASqrNOPwZwCjfm1AhL82769Xx-SyduA%40mail.gmail.com
[6] 
http://archives.postgresql.org/message-id/20120210140637.GB19783%40ldn-qws-004.delacy.com
[7] http://archives.postgresql.org/message-id/9966.1331920074%40sss.pgh.pa.us

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: autocomplete for functions

2012-06-18 Thread Josh Kupershmidt
On Mon, Mar 19, 2012 at 1:01 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:

 I'm rather of the contrary opinion -- surely if we're going to complete
 function names, we should only complete those that are in schemas in the
 path; similarly for column names.

I think it makes sense to only include currently-visible functions,
but *not* only columns from currently visible tables, since we won't
know yet whether the user intends to schema-qualify the table name.

  (BTW I didn't check but does this
 completion work if I schema-qualify a column name?)

Peter's proposed tab-completion only kicks in for the column-name
itself. Keep in mind, the user might be trying to enter:
  SELECT  schema.table.column ...
  SELECT  table.column ...
  SELECT  table_alias.column ...
  SELECT  column ...

and presumably want to tab-complete the second token somehow. I'm a
bit leery about trying to tab-complete those first two, and the third
is right out. Just having the fourth would make me happy.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-18 Thread Tom Lane
Amit Kapila amit.kap...@huawei.com writes:
 AFAIR you can create pg_control from scratch already with pg_resetxlog.
 The hard part is coming up with values for the counters, such as the
 next WAL location.  Some of them such as next OID are pretty harmless
 if you don't guess right, but I'm worried that wrong next WAL could
 make things worse not better.

 I believe if WAL files are proper as mentioned in Alvaro's mail, the
 purposed logic should generate correct values.

I've got a problem with the assumption that, when pg_control is trash,
megabytes or gigabytes of WAL can still be relied on completely.

I'm almost inclined to suggest that we not get next-LSN from WAL, but
by scanning all the pages in the main data store and computing the max
observed LSN.  This is clearly not very attractive from a performance
standpoint, but it would avoid the obvious failure mode where you lost
some recent WAL segments along with pg_control.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions over pathological TCP connections

2012-06-18 Thread Tom Lane
Leon Smith leon.p.sm...@gmail.com writes:
 Out of (mostly idle) curiousity,  when exactly does a transaction commit,
 especially with respect to a TCP connection that a pathological demon will
 cut off at the worst possible moment?

It's committed when the XLOG_XACT_COMMIT WAL record reaches disk,
if by committed you mean is guaranteed to still be visible following
recovery from a subsequent database or operating system crash.  If you
have some other definition of committed in mind, please say what.

 My question is,  would it be theoretically possible for an element of a
 queue to become marked but not delivered,  or delivered and not marked,  if
 the TCP connection between the backend and client was interrupted at the
 worst possible moment?

The transaction would be committed before a command success report is
delivered to the client, so I don't think delivered-and-not-marked is
possible.  The other direction is possible, if the connection drops
after the transaction is committed but before the completion report
can be delivered to the client.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers