[HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Peter Geoghegan
While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
 * If this page was incompletely split, finish the split now. We
 * do this while holding a lock on the left sibling, which is not
 * good because finishing the split could be a fairly lengthy
 * operation.  But this should happen very seldom.
 */
if (P_INCOMPLETE_SPLIT(lpageop))
{
_bt_finish_split(rel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}

The left sibling referred to here is the first page this key could
be on, an important concept for unique index enforcement. It's the
first sibling iff we're on our first iteration of the nested for(;;)
loop in _bt_findinsertloc(). So the buffer lock held on this left
sibling may constitute what in the past I've called a value lock;
we've established the right to insert our value into the unique index
at this point, and the lock will only be released when we're done
(regardless of whether or not that buffer/page value lock is on the
buffer/page we'll ultimately insert into, or an earlier one).

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.

* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a value lock in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original value
lock throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original value lock. A physical insertion has yet to occur.

* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this). It exclusive locks that later page/buffer before the second guy
gets a chance to lock it once again. It establishes the right to
insert with _bt_check_unique(), undeterred by the second inserter's
buffer lock/value lock. The value lock is effectively skipped over.

* Both the second and third inserters have established the right to
insert the same value, 7, and both do so. The unique index has an
MVCC-snapshot-wise spurious duplicate, and so is corrupt.

Regardless of whether or not I have these details exactly right (that
is, regardless of whether or not this scenario is strictly possible) I
suggest as a code-hardening measure that _bt_findinsertloc() release
its value lock, upon realizing it must complete splits, and then
complete the split or splits known to be required. It would finally
report that it couldn't find an insertion location to
_bt_doinsert(), which would then retry from the start, just like when
_bt_check_unique() finds an inconclusive conflict. The only difference
is that we don't have an xact to wait on. We haven't actually done
anything extra that makes this later goto top; any different to the
existing one.

This should occur so infrequently that it isn't worth trying harder,
or worth differentiating between the UNIQUE_CHECK_NO and
!UNIQUE_CHECK_NO cases when retrying. This also removes the more
general risk of holding an extra buffer lock during page split
completion.

It kind of looks like _bt_findinsertloc() doesn't have this bug,
because in my scenario _bt_finish_split() is called with both the
value lock and its right page locked (so the right page is the left
page for _bt_finish_split()'s purposes). But when you take another
look, and realize that _bt_finish_split() releases its locks, and so
once again only the original value lock will be held when
_bt_finish_split() returns, and so the downlink is there to skip the
value locked page, you realize that the bug does exist (assuming that
I haven't failed to consider some third factor and am not otherwise
mistaken).

Thoughts?
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing 

Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Simon Riggs
On 25 May 2014 17:52, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 Here's an idea I tried to explain to Andres and Simon at the pub last night,
 on how to reduce the spikes in the amount of WAL written at beginning of a
 checkpoint that full-page writes cause. I'm just writing this down for the
 sake of the archives; I'm not planning to work on this myself.
...

Thanks for that idea, and dinner. It looks useful.

I'll call this idea Background FPWs

 Now, I'm sure there are issues with this scheme I haven't thought about, but
 I wanted to get this written down. Note this does not reduce the overall WAL
 volume - on the contrary - but it ought to reduce the spike.

The requirements we were discussing were around

A) reducing WAL volume
B) reducing foreground overhead of writing FPWs - which spikes badly
after checkpoint and the overhead is paid by the user processes
themselves
C) need for FPWs during base backup

So that gives us a few approaches

* Compressing FPWs gives A
* Background FPWs gives us B
   which look like we can combine both ideas

* Double-buffering would give us A and B, but not C
   and would be incompatible with other two ideas

Will think some more.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread ash

David Fetter da...@fetter.org writes:
 
  Also worth considering: functions which take any part of the view
  as a parameter.
 
 Sorry, I don't get it: do you suggest we should re-create dependent
 functions too?

 I'd throw an error in cases where such functions had an obvious and
 deterministic dependency on the views, ideally having gone through all
 such functions first and enumerated them in the error message.

Then it would also make sense to start with checking function dependency
on the tables themselves, not only the joining views:

psql= CREATE TABLE t(id INT);
CREATE TABLE
psql= CREATE FUNCTION func1() RETURNS SETOF INT AS $$ SELECT id FROM t; $$ 
LANGUAGE SQL;
CREATE FUNCTION
psql= ALTER TABLE t ALTER COLUMN id TYPE BIGINT;
ALTER TABLE
-- Would complain on func1 right away

psql= SELECT func1();
ERROR:  return type mismatch in function declared to return integer
DETAIL:  Actual return type is bigint.
CONTEXT:  SQL function func1 during startup

psql= CREATE FUNCTION func2() RETURNS SETOF INT AS $$ SELECT id FROM t; $$ 
LANGUAGE SQL;
ERROR:  return type mismatch in function declared to return integer
DETAIL:  Actual return type is bigint.
CONTEXT:  SQL function func2

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Matteo Beccati
Hi Tom,

On 27/05/2014 03:07, Tom Lane wrote:
 I've verified functionality of this patch on these scenarios:
 
 (1) --with-ossp-uuid on RHEL6, using uuid-1.6.1-10.el6.x86_64
 (2) --with-linux-uuid on RHEL6, using libuuid-2.17.2-12.14.el6_5.x86_64
 (3) --with-linux-uuid on OS X 10.9.3, Intel
 (4) --with-linux-uuid on OS X 10.4.11, PPC (hence, bigendian)
 
 I do not have a machine on which to try --with-bsd-uuid, so it's
 possible I broke that portion of Matteo's patch.  Would someone try
 that case on a FreeBSD box?

I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box. I
could fire up some virtual machines with FreeBSD and other BSD flavours,
but maybe some buildfarm animals could be used for that.

I'm attaching a little patch to be applied on top of yours.

I didn't notice that buf ? 13 : 0 was raising a warning about the
condition being always true on BSD. I guess it's safe to hardcode 13 as
the argument is ignored anyway with ossp, so I've fixed that.

I've also fixed v1mc generation on linux to match the OSSP and BSD
variants and added a regression test for it.


Cheers
-- 
Matteo Beccati

Development  Consulting - http://www.beccati.com/
diff --git a/contrib/uuid-ossp/expected/uuid_ossp.out 
b/contrib/uuid-ossp/expected/uuid_ossp.out
index f393e86..c14db22 100644
--- a/contrib/uuid-ossp/expected/uuid_ossp.out
+++ b/contrib/uuid-ossp/expected/uuid_ossp.out
@@ -77,3 +77,18 @@ SELECT uuid_generate_v4()::text ~ 
'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]
  t
 (1 row)
 
+DO $_$
+DECLARE
+  u text;
+  i int;
+  c int;
+BEGIN
+  FOR i in 1..32 LOOP
+u := substr(uuid_generate_v1mc()::text, 25, 2);
+EXECUTE 'SELECT x''' || u || '''::int  3' INTO c;
+IF c  3 THEN
+  RAISE WARNING 'v1mc broken';
+END IF;
+  END LOOP;
+END;
+$_$;
diff --git a/contrib/uuid-ossp/sql/uuid_ossp.sql 
b/contrib/uuid-ossp/sql/uuid_ossp.sql
index 8f22417..61a44a8 100644
--- a/contrib/uuid-ossp/sql/uuid_ossp.sql
+++ b/contrib/uuid-ossp/sql/uuid_ossp.sql
@@ -17,3 +17,20 @@ SELECT uuid_generate_v3(uuid_ns_dns(), 'www.widgets.com');
 SELECT uuid_generate_v5(uuid_ns_dns(), 'www.widgets.com');
 
 SELECT uuid_generate_v4()::text ~ 
'^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$';
+
+DO $_$
+DECLARE
+  u text;
+  i int;
+  c int;
+BEGIN
+  FOR i in 1..32 LOOP
+u := substr(uuid_generate_v1mc()::text, 25, 2);
+EXECUTE 'SELECT x''' || u || '''::int  3' INTO c;
+IF c  3 THEN
+  RAISE WARNING 'v1mc broken';
+END IF;
+  END LOOP;
+END;
+$_$;
+
diff --git a/contrib/uuid-ossp/uuid-ossp.c b/contrib/uuid-ossp/uuid-ossp.c
index bc29ade..7803dbe 100644
--- a/contrib/uuid-ossp/uuid-ossp.c
+++ b/contrib/uuid-ossp/uuid-ossp.c
@@ -460,6 +460,10 @@ uuid_generate_v1mc(PG_FUNCTION_ARGS)
uuid_t  uu;
 
uuid_generate_random(uu);
+
+   /* set IEEE802 multicast and local-admin bits */
+   ((dce_uuid_t *)uu)-node[0] |= 0x03;
+
uuid_unparse(uu, strbuf);
buf = strbuf + 24;
 #else  /* BSD */
@@ -472,7 +476,7 @@ uuid_generate_v1mc(PG_FUNCTION_ARGS)
 #endif
 
return uuid_generate_internal(UUID_MAKE_V1 | UUID_MAKE_MC, NULL,
- buf, buf ? 13 
: 0);
+ buf, 13);
 }
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Fujii Masao
On Tue, May 27, 2014 at 3:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 25 May 2014 17:52, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 Here's an idea I tried to explain to Andres and Simon at the pub last night,
 on how to reduce the spikes in the amount of WAL written at beginning of a
 checkpoint that full-page writes cause. I'm just writing this down for the
 sake of the archives; I'm not planning to work on this myself.
 ...

 Thanks for that idea, and dinner. It looks useful.

 I'll call this idea Background FPWs

 Now, I'm sure there are issues with this scheme I haven't thought about, but
 I wanted to get this written down. Note this does not reduce the overall WAL
 volume - on the contrary - but it ought to reduce the spike.

 The requirements we were discussing were around

 A) reducing WAL volume
 B) reducing foreground overhead of writing FPWs - which spikes badly
 after checkpoint and the overhead is paid by the user processes
 themselves
 C) need for FPWs during base backup

 So that gives us a few approaches

 * Compressing FPWs gives A
 * Background FPWs gives us B
which look like we can combine both ideas

 * Double-buffering would give us A and B, but not C
and would be incompatible with other two ideas

Double-buffering would allow us to disable FPW safely but which would make
a recovery slow. So if we adopt double-buffering, I think that we would also
need to overhaul the recovery.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Heikki Linnakangas

On 05/26/2014 11:15 PM, Robert Haas wrote:

On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

I don't think we know that. The server might have crashed before that
second record got generated.  (This appears to be an unfixable flaw in
this proposal.)


The second record is generated before the checkpoint is finished and the 
checkpoint record is written.  So it will be there.

(if you crash before the checkpoint is finished, the in-progress checkpoint is 
no good for recovery anyway, and won't be used)


Hmm, I see.

It's not great to have to generate WAL at buffer-eviction time,
though.  Normally, when we go to evict a buffer, the WAL is already
written.  We might have to wait for it to be flushed, but if the WAL
writer is doing its job, hopefully not.  But here we'll definitely
have to wait for the WAL flush.


Yeah. You would want to batch the flushes somehow, instead of flushing 
the WAL for every buffer being flushed. For example, after writing the 
FPW WAL record, just continue with the checkpoint without flushing the 
buffer, and do a second pass later doing buffer flushes.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw and connection management

2014-05-27 Thread Sandro Santilli
On Tue, May 27, 2014 at 12:32:50AM -0300, Fabrízio de Royes Mello wrote:
 On Mon, May 26, 2014 at 11:47 PM, Shigeru Hanada shigeru.han...@gmail.com
 wrote:
 
  2014-05-24 0:09 GMT+09:00 Sandro Santilli s...@keybit.net:
   Indeed I tried DISCARD ALL in hope it would have helped, so I find
   good your idea of allowing extensions to register an hook there.
  
   Still, I'd like the FDW handler itself to possibly be configured
   to disable the pool completely as a server-specific configuration.
 
  Connection management seems FDW-specific feature to me.  How about to
  add FDW option, say pool_connection=true|false, to postgres_fdw which
  allows per-server configuration?

Yes, that's what I had in mind.
I'll try something along those lines.

 Makes sense... but if we use pool_connection=true and want to close the
 opened connection. How can we do that?

Right, I still consider hooks on DISCARD a useful addition.

--strk;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Heikki Linnakangas

On 05/26/2014 02:26 PM, Greg Stark wrote:

On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas hlinnakan...@vmware.com

wrote:



The second record is generated before the checkpoint is finished and the
checkpoint record is written.  So it will be there.

(if you crash before the checkpoint is finished, the in-progress
checkpoint is no good for recovery anyway, and won't be used)


Another idea would be to have separate checkpoints for each buffer
partition. You would have to start recovery from the oldest checkpoint of
any of the partitions.


Yeah. Simon suggested that when we talked about this, but I didn't 
understand how that works at the time. I think I do now. The key to 
making it work is distinguishing, when starting recovery from the latest 
checkpoint, whether a record for a given page can be replayed safely. I 
used flags on WAL records in my proposal to achieve this, but using 
buffer partitions is simpler.


For simplicity, let's imagine that we have two Redo-pointers for each 
checkpoint record: one for even-numbered pages, and another for 
odd-numbered pages. When checkpoint begins, we first update the 
Even-redo pointer to the current WAL insert location, and then flush all 
the even-numbered buffers in the buffer cache. Then we do the same for Odd.


Recovery begins at the Even-redo pointer. Replay works as normal, but 
until you reach the Odd-pointer, you refrain from replaying any changes 
to Odd-numbered pages. After reaching the odd-pointer, you replay 
everything as normal.


Hmm, that seems actually doable...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Greg Stark
On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

 On 05/26/2014 02:26 PM, Greg Stark wrote:

 Another idea would be to have separate checkpoints for each buffer
 partition. You would have to start recovery from the oldest checkpoint of
 any of the partitions.

 Yeah. Simon suggested that when we talked about this, but I didn't understand 
 how that works at the time. I think I do now. The key to making it work is 
 distinguishing, when starting recovery from the latest checkpoint, whether a 
 record for a given page can be replayed safely. I used flags on WAL records 
 in my proposal to achieve this, but using buffer partitions is simpler.

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

 For simplicity, let's imagine that we have two Redo-pointers for each 
 checkpoint record: one for even-numbered pages, and another for odd-numbered 
 pages. When checkpoint begins, we first update the Even-redo pointer to the 
 current WAL insert location, and then flush all the even-numbered buffers in 
 the buffer cache. Then we do the same for Odd.

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

 Recovery begins at the Even-redo pointer. Replay works as normal, but until 
 you reach the Odd-pointer, you refrain from replaying any changes to 
 Odd-numbered pages. After reaching the odd-pointer, you replay everything as 
 normal.

 Hmm, that seems actually doable...



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Ronan Dunklau
 Additionally, I don't really see how that would be useful in a general 
case. 
 With an in-core defined meaning of type transformation, any FDW that 
doesn't 
 fit exactly into the model would have a hard time. For example, what 
happens 
 if an FDW is only ever capable of returning text ?

 That's actually the case where it's most important to have the feature
 all the way down to the column level.

I'm not sure configuration about specific columns from specific tables would 
be that useful: if you already know the structure of the table, you can either 
create it yourself, or run an alter column statement just after creating it.

Alternativeley, with the arbitrary options clause proposed by Tom and detailed 
below, one could use the LIMIT TO / EXCEPT clauses to fine-tune what options 
should apply.

 That's why I suggested doing it via CREATE/ALTER with JSONB or similar
 containing the details rather than inventing new SQL grammar, an
 option I included only to dismiss it.

Hum, adding a simple TYPE MAPPING is already inventing new SQL grammar, but 
its less invasive.

 Between this and the type-mapping questions, it seems likely that
 we're going to need a way for IMPORT FOREIGN SCHEMA to accept
 user-supplied control options, which would in general be specific
 to the FDW being used.  (Another thing the SQL committee failed to
 think about.)

So, without extending the syntax too much, we could add options:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]

This option list could then contain fdw-specific options, including type 
mapping.

Or we could add a specific clause:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]
[ TYPE MAPPING some mapping_definition ]

With mapping_definition being either a tuple, or well-defined jsonb format 
common accross FDWs.

A third solution, which I don't like but doesn't modify the SQL grammar, would 
be to encode the options in the remote_schema name.

Would one of those solutions be acceptable ?

-- 
Ronan Dunklau
http://dalibo.com - http://dalibo.org

signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
  * If this page was incompletely split, finish the split now. We
  * do this while holding a lock on the left sibling, which is not
  * good because finishing the split could be a fairly lengthy
  * operation.  But this should happen very seldom.
  */
if (P_INCOMPLETE_SPLIT(lpageop))
{
 _bt_finish_split(rel, rbuf, stack);
 rbuf = InvalidBuffer;
 continue;
}

The left sibling referred to here is the first page this key could
be on, an important concept for unique index enforcement.


No, it's not the first page this key could be on.

_bt_findinsertloc() does *not* hold a lock on the first valid page the 
key could go to. It merely ensures that when it steps to the next page, 
it releases the lock on the previous page only after acquiring the lock 
on the next page. Throughout the operation, it will hold a lock on 
*some* page that could legally hold the inserted value, and it acquires 
the locks in left-to-right order. This is sufficient for the uniqueness 
checks, because _bt_unique_check() scans all the pages, and 
_bt_unique_check() *does* hold the first page locked while it scans the 
rest of the pages. But _bt_findinsertlock() does not.


Also note that _bt_moveright() also finishes any incomplete splits it 
encounters (when called for an insertion). So _bt_findinsertloc() never 
gets called on a page with the incomplete-split flag set. It might 
encounter one when it moves right, but never the first page.



Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.


If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-+
| |
| 9 - A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 9 - A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | - | |
|  | | |
|  7   7*   8  | |   9 |
+--+ +-+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next 
paragraph you said the new downlink has value 8, which implies the above 
split)



* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a value lock in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original value
lock throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original value lock. A physical insertion has yet to occur.


Hmm, I think you got confused at this step. When inserting a 7, you 
cannot look further right to find a page with more space, because the 
HI-key, 8, on the first page stipulates that 7 must go on that page, not 
some later page.



* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this).


That's a contradiction: the downlink with value 8 points to the first 
page, not some later page. After the split is finished, the tree looks 
like this:


Parent page
+-+
| 8 - A  |
| 9 - B  |
+-+

Leaf A  Leaf B
+-+ +-+
| HI-key: 8   | | HI-key: 9   |
| | - | |
|  7   7*  8  | |   9 |
+-+ +-+


Regardless of whether or not I have 

Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 02:42 PM, Greg Stark wrote:

On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:


On 05/26/2014 02:26 PM, Greg Stark wrote:



Another idea would be to have separate checkpoints for each buffer

partition. You would have to start recovery from the oldest checkpoint of
any of the partitions.


Yeah. Simon suggested that when we talked about this, but I didn't understand 
how that works at the time. I think I do now. The key to making it work is 
distinguishing, when starting recovery from the latest checkpoint, whether a 
record for a given page can be replayed safely. I used flags on WAL records in 
my proposal to achieve this, but using buffer partitions is simpler.


Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.


Since you will be flushing the buffers one redo partition at a time, 
you would want to allow the OS to do merge the writes within a partition 
as much as possible. So my even-odd split would in fact be pretty bad. 
Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the 
same partition, would be better.



I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.


Good point.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Simon Riggs
On 27 May 2014 03:49, Fujii Masao masao.fu...@gmail.com wrote:

 So that gives us a few approaches

 * Compressing FPWs gives A
 * Background FPWs gives us B
which look like we can combine both ideas

 * Double-buffering would give us A and B, but not C
and would be incompatible with other two ideas

 Double-buffering would allow us to disable FPW safely but which would make
 a recovery slow. So if we adopt double-buffering, I think that we would also
 need to overhaul the recovery.

Which is also true of Background FPWs

So our options are

1. Compressed FPWs only

2. Compressed FPWs plus BackgroundFPWs plus Recovery Buffer Prefetch

3. Double Buffering plus Recovery Buffer Prefetch

IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
the reason you also discussed changing the WAL record format to allow
us to identify the blocks touched by recovery more easily?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 03:18 PM, Simon Riggs wrote:

IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
the reason you also discussed changing the WAL record format to allow
us to identify the blocks touched by recovery more easily?


Yeah, that was one use case I had in mind for the WAL format changes. 
See http://www.postgresql.org/message-id/533d6cbf.6080...@vmware.com.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Simon Riggs
On 27 May 2014 07:42, Greg Stark st...@mit.edu wrote:
 On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 On 05/26/2014 02:26 PM, Greg Stark wrote:

 Another idea would be to have separate checkpoints for each buffer
 partition. You would have to start recovery from the oldest checkpoint of
 any of the partitions.

 Yeah. Simon suggested that when we talked about this, but I didn't 
 understand how that works at the time. I think I do now. The key to making 
 it work is distinguishing, when starting recovery from the latest 
 checkpoint, whether a record for a given page can be replayed safely. I used 
 flags on WAL records in my proposal to achieve this, but using buffer 
 partitions is simpler.

 Interesting. I just thought of it independently.

Actually, I heard it from Doug Tolbert in 2005, based on how another
DBMS coped with that issue.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* David Fetter (da...@fetter.org) wrote:
 - We make type mappings settable at the level of:
 - FDW
 - Instance (a.k.a. cluster)
 - Database
 - Schema
 - Table
 - Column

While I like the general idea, you seem to be making this appear much
more complex than it actually is.  For example, I see no value in a
table-level uint4 - int8 definition.  If you want something which is
not the default, just set it on the individual columns of the foreign
table, exactly how we handle column name differences.

There might be some value in being able to redefine what the default is
at the FOREIGN SERVER level, as perhaps MySQL DB X and MySQL DB Y could
have different default mappings for some reason, but adding in the rest
of those levels doesn't add value imv.

 This would consist of:
 - The remote type
 - The local type to which it maps
 - The inbound transformation (default identity)
 - The outbound transformation (default identity)

The remote type and the local type are known already to the FDW, no?
The FDW will need to know how to do whatever conversions we're talking
about also, right?  (If you want to do it in core PG instead of the FDW
then I'm thinking you should probably use a view over top of the
foreign table..).  What's nice is that this all falls under what an FDW
can do *already*, if it wants (and, actually, it could do it on the
table-level technically too, if the FDW supports that, but schema?
database?  these things don't make sense...).

For the IMPORT bit, it should just be doing whatever the defaults are
unless you've set some different defaults for a given foreign server
(also something which could be set using our existing system...).

 ALTER FOREIGN TABLE foo ADD (mapping '{
 datetime: text,
 inbound: IDENTITY,
 outbound: IDENTITY
 }')

I'm very much against having two different command languages where one
is used when convenient.  If we wanted to do this, we should build a
full-spec mapping from whatever JSON language you come up with for our
utility commands to what we actually implement in the grammar.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* David Fetter (da...@fetter.org) wrote:
 In the easy case of PostgreSQL, you might also be able to establish
 whether the UDT in the already defined locally case above is
 identical to the one defined remotely, but I don't think it's possible
 even in principle for non-PostgreSQL remote systems, possibly not even
 for ones with non-identical architecture, PostgreSQL major version,
 etc.

For my 2c, it'd be very handy if IMPORT had an option or similar to also
copy over all (in the schema) or at least any depended-upon UDTs.  It'd
really be nice if we had an ability to check the remote UDT vs. the
local one at some point also, since otherwise you're going to get bitten
at run-time when querying the foreign table.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Matteo Beccati p...@beccati.com writes:
 On 27/05/2014 03:07, Tom Lane wrote:
 I do not have a machine on which to try --with-bsd-uuid, so it's
 possible I broke that portion of Matteo's patch.  Would someone try
 that case on a FreeBSD box?

 I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box.

Ah, cool.  I had documented this option as only working for FreeBSD,
but that's obviously too conservative.  Anyone know about whether it
works on OpenBSD?

 I'm attaching a little patch to be applied on top of yours.

Thanks, will incorporate this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Albe Laurenz laurenz.a...@wien.gv.at writes:
  Oracle follows the SQL standard in folding table names to upper case.
  So this would normally result in a lot of PostgreSQL foreign tables
  with upper case names, which would be odd and unpleasant.
 
  I cannot see a way out of that, but I thought I should mention it.
 
 It seems like it would often be desirable for the Oracle FDW to smash
 all-upper-case names to all-lower-case while importing, so that no quoting
 is needed on either side.  I doubt though that this is so desirable that
 it should happen unconditionally.

The oracle FDW would simply need a foreign-server level option which
says smash everything to lowercase on import.

 Between this and the type-mapping questions, it seems likely that
 we're going to need a way for IMPORT FOREIGN SCHEMA to accept
 user-supplied control options, which would in general be specific
 to the FDW being used.  (Another thing the SQL committee failed to
 think about.)

I can see value in having specific mappings defined at the foreign
server level for what IMPORT does by default, but as IMPORT is intended
to be a bulk process, I don't see the value in trying to teach it how to
do a specific mapping for a specific table or column inside the schema.
You run the IMPORT for the entire schema and then go fix up any special
cases or things you wanted differently.

That said, I'm not against having a way to pass to IMPORT a similar
key/value pair set which we already support for the FDW options on
tables, columns, etc, by way of safe-guarding any potential use case for
such.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* Ronan Dunklau (ronan.dunk...@dalibo.com) wrote:
 So, without extending the syntax too much, we could add options:
 
 IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
 [ { LIMIT TO | EXCEPT } (table_name [, ...])]
 [ OPTIONS (option_list) ]
 
 This option list could then contain fdw-specific options, including type 
 mapping.

Yup, that looks reasonable to me.

 Or we could add a specific clause:
 
 IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
 [ { LIMIT TO | EXCEPT } (table_name [, ...])]
 [ OPTIONS (option_list) ]
 [ TYPE MAPPING some mapping_definition ]

-1 on this one.

 With mapping_definition being either a tuple, or well-defined jsonb format 
 common accross FDWs.
 
 A third solution, which I don't like but doesn't modify the SQL grammar, 
 would 
 be to encode the options in the remote_schema name.

Yeah, don't like that one either.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Petr Jelinek

On 27/05/14 13:49, Ronan Dunklau wrote:

Between this and the type-mapping questions, it seems likely that
we're going to need a way for IMPORT FOREIGN SCHEMA to accept
user-supplied control options, which would in general be specific
to the FDW being used.  (Another thing the SQL committee failed to
think about.)


So, without extending the syntax too much, we could add options:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]

This option list could then contain fdw-specific options, including type
mapping.



This one looks like good option to me.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Alvaro Herrera
Stephen Frost wrote:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
  Albe Laurenz laurenz.a...@wien.gv.at writes:
   Oracle follows the SQL standard in folding table names to upper case.
   So this would normally result in a lot of PostgreSQL foreign tables
   with upper case names, which would be odd and unpleasant.
  
   I cannot see a way out of that, but I thought I should mention it.
  
  It seems like it would often be desirable for the Oracle FDW to smash
  all-upper-case names to all-lower-case while importing, so that no quoting
  is needed on either side.  I doubt though that this is so desirable that
  it should happen unconditionally.
 
 The oracle FDW would simply need a foreign-server level option which
 says smash everything to lowercase on import.

That's not the same thing though -- consider what happens to the quoting
needs for names with mixed case.  If you change mixed case to
all-lowercase, references to such objects using quotes in the
application code would fail because the name is now all-lowercase in
Postgres.  A tri-valued enum could do the trick:
lowercase_names={wholly_uppercase_only, all, none}
with the first one being the most sensible and default.

  Between this and the type-mapping questions, it seems likely that
  we're going to need a way for IMPORT FOREIGN SCHEMA to accept
  user-supplied control options, which would in general be specific
  to the FDW being used.  (Another thing the SQL committee failed to
  think about.)
 
 I can see value in having specific mappings defined at the foreign
 server level for what IMPORT does by default, but as IMPORT is intended
 to be a bulk process, I don't see the value in trying to teach it how to
 do a specific mapping for a specific table or column inside the schema.
 You run the IMPORT for the entire schema and then go fix up any special
 cases or things you wanted differently.

Yes, it seems better to offer the ability to create transactional
scripts around IMPORT (so it must be able to run in a transaction block
-- IMPORT first, then do a bunch of ALTERs), than complicating IMPORT
itself infinitely to support hundreds of possible options.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Stephen Frost wrote:
   It seems like it would often be desirable for the Oracle FDW to smash
   all-upper-case names to all-lower-case while importing, so that no quoting
   is needed on either side.  I doubt though that this is so desirable that
   it should happen unconditionally.
  
  The oracle FDW would simply need a foreign-server level option which
  says smash everything to lowercase on import.
 
 That's not the same thing though -- consider what happens to the quoting
 needs for names with mixed case.  If you change mixed case to
 all-lowercase, references to such objects using quotes in the
 application code would fail because the name is now all-lowercase in
 Postgres.  A tri-valued enum could do the trick:
 lowercase_names={wholly_uppercase_only, all, none}
 with the first one being the most sensible and default.

Sure, I was being a bit over-simplistic.  As was mentioned up-thread,
the option would rather be flip all-uppercase to lowercase and all-
lowercase to uppercase, quote any which are mixed, or something along
those lines.  What I was trying to get at is that it's up to the FDW
what options it wants to support in this regard and we already have a
way for the admin to pass in useful information to the FDW by way of the
FOREIGN SERVER FDW options.

This, plus the generic ability to pass an OPTIONS clause to the IMPORT
(allowing you to have different defaults for different IMPORT
statements) and having it be transactional, as you mention, appears to
be covering all the relevant bases.

Thanks,

stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Matteo Beccati
Hi Tom,

On 27/05/2014 15:52, Tom Lane wrote:
 Matteo Beccati p...@beccati.com writes:
 On 27/05/2014 03:07, Tom Lane wrote:
 I do not have a machine on which to try --with-bsd-uuid, so it's
 possible I broke that portion of Matteo's patch.  Would someone try
 that case on a FreeBSD box?
 
 I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box.
 
 Ah, cool.  I had documented this option as only working for FreeBSD,
 but that's obviously too conservative.  Anyone know about whether it
 works on OpenBSD?

I've tried to google man uuid openbsd and I got the e2fs package
(which contains uuid/uuid.h and libuuid) instead of a man page, so I
believe that could be another use case for --with-linux-uuid.

If it's confirmed to be working, that makes two BSD-derived systems
requiring linux-uuid, so --with-e2fs-uuid or similar would be more
appropriate.


Cheers
-- 
Matteo Beccati

Development  Consulting - http://www.beccati.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Sure, I was being a bit over-simplistic.  As was mentioned up-thread,
 the option would rather be flip all-uppercase to lowercase and all-
 lowercase to uppercase, quote any which are mixed, or something along
 those lines.  What I was trying to get at is that it's up to the FDW
 what options it wants to support in this regard and we already have a
 way for the admin to pass in useful information to the FDW by way of the
 FOREIGN SERVER FDW options.

 This, plus the generic ability to pass an OPTIONS clause to the IMPORT
 (allowing you to have different defaults for different IMPORT
 statements) and having it be transactional, as you mention, appears to
 be covering all the relevant bases.

Yeah, as far as the example of coping with differing case goes, I agree
that we'd want IMPORT to just follow whatever the FDW's default or
configured behavior is, since obviously the FDW will have to know how to
reverse the conversion when sending queries later on.  So there's no
apparent need for an IMPORT-level option *for that example*.  I'm not
sure if we need OPTIONS for IMPORT for any other uses.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Alvaro Herrera
Tom Lane wrote:
 David E. Wheeler da...@justatheory.com writes:
  On May 26, 2014, at 9:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  How about --with-unix-uuid? Or --with-ext2-uuid?
 
  Meh.  Unix certainly subsumes BSD, so that doesn't seem like a very
  useful distinction.  I guess we could use ext2 but that would just
  confuse most people.
 
  --with-uuid?
 
 I thought about that but figured we'd regret it ...

Why not --with-uuid-implementation=impl, and have impl be one of
e2utils, bsd, ossp, with the latter being default?  We could also have
offer the value list or help which would list the available options.
That way, if we come up with a new implementation in the future, this is
easily extensible.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Matteo Beccati p...@beccati.com writes:
 On 27/05/2014 15:52, Tom Lane wrote:
 Ah, cool.  I had documented this option as only working for FreeBSD,
 but that's obviously too conservative.  Anyone know about whether it
 works on OpenBSD?

 I've tried to google man uuid openbsd and I got the e2fs package
 (which contains uuid/uuid.h and libuuid) instead of a man page, so I
 believe that could be another use case for --with-linux-uuid.

Yeah, googling led me to the same conclusion: e2fsprogs libuuid is
definitely available for OpenBSD, but I found no evidence that it
has the built-in UUID functions that FreeBSD/NetBSD have.

 If it's confirmed to be working, that makes two BSD-derived systems
 requiring linux-uuid, so --with-e2fs-uuid or similar would be more
 appropriate.

Still don't like that name much, but maybe that's the best we can do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Andres Freund
On 2014-05-27 16:36:45 +0200, Matteo Beccati wrote:
 On 27/05/2014 15:52, Tom Lane wrote:
  Ah, cool.  I had documented this option as only working for FreeBSD,
  but that's obviously too conservative.  Anyone know about whether it
  works on OpenBSD?
 
 I've tried to google man uuid openbsd and I got the e2fs package
 (which contains uuid/uuid.h and libuuid) instead of a man page, so I
 believe that could be another use case for --with-linux-uuid.
 
 If it's confirmed to be working, that makes two BSD-derived systems
 requiring linux-uuid, so --with-e2fs-uuid or similar would be more
 appropriate.

When I looked at the e2fs uuid implementation a year or so back, it even
had some form of windows support. So +1 to naming it non-os specifically.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Why not --with-uuid-implementation=impl, and have impl be one of
 e2utils, bsd, ossp, with the latter being default?  We could also have
 offer the value list or help which would list the available options.
 That way, if we come up with a new implementation in the future, this is
 easily extensible.

The problem is that the long-established spelling is --with-ossp-uuid.
I don't think we can break that case.  While we could set up something
like what you propose alongside it, it doesn't seem like there's any
advantage to doing so compared to inventing --with-foo-uuid as needed.

In either case, the problem remains of exactly what to call the
e2fsprogs-derived implementation.  It does seem that people who are
familiar with these libraries call it that, but I'm worried that such
a name will confuse those not so familiar.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  This, plus the generic ability to pass an OPTIONS clause to the IMPORT
  (allowing you to have different defaults for different IMPORT
  statements) and having it be transactional, as you mention, appears to
  be covering all the relevant bases.
 
 Yeah, as far as the example of coping with differing case goes, I agree
 that we'd want IMPORT to just follow whatever the FDW's default or
 configured behavior is, since obviously the FDW will have to know how to
 reverse the conversion when sending queries later on.  So there's no
 apparent need for an IMPORT-level option *for that example*.  I'm not
 sure if we need OPTIONS for IMPORT for any other uses.

I'm waffling a bit on this particular point.  IMPORT is for bulk
operations, of course, but perhaps on one remote server you want to
import everything from the dimension tables using the default mapping
but everything from the fact or perhaps aggregate tables in some schema
as text.  You could do that with something like- import #1, change the
server-level options, import #2, or you could do just one big import and
then go back and fix everything, but...

I'd rather introduce this options clause for IMPORT *now*, which means
we don't need to care about if this specific example is really relevant
or not, than not have it in 9.5 and have FDW authors unhappy because
it's missing (whether they need it for this use case or some other).

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Why not --with-uuid-implementation=impl, and have impl be one of
  e2utils, bsd, ossp, with the latter being default?  We could also have
  offer the value list or help which would list the available options.
  That way, if we come up with a new implementation in the future, this is
  easily extensible.
 
 The problem is that the long-established spelling is --with-ossp-uuid.
 I don't think we can break that case.  While we could set up something
 like what you propose alongside it, it doesn't seem like there's any
 advantage to doing so compared to inventing --with-foo-uuid as needed.

I was thinking that --with-ossp-uuid would still be required to enable
UUID generators at all; the other one just selects the implementation to
use, which defaults to OSSP to maintain backwards compatibility.  Maybe
introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of
that.

As a more sophisticated idea: if you say --with-uuid then
--with-uuid-implementation is required; if you say --with-ossp-uuid then
--with-uuid-implementation is optional and defaults to OSSP.

 In either case, the problem remains of exactly what to call the
 e2fsprogs-derived implementation.  It does seem that people who are
 familiar with these libraries call it that, but I'm worried that such
 a name will confuse those not so familiar.

I vote e2fsprogs.  In the help text for that option, mention that it
works on Linux and some others OSes, or something like that so that
people on Linux try that one first, and people on other OSes can
web-search to see whether it's available.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 The problem is that the long-established spelling is --with-ossp-uuid.
 I don't think we can break that case.  While we could set up something
 like what you propose alongside it, it doesn't seem like there's any
 advantage to doing so compared to inventing --with-foo-uuid as needed.

 I was thinking that --with-ossp-uuid would still be required to enable
 UUID generators at all; the other one just selects the implementation to
 use, which defaults to OSSP to maintain backwards compatibility.  Maybe
 introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of
 that.

If we were going to do it like that, I'd vote for

   --with-uuid={ossp,e2fs,bsd}

with no default at present (ie you can't say just --with-uuid,
though we'd have the option to allow that in future).  But I doubt
this is better than the --with-foo-uuid spelling.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread David E. Wheeler
On May 27, 2014, at 7:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 In either case, the problem remains of exactly what to call the
 e2fsprogs-derived implementation.  It does seem that people who are
 familiar with these libraries call it that, but I'm worried that such
 a name will confuse those not so familiar.

--with-libuuid?

D



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Heikki Linnakangas
On 27 May 2014 18:33:48 EEST, Tom Lane t...@sss.pgh.pa.us wrote:
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 The problem is that the long-established spelling is
--with-ossp-uuid.
 I don't think we can break that case.  While we could set up
something
 like what you propose alongside it, it doesn't seem like there's any
 advantage to doing so compared to inventing --with-foo-uuid as
needed.

 I was thinking that --with-ossp-uuid would still be required to
enable
 UUID generators at all; the other one just selects the implementation
to
 use, which defaults to OSSP to maintain backwards compatibility. 
Maybe
 introduce --with-uuid and have --with-ossp-uuid a deprecated synonym
of
 that.

If we were going to do it like that, I'd vote for

   --with-uuid={ossp,e2fs,bsd}

with no default at present (ie you can't say just --with-uuid,
though we'd have the option to allow that in future).  But I doubt
this is better than the --with-foo-uuid spelling.

FWIW, --with-uuid=foo looks much clearer to me.


- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Ronan Dunklau
Le mardi 27 mai 2014 09:52:27 Stephen Frost a écrit :
 * David Fetter (da...@fetter.org) wrote:
  In the easy case of PostgreSQL, you might also be able to establish
  whether the UDT in the already defined locally case above is
  identical to the one defined remotely, but I don't think it's possible
  even in principle for non-PostgreSQL remote systems, possibly not even
  for ones with non-identical architecture, PostgreSQL major version,
  etc.
 
 For my 2c, it'd be very handy if IMPORT had an option or similar to also
 copy over all (in the schema) or at least any depended-upon UDTs.  It'd
 really be nice if we had an ability to check the remote UDT vs. the
 local one at some point also, since otherwise you're going to get bitten
 at run-time when querying the foreign table.

The specification only talks about importing tables, not types, views or (why 
not ?) foreign tables.

If we want to extend the spec further, we could accept more than 
CreateForeignTableStmt as return values from the API:
 - CreateDomainStmt
 - CompositeTypeStmt
 - AlterTableCmd


The core code would be responsible for executing them, and making sure the 
destination schema is correctly positioned on all of those.

The postgres_fdw behaviour could then be controlled with options such as 
include_types (tri-valued enum accepting none, all, as_needed), 
relation_kinds (default to 'r', can support multiple kinds with a string 'rfv' 
for tables, foreign tables and views).

I think we're drifting further away from the standard with that kind of stuff, 
but I'd be happy to implement it if that's the path we choose.

-- 
Ronan Dunklau
http://dalibo.com - http://dalibo.org

signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Jeff Janes
On Mon, May 26, 2014 at 8:15 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 I don't think we know that. The server might have crashed before that
 second record got generated.  (This appears to be an unfixable flaw in
 this proposal.)
 
  The second record is generated before the checkpoint is finished and the
 checkpoint record is written.  So it will be there.
 
  (if you crash before the checkpoint is finished, the in-progress
 checkpoint is no good for recovery anyway, and won't be used)

 Hmm, I see.

 It's not great to have to generate WAL at buffer-eviction time,
 though.  Normally, when we go to evict a buffer, the WAL is already
 written.  We might have to wait for it to be flushed, but if the WAL
 writer is doing its job, hopefully not.  But here we'll definitely
 have to wait for the WAL flush.


I'm not sure we do need to flush it.  If the checkpoint finishes, then the
WAL surely got flushed as part of the process of recording the end of the
checkpoint.  If the checkpoint does not finish, recovery will start from
the previous checkpoint, which does contain the FPW (because if it didn't,
the page would not be eligible for this treatment) and so the possibly torn
page will get overwritten in full.

Cheers,

Jeff


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 27 May 2014 18:33:48 EEST, Tom Lane t...@sss.pgh.pa.us wrote:
 If we were going to do it like that, I'd vote for
 
 --with-uuid={ossp,e2fs,bsd}
 
 with no default at present (ie you can't say just --with-uuid,
 though we'd have the option to allow that in future).  But I doubt
 this is better than the --with-foo-uuid spelling.

 FWIW, --with-uuid=foo looks much clearer to me.

OK, I'll make it so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Peter Geoghegan
On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The left sibling referred to here is the first page this key could
 be on, an important concept for unique index enforcement.

 No, it's not the first page this key could be on.

Well, it may be initially. I could have been more cautious about the
terminology here.

 Also note that _bt_moveright() also finishes any incomplete splits it
 encounters (when called for an insertion). So _bt_findinsertloc() never gets
 called on a page with the incomplete-split flag set. It might encounter one
 when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?). It's a nice property, since it
makes the extra locking while completing a split within
_bt_findinsertloc() particularly infrequent. But, that might also be a
bad thing, when considered from a different perspective.

 If I understood correctly, the tree looks like this before the insertion:

 Parent page:
 +-+
 | |
 | 9 - A  |
 +-+

 Leaf A
 +-+
 | HI-key: 9   |
 | |
 | 7 8 9   |
 +-+

 And after insertion and incomplete split:

 Parent page
 +-+
 | |
 | 9 - A  |
 +-+

 Leaf A  Leaf B
 +--+ +-+
 | HI-key: 8| | HI-key: 9   |
 | (INCOMPLETE_ | | |
 | SPLIT)   | - | |
 |  | | |
 |  7   7*   8  | |   9 |
 +--+ +-+

 After the split is finished, the tree looks like this:

 Parent page
 +-+
 | 8 - A  |
 | 9 - B  |
 +-+

 Leaf A  Leaf B
 +-+ +-+
 | HI-key: 8   | | HI-key: 9   |
 | | - | |
 |  7   7*  8  | |   9 |
 +-+ +-+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to 9 - A?

 Regardless of whether or not I have these details exactly right (that
 is, regardless of whether or not this scenario is strictly possible) I
 suggest as a code-hardening measure that _bt_findinsertloc() release
 its value lock, upon realizing it must complete splits, and then
 complete the split or splits known to be required. It would finally
 report that it couldn't find an insertion location to
 _bt_doinsert(), which would then retry from the start, just like when
 _bt_check_unique() finds an inconclusive conflict.

 Yeah, that would work too. It seems safe enough as it is, though, so I don't
 see the point.

Well, it would be nice to not have to finish the page split in what is
a particularly critical path, with that extra buffer lock. It's not
strictly necessary, but then it is theoretically safer, and certainly
much clearer. The fact that this code is so seldom executed is one
issue that made me revisit this. On the other hand, I can see why
you'd want to avoid cluttering up the relatively comprehensible
_bt_doinsert() function if it could be avoided. I defer to you.

 PS. Thanks for looking into this again! These B-tree changes really need
 thorough review.

You're welcome. Hopefully my questions will lead you in a useful
direction, even if my concerns turn out to be, in the main, unfounded.
 :-)

It previously wasn't in evidence that you'd considered these
interactions, and I feel better knowing that you have.
-- 
Peter Geoghega


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread Evan Jones
Feel free to flame me if I should be posting this elsewhere, but after reading 
the submitting a patch guide, it appears I should ask for guidance here.


I was reading the Postgres MVCC documentation today (which is generally 
fantastic BTW), and am slightly confused by a single sentence example, 
describing possible read-only snapshot isolation anomalies. I would like to 
submit a patch to clarify this example, since I suspect others may be also 
confused, but to do that I need help understanding it. The example was added as 
part of the Serializable Snapshot Isolation patch.

Link to the commit: 
http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21


I'm referring to the following sentence of 13.2.2, which is still in the source 
tree:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ

For example, even a read only transaction at this level may see a control 
record updated to show that a batch has been completed but not see one of the 
detail records which is logically part of the batch because it read an earlier 
revision of the control record.


I do not understand how this example anomaly is possible. I'm imagining 
something like the following:

1. Do a bunch of work, possibly in parallel in multiple transactions, that 
insert/update a bunch of detail records.
2. After all that work commits, insert or update a record in the control 
table indicating that the batch completed.

Or maybe:

1. Do a batch of work and update the control table in a single transaction.


The guarantee that I believe REPEATABLE READ will give you in either of these 
case is that if you see the control table record, you will read all the 
detail records, because the control record is only written if the updated 
detail records have been committed. What am I not understanding?


The most widely cited read-only snapshot isolation example is the bank 
withdrawl example from this paper: 
http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf . 
However, I suspect we can present an anomaly that doesn't require as much 
explanation?

Thanks,

Evan Jones

--
Work: https://www.mitro.co/Personal: http://evanjones.ca/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 09:47 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never gets
called on a page with the incomplete-split flag set. It might encounter one
when it moves right, but never the first page.


Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?).


Right.


If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-+
| |
| 9 - A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 9 - A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | - | |
|  | | |
|  7   7*   8  | |   9 |
+--+ +-+



After the split is finished, the tree looks like this:

Parent page
+-+
| 8 - A  |
| 9 - B  |
+-+

Leaf A  Leaf B
+-+ +-+
| HI-key: 8   | | HI-key: 9   |
| | - | |
|  7   7*  8  | |   9 |
+-+ +-+


How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to 9 - A?


Ah, sorry, I got that wrong. The downlinks store the *low* key of the 
child page, not the high key as I depicted. Let me try again:


On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.


So, initially the tree looks like this:

Parent page:
+-+
| |
| 7 - A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 7 - A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | - | |
|  | | |
|  7   7*  8   | |9|
+--+ +-+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above 
split)



* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a value lock in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original value
lock throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original value lock. A physical insertion has yet to occur.


The downlink of the original page cannot contain 9. Because, as I now 
remember ;-), the downlinks contain low-keys, not high keys.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why is pg_lsn marking itself a preferred type?

2014-05-27 Thread Tom Lane
One of these doesn't belong:

postgres=# select typname, typcategory from pg_type where typispreferred;
   typname   | typcategory 
-+-
 bool| B
 text| S
 oid | N
 float8  | N
 inet| I
 timestamptz | D
 interval| T
 varbit  | V
 pg_lsn  | U
(9 rows)

Was there any actual rationale to this, or was it just somebody who did
not understand what that bit is for?

I think it's probably mostly harmless given the lack of casts to or from
pg_lsn, but it's still a darn bad idea to have any preferred types in the
'U' category.  If we leave it like this it will bite us in the rear
eventually.

The most expedient response at this late date seems to be to change the
entry in pg_type.h without bumping catversion.  That way at least it
will be right in databases initdb'd after beta2.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 10:12 PM, Evan Jones wrote:

I was reading the Postgres MVCC documentation today (which is
generally fantastic BTW), and am slightly confused by a single
sentence example, describing possible read-only snapshot isolation
anomalies. I would like to submit a patch to clarify this example,
since I suspect others may be also confused, but to do that I need
help understanding it. The example was added as part of the
Serializable Snapshot Isolation patch.

Link to the commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21



I'm referring to the following sentence of 13.2.2, which is still in
the source tree:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ

 For example, even a read only transaction at this level may see a
control record updated to show that a batch has been completed but
not see one of the detail records which is logically part of the
batch because it read an earlier revision of the control record.


Hmm, that seems to be a super-summarized description of what Kevin  Dan 
called the receipts problem. There's an example of that in the 
isolation test suite, see src/test/isolation/specs/receipt-report.spec. 
Googling for it, I also found an academic paper written by Kevin  Dan 
that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, 2.1.2 Example 
2: Batch Processing. (Nice work, I didn't know of that paper until now!)


I agree that's too terse. I think it would be good to actually spell out 
a complete example of the Receipt problem in the manual. That chapter in 
the manual contains examples of anomalities in Read Committed mode, so 
it would be good to give a concrete example of an anomaly in Repeatable 
Read mode too. Want to write up a docs patch?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_llog not mentioned in Database File Layout

2014-05-27 Thread Fabrízio de Royes Mello
On Mon, May 26, 2014 at 12:33 PM, Amit Langote amitlangot...@gmail.comwrote:

 Hi,

 Just noticed pg_llog is not mentioned in the Database File Layout
 section. Wonder if it's an oversight?


Yes, it is an oversight. Patch attached.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog sobre TI: http://fabriziomello.blogspot.com
 Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 57e7f09..132169d 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -78,6 +78,11 @@ Item
 /row
 
 row
+ entryfilenamepg_llog//entry
+ entrySubdirectory containing logical replication status data/entry
+/row
+
+row
  entryfilenamepg_multixact//entry
  entrySubdirectory containing multitransaction status data
   (used for shared row locks)/entry

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread Evan Jones
Oh yeah, I shared an office with Dan so I should have thought to check their 
paper. Oops. Thanks for the suggestion; I'll try to summarize this into 
something that is similar to the Read Committed and Serializable mode examples. 
It may take me a week or two to find the time, but thanks for the suggestions.

Evan


On May 27, 2014, at 15:32 , Heikki Linnakangas hlinnakan...@vmware.com wrote:

 I agree that's too terse. I think it would be good to actually spell out a 
 complete example of the Receipt problem in the manual. That chapter in the 
 manual contains examples of anomalities in Read Committed mode, so it would 
 be good to give a concrete example of an anomaly in Repeatable Read mode too. 
 Want to write up a docs patch?


--
Work: https://www.mitro.co/Personal: http://evanjones.ca/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] json casts

2014-05-27 Thread Andrew Dunstan
I've been on the receiving end of a couple of mumbles about the fact 
that the JSON rendering code ignores casts of builtin types to JSON. 
This was originally done as an optimization to avoid doing cache lookups 
for casts for things we knew quite well how to turn into JSON values 
(unlike, say, hstore). However, there is at least one concrete case 
where this has some possibly undesirable consequences, namely 
timestamps. Many JSON processors, especially JavaScript/ECMAScript 
processors, require timestamp values to be in ISO 8601 format, with a 
'T' between the date part and the time part, and thus they barf on the 
output we produce for such values.


I'm therefore wondering if we should just remove this optimization. I 
don't have any performance figures, but it seems unlikely to be terribly 
expensive. Perhaps in 9.5 we should look at caching a lot of the info 
the json rendering code gathers, but that's something we can't look at now.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 10:53 PM, Andrew Dunstan wrote:

I've been on the receiving end of a couple of mumbles about the fact
that the JSON rendering code ignores casts of builtin types to JSON.
This was originally done as an optimization to avoid doing cache lookups
for casts for things we knew quite well how to turn into JSON values
(unlike, say, hstore). However, there is at least one concrete case
where this has some possibly undesirable consequences, namely
timestamps. Many JSON processors, especially JavaScript/ECMAScript
processors, require timestamp values to be in ISO 8601 format, with a
'T' between the date part and the time part, and thus they barf on the
output we produce for such values.


I don't understand what ignoring casts of builtin types to JSON means. 
Can you give an example?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Andrew Dunstan


On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:

On 05/27/2014 10:53 PM, Andrew Dunstan wrote:

I've been on the receiving end of a couple of mumbles about the fact
that the JSON rendering code ignores casts of builtin types to JSON.
This was originally done as an optimization to avoid doing cache lookups
for casts for things we knew quite well how to turn into JSON values
(unlike, say, hstore). However, there is at least one concrete case
where this has some possibly undesirable consequences, namely
timestamps. Many JSON processors, especially JavaScript/ECMAScript
processors, require timestamp values to be in ISO 8601 format, with a
'T' between the date part and the time part, and thus they barf on the
output we produce for such values.


I don't understand what ignoring casts of builtin types to JSON means. 
Can you give an example?



See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.

When rendering some value as part of a json string, if a cast exists 
from the data type to json, then the cast function is used to render the 
json instead of the type's normal output function, but only if it's not 
a builtin type.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Peter Geoghegan
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
 page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Heikki Linnakangas

On 05/27/2014 11:30 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
page, not the high key as I depicted. Let me try again:


Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.


Sure:

Parent page
+-+
| 7 - A  |
| 8 - B  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
|  | | |
|  | - | |
|  | | |
|  7   7*  8   | |9|
+--+ +-+

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:
 On 05/27/2014 10:53 PM, Andrew Dunstan wrote:
 I've been on the receiving end of a couple of mumbles about the fact
 that the JSON rendering code ignores casts of builtin types to JSON.
 This was originally done as an optimization to avoid doing cache lookups
 for casts for things we knew quite well how to turn into JSON values
 (unlike, say, hstore). However, there is at least one concrete case
 where this has some possibly undesirable consequences, namely
 timestamps. Many JSON processors, especially JavaScript/ECMAScript
 processors, require timestamp values to be in ISO 8601 format, with a
 'T' between the date part and the time part, and thus they barf on the
 output we produce for such values.

 I don't understand what ignoring casts of builtin types to JSON means. 
 Can you give an example?

 See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.

 When rendering some value as part of a json string, if a cast exists 
 from the data type to json, then the cast function is used to render the 
 json instead of the type's normal output function, but only if it's not 
 a builtin type.

How exactly would disabling that code have any effect on timestamp
rendering?  There's no cast to json from timestamps (nor any other
builtin type, except jsonb).

I'd be inclined to think a more useful answer to this issue would be to
make json.c special-case timestamps, as it already does for numerics.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 commit fest schedule

2014-05-27 Thread Peter Eisentraut
On 5/22/14, 7:50 AM, Abhijit Menon-Sen wrote:
 At 2014-05-22 07:22:42 -0400, pete...@gmx.net wrote:

 We need a volunteer to manage the first commit fest.
 
 I volunteer.

Congratulations, you're the new commit fest manager.

I trust that you know your way around, but if you have any questions
about anything, you can ask me, and probably some of the other previous
commit fest managers.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Hannu Krosing
On 05/27/2014 11:00 PM, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
 On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:
 On 05/27/2014 10:53 PM, Andrew Dunstan wrote:
 I've been on the receiving end of a couple of mumbles about the fact
 that the JSON rendering code ignores casts of builtin types to JSON.
 This was originally done as an optimization to avoid doing cache lookups
 for casts for things we knew quite well how to turn into JSON values
 (unlike, say, hstore). However, there is at least one concrete case
 where this has some possibly undesirable consequences, namely
 timestamps. Many JSON processors, especially JavaScript/ECMAScript
 processors, require timestamp values to be in ISO 8601 format, with a
 'T' between the date part and the time part, and thus they barf on the
 output we produce for such values.
 I don't understand what ignoring casts of builtin types to JSON means. 
 Can you give an example?
 See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.
 When rendering some value as part of a json string, if a cast exists 
 from the data type to json, then the cast function is used to render the 
 json instead of the type's normal output function, but only if it's not 
 a builtin type.
 How exactly would disabling that code have any effect on timestamp
 rendering?  There's no cast to json from timestamps (nor any other
 builtin type, except jsonb).
I think Andrews idea was, that if cast were used, one could fix the above
problem by defining a correct cast.



 I'd be inclined to think a more useful answer to this issue would be to
 make json.c special-case timestamps, as it already does for numerics.

   regards, tom lane
But I agree that special-casing the code to use the de-facto json standard
of using ISO 8601 date representation is a better solution.

Just make sure you get the TZ part right - this is another place where
PostgreSQL often differs from other systems' understanding of ISO
timestamps.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread David G Johnston
Heikki Linnakangas-6 wrote
 On 05/27/2014 10:12 PM, Evan Jones wrote:
 I was reading the Postgres MVCC documentation today (which is
 generally fantastic BTW), and am slightly confused by a single
 sentence example, describing possible read-only snapshot isolation
 anomalies. I would like to submit a patch to clarify this example,
 since I suspect others may be also confused, but to do that I need
 help understanding it. The example was added as part of the
 Serializable Snapshot Isolation patch.

 Link to the commit:
 http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21



 I'm referring to the following sentence of 13.2.2, which is still in
 the source tree:

 http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ

  For example, even a read only transaction at this level may see a
 control record updated to show that a batch has been completed but
 not see one of the detail records which is logically part of the
 batch because it read an earlier revision of the control record.
 
 Hmm, that seems to be a super-summarized description of what Kevin  Dan 
 called the receipts problem. There's an example of that in the 
 isolation test suite, see src/test/isolation/specs/receipt-report.spec. 
 Googling for it, I also found an academic paper written by Kevin  Dan 
 that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, 2.1.2 Example 
 2: Batch Processing. (Nice work, I didn't know of that paper until now!)
 
 I agree that's too terse. I think it would be good to actually spell out 
 a complete example of the Receipt problem in the manual. That chapter in 
 the manual contains examples of anomalities in Read Committed mode, so 
 it would be good to give a concrete example of an anomaly in Repeatable 
 Read mode too. Want to write up a docs patch?

While this is not a doc patch I decided to give it some thought.  The bank
example was understandable enough for me so I simply tried to make it more
accessible.  I also didn't go and try to get it to conform to other,
existing, examples.  This is intended to replace the entire For example...
paragraph noted above.


While Repeatable Read provides for stable in-transaction reads logical query
anomalies can result because commit order is not restricted and
serialization errors only occur if two transactions attempt to modify the
same record.

Consider a rule that, upon updating r1 OR r2, if r1+r2  0 then subtract an
additional 1 from the corresponding row.
Initial State: r1 = 0; r2 = 0
Transaction 1 Begins: reads (0,0); adds -10 to r1, notes r1 + r2 will be -10
and subtracts an additional 1
Transaction 2 Begins: reads (0,0); adds 20 to r2, notes r1 + r2 will be +20;
no further action needed
Commit 2
Transaction 3: reads (0,20) and commits
Commit 1
Transaction 4: reads (-11,20) and commits

However, if Transaction 2 commits first then, logically, the calculation of
r1 + r2 in Transaction 1 should result in a false outcome and the additional
subtraction of 1 should not occur - leaving T4 reading (-10,20).  

The ability for out-of-order commits is what allows T3 to read the pair
(0,20) which is logically impossible in the T2-before-T1 commit order with
T4 reading (-11,20).

Neither transaction fails since a serialization failure only occurs if a
concurrent update occurs to [ r1 (in T1) ] or to [ r2 (in T2) ]; The update
of [ r2 (in T1) ] is invisible - i.e., no failure occurs if a read value
undergoes a change.


Inspired by:
http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf -
Example 1.3


David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/PG-Manual-Clarifying-the-repeatable-read-isolation-example-tp5805152p5805170.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Andrew Dunstan


On 05/27/2014 05:43 PM, Hannu Krosing wrote:

On 05/27/2014 11:00 PM, Tom Lane wrote:


See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.
When rendering some value as part of a json string, if a cast exists
from the data type to json, then the cast function is used to render the
json instead of the type's normal output function, but only if it's not
a builtin type.

How exactly would disabling that code have any effect on timestamp
rendering?  There's no cast to json from timestamps (nor any other
builtin type, except jsonb).

I think Andrews idea was, that if cast were used, one could fix the above
problem by defining a correct cast.



Right.





I'd be inclined to think a more useful answer to this issue would be to
make json.c special-case timestamps, as it already does for numerics.




OK, that's another approach.


But I agree that special-casing the code to use the de-facto json standard
of using ISO 8601 date representation is a better solution.

Just make sure you get the TZ part right - this is another place where
PostgreSQL often differs from other systems' understanding of ISO
timestamps.



The target would be something like:

   to_char($1,'\-MM-DDTHH:MI:SS.USOF\')


AIUI that should be legal ISO 8601. But I'm happy to defer to experts.


Given that this would be a hard coded behaviour change, is it too late 
to do this for 9.4?


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Stephen Frost
* Andrew Dunstan (and...@dunslane.net) wrote:
 I'd be inclined to think a more useful answer to this issue would be to
 make json.c special-case timestamps, as it already does for numerics.
 
 
 
 OK, that's another approach.

I'm all for doing this for JSON, but it'd sure be nice to have the
option to get actual ISO 8601 from the built-ins too...

 Given that this would be a hard coded behaviour change, is it too
 late to do this for 9.4?

No, for my 2c.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

2014-05-27 Thread Peter Geoghegan
On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Fair enough, but I don't think that affects correctness either way (I
 don't think that you meant to imply that this was a necessary
 precaution that you'd taken - right?).


 Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume). If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there. Do you think it would be a fair
stress-test if I was to hack Postgres so that this call never happens
(within _bt_moveright())? I'd also have an incomplete page split occur
at random with a probability of, say, 1% per split. The mechanism
would be much the same as your original test-case for the split patch
- I'd throw an error at the wrong place, although only 1% of the time,
and over many hours.

The concern with the _bt_moveright() call of _bt_finish_split() is
that it might conceal a problem without reliably fixing it,
potentially making isolating that problem much harder.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Andrew Dunstan (and...@dunslane.net) wrote:
 Given that this would be a hard coded behaviour change, is it too
 late to do this for 9.4?

 No, for my 2c.

If we do it by adding casts then it'd require an initdb, so I'd vote
against that for 9.4.  If we just change behavior in json.c then that
objection doesn't apply, so I wouldn't complain.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Andrew Dunstan


On 05/27/2014 07:17 PM, Tom Lane wrote:

Stephen Frost sfr...@snowman.net writes:

* Andrew Dunstan (and...@dunslane.net) wrote:

Given that this would be a hard coded behaviour change, is it too
late to do this for 9.4?

No, for my 2c.

If we do it by adding casts then it'd require an initdb, so I'd vote
against that for 9.4.  If we just change behavior in json.c then that
objection doesn't apply, so I wouldn't complain.





I wasn't proposing to add a cast, just to allow users to add one if they 
wanted. But I'm quite happy to go with special-casing timestamps, and 
leave the larger question for another time.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why is pg_lsn marking itself a preferred type?

2014-05-27 Thread Michael Paquier
On Wed, May 28, 2014 at 4:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 One of these doesn't belong:

 postgres=# select typname, typcategory from pg_type where typispreferred;
typname   | typcategory
 -+-
  bool| B
  text| S
  oid | N
  float8  | N
  inet| I
  timestamptz | D
  interval| T
  varbit  | V
  pg_lsn  | U
 (9 rows)

 Was there any actual rationale to this, or was it just somebody who did
 not understand what that bit is for?
Looks like an oversight of the pg_lsn patch. You could blame me for
that I suppose...

 I think it's probably mostly harmless given the lack of casts to or from
 pg_lsn, but it's still a darn bad idea to have any preferred types in the
 'U' category.  If we leave it like this it will bite us in the rear
 eventually.
 The most expedient response at this late date seems to be to change the
 entry in pg_type.h without bumping catversion.  That way at least it
 will be right in databases initdb'd after beta2.
Agreed. Attached patch fixes that, but I am sure that you already
figured it out.
-- 
Michael
diff --git a/src/include/catalog/pg_type.h b/src/include/catalog/pg_type.h
index a8abc0f..b7d9256 100644
--- a/src/include/catalog/pg_type.h
+++ b/src/include/catalog/pg_type.h
@@ -578,7 +578,7 @@ DESCR(UUID datatype);
 DATA(insert OID = 2951 ( _uuid			PGNSP PGUID -1 f b A f t \054 0 2950 0 array_in array_out array_recv array_send - - array_typanalyze i x f 0 -1 0 0 _null_ _null_ _null_ ));
 
 /* pg_lsn */
-DATA(insert OID = 3220 ( pg_lsn			PGNSP PGUID 8 FLOAT8PASSBYVAL b U t t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ ));
+DATA(insert OID = 3220 ( pg_lsn			PGNSP PGUID 8 FLOAT8PASSBYVAL b U f t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ ));
 DESCR(PostgreSQL LSN datatype);
 #define LSNOID			3220
 DATA(insert OID = 3221 ( _pg_lsn			PGNSP PGUID -1 f b A f t \054 0 3220 0 array_in array_out array_recv array_send - - array_typanalyze d x f 0 -1 0 0 _null_ _null_ _null_ ));

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Pushed; thanks for working on this!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb failed assertions

2014-05-27 Thread Peter Geoghegan
On Wed, May 21, 2014 at 4:53 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Hmm. The patch looks correct as far as it goes. But that function is still a
 bit funny. When it compares two identical arrays (or objects), and reaches
 the WJB_END_ARRAY token, it will still fall into the code that checks what
 the va and vb types are, and compares the last scalar values in that array
 again. That's wrong, and will fail if the compiler decides to clobber the
 local va or vb variables between iterations of the do-while loop,
 because JsonbIteratorNext() does not set the value when returning
 WJB_END_ARRAY.

That's an obsolete assumption that once actually applied during development.

Attached revision also adds handling for that case.

Thanks
-- 
Peter Geoghegan
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
*** static void uniqueifyJsonbObject(JsonbVa
*** 62,74 
   *
   * There isn't a JsonbToJsonbValue(), because generally we find it more
   * convenient to directly iterate through the Jsonb representation and only
!  * really convert nested scalar values.  formIterIsContainer() does this, so
!  * that clients of the iteration code don't have to directly deal with the
!  * binary representation (JsonbDeepContains() is a notable exception, although
!  * all exceptions are internal to this module).  In general, functions that
!  * accept a JsonbValue argument are concerned with the manipulation of scalar
!  * values, or simple containers of scalar values, where it would be
!  * inconvenient to deal with a great amount of other state.
   */
  Jsonb *
  JsonbValueToJsonb(JsonbValue *val)
--- 62,74 
   *
   * There isn't a JsonbToJsonbValue(), because generally we find it more
   * convenient to directly iterate through the Jsonb representation and only
!  * really convert nested scalar values.  JsonbIteratorNext() does this, so that
!  * clients of the iteration code don't have to directly deal with the binary
!  * representation (JsonbDeepContains() is a notable exception, although all
!  * exceptions are internal to this module).  In general, functions that accept
!  * a JsonbValue argument are concerned with the manipulation of scalar values,
!  * or simple containers of scalar values, where it would be inconvenient to
!  * deal with a great amount of other state.
   */
  Jsonb *
  JsonbValueToJsonb(JsonbValue *val)
*** compareJsonbContainers(JsonbContainer *a
*** 137,149 
  		ra = JsonbIteratorNext(ita, va, false);
  		rb = JsonbIteratorNext(itb, vb, false);
  
- 		/*
- 		 * To a limited extent we'll redundantly iterate over an array/object
- 		 * while re-performing the same test without any reasonable
- 		 * expectation of the same container types having differing lengths
- 		 * (as when we process a WJB_BEGIN_OBJECT, and later the corresponding
- 		 * WJB_END_OBJECT), but no matter.
- 		 */
  		if (ra == rb)
  		{
  			if (ra == WJB_DONE)
--- 137,142 
*** compareJsonbContainers(JsonbContainer *a
*** 152,157 
--- 145,162 
  break;
  			}
  
+ 			if (ra == WJB_END_ARRAY || ra == WJB_END_OBJECT)
+ 			{
+ /*
+  * There is no array or object to safely compare at this stage
+  * of processing.  jbvArray/jbvObject values are only
+  * considered initially, during processing of WJB_BEGIN_ARRAY
+  * and WJB_BEGIN_OBJECT tokens.  This doesn't matter, because a
+  * second comparison would be redundant.
+  */
+ continue;
+ 			}
+ 
  			if (va.type == vb.type)
  			{
  switch (va.type)
*** compareJsonbContainers(JsonbContainer *a
*** 194,212 
  		else
  		{
  			/*
! 			 * It's safe to assume that the types differed.
  			 *
  			 * If the two values were the same container type, then there'd
  			 * have been a chance to observe the variation in the number of
! 			 * elements/pairs (when processing WJB_BEGIN_OBJECT, say).  They
! 			 * can't be scalar types either, because then they'd have to be
! 			 * contained in containers already ruled unequal due to differing
! 			 * numbers of pairs/elements, or already directly ruled unequal
! 			 * with a call to the underlying type's comparator.
  			 */
  			Assert(va.type != vb.type);
! 			Assert(va.type == jbvArray || va.type == jbvObject);
! 			Assert(vb.type == jbvArray || vb.type == jbvObject);
  			/* Type-defined order */
  			res = (va.type  vb.type) ? 1 : -1;
  		}
--- 199,227 
  		else
  		{
  			/*
! 			 * It's safe to assume that the types differed, and that the va and
! 			 * vb values passed were set.
  			 *
+ 			 * We don't have to consider the WJB_END_ARRAY and WJB_END_OBJECT
+ 			 * cases here, because in order to process those tokens we'd first
+ 			 * have to process WJB_BEGIN_ARRAY or WJB_BEGIN_OBJECT, and that
+ 			 * would be sufficient reason to end up here (with reliable values
+ 			 * set for va and vb), which always results in finishing the
+ 			 * comparison.
+ 			 */
+ 			Assert(ra != 

Re: [HACKERS] [9.4] Minor SSL/ECDH related doc fixes

2014-05-27 Thread Bruce Momjian
On Sat, May 17, 2014 at 10:36:59PM +0300, Marko Kreen wrote:
 - Clarify ECDH decription in release notes.
 - Fix default value - it's 'prime256v1'.
 - List curves with good cross-platform support explicitly
   (NIST P-256 / P-384 / P-521).
 
 The -list_curves output is full of garbage, it's hard to know which
 ones make sense to use.  Only those three curves are supported
 cross-platform - OpenSSL/Java/Windows - so list them explicitly.
 
 Only reason to tune this value is changing overall security
 level up/down, so now this can be done safely and quickly.
 
 Only upwards though.  We could also list here NIST P-192/P-224
 (prime192v1, secp224r1), but those are not supported by Windows.
 And prime256v1 is quite fast already.
 
 In the future it might make philosophical sense to list
 also Brainpool curves (RFC7027), or some new curves from
 http://safecurves.cr.yp.to/ when they are brought to TLS.
 But currently only NIST/NSA curves are working option,
 so let's keep it simple for users.

Attached patch applied.  I shortened the release note description.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index d9e5985..4a666d0
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** include 'filename'
*** 1020,1032 
/term
listitem
 para
! Specifies the name of the curve to use in ECDH key exchanges.  The
! default is literalprime256p1/.
 /para
  
 para
! The list of available curves can be shown with the command
! literalopenssl ecparam -list_curves/literal.
 /para
/listitem
   /varlistentry
--- 1020,1042 
/term
listitem
 para
! Specifies the name of the curve to use in ECDH key exchange.
! It needs to be supported by all clients that connect.
! It does not need to be same curve as used by server's
! Elliptic Curve key.  The default is literalprime256v1/.  
 /para
  
 para
! OpenSSL names for most common curves:
! literalprime256v1/ (NIST P-256),
! literalsecp384r1/ (NIST P-384),
! literalsecp521r1/ (NIST P-521).
!/para
! 
!para
! The full list of available curves can be shown with the command
! literalopenssl ecparam -list_curves/literal.  Not all of them
! are usable in TLS though.
 /para
/listitem
   /varlistentry
diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
new file mode 100644
index 24862fe..ae059f8
*** a/doc/src/sgml/release-9.4.sgml
--- b/doc/src/sgml/release-9.4.sgml
***
*** 616,632 
 /para
  
 para
! Such keys are faster and have improved security over previous
! options. The new configuration
! parameter link linkend=guc-ssl-ecdh-curvevarnamessl_ecdh_curve//link
! controls which curve is used.
 /para
/listitem
  
listitem
 para
  Improve the default link
! linkend=guc-ssl-ciphersvarnamessl_ciphers//link ciphers
  (Marko Kreen)
 /para
/listitem
--- 616,633 
 /para
  
 para
! This allows use of Elliptic Curve keys for server authentication.
! Such keys are faster and have improved security over acronymRSA/ keys.
! The new configuration parameter
! link linkend=guc-ssl-ecdh-curvevarnamessl_ecdh_curve//link
! controls which curve is used for acronymECDH/.
 /para
/listitem
  
listitem
 para
  Improve the default link
! linkend=guc-ssl-ciphersvarnamessl_ciphers//link value
  (Marko Kreen)
 /para
/listitem

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-05-27 Thread Gurjeet Singh
Please find attached the updated code of Postgres Hibenator. Notable
changes since the first proposal are:

.) The name has been changed to pg_hibernator (from pg_hibernate), to
avoid confusion with the ORM Hibernate.
.) Works with Postgres 9.4
.) Uses DynamicBackgroundWorker infrastructure.
.) Ability to restore one database at a time, to avoid random-read
storms. Can be disabled by parameter.
.) A couple of bug-fixes.
.) Detailed documentation.

I am pasting the README here (also included in the attachment).

Best regards,

Postgres Hibernator
===

This Postgres extension is a set-it-and-forget-it solution to save and restore
the Postgres shared-buffers contents, across Postgres server restarts.

It performs the automatic save and restore of database buffers, integrated with
database shutdown and startup, hence reducing the durations of
database maintenance
windows, in effect increasing the uptime of your applications.

Postgres Hibernator automatically saves the list of shared buffers to the disk
on database shutdown, and automatically restores the buffers on
database startup.
This acts pretty much like your Operating System's hibernate feature, except,
instead of saving the contents of the memory to disk, Postgres Hibernator saves
just a list of block identifiers. And it uses that list after startup to restore
the blocks from data directory into Postgres' shared buffers.

Why
--

DBAs are often faced with the task of performing some maintenance on their
database server(s) which requires shutting down the database. The maintenance
may involve anything from a database patch application, to a hardware upgrade.
One ugly side-effect of restarting the database server/service is that all the
data currently in database server's memory will be all lost, which was
painstakingly fetched from disk and put there in response to application queries
over time. And this data will have to be rebuilt as applications start querying
database again. The query response times will be very high until all the hot
data is fetched from disk and put back in memory again.

People employ a few tricks to get around this ugly truth, which range from
running a `select * from app_table;`, to `dd if=table_file ...`, to using
specialized utilities like pgfincore to prefetch data files into OS cache.
Wouldn't it be ideal if the database itself could save and restore its memory
contents across restarts!

The duration for which the server is building up caches, and trying to reach its
optimal cache performance is called ramp-up time. Postgres Hibernator is aimed
at reducing the ramp-up time of Postgres servers.

How
--

Compile and install the extension (you'll need a Postgres instalation and its
`pg_config` in `$PATH`):

$ cd pg_hibernator
$ make install

Then.

1. Add `pg_hibernator` to the `shared_preload_libraries` variable in
`postgresql.conf` file.
2. Restart the Postgres server.
3. You are done.

How it works
--

This extension uses the `Background Worker` infrastructure of
Postgres, which was
introduced in Postgres 9.3. When the server starts, this extension registers
background workers; one for saving the buffers (called `Buffer Saver`) when the
server shuts down, and one for each database in the cluster (called
`Block Readers`)
for restoring the buffers saved during previous shutdown.

When the Postgres server is being stopped/shut down, the `Buffer
Saver` scans the
shared-buffers of Postgres, and stores the unique block identifiers of
each cached
block to the disk. This information is saved under the `$PGDATA/pg_hibernator/`
directory. For each of the database whose blocks are resident in shared buffers,
one file is created; for eg.: `$PGDATA/pg_hibernator/2.postgres.save`.

During the next startup sequence, the `Block Reader` threads are
registered, one for
each file present under `$PGDATA/pg_hibernator/` directory. When the
Postgres server
has reached stable state (that is, it's ready for database connections), these
`Block Reader` processes are launched. The `Block Reader` process
reads the save-files
looking for block-ids to restore. It then connects to the respective database,
and requests Postgres to fetch the blocks into shared-buffers.

Configuration
--

This extension can be controlled via the following parameters. These parameters
can be set in postgresql.conf or on postmaster's command-line.

- `pg_hibernator.enabled`

Setting this parameter to false disables the hibernator features. That is,
on server startup the BlockReader processes will not be launched, and on
server shutdown the list of blocks in shared buffers will not be saved.

Note that the BuffersSaver process exists at all times, even when this
parameter is set to `false`. This is to allow the DBA to enable/disable the
extension without having to restart the server. The BufferSaver process
checks this parameter during server startup and right before shutdown, 

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread Robert Haas
On Mon, May 26, 2014 at 10:39 AM, Stephen Frost sfr...@snowman.net wrote:
 * ash (a...@commandprompt.com) wrote:
 This came up recently on general list (and I've just hit the same issue 
 today):
   
 http://www.postgresql.org/message-id/cab7npqtlmmn1ltb5we0v0do57ip0u73ykwzbzytaxdf1caw...@mail.gmail.com

 Why couldn't postgres re-create the dependent views automatically?  I
 mean it knows which views depend on the altered column and keeps the
 view definition, no?

 Might be pretty complicated in the end..

 Would a patch likely to be accepted?  How hard do you feel this might be
 to implement?  Any caveat that defeats the purpose of such feature?

 It'd need to be explicitly requested, eg a 'CASCADE' option.

Why?  Would any sane person NOT want this behavior?

I think the question here is whether there's any way to make this work
at all, not whether we'd want it if we could get it.  Consider:

CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t;

If we drop the view, change the column type of t.a, and re-execute the
view, + might resolve to a different operator than before (or no
operator, at all).  Furthermore, the operator to which it resolves
will depend on the search path at the time the CREATE OR REPLACE VIEW
command is executed.

Now, consider the situation in which we want to achieve the same
result without having to drop and recreate v.  When the column type of
t.a is changed, we can use the dependencies to figure out that v might
be impacted.  We can trace through the rewrite rule to find out where
column t.a is referenced.  And ... then what?  All we know about t.a
is that we're applying some operator to it, which is specified by OID.
 The rewrite rule doesn't tell us the actual *name* by which the
operator was referenced in the original view text, nor does it tell us
the search path that was in effect at that time.  If it did, we could
pick the same operator for + that would have been used had t.a been of
the new type originally, but as it is, we can't.

Now maybe there are options other than trying to reproduce what the
original CREATE OR REPLACE statement would have done against the new
type.  For example, we could look through views that depend on t.a and
rewrite each reference to that column to t.a::oldtype.  This might
lead to odd results with multiple nested casts and generally funny
behavior if the column is re-typed multiple times; but maybe there's
some way to fix that.  Also, it might not really be the semantics you
want if you were hoping the type update would truly cascade.  But it
might still be better than a sharp stick in the eye, which is kinda
what we offer today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Peter Eisentraut
On 5/26/14, 1:25 PM, Tom Lane wrote:
 Assuming this works as advertised, does anyone have an objection to
 pushing it into 9.4?

Yes, it's way too late for that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Peter Eisentraut
On 5/27/14, 2:41 PM, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 27 May 2014 18:33:48 EEST, Tom Lane t...@sss.pgh.pa.us wrote:
 If we were going to do it like that, I'd vote for

 --with-uuid={ossp,e2fs,bsd}

 with no default at present (ie you can't say just --with-uuid,
 though we'd have the option to allow that in future).  But I doubt
 this is better than the --with-foo-uuid spelling.
 
 FWIW, --with-uuid=foo looks much clearer to me.
 
 OK, I'll make it so.

I'm shocked that this new feature has been committed post beta with less
than 48 hours of review time over a holiday weekend.  This issue has
been known for years.  Why does it need to be solved right now?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 I'm shocked that this new feature has been committed post beta with less
 than 48 hours of review time over a holiday weekend.  This issue has
 been known for years.  Why does it need to be solved right now?

As per the commit message: our packagers were screaming about the fact
that they couldn't build uuid-ossp anymore now that autoconf 2.69 has
tightened up its validity checks for headers.  If you don't like this
change, we can revert it and also revert the upgrade to 2.69.

I'm not terribly happy about pushing such a change post-beta1 either,
but it's not like this isn't something we've known was needed.  Anyway,
what's the worst case if we find a bug here?  Tell people not to use
uuid-ossp?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread David G Johnston
Alexander Shulgin wrote
 Hi Hackers,
 
 This came up recently on general list (and I've just hit the same issue
 today):
   http://www.postgresql.org/message-id/

 CAB7nPqTLmMn1LTb5WE0v0dO57iP0U73yKwzbZytAXDF1CAWLZg@.gmail

 
 Why couldn't postgres re-create the dependent views automatically?  I
 mean it knows which views depend on the altered column and keeps the
 view definition, no?
 
 Would a patch likely to be accepted?  How hard do you feel this might be
 to implement?  Any caveat that defeats the purpose of such feature?
 
 Thanks.

Would it be possible to handle the specific case of varchar(n) to
varchar/text by just ignoring the error?

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Re-create-dependent-views-on-ALTER-TABLE-ALTER-COLUMN-TYPE-tp5804972p5805191.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread ash

Robert Haas robertmh...@gmail.com writes:

 It'd need to be explicitly requested, eg a 'CASCADE' option.

 Why?  Would any sane person NOT want this behavior?

 I think the question here is whether there's any way to make this work
 at all, not whether we'd want it if we could get it.  Consider:

 CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t;

 If we drop the view, change the column type of t.a, and re-execute the
 view, + might resolve to a different operator than before (or no
 operator, at all).  Furthermore, the operator to which it resolves
 will depend on the search path at the time the CREATE OR REPLACE VIEW
 command is executed.

 Now, consider the situation in which we want to achieve the same
 result without having to drop and recreate v.  When the column type of
 t.a is changed, we can use the dependencies to figure out that v might
 be impacted.  We can trace through the rewrite rule to find out where
 column t.a is referenced.  And ... then what?  All we know about t.a
 is that we're applying some operator to it, which is specified by OID.
  The rewrite rule doesn't tell us the actual *name* by which the
 operator was referenced in the original view text, nor does it tell us
 the search path that was in effect at that time.  If it did, we could
 pick the same operator for + that would have been used had t.a been of
 the new type originally, but as it is, we can't.

This could be a showstopper indeed.  We can look up view def in pg_views
view, but it doesn't include any schema references unless they were
explicit in the CREATE VIEW statement.

On the other hand, pg_dump *can* work around this: if you dump a view
that has been defined when a specific search_path was in effect, you'll
get correct definition in the schema dump.

So why can't we try to learn from pg_dump?

 Now maybe there are options other than trying to reproduce what the
 original CREATE OR REPLACE statement would have done against the new
 type.  For example, we could look through views that depend on t.a and
 rewrite each reference to that column to t.a::oldtype.  This might
 lead to odd results with multiple nested casts and generally funny
 behavior if the column is re-typed multiple times; but maybe there's
 some way to fix that.  Also, it might not really be the semantics you
 want if you were hoping the type update would truly cascade.  But it
 might still be better than a sharp stick in the eye, which is kinda
 what we offer today.

No, casting back to oldtype totally defeats the purpose, at least for my
usecase.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread ash

David G Johnston david.g.johns...@gmail.com writes:

 Would it be possible to handle the specific case of varchar(n) to
 varchar/text by just ignoring the error?

Simply for the reference, my case is INT to BIGINT.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread Robert Haas
On Tue, May 27, 2014 at 11:20 PM, ash a...@commandprompt.com wrote:
 Now, consider the situation in which we want to achieve the same
 result without having to drop and recreate v.  When the column type of
 t.a is changed, we can use the dependencies to figure out that v might
 be impacted.  We can trace through the rewrite rule to find out where
 column t.a is referenced.  And ... then what?  All we know about t.a
 is that we're applying some operator to it, which is specified by OID.
  The rewrite rule doesn't tell us the actual *name* by which the
 operator was referenced in the original view text, nor does it tell us
 the search path that was in effect at that time.  If it did, we could
 pick the same operator for + that would have been used had t.a been of
 the new type originally, but as it is, we can't.

 This could be a showstopper indeed.  We can look up view def in pg_views
 view, but it doesn't include any schema references unless they were
 explicit in the CREATE VIEW statement.

 On the other hand, pg_dump *can* work around this: if you dump a view
 that has been defined when a specific search_path was in effect, you'll
 get correct definition in the schema dump.

 So why can't we try to learn from pg_dump?

Well, pg_dump is trying to do something different than what you're
trying to do here.  pg_dump wants to make sure that the view, when fed
back into psql, creates the same view that exists now, regardless of
whether that's what the user created originally.  For example, if a
view is created referring to table foo, and table foo is later renamed
to bar, then pg_dump wants to (and does) dump a statement referring to
bar, not foo - even if there's a *new* table called foo against which
the view could have been defined.  Similarly, pg_dump will
schema-qualify functions and operators, or not, based on whether
that's necessary to reference the exact same operators that were
selected when the original CREATE VIEW command was run, regardless of
whether the original references were schema-qualified.  None of that
involves answering hypothetical questions; but what you want to do
does, and that I think is the problem in a nutshell.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json casts

2014-05-27 Thread Robert Haas
On Tue, May 27, 2014 at 5:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'd be inclined to think a more useful answer to this issue would be to
 make json.c special-case timestamps, as it already does for numerics.

I wonder if anyone besides me is nervous about changing the semantics
here.  It seems like the sort of backward-compatibility break we
normally avoid.  If we do make such a compatibility break, it should
certainly be release-noted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Amit Kapila
On Tue, May 27, 2014 at 1:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, May 27, 2014 at 3:57 PM, Simon Riggs si...@2ndquadrant.com
wrote:
  The requirements we were discussing were around
 
  A) reducing WAL volume
  B) reducing foreground overhead of writing FPWs - which spikes badly
  after checkpoint and the overhead is paid by the user processes
  themselves
  C) need for FPWs during base backup
 
  So that gives us a few approaches
 
  * Compressing FPWs gives A
  * Background FPWs gives us B
 which look like we can combine both ideas
 
  * Double-buffering would give us A and B, but not C
 and would be incompatible with other two ideas

 Double-buffering would allow us to disable FPW safely but which would make
 a recovery slow.

Is it due to the fact that during recovery, it needs to check the
contents of double buffer as well as the page in original location
for consistency or there is something else also which will lead
to slow recovery?

Won't DBW (double buffer write) reduce the need for number of
pages that needs to be read from disk as compare to FPW which
will suffice the performance degradation due to any other impact?

IIUC in DBW mechanism, we need to have a temporary sequential
log file of fixed size which will be used to write data before the data
gets written to its actual location in tablespace.  Now as the temporary
log file is of fixed size, the number of pages that needs to be read
during recovery should be less as compare to FPW because in FPW
it needs to read all the pages written in WAL log after last successful
checkpoint.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com