date:20140527

On 27 May 2014 18:18, Jeff Janes  wrote:
> On Mon, May 26, 2014 at 8:15 PM, Robert Haas  wrote:
>>
>> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
>>  wrote:
>> >>I don't think we know that. The server might have crashed before that
>> >>second record got generated.  (This appears to be an unfixable flaw in
>> >>this proposal.)
>> >
>> > The second record is generated before the checkpoint is finished and the
>> > checkpoint record is written.  So it will be there.
>> >
>> > (if you crash before the checkpoint is finished, the in-progress
>> > checkpoint is no good for recovery anyway, and won't be used)
>>
>> Hmm, I see.
>>
>> It's not great to have to generate WAL at buffer-eviction time,
>> though.  Normally, when we go to evict a buffer, the WAL is already
>> written.  We might have to wait for it to be flushed, but if the WAL
>> writer is doing its job, hopefully not.  But here we'll definitely
>> have to wait for the WAL flush.
>
>
> I'm not sure we do need to flush it.  If the checkpoint finishes, then the
> WAL surely got flushed as part of the process of recording the end of the
> checkpoint.  If the checkpoint does not finish, recovery will start from the
> previous checkpoint, which does contain the FPW (because if it didn't, the
> page would not be eligible for this treatment) and so the possibly torn page
> will get overwritten in full.

I think Robert is correct, you would need to flush WAL before writing
the disk buffer. That is the current invariant of WAL before data.

However, we don't need to do this in a simple way: FPW-flush-buffer,
we can do that with more buffering.

So it seems like a reasonable idea to do this using a 64 buffer
BulkAccessStrategy object and flush the WAL every 64 buffers. That's
beginning to look more like double buffering though...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes

On 27 May 2014 13:20, Heikki Linnakangas  wrote:
> On 05/27/2014 03:18 PM, Simon Riggs wrote:
>>
>> IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
>> the reason you also discussed changing the WAL record format to allow
>> us to identify the blocks touched by recovery more easily?
>
>
> Yeah, that was one use case I had in mind for the WAL format changes. See
> http://www.postgresql.org/message-id/533d6cbf.6080...@vmware.com.

Those proposals suggest some very big changes to the way WAL works.

Prefetch can work easily enough for most records - do we really need
that much churn?

You mentioned Btree vacuum records, but I'm planning to optimize those
another way.

Why don't we just have the prefetch code in core and forget the WAL
format changes?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts



On 05/27/2014 11:55 PM, Robert Haas wrote:

On Tue, May 27, 2014 at 5:00 PM, Tom Lane  wrote:

I'd be inclined to think a more useful answer to this issue would be to
make json.c special-case timestamps, as it already does for numerics.

I wonder if anyone besides me is nervous about changing the semantics
here.  It seems like the sort of backward-compatibility break we
normally avoid.  If we do make such a compatibility break, it should
certainly be release-noted.



Yes, certainly it should be.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing pg_hibernate

2014-05-27 Thread Amit Kapila

On Wed, May 28, 2014 at 7:31 AM, Gurjeet Singh  wrote:
 > Caveats
> --
>
> - Buffer list is saved only when Postgres is shutdown in "smart" and
> "fast" modes.
>
> That is, buffer list is not saved when database crashes, or on
"immediate"
> shutdown.
>
> - A reduction in `shared_buffers` is not detected.
>
> If the `shared_buffers` is reduced across a restart, and if the
combined
> saved buffer list is larger than the new shared_buffers, Postgres
> Hibernator continues to read and restore blocks even after
`shared_buffers`
> worth of buffers have been restored.

How about the cases when shared buffers already contain some
data:
a. Before Readers start filling shared buffers, if this cluster wishes
to join replication as a slave and receive the data from master, then
this utility might need to evict some buffers filled during startup
phase.
b. As soon as the server completes startup (reached consistent
point), it allows new connections which can also use some shared
buffers before Reader process could use shared buffers or are you
planing to change the time when users can connect to database.

I am not sure if replacing shared buffer contents in such cases can
always be considered useful.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Amit Kapila

On Tue, May 27, 2014 at 1:19 PM, Fujii Masao  wrote:
> On Tue, May 27, 2014 at 3:57 PM, Simon Riggs 
wrote:
> > The requirements we were discussing were around
> >
> > A) reducing WAL volume
> > B) reducing foreground overhead of writing FPWs - which spikes badly
> > after checkpoint and the overhead is paid by the user processes
> > themselves
> > C) need for FPWs during base backup
> >
> > So that gives us a few approaches
> >
> > * Compressing FPWs gives A
> > * Background FPWs gives us B
> >which look like we can combine both ideas
> >
> > * Double-buffering would give us A and B, but not C
> >and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow.

Is it due to the fact that during recovery, it needs to check the
contents of double buffer as well as the page in original location
for consistency or there is something else also which will lead
to slow recovery?

Won't DBW (double buffer write) reduce the need for number of
pages that needs to be read from disk as compare to FPW which
will suffice the performance degradation due to any other impact?

IIUC in DBW mechanism, we need to have a temporary sequential
log file of fixed size which will be used to write data before the data
gets written to its actual location in tablespace.  Now as the temporary
log file is of fixed size, the number of pages that needs to be read
during recovery should be less as compare to FPW because in FPW
it needs to read all the pages written in WAL log after last successful
checkpoint.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] json casts

2014-05-27 Thread Robert Haas

On Tue, May 27, 2014 at 5:00 PM, Tom Lane  wrote:
> I'd be inclined to think a more useful answer to this issue would be to
> make json.c special-case timestamps, as it already does for numerics.

I wonder if anyone besides me is nervous about changing the semantics
here.  It seems like the sort of backward-compatibility break we
normally avoid.  If we do make such a compatibility break, it should
certainly be release-noted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread Robert Haas

On Tue, May 27, 2014 at 11:20 PM, ash  wrote:
>> Now, consider the situation in which we want to achieve the same
>> result without having to drop and recreate v.  When the column type of
>> t.a is changed, we can use the dependencies to figure out that v might
>> be impacted.  We can trace through the rewrite rule to find out where
>> column t.a is referenced.  And ... then what?  All we know about t.a
>> is that we're applying some operator to it, which is specified by OID.
>>  The rewrite rule doesn't tell us the actual *name* by which the
>> operator was referenced in the original view text, nor does it tell us
>> the search path that was in effect at that time.  If it did, we could
>> pick the same operator for + that would have been used had t.a been of
>> the new type originally, but as it is, we can't.
>
> This could be a showstopper indeed.  We can look up view def in pg_views
> view, but it doesn't include any schema references unless they were
> explicit in the CREATE VIEW statement.
>
> On the other hand, pg_dump *can* work around this: if you dump a view
> that has been defined when a specific search_path was in effect, you'll
> get correct definition in the schema dump.
>
> So why can't we try to learn from pg_dump?

Well, pg_dump is trying to do something different than what you're
trying to do here.  pg_dump wants to make sure that the view, when fed
back into psql, creates the same view that exists now, regardless of
whether that's what the user created originally.  For example, if a
view is created referring to table foo, and table foo is later renamed
to bar, then pg_dump wants to (and does) dump a statement referring to
bar, not foo - even if there's a *new* table called foo against which
the view could have been defined.  Similarly, pg_dump will
schema-qualify functions and operators, or not, based on whether
that's necessary to reference the exact same operators that were
selected when the original CREATE VIEW command was run, regardless of
whether the original references were schema-qualified.  None of that
involves answering hypothetical questions; but what you want to do
does, and that I think is the problem in a nutshell.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread ash


David G Johnston  writes:
>
> Would it be possible to handle the specific case of varchar(n) to
> varchar/text by just ignoring the error?

Simply for the reference, my case is INT to BIGINT.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread ash


Robert Haas  writes:
>>
>> It'd need to be explicitly requested, eg a 'CASCADE' option.
>
> Why?  Would any sane person NOT want this behavior?
>
> I think the question here is whether there's any way to make this work
> at all, not whether we'd want it if we could get it.  Consider:
>
> CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t;
>
> If we drop the view, change the column type of t.a, and re-execute the
> view, + might resolve to a different operator than before (or no
> operator, at all).  Furthermore, the operator to which it resolves
> will depend on the search path at the time the CREATE OR REPLACE VIEW
> command is executed.
>
> Now, consider the situation in which we want to achieve the same
> result without having to drop and recreate v.  When the column type of
> t.a is changed, we can use the dependencies to figure out that v might
> be impacted.  We can trace through the rewrite rule to find out where
> column t.a is referenced.  And ... then what?  All we know about t.a
> is that we're applying some operator to it, which is specified by OID.
>  The rewrite rule doesn't tell us the actual *name* by which the
> operator was referenced in the original view text, nor does it tell us
> the search path that was in effect at that time.  If it did, we could
> pick the same operator for + that would have been used had t.a been of
> the new type originally, but as it is, we can't.

This could be a showstopper indeed.  We can look up view def in pg_views
view, but it doesn't include any schema references unless they were
explicit in the CREATE VIEW statement.

On the other hand, pg_dump *can* work around this: if you dump a view
that has been defined when a specific search_path was in effect, you'll
get correct definition in the schema dump.

So why can't we try to learn from pg_dump?

> Now maybe there are options other than trying to reproduce what the
> original CREATE OR REPLACE statement would have done against the new
> type.  For example, we could look through views that depend on t.a and
> rewrite each reference to that column to t.a::oldtype.  This might
> lead to odd results with multiple nested casts and generally funny
> behavior if the column is re-typed multiple times; but maybe there's
> some way to fix that.  Also, it might not really be the semantics you
> want if you were hoping the type update would truly cascade.  But it
> might still be better than a sharp stick in the eye, which is kinda
> what we offer today.

No, casting back to oldtype totally defeats the purpose, at least for my
usecase.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread David G Johnston

Alexander Shulgin wrote
> Hi Hackers,
> 
> This came up recently on general list (and I've just hit the same issue
> today):
>   http://www.postgresql.org/message-id/

> CAB7nPqTLmMn1LTb5WE0v0dO57iP0U73yKwzbZytAXDF1CAWLZg@.gmail

> 
> Why couldn't postgres re-create the dependent views automatically?  I
> mean it knows which views depend on the altered column and keeps the
> view definition, no?
> 
> Would a patch likely to be accepted?  How hard do you feel this might be
> to implement?  Any caveat that defeats the purpose of such feature?
> 
> Thanks.

Would it be possible to handle the specific case of varchar(n) to
varchar/text by just ignoring the error?

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Re-create-dependent-views-on-ALTER-TABLE-ALTER-COLUMN-TYPE-tp5804972p5805191.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Peter Eisentraut  writes:
> I'm shocked that this new feature has been committed post beta with less
> than 48 hours of review time over a holiday weekend.  This issue has
> been known for years.  Why does it need to be solved right now?

As per the commit message: our packagers were screaming about the fact
that they couldn't build uuid-ossp anymore now that autoconf 2.69 has
tightened up its validity checks for headers.  If you don't like this
change, we can revert it and also revert the upgrade to 2.69.

I'm not terribly happy about pushing such a change post-beta1 either,
but it's not like this isn't something we've known was needed.  Anyway,
what's the worst case if we find a bug here?  Tell people not to use
uuid-ossp?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Peter Eisentraut

On 5/27/14, 2:41 PM, Tom Lane wrote:
> Heikki Linnakangas  writes:
>> On 27 May 2014 18:33:48 EEST, Tom Lane  wrote:
>>> If we were going to do it like that, I'd vote for
>>>
>>> --with-uuid={ossp,e2fs,bsd}
>>>
>>> with no default at present (ie you can't say just "--with-uuid",
>>> though we'd have the option to allow that in future).  But I doubt
>>> this is better than the --with-foo-uuid spelling.
> 
>> FWIW, --with-uuid=foo looks much clearer to me.
> 
> OK, I'll make it so.

I'm shocked that this new feature has been committed post beta with less
than 48 hours of review time over a holiday weekend.  This issue has
been known for years.  Why does it need to be solved right now?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Peter Eisentraut

On 5/26/14, 1:25 PM, Tom Lane wrote:
> Assuming this works as advertised, does anyone have an objection to
> pushing it into 9.4?

Yes, it's way too late for that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?

2014-05-27 Thread Robert Haas

On Mon, May 26, 2014 at 10:39 AM, Stephen Frost  wrote:
> * ash (a...@commandprompt.com) wrote:
>> This came up recently on general list (and I've just hit the same issue 
>> today):
>>   
>> http://www.postgresql.org/message-id/cab7npqtlmmn1ltb5we0v0do57ip0u73ykwzbzytaxdf1caw...@mail.gmail.com
>>
>> Why couldn't postgres re-create the dependent views automatically?  I
>> mean it knows which views depend on the altered column and keeps the
>> view definition, no?
>
> Might be pretty complicated in the end..
>
>> Would a patch likely to be accepted?  How hard do you feel this might be
>> to implement?  Any caveat that defeats the purpose of such feature?
>
> It'd need to be explicitly requested, eg a 'CASCADE' option.

Why?  Would any sane person NOT want this behavior?

I think the question here is whether there's any way to make this work
at all, not whether we'd want it if we could get it.  Consider:

CREATE OR REPLACE VIEW v AS SELECT a + 0 FROM t;

If we drop the view, change the column type of t.a, and re-execute the
view, + might resolve to a different operator than before (or no
operator, at all).  Furthermore, the operator to which it resolves
will depend on the search path at the time the CREATE OR REPLACE VIEW
command is executed.

Now, consider the situation in which we want to achieve the same
result without having to drop and recreate v.  When the column type of
t.a is changed, we can use the dependencies to figure out that v might
be impacted.  We can trace through the rewrite rule to find out where
column t.a is referenced.  And ... then what?  All we know about t.a
is that we're applying some operator to it, which is specified by OID.
 The rewrite rule doesn't tell us the actual *name* by which the
operator was referenced in the original view text, nor does it tell us
the search path that was in effect at that time.  If it did, we could
pick the same operator for + that would have been used had t.a been of
the new type originally, but as it is, we can't.

Now maybe there are options other than trying to reproduce what the
original CREATE OR REPLACE statement would have done against the new
type.  For example, we could look through views that depend on t.a and
rewrite each reference to that column to t.a::oldtype.  This might
lead to odd results with multiple nested casts and generally funny
behavior if the column is re-typed multiple times; but maybe there's
some way to fix that.  Also, it might not really be the semantics you
want if you were hoping the type update would truly cascade.  But it
might still be better than a sharp stick in the eye, which is kinda
what we offer today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing pg_hibernate

2014-05-27 Thread Gurjeet Singh

Please find attached the updated code of Postgres Hibenator. Notable
changes since the first proposal are:

.) The name has been changed to pg_hibernator (from pg_hibernate), to
avoid confusion with the ORM Hibernate.
.) Works with Postgres 9.4
.) Uses DynamicBackgroundWorker infrastructure.
.) Ability to restore one database at a time, to avoid random-read
storms. Can be disabled by parameter.
.) A couple of bug-fixes.
.) Detailed documentation.

I am pasting the README here (also included in the attachment).

Best regards,

Postgres Hibernator
===

This Postgres extension is a set-it-and-forget-it solution to save and restore
the Postgres shared-buffers contents, across Postgres server restarts.

It performs the automatic save and restore of database buffers, integrated with
database shutdown and startup, hence reducing the durations of
database maintenance
windows, in effect increasing the uptime of your applications.

Postgres Hibernator automatically saves the list of shared buffers to the disk
on database shutdown, and automatically restores the buffers on
database startup.
This acts pretty much like your Operating System's hibernate feature, except,
instead of saving the contents of the memory to disk, Postgres Hibernator saves
just a list of block identifiers. And it uses that list after startup to restore
the blocks from data directory into Postgres' shared buffers.

Why
--

DBAs are often faced with the task of performing some maintenance on their
database server(s) which requires shutting down the database. The maintenance
may involve anything from a database patch application, to a hardware upgrade.
One ugly side-effect of restarting the database server/service is that all the
data currently in database server's memory will be all lost, which was
painstakingly fetched from disk and put there in response to application queries
over time. And this data will have to be rebuilt as applications start querying
database again. The query response times will be very high until all the "hot"
data is fetched from disk and put back in memory again.

People employ a few tricks to get around this ugly truth, which range from
running a `select * from app_table;`, to `dd if=table_file ...`, to using
specialized utilities like pgfincore to prefetch data files into OS cache.
Wouldn't it be ideal if the database itself could save and restore its memory
contents across restarts!

The duration for which the server is building up caches, and trying to reach its
optimal cache performance is called ramp-up time. Postgres Hibernator is aimed
at reducing the ramp-up time of Postgres servers.

How
--

Compile and install the extension (you'll need a Postgres instalation and its
`pg_config` in `$PATH`):

$ cd pg_hibernator
$ make install

Then.

1. Add `pg_hibernator` to the `shared_preload_libraries` variable in
`postgresql.conf` file.
2. Restart the Postgres server.
3. You are done.

How it works
--

This extension uses the `Background Worker` infrastructure of
Postgres, which was
introduced in Postgres 9.3. When the server starts, this extension registers
background workers; one for saving the buffers (called `Buffer Saver`) when the
server shuts down, and one for each database in the cluster (called
`Block Readers`)
for restoring the buffers saved during previous shutdown.

When the Postgres server is being stopped/shut down, the `Buffer
Saver` scans the
shared-buffers of Postgres, and stores the unique block identifiers of
each cached
block to the disk. This information is saved under the `$PGDATA/pg_hibernator/`
directory. For each of the database whose blocks are resident in shared buffers,
one file is created; for eg.: `$PGDATA/pg_hibernator/2.postgres.save`.

During the next startup sequence, the `Block Reader` threads are
registered, one for
each file present under `$PGDATA/pg_hibernator/` directory. When the
Postgres server
has reached stable state (that is, it's ready for database connections), these
`Block Reader` processes are launched. The `Block Reader` process
reads the save-files
looking for block-ids to restore. It then connects to the respective database,
and requests Postgres to fetch the blocks into shared-buffers.

Configuration
--

This extension can be controlled via the following parameters. These parameters
can be set in postgresql.conf or on postmaster's command-line.

- `pg_hibernator.enabled`

Setting this parameter to false disables the hibernator features. That is,
on server startup the BlockReader processes will not be launched, and on
server shutdown the list of blocks in shared buffers will not be saved.

Note that the BuffersSaver process exists at all times, even when this
parameter is set to `false`. This is to allow the DBA to enable/disable the
extension without having to restart the server. The BufferSaver process
checks this parameter during server startup and right before shutdown,

Re: [HACKERS] [9.4] Minor SSL/ECDH related doc fixes

2014-05-27 Thread Bruce Momjian

On Sat, May 17, 2014 at 10:36:59PM +0300, Marko Kreen wrote:
> - Clarify ECDH decription in release notes.
> - Fix default value - it's 'prime256v1'.
> - List curves with good cross-platform support explicitly
>   (NIST P-256 / P-384 / P-521).
> 
> The -list_curves output is full of garbage, it's hard to know which
> ones make sense to use.  Only those three curves are supported
> cross-platform - OpenSSL/Java/Windows - so list them explicitly.
> 
> Only reason to tune this value is changing overall security
> level up/down, so now this can be done safely and quickly.
> 
> Only upwards though.  We could also list here NIST P-192/P-224
> (prime192v1, secp224r1), but those are not supported by Windows.
> And prime256v1 is quite fast already.
> 
> In the future it might make philosophical sense to list
> also Brainpool curves (RFC7027), or some new curves from
> http://safecurves.cr.yp.to/ when they are brought to TLS.
> But currently only NIST/NSA curves are working option,
> so let's keep it simple for users.

Attached patch applied.  I shortened the release note description.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index d9e5985..4a666d0
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** include 'filename'
*** 1020,1032 

! Specifies the name of the curve to use in ECDH key exchanges.  The
! default is prime256p1.

! The list of available curves can be shown with the command
! openssl ecparam -list_curves.

--- 1020,1042 

! Specifies the name of the curve to use in ECDH key exchange.
! It needs to be supported by all clients that connect.
! It does not need to be same curve as used by server's
! Elliptic Curve key.  The default is prime256v1.  

! OpenSSL names for most common curves:
! prime256v1 (NIST P-256),
! secp384r1 (NIST P-384),
! secp521r1 (NIST P-521).
!
! 
!
! The full list of available curves can be shown with the command
! openssl ecparam -list_curves.  Not all of them
! are usable in TLS though.

diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
new file mode 100644
index 24862fe..ae059f8
*** a/doc/src/sgml/release-9.4.sgml
--- b/doc/src/sgml/release-9.4.sgml
***
*** 616,632 

! Such keys are faster and have improved security over previous
! options. The new configuration
! parameter ssl_ecdh_curve
! controls which curve is used.

  Improve the default ssl_ciphers ciphers
  (Marko Kreen)

--- 616,633 

! This allows use of Elliptic Curve keys for server authentication.
! Such keys are faster and have improved security over RSA keys.
! The new configuration parameter
! ssl_ecdh_curve
! controls which curve is used for ECDH.

  Improve the default ssl_ciphers value
  (Marko Kreen)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb failed assertions

On Wed, May 21, 2014 at 4:53 AM, Heikki Linnakangas
 wrote:
> Hmm. The patch looks correct as far as it goes. But that function is still a
> bit funny. When it compares two identical arrays (or objects), and reaches
> the WJB_END_ARRAY token, it will still fall into the code that checks what
> the va and vb types are, and compares the last scalar values in that array
> again. That's wrong, and will fail if the compiler decides to clobber the
> local "va" or "vb" variables between iterations of the do-while loop,
> because JsonbIteratorNext() does not set the value when returning
> WJB_END_ARRAY.

That's an obsolete assumption that once actually applied during development.

Attached revision also adds handling for that case.

Thanks
-- 
Peter Geoghegan
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
*** static void uniqueifyJsonbObject(JsonbVa
*** 62,74 
   *
   * There isn't a JsonbToJsonbValue(), because generally we find it more
   * convenient to directly iterate through the Jsonb representation and only
!  * really convert nested scalar values.  formIterIsContainer() does this, so
!  * that clients of the iteration code don't have to directly deal with the
!  * binary representation (JsonbDeepContains() is a notable exception, although
!  * all exceptions are internal to this module).  In general, functions that
!  * accept a JsonbValue argument are concerned with the manipulation of scalar
!  * values, or simple containers of scalar values, where it would be
!  * inconvenient to deal with a great amount of other state.
   */
  Jsonb *
  JsonbValueToJsonb(JsonbValue *val)
--- 62,74 
   *
   * There isn't a JsonbToJsonbValue(), because generally we find it more
   * convenient to directly iterate through the Jsonb representation and only
!  * really convert nested scalar values.  JsonbIteratorNext() does this, so that
!  * clients of the iteration code don't have to directly deal with the binary
!  * representation (JsonbDeepContains() is a notable exception, although all
!  * exceptions are internal to this module).  In general, functions that accept
!  * a JsonbValue argument are concerned with the manipulation of scalar values,
!  * or simple containers of scalar values, where it would be inconvenient to
!  * deal with a great amount of other state.
   */
  Jsonb *
  JsonbValueToJsonb(JsonbValue *val)
*** compareJsonbContainers(JsonbContainer *a
*** 137,149 
  		ra = JsonbIteratorNext(&ita, &va, false);
  		rb = JsonbIteratorNext(&itb, &vb, false);
  
- 		/*
- 		 * To a limited extent we'll redundantly iterate over an array/object
- 		 * while re-performing the same test without any reasonable
- 		 * expectation of the same container types having differing lengths
- 		 * (as when we process a WJB_BEGIN_OBJECT, and later the corresponding
- 		 * WJB_END_OBJECT), but no matter.
- 		 */
  		if (ra == rb)
  		{
  			if (ra == WJB_DONE)
--- 137,142 
*** compareJsonbContainers(JsonbContainer *a
*** 152,157 
--- 145,162 
  break;
  			}
  
+ 			if (ra == WJB_END_ARRAY || ra == WJB_END_OBJECT)
+ 			{
+ /*
+  * There is no array or object to safely compare at this stage
+  * of processing.  jbvArray/jbvObject values are only
+  * considered initially, during processing of WJB_BEGIN_ARRAY
+  * and WJB_BEGIN_OBJECT tokens.  This doesn't matter, because a
+  * second comparison would be redundant.
+  */
+ continue;
+ 			}
+ 
  			if (va.type == vb.type)
  			{
  switch (va.type)
*** compareJsonbContainers(JsonbContainer *a
*** 194,212 
  		else
  		{
  			/*
! 			 * It's safe to assume that the types differed.
  			 *
  			 * If the two values were the same container type, then there'd
  			 * have been a chance to observe the variation in the number of
! 			 * elements/pairs (when processing WJB_BEGIN_OBJECT, say).  They
! 			 * can't be scalar types either, because then they'd have to be
! 			 * contained in containers already ruled unequal due to differing
! 			 * numbers of pairs/elements, or already directly ruled unequal
! 			 * with a call to the underlying type's comparator.
  			 */
  			Assert(va.type != vb.type);
! 			Assert(va.type == jbvArray || va.type == jbvObject);
! 			Assert(vb.type == jbvArray || vb.type == jbvObject);
  			/* Type-defined order */
  			res = (va.type > vb.type) ? 1 : -1;
  		}
--- 199,227 
  		else
  		{
  			/*
! 			 * It's safe to assume that the types differed, and that the va and
! 			 * vb values passed were set.
  			 *
+ 			 * We don't have to consider the WJB_END_ARRAY and WJB_END_OBJECT
+ 			 * cases here, because in order to process those tokens we'd first
+ 			 * have to process WJB_BEGIN_ARRAY or WJB_BEGIN_OBJECT, and that
+ 			 * would be sufficient reason to end up here (with reliable values
+ 			 * set for va and vb), which always results in finishing the
+ 			 * comparison.
+ 			 */
+ 			Assert(ra != WJB_END_

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Pushed; thanks for working on this!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why is pg_lsn marking itself a preferred type?

2014-05-27 Thread Michael Paquier

On Wed, May 28, 2014 at 4:27 AM, Tom Lane  wrote:
> One of these doesn't belong:
>
> postgres=# select typname, typcategory from pg_type where typispreferred;
>typname   | typcategory
> -+-
>  bool| B
>  text| S
>  oid | N
>  float8  | N
>  inet| I
>  timestamptz | D
>  interval| T
>  varbit  | V
>  pg_lsn  | U
> (9 rows)
>
> Was there any actual rationale to this, or was it just somebody who did
> not understand what that bit is for?
Looks like an oversight of the pg_lsn patch. You could blame me for
that I suppose...

> I think it's probably mostly harmless given the lack of casts to or from
> pg_lsn, but it's still a darn bad idea to have any preferred types in the
> 'U' category.  If we leave it like this it will bite us in the rear
> eventually.
> The most expedient response at this late date seems to be to change the
> entry in pg_type.h without bumping catversion.  That way at least it
> will be right in databases initdb'd after beta2.
Agreed. Attached patch fixes that, but I am sure that you already
figured it out.
-- 
Michael
diff --git a/src/include/catalog/pg_type.h b/src/include/catalog/pg_type.h
index a8abc0f..b7d9256 100644
--- a/src/include/catalog/pg_type.h
+++ b/src/include/catalog/pg_type.h
@@ -578,7 +578,7 @@ DESCR("UUID datatype");
 DATA(insert OID = 2951 ( _uuid			PGNSP PGUID -1 f b A f t \054 0 2950 0 array_in array_out array_recv array_send - - array_typanalyze i x f 0 -1 0 0 _null_ _null_ _null_ ));
 
 /* pg_lsn */
-DATA(insert OID = 3220 ( pg_lsn			PGNSP PGUID 8 FLOAT8PASSBYVAL b U t t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ ));
+DATA(insert OID = 3220 ( pg_lsn			PGNSP PGUID 8 FLOAT8PASSBYVAL b U f t \054 0 0 3221 pg_lsn_in pg_lsn_out pg_lsn_recv pg_lsn_send - - - d p f 0 -1 0 0 _null_ _null_ _null_ ));
 DESCR("PostgreSQL LSN datatype");
 #define LSNOID			3220
 DATA(insert OID = 3221 ( _pg_lsn			PGNSP PGUID -1 f b A f t \054 0 3220 0 array_in array_out array_recv array_send - - array_typanalyze d x f 0 -1 0 0 _null_ _null_ _null_ ));

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts



On 05/27/2014 07:17 PM, Tom Lane wrote:

Stephen Frost  writes:

* Andrew Dunstan (and...@dunslane.net) wrote:

Given that this would be a hard coded behaviour change, is it too
late to do this for 9.4?

No, for my 2c.

If we do it by adding casts then it'd require an initdb, so I'd vote
against that for 9.4.  If we just change behavior in json.c then that
objection doesn't apply, so I wouldn't complain.





I wasn't proposing to add a cast, just to allow users to add one if they 
wanted. But I'm quite happy to go with special-casing timestamps, and 
leave the larger question for another time.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts

Stephen Frost  writes:
> * Andrew Dunstan (and...@dunslane.net) wrote:
>> Given that this would be a hard coded behaviour change, is it too
>> late to do this for 9.4?

> No, for my 2c.

If we do it by adding casts then it'd require an initdb, so I'd vote
against that for 9.4.  If we just change behavior in json.c then that
objection doesn't apply, so I wouldn't complain.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
 wrote:
>> Fair enough, but I don't think that affects correctness either way (I
>> don't think that you meant to imply that this was a necessary
>> precaution that you'd taken - right?).
>
>
> Right.

So, the comments within _bt_search() suggest that the _bt_moveright()
call will perform a _bt_finish_split() call opportunistically iff it's
called from _bt_doinsert() (i.e. access == BT_WRITE). There is no
reason to not do so in all circumstances though, assuming that it's
desirable to do so as soon as possible (which I *don't* actually
assume). If I'm not mistaken, it's also true that it would be strictly
speaking correct to never do it there. Do you think it would be a fair
stress-test if I was to hack Postgres so that this call never happens
(within _bt_moveright())? I'd also have an incomplete page split occur
at random with a probability of, say, 1% per split. The mechanism
would be much the same as your original test-case for the split patch
- I'd throw an error at the wrong place, although only 1% of the time,
and over many hours.

The concern with the _bt_moveright() call of _bt_finish_split() is
that it might conceal a problem without reliably fixing it,
potentially making isolating that problem much harder.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts

* Andrew Dunstan (and...@dunslane.net) wrote:
> >>I'd be inclined to think a more useful answer to this issue would be to
> >>make json.c special-case timestamps, as it already does for numerics.
> >>
> >>
> 
> OK, that's another approach.

I'm all for doing this for JSON, but it'd sure be nice to have the
option to get actual ISO 8601 from the built-ins too...

> Given that this would be a hard coded behaviour change, is it too
> late to do this for 9.4?

No, for my 2c.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] json casts



On 05/27/2014 05:43 PM, Hannu Krosing wrote:

On 05/27/2014 11:00 PM, Tom Lane wrote:


See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.
When rendering some value as part of a json string, if a cast exists
from the data type to json, then the cast function is used to render the
json instead of the type's normal output function, but only if it's not
a builtin type.

How exactly would disabling that code have any effect on timestamp
rendering?  There's no cast to json from timestamps (nor any other
builtin type, except jsonb).

I think Andrews idea was, that if cast were used, one could fix the above
problem by defining a correct cast.



Right.





I'd be inclined to think a more useful answer to this issue would be to
make json.c special-case timestamps, as it already does for numerics.




OK, that's another approach.


But I agree that special-casing the code to use the de-facto json standard
of using ISO 8601 date representation is a better solution.

Just make sure you get the TZ part right - this is another place where
PostgreSQL often differs from other systems' understanding of ISO
timestamps.



The target would be something like:

   to_char($1,'\"-MM-DD"T"HH:MI:SS.USOF\"')


AIUI that should be legal ISO 8601. But I'm happy to defer to experts.


Given that this would be a hard coded behaviour change, is it too late 
to do this for 9.4?


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread David G Johnston

Heikki Linnakangas-6 wrote
> On 05/27/2014 10:12 PM, Evan Jones wrote:
>> I was reading the Postgres MVCC documentation today (which is
>> generally fantastic BTW), and am slightly confused by a single
>> sentence example, describing possible read-only snapshot isolation
>> anomalies. I would like to submit a patch to clarify this example,
>> since I suspect others may be also confused, but to do that I need
>> help understanding it. The example was added as part of the
>> Serializable Snapshot Isolation patch.
>>
>> Link to the commit:
>> http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21
>>
>>
>>
>> I'm referring to the following sentence of 13.2.2, which is still in
>> the source tree:
>>
>> http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ
>>
>>  "For example, even a read only transaction at this level may see a
>> control record updated to show that a batch has been completed but
>> not see one of the detail records which is logically part of the
>> batch because it read an earlier revision of the control record."
> 
> Hmm, that seems to be a super-summarized description of what Kevin & Dan 
> called the "receipts problem". There's an example of that in the 
> isolation test suite, see src/test/isolation/specs/receipt-report.spec. 
> Googling for it, I also found an academic paper written by Kevin & Dan 
> that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, "2.1.2 Example 
> 2: Batch Processing". (Nice work, I didn't know of that paper until now!)
> 
> I agree that's too terse. I think it would be good to actually spell out 
> a complete example of the Receipt problem in the manual. That chapter in 
> the manual contains examples of anomalities in Read Committed mode, so 
> it would be good to give a concrete example of an anomaly in Repeatable 
> Read mode too. Want to write up a docs patch?

While this is not a doc patch I decided to give it some thought.  The "bank"
example was understandable enough for me so I simply tried to make it more
accessible.  I also didn't go and try to get it to conform to other,
existing, examples.  This is intended to replace the entire "For example..."
paragraph noted above.


While Repeatable Read provides for stable in-transaction reads logical query
anomalies can result because commit order is not restricted and
serialization errors only occur if two transactions attempt to modify the
same record.

Consider a rule that, upon updating r1 OR r2, if r1+r2 < 0 then subtract an
additional 1 from the corresponding row.
Initial State: r1 = 0; r2 = 0
Transaction 1 Begins: reads (0,0); adds -10 to r1, notes r1 + r2 will be -10
and subtracts an additional 1
Transaction 2 Begins: reads (0,0); adds 20 to r2, notes r1 + r2 will be +20;
no further action needed
Commit 2
Transaction 3: reads (0,20) and commits
Commit 1
Transaction 4: reads (-11,20) and commits

However, if Transaction 2 commits first then, logically, the calculation of
r1 + r2 in Transaction 1 should result in a false outcome and the additional
subtraction of 1 should not occur - leaving T4 reading (-10,20).  

The ability for out-of-order commits is what allows T3 to read the pair
(0,20) which is logically impossible in the T2->before->T1 commit order with
T4 reading (-11,20).

Neither transaction fails since a serialization failure only occurs if a
concurrent update occurs to [ r1 (in T1) ] or to [ r2 (in T2) ]; The update
of [ r2 (in T1) ] is invisible - i.e., no failure occurs if a read value
undergoes a change.


Inspired by:
http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf -
Example 1.3


David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/PG-Manual-Clarifying-the-repeatable-read-isolation-example-tp5805152p5805170.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts

2014-05-27 Thread Hannu Krosing

On 05/27/2014 11:00 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:
>>> On 05/27/2014 10:53 PM, Andrew Dunstan wrote:
 I've been on the receiving end of a couple of mumbles about the fact
 that the JSON rendering code ignores casts of builtin types to JSON.
 This was originally done as an optimization to avoid doing cache lookups
 for casts for things we knew quite well how to turn into JSON values
 (unlike, say, hstore). However, there is at least one concrete case
 where this has some possibly undesirable consequences, namely
 timestamps. Many JSON processors, especially JavaScript/ECMAScript
 processors, require timestamp values to be in ISO 8601 format, with a
 'T' between the date part and the time part, and thus they barf on the
 output we produce for such values.
>>> I don't understand what ignoring casts of builtin types to JSON means. 
>>> Can you give an example?
>> See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.
>> When rendering some value as part of a json string, if a cast exists 
>> from the data type to json, then the cast function is used to render the 
>> json instead of the type's normal output function, but only if it's not 
>> a builtin type.
> How exactly would disabling that code have any effect on timestamp
> rendering?  There's no cast to json from timestamps (nor any other
> builtin type, except jsonb).
I think Andrews idea was, that if cast were used, one could fix the above
problem by defining a correct cast.


>
> I'd be inclined to think a more useful answer to this issue would be to
> make json.c special-case timestamps, as it already does for numerics.
>
>   regards, tom lane
But I agree that special-casing the code to use the de-facto json standard
of using ISO 8601 date representation is a better solution.

Just make sure you get the TZ part right - this is another place where
PostgreSQL often differs from other systems' understanding of ISO
timestamps.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 commit fest schedule

2014-05-27 Thread Peter Eisentraut

On 5/22/14, 7:50 AM, Abhijit Menon-Sen wrote:
> At 2014-05-22 07:22:42 -0400, pete...@gmx.net wrote:
>>
>> We need a volunteer to manage the first commit fest.
> 
> I volunteer.

Congratulations, you're the new commit fest manager.

I trust that you know your way around, but if you have any questions
about anything, you can ask me, and probably some of the other previous
commit fest managers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts

Andrew Dunstan  writes:
> On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:
>> On 05/27/2014 10:53 PM, Andrew Dunstan wrote:
>>> I've been on the receiving end of a couple of mumbles about the fact
>>> that the JSON rendering code ignores casts of builtin types to JSON.
>>> This was originally done as an optimization to avoid doing cache lookups
>>> for casts for things we knew quite well how to turn into JSON values
>>> (unlike, say, hstore). However, there is at least one concrete case
>>> where this has some possibly undesirable consequences, namely
>>> timestamps. Many JSON processors, especially JavaScript/ECMAScript
>>> processors, require timestamp values to be in ISO 8601 format, with a
>>> 'T' between the date part and the time part, and thus they barf on the
>>> output we produce for such values.

>> I don't understand what ignoring casts of builtin types to JSON means. 
>> Can you give an example?

> See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.

> When rendering some value as part of a json string, if a cast exists 
> from the data type to json, then the cast function is used to render the 
> json instead of the type's normal output function, but only if it's not 
> a builtin type.

How exactly would disabling that code have any effect on timestamp
rendering?  There's no cast to json from timestamps (nor any other
builtin type, except jsonb).

I'd be inclined to think a more useful answer to this issue would be to
make json.c special-case timestamps, as it already does for numerics.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)


On 05/27/2014 11:30 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
 wrote:

Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
page, not the high key as I depicted. Let me try again:


Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.


Sure:

Parent page
+-+
| 7 -> A  |
| 8 -> B  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
|  | | |
|  | <-> | |
|  | | |
|  7   7*  8   | |9|
+--+ +-+

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 12:19 PM, Heikki Linnakangas
 wrote:
> Ah, sorry, I got that wrong. The downlinks store the *low* key of the child
> page, not the high key as I depicted. Let me try again:

Would you mind humoring me, and including a corrected final
post-downlink-insert diagram, when the split is fully complete? You
omitted that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts



On 05/27/2014 03:57 PM, Heikki Linnakangas wrote:

On 05/27/2014 10:53 PM, Andrew Dunstan wrote:

I've been on the receiving end of a couple of mumbles about the fact
that the JSON rendering code ignores casts of builtin types to JSON.
This was originally done as an optimization to avoid doing cache lookups
for casts for things we knew quite well how to turn into JSON values
(unlike, say, hstore). However, there is at least one concrete case
where this has some possibly undesirable consequences, namely
timestamps. Many JSON processors, especially JavaScript/ECMAScript
processors, require timestamp values to be in ISO 8601 format, with a
'T' between the date part and the time part, and thus they barf on the
output we produce for such values.


I don't understand what ignoring casts of builtin types to JSON means. 
Can you give an example?



See src/backend/utils/adt/json.c:json_categorize_type() lines 1280-1300.

When rendering some value as part of a json string, if a cast exists 
from the data type to json, then the cast function is used to render the 
json instead of the type's normal output function, but only if it's not 
a builtin type.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json casts


On 05/27/2014 10:53 PM, Andrew Dunstan wrote:

I've been on the receiving end of a couple of mumbles about the fact
that the JSON rendering code ignores casts of builtin types to JSON.
This was originally done as an optimization to avoid doing cache lookups
for casts for things we knew quite well how to turn into JSON values
(unlike, say, hstore). However, there is at least one concrete case
where this has some possibly undesirable consequences, namely
timestamps. Many JSON processors, especially JavaScript/ECMAScript
processors, require timestamp values to be in ISO 8601 format, with a
'T' between the date part and the time part, and thus they barf on the
output we produce for such values.


I don't understand what ignoring casts of builtin types to JSON means. 
Can you give an example?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] json casts

I've been on the receiving end of a couple of mumbles about the fact 
that the JSON rendering code ignores casts of builtin types to JSON. 
This was originally done as an optimization to avoid doing cache lookups 
for casts for things we knew quite well how to turn into JSON values 
(unlike, say, hstore). However, there is at least one concrete case 
where this has some possibly undesirable consequences, namely 
timestamps. Many JSON processors, especially JavaScript/ECMAScript 
processors, require timestamp values to be in ISO 8601 format, with a 
'T' between the date part and the time part, and thus they barf on the 
output we produce for such values.


I'm therefore wondering if we should just remove this optimization. I 
don't have any performance figures, but it seems unlikely to be terribly 
expensive. Perhaps in 9.5 we should look at caching a lot of the info 
the json rendering code gathers, but that's something we can't look at now.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread Evan Jones

Oh yeah, I shared an office with Dan so I should have thought to check their 
paper. Oops. Thanks for the suggestion; I'll try to summarize this into 
something that is similar to the Read Committed and Serializable mode examples. 
It may take me a week or two to find the time, but thanks for the suggestions.

Evan

On May 27, 2014, at 15:32 , Heikki Linnakangas  wrote:

> I agree that's too terse. I think it would be good to actually spell out a 
> complete example of the Receipt problem in the manual. That chapter in the 
> manual contains examples of anomalities in Read Committed mode, so it would 
> be good to give a concrete example of an anomaly in Repeatable Read mode too. 
> Want to write up a docs patch?

--
Work: https://www.mitro.co/Personal: http://evanjones.ca/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_llog not mentioned in "Database File Layout"

2014-05-27 Thread Fabrízio de Royes Mello

On Mon, May 26, 2014 at 12:33 PM, Amit Langote wrote:

> Hi,
>
> Just noticed pg_llog is not mentioned in the "Database File Layout"
> section. Wonder if it's an oversight?
>
>
Yes, it is an oversight. Patch attached.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 57e7f09..132169d 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -78,6 +78,11 @@ Item
 
 
 
+ pg_llog
+ Subdirectory containing logical replication status data
+
+
+
  pg_multixact
  Subdirectory containing multitransaction status data
   (used for shared row locks)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PG Manual: Clarifying the repeatable read isolation example


On 05/27/2014 10:12 PM, Evan Jones wrote:

I was reading the Postgres MVCC documentation today (which is
generally fantastic BTW), and am slightly confused by a single
sentence example, describing possible read-only snapshot isolation
anomalies. I would like to submit a patch to clarify this example,
since I suspect others may be also confused, but to do that I need
help understanding it. The example was added as part of the
Serializable Snapshot Isolation patch.

Link to the commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21



I'm referring to the following sentence of 13.2.2, which is still in
the source tree:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ

 "For example, even a read only transaction at this level may see a
control record updated to show that a batch has been completed but
not see one of the detail records which is logically part of the
batch because it read an earlier revision of the control record."


Hmm, that seems to be a super-summarized description of what Kevin & Dan 
called the "receipts problem". There's an example of that in the 
isolation test suite, see src/test/isolation/specs/receipt-report.spec. 
Googling for it, I also found an academic paper written by Kevin & Dan 
that illustrates it: http://arxiv.org/pdf/1208.4179.pdf, "2.1.2 Example 
2: Batch Processing". (Nice work, I didn't know of that paper until now!)


I agree that's too terse. I think it would be good to actually spell out 
a complete example of the Receipt problem in the manual. That chapter in 
the manual contains examples of anomalities in Read Committed mode, so 
it would be good to give a concrete example of an anomaly in Repeatable 
Read mode too. Want to write up a docs patch?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Why is pg_lsn marking itself a preferred type?

One of these doesn't belong:

postgres=# select typname, typcategory from pg_type where typispreferred;
   typname   | typcategory 
-+-
 bool| B
 text| S
 oid | N
 float8  | N
 inet| I
 timestamptz | D
 interval| T
 varbit  | V
 pg_lsn  | U
(9 rows)

Was there any actual rationale to this, or was it just somebody who did
not understand what that bit is for?

I think it's probably mostly harmless given the lack of casts to or from
pg_lsn, but it's still a darn bad idea to have any preferred types in the
'U' category.  If we leave it like this it will bite us in the rear
eventually.

The most expedient response at this late date seems to be to change the
entry in pg_type.h without bumping catversion.  That way at least it
will be right in databases initdb'd after beta2.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)


On 05/27/2014 09:47 PM, Peter Geoghegan wrote:

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
 wrote:

Also note that _bt_moveright() also finishes any incomplete splits it
encounters (when called for an insertion). So _bt_findinsertloc() never gets
called on a page with the incomplete-split flag set. It might encounter one
when it moves right, but never the first page.


Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?).


Right.


If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-+
| |
| 9 -> A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 9 -> A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | <-> | |
|  | | |
|  7   7*   8  | |   9 |
+--+ +-+



After the split is finished, the tree looks like this:

Parent page
+-+
| 8 -> A  |
| 9 -> B  |
+-+

Leaf A  Leaf B
+-+ +-+
| HI-key: 8   | | HI-key: 9   |
| | <-> | |
|  7   7*  8  | |   9 |
+-+ +-+


How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?


Ah, sorry, I got that wrong. The downlinks store the *low* key of the 
child page, not the high key as I depicted. Let me try again:


On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.


So, initially the tree looks like this:

Parent page:
+-+
| |
| 7 -> A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 7 -> A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | <-> | |
|  | | |
|  7   7*  8   | |9|
+--+ +-+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next
paragraph you said the new downlink has value 8, which implies the above 
split)



* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.


The downlink of the original page cannot contain 9. Because, as I now 
remember ;-), the downlinks contain low-keys, not high keys.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PG Manual: Clarifying the repeatable read isolation example

2014-05-27 Thread Evan Jones

Feel free to flame me if I should be posting this elsewhere, but after reading
the "submitting a patch" guide, it appears I should ask for guidance here.

I was reading the Postgres MVCC documentation today (which is generally
fantastic BTW), and am slightly confused by a single sentence example,
describing possible read-only snapshot isolation anomalies. I would like to
submit a patch to clarify this example, since I suspect others may be also
confused, but to do that I need help understanding it. The example was added as
part of the Serializable Snapshot Isolation patch.

Link to the commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;h=dafaa3efb75ce1aae2e6dbefaf6f3a889dea0d21

I'm referring to the following sentence of 13.2.2, which is still in the source
tree:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-REPEATABLE-READ

"For example, even a read only transaction at this level may see a control
record updated to show that a batch has been completed but not see one of the
detail records which is logically part of the batch because it read an earlier
revision of the control record."

I do not understand how this example anomaly is possible. I'm imagining
something like the following:

1. Do a bunch of work, possibly in parallel in multiple transactions, that
insert/update a bunch of detail records.
2. After all that work commits, insert or update a record in the "control"
table indicating that the batch completed.

Or maybe:

1. Do a batch of work and update the "control" table in a single transaction.

The guarantee that I believe REPEATABLE READ will give you in either of these
case is that if you see the "control" table record, you will read all the
detail records, because the control record is only written if the updated
detail records have been committed. What am I not understanding?

The most widely cited read-only snapshot isolation example is the bank
withdrawl example from this paper:
http://www.sigmod.org/publications/sigmod-record/0409/2.ROAnomONeil.pdf .
However, I suspect we can present an anomaly that doesn't require as much
explanation?

Thanks,

Evan Jones

--
Work: https://www.mitro.co/Personal: http://evanjones.ca/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)

On Tue, May 27, 2014 at 4:54 AM, Heikki Linnakangas
 wrote:
>> The "left sibling" referred to here is "the first page this key could
>> be on", an important concept for unique index enforcement.
>
> No, it's not "the first page this key could be on".

Well, it may be initially. I could have been more cautious about the
terminology here.

> Also note that _bt_moveright() also finishes any incomplete splits it
> encounters (when called for an insertion). So _bt_findinsertloc() never gets
> called on a page with the incomplete-split flag set. It might encounter one
> when it moves right, but never the first page.

Fair enough, but I don't think that affects correctness either way (I
don't think that you meant to imply that this was a necessary
precaution that you'd taken - right?). It's a nice property, since it
makes the extra locking while completing a split within
_bt_findinsertloc() particularly infrequent. But, that might also be a
bad thing, when considered from a different perspective.

> If I understood correctly, the tree looks like this before the insertion:
>
> Parent page:
> +-+
> | |
> | 9 -> A  |
> +-+
>
> Leaf A
> +-+
> | HI-key: 9   |
> | |
> | 7 8 9   |
> +-+
>
> And after insertion and incomplete split:
>
> Parent page
> +-+
> | |
> | 9 -> A  |
> +-+
>
> Leaf A  Leaf B
> +--+ +-+
> | HI-key: 8| | HI-key: 9   |
> | (INCOMPLETE_ | | |
> | SPLIT)   | <-> | |
> |  | | |
> |  7   7*   8  | |   9 |
> +--+ +-+

> After the split is finished, the tree looks like this:
>
> Parent page
> +-+
> | 8 -> A  |
> | 9 -> B  |
> +-+
>
> Leaf A  Leaf B
> +-+ +-+
> | HI-key: 8   | | HI-key: 9   |
> | | <-> | |
> |  7   7*  8  | |   9 |
> +-+ +-+

How did the parent page change between before and after the final
atomic operation (page split completion)? What happened to "9 -> A"?

>> Regardless of whether or not I have these details exactly right (that
>> is, regardless of whether or not this scenario is strictly possible) I
>> suggest as a code-hardening measure that _bt_findinsertloc() release
>> its "value lock", upon realizing it must complete splits, and then
>> complete the split or splits known to be required. It would finally
>> report that it "couldn't find" an insertion location to
>> _bt_doinsert(), which would then retry from the start, just like when
>> _bt_check_unique() finds an inconclusive conflict.
>
> Yeah, that would work too. It seems safe enough as it is, though, so I don't
> see the point.

Well, it would be nice to not have to finish the page split in what is
a particularly critical path, with that extra buffer lock. It's not
strictly necessary, but then it is theoretically safer, and certainly
much clearer. The fact that this code is so seldom executed is one
issue that made me revisit this. On the other hand, I can see why
you'd want to avoid cluttering up the relatively comprehensible
_bt_doinsert() function if it could be avoided. I defer to you.

> PS. Thanks for looking into this again! These B-tree changes really need
> thorough review.

You're welcome. Hopefully my questions will lead you in a useful
direction, even if my concerns turn out to be, in the main, unfounded.
 :-)

It previously wasn't in evidence that you'd considered these
interactions, and I feel better knowing that you have.
-- 
Peter Geoghega

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Heikki Linnakangas  writes:
> On 27 May 2014 18:33:48 EEST, Tom Lane  wrote:
>> If we were going to do it like that, I'd vote for
>> 
>> --with-uuid={ossp,e2fs,bsd}
>> 
>> with no default at present (ie you can't say just "--with-uuid",
>> though we'd have the option to allow that in future).  But I doubt
>> this is better than the --with-foo-uuid spelling.

> FWIW, --with-uuid=foo looks much clearer to me.

OK, I'll make it so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Jeff Janes

On Mon, May 26, 2014 at 8:15 PM, Robert Haas  wrote:

> On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas
>  wrote:
> >>I don't think we know that. The server might have crashed before that
> >>second record got generated.  (This appears to be an unfixable flaw in
> >>this proposal.)
> >
> > The second record is generated before the checkpoint is finished and the
> checkpoint record is written.  So it will be there.
> >
> > (if you crash before the checkpoint is finished, the in-progress
> checkpoint is no good for recovery anyway, and won't be used)
>
> Hmm, I see.
>
> It's not great to have to generate WAL at buffer-eviction time,
> though.  Normally, when we go to evict a buffer, the WAL is already
> written.  We might have to wait for it to be flushed, but if the WAL
> writer is doing its job, hopefully not.  But here we'll definitely
> have to wait for the WAL flush.
>

I'm not sure we do need to flush it.  If the checkpoint finishes, then the
WAL surely got flushed as part of the process of recording the end of the
checkpoint.  If the checkpoint does not finish, recovery will start from
the previous checkpoint, which does contain the FPW (because if it didn't,
the page would not be eligible for this treatment) and so the possibly torn
page will get overwritten in full.

Cheers,

Jeff

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Ronan Dunklau

Le mardi 27 mai 2014 09:52:27 Stephen Frost a écrit :
> * David Fetter (da...@fetter.org) wrote:
> > In the "easy" case of PostgreSQL, you might also be able to establish
> > whether the UDT in the "already defined locally" case above is
> > identical to the one defined remotely, but I don't think it's possible
> > even in principle for non-PostgreSQL remote systems, possibly not even
> > for ones with non-identical architecture, PostgreSQL major version,
> > etc.
> 
> For my 2c, it'd be very handy if IMPORT had an option or similar to also
> copy over all (in the schema) or at least any depended-upon UDTs.  It'd
> really be nice if we had an ability to check the remote UDT vs. the
> local one at some point also, since otherwise you're going to get bitten
> at run-time when querying the foreign table.

The specification only talks about importing tables, not types, views or (why 
not ?) foreign tables.

If we want to extend the spec further, we could accept more than 
CreateForeignTableStmt as return values from the API:
 - CreateDomainStmt
 - CompositeTypeStmt
 - AlterTableCmd

The core code would be responsible for executing them, and making sure the 
destination schema is correctly positioned on all of those.

The postgres_fdw behaviour could then be controlled with options such as 
include_types (tri-valued enum accepting "none", "all", "as_needed"), 
relation_kinds (default to 'r', can support multiple kinds with a string 'rfv' 
for tables, foreign tables and views).

I think we're drifting further away from the standard with that kind of stuff, 
but I'd be happy to implement it if that's the path we choose.

-- 
Ronan Dunklau
http://dalibo.com - http://dalibo.org

signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

On 27 May 2014 18:33:48 EEST, Tom Lane  wrote:
>Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> The problem is that the long-established spelling is
>--with-ossp-uuid.
>>> I don't think we can break that case.  While we could set up
>something
>>> like what you propose alongside it, it doesn't seem like there's any
>>> advantage to doing so compared to inventing --with-foo-uuid as
>needed.
>
>> I was thinking that --with-ossp-uuid would still be required to
>enable
>> UUID generators at all; the other one just selects the implementation
>to
>> use, which defaults to OSSP to maintain backwards compatibility. 
>Maybe
>> introduce --with-uuid and have --with-ossp-uuid a deprecated synonym
>of
>> that.
>
>If we were going to do it like that, I'd vote for
>
>   --with-uuid={ossp,e2fs,bsd}
>
>with no default at present (ie you can't say just "--with-uuid",
>though we'd have the option to allow that in future).  But I doubt
>this is better than the --with-foo-uuid spelling.

FWIW, --with-uuid=foo looks much clearer to me.


- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread David E. Wheeler

On May 27, 2014, at 7:44 AM, Tom Lane  wrote:

> In either case, the problem remains of exactly what to call the
> e2fsprogs-derived implementation.  It does seem that people who are
> familiar with these libraries call it that, but I'm worried that such
> a name will confuse those not so familiar.

--with-libuuid?

D



signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Alvaro Herrera  writes:
> Tom Lane wrote:
>> The problem is that the long-established spelling is --with-ossp-uuid.
>> I don't think we can break that case.  While we could set up something
>> like what you propose alongside it, it doesn't seem like there's any
>> advantage to doing so compared to inventing --with-foo-uuid as needed.

> I was thinking that --with-ossp-uuid would still be required to enable
> UUID generators at all; the other one just selects the implementation to
> use, which defaults to OSSP to maintain backwards compatibility.  Maybe
> introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of
> that.

If we were going to do it like that, I'd vote for

   --with-uuid={ossp,e2fs,bsd}

with no default at present (ie you can't say just "--with-uuid",
though we'd have the option to allow that in future).  But I doubt
this is better than the --with-foo-uuid spelling.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera  writes:
> > Why not --with-uuid-implementation=, and have  be one of
> > e2utils, bsd, ossp, with the latter being default?  We could also have
> > offer the value "list" or "help" which would list the available options.
> > That way, if we come up with a new implementation in the future, this is
> > easily extensible.
> 
> The problem is that the long-established spelling is --with-ossp-uuid.
> I don't think we can break that case.  While we could set up something
> like what you propose alongside it, it doesn't seem like there's any
> advantage to doing so compared to inventing --with-foo-uuid as needed.

I was thinking that --with-ossp-uuid would still be required to enable
UUID generators at all; the other one just selects the implementation to
use, which defaults to OSSP to maintain backwards compatibility.  Maybe
introduce --with-uuid and have --with-ossp-uuid a deprecated synonym of
that.

As a more sophisticated idea: if you say --with-uuid then
--with-uuid-implementation is required; if you say --with-ossp-uuid then
--with-uuid-implementation is optional and defaults to OSSP.

> In either case, the problem remains of exactly what to call the
> e2fsprogs-derived implementation.  It does seem that people who are
> familiar with these libraries call it that, but I'm worried that such
> a name will confuse those not so familiar.

I vote "e2fsprogs".  In the help text for that option, mention that it
works on Linux and some others OSes, or something like that so that
people on Linux try that one first, and people on other OSes can
web-search to see whether it's available.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > This, plus the generic ability to pass an OPTIONS clause to the IMPORT
> > (allowing you to have different defaults for different IMPORT
> > statements) and having it be transactional, as you mention, appears to
> > be covering all the relevant bases.
> 
> Yeah, as far as the example of coping with differing case goes, I agree
> that we'd want IMPORT to just follow whatever the FDW's default or
> configured behavior is, since obviously the FDW will have to know how to
> reverse the conversion when sending queries later on.  So there's no
> apparent need for an IMPORT-level option *for that example*.  I'm not
> sure if we need OPTIONS for IMPORT for any other uses.

I'm waffling a bit on this particular point.  IMPORT is for bulk
operations, of course, but perhaps on one remote server you want to
import everything from the dimension tables using the default mapping
but everything from the fact or perhaps aggregate tables in some schema
as text.  You could do that with something like- import #1, change the
server-level options, import #2, or you could do just one big import and
then go back and fix everything, but...

I'd rather introduce this options clause for IMPORT *now*, which means
we don't need to care about if this specific example is really relevant
or not, than not have it in 9.5 and have FDW authors unhappy because
it's missing (whether they need it for this use case or some other).

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Alvaro Herrera  writes:
> Why not --with-uuid-implementation=, and have  be one of
> e2utils, bsd, ossp, with the latter being default?  We could also have
> offer the value "list" or "help" which would list the available options.
> That way, if we come up with a new implementation in the future, this is
> easily extensible.

The problem is that the long-established spelling is --with-ossp-uuid.
I don't think we can break that case.  While we could set up something
like what you propose alongside it, it doesn't seem like there's any
advantage to doing so compared to inventing --with-foo-uuid as needed.

In either case, the problem remains of exactly what to call the
e2fsprogs-derived implementation.  It does seem that people who are
familiar with these libraries call it that, but I'm worried that such
a name will confuse those not so familiar.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Andres Freund

On 2014-05-27 16:36:45 +0200, Matteo Beccati wrote:
> On 27/05/2014 15:52, Tom Lane wrote:
> > Ah, cool.  I had documented this option as only working for FreeBSD,
> > but that's obviously too conservative.  Anyone know about whether it
> > works on OpenBSD?
> 
> I've tried to google "man uuid openbsd" and I got the e2fs package
> (which contains uuid/uuid.h and libuuid) instead of a man page, so I
> believe that could be another use case for --with-linux-uuid.
> 
> If it's confirmed to be working, that makes two BSD-derived systems
> requiring "linux-uuid", so --with-e2fs-uuid or similar would be more
> appropriate.

When I looked at the e2fs uuid implementation a year or so back, it even
had some form of windows support. So +1 to naming it non-os specifically.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Matteo Beccati  writes:
> On 27/05/2014 15:52, Tom Lane wrote:
>> Ah, cool.  I had documented this option as only working for FreeBSD,
>> but that's obviously too conservative.  Anyone know about whether it
>> works on OpenBSD?

> I've tried to google "man uuid openbsd" and I got the e2fs package
> (which contains uuid/uuid.h and libuuid) instead of a man page, so I
> believe that could be another use case for --with-linux-uuid.

Yeah, googling led me to the same conclusion: e2fsprogs libuuid is
definitely available for OpenBSD, but I found no evidence that it
has the built-in UUID functions that FreeBSD/NetBSD have.

> If it's confirmed to be working, that makes two BSD-derived systems
> requiring "linux-uuid", so --with-e2fs-uuid or similar would be more
> appropriate.

Still don't like that name much, but maybe that's the best we can do.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Alvaro Herrera

Tom Lane wrote:
> "David E. Wheeler"  writes:
> > On May 26, 2014, at 9:30 PM, Tom Lane  wrote:
> >>> How about --with-unix-uuid? Or --with-ext2-uuid?
> 
> >> Meh.  "Unix" certainly subsumes BSD, so that doesn't seem like a very
> >> useful distinction.  I guess we could use "ext2" but that would just
> >> confuse most people.
> 
> > --with-uuid?
> 
> I thought about that but figured we'd regret it ...

Why not --with-uuid-implementation=, and have  be one of
e2utils, bsd, ossp, with the latter being default?  We could also have
offer the value "list" or "help" which would list the available options.
That way, if we come up with a new implementation in the future, this is
easily extensible.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

Stephen Frost  writes:
> Sure, I was being a bit over-simplistic.  As was mentioned up-thread,
> the option would rather be "flip all-uppercase to lowercase and all-
> lowercase to uppercase, quote any which are mixed", or something along
> those lines.  What I was trying to get at is that it's up to the FDW
> what options it wants to support in this regard and we already have a
> way for the admin to pass in useful information to the FDW by way of the
> FOREIGN SERVER FDW options.

> This, plus the generic ability to pass an OPTIONS clause to the IMPORT
> (allowing you to have different defaults for different IMPORT
> statements) and having it be transactional, as you mention, appears to
> be covering all the relevant bases.

Yeah, as far as the example of coping with differing case goes, I agree
that we'd want IMPORT to just follow whatever the FDW's default or
configured behavior is, since obviously the FDW will have to know how to
reverse the conversion when sending queries later on.  So there's no
apparent need for an IMPORT-level option *for that example*.  I'm not
sure if we need OPTIONS for IMPORT for any other uses.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-27 Thread Matteo Beccati

Hi Tom,

On 27/05/2014 15:52, Tom Lane wrote:
> Matteo Beccati  writes:
>> On 27/05/2014 03:07, Tom Lane wrote:
>>> I do not have a machine on which to try --with-bsd-uuid, so it's
>>> possible I broke that portion of Matteo's patch.  Would someone try
>>> that case on a FreeBSD box?
> 
>> I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box.
> 
> Ah, cool.  I had documented this option as only working for FreeBSD,
> but that's obviously too conservative.  Anyone know about whether it
> works on OpenBSD?

I've tried to google "man uuid openbsd" and I got the e2fs package
(which contains uuid/uuid.h and libuuid) instead of a man page, so I
believe that could be another use case for --with-linux-uuid.

If it's confirmed to be working, that makes two BSD-derived systems
requiring "linux-uuid", so --with-e2fs-uuid or similar would be more
appropriate.

Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Stephen Frost wrote:
> > > It seems like it would often be desirable for the Oracle FDW to smash
> > > all-upper-case names to all-lower-case while importing, so that no quoting
> > > is needed on either side.  I doubt though that this is so desirable that
> > > it should happen unconditionally.
> > 
> > The oracle FDW would simply need a foreign-server level option which
> > says "smash everything to lowercase on import".
> 
> That's not the same thing though -- consider what happens to the quoting
> needs for names with mixed case.  If you change mixed case to
> all-lowercase, references to such objects using quotes in the
> application code would fail because the name is now all-lowercase in
> Postgres.  A tri-valued enum could do the trick:
> lowercase_names={wholly_uppercase_only, all, none}
> with the first one being the most sensible and default.

Sure, I was being a bit over-simplistic.  As was mentioned up-thread,
the option would rather be "flip all-uppercase to lowercase and all-
lowercase to uppercase, quote any which are mixed", or something along
those lines.  What I was trying to get at is that it's up to the FDW
what options it wants to support in this regard and we already have a
way for the admin to pass in useful information to the FDW by way of the
FOREIGN SERVER FDW options.

This, plus the generic ability to pass an OPTIONS clause to the IMPORT
(allowing you to have different defaults for different IMPORT
statements) and having it be transactional, as you mention, appears to
be covering all the relevant bases.

Thanks,

stephen

signature.asc
Description: Digital signature

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Alvaro Herrera

Stephen Frost wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Albe Laurenz  writes:
> > > Oracle follows the SQL standard in folding table names to upper case.
> > > So this would normally result in a lot of PostgreSQL foreign tables
> > > with upper case names, which would be odd and unpleasant.
> > 
> > > I cannot see a way out of that, but I thought I should mention it.
> > 
> > It seems like it would often be desirable for the Oracle FDW to smash
> > all-upper-case names to all-lower-case while importing, so that no quoting
> > is needed on either side.  I doubt though that this is so desirable that
> > it should happen unconditionally.
> 
> The oracle FDW would simply need a foreign-server level option which
> says "smash everything to lowercase on import".

That's not the same thing though -- consider what happens to the quoting
needs for names with mixed case.  If you change mixed case to
all-lowercase, references to such objects using quotes in the
application code would fail because the name is now all-lowercase in
Postgres.  A tri-valued enum could do the trick:
lowercase_names={wholly_uppercase_only, all, none}
with the first one being the most sensible and default.

> > Between this and the type-mapping questions, it seems likely that
> > we're going to need a way for IMPORT FOREIGN SCHEMA to accept
> > user-supplied control options, which would in general be specific
> > to the FDW being used.  (Another thing the SQL committee failed to
> > think about.)
> 
> I can see value in having specific mappings defined at the foreign
> server level for what IMPORT does by default, but as IMPORT is intended
> to be a bulk process, I don't see the value in trying to teach it how to
> do a specific mapping for a specific table or column inside the schema.
> You run the IMPORT for the entire schema and then go fix up any special
> cases or things you wanted differently.

Yes, it seems better to offer the ability to create transactional
scripts around IMPORT (so it must be able to run in a transaction block
-- IMPORT first, then do a bunch of ALTERs), than complicating IMPORT
itself infinitely to support hundreds of possible options.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Petr Jelinek


On 27/05/14 13:49, Ronan Dunklau wrote:

Between this and the type-mapping questions, it seems likely that
we're going to need a way for IMPORT FOREIGN SCHEMA to accept
user-supplied control options, which would in general be specific
to the FDW being used.  (Another thing the SQL committee failed to
think about.)


So, without extending the syntax too much, we could add options:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]

This option list could then contain fdw-specific options, including type
mapping.



This one looks like good option to me.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* Ronan Dunklau (ronan.dunk...@dalibo.com) wrote:
> So, without extending the syntax too much, we could add options:
> 
> IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
> [ { LIMIT TO | EXCEPT } (table_name [, ...])]
> [ OPTIONS (option_list) ]
> 
> This option list could then contain fdw-specific options, including type 
> mapping.

Yup, that looks reasonable to me.

> Or we could add a specific clause:
> 
> IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
> [ { LIMIT TO | EXCEPT } (table_name [, ...])]
> [ OPTIONS (option_list) ]
> [ TYPE MAPPING some mapping_definition ]

-1 on this one.

> With mapping_definition being either a tuple, or well-defined jsonb format 
> common accross FDWs.
> 
> A third solution, which I don't like but doesn't modify the SQL grammar, 
> would 
> be to encode the options in the remote_schema name.

Yeah, don't like that one either.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Albe Laurenz  writes:
> > Oracle follows the SQL standard in folding table names to upper case.
> > So this would normally result in a lot of PostgreSQL foreign tables
> > with upper case names, which would be odd and unpleasant.
> 
> > I cannot see a way out of that, but I thought I should mention it.
> 
> It seems like it would often be desirable for the Oracle FDW to smash
> all-upper-case names to all-lower-case while importing, so that no quoting
> is needed on either side.  I doubt though that this is so desirable that
> it should happen unconditionally.

The oracle FDW would simply need a foreign-server level option which
says "smash everything to lowercase on import".

> Between this and the type-mapping questions, it seems likely that
> we're going to need a way for IMPORT FOREIGN SCHEMA to accept
> user-supplied control options, which would in general be specific
> to the FDW being used.  (Another thing the SQL committee failed to
> think about.)

I can see value in having specific mappings defined at the foreign
server level for what IMPORT does by default, but as IMPORT is intended
to be a bulk process, I don't see the value in trying to teach it how to
do a specific mapping for a specific table or column inside the schema.
You run the IMPORT for the entire schema and then go fix up any special
cases or things you wanted differently.

That said, I'm not against having a way to pass to IMPORT a similar
key/value pair set which we already support for the FDW options on
tables, columns, etc, by way of safe-guarding any potential use case for
such.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

Matteo Beccati  writes:
> On 27/05/2014 03:07, Tom Lane wrote:
>> I do not have a machine on which to try --with-bsd-uuid, so it's
>> possible I broke that portion of Matteo's patch.  Would someone try
>> that case on a FreeBSD box?

> I've tested on NetBSD i386 and --with-bsd-uuid worked out of the box.

Ah, cool.  I had documented this option as only working for FreeBSD,
but that's obviously too conservative.  Anyone know about whether it
works on OpenBSD?

> I'm attaching a little patch to be applied on top of yours.

Thanks, will incorporate this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* David Fetter (da...@fetter.org) wrote:
> In the "easy" case of PostgreSQL, you might also be able to establish
> whether the UDT in the "already defined locally" case above is
> identical to the one defined remotely, but I don't think it's possible
> even in principle for non-PostgreSQL remote systems, possibly not even
> for ones with non-identical architecture, PostgreSQL major version,
> etc.

For my 2c, it'd be very handy if IMPORT had an option or similar to also
copy over all (in the schema) or at least any depended-upon UDTs.  It'd
really be nice if we had an ability to check the remote UDT vs. the
local one at some point also, since otherwise you're going to get bitten
at run-time when querying the foreign table.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

* David Fetter (da...@fetter.org) wrote:
> - We make type mappings settable at the level of:
> - FDW
> - Instance (a.k.a. cluster)
> - Database
> - Schema
> - Table
> - Column

While I like the general idea, you seem to be making this appear much
more complex than it actually is.  For example, I see no value in a
table-level "uint4 -> int8" definition.  If you want something which is
not the default, just set it on the individual columns of the foreign
table, exactly how we handle column name differences.

There might be some value in being able to redefine what the default is
at the FOREIGN SERVER level, as perhaps MySQL DB X and MySQL DB Y could
have different default mappings for some reason, but adding in the rest
of those levels doesn't add value imv.

> This would consist of:
> - The remote type
> - The local type to which it maps
> - The inbound transformation (default identity)
> - The outbound transformation (default identity)

The remote type and the local type are known already to the FDW, no?
The FDW will need to know how to do whatever conversions we're talking
about also, right?  (If you want to do it in core PG instead of the FDW
then I'm thinking you should probably use a view over top of the
foreign table..).  What's nice is that this all falls under what an FDW
can do *already*, if it wants (and, actually, it could do it on the
table-level technically too, if the FDW supports that, but schema?
database?  these things don't make sense...).

For the IMPORT bit, it should just be doing whatever the defaults are
unless you've set some different defaults for a given foreign server
(also something which could be set using our existing system...).

> ALTER FOREIGN TABLE foo ADD (mapping '{
> "datetime": "text",
> "inbound": "IDENTITY",
> outbound: "IDENTITY"
> }')

I'm very much against having two different command languages where one
is used "when convenient".  If we wanted to do this, we should build a
full-spec mapping from whatever JSON language you come up with for our
utility commands to what we actually implement in the grammar.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Spreading full-page writes

On 27 May 2014 07:42, Greg Stark  wrote:
> On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
>  wrote:
>>
>> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>>
 Another idea would be to have separate checkpoints for each buffer
>>> partition. You would have to start recovery from the oldest checkpoint of
>>> any of the partitions.
>>
>> Yeah. Simon suggested that when we talked about this, but I didn't 
>> understand how that works at the time. I think I do now. The key to making 
>> it work is distinguishing, when starting recovery from the latest 
>> checkpoint, whether a record for a given page can be replayed safely. I used 
>> flags on WAL records in my proposal to achieve this, but using buffer 
>> partitions is simpler.
>
> Interesting. I just thought of it independently.

Actually, I heard it from Doug Tolbert in 2005, based on how another
DBMS coped with that issue.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes


On 05/27/2014 03:18 PM, Simon Riggs wrote:

IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
the reason you also discussed changing the WAL record format to allow
us to identify the blocks touched by recovery more easily?


Yeah, that was one use case I had in mind for the WAL format changes. 
See http://www.postgresql.org/message-id/533d6cbf.6080...@vmware.com.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes

On 27 May 2014 03:49, Fujii Masao  wrote:

>> So that gives us a few approaches
>>
>> * Compressing FPWs gives A
>> * Background FPWs gives us B
>>which look like we can combine both ideas
>>
>> * Double-buffering would give us A and B, but not C
>>and would be incompatible with other two ideas
>
> Double-buffering would allow us to disable FPW safely but which would make
> a recovery slow. So if we adopt double-buffering, I think that we would also
> need to overhaul the recovery.

Which is also true of Background FPWs

So our options are

1. Compressed FPWs only

2. Compressed FPWs plus BackgroundFPWs plus Recovery Buffer Prefetch

3. Double Buffering plus Recovery Buffer Prefetch

IIRC Koichi had a patch for prefetch during recovery. Heikki, is that
the reason you also discussed changing the WAL record format to allow
us to identify the blocks touched by recovery more easily?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes


On 05/27/2014 02:42 PM, Greg Stark wrote:

On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
 wrote:


On 05/26/2014 02:26 PM, Greg Stark wrote:



Another idea would be to have separate checkpoints for each buffer

partition. You would have to start recovery from the oldest checkpoint of
any of the partitions.


Yeah. Simon suggested that when we talked about this, but I didn't understand 
how that works at the time. I think I do now. The key to making it work is 
distinguishing, when starting recovery from the latest checkpoint, whether a 
record for a given page can be replayed safely. I used flags on WAL records in 
my proposal to achieve this, but using buffer partitions is simpler.


Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.


Since you will be flushing the buffers one "redo partition" at a time, 
you would want to allow the OS to do merge the writes within a partition 
as much as possible. So my even-odd split would in fact be pretty bad. 
Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the 
same partition, would be better.



I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.


Good point.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition within _bt_findinsertloc()? (new page split code)


On 05/27/2014 09:17 AM, Peter Geoghegan wrote:

While speccing out a new B-Tree verification tool, I had the
opportunity to revisit a thought I had during review concerning
_bt_findinsertloc(): that the new coding is unsafe in the event of
deferred split completion of a leaf page of a unique index. To recap,
we now do the following in _bt_findinsertloc():

/*
  * If this page was incompletely split, finish the split now. We
  * do this while holding a lock on the left sibling, which is not
  * good because finishing the split could be a fairly lengthy
  * operation.  But this should happen very seldom.
  */
if (P_INCOMPLETE_SPLIT(lpageop))
{
 _bt_finish_split(rel, rbuf, stack);
 rbuf = InvalidBuffer;
 continue;
}

The "left sibling" referred to here is "the first page this key could
be on", an important concept for unique index enforcement.


No, it's not "the first page this key could be on".

_bt_findinsertloc() does *not* hold a lock on the first valid page the 
key could go to. It merely ensures that when it steps to the next page, 
it releases the lock on the previous page only after acquiring the lock 
on the next page. Throughout the operation, it will hold a lock on 
*some* page that could legally hold the inserted value, and it acquires 
the locks in left-to-right order. This is sufficient for the uniqueness 
checks, because _bt_unique_check() scans all the pages, and 
_bt_unique_check() *does* hold the first page locked while it scans the 
rest of the pages. But _bt_findinsertlock() does not.


Also note that _bt_moveright() also finishes any incomplete splits it 
encounters (when called for an insertion). So _bt_findinsertloc() never 
gets called on a page with the incomplete-split flag set. It might 
encounter one when it moves right, but never the first page.



Anyway, the concern is that there may be problems when we call
_bt_finish_split() with that left sibling locked thoughout (i.e.
finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and
itself has a right sibling from the incomplete split (which is usually
the value lock page's right-right sibling)). I'm not concerned about
performance, since as the comment says it ought to be an infrequent
occurrence. I also believe that there are no deadlock hazards. But
consider this scenario:

* We insert the value 7 into an int4 unique index. We need to split
the leaf page. We run out of memory or something, and ours is an
incomplete page split. Our transaction is aborted. For the sake of
argument, suppose that there are also already a bunch of dead tuples
within the index with values of 7, 8 and 9.


If I understood correctly, the tree looks like this before the insertion:

Parent page:
+-+
| |
| 9 -> A  |
+-+

Leaf A
+-+
| HI-key: 9   |
| |
| 7 8 9   |
+-+

And after insertion and incomplete split:

Parent page
+-+
| |
| 9 -> A  |
+-+

Leaf A  Leaf B
+--+ +-+
| HI-key: 8| | HI-key: 9   |
| (INCOMPLETE_ | | |
| SPLIT)   | <-> | |
|  | | |
|  7   7*   8  | |   9 |
+--+ +-+

where 7* is the newly inserted key, with value 7.

(you didn't mention at what point the split happens, but in the next 
paragraph you said the new downlink has value 8, which implies the above 
split)



* Another inserter of the value 7 comes along. It follows exactly the
same downlink as the first, now aborted inserter (suppose the
downlink's value is 9). It also locks the same leaf page to establish
a "value lock" in precisely the same manner. Finding no room on the
first page, it looks further right, maintaining its original "value
lock" throughout. It finishes the first inserter's split of the
non-value-lock page - a new downlink is inserted into the parent page,
with the value 8. It then releases all buffer locks except the first
one/original "value lock". A physical insertion has yet to occur.


Hmm, I think you got confused at this step. When inserting a 7, you 
cannot "look further right" to find a page with more space, because the 
HI-key, 8, on the first page stipulates that 7 must go on that page, not 
some later page.



* A third inserter of the value comes along. It gets to a later page
than the one locked by the second inserter, preferring the newer
downlink with value 8 (the internal-page _bt_binsrch() logic ensures
this).


That's a contradiction: the downlink with value 8 points to the first 
page, not some later page. After the split is finished, the tree looks 
like this:


Parent page
+-+
| 8 -> A  |
| 9 -> B  |
+-+

Leaf A  Leaf B
+-+ +-+
| HI-key: 8   | | HI-key: 9   |
| | <-> | |
|  7   7*  8  | |   9 |
+-+ +-+


Regardless of wh

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-27 Thread Ronan Dunklau

>> Additionally, I don't really see how that would be useful in a general 
case. 
>> With an "in-core" defined meaning of type transformation, any FDW that 
doesn't 
>> fit exactly into the model would have a hard time. For example, what 
happens 
>> if an FDW is only ever capable of returning text ?

> That's actually the case where it's most important to have the feature
> all the way down to the column level.

I'm not sure configuration about specific columns from specific tables would 
be that useful: if you already know the structure of the table, you can either 
create it yourself, or run an alter column statement just after creating it.

Alternativeley, with the arbitrary options clause proposed by Tom and detailed 
below, one could use the LIMIT TO / EXCEPT clauses to fine-tune what options 
should apply.

> That's why I suggested doing it via CREATE/ALTER with JSONB or similar
> containing the details rather than inventing new SQL grammar, an
> option I included only to dismiss it.

Hum, adding a simple TYPE MAPPING is already inventing new SQL grammar, but 
its less invasive.

> Between this and the type-mapping questions, it seems likely that
> we're going to need a way for IMPORT FOREIGN SCHEMA to accept
> user-supplied control options, which would in general be specific
> to the FDW being used.  (Another thing the SQL committee failed to
> think about.)

So, without extending the syntax too much, we could add options:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]

This option list could then contain fdw-specific options, including type 
mapping.

Or we could add a specific clause:

IMPORT FOREIGN SCHEMA remote_schema FROM SERVER server_name INTO local_schema
[ { LIMIT TO | EXCEPT } (table_name [, ...])]
[ OPTIONS (option_list) ]
[ TYPE MAPPING some mapping_definition ]

With mapping_definition being either a tuple, or well-defined jsonb format 
common accross FDWs.

A third solution, which I don't like but doesn't modify the SQL grammar, would 
be to encode the options in the remote_schema name.

Would one of those solutions be acceptable ?

-- 
Ronan Dunklau
http://dalibo.com - http://dalibo.org

signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Spreading full-page writes

2014-05-27 Thread Greg Stark

On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
 wrote:
>
> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>
>>> Another idea would be to have separate checkpoints for each buffer
>> partition. You would have to start recovery from the oldest checkpoint of
>> any of the partitions.
>
> Yeah. Simon suggested that when we talked about this, but I didn't understand 
> how that works at the time. I think I do now. The key to making it work is 
> distinguishing, when starting recovery from the latest checkpoint, whether a 
> record for a given page can be replayed safely. I used flags on WAL records 
> in my proposal to achieve this, but using buffer partitions is simpler.

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

> For simplicity, let's imagine that we have two Redo-pointers for each 
> checkpoint record: one for even-numbered pages, and another for odd-numbered 
> pages. When checkpoint begins, we first update the Even-redo pointer to the 
> current WAL insert location, and then flush all the even-numbered buffers in 
> the buffer cache. Then we do the same for Odd.

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

> Recovery begins at the Even-redo pointer. Replay works as normal, but until 
> you reach the Odd-pointer, you refrain from replaying any changes to 
> Odd-numbered pages. After reaching the odd-pointer, you replay everything as 
> normal.
>
> Hmm, that seems actually doable...

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes


On 05/26/2014 02:26 PM, Greg Stark wrote:

On Mon, May 26, 2014 at 1:22 PM, Heikki Linnakangas 
wrote:



The second record is generated before the checkpoint is finished and the
checkpoint record is written.  So it will be there.

(if you crash before the checkpoint is finished, the in-progress
checkpoint is no good for recovery anyway, and won't be used)


Another idea would be to have separate checkpoints for each buffer
partition. You would have to start recovery from the oldest checkpoint of
any of the partitions.


Yeah. Simon suggested that when we talked about this, but I didn't 
understand how that works at the time. I think I do now. The key to 
making it work is distinguishing, when starting recovery from the latest 
checkpoint, whether a record for a given page can be replayed safely. I 
used flags on WAL records in my proposal to achieve this, but using 
buffer partitions is simpler.


For simplicity, let's imagine that we have two Redo-pointers for each 
checkpoint record: one for even-numbered pages, and another for 
odd-numbered pages. When checkpoint begins, we first update the 
Even-redo pointer to the current WAL insert location, and then flush all 
the even-numbered buffers in the buffer cache. Then we do the same for Odd.


Recovery begins at the Even-redo pointer. Replay works as normal, but 
until you reach the Odd-pointer, you refrain from replaying any changes 
to Odd-numbered pages. After reaching the odd-pointer, you replay 
everything as normal.


Hmm, that seems actually doable...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw and connection management

2014-05-27 Thread Sandro Santilli

On Tue, May 27, 2014 at 12:32:50AM -0300, Fabrízio de Royes Mello wrote:
> On Mon, May 26, 2014 at 11:47 PM, Shigeru Hanada 
> wrote:
> >
> > 2014-05-24 0:09 GMT+09:00 Sandro Santilli :
> > > Indeed I tried "DISCARD ALL" in hope it would have helped, so I find
> > > good your idea of allowing extensions to register an hook there.
> > >
> > > Still, I'd like the FDW handler itself to possibly be configured
> > > to disable the pool completely as a server-specific configuration.
> >
> > Connection management seems FDW-specific feature to me.  How about to
> > add FDW option, say pool_connection=true|false, to postgres_fdw which
> > allows per-server configuration?

Yes, that's what I had in mind.
I'll try something along those lines.

> Makes sense... but if we use "pool_connection=true" and want to close the
> opened connection. How can we do that?

Right, I still consider hooks on DISCARD a useful addition.

--strk;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spreading full-page writes