date:20150223

Re: [HACKERS] Enforce creation of destination folders for source files in pg_regress (Was: pg_regress writes into source tree)

2015-02-23 Thread Michael Paquier

On Mon, Feb 23, 2015 at 9:31 PM, Robert Haas  wrote:

>
> > On Feb 22, 2015, at 5:41 AM, Michael Paquier 
> wrote:
> > This is up to the maintainer of each extension to manage their code
> > tree. However I can imagine that some people would be grateful if we
> > allow them to not need sql/ and expected/ containing only one single
> > .gitignore file ignoring everything with "*", making the code tree of
> > their extensions cleaner.
>
> I don't see this as an improvement. What's wrong with a .gitignore file
> ignoring everything?
>

That's mainly a matter of code maintenance style (and I am on the side that
a minimum number of folders in a git tree makes things easier to follow),
and IMO it is strange to not have some flexibility. Btw, the reason why
this patch exists is this thread of last December:
http://www.postgresql.org/message-id/flat/20141212134508.gt1...@alvh.no-ip.org#20141212134508.gt1...@alvh.no-ip.org
-- 
Michael

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Michael Paquier

On Mon, Feb 23, 2015 at 9:22 PM, Fujii Masao  wrote:

> On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed 
> wrote:
> > Hello,
> >
> > Attached is a patch which has following changes,
> >
> > As suggested above block ID in xlog structs has been replaced by chunk
> ID.
> > Chunk ID is used to distinguish between different types of xlog record
> > fragments.
> > Like,
> > XLR_CHUNK_ID_DATA_SHORT
> > XLR_CHUNK_ID_DATA_LONG
> > XLR_CHUNK_BKP_COMPRESSED
> > XLR_CHUNK_BKP_WITH_HOLE
> >
> > In block references, block ID follows the chunk ID. Here block ID retains
> > its functionality.
> > This approach increases data by 1 byte for each block reference in an
> xlog
> > record. This approach separates ID referring different fragments of xlog
> > record from the actual block ID which is used to refer  block references
> in
> > xlog record.
>
> I've not read this logic yet, but ISTM there is a bug in that new WAL
> format
> because I got the following error and the startup process could not replay
> any WAL records when I set up replication and enabled wal_compression.
>
> LOG:  record with invalid length at 0/3B0
> LOG:  record with invalid length at 0/3000518
> LOG:  Invalid block length in record 0/30005A0
> LOG:  Invalid block length in record 0/3000D60
>

Looking at this code, I think that it is really confusing to move the data
related to the status of the backup block out of XLogRecordBlockImageHeader
to the chunk ID itself that may *not* include a backup block at all as it
is conditioned by the presence of BKPBLOCK_HAS_IMAGE. I would still prefer
the idea of having the backup block data in its dedicated header with bits
stolen from the existing fields, perhaps by rewriting it to something like
that:
typedef struct XLogRecordBlockImageHeader {
uint32 length:15,
 hole_length:15,
 is_compressed:1,
 is_hole:1;
} XLogRecordBlockImageHeader;
Now perhaps I am missing something and this is really "ugly" ;)

+#define XLR_CHUNK_ID_DATA_SHORT255
+#define XLR_CHUNK_ID_DATA_LONG 254
+#define XLR_CHUNK_BKP_COMPRESSED   0x01
+#define XLR_CHUNK_BKP_WITH_HOLE0x02
Wouldn't we need a XLR_CHUNK_ID_BKP_HEADER or equivalent? The idea between
this chunk_id stuff it to be able to make the difference between a short
header, a long header and a backup block header by looking at the first
byte.

The comments on top of XLogRecordBlockImageHeader are still mentioning the
old parameters like with_hole or is_compressed that you have removed.

It seems as well that there is some noise:
-   lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).
+   lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h)
-- 
Michael

Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Michael Paquier

On Tue, Feb 24, 2015 at 3:13 PM, Rushabh Lathia 
wrote:
>
> Ok. Looking at above code into getExtensionMembership(). It seems like you
> fix you suggested is not correct. I new table with DEFAULT attribute into
> dump_test extension and pg_dump with binary-upgrade is failing with
> pg_dump: invalid column number 1 for table "bb_tab_fkey".
>

This problem is not the object of this patch, but the one reported by
Gilles Darold here:
http://www.postgresql.org/message-id/54b7a400.4020...@dalibo.com
And there are patches traded in this CF to solve this issue.


> rushabh@rushabh-centos-vm:dump_test$ cat dump_test--1.0.sql
> /* dump_test/dump_test--1.0.sql */
>
> -- complain if script is sourced in psql, rather than via CREATE EXTENSION
> \echo Use "CREATE EXTENSION dump_test" to load this file. \quit
>
> CREATE TABLE IF NOT EXISTS bb_tab_fkey (
> id int PRIMARY KEY
> );
>
> CREATE TABLE IF NOT EXISTS aa_tab_fkey (
> id int REFERENCES bb_tab_fkey(id)
> );
>
> CREATE TABLE IF NOT EXISTS foo ( a int default 100 );
>
> This gave another strong reason to add if (!tbinfo->dobj.dump) check
> rather then ext_member check. What you say ?
>

Nope. Using dobj.dump is a bad idea as well, I think that this would break
the tracking of inherit tables and their parent relations, aka the
relations that are marked as interesting to track and from which we need
attribute information.

Btw, perhaps you may have noticed, but I marked this patch as rejected... I
don't think it makes much sense to put restrictions in this code path after
finding my way through all the stuff of pg_dump.
-- 
Michael

Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Rushabh Lathia

On Mon, Feb 23, 2015 at 7:45 PM, Michael Paquier 
wrote:

>
>
> On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia 
> wrote:
> > Thanks to the easy handy testcase, was able to replicate the test
> scenario
> > on my local environment. And yes tbinfo->dobj.ext_member check into
> > getTableAttrs() do fix the issue.
> >
> > Looking more into pg_dump code what I found that, generally PG don't have
> > tbinfo->dobj.ext_member check to ignore the object. Mostly we do this
> kind
> > of check using tbinfo->dobj.dump (look at dumpTable() for reference). Do
> you
> > have any particular reason if choosing dobj.ext_member over dobj.dump ?
>
> Hm. I have been pondering about this code more and I am dropping the patch
> as either adding a check based on the flag to track dumpable objects or
> ext_member visibly breaks the logic that binary upgrade and tables in
> extensions rely on. Particularly this portion makes me think so in
> getExtensionMembership():
> /*
>  * Normally, mark the member object as not to be dumped.
> But in
>  * binary upgrades, we still dump the members
> individually, since the
>  * idea is to exactly reproduce the database contents
> rather than
>  * replace the extension contents with something different.
>  */
> if (!dopt->binary_upgrade)
> dobj->dump = false;
> else
> dobj->dump = refdobj->dump;
>

Ok. Looking at above code into getExtensionMembership(). It seems like you
fix you suggested is not correct. I new table with DEFAULT attribute into
dump_test extension and pg_dump with binary-upgrade is failing with
pg_dump: invalid column number 1 for table "bb_tab_fkey".

rushabh@rushabh-centos-vm:dump_test$ cat dump_test--1.0.sql
/* dump_test/dump_test--1.0.sql */

-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION dump_test" to load this file. \quit

CREATE TABLE IF NOT EXISTS bb_tab_fkey (
id int PRIMARY KEY
);

CREATE TABLE IF NOT EXISTS aa_tab_fkey (
id int REFERENCES bb_tab_fkey(id)
);

CREATE TABLE IF NOT EXISTS foo ( a int default 100 );

This gave another strong reason to add if (!tbinfo->dobj.dump) check rather
then ext_member check. What you say ?

Regards,

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Michael Paquier

On Mon, Feb 23, 2015 at 8:57 PM, Fujii Masao  wrote:

> So I pushed the patch.
>

Thank you.
-- 
Michael

Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread David Steele

On 2/23/15 6:55 PM, Michael Paquier wrote:
> Hi all,
> 
> This is a purely informational email...
> I have put my hands on a Raspberry PI 2, and I have found that it takes
> 6 minutes to compile the backend code using the 4 cores of the ARMv7
> processor, and close to 20 minutes on a single core (without ccache).
> The test has been done using ArchLinux ARM with a micro SD card of class
> 10 (said to be able to manage 50MB/s in write, 60MB/s in read).
> In comparison, the buildfarm machine hamster, which is a Rasp PI 1,
> takes close to 2 hours to do the same operation, using the same OS and a
> similar class 10 card (30MB/s but the I/O is not the bottleneck).

Nice!  Looks like Moore's Law works even on tight budgets.

-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] How about to have relnamespace and relrole?

2015-02-23 Thread Michael Paquier

On Tue, Feb 24, 2015 at 11:35 AM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello,
>
> At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut 
> wrote in <54e647fd.5000...@gmx.net>
> > On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote:
> > > Hello, this is the patchset v2 of this feature.
> > >
> > > 0001-Add-regrole.patch
> > > 0002-Add-regnamespace.patch
> > > 0003-Check-new-reg-types-are-not-used-as-default-values.patch
> > > 0004-Documentation-for-new-OID-alias-types.patch
> >
> > Somehow, these patches ended up in the commit fest without an author
> > listed.  That should probably not be possible.
>
> Mmm.. I saw the author for this is listed here,
>
> https://commitfest.postgresql.org/4/
>
> Did you fix that manually for me?
>

Looking at the log entry:
2015-02-19 21:11:08 Michael Paquier (michael-kun) Changed authors to
Kyotaro Horiguchi (horiguti)
I do a pass from time to on the patches...
-- 
Michael

Re: [HACKERS] Bug in pg_dump

2015-02-23 Thread Michael Paquier

On Tue, Feb 24, 2015 at 2:17 AM, Gilles Darold 
wrote:

> Looks great to me, I have tested with the postgis_topology extension
> everything works fine.
>

Actually, after looking more in depth at the internals of pg_dump I think
that both patches are wrong (did that yesterday night for another patch).
First this patch marks a table in an extension as always dumpable:
+  /* Mark member table as dumpable */
+  configtbl->dobj.dump = true;
And then many checks on ext_member are added in many code paths to ensure
that objects in extensions have their definition never dumped.
But actually this assumption is not true all the time per this code in
getExtensionMemberShip:
if (!dopt->binary_upgrade)
dobj->dump = false;
else
dobj->dump = refdobj->dump;
So this patch would break binary upgrade where some extension objects
should be dumped (one reason why I haven't noticed that before is that
pg_upgrade tests do not include extensions).

Hence, one idea coming to my mind to fix the problem would be to add some
extra processing directly in getExtensionMembership() after building the
objects DO_TABLE_DATA with makeTableDataInfo() by checking the FK
dependencies and add a dependency link with addObjectDependency. The good
part with that is that even user tables that reference extension tables
with a FK can be restored correctly because their constraint is added
*after* loading the data. I noticed as well that with this patch the
--data-only mode was dumping tables in the correct order.

Speaking of which, patches implementing this idea are attached. The module
test case has been improved as well with a check using a table not in an
extension linked with a FK to a table in an extension.
-- 
Michael
From 7fb84d68df023d913c448f2498987ca4f0a70595 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 17 Feb 2015 07:39:23 +
Subject: [PATCH 1/3] Fix ordering of tables part of extensions linked with
 constraints

Additional checks on FK constraints potentially linking between them
extension objects are done and data dump ordering is ensured. Note
that this does not take into account foreign keys of tables that are
not part of an extension linking to an extension table.
---
 src/bin/pg_dump/pg_dump.c | 56 ++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2b53c72..5b1b240 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15236,7 +15236,8 @@ dumpRule(Archive *fout, DumpOptions *dopt, RuleInfo *rinfo)
 }
 
 /*
- * getExtensionMembership --- obtain extension membership data
+ * getExtensionMembership --- obtain extension membership data and check FK
+ * dependencies among extension tables.
  */
 void
 getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[],
@@ -15423,6 +15424,59 @@ getExtensionMembership(Archive *fout, DumpOptions *dopt, ExtensionInfo extinfo[]
 	}
 }
 			}
+
+			/*
+			 * Now that all the TableInfoData objects have been created for
+			 * all the extensions, check their FK dependencies and register
+			 * them to ensure correct data ordering.  Note that this is not
+			 * a problem for user tables not included in an extension
+			 * referencing with a FK tables in extensions as their constraint
+			 * is declared after dumping their data. In --data-only mode the
+			 * table ordering is ensured as well thanks to
+			 * getTableDataFKConstraints().
+			 */
+			for (j = 0; j < nconfigitems; j++)
+			{
+int			i_confrelid, k;
+PQExpBuffer query2 = createPQExpBuffer();
+TableInfo  *configtbl;
+Oid			configtbloid = atooid(extconfigarray[j]);
+
+configtbl = findTableByOid(configtbloid);
+if (configtbl == NULL || configtbl->dataObj == NULL)
+	continue;
+
+appendPQExpBuffer(query2,
+	  "SELECT confrelid "
+	  "FROM pg_catalog.pg_constraint "
+	  "WHERE conrelid = '%u' "
+	  "AND contype = 'f'",
+	  configtbloid);
+
+res = ExecuteSqlQuery(fout, query2->data, PGRES_TUPLES_OK);
+ntups = PQntuples(res);
+i_confrelid = PQfnumber(res, "confrelid");
+
+for (k = 0; k < ntups; k++)
+{
+	Oid			confrelid;
+	TableInfo  *reftable;
+
+	confrelid = atooid(PQgetvalue(res, k, i_confrelid));
+	reftable = findTableByOid(confrelid);
+
+	if (reftable == NULL || reftable->dataObj == NULL)
+		continue;
+
+	/*
+	 * Make referencing TABLE_DATA object depend on the
+	 * referenced table's TABLE_DATA object.
+	 */
+	addObjectDependency(&configtbl->dataObj->dobj,
+		reftable->dataObj->dobj.dumpId);
+}
+resetPQExpBuffer(query2);
+			}
 		}
 		if (extconfigarray)
 			free(extconfigarray);
-- 
2.3.0

From ad5cd360243b44e735195c5e94df3c21e8f18e07 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 17 Feb 2015 22:29:28 +0900
Subject: [PATCH 2/3] Make prove_check install contents of current

Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Andrew Gierth

> "Tomas" == Tomas Vondra  writes:

 Tomas> I believe the small regressions (1-10%) for small data sets,
 Tomas> might be caused by this 'random padding' effect, because that's
 Tomas> probably where L1/L2 cache is most important. For large datasets
 Tomas> the caches are probably not as efficient anyway, so the random
 Tomas> padding makes no difference,

Except that _your own results_ show that this is not the case.

In your first set of results you claimed a reduction in performance to
91% for a query which is _in no way whatsoever_ affected by any of the
code changes. How is this not noise?

I refer specifically to this case from your spreadsheet:

query   select * from (select * from stuff_text order by randtxt offset
  1000) foo
typetext
scale   5
master  26.949
datum   28.051
numeric 29.734
datum   96%
numeric 91% 

This query does not invoke any code path touched by either the datum or
numeric patches! The observed slowdown is therefore just noise (assuming
here that your timings are correct).

Whether that case can be improved by tweaking the _text_ abbreviation
code is another question, one which is not relevant to either of the
patches currently in play.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Kouhei Kaigai

> Kouhei Kaigai wrote:
> > Please see check_object_ownership(). It checks relation's ownership
> > if OBJECT_COLUMN, however, type's ownership is the correct check if
> > OBJECT_ATTRIBUTE.
> 
> Hmm.  Is there any case where the two are different?
>
AlterObjectTypeCommandTag()?

OBJECT_ATTRIBUTE makes "ALTER TYPE" tag, but "ALTER COLUMN" tag is
made with OBJECT_COLUMN.

Above two cases are all I could found.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Alvaro Herrera

Kouhei Kaigai wrote:
> Please see check_object_ownership(). It checks relation's ownership
> if OBJECT_COLUMN, however, type's ownership is the correct check if
> OBJECT_ATTRIBUTE.

Hmm.  Is there any case where the two are different?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Robert Haas

On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas
 wrote:
>> Robert pointed out that the visibility information
>> for an index-only scan wasn't checked while the index page READ
>> lock was held, so those scans also still hold the pins.
>
> Why does an index-only scan need to hold the pin?

Suppose there's a dead tuple in the heap someplace, and an index
pointer pointing to that dead tuple.  An index scan reaches the
relevant page and copies all of the index pointers.  VACUUM removes
the index pointer and marks the heap page all-visible.  The scan now
uses the index pointers copied to backend-local memory and notes that
the heap-page is all-visible, so the scan sees the tuple even though
that tuple is totally gone from the heap by that point.  Holding the
pin prevents this, because we can't reach the second heap pass until
the scan is done with the page, and therefore the dead line pointer in
the heap page can't be marked unused, and therefore the page can't be
marked all-visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Kouhei Kaigai

Please see check_object_ownership(). It checks relation's ownership
if OBJECT_COLUMN, however, type's ownership is the correct check if
OBJECT_ATTRIBUTE.

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Alvaro Herrera
> Sent: Tuesday, February 24, 2015 4:02 AM
> To: Pg Hackers; Peter Eisentraut
> Subject: [HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE
> for composites)
> 
> I found that the OBJECT_ATTRIBUTE symbol is useless.  I can just remove
> it and replace it with OBJECT_COLUMN, and everything continues to work;
> no test fails that I can find.
> 
> I thought we had a prohibition against ALTER TABLE when used on
> composites, but it's not as severe as I thought.  The following commands
> fail in master:
> 
> ALTER TABLE comptype RENAME TO comptype2; -- HINT: Use ALTER TYPE
> ALTER TABLE comptype SET SCHEMA sch;  -- HINT: Use ALTER TYPE
> 
> However, the following command works in master:
> 
> ALTER TABLE comptype RENAME COLUMN a TO b;
> and has the same effect as this:
> ALTER TYPE comptype RENAME ATTRIBUTE a TO b;
> 
> The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently
> using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do
> not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE
> completely.  That leads to the attached patch, which changes no test
> result at all.
> 
> This symbol was added in
> 
> commit e440e12c562432a2a695b8054964fb34e3bd823e
> Author: Peter Eisentraut 
> Date:   Sun Sep 26 14:41:03 2010 +0300
> 
> Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE
> 
> Like with tables, this also requires allowing the existence of
> composite types with zero attributes.
> 
> reviewed by KaiGai Kohei
> 
> Thoughts?
> 
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread Adam Brightwell

Michael,


> This is a purely informational email...
> I have put my hands on a Raspberry PI 2, and I have found that it takes 6
> minutes to compile the backend code using the 4 cores of the ARMv7
> processor, and close to 20 minutes on a single core (without ccache). The
> test has been done using ArchLinux ARM with a micro SD card of class 10
> (said to be able to manage 50MB/s in write, 60MB/s in read).
> In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes
> close to 2 hours to do the same operation, using the same OS and a similar
> class 10 card (30MB/s but the I/O is not the bottleneck).
>

Thanks for sharing this info.  I'm looking forward to getting my hands on
an rpi2 soon for some of my other projects.  Having an idea of the
increased performance, especially related to compilation, was certainly
something I was curious about.

-Adam



-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com

Re: [HACKERS] [RFC] LSN Map

2015-02-23 Thread Robert Haas

On Mon, Feb 23, 2015 at 12:52 PM, Heikki Linnakangas
 wrote:
> Dunno, but Jim's got a point. This is a maintenance burden to all indexams,
> if they all have to remember to update the LSN map separately. It needs to
> be done in some common code, like in PageSetLSN or XLogInsert or something.
>
> Aside from that, isn't this horrible from a performance point of view? The
> patch doubles the buffer manager traffic, because any update to any page
> will also need to modify the LSN map. This code is copied from the
> visibility map code, but we got away with it there because the VM only needs
> to be updated the first time a page is modified. Subsequent updates will
> know the visibility bit is already cleared, and don't need to access the
> visibility map.
>
> Ans scalability: Whether you store one value for every N pages, or the LSN
> of every page, this is going to have a huge effect of focusing contention to
> the LSN pages. Currently, if ten backends operate on ten different heap
> pages, for example, they can run in parallel. There will be some contention
> on the WAL insertions (much less in 9.4 than before). But with this patch,
> they will all fight for the exclusive lock on the single LSN map page.
>
> You'll need to find a way to not update the LSN map on every update. For
> example, only update the LSN page on the first update after a checkpoint
> (although that would still have a big contention focusing effect right after
> a checkpoint).

I think it would make more sense to do this in the background.
Suppose there's a background process that reads the WAL and figures
out which buffers it touched, and then updates the LSN map
accordingly.  Then the contention-focusing effect disappears, because
all of the updates to the LSN map are being made by the same process.
You need some way to make sure the WAL sticks around until you've
scanned it for changed blocks - but that is mighty close to what a
physical replication slot does, so it should be manageable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about to have relnamespace and relrole?

2015-02-23 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 19 Feb 2015 15:30:53 -0500, Peter Eisentraut  wrote in 
<54e647fd.5000...@gmx.net>
> On 2/18/15 3:44 AM, Kyotaro HORIGUCHI wrote:
> > Hello, this is the patchset v2 of this feature.
> > 
> > 0001-Add-regrole.patch
> > 0002-Add-regnamespace.patch
> > 0003-Check-new-reg-types-are-not-used-as-default-values.patch
> > 0004-Documentation-for-new-OID-alias-types.patch
> 
> Somehow, these patches ended up in the commit fest without an author
> listed.  That should probably not be possible.

Mmm.. I saw the author for this is listed here,

https://commitfest.postgresql.org/4/

Did you fix that manually for me?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra

On 24.2.2015 01:44, Peter Geoghegan wrote:
> On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra
>  wrote:
>> Are you going to add this into the CF? Would be nice to get this into
>> 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
>> it part of one of the existing sortsupport patches.
> 
> It's a bugfix, IMV. I guess I should add it to the current CF, but I
> think I might be forbidden from doing so as a non-CF-manager...


Yes, I consider that a bugfix too, fixing a performance regression.

regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] deparsing utility commands

2015-02-23 Thread Andres Freund

Hi,

On 2015-02-23 19:48:43 -0500, Stephen Frost wrote:
> > Yes, it might be possible to use the same code for a bunch of minor
> > commands, but not for the interesting/complex stuff.
> 
> We can clearly rebuild at least CREATE commands for all objects without
> access to the parse tree, obviously pg_dump manages somehow.

That's not the same. pg_dump recreates a static state, not running log
of changes that can be replayed.

> I didn't specify that a single command had to be used.

Well, if you want to get 'the create table statement' you'll need to
decide what you want. With ddl event triggers it's clear - something
that, when replayed, results in the same catalog contents. Such an easy
definition doesn't, as far as I know, exist for a set of functions you
seem to imagine.

> Further, the actual deparse_CreateStmt doesn't look like it'd have a
> terribly hard time producing something without access to the
> parsetree.

The interesting bit isn't the deparse_CreateStmt itself, but all the
stuff that's also triggered when you do a CREATE TABLE
nontrival-stuff;. utility.c will first transform the statement and then
run run and stash every created subcommand. And the subcommands will
*also* get deparsed.

Just think about what to do about CREATE TABLE foo(id serial primary
key, data text, bar_id REFERENCES foo.bar); - there's no way you can get
which other objects to dump from the catalog alone. What for a schema,
considering CREATE SCHEMA ... (schema_elements)+;?

Sure, individual subcommands could refer to the catalog instead of the
parse tree. But to get the whole thing you can't easily just refer to
it.

> All I'm suggesting is that we focus on collecting the information from
> the catalog and avoid using the parsetree.  For at least the CREATE and
> DROP cases, that should be entirely possible.

DROP's already in 9.4, the additions in 9.5 were more or less usability
things.  The problem generating DROPs is right now more identifying
which one you want to drop and checking the dependencies - the latter
I'm not sure how to do without actually executing the DROP.

> > > Maybe that's possible and maybe it isn't, but at least for the CREATE
> > > cases we should be able to avoid forcing a user to provide a built
> > > parsetree, and that'd be *really* nice and open up this feature to
> > > quite a few other use-cases that I can think of.  I'd further suggest
> > > that we provide these command to the SQL level, and then have a
> > > wrapper which will take the name of an object, resolve it to Oid, and
> > > then pass back the CREATE command for it.
> > 
> > I think this is a different feature that might end up sharing some of
> > the infrastructure, but not more.
> 
> I know you've looked through this code also and I really don't get the
> comment that only "some" of this infrastructure would be shared.  As I
> tried to point out, for the 'CREATE POLICY' case, a few minor
> modifications would have it provide exactly what I'm suggesting and I'm
> sure that most of the cases would be similar.  Simply looking through
> the code with an eye towards avoiding the parsetree in favor of pulling
> information from the catalog would illustrate the point I'm making, I
> believe.

I've no problem at all changing CREATE POLICY (and some other) deparse
functions to look more at the catalog than the command. What I don't see
is changing all of the create deparse functions, guarantee that they are
usable for getting the DDL of catalog objects and provide SQL
accessible infrastructure for that. That's a *massive* undertaking.

What I mean with 'sharing some of the infrastructure' is that I can see
a good portion of the deparse_* functions being reused for what you
desire.

But the decision about which of those to call will be an entirely
separate project. So is a whole new infrastructure to consider locking
and visibility (all the deparse stuff uses continualy evolving catalog
snapshots!) that it'll need as that's a problem the event trigger stuff
has much less to care about, because the objects are new rows and thus
can't be updated by other backends.

> > I don't think it's all that related, so I'm not surprised. If you
> > execute a CREATE TABLE(id serial primary key); you'll get a bunch of
> > commands with this facility - it'd be much less clear what exactly shall
> > be dumped in the case of some hypothetical GUI when you want to dump the
> > table.
> 
> I really don't think it's all that strange or complicated to consider
> and we've got a rather nice example of what a good approach would be.

Right. We got a *massive* program that solves dependencies and doesn't
do all that much useful/correct things if you only let it dump
individual objects. And that dumping of individual objects is what you
want... ;)

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_check_dir comments and implementation mismatch

2015-02-23 Thread Noah Misch

On Sun, Feb 22, 2015 at 07:57:41PM -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Feb 20, 2015 at 12:59 AM, Noah Misch  wrote:
> >> On Mon, Feb 02, 2015 at 03:48:33PM -0500, Robert Haas wrote:
> >>> If readir() fails and closedir() succeeds, the return will be -1 but
> >>> errno will be 0.
> 
> >> Out of curiosity, have you seen a closedir() implementation behave that 
> >> way?
> >> It would violate C99 ("The value of errno is zero at program startup, but 
> >> is
> >> never set to zero by any library function.") and POSIX.
> 
> > No.  Good point, I didn't think about that.  I think this way is safer, 
> > though.
> 
> While the spec forbids library functions from setting errno to zero, there
> is no restriction on them changing errno in other ways despite returning
> success; their exit-time value of errno is only well-defined if they fail.
> So we do need to preserve errno explicitly across closedir(), or we may
> report the wrong failure from readdir().

Yes.  I'm happy with the commit itself.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT UPDATE and logical decoding

2015-02-23 Thread Peter Geoghegan

On Fri, Feb 20, 2015 at 3:44 PM, Peter Geoghegan  wrote:
>>> they'd only see a
>>> REORDER_BUFFER_CHANGE_INSERT when that was the definitive outcome of
>>> an UPSERT, or alternatively a REORDER_BUFFER_CHANGE_UPDATE when that
>>> was the definitive outcome. No need for output plugins to consider
>>> REORDER_BUFFER_CHANGE_INTERNAL_SUPERDELETE at all.
>>
>> Yes. It'd be easiest if the only the final insert/update were actually
>> WAL logged as full actions.
>
> Well, that implies that we'd actually know that we'd succeed when WAL
> logging the speculative heap tuple's insertion. We literally have no
> way of knowing if that's the case at that point, though - that's just
> the nature of value locking scheme #2's optimistic approach.

Attached patch does this. This is just a sketch, to provoke a quick
discussion (which is hopefully all this needs). I did things this way
because it seemed more efficient than rolling this into a whole new
revision of the ON CONFLICT patch series.

Ostensibly, this does the right thing: The test_decoding output plugin
only ever reports either an INSERT or an UPDATE within a transaction
(that contains an INSERT ... ON CONFLICT UPDATE). With jjanes_upsert
running, I have not found any problematic cases where this fails to be
true (although I have certainly been less than thorough so far -
please take that into consideration).

I have two basic concerns:

* Do you think that what I've sketched here is roughly the right
approach? Consolidating super deletions and insertions at transaction
reassembly seemed like the right thing to do, but obviously I'm not
best qualified to judge that. Clearly, it would be possible to
optimize this so REORDER_BUFFER_CHANGE_INSERT "peek ahead" happens
much more selectively, which I haven't bothered with yet.

* What are the hazards implied by my abuse of the
ReorderBufferIterTXNNext() interface? It looks like it should be okay
to "peek ahead" like this, because it has a slightly odd convention
around memory management that accidentally works. But honestly, I'm
feeling too lazy to make myself grok the code within
ReorderBufferIterTXNNext() at the moment, and so have not put concerns
about the code being totally broken to rest yet. Could the "current"
ReorderBufferChange be freed before it is sent to a logical decoding
plugin by way of a call to the  rb->apply_change() callback (when the
xact is spooled to disk, for example)?

Thanks
-- 
Peter Geoghegan
From 53784c818b39d39c6a1f3588f7823a9010074869 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 22 Feb 2015 17:02:14 -0800
Subject: [PATCH 7/7] Logical decoding support for ON CONFLICT UPDATE

---
 src/backend/replication/logical/decode.c|  5 +++-
 src/backend/replication/logical/reorderbuffer.c | 35 +++--
 src/include/replication/reorderbuffer.h |  3 ++-
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e7614bd..a9cd0df 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -682,7 +682,10 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		return;

 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+	if (!(xlrec->flags & XLOG_HEAP_KILLED_SPECULATIVE_TUPLE))
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+	else
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_DELETE;

 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 20bb3b7..8f9101e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -401,6 +401,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INSERT:
 		case REORDER_BUFFER_CHANGE_UPDATE:
 		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 			if (change->data.tp.newtuple)
 			{
 ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
@@ -1313,7 +1314,12 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,

 	PG_TRY();
 	{
+		/*
+		 * To merge super delete and insert records, peek-ahead is occasionally
+		 * required
+		 */
 		ReorderBufferChange *change;
+		ReorderBufferChange *nextchange = NULL;

 		if (using_subtxn)
 			BeginInternalSubTransaction("replay");
@@ -1323,16 +1329,21 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		rb->begin(rb, txn);

 		iterstate = ReorderBufferIterTXNInit(rb, txn);
-		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
+		while ((change = nextchange ? nextchange :
+ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;

+			/* Forget about previous peek ahead */
+			nextchange = NULL;
+
 			switch (change->actio

Re: [HACKERS] deparsing utility commands

2015-02-23 Thread Stephen Frost

Andres,

* Andres Freund (and...@2ndquadrant.com) wrote:
> On 2015-02-21 14:51:32 -0500, Stephen Frost wrote:
> > It'd be *really* nice to be able to pass an object identifier to some
> > function and get back the CREATE (in particular, though perhaps DROP, or
> > whatever) command for it.  This gets us *awful* close to that without
> > actually giving it to us and that's bugging me.  The issue is the
> > parsetree, which I understand may be required in some cases (ALTER,
> > primairly, it seems?), but isn't always.
>  
> > The USING and WITH CHECK quals can be extracted from the polForm also,
> > of course, it's what psql and pg_dump are doing, after all.
> > 
> > So, why depend on the parsetree?  What about have another argument which
> > a user could use without the parsetree, for things that it's absolutely
> > required for today (eg: ALTER sub-command differentiation)?
> 
> I'm really not wild about pretty massively expanding the scope at the
> eleventh hour.  There's a fair number of commands where this the
> deparsing command will give you a bunch of commands - look at CREATE
> TABLE and CREATE SCHEMA ... for the most extreme examples. For those
> there's no chance to do that without the parse tree available.

For my 2c, I don't agree with either of your assumptions above.  I don't
see this as a massive expansion of the scope, nor that we're in the 11th
hour at this point.  Agreed, it's the last commitfest, but we're near
the beginning of it and Alvaro has already made modifications to the
patch set, as is generally expected to happen for any patch in a
commitfest, without much trouble.  The changes which I'm suggesting are
nearly trivial for the larger body of work and only slightly more than
trivial for the more complicated pieces.

> Yes, it might be possible to use the same code for a bunch of minor
> commands, but not for the interesting/complex stuff.

We can clearly rebuild at least CREATE commands for all objects without
access to the parse tree, obviously pg_dump manages somehow.  I didn't
specify that a single command had to be used.  Further, the actual
deparse_CreateStmt doesn't look like it'd have a terribly hard time
producing something without access to the parsetree.

The whole premise of this approach is that we're using the results of
the catalog changes as the basis for what's needed to recreate the
command.  The places where we're referring to the parsetree look more
like a crutch of convenience than for any particular good reason to.
Consider this part of 0011 which includes deparse_CreateStmt:

/*
 * Process table elements: column definitions and constraints.  Only
 * the column definitions are obtained from the parse node itself.  To
 * get constraints we rely on pg_constraint, because the parse node
 * might be missing some things such as the name of the constraints.
 */

and then looking down through to deparse_ColumnDef, we see that the
ColumnDef passed in is used almost exclusivly for the *column name*
(which is used to look up the entry in pg_attribute..).  Looping over
the pg_attribute entries for the table in the attnum order would
certainly return the same result, or something has gone wrong.

Beyond the inheiritance consideration, the only other reference actually
leads to what you argued (correctly, in my view) was a mistake in the
code- where a NOT NULL is omitted for the primary key case.  Weren't you
also complaining about the 'OF type' form being a potential issue?
That'd also go away with the approach I'm advocating.  In short, I
suspect that if this approach had been taken originally, at least some
of the concerns and issued levied against the current implementation
wouldn't exist.  Thankfully, the way the code has been developed, the
majority of the code is general infrastructure and the changes I'm
suggesting are all in code which is simpler, thanks to that
infrastructure already being there.

All I'm suggesting is that we focus on collecting the information from
the catalog and avoid using the parsetree.  For at least the CREATE and
DROP cases, that should be entirely possible.  This does end up being a
bit different from the original goal, which was closer to "reproduce
exactly the command that was specified," but as shown above, that
probably wasn't what we ever really wanted.  The original command is
ambiguous on a number of levels and even where it isn't we can get the
canonical information we need straight from the catalog.

> > Maybe that's possible and maybe it isn't, but at least for the CREATE
> > cases we should be able to avoid forcing a user to provide a built
> > parsetree, and that'd be *really* nice and open up this feature to
> > quite a few other use-cases that I can think of.  I'd further suggest
> > that we provide these command to the SQL level, and then have a
> > wrapper which will take the name of an object, resolve it to Oid, and
> > then pass back the CREATE command for it.
> 
> I think this is a different feature that might end up s

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan

On Mon, Feb 23, 2015 at 4:41 PM, Tomas Vondra
 wrote:
> Are you going to add this into the CF? Would be nice to get this into
> 9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
> it part of one of the existing sortsupport patches.

It's a bugfix, IMV. I guess I should add it to the current CF, but I
think I might be forbidden from doing so as a non-CF-manager...
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra

On 23.2.2015 21:52, Peter Geoghegan wrote:
> On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra
>  wrote:
>> So while it's true that for the 3rd query we get much worse results
>> compared to the other queries (i.e. we don't get >400% speedup but ~3%
>> slowdown compared to master), it's true that master performs
>> exceptionally well for this query with small datasets. Once we get to 2M
>> rows, the master performance drops significantly but cost-model keeps
>> the performance characteristics and the speedup jumps back to ~700%
>> which is nice.
>>
>> These numbers are for the 'ASC + unsorted row' test, but I do get
>> exactly the same pattern for the 'random' tests done previously.
> 
> Yeah. Looks like you're comparing a case where the old cost model
> did the right thing anyway (i.e. used abbreviation). The difference
> would then be entirely explainable as noise. Right?

I don't think it's just noise - on the i5-2500 CPU, the result are very
consistent. For 1M random rows with this query (#3)

select * from (select * from stuff_text order by randtxt
   offset 1000) foo

I do get this (with 20 of the query):

 min   max  medianmean  stddev

master   932.449   932.902  932.637   932.639   0.116
cost-model   957.16958.47   957.497   957.586   0.411

That is super-consistent, IMHO, and the ~2.6% slowdown is clearly
visible. But it might be due to the padding described by Andrew.

Similarly on the Xeon E5450 CPU, although the results there are much
more noisy. Maybe it's more

>> It would be nice if we could address the 3% regression for the
>> last query, but I guess it's not a big deal. The numbers in general
>> are absolutely impressive. Kudos.
> 
> Thanks.

Are you going to add this into the CF? Would be nice to get this into
9.5. Strictly speaking it should go to 2015-06 I guess, but I'd consider
it part of one of the existing sortsupport patches.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Raspberry PI vs Raspberry PI 2: time to compile backend code

2015-02-23 Thread Michael Paquier

Hi all,

This is a purely informational email...
I have put my hands on a Raspberry PI 2, and I have found that it takes 6
minutes to compile the backend code using the 4 cores of the ARMv7
processor, and close to 20 minutes on a single core (without ccache). The
test has been done using ArchLinux ARM with a micro SD card of class 10
(said to be able to manage 50MB/s in write, 60MB/s in read).
In comparison, the buildfarm machine hamster, which is a Rasp PI 1, takes
close to 2 hours to do the same operation, using the same OS and a similar
class 10 card (30MB/s but the I/O is not the bottleneck).

Regards,
-- 
Michael

Re: [HACKERS] logical column ordering

2015-02-23 Thread Tomas Vondra

Hi,

attached is the result of my first attempt to make the logical column
ordering patch work. This touches a lot of code in the executor that is
mostly new to me, so if you see something that looks like an obvious
bug, it probably is (so let me know).   


improvements

The main improvements of this version are that:

* initdb actually works (while before it was crashing)

* regression tests work, with two exceptions

  (a) 'subselect' fails because EXPLAIN prints columns in physical order
  (but we expect logical)

  (b) col_order crashes works because of tuple descriptor mismatch in a
  function call (this actually causes a segfault)

The main change is this patch is that tlist_matches_tupdesc() now checks
target list vs. physical attribute order, which may result in doing a
projection (in cases when that would not be done previously).

I don not claim this is the best approach - maybe it would be better to
keep the physical tuple and reorder it lazily. That's why I kept a few
pieces of code (fix_physno_mutator) and a few unused fields in Var.

Over the time I've heard various use cases for this patch, but in most
cases it was quite speculative. If you have an idea where this might be
useful, can you explain it here, or maybe point me to a place where it's
described?

There's also a few FIXMEs, mostly from Alvaro's version of the patch.
Some of them are probably obsolete, but I wasn't 100% sure by that so
I've left them in place until I understand the code sufficiently.


randomized testing
--
I've also attached a python script for simple randomized testing. Just
execute it like this:

$ python randomize-attlognum.py -t test_1 test_2 \
   --init-script attlognum-init.sql \
   --test-script attlognum-test.sql

and it will do this over and over

$ dropdb test
$ createdb test
$ run init script
$ randomly set attlognums for the tables (test_1 and test_2)
$ run test script

It does not actually check the result, but my experience is that when
there's a bug in handling the descriptor, it results in segfault pretty
fast (just put some varlena columns into the table).


plans / future
--
After discussing this with Alvaro, we've both agreed that this is far
too high-risk change to commit in the very last CF (even if it was in a
better shape). So while it's added to 2015-02 CF, we're aiming for 9.6
if things go well.


regards

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c5892d3..7528724 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2368,6 +2368,9 @@ get_attnum_pk_pos(int *pkattnums, int pknumatts, int key)
 	return -1;
 }
 
+/*
+ * FIXME this probably needs to be tweaked.
+ */
 static HeapTuple
 get_tuple_of_interest(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals)
 {
diff --git a/contrib/spi/timetravel.c b/contrib/spi/timetravel.c
index 0699438..30e496c 100644
--- a/contrib/spi/timetravel.c
+++ b/contrib/spi/timetravel.c
@@ -314,6 +314,7 @@ timetravel(PG_FUNCTION_ARGS)
 		Oid		   *ctypes;
 		char		sql[8192];
 		char		separ = ' ';
+		Form_pg_attribute *attrs;
 
 		/* allocate ctypes for preparation */
 		ctypes = (Oid *) palloc(natts * sizeof(Oid));
@@ -322,10 +323,11 @@ timetravel(PG_FUNCTION_ARGS)
 		 * Construct query: INSERT INTO _relation_ VALUES ($1, ...)
 		 */
 		snprintf(sql, sizeof(sql), "INSERT INTO %s VALUES (", relname);
+		attrs = TupleDescGetLogSortedAttrs(tupdesc);
 		for (i = 1; i <= natts; i++)
 		{
-			ctypes[i - 1] = SPI_gettypeid(tupdesc, i);
-			if (!(tupdesc->attrs[i - 1]->attisdropped)) /* skip dropped columns */
+			ctypes[i - 1] = SPI_gettypeid(tupdesc, attrs[i - 1]->attnum);
+			if (!(attrs[i - 1]->attisdropped)) /* skip dropped columns */
 			{
 snprintf(sql + strlen(sql), sizeof(sql) - strlen(sql), "%c$%d", separ, i);
 separ = ',';
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 6cd4e8e..14787ce 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -79,6 +79,8 @@
 /*
  * heap_compute_data_size
  *		Determine size of the data area of a tuple to be constructed
+ *
+ * Note: input arrays must be in attnum order.
  */
 Size
 heap_compute_data_size(TupleDesc tupleDesc,
@@ -88,16 +90,23 @@ heap_compute_data_size(TupleDesc tupleDesc,
 	Size		data_length = 0;
 	int			i;
 	int			numberOfAttributes = tupleDesc->natts;
-	Form_pg_attribute *att = tupleDesc->attrs;
+	Form_pg_attribute *att = TupleDescGetPhysSortedAttrs(tupleDesc);
 
+	/*
+	 * We need to consider the attributes in physical order for storage, yet
+	 * our input arrays are in attnum order.  In this loop, "i" is an index
+	 * into the attphysnum-sorted attribute array, and idx is an index into the
+	 * input arrays.
+	 */
 	for (i = 0; i < numberOfAttri

Re: [HACKERS] POLA violation with \c service=

2015-02-23 Thread Alvaro Herrera

Here's the real attachment.  Previous one was a misguided shell
redirection.  Meh.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 830d41b9d23716af29491cbab59808c35e25ec12 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 20 Feb 2015 14:36:41 -0300
Subject: [PATCH] Fix psql's \c to work with URIs and conninfo strings
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This makes it more consistent with what can be given in the command
line.

Authors: Andrew Dunstan, David Fetter, Andrew Gierth, Pavel Stěhule
---
 doc/src/sgml/ref/psql-ref.sgml| 30 ++---
 src/bin/psql/command.c| 69 +++
 src/bin/psql/tab-complete.c   | 20 ++--
 src/common/Makefile   |  2 +-
 src/common/connstrings.c  | 52 +
 src/include/common/connstrings.h  | 16 +
 src/interfaces/libpq/fe-connect.c | 36 +---
 7 files changed, 162 insertions(+), 63 deletions(-)
 create mode 100644 src/common/connstrings.c
 create mode 100644 src/include/common/connstrings.h

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index a637001..a218af8 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -796,18 +796,23 @@ testdb=>
   
 
   
-\c or \connect [ dbname [ username ] [ host ] [ port ] ]
+\c   [ { [ dbname [ username ] [ host ] [ port ] ] | conninfo string } ] 
+\connect [ { [ dbname [ username ] [ host ] [ port ] ] | conninfo string } ] 
 
 
 Establishes a new connection to a PostgreSQL
-server. If the new connection is successfully made, the
-previous connection is closed. If any of conninfo string, or a URI.
+If the new connection is successfully made, the
+previous connection is closed.
+When using positional parameters, if any of dbname, username, host or port are omitted or specified
 as -, the value of that parameter from the
-previous connection is used. If there is no previous
+previous connection is used.
+If using positional parameters and there is no previous
 connection, the libpq default for
 the parameter's value is used.
 
@@ -822,6 +827,23 @@ testdb=>
 mechanism that scripts are not accidentally acting on the
 wrong database on the other hand.
 
+
+
+Positional syntax:
+
+=> \c mydb myuser host.dom 6432
+
+
+
+
+The conninfo form detailed in 
+takes two forms: keyword-value pairs, and URIs:
+
+
+=> \c service=foo
+=> \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable"
+=> \c postgresql://tom@localhost/mydb?application_name=myapp
+
 
   
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 7c9f28d..7d8d5bb 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1607,6 +1607,8 @@ do_connect(char *dbname, char *user, char *host, char *port)
 	PGconn	   *o_conn = pset.db,
 			   *n_conn;
 	char	   *password = NULL;
+	bool		keep_password = true;
+	bool		has_connection_string = false;
 
 	if (!o_conn && (!dbname || !user || !host || !port))
 	{
@@ -1620,15 +1622,31 @@ do_connect(char *dbname, char *user, char *host, char *port)
 		return false;
 	}
 
-	if (!dbname)
-		dbname = PQdb(o_conn);
 	if (!user)
 		user = PQuser(o_conn);
+
 	if (!host)
 		host = PQhost(o_conn);
+
 	if (!port)
 		port = PQport(o_conn);
 
+	if (dbname)
+		has_connection_string = recognized_connection_string(dbname);
+
+	keep_password =
+		((strcmp(user, PQuser(o_conn)) == 0) &&
+		 (!host || strcmp(host, PQhost(o_conn)) == 0) &&
+		 (strcmp(port, PQport(o_conn)) == 0) &&
+		 !has_connection_string);
+
+	/*
+	 * Unlike the previous stanzas, changing only the dbname shouldn't trigger
+	 * throwing away the password.
+	 */
+	if (!dbname)
+		dbname = PQdb(o_conn);
+
 	/*
 	 * If the user asked to be prompted for a password, ask for one now. If
 	 * not, use the password from the old connection, provided the username
@@ -1643,9 +1661,13 @@ do_connect(char *dbname, char *user, char *host, char *port)
 	{
 		password = prompt_for_password(user);
 	}
-	else if (o_conn && user && strcmp(PQuser(o_conn), user) == 0)
+	else if (o_conn && keep_password)
 	{
-		password = pg_strdup(PQpass(o_conn));
+		password = PQpass(o_conn);
+		if (password && *password)
+			password = pg_strdup(password);
+		else
+			password = NULL;
 	}
 
 	while (true)
@@ -1653,23 +1675,28 @@ do_connect(char *dbname, char *user, char *host, char *port)
 #define PARAMS_ARRAY_SIZE	8
 		const char **keywords = pg_malloc(PARAMS_ARRAY_SIZE * sizeof(*keywords));
 		const char **values = pg_malloc(PARAMS_ARRAY_SIZE * sizeof(*values));
-
-		keywords[0] = "host";
-		values[0] = host;
-

Re: [HACKERS] POLA violation with \c service=

2015-02-23 Thread Alvaro Herrera

David Fetter wrote:

> My thinking behind this was that the patch is a bug fix and intended
> to be back-patched, so I wanted to mess with as little infrastructure
> as possible.  A new version of libpq seems like a very big ask for
> such a case.  You'll recall that the original problem was that
> 
> \c service=foo
> 
> only worked accidentally for some pretty narrow use cases and broke
> without much of a clue for the rest.  It turned out that the general
> problem was that options given to psql on the command line were not
> even remotely equivalent to \c, even though they were documented to
> be.

So, in view of these arguments and those put forward by Pavel
downthread, I think the attached is an acceptable patch for the master
branch.  It doesn't apply to back branches though; 9.4 and 9.3 have a
conflict in tab-complete.c, 9.2 has additional conflicts in command.c,
and 9.1 and 9.0 are problematic all over because they don't have
src/common.  Could you please submit patches adapted for each group of
branches?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan

On Mon, Feb 23, 2015 at 11:17 AM, Tomas Vondra
 wrote:
> So while it's true that for the 3rd query we get much worse results
> compared to the other queries (i.e. we don't get >400% speedup but ~3%
> slowdown compared to master), it's true that master performs
> exceptionally well for this query with small datasets. Once we get to 2M
> rows, the master performance drops significantly but cost-model keeps
> the performance characteristics and the speedup jumps back to ~700%
> which is nice.
>
> These numbers are for the 'ASC + unsorted row' test, but I do get
> exactly the same pattern for the 'random' tests done previously.

Yeah. Looks like you're comparing a case where the old cost model did
the right thing anyway (i.e. used abbreviation). The difference would
then be entirely explainable as noise. Right?

> It would be nice if we could address the 3% regression for the last
> query, but I guess it's not a big deal. The numbers in general are
> absolutely impressive. Kudos.

Thanks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc

2015-02-23 Thread Tom Lane

I wrote:
> I'm not seeing any terribly pleasing ways to fix this.  Aside from
> the option of doing nothing, it seems like these are the choices:

> 1. We could hack base_yylex() to reduce NOT LIKE to a single token
> which could be given the same precedence as LIKE.  Ditto for the other
> four cases.  This would result in nice behavior for these cases, but as
> Robert has correctly pointed out, hacks in base_yylex() risk creating
> other poorly-understood behaviors.

> 2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR,
> BETWEEN, and IN all have precedence just above NOT, which would pretty
> much mask the problem since there would be no other tokens with in-between
> precedence levels.  In the context of the operator precedence changes
> I proposed earlier, this would mean inserting the IS tests and comparison
> operators between IN_P and POSTFIXOP rather than where the
> single-character comparison ops live now.  We would likely also have to
> flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid
> having weird interactions among them.  This isn't terribly attractive
> because it would risk larger behavioral changes than the previous
> proposal.

I thought of another possibility:

3. Leave everything as-is but mark the NOT-operator productions as having
the precedence of NOT rather than of LIKE etc.  This would change the
behavior only for the NOT-LIKE-followed-by-< example, and would make the
two cases for NOT LIKE consistent though they'd remain inconsistent with
LIKE.  This behavior seems at least somewhat explainable/documentable
("NOT-foo operators have the precedence of NOT"), whereas what we have
seems about impossible to justify.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce pinning in btree indexes

2015-02-23 Thread Heikki Linnakangas


On 02/15/2015 02:19 AM, Kevin Grittner wrote:

Interestingly, the btree README points out that using the old TID
with a new tuple poses no hazard for a scan using an MVCC snapshot,
because the new tuple would not be visible to a snapshot created
that long ago.


The first question is: Do we really need that interlock for the non-MVCC 
snapshots either?


If we do: For non-MVCC snapshots, we need to ensure that all index scans 
that started before the VACUUM did complete before the VACUUM does. I 
wonder if we could find a different mechanism to enforce that. Using the 
pin-interlock for that has always seemed a bit silly to me. Perhaps grab 
a new heavy-weight lock on the table whenever a non-MVCC index scan on 
the table begins, and have VACUUM wait on it.



I found that the LP_DEAD hinting
would be a problem with an old TID, but I figured we could work
around that by storing the page LSN into the scan position structure
when the page contents were read, and only doing hinting if that
matched when we were ready to do the hinting.  That wouldn't work
for an index which was not WAL-logged, so the patch still holds
pins for those.


Or you could use GetFakeLSNForUnloggedRel().


Robert pointed out that the visibility information
for an index-only scan wasn't checked while the index page READ
lock was held, so those scans also still hold the pins.


Why does an index-only scan need to hold the pin?


Finally, there was an "optimization" for marking buffer position
for possible restore that was incompatible with releasing the pin.
I use quotes because the optimization adds overhead to every move
to the next page in order set some variables in a structure when a
mark is requested instead of running two fairly small memcpy()
statements.  The two-day benchmark of the customer showed no
performance hit, and looking at the code I would be amazed if the
optimization yielded a measurable benefit.  In general, optimization
by adding overhead to moving through a scan to save time in a mark
operation seems dubious.


Hmm. Did your test case actually exercise mark/restore? The memcpy()s 
are not that small. Especially if it's an index-only scan, you're 
copying a large portion of the page. Some scans call markpos on every 
tuple, so that could add up.



At some point we could consider building on this patch to recheck
index conditions for heap access when a non-MVCC snapshot is used,
check the visibility map for referenced heap pages when the TIDs
are read for an index-only scan, and/or skip LP_DEAD hinting for
non-WAL-logged indexes.  But all those are speculative future work;
this is a conservative implementation that just didn't modify
pinning where there were any confounding factors.


Understood. Still, I'd like to see if we can easily get rid of the 
pinning altogether.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Precedence of NOT LIKE, NOT BETWEEN, etc

2015-02-23 Thread Tom Lane

While fooling around with testing operator precedence warnings,
I discovered that there's some existing precedence behavior that's
not at all what I expected.  Consider these examples (all done in
9.4; this is longstanding behavior):

regression=# SELECT 1 < '(3,4)'::point LIKE 42.0;
ERROR:  operator does not exist: point ~~ numeric
LINE 1: SELECT 1 < '(3,4)'::point LIKE 42.0;
  ^

regression=# SELECT 1 < '(3,4)'::point NOT LIKE 42.0;
ERROR:  operator does not exist: integer < point
LINE 1: SELECT 1 < '(3,4)'::point NOT LIKE 42.0;
 ^

regression=# SELECT 1 LIKE '(3,4)'::point < 42.0;
ERROR:  operator does not exist: integer ~~ point
LINE 1: SELECT 1 LIKE '(3,4)'::point < 42.0;
 ^

regression=# SELECT 1 NOT LIKE '(3,4)'::point < 42.0;
ERROR:  operator does not exist: integer !~~ point
LINE 1: SELECT 1 NOT LIKE '(3,4)'::point < 42.0;
 ^

These queries are silly of course, the point is just to expose which
operator the parser thinks binds more tightly.  And what we have above is:
LIKE binds more tightly than <, in either combination.  NOT LIKE binds
less tightly than < if < is on the left, but more tightly than < if <
is on the right.

This seems like a bug; it's certainly not what you'd call intuitive.

The cause, if I'm understanding the Bison manual correctly, is that
the NOT token has lower precedence than '<', so at the critical point
where it has scanned 1 < '(3,4)'::point and has to either reduce that
(thus binding < more tightly) or shift, it looks at the lookahead token
which is either LIKE or NOT in the first two examples above.  LIKE has
higher priority than '<' so in the first example it shifts; but NOT
has lower priority than '<' so in the second example it reduces.

The other productions starting with "a_expr NOT ..." also have the
same problem; that includes NOT ILIKE, NOT SIMILAR TO, NOT BETWEEN,
and NOT IN.

I'm not seeing any terribly pleasing ways to fix this.  Aside from
the option of doing nothing, it seems like these are the choices:

1. We could hack base_yylex() to reduce NOT LIKE to a single token
which could be given the same precedence as LIKE.  Ditto for the other
four cases.  This would result in nice behavior for these cases, but as
Robert has correctly pointed out, hacks in base_yylex() risk creating
other poorly-understood behaviors.

2. We could change the precedence levels so that LIKE/ILIKE/SIMILAR,
BETWEEN, and IN all have precedence just above NOT, which would pretty
much mask the problem since there would be no other tokens with in-between
precedence levels.  In the context of the operator precedence changes
I proposed earlier, this would mean inserting the IS tests and comparison
operators between IN_P and POSTFIXOP rather than where the
single-character comparison ops live now.  We would likely also have to
flatten LIKE/BETWEEN/IN into a single %nonassoc precedence level to avoid
having weird interactions among them.  This isn't terribly attractive
because it would risk larger behavioral changes than the previous
proposal.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra

On 23.2.2015 19:22, Peter Geoghegan wrote:
> On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra
>  wrote:
>> The durations are much higher than without the single unsorted row added
>> at the end. Queries often take 20x longer to finish (on the same code),
>> depending on the scale.
> 
> I knew this would happen.  :-)
> 
>> What's interesting here is that some queries are much faster, but query
>> #3 is slow until we hit 2M rows:
>>
>> select * from (select * from stuff_int_desc order by randint
>>offset 1000) foo
> 
> Not sure why that would be. Can you see what a "#define 
> DEBUG_ABBREV_KEYS" (within varlena.c) build shows happens with the 
> cost model? Enable debug1 output for that.

test=# select * from (select * from stuff_text_asc order by randtxt
offset 1000) foo;
DEBUG:  abbrev_distinct after 160: 152.862464 (key_distinct: 155.187096,
norm_abbrev_card: 0.955390, prop_card: 0.20)
DEBUG:  abbrev_distinct after 320: 314.784336 (key_distinct: 313.425344,
norm_abbrev_card: 0.983701, prop_card: 0.20)
DEBUG:  abbrev_distinct after 640: 629.050490 (key_distinct: 643.945298,
norm_abbrev_card: 0.982891, prop_card: 0.20)
DEBUG:  abbrev_distinct after 1280: 1194.271440 (key_distinct:
1257.153875, norm_abbrev_card: 0.933025, prop_card: 0.20)
DEBUG:  abbrev_distinct after 2560: 2402.820431 (key_distinct:
2631.772790, norm_abbrev_card: 0.938602, prop_card: 0.20)
DEBUG:  abbrev_distinct after 5120: 4490.142750 (key_distinct:
5261.865309, norm_abbrev_card: 0.876981, prop_card: 0.20)
DEBUG:  abbrev_distinct after 10240: 8000.884171 (key_distinct:
10123.941172, norm_abbrev_card: 0.781336, prop_card: 0.20)
DEBUG:  abbrev_distinct after 20480: 13596.546899 (key_distinct:
20563.364696, norm_abbrev_card: 0.663894, prop_card: 0.13)
DEBUG:  abbrev_distinct after 40960: 23369.899021 (key_distinct:
40500.600680, norm_abbrev_card: 0.570554, prop_card: 0.084500)
DEBUG:  abbrev_distinct after 81920: 30884.544030 (key_distinct:
81466.848436, norm_abbrev_card: 0.377009, prop_card: 0.054925)
 randtxt
-
(0 rows)

>> Looking at the previous tests, I see this is exactly what's
>> happening to this query with 'random' dataset - it's slightly
>> slower than master up until 2M rows, when it suddenly jumps to the
>> same speedup as the other queries. Can we do something about that?
> 
> Possibly.
> 
>> Anyway, I'm wondering what conclusion we can do from this? I believe
>> vast majority of datasets in production won't be perfectly sorted,
>> because when the table is CLUSTERed by index we tend to use index scan
>> to do the sort (so no problem), or the data are not actually perfectly
>> sorted (and here we get significant speedup).
> 
> One conclusion might be that commit a3f0b3d6 should not have added the
> pre-check for perfectly sorted input, which is not in the Bentley &
> Mcilroy original - exploiting the fact that the set is already
> partially sorted is a fine thing, but we're not doing the necessary

Would that pre-check hurt even the random test case? I can imagine it
might behave strangely with almost perfectly sorted data (when only the
very last row is unsorted), but not for random data (but see below).

> bookkeeping. But that's not really why I suggested that test case.
> Rather, I wanted to put the possible regression in the event of
> perfectly sorted input in perspective. Maybe that regression isn't
> worth worrying about, if in the event of a single tuple being out of
> order at the end of the set the outcome dramatically shifts to
> strongly favor abbreviation. Another way of looking at it would be
> that the pre-sorted case is an argument against strxfrm() sorting in
> general more than abbreviated keys in particular, an argument that
> seems new and strange to me. The analysis of algorithms focuses on the
> worst case and average case for a reason, and (say) the glibc
> documentation never considers this when discussing strxfrm().

Yeah, every algorithm has a worst case - either you get a very fast
execution on average with a few rare cases when it's much slower, or you
get very bad average performance.

But after looking at the data a bit more, I actually think this is a red
herring. After looking at the actual runtimes (and not just speedups), I
get this:

  scalequery# master   cost-modelspeedup
 10 1884  103   856%
100 1  12153 1506   807%
200 1  26187 3894   673%
300 1  38147 6601   578%
400 1  5633210976   513%
500 1  7700916409   469%

 10 2883  110   805%
100 2  12114 1574   770%
200 2  26104 4038   646%
300 2  38028 6828   557%
400 2  5613811294   497%
500

[HACKERS] OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)

2015-02-23 Thread Alvaro Herrera

I found that the OBJECT_ATTRIBUTE symbol is useless.  I can just remove
it and replace it with OBJECT_COLUMN, and everything continues to work;
no test fails that I can find.

I thought we had a prohibition against ALTER TABLE when used on
composites, but it's not as severe as I thought.  The following commands
fail in master:

ALTER TABLE comptype RENAME TO comptype2;   -- HINT: Use ALTER TYPE
ALTER TABLE comptype SET SCHEMA sch;-- HINT: Use ALTER TYPE

However, the following command works in master:

ALTER TABLE comptype RENAME COLUMN a TO b;
and has the same effect as this:
ALTER TYPE comptype RENAME ATTRIBUTE a TO b;

The RENAME ATTTRIBUTE case in RenameStmt is the only thing currently
using OBJECT_ATTRIBUTE; therefore, since in precisely that case we do
not prohibit using ALTER TABLE, we can just remove OBJECT_ATTRIBUTE
completely.  That leads to the attached patch, which changes no test
result at all.

This symbol was added in

commit e440e12c562432a2a695b8054964fb34e3bd823e
Author: Peter Eisentraut 
Date:   Sun Sep 26 14:41:03 2010 +0300

Add ALTER TYPE ... ADD/DROP/ALTER/RENAME ATTRIBUTE

Like with tables, this also requires allowing the existence of
composite types with zero attributes.

reviewed by KaiGai Kohei

Thoughts?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index d899dd7..1da38c0 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -1591,7 +1591,6 @@ check_object_ownership(Oid roleid, ObjectType objtype, ObjectAddress address,
 			break;
 		case OBJECT_TYPE:
 		case OBJECT_DOMAIN:
-		case OBJECT_ATTRIBUTE:
 		case OBJECT_DOMCONSTRAINT:
 			if (!pg_type_ownercheck(address.objectId, roleid))
 aclcheck_error_type(ACLCHECK_NOT_OWNER, address.objectId);
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 78b54b4..7f85fa4 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -330,7 +330,6 @@ ExecRenameStmt(RenameStmt *stmt)
 			return RenameRelation(stmt);
 
 		case OBJECT_COLUMN:
-		case OBJECT_ATTRIBUTE:
 			return renameatt(stmt);
 
 		case OBJECT_RULE:
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index dcf5b98..d14346f 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -1056,7 +1056,6 @@ EventTriggerSupportsObjectType(ObjectType obtype)
 			/* no support for event triggers on event triggers */
 			return false;
 		case OBJECT_AGGREGATE:
-		case OBJECT_ATTRIBUTE:
 		case OBJECT_CAST:
 		case OBJECT_COLUMN:
 		case OBJECT_COLLATION:
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6c21002..8ad933d 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -7848,7 +7848,7 @@ RenameStmt: ALTER AGGREGATE func_name aggr_args RENAME TO name
 			| ALTER TYPE_P any_name RENAME ATTRIBUTE name TO name opt_drop_behavior
 {
 	RenameStmt *n = makeNode(RenameStmt);
-	n->renameType = OBJECT_ATTRIBUTE;
+	n->renameType = OBJECT_COLUMN;
 	n->relationType = OBJECT_TYPE;
 	n->relation = makeRangeVarFromAnyName($3, @3, yyscanner);
 	n->subname = $6;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6d26986..2b6fc3e 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1615,9 +1615,6 @@ AlterObjectTypeCommandTag(ObjectType objtype)
 		case OBJECT_AGGREGATE:
 			tag = "ALTER AGGREGATE";
 			break;
-		case OBJECT_ATTRIBUTE:
-			tag = "ALTER TYPE";
-			break;
 		case OBJECT_CAST:
 			tag = "ALTER CAST";
 			break;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ac13302..09607eb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1212,7 +1212,6 @@ typedef struct SetOperationStmt
 typedef enum ObjectType
 {
 	OBJECT_AGGREGATE,
-	OBJECT_ATTRIBUTE,			/* type's attribute, when distinct from column */
 	OBJECT_CAST,
 	OBJECT_COLUMN,
 	OBJECT_COLLATION,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Peter Geoghegan

On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra
 wrote:
> The durations are much higher than without the single unsorted row added
> at the end. Queries often take 20x longer to finish (on the same code),
> depending on the scale.

I knew this would happen.  :-)

> What's interesting here is that some queries are much faster, but query
> #3 is slow until we hit 2M rows:
>
> select * from (select * from stuff_int_desc order by randint
>offset 1000) foo

Not sure why that would be. Can you see what a "#define
DEBUG_ABBREV_KEYS" (within varlena.c) build shows happens with the
cost model? Enable debug1 output for that.

> Looking at the previous tests, I see this is exactly what's happening to
> this query with 'random' dataset - it's slightly slower than master up
> until 2M rows, when it suddenly jumps to the same speedup as the other
> queries. Can we do something about that?

Possibly.

> Anyway, I'm wondering what conclusion we can do from this? I believe
> vast majority of datasets in production won't be perfectly sorted,
> because when the table is CLUSTERed by index we tend to use index scan
> to do the sort (so no problem), or the data are not actually perfectly
> sorted (and here we get significant speedup).

One conclusion might be that commit a3f0b3d6 should not have added the
pre-check for perfectly sorted input, which is not in the Bentley &
Mcilroy original - exploiting the fact that the set is already
partially sorted is a fine thing, but we're not doing the necessary
bookkeeping. But that's not really why I suggested that test case.
Rather, I wanted to put the possible regression in the event of
perfectly sorted input in perspective. Maybe that regression isn't
worth worrying about, if in the event of a single tuple being out of
order at the end of the set the outcome dramatically shifts to
strongly favor abbreviation. Another way of looking at it would be
that the pre-sorted case is an argument against strxfrm() sorting in
general more than abbreviated keys in particular, an argument that
seems new and strange to me. The analysis of algorithms focuses on the
worst case and average case for a reason, and (say) the glibc
documentation never considers this when discussing strxfrm().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] json_populate_record issue - TupleDesc reference leak

2015-02-23 Thread Pavel Stehule

by the way - this feature is undocumented - I though so only value used as
type holder is not used.

Should be documented better, - if I understand - it is base stone for
implementation #= hstore operator

some nice example

postgres=# select json_populate_record('(10,20)'::pt, '{"a":30}');
 json_populate_record
--
 (30,20)
(1 row)

a mentioned bug is corner case - bugfix is simple

diff --git a/src/backend/utils/adt/jsonfuncs.c
b/src/backend/utils/adt/jsonfuncs.c
new file mode 100644
index a8cdeaa..6e83f78
*** a/src/backend/utils/adt/jsonfuncs.c
--- b/src/backend/utils/adt/jsonfuncs.c
*** populate_record_worker(FunctionCallInfo
*** 2114,2119 
--- 2114,2122 
 */
if (hash_get_num_entries(json_hash) == 0 && rec)
{
+   if (have_record_arg)
+   ReleaseTupleDesc(tupdesc);
+
hash_destroy(json_hash);
PG_RETURN_POINTER(rec);
}




2015-02-23 18:27 GMT+01:00 Pavel Stehule :

> Hi
>
> When I tested json_populate_function in test
> http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097
> I found a small issue
>
> create type pt as (a int, b int);
>
> postgres=# select json_populate_record('(10,20)'::pt, '{}');
> WARNING:  TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1)
> still referenced
>  json_populate_record
> --
>  (10,20)
> (1 row)
>
> jsonb is ok
>
> postgres=# select jsonb_populate_record('(10,20)'::pt, '{}');
>  jsonb_populate_record
> ---
>  (10,20)
> (1 row)
>
> Regards
>
> Pavel
>
>

Re: [HACKERS] mogrify and indent features for jsonb

2015-02-23 Thread Thom Brown

On 15 February 2015 at 03:06, Andrew Dunstan  wrote:

>
> Attached is a patch to provide a number of very useful facilities to jsonb
> that people have asked for. These are based on work by Dmitry Dolgov in his
> jsonbx extension, but I take responsibility for any bugs.
>
> The facilities are:
>
> new operations:
>
> concatenation:jsonb || jsonb -> jsonb
> deletion: jsonb - text -> jsonb
> deletion: jsonb - int -> text
>
> new functions:
>
> produce indented text: jsonb_indent(jsonb) -> text
> change an element at a path:  jsonb_replace(jsonb, text[], jsonb) -> jsonb.
>
>
> It would be relatively trivial to add:
>
> delete an element at a path: jsonb_delete(jsonb, text[]) -> json
>

Would this support deleting "type" and the value 'dd' from the following?:

{"a": 1, "b": 2, "c": {"type": "json", "stuff": "test"}, "d":
["aa","bb","cc","dd"]}


>
> and I think we should do that for the sake of completeness.
>
> The docs might need a little extra work, and the indent code definitely
> needs work, which I hope to complete in the next day or two, but I wanted
> to put a stake in the ground.
>

This is high on my wanted list, so thanks for working on this.

Seems to work well for me with a few tests.

Is there a way to take the json:

'{"a": 1, "b": 2, "c": {"type": "json", "stuff": "test"}, "d":
["aa","bb","cc","dd"]}'

and add "ee" to "d" without replacing it?  I can think of ways of currently
doing it, but it's very convoluted just for pushing a value to an array.

Also, are there any plans to support the following?:

jsonb - text[] # Provide list of keys to delete in array
jsonb - jsonb # Deduplicate key:value pairs
jsonb && jsonb # Return overlapping jsonb (opposite of jsonb - jsonb)

Thanks

Thom

Re: [HACKERS] [RFC] LSN Map

2015-02-23 Thread Heikki Linnakangas


On 01/13/2015 01:22 PM, Marco Nenciarini wrote:

Il 08/01/15 20:18, Jim Nasby ha scritto:

On 1/7/15, 3:50 AM, Marco Nenciarini wrote:

The current implementation tracks only heap LSN. It currently does not
track any kind of indexes, but this can be easily added later.


Would it make sense to do this at a buffer level, instead of at the heap
level? That means it would handle both heap and indexes.
  I don't know if LSN is visible that far down though.


Where exactly you are thinking to handle it?


Dunno, but Jim's got a point. This is a maintenance burden to all 
indexams, if they all have to remember to update the LSN map separately. 
It needs to be done in some common code, like in PageSetLSN or 
XLogInsert or something.


Aside from that, isn't this horrible from a performance point of view? 
The patch doubles the buffer manager traffic, because any update to any 
page will also need to modify the LSN map. This code is copied from the 
visibility map code, but we got away with it there because the VM only 
needs to be updated the first time a page is modified. Subsequent 
updates will know the visibility bit is already cleared, and don't need 
to access the visibility map.


Ans scalability: Whether you store one value for every N pages, or the 
LSN of every page, this is going to have a huge effect of focusing 
contention to the LSN pages. Currently, if ten backends operate on ten 
different heap pages, for example, they can run in parallel. There will 
be some contention on the WAL insertions (much less in 9.4 than before). 
But with this patch, they will all fight for the exclusive lock on the 
single LSN map page.


You'll need to find a way to not update the LSN map on every update. For 
example, only update the LSN page on the first update after a checkpoint 
(although that would still have a big contention focusing effect right 
after a checkpoint).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Henry B Hotz

Renegotiation should be a best practice. Trouble is it's been broken (at the 
protocol level) three times in the last few years so it's a massive hole in 
practice. 

Ideally we should leave the renegotiate in, and only remove it if configure 
detects a broken version of TLS.

Personal email. hbh...@oxy.edu

> On Feb 23, 2015, at 7:01 AM, Albe Laurenz  wrote:
> 
> I'd say it is best to wait if and how OpenSSL change their API when they
> implement TLS 1.3.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] json_populate_record issue - TupleDesc reference leak

2015-02-23 Thread Pavel Stehule

Hi

When I tested json_populate_function in test
http://stackoverflow.com/questions/7711432/how-to-set-value-of-composite-variable-field-using-dynamic-sql/28673097#28673097
I found a small issue

create type pt as (a int, b int);

postgres=# select json_populate_record('(10,20)'::pt, '{}');
WARNING:  TupleDesc reference leak: TupleDesc 0x7f10fcf41400 (567018,-1)
still referenced
 json_populate_record
--
 (10,20)
(1 row)

jsonb is ok

postgres=# select jsonb_populate_record('(10,20)'::pt, '{}');
 jsonb_populate_record
---
 (10,20)
(1 row)

Regards

Pavel

Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Andres Freund

On 2015-02-23 15:15:31 +0100, Florian Weimer wrote:
> On 02/22/2015 02:05 PM, Andres Freund wrote:
> > On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
> >> I honestly wonder why postgres uses renegotiation at all. The motivation
> >> that cryptoanalysis is easier as more data is sent seems quite
> >> far-fetched.
> > 
> > I don't think so. There's a fair number of algorithms that can/could be
> > much easier be attached with lots of data available. Especially if you
> > can guess/know/control some of the data.  Additionally renegotiating
> > regularly helps to constrain a possible key leagage to a certain amount
> > of time. With backend connections often being alive for weeks at a time
> > that's not a bad thing.
> 
> Renegotiation will be removed from future TLS versions because it is
> considered unnecessary with modern ciphers:
> 
>   
> 
> If ciphers require rekeying, that mechanism will be provided at the TLS
> layer in the future.
> 
> I think you could remove renegotiation from PostgreSQL as long as you
> offer something better than RC4 in the TLS handshake.

As far as I understand it, this argument misses an important
point. Without protocol level rekeying, handled by the library in the
background, never changing the session key pretty much breaks PFS. In
http the sessions almost always are short enough that such a
consideration doesn't play a role, but it's far from uncommon to have
database connections that live weeks and transport hundreds of gigabytes
in their lifetime.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mogrify and indent features for jsonb

2015-02-23 Thread Dmitry Dolgov

Hi, Petr, thanks for the review.

>>> I think it would be better if the ident printing didn't put the start
of array ([) and start of dictionary ({) on separate line
Did you mean this?

[
{
"a": 1,
"b": 2
}
]

I tried to verify this in several ways (http://jsonprettyprint.com/,
"JSON.stringify", "json.too" from python), and I always get this result
(new line after ([) and ({) ).

>>> I don't really understand the point of h_atoi() function.
Initially, this function was to combine the convertion logic and specific
verifications. But I agree, "strtol" is more correct way, I should improve
this.

>>> The code looks ok as well except maybe the replacePath could use couple
of comments
I already added several commentaries (and looks like I should add even more
in the nearest future) for this function in the jsonbx extension, and I
think we can update this patch one more time with that improvement.

>>> About the Assert(ARR_ELEMTYPE(path) == TEXTOID);
I based my work on the hstore extension, which contains such kind of
assertions. But I suppose, it's not required anymore, so I removed this
from the extension. And, I think, we can also remove this from patch.

On 18 February 2015 at 08:32, Petr Jelinek  wrote:

> Hi,
>
> I looked at the patch and have several comments.
>
> First let me say that modifying the individual paths of the json is the
> feature I miss the most in the current implementation so I am happy that
> this patch was submitted.
>
> I would prefer slightly the patch split into two parts, one for the indent
> printing and one with the manipulation functions, but it's not too big
> patch so it's not too bad as it is.
>
> There is one compiler warning that I see:
> jsonfuncs.c:3533:1: warning: no previous prototype for ‘jsonb_delete_path’
> [-Wmissing-prototypes]
>  jsonb_delete_path(PG_FUNCTION_ARGS)
>
> I think it would be better if the ident printing didn't put the start of
> array ([) and start of dictionary ({) on separate line since most "pretty"
> printing implementations outside of Postgres (like the JSON.stringify or
> python's json module) don't do that. This is purely about consistency with
> the rest of the world.
>
> The json_ident might be better named as json_pretty perhaps?
>
> I don't really understand the point of h_atoi() function. What's wrong
> with using strtol like pg_atoi does? Also there is no integer overflow
> check anywhere in that function.
>
> There is tons of end of line whitespace mess in jsonb_indent docs.
>
> Otherwise everything I tried so far works as expected. The code looks ok
> as well except maybe the replacePath could use couple of comments (for
> example the line which uses the h_atoi) to make it easier to follow.
>
> About the:
>
>> +   /* XXX : why do we need this assertion? The functions is declared
>> to take text[] */
>> +   Assert(ARR_ELEMTYPE(path) == TEXTOID);
>>
>
> I wonder about this also, some functions do that, some don't, I am not
> really sure what is the rule there myself.
>
> --
>  Petr Jelinek  http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Bug in pg_dump

2015-02-23 Thread Gilles Darold

Le 17/02/2015 14:44, Michael Paquier a écrit :
> On Tue, Jan 20, 2015 at 8:06 PM, Gilles Darold  
> wrote:
>> Le 19/01/2015 14:41, Robert Haas a écrit :
>>> On Thu, Jan 15, 2015 at 6:26 AM, Gilles Darold  
>>> wrote:
 I attach a patch that solves the issue in pg_dump, let me know if it might
 be included in Commit Fest or if the three other solutions are a better
 choice.
>>> I think a fix in pg_dump is appropriate, so I'd encourage you to add
>>> this to the next CommitFest.
>>>
>> Ok, thanks Robert. The patch have been added to next CommitFest under
>> the Bug Fixes section.
>>
>> I've also made some review of the patch and more test. I've rewritten
>> some comments and added a test when TableInfo is NULL to avoid potential
>> pg_dump crash.
>>
>> New patch attached: pg_dump.c-extension-FK-v2.patch
> So, I looked at your patch and I have some comments...
>
> The approach taken by the patch looks correct to me as we cannot
> create FK constraints after loading the data in the case of an
> extension, something that can be done with a data-only dump.
>
> Now, I think that the routine hasExtensionMember may impact
> performance unnecessarily in the case of databases with many tables,
> say thousands or more. And we can easily avoid this performance
> problem by checking the content of each object dumped by doing the
> legwork directly in getTableData. Also, most of the NULL pointer
> checks are not needed as most of those code paths would crash if
> tbinfo is NULL, and actually only keeping the one in dumpConstraint is
> fine.

Yes that's exactly what we discuss at Moscow, thanks for removing the
hasExtensionMember() routine. About NULL pointer I'm was not sure that
it could not crash on some other parts so I decided to check it
everywhere. If that's ok to just check in dumpConstraint() that's fine.
 
> On top of those things, I have written a small extension able to
> reproduce the problem that I have included as a test module in
> src/test/modules. The dump/restore check is done using the TAP tests,
> and I have hacked prove_check to install as well the contents of the
> current folder to be able to use the TAP routines with an extension
> easily. IMO, as we have no coverage of pg_dump with extensions I think
> that it would be useful to ensure that this does not break again in
> the future.

Great ! I did not had time to do that, thank you so much.

> All the hacking I have done during the review results in the set of
> patch attached:
> - 0001 is your patch, Gilles, with some comment fixes as well as the
> performance issue with hasExtensionMember fix
> - 0002 is the modification of prove_check that makes TAP tests usable
> with extensions
> - 0003 is the test module covering the tests needed for pg_dump, at
> least for the problem of this thread.
>
> Gilles, how does that look to you?

Looks great to me, I have tested with the postgis_topology extension
everything works fine.

> (Btw, be sure to generate your patches directly with git next time. :) )

Sorry, some old reminiscence :-)

Thanks for the review.
 
Best regards,

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown

On 23 February 2015 at 16:53, Andres Freund  wrote:

> On 2015-02-23 15:48:25 +, Thom Brown wrote:
> > On 23 February 2015 at 15:42, Andres Freund 
> wrote:
> >
> > > On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
> > > > I unfortunately don't remember enough of the thread to reference it
> > > > here.
> > >
> > > Found the right keywords. The threads below
> > >
> > >
> http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
> > > and
> > >
> > >
> http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau
> > >
> >
> > Yes, this seems to be virtually the same issue reported.  The trace looks
> > the same except for RecordTransactionCommit.
>
> So, I proposed in
>
> http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de
> that we make sequences assign a xid and only wait for syncrep when a xid
> is assigned. The biggest blocker was that somebody would have to do some
> code reviewing to find other locations that might need similar
> treatment.
>
> I did a, quick, grep for XLogInsert() and I think we're otherwise
> fine. There's some debatable cases:
>
> * XLOG_STANDBY_LOCK  doesn't force a xid to be assigned. I think it's
>   harmless though, as we really only need to wait for that to be
>   replicated if the transaction did something relevant (i.e. catalog
>   changes). And those will force xid assignment.
> * 2pc records don't assign a xid. But twophase.c does it's own waiting,
>   so that's fine.
> * Plain vacuums will not trigger waits. But I think that's good. There's
>   really no need to wait if all that's been done is some cleanup without
>   visible consequences.
> * Fujii brought up that we might want to wait for XLOG_SWITCH - I don't
>   really see why.
> * XLOG_RESTORE_POINT is a similar candidate - I don't see really valid
>   arguments for making 2pc wait.
>
>
> The attached, untested, patch changes things so that we
> a) only wait for syncrep if we both wrote WAL and had a xid assigned
> b) use an async commit if we just had a xid assigned, without having
>written WAL, even if synchronous_commit = off
> c) acquire a xid when WAL logging sequence changes (arguable at least
>one of the xid assignments is redundant, but it doesn't cost
>anything, so ...)
>
> I think it makes sense to change a) and b) that way because there's no
> need to wait for WAL flushes/syncrep waits when all that happened is
> manipulations of temporary/unlogged tables or HOT pruning. It's slightly
> wierd that the on-disk flush and the syncrep wait essentially used two
> different mechanisms for deciding when to flush.
>
> Comments? This is obviously just a POC, but I think something like this
> does make a great deal of sense.
>
> Thom, does that help?


Yeah, this appears to eliminate the problem, at least in the case I
reported.

Thanks

-- 
Thom

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-23 Thread Heikki Linnakangas


On 02/23/2015 01:01 PM, Andres Freund wrote:

On 2015-02-22 21:24:56 -0500, Robert Haas wrote:

On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek  wrote:

I am wondering a bit about interaction with wal_keep_segments.
One thing is that wal_keep_segments is still specified in number of segments
and not size units, maybe it would be worth to change it also?
And the other thing is that, if set, the wal_keep_segments is the real
max_wal_size from the user perspective (not from perspective of the
algorithm in this patch, but user does not really care about that) which is
somewhat weird given the naming.


It seems like wal_keep_segments is more closely related to
wal_*min*_size.  The idea of both settings is that each is a minimum
amount of WAL we want to keep around for some purpose.  But they're
not quite the same, I guess, because wal_min_size just forces us to
keep that many files around - they can be overwritten whenever.
wal_keep_segments is an amount of actual WAL data we want to keep
around.

Would it make sense to require that wal_keep_segments <= wal_min_size?


I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?


Right. While wal_keep_segments and wal_min_size both set a kind of a 
minimum on the amount of WAL that's kept in pg_xlog, they are different 
things, and a rule that one must be less than or greater than the other 
doesn't make sense.


Everyone seems to be happy with the names and behaviour of the GUCs, so 
committed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund

On 2015-02-23 15:48:25 +, Thom Brown wrote:
> On 23 February 2015 at 15:42, Andres Freund  wrote:
> 
> > On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
> > > I unfortunately don't remember enough of the thread to reference it
> > > here.
> >
> > Found the right keywords. The threads below
> >
> > http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
> > and
> >
> > http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau
> >
> 
> Yes, this seems to be virtually the same issue reported.  The trace looks
> the same except for RecordTransactionCommit.

So, I proposed in
http://www.postgresql.org/message-id/20140707155113.gb1...@alap3.anarazel.de
that we make sequences assign a xid and only wait for syncrep when a xid
is assigned. The biggest blocker was that somebody would have to do some
code reviewing to find other locations that might need similar
treatment.

I did a, quick, grep for XLogInsert() and I think we're otherwise
fine. There's some debatable cases:

* XLOG_STANDBY_LOCK  doesn't force a xid to be assigned. I think it's
  harmless though, as we really only need to wait for that to be
  replicated if the transaction did something relevant (i.e. catalog
  changes). And those will force xid assignment.
* 2pc records don't assign a xid. But twophase.c does it's own waiting,
  so that's fine.
* Plain vacuums will not trigger waits. But I think that's good. There's
  really no need to wait if all that's been done is some cleanup without
  visible consequences.
* Fujii brought up that we might want to wait for XLOG_SWITCH - I don't
  really see why.
* XLOG_RESTORE_POINT is a similar candidate - I don't see really valid
  arguments for making 2pc wait.

The attached, untested, patch changes things so that we
a) only wait for syncrep if we both wrote WAL and had a xid assigned
b) use an async commit if we just had a xid assigned, without having
   written WAL, even if synchronous_commit = off
c) acquire a xid when WAL logging sequence changes (arguable at least
   one of the xid assignments is redundant, but it doesn't cost
   anything, so ...)

I think it makes sense to change a) and b) that way because there's no
need to wait for WAL flushes/syncrep waits when all that happened is
manipulations of temporary/unlogged tables or HOT pruning. It's slightly
wierd that the on-disk flush and the syncrep wait essentially used two
different mechanisms for deciding when to flush.

Comments? This is obviously just a POC, but I think something like this
does make a great deal of sense.

Thom, does that help?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 16828b249b4d2ebfc04678f2570d6c439981e472 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 23 Feb 2015 17:38:40 +0100
Subject: [PATCH] Reconsider when to flush WAL and wait for syncrep while
 committing.

---
 src/backend/access/transam/xact.c | 30 ++
 src/backend/commands/sequence.c   | 23 +++
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 97000ef..2818a03 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1031,10 +1031,9 @@ RecordTransactionCommit(void)

 		/*
 		 * If we didn't create XLOG entries, we're done here; otherwise we
-		 * should flush those entries the same as a commit record.  (An
-		 * example of a possible record that wouldn't cause an XID to be
-		 * assigned is a sequence advance record due to nextval() --- we want
-		 * to flush that to disk before reporting commit.)
+		 * should trigger flushing those entries the same as a commit record
+		 * would.  This will primarily happen for HOT pruning and the like; we
+		 * want these to be flushed to disk in due time.
 		 */
 		if (!wrote_xlog)
 			goto cleanup;
@@ -1153,11 +1152,13 @@ RecordTransactionCommit(void)
 	/*
 	 * Check if we want to commit asynchronously.  We can allow the XLOG flush
 	 * to happen asynchronously if synchronous_commit=off, or if the current
-	 * transaction has not performed any WAL-logged operation.  The latter
-	 * case can arise if the current transaction wrote only to temporary
-	 * and/or unlogged tables.  In case of a crash, the loss of such a
-	 * transaction will be irrelevant since temp tables will be lost anyway,
-	 * and unlogged tables will be truncated.  (Given the foregoing, you might
+	 * transaction has not performed any WAL-logged operation or didn't assign
+	 * a xid.  The transaction can end up not writing any WAL, even if it has
+	 * a xid, if it only wrote to temporary and/or unlogged tables.  It can
+	 * end up having written WAL without and xid if it did HOT pruning.  In
+	 * case of a crash, the loss of such a transaction will be irrelevant;
+	 * temp tables will be lost anyway, unlogged tables

Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Tomas Vondra

Hi,

On 23.2.2015 11:59, Andrew Gierth wrote:
>> "Tomas" == Tomas Vondra  writes:
> 
>  Tomas> Interesting, but I think Gavin was asking about how much
>  Tomas> variation was there for each tested case (e.g. query executed on
>  Tomas> the same code / dataset). And in those cases the padding /
>  Tomas> alignment won't change, and thus the effects you describe won't
>  Tomas> influence the results, no?
> 
> My point is exactly the fact that since the result is not affected,
> this variation between runs of the same code is of no real relevance
> to the question of whether a given change in performance can properly
> be attributed to a patch.
> 
> Put it this way: suppose I have a test that when run repeatedly with no
> code changes takes 6.10s (s=0.025s), and I apply a patch that changes
> that to 6.26s (s=0.025s). Did the patch have an impact on performance?
> 
> Now suppose that instead of applying the patch I insert random amounts
> of padding in an unused function and find that my same test now takes a
> mean of 6.20s (s=0.058s) when I take the best timing for each padding
> size and calculate stats across sizes. Now it looks obvious that the
> actual code of the patch probably wasn't responsible for any change...
> 
> The numbers used here aren't theoretical; they are obtained by testing a
> single query - "select * from d_flt order by v offset 1000" where
> d_flt contains 5 million float8 values - over 990 times with 33
> different random padding sizes (uniform in 0-32767). Here's a scatter
> plot, with 3 runs of each padding size so you can see the repeatability:
> http://tinyurl.com/op9qg8a

I think we're talking about slightly different things, then.

I believe Gavin was asking about variability for executions with a
particular code (i.e. with fixed amount of padding), to decide whether
it even makes sense to compare results for different patches or whether
the differences are just random noise.

Interpreting those differences - whether they are due to changes in the
algorithm or a result of some padding somewhere else in the code, that's
of course important too.

I believe the small regressions (1-10%) for small data sets, might be
caused by this 'random padding' effect, because that's probably where
L1/L2 cache is most important. For large datasets the caches are
probably not as efficient anyway, so the random padding makes no
difference, and the speedup is just as good as for the other queries.
See for example this:

  http://www.postgresql.org/message-id/54eb580c.2000...@2ndquadrant.com

But I'm speculating here ... time for a profiler, I guess.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for text cost model fix

2015-02-23 Thread Tomas Vondra

Hi,

On 22.2.2015 22:30, Peter Geoghegan wrote:
> On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra
>  wrote:
>> In short, this fixes all the cases except for the ASC sorted data. I
>> haven't done any code review, but I think we want this.
>>
>> I'll use data from the i5-2500k, but it applies to the Xeon too, except
>> that the Xeon results are more noisy and the speedups are not that
>> significant.
>>
>> For the 'text' data type, and 'random' dataset, the results are these:
>>
>>   scaledatumcost-model
>> ---
>>  10 328%  323%
>> 100 392%  391%
>> 200  96%  565%
>> 300  97%  572%
>> 400  97%  571%
>> 500  98%  570%
>>
>> The numbers are speedup vs. master, so 100% means exactly the same
>> speed, 200% means twice as fast.
>>
>> So while with 'datum' patch this actually caused very nice speedup for
>> small datasets - about 3-4x speedup up to 1M rows, for larger datasets
>> we've seen small regression (~3% slower). With the cost model fix, we
>> actually see a significant speedup (about 5.7x) for these cases.
> 
> Cool.
> 
>> I haven't verified whether this produces the same results, but if it
>> does this is very nice.
>>
>> For 'DESC' dataset (i.e. data sorted in reverse order), we do get even
>> better numbers, with up to 6.5x speedup on large datasets.
>>
>> But for 'ASC' dataset (i.e. already sorted data), we do get this:
>>
>>   scaledatumcost-model
>> ---
>>  10  85%   84%
>> 100  87%   87%
>> 200  76%   96%
>> 300  82%   90%
>> 400  91%   83%
>> 500  93%   81%
>>
>> Ummm, not that great, I guess :-(
> 
> You should try it with the data fully sorted like this, but with one 
> tiny difference: The very last tuple is out of order. How does that 
> look?

So here are the results for ASC-ordered dataset, with one 'unsorted' row
added to the end of the dataset. As before the complete scripts are
attached, and the raw results are available in a spreadsheet:

http://bit.ly/18g1nTU

The durations are much higher than without the single unsorted row added
at the end. Queries often take 20x longer to finish (on the same code),
depending on the scale.

The speedup results (compared to master) look like this:

scale query# datum numeric   cost model
10 1  859%861% 856%
10 2  811%814% 805%
10 3  100%100%  97%
1001  805%804% 807%
1002  769%773% 770%
1003  100%100%  98%
2001   97% 97% 673%
2002   96% 97% 646%
2003   99%101% 678%
3001   98% 98% 578%
3002   96% 97% 557%
3003   99%101% 579%
4001   99% 99% 513%
4002   97% 98% 497%
4003   99%101% 510%
5001   99% 99% 469%
5002   97% 98% 456%
5003   99%101% 466%

What's interesting here is that some queries are much faster, but query
#3 is slow until we hit 2M rows:

select * from (select * from stuff_int_desc order by randint
   offset 1000) foo

Looking at the previous tests, I see this is exactly what's happening to
this query with 'random' dataset - it's slightly slower than master up
until 2M rows, when it suddenly jumps to the same speedup as the other
queries. Can we do something about that?

Anyway, I'm wondering what conclusion we can do from this? I believe
vast majority of datasets in production won't be perfectly sorted,
because when the table is CLUSTERed by index we tend to use index scan
to do the sort (so no problem), or the data are not actually perfectly
sorted (and here we get significant speedup).

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

scripts.tgz
Description: application/tgz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Query Rewrite with Postgres' materialized views

2015-02-23 Thread Eric Grinstein

Thank you for your answers.
I am very eager to contribute to Postgres, especially in the materialized
views area.
I have created a thread

proposing
to work on it as my Google Summer of Code project.
It became a consensus that implementing the query rewrite feature would be
unfeasible
for such a short period of time, but Tomas Vondra suggested I could work on
a first part of It, such as
making a staleness flag for the MVs. If you have any opinions on the
thread, i'd be pleased to read them.

Thank you and regards,

Eric

2015-02-21 16:09 GMT-02:00 Robert Haas :

> On Sat, Feb 21, 2015 at 1:04 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> I'm not really sure what Josh was talking about in that answer.  In
> >> terms of doing this automatically, I doubt that's likely to happen
> >> until we have a way to automatically update a materialized view when
> >> the underlying data changes --- and Kevin Grittner has done a bunch of
> >> work towards that, but more is needed to get us there.
> >
> > Even with auto maintenance, it seems unlikely that matviews would ever
> > be so transparent that it would be okay for the planner to automatically
> > substitute them into a query.  The data in a matview is going to be at
> > least a little bit stale, else You're Doing It Wrong.
>
> Perhaps, but this is a feature we've gotten MANY customer requests
> for.  IDK why, but it's true.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: [HACKERS] Allow "snapshot too old" error, to prevent bloat

2015-02-23 Thread Amit Kapila

On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner  wrote:
>
> Amit Kapila  wrote:
>
> > It seems to me that SQL Server also uses similar mechanism to
> > avoid the bloat in version store (place to store previous
> > versions or record).
>
> > I think if other leading databases provide a way to control the
> > bloat, it indicates that most of the customers having
> > write-intesive workload would like to see such an option.
>
> Yeah, I think many users are surprised by the damage that can be
> done to a database by an idle connection or long-running read-only
> query, especially when coming from other databases which have
> protections against that.
>
> The bad news is that changing to specifying the limit in time
> rather than transaction IDs is far more complicated than I thought.
> Basically, we need to associate times in the past with the value of
> latestCompletedXid (or derived values like snapshot xmin) at those
> times.  I was initially thinking that by adding a timestamp to the
> snapshot we could derive this from active snapshots.  For any one
> connection, it's not that hard for it to scan the pairingheap of
> snapshots and pick out the right values; but the problem is that it
> is process-local memory and this needs to work while the connection
> is sitting idle for long periods of time.

Could you please explain in slightly more detail why can't it work
if we use timestamp instead of snapshot->xmin in your patch in
function TestForOldSnapshot()?

> I've got a couple ideas
> on how to approach it, but neither seems entirely satisfying:
>
> (1) Use a new timer ID to interrupt the process whenever its
> oldest snapshot which hasn't yet crossed the "old snapshot"
> threshold does so.  The problem is that this seems as though it
> would need to do an awful lot more work (scanning the pairing heap
> for the best match) than anything else is currently doing in a
> timeout handler.  One alternative would be to keep a second pairing
> heap to track snapshots which have not crossed the threshold,
> removing items as they get old -- that would make the work needed
> in the timeout handler closer to other handlers.
>
> (2) Use a course enough granularity on time and a short enough
> maximum for the GUC to just keep a circular buffer of the mappings
> in memory.  We might be able to make this dense enough that one
> minute resolution for up to 60 days could fit in 338kB.  Obviously
> that could be reduced with courser granularity or a shorter
> maximum.
>

How exactly will this allow to return "snapshot old" error, do we
need a check for page (page lsn) as in your current patch?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-02-23 Thread David Steele

On 2/18/15 10:25 AM, David Steele wrote:
> On 2/18/15 6:11 AM, Fujii Masao wrote:
>> The pg_audit doesn't log BIND parameter values when prepared statement is 
>> used.
>> Seems this is an oversight of the patch. Or is this intentional?
> 
> It's actually intentional - following the model I talked about in my
> earlier emails, the idea is to log statements only.  This also follows
> on 2ndQuadrant's implementation.

Unfortunately, I think it's beyond the scope of this module to log bind
variables.  I'm following not only 2ndQuadrant's implementation, but
Oracle's as well.

> Logging values is interesting, but I'm sure the user would want to
> specify which columns to log, which I felt was beyond the scope of the
> patch.
> 
>> The pg_audit cannot log the statement like "SELECT 1" which doesn't access to
>> the database object. Is this intentional? I think that there are many users 
>> who
>> want to audit even such statement.
> 
> I think I see how to make this work.  I'll work on it for the next
> version of the patch.

This has been fixed in the v2 patch.

>> Imagine the case where you call the user-defined function which executes
>> many nested statements. In this case, pg_audit logs only top-level statement
>> (i.e., issued directly by client) every time nested statement is executed.
>> In fact, one call of such UDF can cause lots of *same* log messages. I think
>> this is problematic.
> 
> I agree - not sure how to go about addressing it, though.  I've tried to
> cut down on the verbosity of the logging in general, but of course it
> can still be a problem.
> 
> Using security definer and a different logging GUC for the defining role
> might work.  I'll add that to my unit tests and see what happens.

That didn't work - but I didn't really expect it to.

Here are two options I thought of:

1) Follow Oracle's "as session" option and only log each statement type
against an object the first time it happens in a session.  This would
greatly reduce logging, but might be too little detail.  It would
increase the memory footprint of the module to add the tracking.

2) Only log once per call to the backend.  Essentially, we would only
log the statement you see in pg_stat_activity.  This could be a good
option because it logs what the user accesses directly, rather than
functions, views, etc. which hopefully are already going through a
review process and can be audited that way.

Would either of those address your concerns?

-- 
- David Steele
da...@pgmasters.net

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-02-23 Thread David Steele

Hi Stephen,

Thanks for your review.  All fixed except for comments below:

On 2/17/15 10:34 AM, Stephen Frost wrote:
>> +/*
>> + * Check privileges granted indirectly via role memberships. We do this 
>> in
>> + * a separate pass to minimize expensive indirect membership tests.  In
>> + * particular, it's worth testing whether a given ACL entry grants any
>> + * privileges still of interest before we perform the has_privs_of_role
>> + * test.
>> + */
> 
> I'm a bit on the fence about this..  Can you provide a use-case where
> doing this makes sense..?  Does this mean I could grant admin_role1 to
> audit and then get auditing on everything that user1 has access to?
> That seems like it might be useful for environments where such roles
> already exist though it might end up covering more than is intended..

The idea is that if there are already ready-made roles to be audited
then they don't need to be reconstituted for the audit role.  You could
just do:

grant admin_role to audit;
grant user_role to audit;

Of course, we could list multiple roles in the pg_audit.role GUC, but I
thought this would be easier to use and maintain since there was some
worry about GUCs being fragile when they refer to database objects.

>> +
>> +# test.pl - pgAudit Unit Tests
>> +
> 
> This is definitiely nice..
> 
>> diff --git a/doc/src/sgml/pgaudit.sgml b/doc/src/sgml/pgaudit.sgml
>> new file mode 100644
>> index 000..f3f4ab9
>> --- /dev/null
>> +++ b/doc/src/sgml/pgaudit.sgml
>> @@ -0,0 +1,316 @@
>> +
>> +
>> +
>> +  pg_audit
> 
> There seems to be a number of places which are 'pgaudit' and a bunch
> that are 'pg_audit'.  I'm guessing you were thinking 'pg_audit', but
> it'd be good to clean up and make them all consistent.

Fixed, though I still left the file name as pgaudit.sgml since all but
one module follow that convention.

>> +
>> +  Session auditing allows the logging of all commands that are executed 
>> by
>> +  a user in the backend.  Each command is logged with a single entry and
>> +  includes the audit type (e.g. SESSION), command 
>> type
>> +  (e.g. CREATE TABLE, SELECT) and
>> +  statement (e.g. "select * from test").
>> +
>> +  Fully-qualified names and object types will be logged for
>> +  CREATE, UPDATE, and
>> +  DROP commands on TABLE,
>> +  MATVIEW, VIEW,
>> +  INDEX, FOREIGN TABLE,
>> +  COMPOSITE TYPE, INDEX, and
>> +  SEQUENCE objects as well as function calls.
>> +
> 
> Ah, you do have a list of what objects you get fully qualified names
> for, nice.  Are there obvious omissions from that list..?  If so, we
> might be able to change what happens with objectAccess to include them..

It seems like these are the objects where having a name really matters.
 I'm more interested in using the deparse code to handle fully-qualified
names for additional objects rather than adding hooks.

>> +
>> +  Examples
>> +
>> +  
>> +Set pgaudit.log = 'read, ddl' in
>> +postgresql.conf.
>> +  
> 
> Perhaps I missed it, but it'd be good to point out that GUCs can be set
> at various levels.  I know we probably say that somewhere else, but it's
> particularly relevant for this.

Yes, it's very relevant for this patch.  Do you think it's enough to
call out the functionality, or should I provide examples?  Maybe a
separate section just for this concept?

Patch v2 is attached for all changes except the doc change above.

-- 
- David Steele
da...@pgmasters.net
diff --git a/contrib/Makefile b/contrib/Makefile
index 195d447..d8e75f4 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -29,6 +29,7 @@ SUBDIRS = \
pageinspect \
passwordcheck   \
pg_archivecleanup \
+   pg_audit\
pg_buffercache  \
pg_freespacemap \
pg_prewarm  \
diff --git a/contrib/pg_audit/Makefile b/contrib/pg_audit/Makefile
new file mode 100644
index 000..32bc6d9
--- /dev/null
+++ b/contrib/pg_audit/Makefile
@@ -0,0 +1,20 @@
+# pg_audit/Makefile
+
+MODULE = pg_audit
+MODULE_big = pg_audit
+OBJS = pg_audit.o
+
+EXTENSION = pg_audit
+
+DATA = pg_audit--1.0.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_audit
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_audit/pg_audit--1.0.0.sql 
b/contrib/pg_audit/pg_audit--1.0.0.sql
new file mode 100644
index 000..2eee3b9
--- /dev/null
+++ b/contrib/pg_audit/pg_audit--1.0.0.sql
@@ -0,0 +1,4 @@
+/* pg_audit/pg_audit--1.0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_au

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown

On 23 February 2015 at 15:42, Andres Freund  wrote:

> On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
> > I unfortunately don't remember enough of the thread to reference it
> > here.
>
> Found the right keywords. The threads below
>
> http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
> and
>
> http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau
>

Yes, this seems to be virtually the same issue reported.  The trace looks
the same except for RecordTransactionCommit.

-- 
Thom

Re: [HACKERS] [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

2015-02-23 Thread Andres Freund

Hi,

On 2014-07-26 18:16:01 +0200, Andres Freund wrote:
> On 2014-07-26 11:32:24 -0400, Tom Lane wrote:
> > "MauMau"  writes:
> > > [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
> > 
> > Ugh.
> > 
> > One line of thought is that it's pretty unsafe to be doing anything
> > as complicated as transaction start/commit in a signal handler, even one
> > that is sure it's not interrupting anything else.
> 
> Yea, that's really not nice.

MauMau, we don't do this anymore. Could you verify that the issue is
fixed for you?

I'd completely forgotten that this thread made me work on moving
everything complicated out of signal handlers...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown

On 23 February 2015 at 15:38, Andres Freund  wrote:

> Hi,
>
> On 2015-02-23 15:25:57 +, Thom Brown wrote:
> > I've noticed that if the primary is started and then a base backup is
> > immediately taken from it and started as as a synchronous standby, it
> > doesn't replicate and the primary hangs indefinitely when trying to run
> any
> > WAL-generating statements.  It only recovers when either the primary is
> > restarted (which has to use a fast shutdown otherwise it also hangs
> > forever), or the standby is restarted.
> >
> > Here's a way of reproducing it:
> > ...
> > Note that if you run the commands one by one, there isn't a problem.  If
> > you run it as a script, the standby doesn't connect to the primary.
> There
> > aren't any errors reported by either the standby or the primary.  The
> > primary's wal sender process reports the following:
> >
> > wal sender process rep_user 127.0.0.1(45243) startup waiting for
> 0/3000158
> >
> > Anyone know why this would be happening?  And if this could be a problem
> in
> > other scenarios?
>
> Given that normally a walsender doesn't wait for syncrep I guess this is
> the above backend just did authentication. If you gdb into the
> walsender, what's the backtrace?
>

#0  0x7f66d1725940 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00617faa in WaitLatchOrSocket ()
#2  0x0064741b in SyncRepWaitForLSN ()
#3  0x004bbf8f in CommitTransaction ()
#4  0x004be135 in CommitTransactionCommand ()
#5  0x00757679 in InitPostgres ()
#6  0x00675032 in PostgresMain ()
#7  0x004617ef in ServerLoop ()
#8  0x00627c9c in PostmasterMain ()
#9  0x0046223d in main ()

-- 
Thom

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund

On 2015-02-23 16:38:44 +0100, Andres Freund wrote:
> I unfortunately don't remember enough of the thread to reference it
> here.

Found the right keywords. The threads below
http://archives.postgresql.org/message-id/369698E947874884A77849D8FE3680C2%40maumau
and
http://www.postgresql.org/message-id/5CF4ABBA67674088B3941894E22A0D25@maumau

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Andres Freund

Hi,

On 2015-02-23 15:25:57 +, Thom Brown wrote:
> I've noticed that if the primary is started and then a base backup is
> immediately taken from it and started as as a synchronous standby, it
> doesn't replicate and the primary hangs indefinitely when trying to run any
> WAL-generating statements.  It only recovers when either the primary is
> restarted (which has to use a fast shutdown otherwise it also hangs
> forever), or the standby is restarted.
> 
> Here's a way of reproducing it:
> ...
> Note that if you run the commands one by one, there isn't a problem.  If
> you run it as a script, the standby doesn't connect to the primary.  There
> aren't any errors reported by either the standby or the primary.  The
> primary's wal sender process reports the following:
> 
> wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158
> 
> Anyone know why this would be happening?  And if this could be a problem in
> other scenarios?

Given that normally a walsender doesn't wait for syncrep I guess this is
the above backend just did authentication. If you gdb into the
walsender, what's the backtrace?

We previously had discussions about that being rather annoying; I
unfortunately don't remember enough of the thread to reference it
here. If it really is this, I think we should add some more smarts about
only enabling syncrep once a backend is fully up and maybe even remove
it from more scenarios during commits generally (e.g. if no xid was
assigned and we just pruned something).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "multiple backends attempting to wait for pincount 1"

2015-02-23 Thread Andres Freund

On 2015-02-17 13:14:00 -0500, Tom Lane wrote:
> Hm, good point.  On the other hand, should we worry about the possibility
> of a lost signal?  Moving the flag-clearing would guard against that,
> which the current code does not.  But we've not seen field reports of such
> issues AFAIR, so this might not be an important consideration.

I think if there were lost signals there'd be much bigger problems given
the same (or in master) similar mechanics are used for a lot of other
things including heavyweight and lightweight locks wait queues.

> > ...
> > /*
> >  * Make sure waiter flag is reset - it might not be if
> >  * ProcWaitForSignal() returned for another reason than 
> > UnpinBuffer()
> >  * signalling us.
> >  */
> > LockBufHdr(bufHdr);
> > buf->flags &= ~BM_PIN_COUNT_WAITER;
> > Assert(bufHdr->wait_backend_pid == MyProcPid);
> > UnlockBufHdr(bufHdr);
> 
> > PinCountWaitBuf = NULL;
> > /* Loop back and try again */
> > }
> 
> > to the bottom of the loop would suffice.
> 
> No, I disagree.  If we maintain the rule that the signaler clears
> BM_PIN_COUNT_WAITER, then once that happens there is nothing to stop a
> third party from trying to LockBufferForCleanup on the same buffer (except
> for table-level locking conventions, which IMO this mechanism shouldn't be
> dependent on).  So this coding would potentially clear the
> BM_PIN_COUNT_WAITER flag belonging to that third party, and then fail the
> Assert --- but only in debug builds, not in production, where it would
> just silently lock up the third-party waiter.  So I think having a test to
> verify that it's still "our" BM_PIN_COUNT_WAITER flag is essential.

Pushed with a test guarding against that. I still think it might be
slightly better to error out if somebody else waits, but I guess it's
unlikely that we'd mistakenly add code doing that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Primary not sending to synchronous standby

2015-02-23 Thread Thom Brown

Hi,

I've noticed that if the primary is started and then a base backup is
immediately taken from it and started as as a synchronous standby, it
doesn't replicate and the primary hangs indefinitely when trying to run any
WAL-generating statements.  It only recovers when either the primary is
restarted (which has to use a fast shutdown otherwise it also hangs
forever), or the standby is restarted.

Here's a way of reproducing it:

---

mkdir -p -m 0700 primary standby1

initdb -N -k -D primary -E 'UTF8'

cat << PRIMARYCONFIG >> primary/postgresql.conf
shared_buffers = 8MB
logging_collector = on
log_line_prefix = '%m - %u - %d'
synchronous_standby_names = 'standby1'
max_connections = 8
wal_level = 'hot_standby'
port = 5530
max_wal_senders = 3
wal_keep_segments = 6
PRIMARYCONFIG

cat << PRIMARYHBA >> primary/pg_hba.conf
local   replication rep_user   trust
hostreplication rep_user   127.0.0.1/32trust
hostreplication rep_user   ::1/128 trust
PRIMARYHBA

pg_ctl start -D primary

psql -p 5530 -h localhost -c 'SET SESSION synchronous_commit TO
'off';CREATE USER rep_user REPLICATION;;' -d postgres

pg_basebackup -x -D standby1 -h localhost -p 5530 -U rep_user

cat << STANDBYCONFIG >> standby1/postgresql.conf
port = 5531
hot_standby = on
STANDBYCONFIG

cat << STANDBYRECOVERY >> standby1/recovery.conf
standby_mode = 'on'
recovery_target_timeline = 'latest'
primary_conninfo = 'host=127.0.0.1 user=rep_user port=5530
application_name=standby1'
STANDBYRECOVERY

pg_ctl -D standby1 start

---

Note that if you run the commands one by one, there isn't a problem.  If
you run it as a script, the standby doesn't connect to the primary.  There
aren't any errors reported by either the standby or the primary.  The
primary's wal sender process reports the following:

wal sender process rep_user 127.0.0.1(45243) startup waiting for 0/3000158

Anyone know why this would be happening?  And if this could be a problem in
other scenarios?

Thom

Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Albe Laurenz

Florian Weimer wrote:
> On 02/22/2015 02:05 PM, Andres Freund wrote:
>> On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
>>> I honestly wonder why postgres uses renegotiation at all. The motivation
>>> that cryptoanalysis is easier as more data is sent seems quite
>>> far-fetched.
>>
>> I don't think so. There's a fair number of algorithms that can/could be
>> much easier be attached with lots of data available. Especially if you
>> can guess/know/control some of the data.  Additionally renegotiating
>> regularly helps to constrain a possible key leagage to a certain amount
>> of time. With backend connections often being alive for weeks at a time
>> that's not a bad thing.
> 
> Renegotiation will be removed from future TLS versions because it is
> considered unnecessary with modern ciphers:
> 
>   
> 
> If ciphers require rekeying, that mechanism will be provided at the TLS
> layer in the future.
> 
> I think you could remove renegotiation from PostgreSQL as long as you
> offer something better than RC4 in the TLS handshake.

I'd say it is best to wait if and how OpenSSL change their API when they
implement TLS 1.3.

I'd vote against removing renegotiation.  At the very least, if the feature
should provide unnecessary and cumbersome with future versions of OpenSSL,
we should retain ssl_renegotiation_limit and change the default to 0.
It might still be of value with older versions.

If changing the encryption is so useless, whe did the TLS workgroup
decide to introduce rekeying as a substitute for renegotiation?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Michael Paquier

On Mon, Feb 23, 2015 at 5:57 PM, Rushabh Lathia 
wrote:
> Thanks to the easy handy testcase, was able to replicate the test scenario
> on my local environment. And yes tbinfo->dobj.ext_member check into
> getTableAttrs() do fix the issue.
>
> Looking more into pg_dump code what I found that, generally PG don't have
> tbinfo->dobj.ext_member check to ignore the object. Mostly we do this kind
> of check using tbinfo->dobj.dump (look at dumpTable() for reference). Do
you
> have any particular reason if choosing dobj.ext_member over dobj.dump ?

Hm. I have been pondering about this code more and I am dropping the patch
as either adding a check based on the flag to track dumpable objects or
ext_member visibly breaks the logic that binary upgrade and tables in
extensions rely on. Particularly this portion makes me think so in
getExtensionMembership():
/*
 * Normally, mark the member object as not to be dumped.
But in
 * binary upgrades, we still dump the members individually,
since the
 * idea is to exactly reproduce the database contents
rather than
 * replace the extension contents with something different.
 */
if (!dopt->binary_upgrade)
dobj->dump = false;
else
dobj->dump = refdobj->dump;
Regards,
-- 
Michael

Re: [HACKERS] SSL renegotiation

2015-02-23 Thread Florian Weimer

On 02/22/2015 02:05 PM, Andres Freund wrote:
> On 2015-02-22 01:27:54 +0100, Emil Lenngren wrote:
>> I honestly wonder why postgres uses renegotiation at all. The motivation
>> that cryptoanalysis is easier as more data is sent seems quite
>> far-fetched.
> 
> I don't think so. There's a fair number of algorithms that can/could be
> much easier be attached with lots of data available. Especially if you
> can guess/know/control some of the data.  Additionally renegotiating
> regularly helps to constrain a possible key leagage to a certain amount
> of time. With backend connections often being alive for weeks at a time
> that's not a bad thing.

Renegotiation will be removed from future TLS versions because it is
considered unnecessary with modern ciphers:

  

If ciphers require rekeying, that mechanism will be provided at the TLS
layer in the future.

I think you could remove renegotiation from PostgreSQL as long as you
offer something better than RC4 in the TLS handshake.

-- 
Florian Weimer / Red Hat Product Security


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery

2015-02-23 Thread Andres Freund

Hi,

On 2015-02-23 15:00:35 +0100, Marko Tiikkaja wrote:
> Andres asked me on IRC to report this here.  Since we upgraded our standby
> servers to 9.1.15 (though the master is still running 9.1.14), we've seen
> the error in $SUBJECT a number of times.

FWIW, I think this is just as borked in 9.1.14 and will likely affect
all of 9.0 - 9.2. The problem is that in those releases multixacts
aren't maintained on the standby in a way that allows access.

index_getnext() itself is actually pretty easy to fix, it already checks
whether the scan started while in recovery when using the result of the
error triggering HeapTupleSatisfiesVacuum(), just too late.  I don't
remember other HTSV callers that can run in recovery, given that DDL is
obviously impossible and we don't support serializable while in
recovery.

Alternatively we could make MultiXactIdIsRunning() return false < 9.3
when in recovery. I think that'd end up fixing things, but it seems
awfully fragile to me.

I do see a HTSU in pgrowlocks.c - that's not really safe during recovery
 < 9.3, given it accesses multixacts. I guess it needs to throw an error.

I wonder if we shouldn't put a Assert() in HTSV/HTSU to prevent such
problems.

Greetings,

Andres Freund
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ERROR: cannot GetMultiXactIdMembers() during recovery

2015-02-23 Thread Marko Tiikkaja


Hi,

Andres asked me on IRC to report this here.  Since we upgraded our 
standby servers to 9.1.15 (though the master is still running 9.1.14), 
we've seen the error in $SUBJECT a number of times.  I managed to 
reproduce it today by running the same query over and over again, and 
attached is the back trace.


Let me know if you need any additional information.


.m
#0  GetMultiXactIdMembers (multi=56513428, xids=0x7fff9e3691e8) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:923
#1  0x7f029a917dd4 in MultiXactIdIsRunning (multi=) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/transam/multixact.c:386
#2  0x7f029abad79d in HeapTupleSatisfiesVacuum (tuple=0x7f0145f2c060, 
OldestXmin=2418699920, buffer=111945) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/utils/time/tqual.c:1184
#3  0x7f029a8f08f5 in index_getnext (scan=scan@entry=0x7f029b66ac78, 
direction=direction@entry=ForwardScanDirection) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/access/index/indexam.c:644
#4  0x7f029a9fe646 in IndexNext (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:78
#5  0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029a9fe5c0 
, accessMtd=0x7f029a9fe600 , node=0x7f029b669550)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82
#6  ExecScan (node=node@entry=0x7f029b669550, 
accessMtd=accessMtd@entry=0x7f029a9fe600 , 
recheckMtd=recheckMtd@entry=0x7f029a9fe5c0 )
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167
#7  0x7f029a9fe72b in ExecIndexScan (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeIndexscan.c:146
#8  0x7f029a9ec968 in ExecProcNode (node=node@entry=0x7f029b669550) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:398
#9  0x7f029a9fd3e1 in MultiExecHash (node=node@entry=0x7f029b6690b0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHash.c:103
#10 0x7f029a9eca74 in MultiExecProcNode (node=node@entry=0x7f029b6690b0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:536
#11 0x7f029a9fdd94 in ExecHashJoin (node=node@entry=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:177
#12 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447
#13 0x7f029a9fe319 in ExecHashJoinOuterGetTuple (hashvalue=0x7fff9e369554, 
hjstate=0x7f029b666c80, outerNode=0x7f029b667aa0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:656
#14 ExecHashJoin (node=node@entry=0x7f029b666c80) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeHashjoin.c:209
#15 0x7f029a9ec8b8 in ExecProcNode (node=node@entry=0x7f029b666c80) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:447
#16 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b666a10) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103
#17 0x7f029a9ec898 in ExecProcNode (node=node@entry=0x7f029b666a10) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:458
#18 0x7f029aa096e0 in begin_partition 
(winstate=winstate@entry=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:683
#19 0x7f029aa0b49b in ExecWindowAgg 
(winstate=winstate@entry=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeWindowAgg.c:1287
#20 0x7f029a9ec868 in ExecProcNode (node=0x7f029b665530) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:470
#21 0x7f029a9f41dc in ExecScanFetch (recheckMtd=0x7f029aa084f0 
, accessMtd=0x7f029aa08500 , node=0x7f029b664e60)
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:82
#22 ExecScan (node=node@entry=0x7f029b664e60, 
accessMtd=accessMtd@entry=0x7f029aa08500 , 
recheckMtd=recheckMtd@entry=0x7f029aa084f0 )
at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execScan.c:167
#23 0x7f029aa08538 in ExecSubqueryScan (node=node@entry=0x7f029b664e60) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSubqueryscan.c:85
#24 0x7f029a9ec938 in ExecProcNode (node=node@entry=0x7f029b664e60) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:412
#25 0x7f029aa05a59 in ExecSort (node=node@entry=0x7f029b664bf0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/nodeSort.c:103
#26 0x7f029a9ec898 in ExecProcNode (node=node@entry=0x7f029b664bf0) at 
/tmp/buildd/postgresql-9.1-9.1.15/build/../src/backend/executor/execProcnode.c:458
#27 0x0

Re: [HACKERS] Enforce creation of destination folders for source files in pg_regress (Was: pg_regress writes into source tree)

2015-02-23 Thread Robert Haas


> On Feb 22, 2015, at 5:41 AM, Michael Paquier  
> wrote:
> This is up to the maintainer of each extension to manage their code
> tree. However I can imagine that some people would be grateful if we
> allow them to not need sql/ and expected/ containing only one single
> .gitignore file ignoring everything with "*", making the code tree of
> their extensions cleaner.

I don't see this as an improvement. What's wrong with a .gitignore file 
ignoring everything?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-02-23 Thread Kohei KaiGai

> Amit and I had a long discussion about this on Friday while in Boston
> together.  I previously argued that the master and the slave should be
> executing the same node, ParallelSeqScan.  However, Amit argued
> persuasively that what the master is doing is really pretty different
> from what the worker is doing, and that they really ought to be
> running two different nodes.  This led us to cast about for a better
> design, and we came up with something that I think will be much
> better.
>
> The basic idea is to introduce a new node called Funnel.  A Funnel
> node will have a left child but no right child, and its job will be to
> fire up a given number of workers.  Each worker will execute the plan
> which is the left child of the funnel.  The funnel node itself will
> pull tuples from all of those workers, and can also (if there are no
> tuples available from any worker) execute the plan itself.  So a
> parallel sequential scan will look something like this:
>
> Funnel
> Workers: 4
> -> Partial Heap Scan on xyz
>
> What this is saying is that each worker is going to scan part of the
> heap for xyz; together, they will scan the whole thing.
>

What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

> The neat thing about this way of separating things out is that we can
> eventually write code to push more stuff down into the funnel.  For
> example, consider this:
>
> Nested Loop
> -> Seq Scan on foo
> -> Index Scan on bar
> Index Cond: bar.x = foo.x
>
> Now, if a parallel sequential scan is cheaper than a regular
> sequential scan, we can instead do this:
>
> Nested Loop
> -> Funnel
> -> Partial Heap Scan on foo
> -> Index Scan on bara
> Index Cond: bar.x = foo.x
>
> The problem with this is that the nested loop/index scan is happening
> entirely in the master.  But we can have logic that fixes that by
> knowing that a nested loop can be pushed through a funnel, yielding
> this:
>
> Funnel
> -> Nested Loop
> -> Partial Heap Scan on foo
> -> Index Scan on bar
> Index Cond: bar.x = foo.x
>
> Now that's pretty neat.  One can also imagine doing this with
> aggregates.  Consider:
>
I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

> HashAggregate
> -> Funnel
> -> Partial Heap Scan on foo
> Filter: x = 1
>
> Here, we can't just push the HashAggregate through the filter, but
> given infrastructure for we could convert that to something like this:
>
> HashAggregateFinish
> -> Funnel
> -> HashAggregatePartial
> -> Partial Heap Scan on foo
>  Filter: x = 1
>
> That'd be swell.
>
> You can see that something like this will also work for breaking off
> an entire plan tree and shoving it down into a worker.  The master
> can't participate in the computation in that case, but it's otherwise
> the same idea.
>
I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Fujii Masao

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed  wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID retains
> its functionality.
> This approach increases data by 1 byte for each block reference in an xlog
> record. This approach separates ID referring different fragments of xlog
> record from the actual block ID which is used to refer  block references in
> xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG:  record with invalid length at 0/3B0
LOG:  record with invalid length at 0/3000518
LOG:  Invalid block length in record 0/30005A0
LOG:  Invalid block length in record 0/3000D60
...

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-23 Thread Fujii Masao

On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier
 wrote:
> On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao  wrote:
>> On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
>>  wrote:
>>> On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao  wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
  wrote:
> On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
>> - * Wait for more WAL to arrive. Time out after 5 
>> seconds,
>> - * like when polling the archive, to react to a 
>> trigger
>> - * file promptly.
>> + * Wait for more WAL to arrive. Time out after
>> the amount of
>> + * time specified by wal_retrieve_retry_interval, 
>> like
>> + * when polling the archive, to react to a
>> trigger file promptly.
>>   */
>>  WaitLatch(&XLogCtl->recoveryWakeupLatch,
>>WL_LATCH_SET | WL_TIMEOUT,
>> -  5000L);
>> +  wal_retrieve_retry_interval * 1000L);
>>
>> This change can prevent the startup process from reacting to
>> a trigger file. Imagine the case where the large interval is set
>> and the user want to promote the standby by using the trigger file
>> instead of pg_ctl promote. I think that the sleep time should be 5s
>> if the interval is set to more than 5s. Thought?
>
> I disagree here. It is interesting to accelerate the check of WAL
> availability from a source in some cases for replication, but the
> opposite is true as well as mentioned by Alexey at the beginning of
> the thread to reduce the number of requests when requesting WAL
> archives from an external storage type AWS. Hence a correct solution
> would be to check periodically for the trigger file with a maximum
> one-time wait of 5s to ensure backward-compatible behavior. We could
> reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing 
 something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not 
 related to
 the retry time to restore WAL from the archive.
>>>
>>> [Re-reading the code...]
>>> Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
>>> maximum of 5s then. I also noticed in previous patch that the wait was
>>> maximized to 5s. To begin with, a loop should have been used if it was
>>> a sleep, but as now patch uses a latch this limit does not make much
>>> sense... Patch updated is attached.
>>
>> On second thought, the interval to check the trigger file is very different
>> from the wait time to retry to retrieve WAL. So it seems strange and even
>> confusing to control them by one parameter. If we really want to make
>> the interval for the trigger file configurable, we should invent new GUC for 
>> it.
>> But I don't think that it's worth doing that. If someone wants to react the
>> trigger file more promptly for "fast" promotion, he or she basically can use
>> pg_ctl promote command, instead. Thought?
>
> Hm, OK.
>
>> Attached is the updated version of the patch. I changed the parameter so that
>> it doesn't affect the interval of checking the trigger file.
>>
>> -static pg_time_t last_fail_time = 0;
>> -pg_time_tnow;
>> +TimestampTz now = GetCurrentTimestamp();
>> +TimestampTzlast_fail_time = now;
>>
>> I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
>> to be called for each WaitForWALToBecomeAvailable().
>>
>> +WaitLatch(&XLogCtl->recoveryWakeupLatch,
>> +  WL_LATCH_SET | WL_TIMEOUT,
>> +  wait_time / 1000);
>>
>> Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.
>
> Yeah, I am wondering though why this has not been added after 89fd72cb though.
>
>> +{"wal_retrieve_retry_interval", PGC_SIGHUP, WAL_SETTINGS,
>>
>> WAL_SETTINGS should be REPLICATION_STANDBY? Changed.
>
> Sure.

So I pushed the patch.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-23 Thread Andres Freund

On 2015-02-22 21:24:56 -0500, Robert Haas wrote:
> On Sat, Feb 21, 2015 at 11:29 PM, Petr Jelinek  wrote:
> > I am wondering a bit about interaction with wal_keep_segments.
> > One thing is that wal_keep_segments is still specified in number of segments
> > and not size units, maybe it would be worth to change it also?
> > And the other thing is that, if set, the wal_keep_segments is the real
> > max_wal_size from the user perspective (not from perspective of the
> > algorithm in this patch, but user does not really care about that) which is
> > somewhat weird given the naming.
> 
> It seems like wal_keep_segments is more closely related to
> wal_*min*_size.  The idea of both settings is that each is a minimum
> amount of WAL we want to keep around for some purpose.  But they're
> not quite the same, I guess, because wal_min_size just forces us to
> keep that many files around - they can be overwritten whenever.
> wal_keep_segments is an amount of actual WAL data we want to keep
> around.
> 
> Would it make sense to require that wal_keep_segments <= wal_min_size?

I don't think so. Right now checkpoint_segments is a useful tool to
relatively effectively control the amount of WAL that needs to be
replayed in the event of a crash. wal_keep_segments in contrast doesn't
have much to do with the normal working of the system, except that it
delays recycling of WAL segments a bit.

With a condition like above, how would you set up things that you have
50k segments around for replication (say a good days worth), but that
your will never have to replay more than ~800 segments (i.e. something
like checkpoint_segments = 800)?

Am I missing something?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for Numeric

2015-02-23 Thread Andrew Gierth

> "Tomas" == Tomas Vondra  writes:

 Tomas> Interesting, but I think Gavin was asking about how much
 Tomas> variation was there for each tested case (e.g. query executed on
 Tomas> the same code / dataset). And in those cases the padding /
 Tomas> alignment won't change, and thus the effects you describe won't
 Tomas> influence the results, no?

My point is exactly the fact that since the result is not affected, this
variation between runs of the same code is of no real relevance to the
question of whether a given change in performance can properly be
attributed to a patch.

Put it this way: suppose I have a test that when run repeatedly with no
code changes takes 6.10s (s=0.025s), and I apply a patch that changes
that to 6.26s (s=0.025s). Did the patch have an impact on performance?

Now suppose that instead of applying the patch I insert random amounts
of padding in an unused function and find that my same test now takes a
mean of 6.20s (s=0.058s) when I take the best timing for each padding
size and calculate stats across sizes. Now it looks obvious that the
actual code of the patch probably wasn't responsible for any change...

The numbers used here aren't theoretical; they are obtained by testing a
single query - "select * from d_flt order by v offset 1000" where
d_flt contains 5 million float8 values - over 990 times with 33
different random padding sizes (uniform in 0-32767). Here's a scatter
plot, with 3 runs of each padding size so you can see the repeatability:
http://tinyurl.com/op9qg8a

-- 
Andrew.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] attribute for non-gcc compilers

2015-02-23 Thread Oskari Saarenmaa

23.02.2015, 04:31, Robert Haas kirjoitti:
> On Tue, Feb 17, 2015 at 8:41 AM, Oskari Saarenmaa  wrote:
>> 15.01.2015, 21:58, Robert Haas kirjoitti:
>>> On Wed, Jan 14, 2015 at 5:54 PM, Andres Freund  
>>> wrote:
 I think I'd for now simply not define pg_attribute_aligned() on
 platforms where it's not supported, instead of defining it empty. If we
 need a softer variant we can name it pg_attribute_aligned_if_possible or
 something.

 Sounds sane?
>>>
>>> Yes, that sounds like a much better plan.
>>
>> Attached an updated patch rebased on today's git master that never
>> defines aligned or packed empty.
>>
>> This is also included in the current commitfest,
>> https://commitfest.postgresql.org/4/115/
> 
> Is this going to play nicely with pgindent?

I ran pgindent on the tree with this patch applied (with a few changes,
pgindent modified atomics headers a bit) and the changes looked ok to
me, mostly pgindent just rewrapped lines like this:

-extern void quickdie(SIGNAL_ARGS) pg_attribute_noreturn;
+extern void
+quickdie(SIGNAL_ARGS) pg_attribute_noreturn;


but there were two cases where it produced a bit weird indentation:

 #ifdef __arm__
-pg_attribute_packed/* Appropriate whack upside the
head for ARM */
+   pg_attribute_packed /* Appropriate whack upside
the head for ARM */
 #endif
 ItemPointerData;

and

 void
-pg_attribute_noreturn
+   pg_attribute_noreturn
 plpgsql_yyerror(const char *message)
 {


/ Oskari


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump gets attributes from tables in extensions

2015-02-23 Thread Rushabh Lathia

Thanks to the easy handy testcase, was able to replicate the test scenario
on my local environment. And yes tbinfo->dobj.ext_member check into
getTableAttrs() do fix the issue.

Looking more into pg_dump code what I found that, generally PG don't have
tbinfo->dobj.ext_member check to ignore the object. Mostly we do this kind
of check using tbinfo->dobj.dump (look at dumpTable() for reference). Do you
have any particular reason if choosing dobj.ext_member over dobj.dump ?

On Fri, Feb 20, 2015 at 12:20 PM, Michael Paquier  wrote:

> On Fri, Feb 20, 2015 at 5:33 AM, Peter Eisentraut  wrote:
> > On 2/16/15 2:45 AM, Michael Paquier wrote:
> >> While looking at the patch to fix pg_dump with extensions containing
> >> tables referencing each other, I got surprised by the fact that
> >> getTableAttrs tries to dump table attributes even for tables that are
> >> part of an extension. Is that normal?
> >> Attached is a patch that I think makes things right, but not dumping any
> >> tables that are part of ext_member.
> >
> > Can you provide an example/test case?  (e.g., which publicly available
> > extension contains tables with attributes?)
>
> Sure. Attached is a simplified version of the extension I used for the
> other patch on pg_dump.
> $ psql -c 'create extension dump_test'
> CREATE EXTENSION
> $ psql -At -c '\dx+ dump_test'
> table aa_tab_fkey
> table bb_tab_fkey
> $ pg_dump -v 2>&1 | grep "columns and types"
> pg_dump: finding the columns and types of table "public"."bb_tab_fkey"
> pg_dump: finding the columns and types of table "public"."aa_tab_fkey"
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Rushabh Lathia

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2015-02-23 Thread Rahila Syed

Hello,

Attached is a patch which has following changes,

As suggested above block ID in xlog structs has been replaced by chunk ID.
Chunk ID is used to distinguish between different types of xlog record
fragments.
Like,
XLR_CHUNK_ID_DATA_SHORT
XLR_CHUNK_ID_DATA_LONG
XLR_CHUNK_BKP_COMPRESSED
XLR_CHUNK_BKP_WITH_HOLE

In block references, block ID follows the chunk ID. Here block ID retains
its functionality.
This approach increases data by 1 byte for each block reference in an xlog
record. This approach separates ID referring different fragments of xlog
record from the actual block ID which is used to refer  block references in
xlog record.

Following are WAL numbers for each scenario,

 WAL
FPW compression on   121.652 MB

FPW compression off   148.998 MB

HEAD  148.764 MB

Compression remains nearly same as before. There is some difference in WAL
between HEAD and HEAD+patch+compression OFF. This difference corresponds to
1 byte increase with each block reference of xlog record.

Thank you,
Rahila Syed





On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila 
wrote:

> Hello,
>
> >I think we should change the xlog format so that the block_id (which
> currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the
> block id but something like XLR_CHUNK_ID. Which is used as is for
> XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
> >XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
> XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
> block id following the chunk id.
>
> >Yes, that'll increase the amount of data for a backup block by 1 byte,
> but I think that's worth it. I'm pretty sure we will be happy about the
> added extensibility pretty soon.
>
> To clarify my understanding of the above change,
>
> Instead of a block id to reference different fragments of an xlog record ,
> a single byte field "chunk_id"  should be used.  chunk_id  will be same as
> XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
> But for block references, it will take store following values in order to
> store information about the backup blocks.
> #define XLR_CHUNK_BKP_COMPRESSED  0x01
> #define XLR_CHUNK_BKP_WITH_HOLE 0x02
> ...
>
> The new xlog format should look like follows,
>
> Fixed-size header (XLogRecord struct)
> Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
> XLogRecordBlockHeader
> Chunk_id
>  XLogRecordBlockHeader
> ...
> ...
> Chunk_id ( rename id field of the XLogRecordDataHeader struct)
> XLogRecordDataHeader[Short|Long]
>  block data
>  block data
>  ...
>  main data
>
> I will post a patch based on this.
>
> Thank you,
> Rahila Syed
>
> -Original Message-
> From: Andres Freund [mailto:and...@2ndquadrant.com]
> Sent: Monday, February 16, 2015 5:26 PM
> To: Syed, Rahila
> Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
> Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
>
> On 2015-02-16 11:30:20 +, Syed, Rahila wrote:
> > - * As a trivial form of data compression, the XLOG code is aware that
> > - * PG data pages usually contain an unused "hole" in the middle,
> > which
> > - * contains only zero bytes.  If hole_length > 0 then we have removed
> > - * such a "hole" from the stored data (and it's not counted in the
> > - * XLOG record's CRC, either).  Hence, the amount of block data
> > actually
> > - * present is BLCKSZ - hole_length bytes.
> > + * Block images are able to do several types of compression:
> > + * - When wal_compression is off, as a trivial form of compression,
> > + the
> > + * XLOG code is aware that PG data pages usually contain an unused
> "hole"
> > + * in the middle, which contains only zero bytes.  If length < BLCKSZ
> > + * then we have removed such a "hole" from the stored data (and it is
> > + * not counted in the XLOG record's CRC, either).  Hence, the amount
> > + * of block data actually present is "length" bytes.  The hole "offset"
> > + * on page is defined using "hole_offset".
> > + * - When wal_compression is on, block images are compressed using a
> > + * compression algorithm without their hole to improve compression
> > + * process of the page. "length" corresponds in this case to the
> > + length
> > + * of the compressed block. "hole_offset" is the hole offset of the
> > + page,
> > + * and the length of the uncompressed block is defined by
> > + "raw_length",
> > + * whose data is included in the record only when compression is
> > + enabled
> > + * and "with_hole" is set to true, see below.
> > + *
> > + * "is_compressed" is used to identify if a given block image is
> > + compressed
> > + * or not. Maximum page size allowed on the system being 32k, the
> > + hole
> > + * offset cannot be more than 15-bit long so the last free bit is
> > + used to
> > + * store the compression state of block image. If the maximum page

72 matches

Mail list logo