date:20161021

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Michael Paquier

On Sat, Oct 22, 2016 at 9:45 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby  wrote:
>>> On 10/21/16 8:47 AM, Tom Lane wrote:
 If we are willing to do that then we don't really have to solve the
 problem on the backend side.  One could expect that autovacuum would
 clean things up within a few minutes after a backend failure.
>
>> I am really thinking that we should just do that and call it a day
>> then, but document the fact that if one wants to look at the content
>> of orphaned tables after a crash he had better turn autovacuum to off
>> for the time of the analysis.
>
> Yeah, agreed.  This also points up the value of Robert's suggestion
> of a "really off" setting.

Okay, so I suggest something like the attached as HEAD-only change
because that's a behavior modification.
-- 
Michael
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index f87f3e0..0a365a0 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -800,6 +800,14 @@ analyze threshold = analyze base threshold + analyze scale 
factor * number of tu

 

+Autovacuum detects orphaned temporary tables that could be left behind
+by a crashed backend, and forcibly drops them.  In the event of a backend
+crash, looking at the contents of orphaned temporary tables to perform
+some research on their contents is possible though it is advised to
+set autovacuum to off before doing so.
+   
+
+   
 The default thresholds and scale factors are taken from
 postgresql.conf, but it is possible to override them
 (and many other autovacuum control parameters) on a per-table basis; see
diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index e3a6911..3a0f4a8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2037,34 +2037,21 @@ do_autovacuum(void)
/* We just ignore it if the owning backend is still 
active */
if (backendID == MyBackendId || 
BackendIdGetProc(backendID) == NULL)
{
+   ObjectAddress object;
+
/*
-* We found an orphan temp table (which was 
probably left
-* behind by a crashed backend).  If it's so 
old as to need
-* vacuum for wraparound, forcibly drop it.  
Otherwise just
-* log a complaint.
+* We found an orphan temp table which was 
probably left
+* behind by a crashed backend, so forcibly 
drop it.
 */
-   if (wraparound)
-   {
-   ObjectAddress object;
-
-   ereport(LOG,
-   (errmsg("autovacuum: 
dropping orphan temp table \"%s\".\"%s\" in database \"%s\"",
-
get_namespace_name(classForm->relnamespace),
-   
NameStr(classForm->relname),
-   
get_database_name(MyDatabaseId;
-   object.classId = RelationRelationId;
-   object.objectId = relid;
-   object.objectSubId = 0;
-   performDeletion(&object, DROP_CASCADE, 
PERFORM_DELETION_INTERNAL);
-   }
-   else
-   {
-   ereport(LOG,
-   (errmsg("autovacuum: 
found orphan temp table \"%s\".\"%s\" in database \"%s\"",
-
get_namespace_name(classForm->relnamespace),
-   
NameStr(classForm->relname),
-   
get_database_name(MyDatabaseId;
-   }
+   ereport(LOG,
+   (errmsg("autovacuum: dropping 
orphan temp table \"%s\".\"%s\" in database \"%s\"",
+
get_namespace_name(classForm->relnamespace),
+   
NameStr(classForm->relname),
+   
get_database_name(MyDatabaseId;
+   object.classId = RelationRelationId;
+   objec

Re: [HACKERS] Parallel Index Scans

2016-10-21 Thread Amit Kapila

On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas  wrote:
> On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila  wrote:
>>> I think the parallel_workers reloption should override the degree of
>>> parallelism for any sort of parallel scan on that table.  Had I
>>> intended it to apply only to sequential scans, I would have named it
>>> differently.
>>
>> I think there is big difference of size of relation to scan between
>> parallel sequential scan and parallel (range) index scan which could
>> make it difficult for user to choose the value of this parameter.  Why
>> do you think that the parallel_workers reloption should suffice all
>> type of scans for a table?  I could only think of providing it based
>> on thinking that lesser config knobs makes life easier.
>
> Well, we could do that, but it would be fairly complicated and it
> doesn't seem to me to be the right place to focus our efforts.  I'd
> rather try to figure out some way to make the planner smarter, because
> even if users can override the number of workers on a
> per-table-per-scan-type basis, they're probably still going to find
> using parallel query pretty frustrating unless we make the
> number-of-workers formula smarter than it is today.  Anyway, even if
> we do decide to add more reloptions than just parallel_degree someday,
> couldn't that be left for a separate patch?
>

That makes sense to me.  As of now, patch doesn't consider reloptions
for parallel index scans.  So, I think we can leave it as it is and
then later as a separate patch decide whether to use reloption of
table or a separate reloption for index would be better choice.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make check-world failing for me

2016-10-21 Thread Thom Brown

On 22 October 2016 at 01:52, Tom Lane  wrote:

> Thom Brown  writes:
> > I'm using the latest version of Linux Mint Debian Edition, having
> recently
> > upgraded from an older version, and now I can't get make check-world to
> > finish successfully.
>
> > The initial theory was that the version of Bison I'm using (3.0.2) is to
> > blame, but I've also tested it against 3.0.1, 2.7.1 and 2.3, but I get
> the
> > same problem:
>
> Don't see how it would be Bison's fault.  What this looks like to me is
> that you've got a corrupted copy of src/test/perl/PostgresNode.pm,
> because this:
>
> > syntax error at
> > /home/thom/Development/postgresql/src/test/modules/
> commit_ts/../../../../src/test/perl/PostgresNode.pm
> > line 437, near ")
> > print"
>
> doesn't seem to have anything to do with what's actually at line 437
> in that file (at least not in current HEAD).  I don't see any close
> matches elsewhere in that file, either.
>

I've just taken a fresh clone of the PostgreSQL git repo, and it appears
you're right.  All the tests appear to be passing now, so sorry for the
noise.

Thanks

Thom

Re: [HACKERS] make check-world failing for me

2016-10-21 Thread Tom Lane

Thom Brown  writes:
> I'm using the latest version of Linux Mint Debian Edition, having recently
> upgraded from an older version, and now I can't get make check-world to
> finish successfully.

> The initial theory was that the version of Bison I'm using (3.0.2) is to
> blame, but I've also tested it against 3.0.1, 2.7.1 and 2.3, but I get the
> same problem:

Don't see how it would be Bison's fault.  What this looks like to me is
that you've got a corrupted copy of src/test/perl/PostgresNode.pm,
because this:

> syntax error at
> /home/thom/Development/postgresql/src/test/modules/commit_ts/../../../../src/test/perl/PostgresNode.pm
> line 437, near ")
> print"

doesn't seem to have anything to do with what's actually at line 437
in that file (at least not in current HEAD).  I don't see any close
matches elsewhere in that file, either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Tom Lane

Michael Paquier  writes:
> On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby  wrote:
>> On 10/21/16 8:47 AM, Tom Lane wrote:
>>> If we are willing to do that then we don't really have to solve the
>>> problem on the backend side.  One could expect that autovacuum would
>>> clean things up within a few minutes after a backend failure.

> I am really thinking that we should just do that and call it a day
> then, but document the fact that if one wants to look at the content
> of orphaned tables after a crash he had better turn autovacuum to off
> for the time of the analysis.

Yeah, agreed.  This also points up the value of Robert's suggestion
of a "really off" setting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Tom Lane

Alvaro Herrera  writes:
> Jim Nasby wrote:
>> It occurs to me that it might be worth embedding the relation name in the
>> free space of the first block. Most people would never notice the missing 64
>> bytes, but it would be incredibly helpful in cases like this...

> Agreed.  The problem is how to install it without breaking pg_upgrade.

Well, that's the first problem.  The second problem is how to cope with
RENAME TABLE.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Draft release notes for next week's releases

2016-10-21 Thread Peter Geoghegan

On Fri, Oct 21, 2016 at 4:51 PM, Tom Lane  wrote:
> Please review ...

Is somebody going to look at the bugfix for the issue where ON
CONFLICT DO NOTHING is used at higher isolation levels [1]? I think
that it's still possible to get it in.

[1] 
https://www.postgresql.org/message-id/cam3swzr6an++h24e6y2nwetjmtcjxdbfgeifoss2jpfpemq...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Draft release notes for next week's releases

2016-10-21 Thread Tom Lane

I have committed draft release notes at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=eacaf6e29fd2a3047aff9738a35a8e9b05e55375

They should be up on the website at
https://www.postgresql.org/docs/devel/static/release-9-6-1.html
as soon as guaibasaurus does its next run (about half an hour
from now, looks like).

We need to consider how to document recovery procedures for the
recently-fixed FSM and VM corruption bugs.  For the moment I've
assumed that the main documentation for this will be on the wiki,
but if so, we need to put in at least placeholder pages for the
release notes to point to.  (Also, somebody want to start filling
in the placeholders?)

For the moment, there are just notes for 9.6.1; I'll slice and
dice them for the back branches later.  Note that quite a few
of the items don't really apply to 9.6.1 because the bug fixes
are already there in 9.6.0.  This is visible in the SGML source
because the comments for them look like

Branch: REL9_6_STABLE Release: REL9_6_0 [36646d3af] 2016-09-03 13:28:53 -0400

Please review ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] tuplesort_gettuple_common() and *should_free argument

2016-10-21 Thread Peter Geoghegan

I noticed that the routine tuplesort_gettuple_common() fails to set
*should_free in all paths in master branch (no bug in 9.6). Consider
the TSS_SORTEDONTAPE case, where we can return false without also
setting "*should_free = false" to instruct caller not to pfree() tuple
memory that tuplesort still owns. I suppose that this is okay because
caller should never pfree() a tuple when NULL pointer was returned at
higher level (as "pointer-to-tuple"). Even still, it seems like a bad
idea to rely on caller testing things such that caller doesn't go on
to pfree() a NULL pointer when seemingly instructed to do so by
tuplesort "get tuple" interface routine (that is, some higher level
routine that calls tuplesort_gettuple_common()).

More importantly, there are no remaining cases where
tuplesort_gettuple_common() sets "*should_free = true", because there
is no remaining need for caller to *ever* pfree() tuple. Moreover, I
don't anticipate any future need for this -- the scheme recently
established around per-tape "slab slots" makes it seem unlikely to me
that the need to "*should_free = true" will ever arise again. So, for
Postgres 10, why not rip out the "*should_free" arguments that appear
in a bunch of places? This lets us slightly simplify most tuplesort.c
callers.

Now, it is still true that at least some callers to
tuplesort_gettupleslot() (but not any other "get tuple" interface
routine) are going to *require* ownership of memory for returned
tuples, which means a call to heap_copy_minimal_tuple() remains
necessary there (see recent commit
d8589946ddd5c43e1ebd01c5e92d0e177cbfc198 for details). But, that's
beside the point. In the master branch only, the
tuplesort_gettupleslot() test "if (!should_free)" seems tautological,
just as similar tests are for every other tuplesort_gettuple_common()
caller. So, tuplesort_gettupleslot() can safely assume it should
*always* heap_copy_minimal_tuple() in the master branch. (BTW, I'm
going to teach tuplesort_gettuple_common() to avoid this
heap_copy_minimal_tuple() call for callers that don't actually need
it, but that's a discussion for another thread).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fast Default WIP patch for discussion

2016-10-21 Thread Serge Rielau

As promised and requested find attached a work in progress patch for fast defaults.This is my first patch, I hope I used the right format…..The premise of the feature is to avoid a table rewrite when adding a column with a default.This is done by remembering the default value in pg_attribute instead of updating all the rows.When a tuple is read from disk which is shorter than the tuple descriptor mandates the default is “plugged in”either when filling in the datum/null array of a virtual tuple, or by “expanding” the tuple to a complete heap or minimal tuplewith all attributes.Example:CREATE TABLE t (pk INT NOT NULL PRIMARY KEY);INSERT INTO t VALUES (1);=> t: (1)ALTER TABLE t ADD COLUMN c1 INT DEFAULT 5;=> 1. Stores the default _expression_ in pg_attrdef.adbin (no news)     2. a. Stores the default values (computed at time of ALTER TABLE) in pg_attribute.att_existdef         b. Sets pg_attribute.att_hasexistdef to trueSELECT * FROM t;=> Build a virtual tuple using getAttr() API (no news)     but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef      if att_hasexistdef is true=> (1) becomes (1, 5)INSERT INTO t VALUES (2);=> Fill in DEFAULT for t.c1 from pg_attrdef.adbin as usual: (2, 5)=> t: (1), (2, 5)ALTER TABLE t ALTER COLUMN c1 SET DEFAULT -1;=> 1. Drop row from pg_attrdef for c1   (no news)     2. Add row to pg_attrdef for c1 DEFAULT -1  (no news)      3. Leave pg_atribute.att_existdef alone!!INSERT INTO t VALUES (3);=> Fill in DEFAULT for t.c1 from pg_attrdef.adbin as usual: (3, -1)=> t: (1), (2, 5), (3, -1)SELECT * FROM t;=>  Build a virtual tuple using get*Attr() API (no news)     but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef      if att_hasexistdef is true=> (1) becomes (1, 5)     (2, 5)     (3, -1)ALTER TABLE t ALTER COLUMN c1 DROP DEFAULT;=> 1. Drop row from pg_attrdef for c1   (no news)     2. Leave pg_atribute.att_existdef alone!!INSERT INTO t VALUES (4);=> Fill in default DEFAULT (4, NULL)=> t: (1), (2, 5), (3, -1), (4, NULL)SELECT * FROM t;=>  Build a virtual tuple using get*Attr() API (no news)     but since tuple has fewer mats than table signature fill in extra attributes from pg_attribute.att_existdef      if att_hasexistdef is true=> (1) becomes (1, 5)     (2, 5)     (3, -1)     (4, NULL)You can find a new (incomplete) test file fast_default.sql in regress/sql Some key design points requiring discussion:1. Storage of the “exist” (working name) default   Right now the patch stores the default value in its binary form as it would be in the tuple into a BYTEA.   It would be feasible to store the pretty printed literal, however this requires calling the io functions when a    tuple descriptor is built.2. The exist default is cached alongside the “current” default in the tuple descriptor’s constraint structure.    Seems most natural too me, but debatable.3. Delayed vs. early expansion of the tuples.    To avoid having to decide when to copy tuple descriptors with or without constraints and defaults    I have opted to expand the tuple at the core entry points.    How do I know I have them all? An omission means wrong results!4. attisnull()    This routine is used in many places, but to give correct result sit must now be accompanied    by the tuple descriptor. This becomes moderately messy and it’s not always clear where to get that.    Interestingly most usages are related to catalog lookups.    Assuming we have no intention to support fast default for catalog tables we could keep using the    existing attisnull() api for catalog lookups and use a new version (name tbd) for user tables.5. My head hurts looking at the PK/FK code - it’s not always clear which tuple descriptor belongs     to which tuple6. Performance of the expansion code.    The current code needs to walk all defaults and then start padding by filling in values.    But the outcome is always the same. We will produce the same payload and the name null map.    It would be feasible to cache an “all defaults tuple”, remember the offsets (and VARWIDTH, HASNULL)     for each attribute and then simply splice the short and default tuples together.    This ought to be faster, but the meta data to cache is not insignificant and the expansion code is messy enough    without this already.7. Grooming    Obviously we can remove all exist defaults for a table from pg_attribute whenever the table is rewrittem.    That’s easy.    But could we/should we keep track of the short tuples and either eliminate them or drop exist defaults once they     become obsolete because there is no tuple short enough for them to matter. 8. Do we need to worry about toasted defaults?CheersSerge RielauSalesforce.com

fast_default.patch
Description: Binary data

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Tom Lane

Alvaro Herrera  writes:
> Robert Haas wrote:
>> So, I think that this is a really promising direction, but also that
>> you should try very hard to try to get out from under this 6-byte PK
>> limitation.  That seems really ugly, and in practice it probably means
>> your PK is probably going to be limited to int4, which is kind of sad
>> since it leaves people using int8 or text PKs out in the cold.

> I think we could just add a new type, unsigned 6 byte int, specifically
> for this purpose.

I think that's a really bad idea, because after you've fixed this
hopefully-temporary limitation, we'll still be stuck carrying this
weird type forever.  Besides which, doesn't the existing TID type
already serve the purpose?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Alvaro Herrera

Robert Haas wrote:

> So, I think that this is a really promising direction, but also that
> you should try very hard to try to get out from under this 6-byte PK
> limitation.  That seems really ugly, and in practice it probably means
> your PK is probably going to be limited to int4, which is kind of sad
> since it leaves people using int8 or text PKs out in the cold.

I think we could just add a new type, unsigned 6 byte int, specifically
for this purpose.  Little in the way of operators, as it's pointless to
try to do arithmetic with object identifiers.  (It'd be similar to UUID
in spirit, but I wouldn't try to do anything too smart to generate them.)

> I believe Claudio Freire is on to something when he suggests storing
> the PK in the index tuple; one could try to skip storing the TID, or
> always store it as all-zeroes.

That bloats the index a bit.  But then, maybe that is fine ...

> The VACUUM problems seem fairly serious.  It's true that these indexes
> will be less subject to bloat, because they only need updating when
> the PK or the indexed columns change, not when other indexed columns
> change.  On the other hand, there's nothing to prevent a PK from being
> recycled for an unrelated tuple.  We can guarantee that a TID won't be
> recycled until all index references to the TID are gone, but there's
> no such guarantee for a PK.  AFAICT, that would mean that an indirect
> index would have to be viewed as unreliable: after looking up the PK,
> you'd *always* have to recheck that it actually matched the index
> qual.

Yes, recheck is always needed.

As for vacuum, I was thinking this morning that perhaps the answer to
that is just to not vacuum the index at all and instead rely on the
killtuple interface (which removes tuples during scan).  So we don't
need to spend precious maint_work_mem space on a large list of PK
values.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Michael Paquier

On Sat, Oct 22, 2016 at 12:15 AM, Jim Nasby  wrote:
> On 10/21/16 8:47 AM, Tom Lane wrote:
>>>
>>> It seems to me that you'd even want to make the drop of orphaned
>>> > tables mandatory once it is detected even it is not a wraparound
>>> > autovacuum.
>>
>> If we are willing to do that then we don't really have to solve the
>> problem on the backend side.  One could expect that autovacuum would
>> clean things up within a few minutes after a backend failure.
>
> Unless all the autovac workers are busy working on huge tables... maybe a
> delay of several hours/days is OK in this case, but it's not wise to assume
> autovac will always get to something within minutes.

I am really thinking that we should just do that and call it a day
then, but document the fact that if one wants to look at the content
of orphaned tables after a crash he had better turn autovacuum to off
for the time of the analysis.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera

Jim Nasby wrote:
> On 10/21/16 2:02 PM, Alvaro Herrera wrote:
> > Merlin Moncure wrote:
> > 
> > > OK, I have some good (very- in the specific case of yours truly) news
> > > to report.  Doing a filesystem level copy to a test server I was able
> > > to relfilenode swap one of the critical tables over the place of the
> > > refilenode of the stored backup.  Not being able know the file to copy
> > > from, I figured out the source node by judging the size and using
> > > 'strings'  utility.  Data recovery for that table at least appears to
> > > be 100%.
> > 
> > FWIW you can use pg_filedump and match based on the number of columns.
> > I suppose you could also use the pageinspect extension, by 'dd'ing a
> > page from the file into the database and feeding into heap_page_items as
> > bytea.
> 
> It occurs to me that it might be worth embedding the relation name in the
> free space of the first block. Most people would never notice the missing 64
> bytes, but it would be incredibly helpful in cases like this...

Agreed.  The problem is how to install it without breaking pg_upgrade.

Note that for DR purposes it is better to put that data in the first
block of *each segment*.  Otherwise if you have many >1GB tables, you
have to go segment by segment ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread Michael Paquier

On Sat, Oct 22, 2016 at 4:29 AM, Alvaro Herrera
 wrote:
> David Steele wrote:
>> On 10/21/16 3:12 AM, David G. Johnston wrote:
>>
>> > I have no problem continuing keeping with historical precedent and
>> > allowing mnemonic abbreviations in our directory and file names at this
>> > point.
>>
>> I'm still in favor of pg_xact.  A search of the 9.6 docs brings up a number
>> of hits for "xact": pg_last_xact_replay_timestamp(),
>> pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(),
>> pg_last_committed_xact(),  pg_prepared_xacts(), etc. There are also numerous
>> column names that have "xact" in them.
>
> I'm +1 on pg_clog -> pg_xact.

So let's say that's the winner then.

> Also +1 to renaming pg_subtrans to pg_subxact.

Nice suggestion, good naming for consistency with the rest.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-21 Thread Michael Paquier

On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby  wrote:
> On 10/20/16 10:15 PM, Michael Paquier wrote:
>>
>> 2) If anything is found, stop the server and delete the files manually.
>> 3) Re-start the server.
>> OK, that's troublesome and costly for large relations, but we know
>> that's the safest way to go for any versions, and there is no need to
>> complicate the code with any one-time repairing extensions.
>
> Wouldn't you need to run around to all your replicas and do that as well?

Yeah...

>> Speaking of which, I implemented a small extension able to truncate
>> the FSM up to the size of the relation as attached, but as I looked at
>> it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
>> is rather limited... And I pushed as well a version on github:
>> https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
>> The limitation range of such an extension is a argument good enough to
>> just rely on the stop/delete-FSM/start method to fix an instance and
>> let VACUUM do the rest of the work. That looks to work but use it at
>> your own risk.
>
>
> Shouldn't the truncation be logged before it's performed?

Doh. Yes, thanks for the reminder. And I commented on that upthread..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Jim Nasby


On 10/21/16 4:05 PM, Adam Brusselback wrote:

My main point is that maybe the reason why most users use int4 pkeys
(besides conventional wisdom) is because it gets the most support from
features like this, and it may not be quite as skewed if that same
support were given to other types.


I think it's a pretty safe bet that they're the most popular simply 
because serial means int4 and not int8. I bet a fair number of users 
don't even realize bigserial exists.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE

2016-10-21 Thread Jim Nasby


On 10/21/16 12:30 PM, Stephen Frost wrote:

I don't see why we would want to stick 'N/A' in for the header, even if
we are reporting the details, when we can provide a pretty reasonable
number.


Because then it's absolutely clear that we don't have a valid rowcount, 
only a guess (and a guess that's potentially off by a lot).


No one is used to seeing "N/A" in explain, so when they do see it 
they'll immediately realize they don't know what's going on and hit 
google or the docs up. Otherwise they'll just think it's an accurate 
rowcount like for any other node...



In particular, I certainly don't think we would want to report
N/A sometimes (lossy case) and then an actual number other times (all
exact case).  That strikes me as much more likely to be confusing.


Fair enough. I'd certainly rather have a constant N/A then a guess at 
the rowcount.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Adam Brusselback

Just throwing an anecdote out there, but my company uses UUID for primary
keys on every table in the DB.  While int4 is for sure more popular, it
would be nice if there weren't even more reasons to "force" people in that
direction.  I know I started regretting the decision to go with UUID
primary keys slightly once I realized that we'd need exclusion constraints,
and you have to jump through hoops to use them together.

My main point is that maybe the reason why most users use int4 pkeys
(besides conventional wisdom) is because it gets the most support from
features like this, and it may not be quite as skewed if that same support
were given to other types.

Just my $0.02
-Adam

On Fri, Oct 21, 2016 at 4:46 PM, Jim Nasby  wrote:

> On 10/19/16 7:52 AM, Robert Haas wrote:
>
>> So, I think that this is a really promising direction, but also that
>> you should try very hard to try to get out from under this 6-byte PK
>> limitation.  That seems really ugly, and in practice it probably means
>> your PK is probably going to be limited to int4, which is kind of sad
>> since it leaves people using int8 or text PKs out in the cold.
>>
>
> My impression is that int4 is by far the most popular PK type. Even if the
> initial implementation is limited to that I think it'd have a lot of
> potential.
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Jim Nasby


On 10/21/16 2:02 PM, Alvaro Herrera wrote:

Merlin Moncure wrote:


OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.


FWIW you can use pg_filedump and match based on the number of columns.
I suppose you could also use the pageinspect extension, by 'dd'ing a
page from the file into the database and feeding into heap_page_items as
bytea.


It occurs to me that it might be worth embedding the relation name in 
the free space of the first block. Most people would never notice the 
missing 64 bytes, but it would be incredibly helpful in cases like this...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Jim Nasby


On 10/21/16 2:48 PM, Sven R. Kunze wrote:



You don't need that limitation (and vacuum will be simpler) if you add

the PK as another key, akin to:


CREATE INDIRECT INDEX idx ON tab (a, b, c);

turns into

CREATE INDEX idx ON tab (a, b, c, pk);



I know I am late to this point but I wanted to present my mere user's
point of view.

First I liked it, as does not introduce yet another syntax to learn.


I believe you mis-understood what Claudio was saying. He's not 
suggesting an index with the PK on the end magically becomes an indirect 
index; he was saying that a "simple" way to overcome the 6 byte index 
TID limitation would be to store the PK as part of the index key. He 
used existing DDL to illustrate that, but that was just for 
illustration, not how this would actually be implemented.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Jim Nasby


On 10/19/16 7:52 AM, Robert Haas wrote:

So, I think that this is a really promising direction, but also that
you should try very hard to try to get out from under this 6-byte PK
limitation.  That seems really ugly, and in practice it probably means
your PK is probably going to be limited to int4, which is kind of sad
since it leaves people using int8 or text PKs out in the cold.


My impression is that int4 is by far the most popular PK type. Even if 
the initial implementation is limited to that I think it'd have a lot of 
potential.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for changes to recovery.conf API

2016-10-21 Thread Josh Berkus

On 09/28/2016 10:13 AM, Robert Haas wrote:
> On Tue, Sep 6, 2016 at 10:11 AM, David Steele  wrote:
>> On 9/6/16 8:07 AM, Robert Haas wrote:
>>> On Wed, Aug 31, 2016 at 9:45 PM, Simon Riggs 
>>> wrote:
 Related cleanup
 * Promotion signal file is now called "promote.trigger" rather than
 just "promote"
 * Remove user configurable "trigger_file" mechanism - use
 "promote.trigger" for all cases
>>>
>>>
>>> I'm in favor of this.  I don't think that it's very hard for authors
>>> of backup tools to adapt to this new world, and I don't see that
>>> allowing configurability here does anything other than create more
>>> cases to worry about.
>>
>> +1 from a backup tool author.
> 
> It's time to wrap up this CommitFest, and this thread doesn't seem to
> contain anything that looks like a committable patch.  So, I'm marking
> this "Returned with Feedback".  I hope that the fact that there's been
> no discussion for the last three weeks doesn't mean this effort is
> dead; I would like very much to see it move forward.

Has this gone anywhere?  Given that we're in "break all the things" mode
for PostgreSQL 10, it would be the ideal time to consolidate
recovery.conf with pg.conf.


-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove autovacuum GUC?

2016-10-21 Thread Jim Nasby


On 10/20/16 11:50 PM, Craig Ringer wrote:

Personally what I think is needed here is to make monitoring and bloat
visibility not completely suck. So we can warn users if tables haven't
been vac'd in ages and have recent churn. And so they can easily SELECT
a view to get bloat estimates with an estimate of how much drift there
could've been since last vacuum.


+10. I've seen people spend a bunch of time screwing around with the 2 
major "bloat queries", both of which have some issues. And there is *no* 
way to actually quantify whether autovac is keeping up with things or not.



Users turn off vacuum because they cannot see that it is doing anything
except wasting I/O and cpu. So:

* A TL;DR in the docs saying what vac does and why not to turn it off.
In particular warning that turning autovac off will make a slow SB get
slower even though it seems to help at first.


IMHO we should also suggest that for users that have periods of lower 
activity that they run a manual vacuum. That reduces the odds of autovac 
ruining your day unexpectedly, as well as allowing it to to focus on 
high-velocity tables that need more vacuuming and not on huge tables 
that just happened to surpass their threshold during a site busy period.



* A comment in the conf file with the same TL;DR. Comments are free,
let's use a few lines.

* Warn on startup when autovac is off?


Well, I suspect that someone who thinks autovac=off is a good idea 
probably doesn't monitor their logs either, but it wouldn't hurt.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FSM corruption leading to errors

2016-10-21 Thread Jim Nasby


On 10/20/16 10:15 PM, Michael Paquier wrote:

2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.


Wouldn't you need to run around to all your replicas and do that as well?


Speaking of which, I implemented a small extension able to truncate
the FSM up to the size of the relation as attached, but as I looked at
it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
is rather limited... And I pushed as well a version on github:
https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
The limitation range of such an extension is a argument good enough to
just rely on the stop/delete-FSM/start method to fix an instance and
let VACUUM do the rest of the work. That looks to work but use it at
your own risk.


Shouldn't the truncation be logged before it's performed?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Sven R. Kunze

On 2016-10-18 20:04:32, Claudio Freire wrote:

> You don't need that limitation (and vacuum will be simpler) if you add
the PK as another key, akin to:
>
> CREATE INDIRECT INDEX idx ON tab (a, b, c);
>
> turns into
>
> CREATE INDEX idx ON tab (a, b, c, pk);

I know I am late to this point but I wanted to present my mere user's 
point of view.

First I liked it, as does not introduce yet another syntax to learn. 
However, after following the discussion, I see that indirect indexes 
have their disadvantages/slowdowns as well. If adding "pk" to the end of 
the column list just converts the index to an indirect index, I am 
unable to use a direct index which might be better in certain cases.

So, from a "dumb" user's point of view, I wonder if PostgreSQL can make 
the right decision of direct/indirect reliably (which would be great). 
And if not, what would be the alternatives? Introducing CREATE DIRECT INDEX?

Cheers,
Sven

PS: I mot saying I would be affected by this but IIRC we have (..., pk) 
indexes in production which then would be converted to indirect ones. 
But I cannot tell whether indirect indexes would do good or harm there.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread Alvaro Herrera

David Steele wrote:
> On 10/21/16 3:12 AM, David G. Johnston wrote:
> 
> > I have no problem continuing keeping with historical precedent and
> > allowing mnemonic abbreviations in our directory and file names at this
> > point.
> 
> I'm still in favor of pg_xact.  A search of the 9.6 docs brings up a number
> of hits for "xact": pg_last_xact_replay_timestamp(),
> pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(),
> pg_last_committed_xact(),  pg_prepared_xacts(), etc. There are also numerous
> column names that have "xact" in them.

I'm +1 on pg_clog -> pg_xact.

Also +1 to renaming pg_subtrans to pg_subxact.

In general, +1 to short mnemonic meaningful names.  -1 to overly long
names, -1 to meaningless names.

Regarding pg_receivexlog I think I'd propose pg_recvwal rather than
pg_receivewal, because of pg_recvlogical.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Alvaro Herrera

Merlin Moncure wrote:

> OK, I have some good (very- in the specific case of yours truly) news
> to report.  Doing a filesystem level copy to a test server I was able
> to relfilenode swap one of the critical tables over the place of the
> refilenode of the stored backup.  Not being able know the file to copy
> from, I figured out the source node by judging the size and using
> 'strings'  utility.  Data recovery for that table at least appears to
> be 100%.

FWIW you can use pg_filedump and match based on the number of columns.
I suppose you could also use the pageinspect extension, by 'dd'ing a
page from the file into the database and feeding into heap_page_items as
bytea.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure

On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure  wrote:
> On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner  wrote:
>> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
>>
>>> Most or all the damage seemed to be to the system catalogs with
>>> at least two critical tables dropped or inaccessible in some
>>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>>> thing.
>>
>> While the oid in pg_class often matches the filename, that is not
>> true after some operations (like CLUSTER or VACUUM FULL).  It is
>> the relfilenode column that is the definitive link to the file.
>
> no such operations happened.  In the first instance at least one table
> dropped from the system catalogs.   I have a hunch that the heap is
> fine (supported by the size of the database on disk).   At this
> precise moment I'm restoring the database to another fileserver in
> order to do some forensic analysis, also in the hopes of getting the
> second database online in order to expedite recovery.
>
> ah -- done. :-)  deleting the init file didn't help, but starting up
> single user allowed the start up to gracefully fail with a FATAL cache
> lookup.

OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.

For those following along, this simple process is only likely to work
easily if the table contains only system types; no user types, enums,
composites, etc, since those have a unique ID for each data restore.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Merlin Moncure

On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner  wrote:
> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:
>
>> Most or all the damage seemed to be to the system catalogs with
>> at least two critical tables dropped or inaccessible in some
>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>> thing.
>
> While the oid in pg_class often matches the filename, that is not
> true after some operations (like CLUSTER or VACUUM FULL).  It is
> the relfilenode column that is the definitive link to the file.

no such operations happened.  In the first instance at least one table
dropped from the system catalogs.   I have a hunch that the heap is
fine (supported by the size of the database on disk).   At this
precise moment I'm restoring the database to another fileserver in
order to do some forensic analysis, also in the hopes of getting the
second database online in order to expedite recovery.

ah -- done. :-)  deleting the init file didn't help, but starting up
single user allowed the start up to gracefully fail with a FATAL cache
lookup.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> I don't think that the problem is that people are accidentally typing
> "pg_resetxlog $PGDATA" and pressing return.  They're typing that on
> purpose, and if you change the sequence of characters required to get
> that effect, they'll just type the new thing instead. The problem is
> that they don't understand when it's appropriate to use that command
> and what the consequences are.

I agree that they don't understand, and that's why I believe that making
the command name, or a required option, a bit more ominous would make
them pause and realize that maybe they want to consider other options
first.

This is not exactly unheard of- apt-get requires an entire phrase be to
be entered when you're asking it to do something extremely dangerous
(remove an essential package):

---
root@ionith:~# apt-get remove login
Reading package lists... Done
Building dependency tree   
Reading state information... Done
The following packages will be REMOVED:
  login
WARNING: The following essential packages will be removed.
This should NOT be done unless you know exactly what you are doing!
  login
0 upgraded, 0 newly installed, 1 to remove and 71 not upgraded.
After this operation, 1,212 kB disk space will be freed.
You are about to do something potentially harmful.
To continue type in the phrase 'Yes, do as I say!'
 ?] 
---

My thought would be to make pg_resetxlog do something more along the
lines of what pg_control does and have it, by default, just investigate
the state of things and complain about problems and then have it include
an option to actually reset things with an appropriately ominous name.
The goal there being, primairly, to give a way for users to get
information about why PG isn't starting or what it is complaining about,
with some additional information about what happens if they forcibly
reset the xlog, noting that it could very likely cause corruption, etc.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 9:47 AM, Stephen Frost  wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Thu, Oct 20, 2016 at 12:12 PM, Stephen Frost  wrote:
>> > That said, I'd also like to see a --force or similar option or mechanism
>> > put in place to reduce the risk of users trashing their system because
>> > they think pg_resetwal is "safe." ("It's just gonna reset things to make
>> > the database start again, should be fine.").
>>
>> You know we already have that, right?
>
> Yes, but I was meaning an option which would be required to make
> pg_resetxlog actually *do* anything.  In other words, I'd rather have it
> report some info back to the user, if it's run without the
> '--really-force' or what-have-you option, and only proceed with
> clearing the WAL or rewriting pg_control when '--really-force' is used.

I don't think that the problem is that people are accidentally typing
"pg_resetxlog $PGDATA" and pressing return.  They're typing that on
purpose, and if you change the sequence of characters required to get
that effect, they'll just type the new thing instead. The problem is
that they don't understand when it's appropriate to use that command
and what the consequences are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PSA: Systemd will kill PostgreSQL

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 8:29 AM, Peter Eisentraut
 wrote:
> On 8/16/16 1:05 PM, Tom Lane wrote:
>> Oh, interesting.  It had occurred to me that we might be able to dodge
>> this issue if we started to recommend using unnamed POSIX semaphores
>> instead of SysV.  (Obviously we'd want to check performance, but it's
>> at least a plausible alternative.)  I had not wanted to go there if
>> it meant that we could have silent loss of SysV shmem with no other
>> symptoms, because as I said upthread, I'm concerned about that breaking
>> the multiple-postmaster interlock.  However, if the cleanup kills only
>> semaphores and not attached-to shmem, then that objection goes away and
>> this becomes something we should seriously consider.
>
> I was digging around this issue the other day again.  We have switched
> to unnamed POSIX semaphores by default now, which will help.  But for
> dynamic shared memory (DSM) we use POSIX shared memory by default, which
> is cleaned up without regarding to attachment.  So there is still a
> potential for failures here, possibly more rare or obscure, given the
> usage of DSM.

The reason I did it that way is because System V shared memory is
often subject to very low limits on how much can be allocated, which
can also produce failures.  It would be easy to switch the default
implementation from POSIX to System V, but I suspect that would be a
loser overall -- in other words, I suspect that if we switched the
default, more people would get hosed by not being able to create those
segments in the first place than are currently getting hosed by having
them removed prematurely.

Also, POSIX shared memory segments at least on Linux are implemented
as files.  If you remove a file, people who have it open can normally
continue to access it.  So it might work OK as long as the file isn't
removed until after everybody involved in a parallel query has already
attached.  That's a dangerous thing to bet on, though, because the DSM
facility also supports server-lifetime DSMs.  We're not using those
capabilities right now in core, but we might start - e.g. Magnus was
suggesting that we could use DSMs plus Thomas Munro's DSA and DHT
patches to replace the stats collector or the temp files used by
pg_stat_statements.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE

2016-10-21 Thread Stephen Frost

* Jim Nasby (jim.na...@bluetreble.com) wrote:
> On 10/21/16 8:21 AM, Stephen Frost wrote:
> >Counting each page as the relation's average number of tuples per page
> >seems entirely reasonable to me, for what that is trying to report.
> 
> My concern is that still leaves a lot of room for confusion when
> interpreting EXPLAIN ANALYZE. Every other node will tell you exactly
> what happened and it's pretty easy to reason about whether rows
> should have gone up or down based on the type of node. You can't do
> that for Bitmap(And|Or) unless you know the details of how
> TIDBitmaps work. Reporting N/A makes it crystal clear that these
> nodes operate very differently than all the others.

I don't see why you think the numbers reported by BitmapAnd based on
this approach wouldn't go up and down in a similar manner to what you
would expect to get, based on that node type.  Reporting N/A is entirely
punting on it when we have perfectly useful information that can be
reported.

> (On a related note, it would also be nice if we reported fractional
> rows when the row count low and loops is high.)

I can certainly understand that, though I think I'd rather have an
actual 'total' value or similar instead, but that's really a different
discussion.

> >That said, I'm a big fan of how we have more detail for things like a
> >HashJoin (buckets, batches, memory usage) and it might be nice to have
> >more information like that for a BitmapAnd (and friends).  In
> >particular, I'm thinking of memory usage, exact vs. lossy pages, etc.
> >Knowing that the bitmap has gotten to the point of being lossy might
> >indicate that a user could up work_mem, for example, and possibly avoid
> >recheck costs.
> 
> I think that's the best way to handle this: report N/A in the header
> and then provide details on exact vs lossy. That provides a clear
> indication to users that these kinds of nodes are special, as well
> as a reminder as to why they're special. Certainly the node could
> report an exact rowcount in the header if there were no lossy pages
> too.

I don't see why we would want to stick 'N/A' in for the header, even if
we are reporting the details, when we can provide a pretty reasonable
number.  In particular, I certainly don't think we would want to report
N/A sometimes (lossy case) and then an actual number other times (all
exact case).  That strikes me as much more likely to be confusing.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 1:17 PM, Josh Berkus  wrote:
> Particularly, with 9.6's freeze map, point (2) is even stronger reason
> to *lower* autovacuum_max_freeze_age.  Since there's little duplicate
> work in a freeze scan, a lot of users will find that frequent freezing
> benefits them a lot ...

That's a very good point, although I hope that vacuum is mostly being
triggered by vacuum_freeze_table_age rather than
autovacuum_freeze_max_age.

On Bruce's original question, there is an answer written into our
documentation: "Vacuum also allows removal of old files from the
pg_clog subdirectory, which is why the default is a relatively low 200
million transactions."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Index Scans

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila  wrote:
>> I think the parallel_workers reloption should override the degree of
>> parallelism for any sort of parallel scan on that table.  Had I
>> intended it to apply only to sequential scans, I would have named it
>> differently.
>
> I think there is big difference of size of relation to scan between
> parallel sequential scan and parallel (range) index scan which could
> make it difficult for user to choose the value of this parameter.  Why
> do you think that the parallel_workers reloption should suffice all
> type of scans for a table?  I could only think of providing it based
> on thinking that lesser config knobs makes life easier.

Well, we could do that, but it would be fairly complicated and it
doesn't seem to me to be the right place to focus our efforts.  I'd
rather try to figure out some way to make the planner smarter, because
even if users can override the number of workers on a
per-table-per-scan-type basis, they're probably still going to find
using parallel query pretty frustrating unless we make the
number-of-workers formula smarter than it is today.  Anyway, even if
we do decide to add more reloptions than just parallel_degree someday,
couldn't that be left for a separate patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Josh Berkus

On 10/21/2016 07:44 AM, Tom Lane wrote:
> Bruce Momjian  writes:
>> Why is autovacuum_freeze_max_age's default set to 200 million, rather
>> than something like 2 billion?  It seems 2 billion is half way to
>> wrap-around and would be a better default.  Right now, the default seems
>> to freeze 10x more often than it has to.
> 
> Please see the archives.  I do not remember the reasoning, but there
> was some, and you need to justify why it was wrong not just assert
> that you think it's silly.

IIRC, there were a couple reasons (and I think they're still good
reasons, which is why I haven't asked to change the default):

1. By setting it to 10% of the max space, we give users plenty of room
to raise the number if they need to without getting into crisis territory.

2. Raising this threshold isn't an unalloyed good.  The longer you wait
to freeze, the more work you'll need to do when autovac freeze rolls
around.  There's actually situations where you want to make this
threshold *lower*, although generally scheduled manual vacuum freezes
serve that.

Particularly, with 9.6's freeze map, point (2) is even stronger reason
to *lower* autovacuum_max_freeze_age.  Since there's little duplicate
work in a freeze scan, a lot of users will find that frequent freezing
benefits them a lot ... especially if they can take advantage of
index-only scans.

-- 
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUG] pg_basebackup from disconnected standby fails

2016-10-21 Thread Robert Haas

On Sun, Oct 2, 2016 at 8:52 AM, Michael Paquier
 wrote:
>> So, if I understand correctly, then we can mark the version posted by
>> you upthread [1] which includes a test along with Kyotaro's fix can be
>> marked as Ready for committer.  If so, then please change the status
>> of patch accordingly.
>
> Patch moved to next CF 2016-11, still with status "ready for committer".
>
> IMO, this thread has reached as conclusion to use this patch to
> address the problem reported:
> cab7npqtv5gmkqcndofgtgqoqxz2xlz4rrw247oqojzztvy6...@mail.gmail.com

I spent some time reading this thread today and trying to understand
exactly what's going on here.  Here's my attempt to restate the
problem:

1. If a restartpoint flushes no dirty buffers, then the redo location
advances but the minimum recovery point does not; therefore, the
minimum recovery point can fall behind the redo location.  That's
correct behavior, because if the standby is subsequently restarted, it
can correctly begin at the checkpoint record associated with the
restartpoint and need not replay any further.

2. However, it causes a problem when a base backup with the "include
WAL" option is taken from a standby because -- on standby servers only
-- the end location for a backup as returned by do_pg_stop_backup() is
the minimum recovery point.  If this precedes the start point for the
backup -- which is the restart point -- perform_base_backup() will
conclude that no WAL files are required for the backup and fail an
internal sanity check.

3. Removing the sanity check wouldn't help, because it can never be
right to end up with zero WAL files as a result of a base backup.  At
a minimum, we need the WAL file which contains the checkpoint record
from which the control file specifies that redo should start.
Actually, I think that checkpoint record could be spread across
multiple files: it might be big.  We need them all, but the current
code won't ensure that we get them.

The consensus solution on this thread seems to be that we should have
pg_do_stop_backup() return the last-replayed XLOG location as the
backup end point.  If the control file has been updated with a newer
redo location, then the associated checkpoint record has surely been
replayed, so we'll definitely have enough WAL to replay that
checkpoint record (and we don't need to replay anything else, because
we're supposing that this is the case where the minimum recovery point
precedes the redo location).  Although this will work, it might
include more WAL in the backup than is strictly necessary.  If the
additional WAL replayed beyond the minimum recovery point does NOT
include a checkpoint, we don't need any of it; if it does, we need
only the portion up to and including the last checkpoint record, and
not anything beyond that.

I can think of two solutions that would be "tighter":

1. When performing a restartpoint, update the minimum recovery point
to just beyond the checkpoint record.  I think this can't hurt anyone
who is actually restarting recovery on the same machine, because
that's exactly the point where they are going to start recovery.  A
minimum recovery point which precedes the point at which they will
start recovery is no better than one which is equal to the point at
which they will start recovery.

2. In perform_base_backup(), if the endptr returned by
do_pg_stop_backup() precedes the end of the checkpoint record returned
by do_pg_start_backup(), use the latter value instead.  Unfortunately,
that's not so easy: we can't just say if (endptr < starptr) startptr =
endptr; because startptr is the *start* of the checkpoint record, not
the end.  I suspect somebody could figure out a solution to this
problem, though.

If we decide we don't want to aim for one of these tighter solutions
and just adopt the previously-discussed patch, then at the very least
it needs better comments.  The minimum comment change the patch makes
right now doesn't give any explanation of why the new algorithm is
better than the old, and that's not cool.  Without that, the next
person who comes along to maintain this code won't have any easy way
of knowing that we grappled with this problem and what our conclusions
were.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Alvaro Herrera

Bruce Momjian wrote:
> On Fri, Oct 21, 2016 at 10:44:41AM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > Why is autovacuum_freeze_max_age's default set to 200 million, rather
> > > than something like 2 billion?  It seems 2 billion is half way to
> > > wrap-around and would be a better default.  Right now, the default seems
> > > to freeze 10x more often than it has to.
> > 
> > Please see the archives.  I do not remember the reasoning, but there
> > was some, and you need to justify why it was wrong not just assert
> > that you think it's silly.
> 
> I think the reasoning was to avoid checking old clog files, but with
> tuple transaction status bits, e.g. HEAP_XMIN_COMMITTED, which were
> added long ago, I don't remember why either.

HEAP_XMIN_COMMITTED existed way before autovacuum, so that doesn't add
up, does it.  As I recall, the reason was to be able to truncate
pg_clog.  I suppose nowadays it's possible to claim that we're not
really bothered by a gigabyte or two of pg_clog?

*If* we're to raise the default then it should not be to 2 billion.
That gives users no breathing room if they find themselves struggling
with the freezing; with the current default, it's possible to increase
it 2x or 4x if you're in trouble, which gives some breathing room until
a permanent solution is found (better vacuuming).  That disappears if
you set the max to its max.

> I remember asking years ago and not getting a good answer, and giving
> up.

[citation needed]

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Specifying the log file name of pgbench -l option

2016-10-21 Thread Masahiko Sawada

Hi all,

The log file generated by pgbench -l option is fixed file name
'pgbench_log..'. And it's a little complicated for the
script that runs pgbench repeatedly to identify the log file name.
Attached patch make it possible to specify the log file name. I think
it's useful for the use who want to run pgbench repeatedly in script
and collects and analyze the result.

The one thing I concern is that this patch changes -l option so that
it requires argument.
But changing its behavior would be good rather than adding new option.

Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pgbench_log_file_name.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 11:20 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane  wrote:
>>> dromedary seems to have found one, or at least an unstable test result.
>
>> OK, looking at that now.  Thanks.
>
> Looking at further failures, it looks like 32-bit machines in general
> get that result.  Might be just a cost estimation difference.
>
> Also, some of the windows machines are folding "sqrt(2)" to a different
> constant than is hard-wired into the expected-result file.  That's
> slightly annoying because it calls into question whether we can ship
> floating-point computations to the far end at all :-(.

IMHO, not shipping floating-point computations for that reason would
be more annoying than helpful.  To really guarantee that the remote
and identical results are identical, you'd need far more
infrastructure than we have - you'd have to make sure the operating
system collations matched, for example.  And we're already assuming
(without any proof) that the default collations match and, for that
matter, that the datatypes match.  If you're pushing down to
PostgreSQL on the other end, you at least have a hope that the other
side might be sufficiently identical to the local side that the
results will be the same, but if somebody implements these pushdowns
for Oracle or SQL server, you almost might as well give up and go
home.  Even for PostgreSQL, getting the same results in every possible
corner case requires a certain degree of optimism.

For my money, the way to handle that is to add more control over
pushdown rather than automatically disabling it in certain cases.  For
example, we could have GUCs that disable all pushdown or certain types
of pushdown - e.g. you could specify that you don't trust the remote
side to sort data properly, or that you don't like it's floating-point
implementation.  I'm not sure what controls are most useful here, but
I'd be willing to bet you a nice glass of white wine that many people
will be happier with a system that they can control than they will be
with one that just disables the optimization in every case where it
might hypothetically not work out.  It's not clear that we can even
reasonably foresee all such cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE

2016-10-21 Thread Jim Nasby


On 10/21/16 8:21 AM, Stephen Frost wrote:

Counting each page as the relation's average number of tuples per page
seems entirely reasonable to me, for what that is trying to report.


My concern is that still leaves a lot of room for confusion when 
interpreting EXPLAIN ANALYZE. Every other node will tell you exactly 
what happened and it's pretty easy to reason about whether rows should 
have gone up or down based on the type of node. You can't do that for 
Bitmap(And|Or) unless you know the details of how TIDBitmaps work. 
Reporting N/A makes it crystal clear that these nodes operate very 
differently than all the others.


(On a related note, it would also be nice if we reported fractional rows 
when the row count low and loops is high.)



That said, I'm a big fan of how we have more detail for things like a
HashJoin (buckets, batches, memory usage) and it might be nice to have
more information like that for a BitmapAnd (and friends).  In
particular, I'm thinking of memory usage, exact vs. lossy pages, etc.
Knowing that the bitmap has gotten to the point of being lossy might
indicate that a user could up work_mem, for example, and possibly avoid
recheck costs.


I think that's the best way to handle this: report N/A in the header and 
then provide details on exact vs lossy. That provides a clear indication 
to users that these kinds of nodes are special, as well as a reminder as 
to why they're special. Certainly the node could report an exact 
rowcount in the header if there were no lossy pages too.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-10-21 Thread Tom Lane

Robert Haas  writes:
> On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane  wrote:
>> dromedary seems to have found one, or at least an unstable test result.

> OK, looking at that now.  Thanks.

Looking at further failures, it looks like 32-bit machines in general
get that result.  Might be just a cost estimation difference.

Also, some of the windows machines are folding "sqrt(2)" to a different
constant than is hard-wired into the expected-result file.  That's
slightly annoying because it calls into question whether we can ship
floating-point computations to the far end at all :-(.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Jim Nasby

On 10/21/16 8:47 AM, Tom Lane wrote:

It seems to me that you'd even want to make the drop of orphaned
> tables mandatory once it is detected even it is not a wraparound
> autovacuum.

If we are willing to do that then we don't really have to solve the
problem on the backend side.  One could expect that autovacuum would
clean things up within a few minutes after a backend failure.

Unless all the autovac workers are busy working on huge tables... maybe 
a delay of several hours/days is OK in this case, but it's not wise to 
assume autovac will always get to something within minutes.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-10-21 Thread Robert Haas

On Fri, Oct 21, 2016 at 10:47 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> I didn't find anything structurally wrong with this patch, so I've
>> committed it with many cosmetic changes.  Judging by what happened
>> with join pushdown, there are probably some residual bugs, but
>> hopefully not too many.
>
> dromedary seems to have found one, or at least an unstable test result.

OK, looking at that now.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Bruce Momjian

On Fri, Oct 21, 2016 at 10:44:41AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Why is autovacuum_freeze_max_age's default set to 200 million, rather
> > than something like 2 billion?  It seems 2 billion is half way to
> > wrap-around and would be a better default.  Right now, the default seems
> > to freeze 10x more often than it has to.
> 
> Please see the archives.  I do not remember the reasoning, but there
> was some, and you need to justify why it was wrong not just assert
> that you think it's silly.

I think the reasoning was to avoid checking old clog files, but with
tuple transaction status bits, e.g. HEAP_XMIN_COMMITTED, which were
added long ago, I don't remember why either.  I remember asking years
ago and not getting a good answer, and giving up.

If no one can give an answer, I suggest we change the default.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-10-21 Thread Tom Lane

Robert Haas  writes:
> I didn't find anything structurally wrong with this patch, so I've
> committed it with many cosmetic changes.  Judging by what happened
> with join pushdown, there are probably some residual bugs, but
> hopefully not too many.

dromedary seems to have found one, or at least an unstable test result.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Tom Lane

Bruce Momjian  writes:
> Why is autovacuum_freeze_max_age's default set to 200 million, rather
> than something like 2 billion?  It seems 2 billion is half way to
> wrap-around and would be a better default.  Right now, the default seems
> to freeze 10x more often than it has to.

Please see the archives.  I do not remember the reasoning, but there
was some, and you need to justify why it was wrong not just assert
that you think it's silly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-10-21 Thread Robert Haas

On Thu, Oct 20, 2016 at 5:38 AM, Jeevan Chalke
 wrote:
> Changes look good to me.
> Thanks for the detailed review.

I didn't find anything structurally wrong with this patch, so I've
committed it with many cosmetic changes.  Judging by what happened
with join pushdown, there are probably some residual bugs, but
hopefully not too many.  Anyhow, I don't think waiting longer to
commit this is necessarily going to help, so let's see what happens...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Oct 20, 2016 at 12:12 PM, Stephen Frost  wrote:
> > That said, I'd also like to see a --force or similar option or mechanism
> > put in place to reduce the risk of users trashing their system because
> > they think pg_resetwal is "safe." ("It's just gonna reset things to make
> > the database start again, should be fine.").
> 
> You know we already have that, right?

Yes, but I was meaning an option which would be required to make
pg_resetxlog actually *do* anything.  In other words, I'd rather have it
report some info back to the user, if it's run without the
'--really-force' or what-have-you option, and only proceed with
clearing the WAL or rewriting pg_control when '--really-force' is used.

> > pg_destroydb almost seems like a better choice, though I suppose
> > 'pg_clearwal' would be more acceptable.  Doesn't have quite the same
> > impact though.
> >
> > Not sure on the best answer here, but it's definitely foot-gun that some
> > users have ended up using on themselves with depressing regularity.
> 
> Just to provide some perspective from the other side of this, I
[...]

I wasn't suggesting that we remove the capability.  There are certainly
use-cases for it, but, unfortunately, I've seen a number of cases where
users simply google'd an error that they got back when trying to start
PG and found someone saying "well, I got this error, but then I ran
pg_resetxlog, and now the database starts up again."

It likely doesn't help that the top links tend to be to mailing list
archives where pg_resetxlog was brought up.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Tom Lane

Michael Paquier  writes:
> On Thu, Oct 20, 2016 at 9:30 PM, Constantin S. Pan  wrote:
>> I tried to fix the problem with a new backend not being
>> able to reuse a temporary namespace when it contains
>> thousands of temporary tables. I disabled locking of objects
>> during namespace clearing process. See the patch attached.
>> Please tell me if there are any reasons why this is wrong.

> That's invasive.

Invasive or otherwise, it's *completely unacceptable*.  Without a lock
you have no way to be sure that nothing else is touching the table.

A less broken approach might be to split the cleanup into multiple shorter
transactions, that is, after every N objects stop and commit what you've
done so far.  This shouldn't be that hard to do during backend exit, as
I'm pretty sure we're starting a new transaction just for this purpose
anyway.  I don't know if it'd be possible to do it during the automatic
cleanup when glomming onto a pre-existing temp namespace, because we're
already within a user-started transaction at that point.  But if we solve
the problem where it's being created, maybe that's enough for now.

>> I also added a GUC variable and changed the condition in
>> autovacuum to let it nuke orphan tables on its way.
>> See another patch attached.

> It seems to me that you'd even want to make the drop of orphaned
> tables mandatory once it is detected even it is not a wraparound
> autovacuum.

If we are willing to do that then we don't really have to solve the
problem on the backend side.  One could expect that autovacuum would
clean things up within a few minutes after a backend failure.  We'd
have to be really darn sure that that "orphaned backend" test can
never have any false positives, though.  I'm not sure that it was
ever designed to be race-condition-proof.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Index Scans

2016-10-21 Thread Amit Kapila

On Thu, Oct 20, 2016 at 10:33 PM, Robert Haas  wrote:
> On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila  wrote:
>>> Ideally, the parallel_workers storage parameter will rarely be
>>> necessary because the optimizer will generally do the right thing in
>>> all case.
>>
>> Yeah, we can choose not to provide any parameter for parallel index
>> scans, but some users might want to have a parameter similar to
>> parallel table scans, so it could be handy for them to use.
>
> I think the parallel_workers reloption should override the degree of
> parallelism for any sort of parallel scan on that table.  Had I
> intended it to apply only to sequential scans, I would have named it
> differently.
>

I think there is big difference of size of relation to scan between
parallel sequential scan and parallel (range) index scan which could
make it difficult for user to choose the value of this parameter.  Why
do you think that the parallel_workers reloption should suffice all
type of scans for a table?  I could only think of providing it based
on thinking that lesser config knobs makes life easier.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve output of BitmapAnd EXPLAIN ANALYZE

2016-10-21 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Changing it in a new major release seems entirely reasonable.
> 
> It's still a crock though.  I wonder whether it wouldn't be better to
> change the nodeBitmap code so that when EXPLAIN ANALYZE is active,
> it expends extra effort to try to produce a rowcount number.

I'm certainly all for doing something better, just didn't think that we
should be worried about making a change to the EXPLAIN ANALYZE output in
a major release because Depesz might have to update the explain site.

> We could certainly run through the result bitmap and count the number
> of exact-TID bits.  I don't see a practical way of doing something
> with lossy page bits, but maybe those occur infrequently enough
> that we could ignore them?  Or we could arbitrarily decide that
> a lossy page should be counted as MaxHeapTuplesPerPage, or a bit
> less arbitrarily, count it as the relation's average number
> of tuples per page.

Counting each page as the relation's average number of tuples per page
seems entirely reasonable to me, for what that is trying to report.

That said, I'm a big fan of how we have more detail for things like a
HashJoin (buckets, batches, memory usage) and it might be nice to have
more information like that for a BitmapAnd (and friends).  In
particular, I'm thinking of memory usage, exact vs. lossy pages, etc.
Knowing that the bitmap has gotten to the point of being lossy might
indicate that a user could up work_mem, for example, and possibly avoid
recheck costs.

Thanks!

Stephen

signature.asc
Description: Digital signature

[HACKERS] Default setting for autovacuum_freeze_max_age

2016-10-21 Thread Bruce Momjian

Why is autovacuum_freeze_max_age's default set to 200 million, rather
than something like 2 billion?  It seems 2 billion is half way to
wrap-around and would be a better default.  Right now, the default seems
to freeze 10x more often than it has to.

Does it default to 200 million so clog can be trimmed?  Is that
reasonable?  We have tuple status flags of commit status so I assume
changing from a normal xid to a frozen one doesn't have a performance
benefit, does it?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-21 Thread Kevin Grittner

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure  wrote:

> Most or all the damage seemed to be to the system catalogs with
> at least two critical tables dropped or inaccessible in some
> fashion.  A lot of the OIDs seemed to be pointing at the wrong
> thing.

While the oid in pg_class often matches the filename, that is not
true after some operations (like CLUSTER or VACUUM FULL).  It is
the relfilenode column that is the definitive link to the file.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Transaction management overhaul is necessary?

2016-10-21 Thread Craig Ringer

On 21 October 2016 at 18:57, Pavel Stehule  wrote:
> 2016-10-21 10:24 GMT+02:00 Tsunakawa, Takayuki
> :
>>
>> Hello,
>>
>> From our experience in handling customers' problems, I feel it's necessary
>> to evolve PostgreSQL's transaction management.  The concrete problems are:
>>
>> 1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java
>> stored functions.
>> This is often the reason people could not migrate to PostgreSQL.

I've run into relatively few for whom this has landed up being a
showstopper, but I agree it's a pain.

There'll probably be more as bigger outfits seek to move from That
Other Database where it's routine to do this.

>> This was because psqlODBC starts and ends a subtransaction for each SQL
>> statement by default to implement statement-level rollback.  And PostgreSQL
>> creates one CurTransactionContext memory context, which is 8KB, for each
>> subtransaction and retain them until the top transaction ends.

Surely that's where to start then. Find a way to pool and re-use,
fully release, or otherwise be done with transaction contexts for
released savepoints.

>>  The total
>> memory used becomes 40GB (8KB * 5 million subtransactions.)  This was
>> avoided by setting the Protocol parameter to 7.4-1, which means
>> transaction-level rollback.

You can control transaction level rollback in psqlODBC directly. You
do not need to fall back to the old protocol. Check the driver
options.

>> The savepoint approach for supporting statement-level rollback is
>> inefficient, because it adds two roundtrips (SAVEPOINT and RELEASE) for each
>> statement.

Right. We can't just fire off each statement wrapped in SAVEPOINT and
RELEASE SAVEPOINT because we need to get the result of the statement
and decide whether to ROLLBACK TO SAVEPOINT or RELEASE SAVEPOINT. It
only requires two round trips if you shove the SAVEPOINT in with the
intended statement, but it's still messy.

I'd like to see an alternative statement with semantics more akin to
COMMIT - which automatically into ROLLBACK if the tx is aborted.
COMMIT SAVEPOINT would be too confusing since it's not truly
committed. I don't know what to call it. But basically something that
does RELEASE SAVEPOINT [named savepoint] unless the subxact is in
aborted state, in which case it does ROLLBACK TO [named savepoint].
Bonus points for letting it remember the last savepoint created and
use that.

Furthermore, we should really add it on the protocol level so drivers
can send subtransaction control messages more compactly, without
needing to go through the parser etc, and without massively spamming
the logs. For this purpose savepoint names would be internally
generated so the driver wouldn't have to send them. We'd log savepoint
boundaries when transaction logging was enabled. Since the client
would send the first such protocol request we could do it on the sly
without a protocol version bump; clients could just check server
version and not use the new messages for older servers. If they send
it to an older server they get a protocol error, which is fine.

> You should to implement a CALL statement - that can be independent on outer
> transaction. The behave inside procedure called by CALL statement should be
> same like client side - and there you can controll transactions explicitly
> without nesting.

I agree that'd be desirable. Top level "procedures" are necessary for
this, really.

This would also enable us to return multiple result sets.

We'd probably have to start at least one small read-only tx for the
initial cache access to look up the proc and set everything up, but if
we don't allocate xids local transactions are super cheap.

However, I think trying to tackle the memory context bloat reported
upthread would be a more effective starting point since it immediately
targets the problem actually experienced.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PSA: Systemd will kill PostgreSQL

2016-10-21 Thread Peter Eisentraut

On 8/16/16 1:05 PM, Tom Lane wrote:
> Oh, interesting.  It had occurred to me that we might be able to dodge
> this issue if we started to recommend using unnamed POSIX semaphores
> instead of SysV.  (Obviously we'd want to check performance, but it's
> at least a plausible alternative.)  I had not wanted to go there if
> it meant that we could have silent loss of SysV shmem with no other
> symptoms, because as I said upthread, I'm concerned about that breaking
> the multiple-postmaster interlock.  However, if the cleanup kills only
> semaphores and not attached-to shmem, then that objection goes away and
> this becomes something we should seriously consider.

I was digging around this issue the other day again.  We have switched
to unnamed POSIX semaphores by default now, which will help.  But for
dynamic shared memory (DSM) we use POSIX shared memory by default, which
is cleaned up without regarding to attachment.  So there is still a
potential for failures here, possibly more rare or obscure, given the
usage of DSM.

(If someone is keeping score, it appears the "safest" combination is
SysV shared memory + POSIX semaphores.)

I have started a wiki page to collect this information:
https://wiki.postgresql.org/wiki/Systemd

To be continued, I suppose ...

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup stream xlog to tar

2016-10-21 Thread Michael Paquier

On Mon, Oct 17, 2016 at 2:37 PM, Michael Paquier
 wrote:
> Except that it looks in pretty good to me, so I am switching that as
> ready for committer.

+   /*
+* Create pg_xlog/archive_status (and thus pg_xlog) so we can write to
+* basedir/pg_xlog as the directory entry in the tar file may arrive
+* later.
+*/
+   snprintf(statusdir, sizeof(statusdir), "%s/pg_xlog/archive_status",
+basedir);

This part conflicts with f82ec32, where you need make pg_basebackup
aware of the backend version.. I promise that's the last conflict, at
least I don't have more patches planned in the area.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Stopping logical replication protocol

2016-10-21 Thread Vladimir Gordiychuk

Craig, Andres what do you thinks about previous message?

2016-10-06 0:35 GMT+03:00 Vladimir Gordiychuk :

> > Vladimir? I'm running out of time to spend on this at least until the next
> CF. Think you can make these changes?
>
> Yes, I can. But I thinks It should be discuss first.
>
> > Terminating COPY BOTH with ERRCODE_QUERY_CANCELED seems fine. If it
> > was expecting the error the client can Sync and do whatever it next
> > wants to do on the walsender protocol, or bail out nicely.
>
> Do you want return CopyFail with ERRCODE_QUERY_CANCELED instead of
> CopyDone on client initialized stop replication?
>
> > Hm, I'm *VERY* doubtful about doing this via yet another callback. Why
> > don't we just error out in the prepare write callback?
>
> But what about scenario when output plugin collect changes in memory with
> some transformation and send it only inside commit_cb?
> It means that OutputPluginPrepareWrite will not execute until end
> transaction and we face with too long stops replication when decodes huge
> transaction.
>
> 2016-10-05 13:06 GMT+03:00 Craig Ringer :
>
>> On 5 October 2016 at 05:14, Andres Freund  wrote:
>> > Hi,
>> >
>> > On 2016-09-23 13:01:27 +0800, Craig Ringer wrote:
>> >> From f98f2388c57d938ebbe07ccd2dbe02138312858f Mon Sep 17 00:00:00 2001
>> >> From: Vladimir Gordiychuk 
>> >> Date: Wed, 7 Sep 2016 00:39:18 +0300
>> >> Subject: [PATCH 2/4] Client-initiated CopyDone during transaction
>> decoding in
>> >>  walsender
>> >>
>> >> The prior patch caused the walsender to react to a client-initiated
>> >> CopyDone while it's in the WalSndLoop. That's not enough to let clients
>> >> end logical decoding mid-transaction because we loop in
>> ReorderBufferCommit
>> >> during decoding of a transaction without ever returning to WalSndLoop.
>> >>
>> >> Allow breaking out of ReorderBufferCommit by detecting that client
>> >> input has requested an end to COPY BOTH mode, so clients can abort
>> >> decoding mid-transaction.
>> >
>> > Hm, I'm *VERY* doubtful about doing this via yet another callback. Why
>> > don't we just error out in the prepare write callback?
>>
>> That's sensible.
>>
>> > I don't think it's a good idea to return a non-error state if a command
>> > is interrupted in the middle. We might already have sent a partial
>> > transaction or such.   Also compare this to interrupting a query - we
>> > don't just returning rows, we error out.
>>
>> Good point. It's not usually visible to the user because if it's a
>> client-initiated cancel the client will generally consume the error,
>> but there's still an error generated.
>>
>> Terminating COPY BOTH with ERRCODE_QUERY_CANCELED seems fine. If it
>> was expecting the error the client can Sync and do whatever it next
>> wants to do on the walsender protocol, or bail out nicely.
>>
>> Vladimir? I'm running out of time to spend on this at least until the
>> next CF. Think you can make these changes?
>>
>>
>> --
>>  Craig Ringer   http://www.2ndQuadrant.com/
>>  PostgreSQL Development, 24x7 Support, Training & Services
>>
>
>

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-21 Thread Amit Kapila

On Fri, Oct 21, 2016 at 1:07 PM, Tomas Vondra
 wrote:
> On 10/21/2016 08:13 AM, Amit Kapila wrote:
>>
>> On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas 
>> wrote:
>>>
>>> On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
>>>  wrote:
>
> I then started a run at 96 clients which I accidentally killed shortly
> before it was scheduled to finish, but the results are not much
> different; there is no hint of the runaway CLogControlLock contention
> that Dilip sees on power2.
>
 What shared_buffer size were you using? I assume the data set fit into
 shared buffers, right?
>>>
>>>
>>> 8GB.
>>>
 FWIW as I explained in the lengthy post earlier today, I can actually
 reproduce the significant CLogControlLock contention (and the patches do
 reduce it), even on x86_64.
>>>
>>>
>>> /me goes back, rereads post.  Sorry, I didn't look at this carefully
>>> the first time.
>>>
 For example consider these two tests:

 * http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
 * http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

 However, it seems I can also reproduce fairly bad regressions, like for
 example this case with data set exceeding shared_buffers:

 * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
>>>
>>>
>>> I'm not sure how seriously we should take the regressions.  I mean,
>>> what I see there is that CLogControlLock contention goes down by about
>>> 50% -- which is the point of the patch -- and WALWriteLock contention
>>> goes up dramatically -- which sucks, but can't really be blamed on the
>>> patch except in the indirect sense that a backend can't spend much
>>> time waiting for A if it's already spending all of its time waiting
>>> for B.
>>>
>>
>> Right, I think not only WALWriteLock, but contention on other locks
>> also goes up as you can see in below table.  I think there is nothing
>> much we can do for that with this patch.  One thing which is unclear
>> is why on unlogged tests it is showing WALWriteLock?
>>
>
> Well, although we don't write the table data to the WAL, we still need to
> write commits and other stuff, right?
>

We do need to write commit, but do we need to flush it immediately to
WAL for unlogged tables?  It seems we allow WALWriter to do that,
refer logic in RecordTransactionCommit.

 And on scale 3000 (which exceeds the
> 16GB shared buffers in this case), there's a continuous stream of dirty
> pages (not to WAL, but evicted from shared buffers), so iostat looks like
> this:
>
>   timetps  wr_sec/s  avgrq-sz  avgqu-sz await   %util
>   08:48:21  81654   1367483 16.75 127264.60   1294.80   97.41
>   08:48:31  41514697516 16.80 103271.11   3015.01   97.64
>   08:48:41  78892   1359779 17.24  97308.42928.36   96.76
>   08:48:51  58735978475 16.66  92303.00   1472.82   95.92
>   08:49:01  62441   1068605 17.11  78482.71   1615.56   95.57
>   08:49:11  55571945365 17.01 113672.62   1923.37   98.07
>   08:49:21  69016   1161586 16.83  87055.66   1363.05   95.53
>   08:49:31  54552913461 16.74  98695.87   1761.30   97.84
>
> That's ~500-600 MB/s of continuous writes. I'm sure the storage could handle
> more than this (will do some testing after the tests complete), but surely
> the WAL has to compete for bandwidth (it's on the same volume / devices).
> Another thing is that we only have 8 WAL insert locks, and maybe that leads
> to contention with such high client counts.
>

Yeah, quite possible, but I don't think increasing that would benefit
in general, because while writing WAL we need to take all the
wal_insert locks. In any case, I think that is a separate problem to
study.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Transaction management overhaul is necessary?

2016-10-21 Thread Pavel Stehule

2016-10-21 10:24 GMT+02:00 Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com>:

> Hello,
>
> From our experience in handling customers' problems, I feel it's necessary
> to evolve PostgreSQL's transaction management.  The concrete problems are:
>
> 1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java
> stored functions.
> This is often the reason people could not migrate to PostgreSQL.


>
> 2. PostgreSQL does not support statement-level rollback.
> When some customer ran a batch app using psqlODBC, one postgres process
> used dozens of GBs of memory and crashed the OS.  The batch app prepares
> some SQL statements with parameters, execute it five millions of times with
> different parameter values in a single transaction.  They didn't experience
> a problem with Oracle.
>
> This was because psqlODBC starts and ends a subtransaction for each SQL
> statement by default to implement statement-level rollback.  And PostgreSQL
> creates one CurTransactionContext memory context, which is 8KB, for each
> subtransaction and retain them until the top transaction ends.  The total
> memory used becomes 40GB (8KB * 5 million subtransactions.)  This was
> avoided by setting the Protocol parameter to 7.4-1, which means
> transaction-level rollback.
>
> The savepoint approach for supporting statement-level rollback is
> inefficient, because it adds two roundtrips (SAVEPOINT and RELEASE) for
> each statement.
>
>
>
> I know autonomous transaction is also discussed, which seems to be
> difficult, so I hope some kind of transaction management overhaul can be
> discussed to cover all these transaction-related features.  How should I
> start?  I found the following item in the TODO list (but I haven't read it
> yet.)  What other discussions should I look at?
>

You should to implement a CALL statement - that can be independent on outer
transaction. The behave inside procedure called by CALL statement should be
same like client side - and there you can controll transactions explicitly
without nesting.

Regards

Pavel


>
> --
> Implement stored procedures
> This might involve the control of transaction state and the return of
> multiple result sets
> PL/pgSQL stored procedure returning multiple result sets (SELECTs)?
> Proposal: real procedures again (8.4)
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg00542.php
> Gathering specs and discussion on feature (post 9.1)
> --
>
>
> Regards
> Takayuki Tsunakawa
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila

On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila  wrote:
> On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan  wrote:
>> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila  wrote:
>>
>> I read the following paragraph from the Volcano paper just now:
>>
>> """
>> During implementation and benchmarking of parallel sorting, we added
>> two more features to exchange. First, we wanted to implement a merge
>> network in which some processors produce sorted streams merge
>> concurrently by other processors. Volcano’s sort iterator can be used
>> to generate a sorted stream. A merge iterator was easily derived from
>> the sort module. It uses a single level merge, instead of the cascaded
>> merge of runs used in sort. The input of a merge iterator is an
>> exchange. Differently from other operators, the merge iterator
>> requires to distinguish the input records by their producer. As an
>> example, for a join operation it does not matter where the input
>> records were created, and all inputs can be accumulated in a single
>> input stream. For a merge operation, it is crucial to distinguish the
>> input records by their producer in order to merge multiple sorted
>> streams correctly.
>> """
>>
>> I don't really understand this paragraph, but thought I'd ask: why the
>> need to "distinguish the input records by their producer in order to
>> merge multiple sorted streams correctly"? Isn't that talking about
>> partitioning, where each workers *ownership* of a range matters?
>>
>
> I think so, but it seems from above text that is mainly required for
> merge iterator which probably will be used in merge join.
>
>> My
>> patch doesn't care which values belong to which workers. And, it
>> focuses quite a lot on dealing well with the memory bandwidth bound,
>> I/O bound part of the sort where we write out the index itself, just
>> by piggy-backing on tuplesort.c. I don't think that that's useful for
>> a general-purpose executor node -- tuple-at-a-time processing when
>> fetching from workers would kill performance.
>>
>
> Right, but what is written in text quoted by you seems to be do-able
> with tuple-at-a-time processing.
>

To be clear, by saying above, I don't mean that we should try that
approach instead of what you are proposing, but it is worth some
discussion to see if that has any significant merits.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila

On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan  wrote:
> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila  wrote:
>
> I read the following paragraph from the Volcano paper just now:
>
> """
> During implementation and benchmarking of parallel sorting, we added
> two more features to exchange. First, we wanted to implement a merge
> network in which some processors produce sorted streams merge
> concurrently by other processors. Volcano’s sort iterator can be used
> to generate a sorted stream. A merge iterator was easily derived from
> the sort module. It uses a single level merge, instead of the cascaded
> merge of runs used in sort. The input of a merge iterator is an
> exchange. Differently from other operators, the merge iterator
> requires to distinguish the input records by their producer. As an
> example, for a join operation it does not matter where the input
> records were created, and all inputs can be accumulated in a single
> input stream. For a merge operation, it is crucial to distinguish the
> input records by their producer in order to merge multiple sorted
> streams correctly.
> """
>
> I don't really understand this paragraph, but thought I'd ask: why the
> need to "distinguish the input records by their producer in order to
> merge multiple sorted streams correctly"? Isn't that talking about
> partitioning, where each workers *ownership* of a range matters?
>

I think so, but it seems from above text that is mainly required for
merge iterator which probably will be used in merge join.

> My
> patch doesn't care which values belong to which workers. And, it
> focuses quite a lot on dealing well with the memory bandwidth bound,
> I/O bound part of the sort where we write out the index itself, just
> by piggy-backing on tuplesort.c. I don't think that that's useful for
> a general-purpose executor node -- tuple-at-a-time processing when
> fetching from workers would kill performance.
>

Right, but what is written in text quoted by you seems to be do-able
with tuple-at-a-time processing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila

On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan  wrote:
> On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas  wrote:
>> Gather Merge can't emit a tuple unless it has buffered at least one
>> tuple from every producer; otherwise, the next tuple it receives from
>> one of those producers might proceed whichever tuple it chooses to
>> emit.

Right. Now, after again looking at Gather Merge patch, I think I can
better understand how it performs merging.

>>  However, it doesn't need to wait until all of the workers are
>> completely done.  The leader only needs to be at least slightly ahead
>> of the slowest worker.  I'm not sure how that compares to Peter's
>> approach.
>
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.
>
>> What I'm worried about is that we're implementing two separate systems
>> to do the same thing, and that the parallel sort approach is actually
>> a lot less general.  I think it's possible to imagine a Parallel Sort
>> implementation which does things Gather Merge can't.  If all of the
>> workers collaborate to sort all of the data rather than each worker
>> sorting its own data, then you've got something which Gather Merge
>> can't match.  But this is not that.
>
> It's not that yet, certainly. I think I've sketched a path forward for
> making partitioning a part of logtape.c that is promising. The sharing
> of ranges within tapes and so on will probably have a significant
> amount in common with what I've come up with.
>
> I don't think that any executor infrastructure is a particularly good
> model when *batch output* is needed -- the tuple queue mechanism will
> be a significant bottleneck, particularly because it does not
> integrate read-ahead, etc.
>

Tuple queue mechanism might not be super-efficient for *batch output*
(cases where many tuples needs to be read and written), but I see no
reason why it will be slower than disk I/O which I think you are using
in the patch.  IIUC, in the patch each worker including leader does
the tape sort for it's share of tuples and then finally leader merges
and populates the index.  I am not sure if the mechanism used in patch
can be useful as compare to using tuple queue, if the workers can
finish their part of sorting in-memory.

>
> The bottom line is that it's inherently difficult for me to refute the
> idea that Gather Merge could do just as well as what I have here,
> because proving that involves adding a significant amount of new
> infrastructure (e.g., to teach the executor about IndexTuples).
>

I think, there could be a simpler way, like we can force the gather
merge node when all the tuples needs to be sorted and compute the time
till it merges all tuples.  Similarly, with your patch, we can wait
till final merge is completed.  However, after doing initial study of
both the patches, I feel one can construct cases where Gather Merge
can win and also there will be cases where your patch can win.  In
particular, the Gather Merge can win where workers needs to perform
sort mostly in-memory.  I am not sure if it's easy to get best of both
the worlds.

Your patch needs rebase and I noticed one warning.
sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-10-21 Thread Constantin S. Pan

On Fri, 21 Oct 2016 14:29:24 +0900
Michael Paquier  wrote:

> That's invasive. I am wondering if a cleaner approach here would be a
> flag in deleteOneObject() that performs the lock cleanup, as that's
> what you are trying to solve here.

The problem occurs earlier, at the findDependentObjects step. All the
objects inside the namespace are being locked before any of them gets
deleted, which leads to the "too many locks" condition.

Cheers,
Constantin Pan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-21 Thread Masahiko Sawada

On Mon, Oct 17, 2016 at 10:04 PM, Kevin Grittner  wrote:
> On Sun, Oct 16, 2016 at 9:26 PM, Masahiko Sawada  
> wrote:
>
>> When I set old_snapshot_threshold = 0 I got error at step #3, which
>> means that the error is occurred without table pruning.
>
> The "snapshot too old" error can happen without pruning, but only
> because there is no way to tell the difference between a page that
> has been pruned since the snapshot was taken and a page which has
> had some other kind of modification since the snapshot was taken.
>
> Ignoring false positives for a moment (where the page is updated by
> something other than pruning), what is required for early pruning
> is that the snapshot has expired (which due to "rounding" and
> avoidance of locking could easily take up to a minute or two more
> than the old_snapshot_threshold setting) and then there is page
> pruning due to a vacuum or just HOT pruning from a page read.  At
> some point after that, a read which is part of returning data to
> the user (e.g., not just positioning for index modification) can
> see that the snapshot is too old and that the LSN for the page is
> past the snapshot LSN.  That is when you get the error.
>> We have regression test for this feature but it sets
>> old_snapshot_threshold = 0, I doubt about we can test it properly.
>> Am I missing something?
>
> This is a hard feature to test properly, and certainly hard to test
> without the test running for a long time.  The zero setting is
> really not intended to be used in production, but only to allow
> some half-way decent testing that doesn't take extreme lengths of
> time.  If you add some delays of a few minutes each at key points
> in a test, you should be able to get a test that works with a
> setting of 1min.  It is not impossible that we might need to add a
> memory barrier to one or two places to get such tests to behave
> consistently, but I have not been able to spot where, if anywhere,
> that would be.


Thank you for explanation! I understood.
When old_snapshot_threshold = 0, it skips to allocate shared memory
area for the xid array and skips the some logic in order to avoid
using the shared memory, so I was concerned about that a little.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-21 Thread Pantelis Theodosiou

On Thu, Oct 20, 2016 at 4:24 PM, Bruce Momjian  wrote:

> On Thu, Oct 20, 2016 at 05:14:51PM +0200, Petr Jelinek wrote:
> > > Also, it seems indirect indexes would be useful for indexing columns
> > > that are not updated frequently on tables that are updated frequently,
> > > and whose primary key is not updated frequently.  That's quite a logic
> > > problem for users to understand.
> > >
> >
> > Which covers like 99.9% of problematic cases I see on daily basis.
> >
> > And by that logic we should not have indexes at all, they are not
> > automatically created and user needs to think about if they need them or
> > not.
>
> Do you have to resort to extreme statements to make your point?  The use
> of indexes is clear to most users, while the use of indirect indexes
> would not be, as I stated earlier.
>

It's not that difficult to explain I think. We just tell them (to
non-sophisticated users) that they are similar to the non-clustered indexes
that other dbms have (SQL Server, MySQL), which add the PK columns to the
non-clustered index when the table is clustered. Same way as there, the
index doesn't need update when the columns or the PK isn't updated.
So we have the same benefit, except that we have the feature for our heap
tables.

I think it's the same for any other feature that is added (partial indexes,
cubes, new syntax like LATERAL and FILTER). People will learn and start to
use it. We can't expect it to be used by everyone the day it's released.

>
> > Also helping user who does not have performance problem by 1% is very
> > different from helping user who has performance problem by 50% even if
> > she needs to think about the solution a bit.
> >
> > WARM can do WARM update 50% of time, indirect index can do HOT update
> > 100% of time (provided the column is not changed), I don't see why we
> > could not have both solutions.
>
> We don't know enough about the limits of WARM to say it is limited to
> 50%.
>
>
>

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-21 Thread David Steele


On 10/21/16 3:12 AM, David G. Johnston wrote:


I have no problem continuing keeping with historical precedent and
allowing mnemonic abbreviations in our directory and file names at this
point.


I'm still in favor of pg_xact.  A search of the 9.6 docs brings up a 
number of hits for "xact": pg_last_xact_replay_timestamp(), 
pg_advisory_xact_lock(), pg_advisory_xact_lock_shared(), 
pg_last_committed_xact(),  pg_prepared_xacts(), etc. There are also 
numerous column names that have "xact" in them.


It's not just an arcane developer term when it shows up in a number of 
our user-facing functions/columns.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC] Transaction management overhaul is necessary?

2016-10-21 Thread Tsunakawa, Takayuki

Hello,

>From our experience in handling customers' problems, I feel it's necessary to 
>evolve PostgreSQL's transaction management.  The concrete problems are:

1. PostgreSQL cannot end and begin transactions in PL/pgSQL and PL/Java stored 
functions.
This is often the reason people could not migrate to PostgreSQL.


2. PostgreSQL does not support statement-level rollback.
When some customer ran a batch app using psqlODBC, one postgres process used 
dozens of GBs of memory and crashed the OS.  The batch app prepares some SQL 
statements with parameters, execute it five millions of times with different 
parameter values in a single transaction.  They didn't experience a problem 
with Oracle.

This was because psqlODBC starts and ends a subtransaction for each SQL 
statement by default to implement statement-level rollback.  And PostgreSQL 
creates one CurTransactionContext memory context, which is 8KB, for each 
subtransaction and retain them until the top transaction ends.  The total 
memory used becomes 40GB (8KB * 5 million subtransactions.)  This was avoided 
by setting the Protocol parameter to 7.4-1, which means transaction-level 
rollback.

The savepoint approach for supporting statement-level rollback is inefficient, 
because it adds two roundtrips (SAVEPOINT and RELEASE) for each statement.



I know autonomous transaction is also discussed, which seems to be difficult, 
so I hope some kind of transaction management overhaul can be discussed to 
cover all these transaction-related features.  How should I start?  I found the 
following item in the TODO list (but I haven't read it yet.)  What other 
discussions should I look at?

--
Implement stored procedures 
This might involve the control of transaction state and the return of multiple 
result sets 
PL/pgSQL stored procedure returning multiple result sets (SELECTs)? 
Proposal: real procedures again (8.4) 
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00542.php 
Gathering specs and discussion on feature (post 9.1) 
--


Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-21 Thread Tomas Vondra


On 10/21/2016 08:13 AM, Amit Kapila wrote:

On Fri, Oct 21, 2016 at 6:31 AM, Robert Haas  wrote:

On Thu, Oct 20, 2016 at 4:04 PM, Tomas Vondra
 wrote:

I then started a run at 96 clients which I accidentally killed shortly
before it was scheduled to finish, but the results are not much
different; there is no hint of the runaway CLogControlLock contention
that Dilip sees on power2.


What shared_buffer size were you using? I assume the data set fit into
shared buffers, right?


8GB.


FWIW as I explained in the lengthy post earlier today, I can actually
reproduce the significant CLogControlLock contention (and the patches do
reduce it), even on x86_64.


/me goes back, rereads post.  Sorry, I didn't look at this carefully
the first time.


For example consider these two tests:

* http://tvondra.bitbucket.org/#dilip-300-unlogged-sync
* http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

However, it seems I can also reproduce fairly bad regressions, like for
example this case with data set exceeding shared_buffers:

* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip


I'm not sure how seriously we should take the regressions.  I mean,
what I see there is that CLogControlLock contention goes down by about
50% -- which is the point of the patch -- and WALWriteLock contention
goes up dramatically -- which sucks, but can't really be blamed on the
patch except in the indirect sense that a backend can't spend much
time waiting for A if it's already spending all of its time waiting
for B.



Right, I think not only WALWriteLock, but contention on other locks
also goes up as you can see in below table.  I think there is nothing
much we can do for that with this patch.  One thing which is unclear
is why on unlogged tests it is showing WALWriteLock?



Well, although we don't write the table data to the WAL, we still need 
to write commits and other stuff, right? And on scale 3000 (which 
exceeds the 16GB shared buffers in this case), there's a continuous 
stream of dirty pages (not to WAL, but evicted from shared buffers), so 
iostat looks like this:


  timetps  wr_sec/s  avgrq-sz  avgqu-sz await   %util
  08:48:21  81654   1367483 16.75 127264.60   1294.80   97.41
  08:48:31  41514697516 16.80 103271.11   3015.01   97.64
  08:48:41  78892   1359779 17.24  97308.42928.36   96.76
  08:48:51  58735978475 16.66  92303.00   1472.82   95.92
  08:49:01  62441   1068605 17.11  78482.71   1615.56   95.57
  08:49:11  55571945365 17.01 113672.62   1923.37   98.07
  08:49:21  69016   1161586 16.83  87055.66   1363.05   95.53
  08:49:31  54552913461 16.74  98695.87   1761.30   97.84

That's ~500-600 MB/s of continuous writes. I'm sure the storage could 
handle more than this (will do some testing after the tests complete), 
but surely the WAL has to compete for bandwidth (it's on the same volume 
/ devices). Another thing is that we only have 8 WAL insert locks, and 
maybe that leads to contention with such high client counts.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

70 matches

Mail list logo