date:20150102

Re: [HACKERS] replicating DROP commands across servers

2015-01-02 Thread Jeff Janes

On Mon, Dec 29, 2014 at 2:15 PM, Alvaro Herrera 
wrote:

> Here's a patch that tweaks the grammar to use TypeName in COMMENT,
> SECURITY LABEL, and DROP for the type and domain cases.  The required
> changes in the code are pretty minimal, thankfully.  Note the slight
> changes to the new object_address test also.
>
> I think I'm pretty much done with this now, so I intend to push this
> first thing tomorrow and then the rebased getObjectIdentityParts patch
> sometime later.
>


This commit 3f88672a4e4d8e648d24ccc65 seems to have broken pg_upgrade for
pg_trgm.

On 9.2.9 freshly initdb and started with default config:

$ createdb jjanes

in psql:

create extension pg_trgm ;
create table foo (x text);
create index on foo using gin (upper(x) gin_trgm_ops);

Then run 9.5's pg_upgrade and it fails at the stage of restoring the
database schema.

The relevant log files are attached

Cheers,

Jeff


pg_upgrade_server.log
Description: Binary data


pg_upgrade_dump_16384.log
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-02 Thread Amit Kapila

On Fri, Jan 2, 2015 at 8:34 PM, Kevin Grittner  wrote:
>
> Amit Kapila  wrote:
>
> > Notes for Committer -
> > There is one behavioural difference in the handling of
--analyze-in-stages
> > switch, when individual tables (by using -t option) are analyzed by
> > using this switch, patch will process (in case of concurrent jobs) all
the
> > given tables for stage-1 and then for stage-2 and so on whereas in the
> > unpatched code it will process all the three stages table by table
> > (table-1 all three stages, table-2 all three stages and so on).  I think
> > the new behaviour is okay as the same is done when this utility does
> > vacuum for whole database.
>
> IMV, the change is for the better.  The whole point of
> --analyze-in-stages is to get minimal statistics so that "good
> enough" plans will be built for most queries to allow a production
> database to be brought back on-line quickly, followed by generating
> increasing granularity (which takes longer but should help ensure
> "best plan") while the database is in use with the initial
> statistics.  Doing all three levels for one table before generating
> the rough statistics for the others doesn't help with that, so I
> see this change as fixing a bug.  Is it feasible to break that part
> out as a separate patch?
>

Currently as the patch stands the fix (or new behaviour) is only
implemented for the multiple jobs option, however to fix this in base
code a separate patch is required.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] History of WAL_LEVEL (archive vs hot_standby)

2015-01-02 Thread Noah Misch

On Thu, Mar 27, 2014 at 08:16:13PM -0700, David Johnston wrote:
> Slightly tangential but are the locking operations associated with the
> recent bugfix generated in both (all?) modes or only hot_standby?

All modes.

> I thought
> it strange that transient locking operations were output with WAL though I
> get it if they are there to support read-only queries.

It is unintuitive.  This comment in heap_lock_tuple() attempts to explain:

/*
 * XLOG stuff.  You might think that we don't need an XLOG record 
because
 * there is no state change worth restoring after a crash.  You would be
 * wrong however: we have just written either a TransactionId or a
 * MultiXactId that may never have been seen on disk before, and we need
 * to make sure that there are XLOG entries covering those ID numbers.
 * Else the same IDs might be re-used after a crash, which would be
 * disastrous if this page made it to disk before the crash.  
Essentially
 * we have to enforce the WAL log-before-data rule even in this case.
 * (Also, in a PITR log-shipping or 2PC environment, we have to have 
XLOG
 * entries for everything anyway.)
 */

Another reason not mentioned is torn pages.  Locking a tuple updates t_xmax,
t_infomask2 and t_infomask.  It's possible for t_xmax to fall on one side of a
page tear and the infomasks to fall on the other side.  Writing t_xmax without
writing the corresponding infomasks could cause the tuple to be considered
deleted, not merely locked, after a crash.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2015-01-02 Thread Craig Ringer

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/30/2014 08:43 AM, Greg Sabino Mullane wrote:
> I am working on enhancing the ping() method of DBD::Pg. The goal of
> that is for a user to be able to determine if the connection to the
> database is still valid.

This entire concept is flawed. IMO the documentation for it should be
marked with a warning pointing out that use of this method to validate
a connection before use implies a race condition; it is never safe to
"validate" a connection then assume it'll actually be valid and
working for the following statement(s).

Correctly written applications must instead tolerate failures,
trapping errors/exceptions at each statement and retrying the transaction.

> The basic way we do this is to send a simple SELECT via PQexec and
> then check for a valid return value (and when in doubt, we check 
> PQstatus).

At the protocol level all that's needed, and all that should be done
IMO, is to send a Sync message and wait for the server's ReadyForQuery
response.

It might be nice to expose a libpq function to do this.

It's only safe to do this when there's no in-progress query, but
that's true of *any* validation method you might choose.

> After some experimenting, the best solution I found is to send the
> PQexec, and then check if PQresultErrorField(result, 'C') is
> '25P02'. If it is, then all is "well", in that the server is still
> alive. If it is not, then we can assume the backend is bad (for
> example, PQerrorMessage gives a "could not receive data from
> server: Bad file descriptor"). Being that we cannot do a rollback
> before calling the PQexec, is this a decent solution? Can we depend
> on really serious errors always trumping the expected 25P02?

I struggle to understand the utility of this.

Your transaction could be in the failed state because of an internal
error caused by the server's storage failing. So it's "alive" in the
sense of responding to queries, but completely useless.

This is another aspect of why the whole approach is wrong. If you want
to find out if the server can run a particular query, run the query.
If you get an error, then it couldn't run that query.

- -- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJUp1IhAAoJELBXNkqjr+S2hvwH/3V1jNIYTzjVKlMKGfiOmXAH
FAdy1PznxZx/DWidcIl7AfSa+9SDoF+cqR9leH1ju0MCy1VkS0W/Lx9lFEfDm6fU
s75kN0zno0N9wbSb/sMLGCFEv4wyX3On0rC401NY7/2HXDWco227JfH7O0fAz/lv
dNUDdmIg2+d0J1lKyTQ9Z5T8hl8SvMuRvnoaT0/5/6sqSRL3S/hSE0pObFGpKG0I
hWpklz3nQwMXZgLOt1YmSAprBd6HyUIzQDG0mV8QQ4SKn7M92J5KSgN1ORyVbMGZ
ImKJ+EpnUVEA+n/yG/CV/u27OfKVSYVQZJLZE3XepLY+/eBI3Ai2d+wK7x9Yrfk=
=NkUj
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)

2015-01-02 Thread Peter Geoghegan

On Thu, Jan 1, 2015 at 11:17 PM, Peter Geoghegan  wrote:
> I've been working on fixing the bugs that Jeff Janes found [1] with
> approach #2 to value locking [2]. Approach #1 was unaffected.

Unfortunately, exclusion constraints (which only value locking
approach #2 has support for, for the IGNORE variant) are broken, even
with my fix for the largely unrelated issues also described in my
opening mail. I now believe that it is inherently impossible to
support exclusion constraints with INSERT ... ON CONFLICT IGNORE, if
we are insistent that there cannot be livelocks, unprincipled
deadlocks or spurious exclusion violationsand I think you all know
how I feel about those  :-)  . I think that we should drop support for
exclusion constraints, regardless of whatever else happens with value
locking.

Previously, my mental model of approach #2 was that it more or less
handled B-Trees and exclusion constraint related indexes equivalently.
This is untrue, though.

Today, in general, exclusion constraints more or less work by
physically inserting a heap tuple and index tuple relating to the
exclusion constraint (typically this is with a GiST index). There
could well be a conflict due to a concurrent insertion, or even due to
the fact that there was a conflicting tuple all along. The
implementation tries to find a conclusively committed conflicting
tuple. If it does not find such a tuple, it knows that its own already
physically inserted tuple is enough of a reason for others to have to
worry about it, and so it may proceed and commit the xact -- once it
does a whole index scan with only its own tuple found (and not any
overlapping tuple from an unfinished xact), it's clear that it can
proceed.

Exclusion constraints are made to do a pre-check by approach #2 to
value locking, to do an IGNORE, but that isn't really significant
here. The reason that #2 is broken for exclusion constraints but not
for unique indexes is that unique index insertion reports a conflict
in the process of inserting. In contrast, unique constraints insert
optimistically, and re-check. What's the difference? Well, even though
(since the implementation passes UNIQUE_CHECK_PARTIAL for unique index
insertion) we still insert when there is a conflict, there is an
important distinction: Value locking. That is to say, even though we
still insert, we also still use buffer locks as value locks in the
usual way that unique indexes do (we hold an exclusion buffer lock on
the first B-Tree leaf page the value could be on, as always). We still
decide that there is no conflict when the is an exclusion buffer lock,
which is a sufficient choke point to make that determination
conclusive. This implies that even with many concurrent insertions
into a newly created table, there will definitely be one inserter that
conclusively does *not* have a conflict, while the others might
(depends on the status of the guy who conclusively had *no* conflict,
or his line of successor xacts that may or may not themselves commit).

In a private e-mail to Heikki, I pointed out that he was mistaken when
he said that there is a pre-existing problem with exclusion
constraints: it is not possible for concurrent inserters to both
spuriously get exclusion violations when only one needs to. Heikki
accepted this. However, I now realize that Heikki was closer to being
right about it than I was; you can't get spurious exclusion
violations, but you can get something that's about the same, which is
a deadlock. That usually doesn't matter that much, because people
usually don't worry about the issue with exclusion constraints. But in
a world where they support INSERT ... ON CONFLICT IGNORE, that isn't
acceptable, IMV.

I have a test case:
https://github.com/petergeoghegan/upsert/blob/master/exclusion.sh

I see both spurious exclusion violations, and unprincipled deadlocks
with this simple IGNORE-based testcase, that uses a GiST + btree_gist
exclusion constraint. I didn't see livelocks, which would have been
really bad, because a mutual need to wait on each other's xact causes
a deadlock, and the deadlock detector detects those. I think that the
spurious exclusion violations are merely an artifact of the
implementation, as opposed to an inherent problem with exclusion
constraints (in other words, I don't withdraw my remarks from the last
paragraph - only (arguably) unprincipled deadlocks occur with ordinary
exclusion constraint enforcement with lots of concurrency).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Claudio Freire

On Fri, Jan 2, 2015 at 7:57 PM, Jim Nasby  wrote:
> On 1/2/15, 4:18 PM, Tom Lane wrote:
>>
>> Heikki Linnakangas  writes:
>>>
>>> >On 01/02/2015 11:41 PM, Tom Lane wrote:

 >>What might be worth trying is establishing a hard-and-fast boundary
 >>between C land and SQL land, with bitwise names in C and bytewise
 >> names
 >>in SQL.  This would mean, for example, that int4pl() would be renamed
 >> to
 >>int32pl() so far as the C function goes, but the function's SQL name
 >> would
 >>remain the same.
>>>
>>> >I don't like that. I read int4pl as the function implementing plus
>>> >operator for the SQL-visible int4 datatype, so int4pl makes perfect
>>> > sense.
>>
>> I agree with that so far as the SQL name for the function goes, which is
>> part of why I don't think we should rename anything at the SQL level.
>> But right now at the C level, it's unclear how things should be named,
>> and I think we don't really want a situation where the most appropriate
>> name is so unclear and potentially confusing.  We're surviving fine with
>> "int32" in C meaning "int4" in SQL so far as the type names go, so why not
>> copy that naming approach for function names?
>
>
> Realistically, how many non-developers actually use the intXX SQL names? I
> don't think I've ever seen it; the only places I recall seeing it done are
> code snippets on developer blogs. Everyone else uses smallint, etc.
>
> I know we're all gun-shy about this after standard_conforming_strings, but
> that affected *everyone*. I believe this change would affect very, very few
> users.
>
> Also, note that I'm not talking about removing anything yet; that would come
> later.

I think it's naive to think the intXX names wouldn't be used just
because they're postgres-specific. Especially when pgadmin3 generates
them.

Many scripts of mine have them because pgadmin3 generated them, many
others have them because I grew accustomed to using them, especially
when I don't care being postgres-specific. float8 vs double precision
and stuff like that.

Lets not generalize anecdote (me using them), lets just pay attention
to the fact that lots of tools expose them (pgadmin3 among them).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Jim Nasby

On 1/2/15, 4:18 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

>On 01/02/2015 11:41 PM, Tom Lane wrote:

>>What might be worth trying is establishing a hard-and-fast boundary
>>between C land and SQL land, with bitwise names in C and bytewise names
>>in SQL.  This would mean, for example, that int4pl() would be renamed to
>>int32pl() so far as the C function goes, but the function's SQL name would
>>remain the same.

>I don't like that. I read int4pl as the function implementing plus
>operator for the SQL-visible int4 datatype, so int4pl makes perfect sense.

I agree with that so far as the SQL name for the function goes, which is
part of why I don't think we should rename anything at the SQL level.
But right now at the C level, it's unclear how things should be named,
and I think we don't really want a situation where the most appropriate
name is so unclear and potentially confusing.  We're surviving fine with
"int32" in C meaning "int4" in SQL so far as the type names go, so why not
copy that naming approach for function names?

Realistically, how many non-developers actually use the intXX SQL names? I 
don't think I've ever seen it; the only places I recall seeing it done are code 
snippets on developer blogs. Everyone else uses smallint, etc.

I know we're all gun-shy about this after standard_conforming_strings, but that 
affected *everyone*. I believe this change would affect very, very few users.

Also, note that I'm not talking about removing anything yet; that would come 
later.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-01-02 Thread Josh Berkus

On 01/02/2015 01:57 AM, Heikki Linnakangas wrote:
> wal_keep_segments does not affect the calculation of CheckPointSegments.
> If you set wal_keep_segments high enough, checkpoint_wal_size will be
> exceeded. The other alternative would be to force a checkpoint earlier,
> i.e. lower CheckPointSegments, so that checkpoint_wal_size would be
> honored. However, if you set wal_keep_segments high enough, higher than
> checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no
> matter how frequently you checkpoint.

So you're saying that wal_keep_segments is part of the max_wal_size
total, NOT in addition to it?

Just asking for clarification, here.  I think that's a fine idea, I just
want to make sure I understood you.  The importance of wal_keep_segments
will be fading as more people use replication slots.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Tom Lane

Heikki Linnakangas  writes:
> On 01/02/2015 11:41 PM, Tom Lane wrote:
>> What might be worth trying is establishing a hard-and-fast boundary
>> between C land and SQL land, with bitwise names in C and bytewise names
>> in SQL.  This would mean, for example, that int4pl() would be renamed to
>> int32pl() so far as the C function goes, but the function's SQL name would
>> remain the same.

> I don't like that. I read int4pl as the function implementing plus 
> operator for the SQL-visible int4 datatype, so int4pl makes perfect sense.

I agree with that so far as the SQL name for the function goes, which is
part of why I don't think we should rename anything at the SQL level.
But right now at the C level, it's unclear how things should be named,
and I think we don't really want a situation where the most appropriate
name is so unclear and potentially confusing.  We're surviving fine with
"int32" in C meaning "int4" in SQL so far as the type names go, so why not
copy that naming approach for function names?

>> That would introduce visible inconsistency between such
>> functions' pg_proc.proname and pg_proc.prosrc fields, but at least the
>> inconsistency would follow a very clear pattern.  And I doubt that very
>> many user applications are depending on the contents of pg_proc.prosrc.

> Someone might be doing
> DirectFunctionCall2(int4pl, ...)
> in an extension. Well, probably not too likely for int4pl, as you could 
> just use the native C + operator, but it's not inconceivable for 
> something like int4recv().

We don't have a lot of hesitation about breaking individual function calls
in extensions, so long as (1) you'll get a compile error and (2) it's
clear how to update the code.  See for instance the multitude of cases
where we've added new arguments to existing C functions.  So I don't think
this objection holds a lot of water, especially when (as you note) there's
not that much reason to be calling most of these functions directly from C.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Heikki Linnakangas


On 01/02/2015 11:41 PM, Tom Lane wrote:

What might be worth trying is establishing a hard-and-fast boundary
between C land and SQL land, with bitwise names in C and bytewise names
in SQL.  This would mean, for example, that int4pl() would be renamed to
int32pl() so far as the C function goes, but the function's SQL name would
remain the same.


I don't like that. I read int4pl as the function implementing plus 
operator for the SQL-visible int4 datatype, so int4pl makes perfect sense.



 That would introduce visible inconsistency between such
functions' pg_proc.proname and pg_proc.prosrc fields, but at least the
inconsistency would follow a very clear pattern.  And I doubt that very
many user applications are depending on the contents of pg_proc.prosrc.


Someone might be doing

DirectFunctionCall2(int4pl, ...)

in an extension. Well, probably not too likely for int4pl, as you could 
just use the native C + operator, but it's not inconceivable for 
something like int4recv().


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final Patch for GROUPING SETS

2015-01-02 Thread Jim Nasby

On 12/31/14, 3:05 PM, Noah Misch wrote:

On Wed, Dec 31, 2014 at 05:33:43PM +, Andrew Gierth wrote:

> >"Noah" == Noah Misch  writes:

>
>  Noah> Suppose one node orchestrated all sorting and aggregation.
>
>Well, that has the downside of making it into an opaque blob, without
>actually gaining much.

The opaque-blob criticism is valid.  As for not gaining much, well, the gain I
sought was to break this stalemate.  You and Tom have expressed willingness to
accept the read I/O multiplier of the CTE approach.  You and I are willing to
swallow an architecture disruption, namely a tuplestore acting as a side
channel between executor nodes.  Given your NACK, I agree that it fails to
move us toward consensus and therefore does not gain much.  Alas.

I haven't read the full discussion in depth, but is what we'd want here is the 
ability to feed tuples to more than one node simultaneously? That would allow 
things like

GroupAggregate
--> Sort(a) \
+--> Sort(a,b) -\
--> Hash(b) +
\--> SeqScan

That would allow the planner to trade off things like total memory consumption 
vs IO.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Tom Lane

Jim Nasby  writes:
> On 12/31/14, 8:13 AM, Andres Freund wrote:
>> Note that the C datatype has been int32/int64 for a while now, it's just
>> the SQL datatype and the names of its support functions. Given that,
>> afaiu, we're talking about the C datatype it seems pretty clear that it
>> should be named int128.

I don't think there was any debate about calling the C typedef int128.
The question at hand was that there are some existing C functions whose
names follow the int2/int4/int8 convention, and it's not very clear what
to do with their 128-bit analogues.  Having said that, I'm fine with
switching to int128 for the C names of the new functions; but it is
definitely less than consistent with what we're doing elsewhere.

> Should we start down this road with SQL too, by creating int32, 64 and 128 
> (if needed?), and changing docs as needed?

No.  The situation is messy enough without changing our conventions at
the SQL level; that will introduce breakage into user applications,
for no very good reason because we've never had any inconsistency there.

What might be worth trying is establishing a hard-and-fast boundary
between C land and SQL land, with bitwise names in C and bytewise names
in SQL.  This would mean, for example, that int4pl() would be renamed to
int32pl() so far as the C function goes, but the function's SQL name would
remain the same.  That would introduce visible inconsistency between such
functions' pg_proc.proname and pg_proc.prosrc fields, but at least the
inconsistency would follow a very clear pattern.  And I doubt that very
many user applications are depending on the contents of pg_proc.prosrc.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-02 Thread Jim Nasby


On 12/31/14, 8:13 AM, Andres Freund wrote:

On 2015-01-01 03:00:50 +1300, David Rowley wrote:

2. References to int16 meaning 16 bytes. I'm really in two minds about

this,

it's quite nice to keep the natural flow, int4, int8, int16, but I can't
help think that this will confuse someone one day. I think it'll be a

long

time before it confused anyone if we called it int128 instead, but I'm

not

that excited about seeing it renamed either. I'd like to hear what others
have to say... Is there a chance that some sql standard in the distant
future will have HUGEINT and we might regret not getting the internal

names

nailed down?


Yeah, I think using int16 to mean 16-bytes will be confusing to
someone almost immediately.



hmm, I think it should be changed to int128 then.  Pitty int4 was selected
as a name instead of int32 back in the day...


Note that the C datatype has been int32/int64 for a while now, it's just
the SQL datatype and the names of its support functions. Given that,
afaiu, we're talking about the C datatype it seems pretty clear that it
should be named int128.


Should we start down this road with SQL too, by creating int32, 64 and 128 (if 
needed?), and changing docs as needed?

Presumably that would be best as a separate patch...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-01-02 Thread Tom Lane

Ashutosh Bapat  writes:
> While looking at the patch for supporting inheritance on foreign tables, I
> noticed that if a transaction makes changes to more than two foreign
> servers the current implementation in postgres_fdw doesn't make sure that
> either all of them rollback or all of them commit their changes, IOW there
> is a possibility that some of them commit their changes while others
> rollback theirs.

> PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
> XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
> foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
> event it commits or aborts those transactions resp.

TBH, I think this is a pretty awful idea.

In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.

The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.

As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions).  So this would absolutely
have to be a not-on-by-default option.  But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly.  So I can't imagine
anyone wanting to turn this on without a more robust answer than that.

Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction).  I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.

It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.

In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Stephen Frost

* Bruce Momjian (br...@momjian.us) wrote:
> To be specific, desirable in streaming replication scenarios that don't
> use SSL compression.  (What percentage is that?)  It is something we
> should mention in the docs for this feature?

Considering how painful the SSL rengeotiation problems were and the CPU
overhead, I'd be surprised if many high-write-volume replication
environments use SSL at all.

There's a lot of win to be had from compression of FPWs, but it's like
most compression in that there are trade-offs to be had and environments
where it won't be a win, but I believe those cases to be the minority.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-02 Thread Andres Freund

On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote:
> diff --git a/src/backend/storage/buffer/buf_init.c 
> b/src/backend/storage/buffer/buf_init.c
> new file mode 100644
> index ff6c713..c4dce5b
> *** a/src/backend/storage/buffer/buf_init.c
> --- b/src/backend/storage/buffer/buf_init.c
> *** InitBufferPool(void)
> *** 67,72 
> --- 67,73 
>   boolfoundBufs,
>   foundDescs;
>   
> + fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
>   BufferDescriptors = (BufferDesc *)
>   ShmemInitStruct("Buffer Descriptors",
>   NBuffers * sizeof(BufferDesc), 
> &foundDescs);
> diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
> new file mode 100644
> index 2ea2216..669c07f
> *** a/src/backend/storage/ipc/shmem.c
> --- b/src/backend/storage/ipc/shmem.c
> *** ShmemInitStruct(const char *name, Size s
> *** 327,332 
> --- 327,335 
>   ShmemIndexEnt *result;
>   void   *structPtr;
>   
> + if (strcmp(name, "Buffer Descriptors") == 0)
> + size = BUFFERALIGN(size) + 64;
> + 
>   LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
>   
>   if (!ShmemIndex)
> *** ShmemInitStruct(const char *name, Size s
> *** 413,418 
> --- 416,432 
>   " \"%s\" (%zu bytes 
> requested)",
>   name, size)));
>   }
> + if (strcmp(name, "Buffer Descriptors") == 0)
> + {
> + /* align on 32 */
> + if ((int64)structPtr % 32 != 0)
> + structPtr = (void *)((int64)structPtr + 32 - 
> (int64)structPtr % 32);
> + /* align on 32 but not 64 */
> + if ((int64)structPtr % 64 == 0)
> + structPtr = (void *)((int64)structPtr + 32);
> + }
> + fprintf(stderr, "shared memory alignment of %s:  %ld-byte\n", 
> name,
> + (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 
> 64);
>   result->size = size;
>   result->location = structPtr;
>   }

> diff --git a/src/backend/storage/buffer/buf_init.c 
> b/src/backend/storage/buffer/buf_init.c
> new file mode 100644
> index ff6c713..c4dce5b
> *** a/src/backend/storage/buffer/buf_init.c
> --- b/src/backend/storage/buffer/buf_init.c
> *** InitBufferPool(void)
> *** 67,72 
> --- 67,73 
>   boolfoundBufs,
>   foundDescs;
>   
> + fprintf(stderr, "Buffer Descriptors size = %ld\n", sizeof(BufferDesc));
>   BufferDescriptors = (BufferDesc *)
>   ShmemInitStruct("Buffer Descriptors",
>   NBuffers * sizeof(BufferDesc), 
> &foundDescs);
> diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
> new file mode 100644
> index 2ea2216..50f836e
> *** a/src/backend/storage/ipc/shmem.c
> --- b/src/backend/storage/ipc/shmem.c
> *** ShmemInitStruct(const char *name, Size s
> *** 327,332 
> --- 327,335 
>   ShmemIndexEnt *result;
>   void   *structPtr;
>   
> + if (strcmp(name, "Buffer Descriptors") == 0)
> + size = BUFFERALIGN(size) + 64;
> + 
>   LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
>   
>   if (!ShmemIndex)
> *** ShmemInitStruct(const char *name, Size s
> *** 413,418 
> --- 416,429 
>   " \"%s\" (%zu bytes 
> requested)",
>   name, size)));
>   }
> + if (strcmp(name, "Buffer Descriptors") == 0)
> + {
> + /* align on 64 */
> + if ((int64)structPtr % 64 != 0)
> + structPtr = (void *)((int64)structPtr + 64 - 
> (int64)structPtr % 64);
> + }
> + fprintf(stderr, "shared memory alignment of %s:  %ld-byte\n", 
> name,
> + (int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 
> 64);
>   result->size = size;
>   result->location = structPtr;
>   }

I can't run tests right now...

What exactly do you want to see with these tests? that's essentially
what I've already benchmarked + some fprintfs?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Bruce Momjian

On Fri, Jan  2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
> On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund  wrote:
> >> , I now see the compression patch as something that has negatives, so
> >> has to be set by the user, and only wins in certain cases.  I am
> >> disappointed, and am trying to figure out how this became such a
> >> marginal win for 9.5.  :-(
> >
> > I find the notion that a multi digit space reduction is a "marginal win"
> > pretty ridiculous and way too narrow focused. Our WAL volume is a
> > *significant* problem in the field. And it mostly consists out of FPWs
> > spacewise.
> 
> One thing I'd like to point out, is that in cases where WAL I/O is an
> issue (ie: WAL archiving), usually people already compress the
> segments during archiving. I know I do, and I know it's recommended on
> the web, and by some consultants.
> 
> So, I wouldn't want this FPW compression, which is desirable in
> replication scenarios if you can spare the CPU cycles (because of
> streaming), adversely affecting WAL compression during archiving.

To be specific, desirable in streaming replication scenarios that don't
use SSL compression.  (What percentage is that?)  It is something we
should mention in the docs for this feature?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Publish autovacuum informations

2015-01-02 Thread Jim Nasby


On 1/1/15, 4:17 PM, Noah Misch wrote:

I'd be all right with putting the data structure declarations in a file
>named something like autovacuum_private.h, especially if it carried an
>annotation that "if you depend on this, don't be surprised if we break
>your code in future".

Such an annotation would be no more true than it is for the majority of header
files.  If including it makes you feel better, I don't object.


We need to be careful with that. Starting to segregate things into _private 
headers implies that stuff in non-private headers *is* locked down. We'd need 
to set clear expectations.

I do think more clarity would be good here. Right now the only distinction we have is things 
like SPI are spelled out in the docs. Other than that, the there really isn't anything to 
indicate how safe it is to rely on what's in the headers. For example, I've got some code 
that's looking at fcinfo->flinfo->fn_expr, and I have no idea how likely that is to get 
broken. Since it's a parse node, my guess is "likely", but I'm just guessing.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Claudio Freire

On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund  wrote:
>> , I now see the compression patch as something that has negatives, so
>> has to be set by the user, and only wins in certain cases.  I am
>> disappointed, and am trying to figure out how this became such a
>> marginal win for 9.5.  :-(
>
> I find the notion that a multi digit space reduction is a "marginal win"
> pretty ridiculous and way too narrow focused. Our WAL volume is a
> *significant* problem in the field. And it mostly consists out of FPWs
> spacewise.

One thing I'd like to point out, is that in cases where WAL I/O is an
issue (ie: WAL archiving), usually people already compress the
segments during archiving. I know I do, and I know it's recommended on
the web, and by some consultants.

So, I wouldn't want this FPW compression, which is desirable in
replication scenarios if you can spare the CPU cycles (because of
streaming), adversely affecting WAL compression during archiving.

Has anyone tested the compressability of WAL segments with FPW compression on?

AFAIK, both pglz and lz4 output should still be compressible with
deflate, but I've never tried.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Bruce Momjian

On Fri, Jan  2, 2015 at 06:11:29PM +0100, Andres Freund wrote:
> > My negativity is not that I don't want it, but I want to understand why
> > it isn't better than I remembered.  You are basically telling me it was
> > always a marginal win.  :-(  Boohoo!
> 
> No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
> are not the bottleneck.

What I have not seen is any recent benchmarks that show it as a win,
while the original email did, so I was confused.  I tried to explain
exactly how I viewed things  --- you can not like it, but that is how I
look for upcoming features, and where we should focus our time.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Andres Freund

On 2015-01-02 12:06:33 -0500, Bruce Momjian wrote:
> On Fri, Jan  2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> > On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > > Why are we not seeing the 33% compression and 15% performance
> > > improvement he saw?  What am I missing here?
> > 
> > To see performance improvements something needs to be the bottleneck. If
> > WAL writes/flushes aren't that in the tested scenario, you won't see a
> > performance benefit. Amdahl's law and all that.
> > 
> > I don't understand your negativity about the topic.
> 
> I remember the initial post from Masao in August 2013 showing a
> performance boost, so I assumed, while we had the concurrent WAL insert
> performance improvement in 9.4, this was going to be our 9.5 WAL
> improvement.

I don't think it makes sense to compare features/improvements that way.

> While the WAL insert performance improvement required no tuning and
> was never a negative

It's actually a negative in some cases.

> , I now see the compression patch as something that has negatives, so
> has to be set by the user, and only wins in certain cases.  I am
> disappointed, and am trying to figure out how this became such a
> marginal win for 9.5.  :-(

I find the notion that a multi digit space reduction is a "marginal win"
pretty ridiculous and way too narrow focused. Our WAL volume is a
*significant* problem in the field. And it mostly consists out of FPWs
spacewise.

> My negativity is not that I don't want it, but I want to understand why
> it isn't better than I remembered.  You are basically telling me it was
> always a marginal win.  :-(  Boohoo!

No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
are not the bottleneck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Bruce Momjian

On Fri, Jan  2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > Why are we not seeing the 33% compression and 15% performance
> > improvement he saw?  What am I missing here?
> 
> To see performance improvements something needs to be the bottleneck. If
> WAL writes/flushes aren't that in the tested scenario, you won't see a
> performance benefit. Amdahl's law and all that.
> 
> I don't understand your negativity about the topic.

I remember the initial post from Masao in August 2013 showing a
performance boost, so I assumed, while we had the concurrent WAL insert
performance improvement in 9.4, this was going to be our 9.5 WAL
improvement.   While the WAL insert performance improvement required no
tuning and was never a negative, I now see the compression patch as
something that has negatives, so has to be set by the user, and only
wins in certain cases.  I am disappointed, and am trying to figure out
how this became such a marginal win for 9.5.  :-(

My negativity is not that I don't want it, but I want to understand why
it isn't better than I remembered.  You are basically telling me it was
always a marginal win.  :-(  Boohoo!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Andres Freund

On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw?  What am I missing here?

To see performance improvements something needs to be the bottleneck. If
WAL writes/flushes aren't that in the tested scenario, you won't see a
performance benefit. Amdahl's law and all that.

I don't understand your negativity about the topic.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Bruce Momjian

On Fri, Jan  2, 2015 at 10:15:57AM -0600, k...@rice.edu wrote:
> On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > > I still don't understand the value of adding WAL compression, given the
> > > high CPU usage and minimal performance improvement.  The only big
> > > advantage is WAL storage, but again, why not just compress the WAL file
> > > when archiving.
> > 
> > before: pg_xlog is 800GB
> > after: pg_xlog is 600GB.
> > 
> > I'm damned sure that many people would be happy with that, even if the
> > *per backend* overhead is a bit higher. And no, compression of archives
> > when archiving helps *zap* with that (streaming, wal_keep_segments,
> > checkpoint_timeout). As discussed before.
> > 
> > Greetings,
> > 
> > Andres Freund
> > 
> 
> +1
> 
> On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
> above you can process 100GB of transaction data at the cost of a bit
> more CPU.

OK, so given your stats, the feature give a 12.5% reduction in I/O.  If
that is significant, shouldn't we see a performance improvement?  If we
don't see a performance improvement, is I/O reduction worthwhile?  Is it
valuable in that it gives non-database applications more I/O to use?  Is
that all?

I suggest we at least document that this feature as mostly useful for
I/O reduction, and maybe say CPU usage and performance might be
negatively impacted.

OK, here is the email I remember from Fujii Masao this same thread that
showed a performance improvement for WAL compression:

http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hnr7ly516kft5ffqf4pp1q...@mail.gmail.com

Why are we not seeing the 33% compression and 15% performance
improvement he saw?  What am I missing here?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread k...@rice.edu

On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > I still don't understand the value of adding WAL compression, given the
> > high CPU usage and minimal performance improvement.  The only big
> > advantage is WAL storage, but again, why not just compress the WAL file
> > when archiving.
> 
> before: pg_xlog is 800GB
> after: pg_xlog is 600GB.
> 
> I'm damned sure that many people would be happy with that, even if the
> *per backend* overhead is a bit higher. And no, compression of archives
> when archiving helps *zap* with that (streaming, wal_keep_segments,
> checkpoint_timeout). As discussed before.
> 
> Greetings,
> 
> Andres Freund
> 

+1

On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
above you can process 100GB of transaction data at the cost of a bit
more CPU.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-02 Thread Kevin Grittner

Amit Kapila  wrote:

> Notes for Committer -
> There is one behavioural difference in the handling of --analyze-in-stages
> switch, when individual tables (by using -t option) are analyzed by
> using this switch, patch will process (in case of concurrent jobs) all the
> given tables for stage-1 and then for stage-2 and so on whereas in the
> unpatched code it will process all the three stages table by table
> (table-1 all three stages, table-2 all three stages and so on).  I think
> the new behaviour is okay as the same is done when this utility does
> vacuum for whole database.

IMV, the change is for the better.  The whole point of
--analyze-in-stages is to get minimal statistics so that "good
enough" plans will be built for most queries to allow a production
database to be brought back on-line quickly, followed by generating
increasing granularity (which takes longer but should help ensure
"best plan") while the database is in use with the initial
statistics.  Doing all three levels for one table before generating
the rough statistics for the others doesn't help with that, so I
see this change as fixing a bug.  Is it feasible to break that part 
out as a separate patch?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What exactly is our CRC algorithm?

2015-01-02 Thread Heikki Linnakangas


On 01/01/2015 09:17 AM, Abhijit Menon-Sen wrote:

Hi.

OK, here are the patches with the various suggestions applied.

I found that the alignment didn't seem to make much difference for the
CRC32* instructions, so I changed to process (len/8)*8bytes followed by
(len%8)*1bytes, the way the Linux kernel does.


Ok.

In the slicing-by-8 version, I wonder if it would be better to do 
single-byte loads to c0-c7, instead of two 4-byte loads and shifts. 
4-byte loads are presumably faster than single byte loads, but then 
you'd avoid the shifts. And then you could go straight into the 
8-bytes-at-a-time loop, without the initial single-byte processing to 
get the start address aligned. (the Linux implementation doesn't do 
that, so maybe it's a bad idea, but might be worth testing..)


Looking at the Linux implementation, I think it only does the bswap once 
per call, not inside the hot loop. Would it even make sense to keep the 
crc variable in different byte order, and only do the byte-swap once in 
END_CRC32() ?


The comments need some work. I note that there is no mention of the 
slicing-by-8 algorithm anywhere in the comments (in the first patch).


Instead of checking for "defined(__GNUC__) || defined(__clang__)", 
should add an explicit configure test for __builtin_bswap32().


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-02 Thread Amit Kapila

On Fri, Jan 2, 2015 at 11:47 AM, Dilip kumar  wrote:
>
> On 31 December 2014 18:36, Amit Kapila Wrote,
>
> >The patch looks good to me.  I have done couple of
>
> >cosmetic changes (spelling mistakes, improve some comments,
>
> >etc.), check the same once and if you are okay, we can move
>
> >ahead.
>
>
>
> Thanks for review and changes, changes looks fine to me..
>

Okay, I have marked this patch as "Ready For Committer"

Notes for Committer -
There is one behavioural difference in the handling of --analyze-in-stages
switch, when individual tables (by using -t option) are analyzed by
using this switch, patch will process (in case of concurrent jobs) all the
given tables for stage-1 and then for stage-2 and so on whereas in the
unpatched code it will process all the three stages table by table
(table-1 all three stages, table-2 all three stages and so on).  I think
the new behaviour is okay as the same is done when this utility does
vacuum for whole database.  As there was no input from any committer
on this point, I thought it is better to get the same rather than waiting
more just for one point.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel mode and parallel contexts

2015-01-02 Thread Amit Kapila

On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas  wrote:
>
>
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing.  So hopefully I will have
> that by then.
>
> I am also hoping Amit will be adapting his parallel seq-scan patch to
> this framework.
>

While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.

1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose.  Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.

2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] compress method for spgist - 2

2015-01-02 Thread Heikki Linnakangas

On 12/23/2014 03:02 PM, Teodor Sigaev wrote:

>I think we'll need a separate SpGistTypeDesc for the input type. Or perhaps a
>separate SpGistTypeDesc for the reconstructed value and an optional decompress
>method to turn the reconstructedValue back into an actual reconstructed input
>datum. Or something like that.

I suppose that compress and reconstruct are mutual exclusive options.

I would rather not assume that. You might well want to store something 
in the leaf nodes that's different from the original Datum, but 
nevertheless contains enough information to reconstruct the original Datum.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2015-01-02 Thread Andres Freund

On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> I still don't understand the value of adding WAL compression, given the
> high CPU usage and minimal performance improvement.  The only big
> advantage is WAL storage, but again, why not just compress the WAL file
> when archiving.

before: pg_xlog is 800GB
after: pg_xlog is 600GB.

I'm damned sure that many people would be happy with that, even if the
*per backend* overhead is a bit higher. And no, compression of archives
when archiving helps *zap* with that (streaming, wal_keep_segments,
checkpoint_timeout). As discussed before.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-01-02 Thread Thom Brown

On 2 January 2015 at 11:13, Amit Kapila  wrote:

>
> On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown  wrote:
> >
> > On 1 January 2015 at 10:34, Amit Kapila  wrote:
> >>
> >> > Running it again, I get the same issue.  This is with
> parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
> >> >
> >> > This doesn't happen if I set the pgbench scale to 50.  I suspect this
> is a OOM issue.  My laptop has 16GB RAM, the table is around 13GB at scale
> 100, and I don't have swap enabled.  But I'm concerned it crashes the whole
> instance.
> >> >
> >>
> >> Isn't this a backend crash due to OOM?
> >> And after that server will restart automatically.
> >
> >
> > Yes, I'm fairly sure it is.  I guess what I'm confused about is that 8
> parallel sequential scans in separate sessions (1 per session) don't cause
> the server to crash, but in a single session (8 in 1 session), they do.
> >
>
> It could be possible that master backend retains some memory
> for longer period which causes it to hit OOM error, by the way
> in your test does always master backend hits OOM or is it
> random (either master or worker)
>

Just ran a few tests, and it appears to always be the master that hits OOM,
or at least I don't seem to be able to get an example of the worker hitting
it.


>
> >
> > Will there be a GUC to influence parallel scan cost?  Or does it take
> into account effective_io_concurrency in the costs?
> >
> > And will the planner be able to decide whether or not it'll choose to
> use background workers or not?  For example:
> >
>
> Yes, we are planing to introduce cost model for parallel
> communication (there is some discussion about the same
> upthread), but it's still not there and that's why you
> are seeing it to choose parallel plan when it shouldn't.
> Currently in patch, if you set parallel_seqscan_degree, it
> will most probably choose parallel plan only.
>

Ah, okay.  Great.

Thanks.

Thom

Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2015-01-02 Thread Dennis Kögel

Am 31.12.2014 um 11:40 schrieb Michael Paquier :
>> As long as master is fixed, I don't actually care. But I agree with Dennis
>> that it's hard to see what's been commited with all the different issues
>> found, and if any commits were done, in which branch. I'd like to be able to
>> tell my customers: update to this minor release to see if it's fixed, but I
>> can't even do that.
> This bug does not endanger at all data consistency as only old WAL
> files remain on the standby, so I'm fine as well with just a clean fix
> on master, and nothing done on back-branches.

I’d like to point out that this issue is causing severe disk space bloat over 
time, as described in my previous posting. I’d assume that, as time goes by, 
this issue will pop up more and more, as long as back-branches remain unfixed.

And as the general recommendation is to never ever fiddle with pg_xlog/ 
contents, I’m not even sure what to do about it.

- D.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-01-02 Thread Amit Kapila

On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown  wrote:
>
> On 1 January 2015 at 10:34, Amit Kapila  wrote:
>>
>> > Running it again, I get the same issue.  This is with
parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>> >
>> > This doesn't happen if I set the pgbench scale to 50.  I suspect this
is a OOM issue.  My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled.  But I'm concerned it crashes the whole
instance.
>> >
>>
>> Isn't this a backend crash due to OOM?
>> And after that server will restart automatically.
>
>
> Yes, I'm fairly sure it is.  I guess what I'm confused about is that 8
parallel sequential scans in separate sessions (1 per session) don't cause
the server to crash, but in a single session (8 in 1 session), they do.
>

It could be possible that master backend retains some memory
for longer period which causes it to hit OOM error, by the way
in your test does always master backend hits OOM or is it
random (either master or worker)

>
> Will there be a GUC to influence parallel scan cost?  Or does it take
into account effective_io_concurrency in the costs?
>
> And will the planner be able to decide whether or not it'll choose to use
background workers or not?  For example:
>

Yes, we are planing to introduce cost model for parallel
communication (there is some discussion about the same
upthread), but it's still not there and that's why you
are seeing it to choose parallel plan when it shouldn't.
Currently in patch, if you set parallel_seqscan_degree, it
will most probably choose parallel plan only.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

2015-01-02 Thread Thom Brown

On 1 January 2015 at 10:34, Amit Kapila  wrote:

> > Running it again, I get the same issue.  This is with
> parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
> >
> > This doesn't happen if I set the pgbench scale to 50.  I suspect this is
> a OOM issue.  My laptop has 16GB RAM, the table is around 13GB at scale
> 100, and I don't have swap enabled.  But I'm concerned it crashes the whole
> instance.
> >
>
> Isn't this a backend crash due to OOM?
> And after that server will restart automatically.
>

Yes, I'm fairly sure it is.  I guess what I'm confused about is that 8
parallel sequential scans in separate sessions (1 per session) don't cause
the server to crash, but in a single session (8 in 1 session), they do.

>
> > I also notice that requesting BUFFERS in a parallel EXPLAIN output
> yields no such information.
> > --
>
> Yeah and the reason for same is that all the work done related
> to BUFFERS is done by backend workers, master backend
> doesn't read any pages, so it is not able to accumulate this
> information.
>
> > Is that not possible to report?
>
> It is not impossible to report such information, we can develop some
> way to share such information between master backend and workers.
> I think we can do this if required once the patch is more stablized.
>

Ah great, as I think losing such information to this feature would be
unfortunate.

Will there be a GUC to influence parallel scan cost?  Or does it take into
account effective_io_concurrency in the costs?

And will the planner be able to decide whether or not it'll choose to use
background workers or not?  For example:

# explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
  QUERY
PLAN
---
 HashAggregate  (cost=89584.00..89584.05 rows=5 width=4) (actual
time=228.222..228.224 rows=5 loops=1)
   Output: bid
   Group Key: pgbench_accounts.bid
   Buffers: shared hit=83334
   ->  Seq Scan on public.pgbench_accounts  (cost=0.00..88334.00
rows=50 width=4) (actual time=0.008..136.522 rows=50 loops=1)
 Output: bid
 Buffers: shared hit=83334
 Planning time: 0.071 ms
 Execution time: 228.265 ms
(9 rows)

This is a quick plan, but if we tell it that it's allowed 8 background
workers:

# set parallel_seqscan_degree = 8;
SET
Time: 0.187 ms

# explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
   QUERY
PLAN

 HashAggregate  (cost=12291.75..12291.80 rows=5 width=4) (actual
time=603.042..603.042 rows=1 loops=1)
   Output: bid
   Group Key: pgbench_accounts.bid
   ->  Parallel Seq Scan on public.pgbench_accounts  (cost=0.00..11041.75
rows=50 width=4) (actual time=2.445..529.284 rows=50 loops=1)
 Output: bid
 Number of Workers: 8
 Number of Blocks Per Workers: 10416
 Planning time: 0.049 ms
 Execution time: 663.103 ms
(9 rows)

Time: 663.437 ms

It's significantly slower.  I'd hope the planner would anticipate this and
decide, "I'm just gonna perform a single scan in this instance as it'll be
a lot quicker for this simple case."  So at the moment
parallel_seqscan_degree seems to mean "You *must* use this many threads if
you can parallelise."  Ideally we'd be saying "can use up to if necessary".

Thanks

Thom

Re: [HACKERS] Parallel Seq Scan

2015-01-02 Thread Amit Kapila

On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas  wrote:
>
> On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
>  wrote:
> > Can we check the number of free bgworkers slots to set the max workers?
>
> The real solution here is that this patch can't throw an error if it's
> unable to obtain the desired number of background workers.  It needs
> to be able to smoothly degrade to a smaller number of background
> workers, or none at all.

I think handling this way can have one side effect which is that if
we degrade to smaller number, then the cost of plan (which was
decided by optimizer based on number of parallel workers) could
be more than non-parallel scan.
Ideally before finalizing the parallel plan we should reserve the
bgworkers required to execute that plan, but I think as of now
we can workout a solution without it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel Seq Scan

2015-01-02 Thread Thom Brown

On 1 January 2015 at 17:59, Robert Haas  wrote:

> On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
>  wrote:
> > Can we check the number of free bgworkers slots to set the max workers?
>
> The real solution here is that this patch can't throw an error if it's
> unable to obtain the desired number of background workers.  It needs
> to be able to smoothly degrade to a smaller number of background
> workers, or none at all.  I think a lot of this work will fall out
> quite naturally if this patch is reworked to use the parallel
> mode/parallel context stuff, the latest version of which includes an
> example of how to set up a parallel scan in such a manner that it can
> run with any number of workers.
>

+1

That sounds like exactly what's needed.

Thom

Re: [HACKERS] Redesigning checkpoint_segments

2015-01-02 Thread Heikki Linnakangas


On 01/01/2015 03:24 AM, Josh Berkus wrote:

Please remind me because I'm having trouble finding this in the
archives: how does wal_keep_segments interact with the new settings?


It's not very straightforward. First of all, min_recycle_wal_size has a 
different effect than wal_keep_segments. Raising min_recycle_wal_size 
causes more segments to be recycled rather than deleted, while 
wal_keep_segments causes old segments to be retained as old segments, so 
that they can be used for streaming replication. If you raise 
min_recycle_wal_size, it will not do any good for streaming replication.


wal_keep_segments does not affect the calculation of CheckPointSegments. 
If you set wal_keep_segments high enough, checkpoint_wal_size will be 
exceeded. The other alternative would be to force a checkpoint earlier, 
i.e. lower CheckPointSegments, so that checkpoint_wal_size would be 
honored. However, if you set wal_keep_segments high enough, higher than 
checkpoint_wal_size, it's impossible to honor checkpoint_wal_size no 
matter how frequently you checkpoint.


It's not totally straightforward to calculate what maximum size of WAL a 
given wal_keep_segments settings will force. wal_keep_segments is taken 
into account at a checkpoint, when we recycle old WAL segments. For 
example, imagine that prior checkpoint started at segment 95, a new 
checkpoint finishes at segment 100, and wal_keep_segments=10. Because of 
wal_keep_segments, we have to retain segments 90-95, which could 
otherwise be recycled. So that forces a WAL size of 10 segments, while 
otherwise 5 would be enough. However, before we reach the next 
checkpoint, let's assume it will complete at segment 105, we will 
consume five more segments, so the actual max WAL size is 15 segments. 
However, we could start recycling the segments 90-95 before we reach the 
next checkpoint, because wal_keep_segments stipulates how many segments 
from the current *insert* location needs to be retained, with not regard 
to checkpoints. But we only attempt to recycle segments at checkpoints.


So that could be made more straightforward if we recycled old segments 
in the background, between checkpoints. That might allow merging 
wal_keep_segments and min_recycle_wal_size settings, too: instead of 
renaming all old recycleable segments at a checkpoint, you could keep 
them around as old segments until they're actually needed for reuse, so 
they could be used for streaming replication up to that point.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

38 matches

Mail list logo