from:"Thom Brown"

Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-09-15 Thread Thom Brown

On 15 September 2017 at 19:23, Andres Freund  wrote:
> Hi Thom,
>
> Thanks for taking a whack at this!
>
> On 2017-09-15 12:16:22 +0100, Thom Brown wrote:
>> I've run a fairly basic test with a table with 101 columns, selecting
>> a single row from the table and I get the following results:
>>
>>
>> Columns with 1-character names:
>>
>> master (80 jobs, 80 connections, 60 seconds):
>
> FWIW, I don't think it's useful to test this with a lot of concurrency -
> at that point you're likely saturating the machine with context switches
> etc. unless you have a lot of cores. As this is isn't related to
> concurrency I'd rather just check a single connection.
>
>
>> transaction type: /tmp/test.sql
>> scaling factor: 1
>> query mode: simple
>
> I think you'd need to use prepared statements / -M prepared to see
> benefits - when parsing statements for every execution the bottleneck is
> elsewhere (hello O(#available_columns * #selected_columns) in
> colNameToVar()).

Okay, I've retested it using a prepared statement executed 100,000
times (which selects a single row from a table with 101 columns, each
column is 42-43 characters long, and 2 rows in the table), and I get
the following:

master:

real1m23.485s
user1m2.800s
sys0m1.200s


patched:

real1m22.530s
user1m2.860s
sys0m1.140s


Not sure why I'm not seeing the gain.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-09-15 Thread Thom Brown

On 14 September 2017 at 07:34, Andres Freund  wrote:
> Hi,
>
> When running workloads that include fast queries with a lot of columns,
> SendRowDescriptionMessage(), and the routines it calls, becomes a
> bottleneck.  Besides syscache lookups (see [1] and [2]) a major cost of
> that is manipulation of the StringBuffer and the version specific
> branches in the per-attribute loop.  As so often, the performance
> differential of this patch gets bigger when the other performance
> patches are applied.
>
> The issues in SendRowDescriptionMessage() are the following:
>
> 1) All the pq_sendint calls, and also the pq_sendstring, are
>expensive. The amount of calculations done for a single 2/4 byte
>addition to the stringbuffer (particularly enlargeStringInfo()) is
>problematic, as are the reallocations themselves.
>
>I addressed this by adding pq_send*_pre() wrappers, implemented as
>inline functions, that require that memory is pre-allocated.
>Combining that with doing a enlargeStringInfo() in
>SendRowDescriptionMessage() that pre-allocates the maximum required
>memory, yields pretty good speedup.
>
>I'm not yet super sure about the implementation. For one, I'm not
>sure this shouldn't instead be stringinfo.h functions, with very very
>tiny pqformat.h wrappers. But conversely I think it'd make a lot of
>sense for the pqformat integer functions to get rid of the
>continually maintained trailing null-byte - I was hoping the compiler
>could optimize that away, but alas, no luck.  As soon as a single
>integer is sent, you can't rely on 0 terminated strings anyway.
>
> 2) It creates a new StringInfo in every iteration. That causes
>noticeable memory management overhead, and restarts the size of the
>buffer every time. When the description is relatively large, that
>leads to a number of reallocations for every
>SendRowDescriptionMessage() call.
>
>My solution here was to create persistent StringInfo for
>SendRowDescriptionMessage() that never gets deallocated (just
>reset). That in combination with new versions of
>pq_beginmessage/endmessage that keep the buffer alive, yields a nice
>speedup.
>
>Currently I'm using a static variable to allocate a string buffer for
>the function. It'd probably better to manage that outside of a single
>function - I'm also wondering why we're allocating a good number of
>stringinfos in various places, rather than reuse them. Most of them
>can't be entered recursively, and even if that's a concern, we could
>have one reusable per portal or such.
>
> 3) The v2/v3 branches in the attribute loop are noticeable (others too,
>but well...).  I solved that by splitting out the v2 and v3
>per-attribute loops into separate functions.  Imo also a good chunk
>more readable.
>
> Comments?

I've run a fairly basic test with a table with 101 columns, selecting
a single row from the table and I get the following results:


Columns with 1-character names:

master (80 jobs, 80 connections, 60 seconds):

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 60 s
number of transactions actually processed: 559275
latency average = 8.596 ms
tps = 9306.984058 (including connections establishing)
tps = 11144.622096 (excluding connections establishing)


patched:

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 60 s
number of transactions actually processed: 585526
latency average = 8.210 ms
tps = 9744.240847 (including connections establishing)
tps = 11454.339301 (excluding connections establishing)


master (80 jobs, 80 connections, 120 seconds):

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 120 s
number of transactions actually processed: 1108312
latency average = 8.668 ms
tps = 9229.356994 (including connections establishing)
tps = 9987.385281 (excluding connections establishing)


patched:

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 120 s
number of transactions actually processed: 1130313
latency average = 8.499 ms
tps = 9412.904876 (including connections establishing)
tps = 10319.404302 (excluding connections establishing)



Columns with at least 42-character names:

master (80 jobs, 80 connections, 60 seconds):

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 60 s
number of transactions actually processed: 197815
latency average = 24.308 ms
tps = 3291.157486 (including connections establishing)
tps = 3825.309489 (excluding connections establishing)



patched:

transaction type: /tmp/test.sql
scaling factor: 1
query mode: simple
number of clients: 80
number of threads: 80
duration: 60 s
number of transac

Re: [HACKERS] [POC] hash partitioning

2017-09-14 Thread Thom Brown

On 14 September 2017 at 09:58, amul sul  wrote:
> On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen
>  wrote:
>>
>> Hi Amul,
>>
>> On 09/08/2017 08:40 AM, amul sul wrote:
>>>
>>> Rebased 0002 against this commit & renamed to 0001, PFA.
>>>
>>
>> This patch needs a rebase.
>>
>
> Thanks for your note.
> Attached is the patch rebased on the latest master head.
> Also added error on
> creating
> d
> efault partition
> for the hash partitioned table
> ,
> and updated document &
> test script for the same.

Sorry, but this needs another rebase as it's broken by commit
77b6b5e9ceca04dbd6f0f6cd3fc881519acc8714.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Adding support for Default partition in partitioning

2017-08-17 Thread Thom Brown

On 17 August 2017 at 10:59, Jeevan Ladhe  wrote:
> Hi,
>
> On Tue, Aug 15, 2017 at 7:20 PM, Robert Haas  wrote:
>>
>> On Wed, Jul 26, 2017 at 8:14 AM, Jeevan Ladhe
>>  wrote:
>> > I have rebased the patches on the latest commit.
>>
>> This needs another rebase.
>
>
> I have rebased the patch and addressed your and Ashutosh comments on last
> set of patches.
>
> The current set of patches contains 6 patches as below:
>
> 0001:
> Refactoring existing ATExecAttachPartition  code so that it can be used for
> default partitioning as well
>
> 0002:
> This patch teaches the partitioning code to handle the NIL returned by
> get_qual_for_list().
> This is needed because a default partition will not have any constraints in
> case
> it is the only partition of its parent.
>
> 0003:
> Support for default partition with the restriction of preventing addition of
> any
> new partition after default partition. This is a merge of 0003 and 0004 in
> V24 series.
>
> 0004:
> Extend default partitioning support to allow addition of new partitions
> after
> default partition is created/attached. This patch is a merge of patches
> 0005 and 0006 in V24 series to simplify the review process. The
> commit message has more details regarding what all is included.
>
> 0005:
> This patch introduces code to check if the scanning of default partition
> child
> can be skipped if it's constraints are proven.
>
> 0006:
> Documentation.
>
>
> PFA, and let me know in case of any comments.

Thanks.  Applies fine, and I've been exercising the patch and it is
doing everything it's supposed to do.

I am, however, curious to know why the planner can't optimise the following:

SELECT * FROM mystuff WHERE mystuff = (1::int,'JP'::text,'blue'::text);

This exhaustively checks all partitions, but if I change it to:

SELECT * FROM mystuff WHERE (id, country, content) =
(1::int,'JP'::text,'blue'::text);

It works fine.

The former filters like so: ((mystuff_default_1.*)::mystuff = ROW(1,
'JP'::text, 'blue'::text))

Shouldn't it instead do:

((mystuff_default_1.id, mystuff_default_1.country,
mystuff_default_1.content)::mystuff = ROW(1, 'JP'::text,
'blue'::text))

So it's not really to do with this patch; it's just something I
noticed while testing.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump does not handle indirectly-granted permissions properly

2017-07-25 Thread Thom Brown

On 26 July 2017 at 00:52, Stephen Frost  wrote:
> Thom,
>
> * Thom Brown (t...@linux.com) wrote:
>> This is the culprit:
>>
>> commit 23f34fa4ba358671adab16773e79c17c92cbc870
>> Author: Stephen Frost 
>> Date:   Wed Apr 6 21:45:32 2016 -0400
>
> Thanks!  I'll take a look tomorrow.

I should point out that this commit was made during the 9.6 cycle, and
I get the same issue with 9.6.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump does not handle indirectly-granted permissions properly

2017-07-25 Thread Thom Brown

On 25 July 2017 at 21:47, Stephen Frost  wrote:
> Tom,
>
> On Tue, Jul 25, 2017 at 16:43 Tom Lane  wrote:
>>
>> AFAICT, pg_dump has no notion that it needs to be careful about the order
>> in which permissions are granted.  I did
>>
>> regression=# create user joe;
>> CREATE ROLE
>> regression=# create user bob;
>> CREATE ROLE
>> regression=# create user alice;
>> CREATE ROLE
>> regression=# \c - joe
>> You are now connected to database "regression" as user "joe".
>> regression=> create table joestable(f1 int);
>> CREATE TABLE
>> regression=> grant select on joestable to alice with grant option;
>> GRANT
>> regression=> \c - alice
>> You are now connected to database "regression" as user "alice".
>> regression=> grant select on joestable to bob;
>> GRANT
>>
>> and then pg_dump'd that.  The ACL entry for joestable looks like this:
>>
>> --
>> -- TOC entry 5642 (class 0 OID 0)
>> -- Dependencies: 606
>> -- Name: joestable; Type: ACL; Schema: public; Owner: joe
>> --
>>
>> SET SESSION AUTHORIZATION alice;
>> GRANT SELECT ON TABLE joestable TO bob;
>> RESET SESSION AUTHORIZATION;
>> GRANT SELECT ON TABLE joestable TO alice WITH GRANT OPTION;
>>
>> Unsurprisingly, that fails to restore:
>>
>> db2=# SET SESSION AUTHORIZATION alice;
>> SET
>> db2=> GRANT SELECT ON TABLE joestable TO bob;
>> ERROR:  permission denied for relation joestable
>>
>>
>> I am not certain whether this is a newly introduced bug or not.
>> However, I tried it in 9.2-9.6, and all of them produce the
>> GRANTs in the right order:
>>
>> GRANT SELECT ON TABLE joestable TO alice WITH GRANT OPTION;
>> SET SESSION AUTHORIZATION alice;
>> GRANT SELECT ON TABLE joestable TO bob;
>> RESET SESSION AUTHORIZATION;
>>
>> That might be just chance, but my current bet is that Stephen
>> broke it sometime in the v10 cycle --- especially since we
>> haven't heard any complaints like this from the field.
>
>
> Hmmm. I'll certainly take a look when I get back to a laptop, but I can't
> recall offhand any specific code for handling that, nor what change I might
> have made in the v10 cycle which would have broken it (if anything, I would
> have expected an issue from the rework in 9.6...).
>
> I should be able to look at this tomorrow though.

This is the culprit:

commit 23f34fa4ba358671adab16773e79c17c92cbc870
Author: Stephen Frost 
Date:   Wed Apr 6 21:45:32 2016 -0400

In pg_dump, include pg_catalog and extension ACLs, if changed

Now that all of the infrastructure exists, add in the ability to
dump out the ACLs of the objects inside of pg_catalog or the ACLs
for objects which are members of extensions, but only if they have
been changed from their original values.

The original values are tracked in pg_init_privs.  When pg_dump'ing
9.6-and-above databases, we will dump out the ACLs for all objects
in pg_catalog and the ACLs for all extension members, where the ACL
has been changed from the original value which was set during either
initdb or CREATE EXTENSION.

This should not change dumps against pre-9.6 databases.

Reviews by Alexander Korotkov, Jose Luis Tallon

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Failover Slots

2017-07-25 Thread Thom Brown

On 8 April 2016 at 07:13, Craig Ringer  wrote:
> On 6 April 2016 at 22:17, Andres Freund  wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>>   raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>>   idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>>   do what they intend. Afaics that'll not work correctly for basebackups
>>   taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(&slot->mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(&slot->mutex);
>
> /* and don't do anything if there's nothing to write */
> if (!was_dirty)
> return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Create language syntax is not proper in pg_dumpall and not working using pg_upgrade

2017-07-25 Thread Thom Brown

On 25 July 2017 at 12:09, tushar  wrote:
> v9.6
>
> postgres=# CREATE LANGUAGE alt_lang1 HANDLER plpgsql_call_handler;
> CREATE LANGUAGE
> postgres=# \q
>
> v10 , run pg_upgrade - failing with this error -
>
> pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
> pg_restore: creating COMMENT "postgres"
> pg_restore: creating SCHEMA "public"
> pg_restore: creating COMMENT "SCHEMA "public""
> pg_restore: creating PROCEDURAL LANGUAGE "alt_lang1"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 560; 2612 16384 PROCEDURAL
> LANGUAGE alt_lang1 edb
> pg_restore: [archiver (db)] could not execute query: ERROR: unsupported
> language "alt_lang1"
> HINT:  The supported languages are listed in the pg_pltemplate system
> catalog.
> Command was: CREATE OR REPLACE PROCEDURAL LANGUAGE "alt_lang1";
>
>
> take the cluster dump using pg_dumpall
> "
> --
> -- Name: alt_lang1; Type: PROCEDURAL LANGUAGE; Schema: -; Owner: edb
> --
>
> CREATE OR REPLACE PROCEDURAL LANGUAGE alt_lang1;
>
>
> ALTER PROCEDURAL LANGUAGE alt_lang1 OWNER TO edb;
> "
>
> Handler part is missing and due to that  it is throwing an error ,if we try
> to execute on psql terminal.

I think this is because the handler function is in the pg_catalog
schema, which aren't dumped.  However, it seems they need to be if a
non-pg_catalog language is created that uses them.

This seems odd though:

/* Skip if not to be dumped */
if (!plang->dobj.dump || dopt->dataOnly)
return;

...

funcInfo = findFuncByOid(plang->lanplcallfoid);
if (funcInfo != NULL && !funcInfo->dobj.dump && plang->dobj.dump)
funcInfo = NULL;/* treat not-dumped same as not-found */


The reason I think it's odd is because, if it finds that the language
needs to be dumped, it checks whether the functions referenced for the
handler, inline and validator are in the pg_catalog schema too.  If
they are, it doesn't output them, but if we know the language isn't in
pg_catalog, and the functions for these options are, then surely we
*have* to output them?  Should there be any checks for these functions
at all?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread Thom Brown

On 20 July 2017 at 14:04, Thom Brown  wrote:
> On 20 July 2017 at 13:23, tushar  wrote:
>> Steps to reproduce -
>>
>> v9.6
>>
>> postgres=# create sequence seq_9166 start 1 increment 1;
>> CREATE SEQUENCE
>> postgres=# create or replace view v3_9166 as select * from seq_9166;
>> CREATE VIEW
>>
>> v10
>>
>> run pg_upgrade , going to fail with this error
>>
>>
>> command: "./pg_restore" --host
>> /home/centos/pg10_14july/postgresql/edbpsql/bin --port 50432 --username edb
>> --exit-on-error --verbose --dbname 'dbname=postgres'
>> "pg_upgrade_dump_13269.custom" >> "pg_upgrade_dump_13269.log" 2>&1
>> pg_restore: connecting to database for restore
>> pg_restore: creating pg_largeobject "pg_largeobject"
>> pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
>> pg_restore: creating COMMENT "postgres"
>> pg_restore: creating SCHEMA "public"
>> pg_restore: creating COMMENT "SCHEMA "public""
>> pg_restore: creating TABLE "public.fb17136_tab1"
>> pg_restore: creating SEQUENCE "public.seq_9166"
>> pg_restore: creating VIEW "public.v3_9166"
>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16392 VIEW
>> v3_9166 edb
>> pg_restore: [archiver (db)] could not execute query: ERROR:  column
>> seq_9166.sequence_name does not exist
>> LINE 14:  SELECT "seq_9166"."sequence_name",
>>  ^
>> Command was:
>> -- For binary upgrade, must preserve pg_type oid
>> SELECT
>> pg_catalog.binary_upgrade_set_next_pg_type_oid('16394'::pg_catalog.oid);
>>
>>
>> -- For binary upgrade, must preserve pg_type array oid
>> SELECT
>> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16393'::pg_catalog.oid);
>>
>>
>> -- For binary upgrade, must preserve pg_class oids
>> SELECT
>> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16392'::pg_catalog.oid);
>>
>> CREATE VIEW "v3_9166" AS
>>  SELECT "seq_9166"."sequence_name",
>> "seq_9166"."last_value",
>> "seq_9166"."start_value",
>> "seq_9166"."increment_by",
>> "seq_9166"."max_value",
>> "seq_9166"."min_value",
>> "seq_9166"."cache_value",
>> "seq_9166"."log_cnt",
>> "seq_9166"."is_cycled",
>> "seq_9166"."is_called"
>>FROM "seq_9166";
>
> This is because sequence_name, start_value, increment_by, max_value,
> min_value, cache_value and is_cycled are no longer output when
> selecting from sequences.  Commit
> 1753b1b027035029c2a2a1649065762fafbf63f3 didn't take into account
> upgrading sequences to 10.

Actually, I'm not sure we need to bother fixing this.  In the view
creation, * has to be expanded to whatever columns exist at the time
of creating the view, and since most of those columns no longer exist
in v10, there's no way to get the view ported over without rewriting
it.  Anything that depends on the output of those columns would be
broken anyway.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread Thom Brown

On 20 July 2017 at 13:23, tushar  wrote:
> Steps to reproduce -
>
> v9.6
>
> postgres=# create sequence seq_9166 start 1 increment 1;
> CREATE SEQUENCE
> postgres=# create or replace view v3_9166 as select * from seq_9166;
> CREATE VIEW
>
> v10
>
> run pg_upgrade , going to fail with this error
>
>
> command: "./pg_restore" --host
> /home/centos/pg10_14july/postgresql/edbpsql/bin --port 50432 --username edb
> --exit-on-error --verbose --dbname 'dbname=postgres'
> "pg_upgrade_dump_13269.custom" >> "pg_upgrade_dump_13269.log" 2>&1
> pg_restore: connecting to database for restore
> pg_restore: creating pg_largeobject "pg_largeobject"
> pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
> pg_restore: creating COMMENT "postgres"
> pg_restore: creating SCHEMA "public"
> pg_restore: creating COMMENT "SCHEMA "public""
> pg_restore: creating TABLE "public.fb17136_tab1"
> pg_restore: creating SEQUENCE "public.seq_9166"
> pg_restore: creating VIEW "public.v3_9166"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16392 VIEW
> v3_9166 edb
> pg_restore: [archiver (db)] could not execute query: ERROR:  column
> seq_9166.sequence_name does not exist
> LINE 14:  SELECT "seq_9166"."sequence_name",
>  ^
> Command was:
> -- For binary upgrade, must preserve pg_type oid
> SELECT
> pg_catalog.binary_upgrade_set_next_pg_type_oid('16394'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_type array oid
> SELECT
> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16393'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_class oids
> SELECT
> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16392'::pg_catalog.oid);
>
> CREATE VIEW "v3_9166" AS
>  SELECT "seq_9166"."sequence_name",
> "seq_9166"."last_value",
> "seq_9166"."start_value",
> "seq_9166"."increment_by",
> "seq_9166"."max_value",
> "seq_9166"."min_value",
> "seq_9166"."cache_value",
> "seq_9166"."log_cnt",
> "seq_9166"."is_cycled",
> "seq_9166"."is_called"
>FROM "seq_9166";

This is because sequence_name, start_value, increment_by, max_value,
min_value, cache_value and is_cycled are no longer output when
selecting from sequences.  Commit
1753b1b027035029c2a2a1649065762fafbf63f3 didn't take into account
upgrading sequences to 10.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread Thom Brown

On 20 July 2017 at 13:09, tushar  wrote:
> Steps to reproduce -
>
> v9.6
>
> postgres=# create table t(n int);
> CREATE TABLE
> postgres=# create table t1(a int);
> CREATE TABLE
> postgres=# create view ttt1 as SELECT e.n FROM t e NATURAL LEFT JOIN t1 d;
> CREATE VIEW
>
> v10 -
>
> run pg_upgrade -
>
> going to fail ,with this error -
>
> "
> pg_restore: creating TABLE "public.t"
> pg_restore: creating TABLE "public.t1"
> pg_restore: creating VIEW "public.ttt1"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16390 VIEW ttt1
> edb
> pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
> or near ")"
> LINE 16:  LEFT JOIN "t1" "d");
> ^
> Command was:
> -- For binary upgrade, must preserve pg_type oid
> SELECT
> pg_catalog.binary_upgrade_set_next_pg_type_oid('16392'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_type array oid
> SELECT
> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16391'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_class oids
> SELECT
> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16390'::pg_catalog.oid);
>
> CREATE VIEW "ttt1" AS
>  SELECT "e"."n"
>FROM ("t" "e"
>  LEFT JOIN "t1" "d");
>
> "
> I think -this issue should be there in the older branches as well but not
> checked that.

I get the same result on 9.2 and 10 in pg_dump output.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-22 Thread Thom Brown

On 22 June 2017 at 22:52, Robert Haas  wrote:
> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy  
> wrote:
>> [ new patch ]
>
> I think this is looking better.  I have some suggestions:
>
> * I suggest renaming launch_autoprewarm_dump() to
> autoprewarm_start_worker().  I think that will be clearer.  Remember
> that user-visible names, internal names, and the documentation should
> all match.

+1

I like related functions and GUCs to be similarly named so that they
have the same prefix.

>
> * I think the GUC could be pg_prewarm.autoprewarm rather than
> pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
> clear.

+1

I also think pg_prewarm.dump_interval should be renamed to
pg_prewarm.autoprewarm_interval.

>
> * In the documentation, don't say "This is a SQL callable function
> to".  This is a list of SQL-callable functions, so each thing in
> the list is one.  Just delete this from the beginning of each
> sentence.

I've made a pass at the documentation and ended up removing those
intros.  I haven't made any GUC/function renaming changes, but I have
rewritten some paragraphs for clarity.  Updated patch attached.

One thing I couldn't quite make sense of is:

"The autoprewarm process will start loading blocks recorded in
$PGDATA/autoprewarm.blocks until there is a free buffer left in the
buffer pool."

Is this saying "until there is a single free buffer remaining in
shared buffers"?  I haven't corrected or clarified this as I don't
understand it.

Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
detect an autoprewarm process already running.  I'd want this to
return NULL or an error if called for a 2nd time.

>
> * The reason for the AT_PWARM_* naming is not very obvious.  Does AT
> mean "at" or "auto" or something else?  How about
> AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY,
> AUTOPREWARM_INTERVAL_DEFAULT?
>
> * Instead of defining apw_sigusr1_handler, I think you could just use
> procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
> perhaps you could just use die().  got_sigterm would go away, and
> you'd just CHECK_FOR_INTERRUPTS().
>
> * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
> reset_apw_state(), which might be better named detach_apw_shmem().
> Similarly, init_apw_state() could be init_apw_shmem().
>
> * Instead of load_one_database(), I suggest
> autoprewarm_database_main().  That is more parallel to
> autoprewarm_main(), which you have elsewhere, and makes it more
> obvious that it's the main entrypoint for some background worker.
>
> * Instead of launch_and_wait_for_per_database_worker(), I suggest
> autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
> suggest autoprewarm_buffers().   The motivation for changing prewarm
> to autoprewarm is that we want the names here to be clearly distinct
> from the other parts of pg_prewarm that are not related to
> autoprewarm.  The motivation for changing buffer_pool to buffers is
> just that it's a little shorter.  Personally I  also like the sound it
> of it better, but YMMV.
>
> * prewarm_buffer_pool() ends with a useless return statement.  I
> suggest removing it.
>
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.
>
> * I think that it would be cool if, when autoprewarm completed, it
> printed a message at LOG rather than DEBUG1, and with a few more
> details, like "autoprewarm successfully prewarmed %d of %d
> previously-loaded blocks".  This would require some additional
> tracking that you haven't got right now; you'd have to keep track not
> only of the number of blocks read from the file but how many of those
> some worker actually loaded.  You could do that with an extra counter
> in the shared memory area that gets incremented by the per-database
> workers.
>
> * dump_block_info_periodically() calls ResetLatch() immediately before
> WaitLatch; that's backwards.  See the commit message for commit
> 887feefe87b9099c2967ec31ce20df4dfa9b and the comments it added to
> the top of latch.h for details on how to do this correctly.
>
> * dump_block_info_periodically()'s main loop is a bit confusing.  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix that by moving the got_sighup test to the
> top of t

Re: [HACKERS] logical replication syntax (was DROP SUBSCRIPTION, query cancellations and slot handling)

2017-05-02 Thread Thom Brown

On 2 May 2017 at 12:55, Robert Haas  wrote:
> On Thu, Apr 20, 2017 at 7:46 AM, Petr Jelinek
>  wrote:
>> DROP SUBSCRIPTION mysub NODROP SLOT;
>
> I'm pretty uninspired by this choice of syntax.  Logical replication
> seems to have added a whole bunch of syntax that involves prefixing
> words with "no".  In various places, there's NODROP, NOREFRESH, NOCOPY
> DATA, NOCONNECT, and NOPUBLISH.  But "NO" is not an English prefix,
> and there's no precedent of which I'm aware for such SQL syntax.  In
> most places, we've chosen to name the option and then the user set it
> to "on" or "off".  So for example you type EXPLAIN (ANALYZE, TIMING
> OFF) or EXPLAIN (ANALYZE, TIMING FALSE), not EXPLAIN (ANALYZE,
> NOTIMING).  I think most of the logical replication stuff could have
> been done this way pretty easily, but for some reason it picked a
> completely different approach.

+1 for not upsetting my OCD.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: ALTER SYSTEM [...] COMMENT

2017-04-26 Thread Thom Brown

On 26 April 2017 at 18:03, Joshua D. Drake  wrote:
> -hackers,
>
> We have had ALTER SYSTEM for some time now. It is awesome to be able to make
> changes that can be system wide without ever having to hit a shell but it
> does lack a feature that seems like an oversight, the ability to comment.
>
> Problem we are trying to solve:
>
> Having documentation for changes to GUC parameters that are modified via
> ALTER SYSTEM.
>
> Why?
>
> Because documentation is good and required for a proper production system.
>
> How?
>
> I propose:
>
> Add a column to pg_settings comment(text)
> Change the grammar to allow:
>
> ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' |
> DEFAULT } COMMENT 'comment'
>
> Example:
>
> ALTER SYSTEM SET maintenance_work_mem TO '1GB' COMMENT IS 'Increased to
> allow autovacuum to be more efficient';
>
> Potential issues:
>
> Does not use existing comment functionality. Alternate solution which would
> decrease functionality is:
>
> COMMENT ON SETTING setting IS 'comment';
>
> Looking forward, we may want to do the following:
>
> 1. Make it so any object can have a comment with creation, e.g;
>
> CREATE TABLE table () COMMENT IS '';
>
> 2. Make it so comments are appended not replaced.

Yeah, this is something I've also suggested previously:

https://www.postgresql.org/message-id/CAA-aLv50MZdjdVk_=Tep6na94dNmi1Y9XkCp3er7FQqvX=d...@mail.gmail.com

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-02-22 Thread Thom Brown

On 23 January 2017 at 11:56, Ivan Kartyshov  wrote:
> Thank you for reviews and suggested improvements.
> I rewrote patch to make it more stable.
>
> Changes
> ===
> I've made a few changes:
> 1) WAITLSN now doesn`t depend on snapshot
> 2) Check current replayed LSN rather than in xact_redo_commit
> 3) Add syntax WAITLSN_INFINITE '0/693FF800' - for infinite wait and
> WAITLSN_NO_WAIT '0/693FF800' for check if LSN was replayed as you
> advised.
> 4) Reduce the count of loops with GUCs (WalRcvForceReply() which in 9.5
> doesn`t exist).
> 5) Optimize loop that set latches.
> 6) Add two GUCs that helps us to configure influence on StartupXLOG:
> count_waitlsn (denominator to check not each LSN)
> interval_waitlsn (Interval in milliseconds to additional LSN check)
>
> Feedback
> 
> On 09/15/2016 05:41 AM, Thomas Munro wrote:
>>
>> You hold a spinlock in one arbitrary slot, but that
>> doesn't seem sufficient: another backend may also read it, compute a
>> new value and then write it, while holding a different spin lock.  Or
>> am I missing something?
>
>
> We acquire an individual spinlock on each member of array, so you cannot
> compute new value and write it concurrently.
>
> Tested
> ==
> We have been tested it on different servers and OS`s, in different cases and
> workloads. New version is nearly as fast as vanilla on primary and bring
> tiny influence on standby performance.
>
> Hardware:
> 144 Intel Cores with HT
> 3TB RAM
> all data on ramdisk
> primary + hotstandby  on the same node.
>
> A dataset was created with "pgbench -i -s 1000" command. For each round of
> test we pause replay on standby, make 100 transaction on primary with
> pgbench, start replay on standby and measure replication gap disappearing
> time under different standby workload. The workload was "WAITLSN
> ('Very/FarLSN', 1000ms timeout)" followed by "select abalance from
> pgbench_accounts there aid = random_aid;"
> For vanilla 1000ms timeout was enforced on pgbench side by -R option.
> GUC waitlsn parameters was adopted for 1000ms timeout on standby with 35000
> tps rate on primary.
> interval_waitlsn = 500 (ms)
> count_waitlsn = 3
>
> On 200 clients, slave caching up master as vanilla without significant
> delay.
> On 500 clients, slave caching up master 3% slower then vanilla.
> On 1000 clients, 12% slower.
> On 5000 clients, 3 time slower because it far above our hardware ability.
>
> How to use it
> ==
> WAITLSN ‘LSN’ [, timeout in ms];
> WAITLSN_INFINITE ‘LSN’;
> WAITLSN_NO_WAIT ‘LSN’;
>
> #Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
> WAITLSN ‘0/303EC60’, 1;
>
> #Or same without timeout.
> WAITLSN ‘0/303EC60’;
> orfile:///home/vis/Downloads/waitlsn_10dev_v2.patch
> WAITLSN_INFINITE '0/693FF800';
>
> #To check if LSN is replayed can be used.
> WAITLSN_NO_WAIT '0/693FF800';
>
> Notice: WAITLSN will release on PostmasterDeath or Interruption events
> if they come earlier then target LSN or timeout.
>
>
> Thank you for reading, will be glad to get your feedback.

Could you please rebase your patch as it no longer applies cleanly.

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash support for grouping sets

2017-02-22 Thread Thom Brown

On 6 January 2017 at 03:48, Andrew Gierth  wrote:
> Herewith a patch for doing grouping sets via hashing or mixed hashing
> and sorting.
>
> The principal objective is to pick whatever combination of grouping sets
> has an estimated size that fits in work_mem, and minimizes the number of
> sorting passes we need to do over the data, and hash those.  (Yes, this
> is a knapsack problem.)
>
> This is achieved by adding another strategy to the Agg node; AGG_MIXED
> means that both hashing and (sorted or plain) grouping happens.  In
> addition, support for multiple hash tables in AGG_HASHED mode is added.
>
> Sample plans look like this:
>
> explain select two,ten from onek group by cube(two,ten);
>   QUERY PLAN
> --
>  MixedAggregate  (cost=0.00..89.33 rows=33 width=8)
>Hash Key: two, ten
>Hash Key: two
>Hash Key: ten
>Group Key: ()
>->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=8)
>
> explain select two,ten from onek group by two, cube(ten,twenty);
>   QUERY PLAN
> ---
>  HashAggregate  (cost=86.50..100.62 rows=162 width=12)
>Hash Key: two, ten, twenty
>Hash Key: two, ten
>Hash Key: two
>Hash Key: two, twenty
>->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=12)
>
> set work_mem='64kB';
> explain select count(*) from tenk1
>   group by grouping sets (unique1,thousand,hundred,ten,two);
>QUERY PLAN
> 
>  MixedAggregate  (cost=1535.39..3023.89 rows=2 width=28)
>Hash Key: two
>Hash Key: ten
>Hash Key: hundred
>Group Key: unique1
>Sort Key: thousand
>  Group Key: thousand
>->  Sort  (cost=1535.39..1560.39 rows=1 width=20)
>  Sort Key: unique1
>  ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=20)
> (10 rows)
>
> Columns with unhashable or unsortable data types are handled
> appropriately, though obviously every individual grouping set must end
> up containing compatible column types.
>
> There is one current weakness which I don't see a good solution for: the
> planner code still has to pick a single value for group_pathkeys before
> planning the input path.  This means that we sometimes can't choose a
> minimal set of sorts, because we may have chosen a sort order for a
> grouping set that should have been hashed, for example:
>
> explain select count(*) from tenk1
>   group by grouping sets (two,unique1,thousand,hundred,ten);
>QUERY PLAN
> 
>  MixedAggregate  (cost=1535.39..4126.28 rows=2 width=28)
>Hash Key: ten
>Hash Key: hundred
>Group Key: two
>Sort Key: unique1
>  Group Key: unique1
>Sort Key: thousand
>  Group Key: thousand
>->  Sort  (cost=1535.39..1560.39 rows=1 width=20)
>  Sort Key: two
>  ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=20)
>
> (3 sorts, vs. 2 in the previous example that listed the same grouping
> sets in different order)
>
> Current status of the work: probably still needs cleanup, maybe some
> better regression tests, and Andres has expressed some concern over the
> performance impact of additional complexity in advance_aggregates; it
> would be useful if this could be tested.  But it should be reviewable
> as-is.

This doesn't apply cleanly to latest master.  Could you please post a
rebased patch?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE SUBSCRIPTION uninterruptable

2017-02-17 Thread Thom Brown

On 17 February 2017 at 14:14, Peter Eisentraut
 wrote:
> On 2/16/17 09:44, Thom Brown wrote:
>> I've noticed that when creating a subscription, it can't be
>> interrupted.  One must wait until it times out, which takes just over
>> 2 minutes.  I'm guessing ALTER SUBSCRIPTION would have the same
>> problem.
>>
>> Shouldn't we have an interrupt for this?
>
> I think this is addressed by these patches:
> https://commitfest.postgresql.org/13/991/
>
> Please give them a try.

Okay, I've now tested those patches, and they do indeed fix this problem for me.

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor

2017-02-16 Thread Thom Brown

On 16 February 2017 at 17:00, Konstantin Knizhnik
 wrote:
> More progress in vectorized Postgres extension (VOPS). It is not required
> any more to use some special functions in queries.
> You can use vector operators in query with standard SQL and still get ten
> times improvement on some queries.
> VOPS extension now uses post parse analyze hook to transform query.
> I really impressed by flexibility and extensibility of Postgres type system.
> User defined types&operatpors&casts do most of the work.
>
> It is still responsibility of programmer or database administrator to create
> proper projections
> of original table. This projections need to use tiles types for some
> attributes (vops_float4,...).
> Then you can query this table using standard SQL. And this query will be
> executed using vector operations!
>
> Example of such TPC-H queries:
>
> Q1:
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
> sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> vops_lineitem_projection
> where
> l_shipdate <= '1998-12-01'::date
> group by
> l_returnflag,
> l_linestatus
> order by
> l_returnflag,
> l_linestatus;
>
>
>
> Q6:
> select
> sum(l_extendedprice*l_discount) as revenue
> from
> lineitem_projection
> where
> l_shipdate between '1996-01-01'::date and '1997-01-01'::date
> and l_discount between 0.08 and 0.1
> and l_quantity < 24;

>
> On 13.02.2017 17:12, Konstantin Knizhnik wrote:
>>
>> Hello hackers,
>>
>> There were many discussions concerning  possible ways of speeding-up
>> Postgres. Different approaches were suggested:
>>
>> - JIT (now we have three different prototype implementations based on
>> LLVM)
>> - Chunked (vectorized) executor
>> - Replacing pull with push
>> - Columnar store (cstore_fdw, IMCS)
>> - Optimizing and improving current executor (reducing tuple deform
>> overhead, function call overhead,...)
>>
>> Obviously the best result can be achieved in case of combining all this
>> approaches. But actually them are more or less interchangeable: vectorized
>> execution is not eliminating interpretation overhead, but it is divided by
>> vector size and becomes less critical.
>>
>> I decided to write small prototype to estimate possible speed improvement
>> of vectorized executor. I created special types representing "tile" and
>> implement standard SQL operators for them. So neither Postgres planer,
>> nether Postgres executor, nether Postgres heap manager are changed. But I
>> was able to reach more than 10 times speed improvement on TPC-H Q1/Q6
>> queries!

Impressive work!

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Partitioning vs ON CONFLICT

2017-02-16 Thread Thom Brown

Hi,

At the moment, partitioned tables have a restriction that prevents
them allowing INSERT ... ON CONFLICT ... statements:

postgres=# INSERT INTO cities SELECT 1, 'Crawley',105000 ON CONFLICT
(city_id) DO NOTHING;
ERROR:  ON CONFLICT clause is not supported with partitioned tables

Why do we have such a restriction?  And what would it take to remove it?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CREATE SUBSCRIPTION uninterruptable

2017-02-16 Thread Thom Brown

Hi,

I've noticed that when creating a subscription, it can't be
interrupted.  One must wait until it times out, which takes just over
2 minutes.  I'm guessing ALTER SUBSCRIPTION would have the same
problem.

Shouldn't we have an interrupt for this?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] 2 doc typos

2017-02-16 Thread Thom Brown

Hi,

Please find attached a patch to fix 2 typos.

1) s/mypubclication/mypublication/

2) Removed trailing comma from last column definition in example.

Thanks

Thom
diff --git a/doc/src/sgml/ref/create_subscription.sgml 
b/doc/src/sgml/ref/create_subscription.sgml
index 59e5ad0..250806f 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -142,7 +142,7 @@ CREATE SUBSCRIPTION subscription_name
Create a subscription to a remote server that replicates tables in
-   the publications mypubclication and
+   the publications mypublication and
insert_only and starts replicating immediately on
commit:
 
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index e0f7cd9..41c08bb 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1539,7 +1539,7 @@ CREATE TABLE measurement_year_month (
 CREATE TABLE cities (
 city_id  bigserial not null,
 name text not null,
-population   bigint,
+population   bigint
 ) PARTITION BY LIST (left(lower(name), 1));
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-01-23 Thread Thom Brown

On 5 January 2017 at 01:21, Craig Ringer  wrote:
> On 5 January 2017 at 09:19, Craig Ringer  wrote:
>
>> so here's a rebased series on top of master. No other changes.
>
> Now with actual patches.

Patch 5 no longer applies:

patching file src/include/pgstat.h
Hunk #1 FAILED at 745.
1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej

--- src/include/pgstat.h
+++ src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum
WAIT_EVENT_SYSLOGGER_MAIN,
WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
-   WAIT_EVENT_WAL_WRITER_MAIN
+   WAIT_EVENT_WAL_WRITER_MAIN,
+   WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
 } WaitEventActivity;

 /* --

Could you rebase?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2017-01-23 Thread Thom Brown

On 23 January 2017 at 01:11, Petr Jelinek  wrote:
> On 20/01/17 22:30, Petr Jelinek wrote:
>> Since it's not exactly straight forward to find when these need to be
>> initialized based on commands, I decided to move the initialization code
>> to exec_replication_command() since that's always called before anything
>> so that makes it much less error prone (patch 0003).
>>
>> The 0003 should be backpatched all the way to 9.4 where multiple
>> commands started using those buffers.
>>
>
> Actually there is better place, the WalSndInit().
>
> Just to make it easier for PeterE (or whichever committer picks this up)
> I attached all the logical replication followup fix/polish patches:
>
> 0001 - Changes the libpqrcv_connect to use async libpq api so that it
> won't get stuck forever in case of connect is stuck. This is preexisting
> bug that also affects walreceiver but it's less visible there as there
> is no SQL interface to initiate connection there.
>
> 0002 - Close replication connection when CREATE SUBSCRIPTION gets
> canceled (otherwise walsender on the other side may stay in idle in
> transaction state).
>
> 0003 - Fixes buffer initialization in walsender that I found when
> testing the above two. This one should be back-patched to 9.4 since it's
> broken since then.
>
> 0004 - Fixes the foreign key issue reported by Thom Brown and also adds
> tests for FK and trigger handling.

This fixes the problem for me.  Thanks.
>
> 0005 - Adds support for renaming publications and subscriptions.

Works for me.

I haven't tested the first 3.

Regards

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Logical replication failing when foreign key present

2017-01-22 Thread Thom Brown

Hi,

There's an issue which I haven't seen documented as expected
behaviour, where replicating data to a table which has a foreign key
results in a replication failure.  This produces the following log
entries:

LOG:  starting logical replication worker for subscription "contacts_sub"
LOG:  logical replication apply for subscription "contacts_sub" has started
ERROR:  AfterTriggerSaveEvent() called outside of query
LOG:  worker process: logical replication worker for subscription
16408 (PID 19201) exited with exit code 1


Reproducible test case:

On both instances:

  CREATE TABLE b (bid int PRIMARY KEY);
  CREATE TABLE a (id int PRIMARY KEY, bid int REFERENCES b (bid));
  INSERT INTO b (bid) VALUES (1);

First instance:
  CREATE PUBLICATION a_pub FOR TABLE a;

Second instance:
  CREATE SUBSCRIPTION a_sub CONNECTION 'host=127.0.0.1 port=5530
user=thom dbname=postgres' PUBLICATION a_pub;

First instance:
 INSERT INTO a (id, bid) VALUES (1,1);


Regards

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-24 Thread Thom Brown

On 20 October 2016 at 01:15, Robert Haas  wrote:
>
> On Wed, Oct 19, 2016 at 12:04 PM, Robert Haas  wrote:
> > On Thu, Oct 13, 2016 at 5:53 AM, Victor Wagner  wrote:
> >> On Thu, 13 Oct 2016 12:30:59 +0530
> >> Mithun Cy  wrote:
> >>> On Fri, Sep 30, 2016 at 2:14 PM, Victor Wagner 
> >>> wrote:
> >>> Okay but for me consistency is also important. Since we agree to
> >>> disagree on some of the comments and others have not expressed any
> >>> problem I am moving it to committer.
> >>
> >> Thank you for your efforts improving my patch
> >
> > I haven't deeply reviewed this patch, but on a quick read-through it
> > certainly seems to need a lot of cleanup work.
>
> Some more comments:
>
> - I am pretty doubtful that the changes to connectOptions2() work as
> intended.  I think that's only going to be called before actualhost
> and actualport could possibly get set.  I don't think we keep
> reinvoking this function for every new host we try.
>
> - It's pretty clear that this isn't going to work if the host list
> includes a mix of hostnames and UNIX socket addresses.  The code that
> handles the UNIX socket address case is totally separate from the code
> that handles the multiple-hostname case.

So should it be the case that it disallows UNIX socket addresses
entirely?  I can't imagine a list of UNIX socket addresses being that
useful.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make check-world failing for me

2016-10-21 Thread Thom Brown

On 22 October 2016 at 01:52, Tom Lane  wrote:

> Thom Brown  writes:
> > I'm using the latest version of Linux Mint Debian Edition, having
> recently
> > upgraded from an older version, and now I can't get make check-world to
> > finish successfully.
>
> > The initial theory was that the version of Bison I'm using (3.0.2) is to
> > blame, but I've also tested it against 3.0.1, 2.7.1 and 2.3, but I get
> the
> > same problem:
>
> Don't see how it would be Bison's fault.  What this looks like to me is
> that you've got a corrupted copy of src/test/perl/PostgresNode.pm,
> because this:
>
> > syntax error at
> > /home/thom/Development/postgresql/src/test/modules/
> commit_ts/../../../../src/test/perl/PostgresNode.pm
> > line 437, near ")
> > print"
>
> doesn't seem to have anything to do with what's actually at line 437
> in that file (at least not in current HEAD).  I don't see any close
> matches elsewhere in that file, either.
>

I've just taken a fresh clone of the PostgreSQL git repo, and it appears
you're right.  All the tests appear to be passing now, so sorry for the
noise.

Thanks

Thom

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-10-19 Thread Thom Brown

On 13 October 2016 at 10:53, Victor Wagner  wrote:
> On Thu, 13 Oct 2016 12:30:59 +0530
> Mithun Cy  wrote:
>
>> On Fri, Sep 30, 2016 at 2:14 PM, Victor Wagner 
>> wrote:
>
>> Okay but for me consistency is also important. Since we agree to
>> disagree on some of the comments and others have not expressed any
>> problem I am moving it to committer.
>
>
> Thank you for your efforts improving my patch

I've found various spelling and grammatical issues in the docs
section, which I've corrected.  I've attached a revised patch with
these changes.

One thing I'm wondering is whether we should be using the term
"master", as that's usually paired with "slave", whereas, nowadays,
there's a tendency to refer to them as "primary" and "standby".  I
know we use "master" in some other places in the documentation, but it
seems inconsistent to have "master" as a parameter value, but then
having "primary" used in some other configuration parameters.

I'd also avoid referring to "the library", and just describe what
happens without making reference to what's making it happen.

Regards

Thom
1,6d0
< commit 52270559f28ab673e14124b813b7cdfb772cd1bd
< Author: mithun 
< Date:   Thu Oct 13 12:07:15 2016 +0530
< 
< libpq10
< 
8c2
< index 4e34f00..fd2fa88 100644
---
> index 4e34f00..adbad75 100644
44c38
< +   There can be serveral host specifications, optionally accompanied
---
> +   There can be several host specifications, optionally accompanied
56,59c50,53
< +   the connect string. In this case these hosts would be considered
< +   alternate entries into same database and if connect to first one
< +   fails, library would try to connect second etc. This can be used
< +   for high availability cluster or for load balancing. See
---
> +   the connection string. In this case these hosts would be considered
> +   alternate entries into same database and if connection to first one
> +   fails, the second would be attemped, and so on. This can be used
> +   for high availability clusters or for load balancing. See the
63,66c57,60
< +   Network host name can be accompanied with port number, separated by
< +   colon. If so, this port number is used only when connected to
< +   this host. If there is no port number, port specified in the
< +parameter would be used.
---
> +   The network host name can be accompanied by a port number, separated 
> by
> +   colon. This port number is used only when connected to
> +   this host. If there is no port number, the port specified in the
> +parameter would be used instead.
71c65
< @@ -943,7 +964,43 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -943,7 +964,42 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
79c73
< +  Specifies how to choose host from list of alternate hosts,
---
> +  Specifies how to choose the host from the list of alternate hosts,
83,86c77,79
< +  If value of this argument is sequential (the
< +  default) library connects to the hosts in order they specified,
< +  and tries to connect second one only if connection to the first
< +  fails.
---
> +  If the value of this argument is sequential (the
> +  default) connections to the hosts will be attempted in the order in
> +  which they are specified.
89,93c82,86
< +  If value is random host to connect is randomly
< +  picked from the list. It allows to balance load between several
< +  cluster nodes. However, currently PostgreSQL doesn't support
< +  multimaster clusters. So, without use of third-party products,
< +  only read-only connections can take advantage from the
---
> +  If the value is random, the host to connect to
> +  will be randomly picked from the list. It allows load balacing between
> +  several cluster nodes. However, PostgreSQL doesn't currently support
> +  multimaster clusters. So, without the use of third-party products,
> +  only read-only connections can take advantage from
102,104c95,97
< +  If this parameter is master upon successful 
connection
< +  library checks if host is in recovery state, and if it is so,
< +  tries next host in the connect string. If this parameter is
---
> +  If this parameter is master, upon successful 
> connection
> +  the host is checked to determine whether it is in a recovery state.  
> If it
> +  is, it then tries next host in the connection string. If this 
> parameter is
115c108
< @@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -985,7 +1041,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
123c116
< @@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
---
> @@ -996,7 +1051,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
132c125
< + Maximum time to cyclically retry all the hosts in connect string.
---
> + Maximum time to cyclically retry all the hosts in the connection string.
138,141c131,134
< + take some time to promote one of

Re: [HACKERS] Add support for restrictive RLS policies

2016-09-01 Thread Thom Brown

On 1 September 2016 at 10:02, Robert Haas  wrote:
> On Thu, Sep 1, 2016 at 12:04 PM, Stephen Frost  wrote:
>> As outlined in the commit message, this adds support for restrictive RLS
>> policies.  We've had this in the backend since 9.5, but they were only
>> available via hooks and therefore extensions.  This adds support for
>> them to be configured through regular DDL commands.  These policies are,
>> essentially "AND"d instead of "OR"d.
>>
>> Includes updates to the catalog, grammer, psql, pg_dump, and regression
>> tests.  Documentation will be added soon, but until then, would be great
>> to get feedback on the grammer, catalog and code changes.
>
> I don't like CREATE RESTRICT POLICY much.  It's not very good grammar,
> for one thing.  I think putting the word RESTRICT, or maybe AS
> RESTRICT, somewhere later in the command would be better.
>
> I also think that it is very strange to have the grammar keyword be
> "restrict" but the internal flag be called "permissive".  It would be
> better to have the sense of those flags match.
>
> (This is not intended as a full review, just a quick comment.)

I had proposed this sort of functionality a couple years back:
https://www.depesz.com/2014/10/02/waiting-for-9-5-row-level-security-policies-rls/#comment-187800

And I suggested CREATE RESTRICTIVE POLICY, but looking back at that,
perhaps you're right, and it would be better to add it later in the
command.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Refactoring of heapam code.

2016-08-05 Thread Thom Brown

On 5 August 2016 at 08:54, Anastasia Lubennikova
 wrote:
> Working on page compression and some other issues related to
> access methods, I found out that the code related to heap
> looks too complicated. Much more complicated, than it should be.
> Since I anyway got into this area, I want to suggest a set of improvements.
>
> There is a number of problems I see in the existing code:
> Problem 1. Heap != Relation.
>
> File heapam.c has a note:
>  *  This file contains the heap_ routines which implement
>  *  the POSTGRES heap access method used for all POSTGRES
>  *  relations.
> This statement is wrong, since we also have index relations,
> that are implemented using other access methods.
>
> Also I think that it's quite strange to have a function heap_create(), that
> is called
> from index_create(). It has absolutely nothing to do with heap access
> method.
>
> And so on, and so on.
>
> One more thing that requires refactoring is "RELKIND_RELATION" name.
> We have a type of relation called "relation"...
>
> Problem 2.
> Some functions are shared between heap and indexes access methods.
> Maybe someday it used to be handy, but now, I think, it's time to separate
> them, because IndexTuples and HeapTuples have really little in common.
>
> Problem 3. The word "heap" is used in many different contexts.
> Heap - a storage where we have tuples and visibility information.
> Generally speaking, it can be implemented over any data structure,
> not just a plain table as it is done now.
>
> Heap - an access method, that provides a set of routines for plain table
> storage, such as insert, delete, update, gettuple, vacuum and so on.
> We use this access method not only for user's tables.
>
> There are many types of relations (pg_class.h):
> #define  RELKIND_RELATION'r'/* ordinary table */
> #define  RELKIND_INDEX   'i'/* secondary index */
> #define  RELKIND_SEQUENCE'S'/* sequence object */
> #define  RELKIND_TOASTVALUE  't'/* for out-of-line
> values */
> #define  RELKIND_VIEW'v'/* view */
> #define  RELKIND_COMPOSITE_TYPE  'c'/* composite type */
> #define  RELKIND_FOREIGN_TABLE   'f'/* foreign table */
> #define  RELKIND_MATVIEW 'm'/* materialized view */
>
> They can be logically separated into three categories:
> "primary storage" - r, S, t, v. They store data and visibility information.
> The only implementation is heapam.c
> "secondary index" - i. They store some data and pointers to primary storage.
> Various existing AMs and managed by AM API.
> "virtual relations" - c, f, m. They have no physical storage, only entries
> in caches and catalogs.
>
> Now, for some reasons, we sometimes use name "heap" for both
> "primary storage" and "virual relations". See src/backend/catalog/heap.c.
> But some functions work only with "primary storage". See pgstat_relation().
> And we constantly have to check relkind everywhere.
> I'd like it would be clear from the code interface and naming.
>
> So there is a couple of patches. They do not cover all mentioned problems,
> but I'd like to get a feedback before continuing.
>
> Patch 1:
> There is a macro "PageGetItem" in bufpage.h that is used to retrieve an item
> from the given page. It is used for both heap and index tuples.
> It doesn't seems a problem, until you are going to find anything in this
> code.
>
> The first patch "PageGetItem_refactoring.patch" is intended to fix it.
> It is just renaming:
> (IndexTuple) PageGetItem ---> PageGetItemIndex
> (HeapTupleHeader) PageGetItem ---> PageGetItemHeap
> Other types of tuples (BrinTuple, SpGistInnerTuple, SpGistLeafTuple,
> SpGistDeadTuple)
> still use PageGetItem.
>
> I also changed functions that do not access tuple's data, but only need
> HeapTupleHeader fields. They use a macro PageGetItemHeapHeaderOnly.
>
> I do not insist on these particular names, by the way.
>
> Patch 2.
> heapam.c, hio.c and src/backend/catalog/heap.c have a lot of confusing names
> and outdated comments. The patch is intended to improve it.
> Fun fact: I found several "check it later" comments unchanged since
>  "Postgres95 1.01 Distribution - Virgin Sources" commit.
>
> I wonder if we can wind better name relation_drop_with_catalog() since,
> again, it's
> not about all kinds of relations. We use another function to drop index
> relations.
>
> I hope these changes will be useful for the future development.
> Suggested patches are mostly about renaming and rearrangement of functions
> between files, so, if nobody has conceptual objections, all I need from
> reviewers
> is an attentive look to find typos, grammar mistakes and overlooked areas.

Could you add this to the commitfest?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing recovery message when target not hit

2016-06-12 Thread Thom Brown

On 12 June 2016 at 12:51, Michael Paquier  wrote:

> On Sun, Jun 12, 2016 at 7:52 PM, Thom Brown  wrote:
> > Aren't those already set by recoveryStopsBefore()?
>
> It is possible to exit the main redo loop if a NULL record is found
> after calling ReadRecord, in which case those would not be set, no?
>

I'm apprehensive about initialising those values myself as I don't want to
set them at a point where they may potentially already be set.

And I've noticed that I didn't re-read my own output messages, so I've
corrected them in the attached patch.

Thom
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4645a3..d724d4b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7105,19 +7105,34 @@ StartupXLOG(void)
 		 * timeline changed.
 		 */
 		if (recoveryTarget == RECOVERY_TARGET_XID)
-			snprintf(reason, sizeof(reason),
-	 "%s transaction %u",
-	 recoveryStopAfter ? "after" : "before",
-	 recoveryStopXid);
+			if (recoveryStopXid == InvalidTransactionId)
+snprintf(reason, sizeof(reason),
+		"recovery finished without reaching recovery target xid of \"%u\"\n",
+		recoveryTargetXid);
+			else
+snprintf(reason, sizeof(reason),
+		 "%s transaction %u",
+		 recoveryStopAfter ? "after" : "before",
+		 recoveryStopXid);
 		else if (recoveryTarget == RECOVERY_TARGET_TIME)
-			snprintf(reason, sizeof(reason),
-	 "%s %s\n",
-	 recoveryStopAfter ? "after" : "before",
-	 timestamptz_to_str(recoveryStopTime));
+			if (recoveryStopTime == 0)
+snprintf(reason, sizeof(reason),
+		"recovery finished without reaching recovery target time of \"%s\"\n",
+		timestamptz_to_str(recoveryTargetTime));
+			else
+snprintf(reason, sizeof(reason),
+		 "%s %s\n",
+		 recoveryStopAfter ? "after" : "before",
+		 timestamptz_to_str(recoveryStopTime));
 		else if (recoveryTarget == RECOVERY_TARGET_NAME)
-			snprintf(reason, sizeof(reason),
-	 "at restore point \"%s\"",
-	 recoveryStopName);
+			if (*recoveryStopName == '\0')
+snprintf(reason, sizeof(reason),
+		"recovery finished without reaching recovery target name of \"%s\"\n",
+		recoveryTargetName);
+			else
+snprintf(reason, sizeof(reason),
+		 "at restore point \"%s\"",
+		 recoveryStopName);
 		else if (recoveryTarget == RECOVERY_TARGET_IMMEDIATE)
 			snprintf(reason, sizeof(reason), "reached consistency");
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing recovery message when target not hit

2016-06-12 Thread Thom Brown

On 11 June 2016 at 13:22, Michael Paquier  wrote:

> On Sat, Jun 11, 2016 at 9:44 AM, Thom Brown  wrote:
> > It may be the wrong way of going about it, but you get the idea of what
> I'm
> > suggesting we output instead.
>
> Surely things could be better. So +1 to be more verbose here.
>
> +if (recoveryStopTime == 0)
> +snprintf(reason, sizeof(reason),
> +"recovery reached consistency before recovery
> target time of \"%s\"\n",
> +timestamptz_to_str(recoveryTargetTime));
> "Reaching consistency" is not exact for here. I'd rather say "finished
> recovery without reaching target blah"
>

Yeah, sounds fine.


>
> +if (recoveryStopXid == 0)
> Checking for InvalidTransactionId is better here.
>

Agreed.


> And it would be good to initialize recoveryStopTime and
> recoveryStopXid as those are set only when a recovery target is
> reached.
>

Aren't those already set by recoveryStopsBefore()?

Revised patch attached, with new wording and covering recovery target name
case.

Thom
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4645a3..4033a2d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7105,19 +7105,34 @@ StartupXLOG(void)
 		 * timeline changed.
 		 */
 		if (recoveryTarget == RECOVERY_TARGET_XID)
-			snprintf(reason, sizeof(reason),
-	 "%s transaction %u",
-	 recoveryStopAfter ? "after" : "before",
-	 recoveryStopXid);
+			if (recoveryStopXid == InvalidTransactionId)
+snprintf(reason, sizeof(reason),
+		"recovery finished withoutrecovery target xid of \"%u\"\n",
+		recoveryTargetXid);
+			else
+snprintf(reason, sizeof(reason),
+		 "%s transaction %u",
+		 recoveryStopAfter ? "after" : "before",
+		 recoveryStopXid);
 		else if (recoveryTarget == RECOVERY_TARGET_TIME)
-			snprintf(reason, sizeof(reason),
-	 "%s %s\n",
-	 recoveryStopAfter ? "after" : "before",
-	 timestamptz_to_str(recoveryStopTime));
+			if (recoveryStopTime == 0)
+snprintf(reason, sizeof(reason),
+		"recovery finished without recovery target time of \"%s\"\n",
+		timestamptz_to_str(recoveryTargetTime));
+			else
+snprintf(reason, sizeof(reason),
+		 "%s %s\n",
+		 recoveryStopAfter ? "after" : "before",
+		 timestamptz_to_str(recoveryStopTime));
 		else if (recoveryTarget == RECOVERY_TARGET_NAME)
-			snprintf(reason, sizeof(reason),
-	 "at restore point \"%s\"",
-	 recoveryStopName);
+			if (*recoveryStopName == '\0')
+snprintf(reason, sizeof(reason),
+		"recovery finished without reaching recovery target name of \"%s\"\n",
+		recoveryTargetName);
+			else
+snprintf(reason, sizeof(reason),
+		 "at restore point \"%s\"",
+		 recoveryStopName);
 		else if (recoveryTarget == RECOVERY_TARGET_IMMEDIATE)
 			snprintf(reason, sizeof(reason), "reached consistency");
 		else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Confusing recovery message when target not hit

2016-06-10 Thread Thom Brown

Hi all,

When recovery_target_time is set, but recovery finishes before it reaches
that time, it outputs "before 2000-01-01 00:00:00+00" to the .history
file.  This is because it uses recoveryStopTime, which is initialised to 0,
but is never set, and is then passed to timestamptz_to_str, which gives it
this date output.

A similar problem exists for recovery_target_xid.  When recovery finishes
before reaching the specified xid, it outputs "before transaction 0" to the
.history file, which is also confusing.

Could we produce something more meaningful?  I've attached a patch which
changes it to say 'recovery reached consistency before recovery target time
of ""' and 'recovery reached consistency before
recovery target xid of ""'.

It may be the wrong way of going about it, but you get the idea of what I'm
suggesting we output instead.

Thom
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4645a3..c68697a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7105,15 +7105,25 @@ StartupXLOG(void)
 		 * timeline changed.
 		 */
 		if (recoveryTarget == RECOVERY_TARGET_XID)
-			snprintf(reason, sizeof(reason),
-	 "%s transaction %u",
-	 recoveryStopAfter ? "after" : "before",
-	 recoveryStopXid);
+			if (recoveryStopXid == 0)
+snprintf(reason, sizeof(reason),
+		"recovery reached consistency before recovery target xid of \"%u\"\n",
+		recoveryTargetXid);
+			else
+snprintf(reason, sizeof(reason),
+		 "%s transaction %u",
+		 recoveryStopAfter ? "after" : "before",
+		 recoveryStopXid);
 		else if (recoveryTarget == RECOVERY_TARGET_TIME)
-			snprintf(reason, sizeof(reason),
-	 "%s %s\n",
-	 recoveryStopAfter ? "after" : "before",
-	 timestamptz_to_str(recoveryStopTime));
+			if (recoveryStopTime == 0)
+snprintf(reason, sizeof(reason),
+		"recovery reached consistency before recovery target time of \"%s\"\n",
+		timestamptz_to_str(recoveryTargetTime));
+			else
+snprintf(reason, sizeof(reason),
+		 "%s %s\n",
+		 recoveryStopAfter ? "after" : "before",
+		 timestamptz_to_str(recoveryStopTime));
 		else if (recoveryTarget == RECOVERY_TARGET_NAME)
 			snprintf(reason, sizeof(reason),
 	 "at restore point \"%s\"",

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gettimeofday is at the end of its usefulness?

2016-06-08 Thread Thom Brown

On 15 May 2014 at 19:56, Bruce Momjian  wrote:

> On Tue, May 13, 2014 at 06:58:11PM -0400, Tom Lane wrote:
> > A recent question from Tim Kane prompted me to measure the overhead
> > costs of EXPLAIN ANALYZE, which I'd not checked in awhile.  Things
> > are far worse than I thought.  On my current server (by no means
> > lavish hardware: Xeon E5-2609 @2.40GHz) a simple seqscan can run
> > at something like 110 nsec per row:
>
> I assume you ran pg_test_timing too:
>
> Testing timing overhead for 3 seconds.
> Per loop time including overhead: 41.70 nsec
> Histogram of timing durations:
> < usec   % of total  count
>  1 95.83035   68935459
>  2  4.169232999133
>  4  0.00037268
>  8  0.4 31
> 16  0.0  1
> 32  0.0  1
>
> My overhead of 41.70 nsec matches yours.
>

Did this idea die, or is it still worth considering?

Thom

Re: [HACKERS] chkpass_in should not be volatile

2016-06-03 Thread Thom Brown

On 3 June 2016 at 15:26, David G. Johnston 
wrote:

> On Fri, Jun 3, 2016 at 9:56 AM, Tom Lane  wrote:
>
>> Thom Brown  writes:
>> > ...or at least according to the warning message:
>> > postgres=# CREATE EXTENSION chkpass ;
>> > WARNING:  type input function chkpass_in should not be volatile
>>
>> See thread here:
>>
>>
>> https://www.postgresql.org/message-id/flat/CACfv%2BpL2oX08SSZSoaHpyC%3DUbfTFmPt4UmVEKJTH7y%3D2QMRCBw%40mail.gmail.com
>>
>> Given the lack of complaints so far, maybe we could think about redefining
>> the behavior of chkpass_in.  I'm not very sure to what, though.
>>
>
> Thom, how did you end up encountering this?
>

I built the extension and tried to create it.  Not really anything other
than that.

Thom

[HACKERS] Problem with dumping bloom extension

2016-06-03 Thread Thom Brown

Hi,

If a database with the bloom extension installed is dumped and restored,
there's an error with the access method creation:

createdb bloomtest
psql -c 'CREATE EXTENSION bloom;' bloomtest
pg_dump -d bloomtest > ~/tmp/bloom.sql
createdb bloomtest2
psql -d bloomtest2 -f ~/tmp/bloom.sql

The output of the last command produces:

"psql:/home/thom/tmp/bloom.sql:48: ERROR:  access method "bloom" already
exists"

So pg_dump shouldn't be dumping this access method as it's part of the
extension.

Regards

Thom

[HACKERS] chkpass_in should not be volatile

2016-06-03 Thread Thom Brown

...or at least according to the warning message:

postgres=# CREATE EXTENSION chkpass ;
WARNING:  type input function chkpass_in should not be volatile

Thom

Re: [HACKERS] array of domain types

2016-06-02 Thread Thom Brown

On 2 June 2016 at 10:13, konstantin knizhnik 
wrote:

>
> On Jun 1, 2016, at 4:37 PM, Thom Brown wrote:
>
> On 1 June 2016 at 14:20, Konstantin Knizhnik 
> wrote:
>
>> I wonder why domain types can not be used for specification of array
>> element:
>>
>> create domain objref as bigint;
>> create table foo(x objref[]);
>> ERROR:  type "objref[]" does not exist
>> create table foo(x bigint[]);
>> CREATE TABLE
>>
>> Is there some principle problem here or it is just not implemented?
>
>
> It's not implemented, but patches welcome.
>
> Thom
>
>
> The patch is trivial:  just use typbasetype in get_array_type if typtype
> is TYPTYPE_DOMAIN:
>
> diff --git a/src/backend/utils/cache/lsyscache.c
> b/src/backend/utils/cache/lsyscache.c
> index cb26d79..ecfbb20 100644
> --- a/src/backend/utils/cache/lsyscache.c
> +++ b/src/backend/utils/cache/lsyscache.c
> @@ -2486,7 +2486,18 @@ get_array_type(Oid typid)
> if (HeapTupleIsValid(tp))
> {
> result = ((Form_pg_type) GETSTRUCT(tp))->typarray;
> -   ReleaseSysCache(tp);
> +   if (result == InvalidOid && ((Form_pg_type)
> GETSTRUCT(tp))->typtype == TYPTYPE_DOMAIN) {
> +   typid = ((Form_pg_type)
> GETSTRUCT(tp))->typbasetype;
> +   ReleaseSysCache(tp);
> +   tp = SearchSysCache1(TYPEOID,
> ObjectIdGetDatum(typid));
> +   if (HeapTupleIsValid(tp))
> +   {
> +   result = ((Form_pg_type)
> GETSTRUCT(tp))->typarray;
> +   ReleaseSysCache(tp);
> +   }
> +   } else {
> +   ReleaseSysCache(tp);
> +   }
> }
> return result;
>  }
>
>
> Any problems with it?
>
>
Yes, it doesn't work:

# CREATE DOMAIN teenager AS int CHECK (VALUE BETWEEN 13 AND 19);
CREATE DOMAIN

# SELECT 14::teenager;
 teenager
--
   14
(1 row)

# SELECT 20::teenager;
ERROR:  value for domain teenager violates check constraint "teenager_check"

# SELECT '{14,20}'::teenager[];
 teenager
--
 {14,20}
(1 row)

That last one should fail.

Thom

Re: [HACKERS] array of domain types

2016-06-01 Thread Thom Brown

On 1 June 2016 at 14:20, Konstantin Knizhnik 
wrote:

> I wonder why domain types can not be used for specification of array
> element:
>
> create domain objref as bigint;
> create table foo(x objref[]);
> ERROR:  type "objref[]" does not exist
> create table foo(x bigint[]);
> CREATE TABLE
>
> Is there some principle problem here or it is just not implemented?


It's not implemented, but patches welcome.

Thom

Re: [HACKERS] Adding an alternate syntax for Phrase Search

2016-05-22 Thread Thom Brown

On 22 May 2016 at 18:52, Josh berkus  wrote:
> Folks,
>
> This came up at pgCon.
>
> The 'word <-> word <-> word' syntax for phrase search is not
> developer-friendly.  While we need the <-> operator for SQL and for the
> sophisticated cases, it would be really good to support an alternate
> syntax for the simplest case of "words next to each other".  My proposal
> is enclosing the phrase in double-quotes, which would be intuitive to
> users and familiar from search engines.  Thus:
>
> to_tsquery(' Berkus & "PostgreSQL Version 10.0" ')
>
> ... would be equivalent to:
>
> to_tsquery(' Berkus & ( PostgreSQL <-> version <-> 10.0 )')
>
> I realize we're already in beta, but pgCon was actually the first time I
> saw the new syntax.  I think if we don't do this now, we'll be doing it
> for 10.0.

I think it's way too late for that.  I don't see a problem with
including it for 10.0, but when the feature freeze has long passed and
we also have our first beta out, it's no longer a matter of changing
the design or additional functionality, unless there's something that
absolutely requires modification.  This isn't that.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 10.0

2016-05-13 Thread Thom Brown

On 13 May 2016 at 16:29, Dave Page  wrote:
> On Fri, May 13, 2016 at 4:23 PM, Thom Brown  wrote:
>>
>> Well, one potential issues is that there may be projects which have
>> already coded in 9.6 checks for feature support.
>
> I suspect that won't be an issue (I never heard of it being for 7.5,
> which was released as 8.0 - but is smattered all over pgAdmin 3 for
> example) - largely because in such apps we're almost always checking
> for a version greater than or less than x.y.
>
> I imagine the bigger issue will be apps that have been written
> assuming the first part of the version number is only a single digit.

Is that likely?  That would be remarkably myopic, but I guess possible.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 10.0

2016-05-13 Thread Thom Brown

On 13 May 2016 at 16:19, Bruce Momjian  wrote:
> On Fri, May 13, 2016 at 11:05:23AM -0400, Robert Haas wrote:
>> Hi,
>>
>> There is a long-running thread on pgsql-hackers on whether 9.6 should
>> instead be called 10.0.  Initially, opinions were mixed, but consensus
>> seems now to have emerged that 10.0 is a good choice, with the major
>> hesitation being that we've already released 9.6beta1, and therefore
>> we might not want to change at this point.  That doesn't seem like an
>> insuperable barrier to me, but I think it's now time for the
>> discussion on this topic to move here, because:
>>
>> 1. Some people who have strong opinions may not have followed the
>> discussion on pgsql-advocacy, and
>>
>> 2. If we're going to rebrand this as 10.0, the work will have to get done 
>> here.
>>
>> The major arguments advanced in favor of 10.0 are:
>>
>> - There are a lot of exciting features in this release.
>>
>> - Even if you aren't super-excited by the features in this release,
>> PostgreSQL 9.6/10.0 is a world away from 10.0, and therefore it makes
>
> I think you meant "a world away from 9.0".
>
> Actually, I don't see the distance from 9.0 as a valid argument as 9.5
> was probably also a world away from 9.0.
>
> I prefer calling 9.7 as 10.0 because there will be near-zero-downtime
> major upgrades with pg_logical (?), and parallelism will cover more
> cases.  Built-in logical replication in 9.7 would be big too, and
> another reason to do 9.7 as 10.0.
>
> On the other hand, the _start_ of parallelism in 9.6 could be enough of
> a reason to call it 10.0, with the idea that the 10-series is
> increasingly parallel-aware.  You could argue that parallelism is a much
> bigger deal than near-zero-downtime upgrades.
>
> I think the fundamental issue is whether we want to lead the 10.0 branch
> with parallelism, or wait for an administrative change like
> near-zero-downtime major upgrades and built-in logical replication.
>
>> sense to bump the version based on the amount of accumulated change
>> between then and now.
>>
>> Thoughts?  Is it crazy to go from 9.6beta1 to 10.0beta2?  What would
>> actually be involved in making the change?
>
> Someone mentioned how Postgres 8.5 became 9.0, but then someone else
> said the change was made during alpha releases, not beta.  Can someone
> dig up the details?

We had 8.5 alpha 3, then 9.0 alpha 4:

REL8_5_ALPHA1
REL8_5_ALPHA2
REL8_5_ALPHA3
REL9_0_ALPHA4
REL9_0_ALPHA5
REL9_0_BETA1
REL9_0_BETA2
REL9_0_BETA3
REL9_0_BETA4
REL9_0_RC1

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 10.0

2016-05-13 Thread Thom Brown

On 13 May 2016 at 16:05, Robert Haas  wrote:
>
> Hi,
>
> There is a long-running thread on pgsql-hackers on whether 9.6 should
> instead be called 10.0.  Initially, opinions were mixed, but consensus
> seems now to have emerged that 10.0 is a good choice, with the major
> hesitation being that we've already released 9.6beta1, and therefore
> we might not want to change at this point.  That doesn't seem like an
> insuperable barrier to me, but I think it's now time for the
> discussion on this topic to move here, because:
>
> 1. Some people who have strong opinions may not have followed the
> discussion on pgsql-advocacy, and
>
> 2. If we're going to rebrand this as 10.0, the work will have to get done 
> here.
>
> The major arguments advanced in favor of 10.0 are:
>
> - There are a lot of exciting features in this release.

True dat.

> - Even if you aren't super-excited by the features in this release,
> PostgreSQL 9.6/10.0 is a world away from 10.0, and therefore it makes
> sense to bump the version based on the amount of accumulated change
> between then and now.

Well, a .6 would be the first:

6.5
7.4
8.4

So a .6 would be the very first.  I think the argument of "accumulated
change" is persuasive.

> Thoughts?  Is it crazy to go from 9.6beta1 to 10.0beta2?  What would
> actually be involved in making the change?

Well, one potential issues is that there may be projects which have
already coded in 9.6 checks for feature support.  I don't know if
there would be any problems from the repo side of things for
beta-testers.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Breakage with VACUUM ANALYSE + partitions

2016-05-04 Thread Thom Brown

On 4 May 2016 at 09:59, Andres Freund  wrote:

> On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
> > I've noticed another breakage, which I can reproduce consistently.
>
> Thanks for the testing!  I pushed a fix for this. This wasn't actually
> an issue in the original patch, but a too strict test added in my fix.
>

Re-testing shows that this appears to have solved the issue.

Thanks

Thom

Re: [HACKERS] [BUGS] Breakage with VACUUM ANALYSE + partitions

2016-04-25 Thread Thom Brown

On 22 April 2016 at 18:07, Andres Freund  wrote:
> On 2016-04-14 11:09:29 +0530, Abhijit Menon-Sen wrote:
>> At 2016-04-12 09:00:57 -0400, robertmh...@gmail.com wrote:
>> >
>> > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund  wrote:
>> > >
>> > > 3) Actually handle the case of the last open segment not being
>> > >RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>> >
>> > #3 seems like it's probably about 15 years overdue, so let's do that
>> > anyway.
>>
>> I don't have anything useful to contribute, I'm just trying to
>> understand this fix.
>>
>> _mdfd_getseg() looks like this:
>>
>>   targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
>>   for (nextsegno = 1; nextsegno <= targetseg; nextsegno++)
>>   {
>>   Assert(nextsegno == v->mdfd_segno + 1);
>>
>>   if (v->mdfd_chain == NULL)
>>   {
>>   /*
>>* Normally we will create new segments only if 
>> authorized by the
>>* caller (i.e., we are doing mdextend()). […]
>>*/
>>   if (behavior == EXTENSION_CREATE || InRecovery)
>>   {
>>   if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
>> /* mdextend */
>>   v->mdfd_chain = _mdfd_openseg(reln, forknum, 
>> +nextsegno, O_CREAT);
>>   }
>>   else
>>   {
>>   /* We won't create segment if not existent */
>>   v->mdfd_chain = _mdfd_openseg(reln, forknum, 
>> nextsegno, 0);
>>   }
>>   if (v->mdfd_chain == NULL)
>> /* return NULL or ERROR */
>>   }
>>   v = v->mdfd_chain;
>>   }
>>
>> Do I understand correctly that the case of the "last open segment"
>> (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
>> (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
>> _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
>> InRecovery?
>>
>> And that "We won't create segment if not existent" should happen, but
>> doesn't only because the next segment file wasn't removed earlier, so
>> we have to add an extra check for that case?
>>
>> In other words, is something like the following what's needed here, or
>> is there more to it?
>
> It's a bit more complicated than that, but that's the principle. Thanks
> for the patch, and sorry for my late response. See my attached version
> for a more fleshed out version of this.
>
> Looking at the code around this I found:
> * _mdfd_getseg(), in contrast to mdnblocks() doesn't check segment size,
>   neither whether to continue with a too small, nor to error out with a
>   too big segment
> * For, imo pretty bad, reasons in mdsync() we currently rely on
>   EXTENSION_RETURN_NULL to extend the file in recovery, and to set errno
>   to ENOENT. Brrr.
> * we currently FATAL if a segment is too big - I've copied that
>   behaviour, but why not just ERROR out?
>
> The attached patch basically adds the segment size checks to
> _mdfd_getseg(), and doesn't perform extension, even in recovery, if
> EXTENSION_REALLY_RETURN_NULL is passed.
>
> This fixes the issue for me, both in the originally reported variant,
> and in some more rudely setup version (i.e. rm'ing files).

Fixes it for me too.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-03-23 Thread Thom Brown

On 22 March 2016 at 02:30, Etsuro Fujita  wrote:
> On 2016/03/19 3:30, Robert Haas wrote:
>>
>> On Fri, Mar 18, 2016 at 5:15 AM, Etsuro Fujita
>>  wrote:
>>>
>>> Attached is the updated version of the patch.

I've noticed that you now can't cancel a query if there's DML pushdown
to a foreign server.  This previously worked while it was sending
individual statements as it interrupted and rolled it back.

Here's what the local server sees when trying to cancel:

# DELETE FROM remote.contacts;
^CCancel request sent
DELETE 500

This should probably be fixed.

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] snapshot too old, configured by time

2016-03-21 Thread Thom Brown

On 17 March 2016 at 21:15, Kevin Grittner  wrote:
> New patch just to merge in recent commits -- it was starting to
> show some bit-rot.  Tests folded in with main patch.

In session 1, I've run:

# begin transaction isolation level repeatable read ;
BEGIN

*# declare stuff scroll cursor for select * from test where num between 5 and 9;
DECLARE CURSOR

*# fetch forward 5 from stuff;
 id  | num |   thing
-+-+
   2 |   8 | hellofji djf odsjfiojdsif ojdsiof
   3 |   7 | hellofji djf odsjfiojdsif ojdsiof
 112 |   9 | hellofji djf odsjfiojdsif ojdsiof
 115 |   6 | hellofji djf odsjfiojdsif ojdsiof
 119 |   8 | hellofji djf odsjfiojdsif ojdsiof
(5 rows)


In session 2, over a min later:

# update test set num = 12 where num between 5 and 9 and id between 120 and 250;
ERROR:  snapshot too old


Then back to session 1:

*# fetch forward 5 from stuff;
ERROR:  snapshot too old


Should session 2 be getting that error?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-15 Thread Thom Brown

On 15 March 2016 at 14:00, Thom Brown  wrote:
> On 10 March 2016 at 18:58, Robert Haas  wrote:
>> On Thu, Mar 10, 2016 at 12:18 AM, Amit Kapila  
>> wrote:
>>> On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas  wrote:
>>>>
>>>> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila 
>>>> wrote:
>>>> > Thanks for the suggestion.  I have updated the patch to include
>>>> > wait_event_type information in the wait_event table.
>>>>
>>>> I think we should remove "a server process is" from all of these entries.
>>>>
>>>> Also, I think this kind of thing should be tightened up:
>>>>
>>>> + A server process is waiting on any one of the
>>>> commit_timestamp
>>>> + buffer locks to read or write the commit_timestamp page in the
>>>> + pg_commit_ts subdirectory.
>>>>
>>>> I'd just write: Waiting to read or write a commit timestamp buffer.
>>>>
>>>
>>> Okay, changed as per suggestion and fixed the morerows issue pointed by
>>> Thom.
>>
>> Committed with some further editing.  In particular, the way you
>> determined whether we could safely access the tranche information for
>> any given ID was wrong; please check over what I did and make sure
>> that isn't also wrong.
>>
>> Whew, this was a long process, but we got there.  Some initial pgbench
>> testing shows that by far the most common wait event observed on that
>> workload is WALWriteLock, which is pretty interesting: perf -e cs and
>> LWLOCK_STATS let you measure the most *frequent* wait events, but that
>> ignores duration.  Sampling pg_stat_activity tells you which things
>> you're spending the most *time* waiting for, which is awfully neat.
>
> It turns out that I hate the fact that the Wait Event Name column is
> effectively in a random order.  If a user sees a message, and goes to
> look up the value in the wait_event description table, they either
> have to search with their browser/PDF viewer, or scan down the list
> looking for the item they're looking for, not knowing how far down it
> will be.  The same goes for wait event type.
>
> I've attached a patch to sort the list by wait event type and then
> wait event name.  It also corrects minor SGML indenting issues.

Let's try that again, this time without duplicating a row, and omitting another.

Thom


sort_wait_event_doc_table_v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-15 Thread Thom Brown

On 10 March 2016 at 18:58, Robert Haas  wrote:
> On Thu, Mar 10, 2016 at 12:18 AM, Amit Kapila  wrote:
>> On Wed, Mar 9, 2016 at 7:17 PM, Robert Haas  wrote:
>>>
>>> On Wed, Mar 9, 2016 at 8:31 AM, Amit Kapila 
>>> wrote:
>>> > Thanks for the suggestion.  I have updated the patch to include
>>> > wait_event_type information in the wait_event table.
>>>
>>> I think we should remove "a server process is" from all of these entries.
>>>
>>> Also, I think this kind of thing should be tightened up:
>>>
>>> + A server process is waiting on any one of the
>>> commit_timestamp
>>> + buffer locks to read or write the commit_timestamp page in the
>>> + pg_commit_ts subdirectory.
>>>
>>> I'd just write: Waiting to read or write a commit timestamp buffer.
>>>
>>
>> Okay, changed as per suggestion and fixed the morerows issue pointed by
>> Thom.
>
> Committed with some further editing.  In particular, the way you
> determined whether we could safely access the tranche information for
> any given ID was wrong; please check over what I did and make sure
> that isn't also wrong.
>
> Whew, this was a long process, but we got there.  Some initial pgbench
> testing shows that by far the most common wait event observed on that
> workload is WALWriteLock, which is pretty interesting: perf -e cs and
> LWLOCK_STATS let you measure the most *frequent* wait events, but that
> ignores duration.  Sampling pg_stat_activity tells you which things
> you're spending the most *time* waiting for, which is awfully neat.

It turns out that I hate the fact that the Wait Event Name column is
effectively in a random order.  If a user sees a message, and goes to
look up the value in the wait_event description table, they either
have to search with their browser/PDF viewer, or scan down the list
looking for the item they're looking for, not knowing how far down it
will be.  The same goes for wait event type.

I've attached a patch to sort the list by wait event type and then
wait event name.  It also corrects minor SGML indenting issues.

Thom


sort_wait_event_doc_table.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-09 Thread Thom Brown

On 9 March 2016 at 13:31, Amit Kapila  wrote:

> On Fri, Mar 4, 2016 at 7:11 PM, Alexander Korotkov 
> wrote:
> >
> >>
> >> If yes, then the only slight worry is that there will lot of repetition
> in wait_event_type column, otherwise it is okay.
> >
> >
> > There is morerows attribute of entry tag in Docbook SGML, it behaves
> like rowspan in HTML.
> >
>
> Thanks for the suggestion.  I have updated the patch to include
> wait_event_type information in the wait_event table.
>
> As asked above by Robert, below is performance data with the patch.
>
> M/C Details
> --
> IBM POWER-8 24 cores, 192 hardware threads
> RAM = 492GB
>
> Performance Data
> 
> min_wal_size=15GB
> max_wal_size=20GB
> checkpoint_timeout=15min
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.9
>
>
> pgbench read-only (median of 3, 5-min runs)
>
> clients BASE PATCH %
> 1 19703.549206 19992.141542 1.4646718364
> 8 120105.542849 127717.835367 6.3380026745
> 64 487334.338764 495861.7211254 1.7498012521
>
>
> The read-only data shows some improvement with patch, but I think this is
> mostly attributed to run-to-run variation.
>
> pgbench read-write (median of 3, 30-min runs)
>
> clients BASE PATCH %
> 1 1703.275728 1696.568881 -0.3937616729
> 8 8884.406185 9442.387472 6.2804567394
> 64 32648.82798 32113.002416 -1.6411785572
>
>
> In the above data, the read-write data shows small regression (1.6%) at
> higher client-count, but when I ran individually that test, the difference
> was 0.5%. I think it is mostly attributed to run-to-run variation as we see
> with read-only tests.
>
>
> Thanks to Mithun C Y for doing performance testing of this patch.
>
> As this patch is adding 4-byte variable to shared memory structure PGPROC,
> so this is susceptible to memory alignment issues for shared buffers as
> discussed in thread [1], but in general the performance data doesn't
> indicate any regression.
>
> [1] -
> http://www.postgresql.org/message-id/caa4ek1+zeb8pmwwktf+3brs0pt4ux6rs6aom0uip8c6shjw...@mail.gmail.com
>
>
The new patch looks fine with regards to grammar and spelling.

However, the new row-spanning layout isn't declared correctly as you've
over-counted by 1 in each morerows attribute, possibly because you equated
it to the rowspan attribute in html, which will add one above whatever you
specify in the document.  "morerows" isn't the total number or rows, but
how many more rows in addition to the current one will the row span.

And yes, as Robert mentioned, please can we remove the "A server process
is" repetition?

Thom

Re: [HACKERS] The plan for FDW-based sharding

2016-03-06 Thread Thom Brown

On 6 Mar 2016 8:27 p.m., "Peter Geoghegan"  wrote:
>
> On Fri, Mar 4, 2016 at 4:41 PM, Robert Haas  wrote:
> > Yeah, I agree with that.  I am utterly mystified by why Bruce keeps
> > beating this drum, and am frankly pretty annoyed about it.  In the
> > first place, he seems to think that he invented the idea of using FDWs
> > for sharding in PostgreSQL, but I don't think that's true.  I think it
> > was partly my idea, and partly something that the NTT folks have been
> > working on for years (cf, e.g.,
> > cb1ca4d800621dcae67ca6c799006de99fa4f0a5).  As far as I understand it,
> > Bruce came in near the end of that conversation and now wants to claim
> > credit for something that doesn't really exist yet and, to the extent
> > that it does exist, wasn't even his idea.
>
> I think that it's easy to have the same idea as someone else
> independently. I've had that happen several times myself; ideas that
> other people had that I felt I could have easily had myself, or did in
> fact have. Most of the ideas that I have are fairly heavily based on
> known techniques. I don't think that I've ever creating a PostgreSQL
> feature that was in some way truly original, except perhaps for some
> aspects of how UPSERT works.

Everything is a remix.

Thom

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-04 Thread Thom Brown

On 4 March 2016 at 13:41, Alexander Korotkov  wrote:
> On Fri, Mar 4, 2016 at 4:20 PM, Amit Kapila  wrote:
>>
>> On Fri, Mar 4, 2016 at 4:01 PM, Alexander Korotkov 
>> wrote:
>>>
>>> On Fri, Mar 4, 2016 at 7:05 AM, Amit Kapila 
>>> wrote:

 > I think the wait event types should be documented - and the wait
 > events too, perhaps.
 >

 As discussed upthread, I have added documentation for all the possible
 wait events and an example.  Some of the LWLocks like AsyncQueueLock and
 AsyncCtlLock are used for quite similar purpose, so I have kept there
 explanation as same.
>>>
>>>
>>> Do you think it worth grouping rows in "wait_event Description" table by
>>> wait event type?
>>
>>
>> They are already grouped (implicitly), do you mean to say that we should
>> add wait event type name as well in that table?
>
>
> Yes.
>
>>
>> If yes, then the only slight worry is that there will lot of repetition in
>> wait_event_type column, otherwise it is okay.
>
>
> There is morerows attribute of entry tag in Docbook SGML, it behaves like
> rowspan in HTML.

+1

Yes, we do this elsewhere in the docs.  And it is difficult to look
through the wait event names at the moment.

I'm also not keen on all the "A server process is waiting" all the way
down the list.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-04 Thread Thom Brown

On 4 March 2016 at 13:35, Thom Brown  wrote:
> On 4 March 2016 at 04:05, Amit Kapila  wrote:
>> On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas  wrote:
>>>
>>> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila 
>>> wrote:
>>> >> I wouldn't bother tinkering with it at this point.  The value isn't
>>> >> going to be recorded on disk anywhere, so it will be easy to change
>>> >> the way it's computed in the future if we ever need to do that.
>>> >>
>>> >
>>> > Okay. Find the rebased patch attached with this mail.  I have moved
>>> > this patch to upcoming CF.
>>>
>>> I would call the functions pgstat_report_wait_start() and
>>> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
>>> pgstat_report_end_waiting().
>>>
>>
>> Changed as per suggestion and made these functions inline.
>>
>>> I think pgstat_get_wait_event_type should not return HWLock, a term
>>> that appears nowhere in our source tree at present.  How about just
>>> "Lock"?
>>>
>>
>> Changed as per suggestion.
>>
>>> I think the wait event types should be documented - and the wait
>>> events too, perhaps.
>>>
>>
>> As discussed upthread, I have added documentation for all the possible wait
>> events and an example.  Some of the LWLocks like AsyncQueueLock and
>> AsyncCtlLock are used for quite similar purpose, so I have kept there
>> explanation as same.
>>
>>> Maybe it's worth having separate wait event type names for lwlocks and
>>> lwlock tranches.  We could report LWLockNamed and LWLockTranche and
>>> document the difference: "LWLockNamed indicates that the backend is
>>> waiting for a specific, named LWLock.  The event name is the name of
>>> that lock.  LWLockTranche indicates that the backend is waiting for
>>> any one of a group of locks with similar function.  The event name
>>> identifies the general purpose of locks in that group."
>>>
>>
>> Changed as per suggestion.
>>
>>> There's no requirement that every session have every tranche
>>> registered.  I think we should consider displaying "extension" for any
>>> tranche that's not built-in, or at least for tranches that are not
>>> registered (rather than "unknown wait event").
>>>
>>> +   if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS)
>>>
>>> Isn't the second part automatically true at this point?
>>>
>>
>> Fixed.
>>
>>> The changes to LockBufferForCleanup() don't look right to me.  Waiting
>>> for a buffer pin might be a reasonable thing to define as a wait
>>> event, but it shouldn't reported as if we were waiting on the LWLock
>>> itself.
>>>
>>
>> As discussed upthread, added a new wait event BufferPin for this case.
>>
>>> What happens if an error is thrown while we're in a wait?
>>>
>>
>> As discussed upthread, added in AbortTransaction and from where ever
>> LWLockReleaseAll() gets called, point to note is that we can call this
>> function only when we are sure there is no further possibility of wait on
>> LWLock.
>>
>>> Does this patch hurt performance?
>>>
>>
>> Performance tests are underway.
>
> I've attached a revised version of the patch with the following corrections:
>
> + 
> +  LWLockTranche: The backend is waiting for any one of a
> +  group of locks with similar function.  The wait_event
> +  name for this type of wait identifies the general purpose of locks
> +  in that group.
> + 
>
> s/with similar/with a similar/
>
> +
> + ControlFileLock
> + A server process is waiting to read or update the control 
> file
> + or creation of a new WAL log file.
> +
>
> As the L in WAL stands for "log" anyway, I think the extra "log" word
> can be removed.
>
> +
> + RelCacheInitLock
> + A server process is waiting to read or write to relation 
> cache
> + initialization file.
> +
>
> s/to relation/to the relation/
>
> +
> + BtreeVacuumLock
> +  A server process is waiting to read or update vacuum related
> +  information for Btree index.
> +
>
> s/vacuum related/vacuum-related/
> s/for Btree/for a Btree/
>
>

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2016-03-04 Thread Thom Brown

On 4 March 2016 at 04:05, Amit Kapila  wrote:
> On Wed, Feb 24, 2016 at 7:14 PM, Robert Haas  wrote:
>>
>> On Mon, Feb 22, 2016 at 10:05 AM, Amit Kapila 
>> wrote:
>> >> I wouldn't bother tinkering with it at this point.  The value isn't
>> >> going to be recorded on disk anywhere, so it will be easy to change
>> >> the way it's computed in the future if we ever need to do that.
>> >>
>> >
>> > Okay. Find the rebased patch attached with this mail.  I have moved
>> > this patch to upcoming CF.
>>
>> I would call the functions pgstat_report_wait_start() and
>> pgstat_report_wait_end() instead of pgstat_report_start_waiting() and
>> pgstat_report_end_waiting().
>>
>
> Changed as per suggestion and made these functions inline.
>
>> I think pgstat_get_wait_event_type should not return HWLock, a term
>> that appears nowhere in our source tree at present.  How about just
>> "Lock"?
>>
>
> Changed as per suggestion.
>
>> I think the wait event types should be documented - and the wait
>> events too, perhaps.
>>
>
> As discussed upthread, I have added documentation for all the possible wait
> events and an example.  Some of the LWLocks like AsyncQueueLock and
> AsyncCtlLock are used for quite similar purpose, so I have kept there
> explanation as same.
>
>> Maybe it's worth having separate wait event type names for lwlocks and
>> lwlock tranches.  We could report LWLockNamed and LWLockTranche and
>> document the difference: "LWLockNamed indicates that the backend is
>> waiting for a specific, named LWLock.  The event name is the name of
>> that lock.  LWLockTranche indicates that the backend is waiting for
>> any one of a group of locks with similar function.  The event name
>> identifies the general purpose of locks in that group."
>>
>
> Changed as per suggestion.
>
>> There's no requirement that every session have every tranche
>> registered.  I think we should consider displaying "extension" for any
>> tranche that's not built-in, or at least for tranches that are not
>> registered (rather than "unknown wait event").
>>
>> +   if (lock->tranche == 0 && lockId < NUM_INDIVIDUAL_LWLOCKS)
>>
>> Isn't the second part automatically true at this point?
>>
>
> Fixed.
>
>> The changes to LockBufferForCleanup() don't look right to me.  Waiting
>> for a buffer pin might be a reasonable thing to define as a wait
>> event, but it shouldn't reported as if we were waiting on the LWLock
>> itself.
>>
>
> As discussed upthread, added a new wait event BufferPin for this case.
>
>> What happens if an error is thrown while we're in a wait?
>>
>
> As discussed upthread, added in AbortTransaction and from where ever
> LWLockReleaseAll() gets called, point to note is that we can call this
> function only when we are sure there is no further possibility of wait on
> LWLock.
>
>> Does this patch hurt performance?
>>
>
> Performance tests are underway.

I've attached a revised version of the patch with the following corrections:

+ 
+  LWLockTranche: The backend is waiting for any one of a
+  group of locks with similar function.  The wait_event
+  name for this type of wait identifies the general purpose of locks
+  in that group.
+ 

s/with similar/with a similar/

+
+ ControlFileLock
+ A server process is waiting to read or update the control file
+ or creation of a new WAL log file.
+

As the L in WAL stands for "log" anyway, I think the extra "log" word
can be removed.

+
+ RelCacheInitLock
+ A server process is waiting to read or write to relation cache
+ initialization file.
+

s/to relation/to the relation/

+
+ BtreeVacuumLock
+  A server process is waiting to read or update vacuum related
+  information for Btree index.
+

s/vacuum related/vacuum-related/
s/for Btree/for a Btree/

+
+ AutovacuumLock
+ A server process which could be autovacuum worker is waiting to
+ update or read current state of autovacuum workers.
+

s/could be autovacuum/could be that an autovacuum/
s/read current/read the current/

(discussed with Amit offline about other sources of wait, and he
suggested autovacuum launcher, so I've added that in too)

+
+ AutovacuumScheduleLock
+ A server process is waiting on another process to ensure that
+ the table it has selected for vacuum still needs vacuum.
+ 
+

s/for vacuum/for a vacuum/
s/still needs vacuum/still needs vacuuming/

+
+ SyncScanLock
+ A server process is waiting to get the start location of scan
+ on table for synchronized scans.
+

s/of scan/of a scan/
s/on table/on a table/

+
+ SerializableFinishedListLock
+ A server process is waiting to access list of finished
+ serializable transactions.
+

s/to access list/to access the list/

+
+ Seriali

Re: [HACKERS] multivariate statistics v10

2016-03-02 Thread Thom Brown

On 2 March 2016 at 14:56, Tomas Vondra  wrote:
>
> Hi,
>
> Attached is v10 of the patch series. There are 9 parts at the moment:
>
>   0001-teach-pull_-varno-varattno-_walker-about-RestrictInf.patch
>   0002-shared-infrastructure-and-functional-dependencies.patch
>   0003-clause-reduction-using-functional-dependencies.patch
>   0004-multivariate-MCV-lists.patch
>   0005-multivariate-histograms.patch
>   0006-multi-statistics-estimation.patch
>   0007-multivariate-ndistinct-coefficients.patch
>   0008-change-how-we-apply-selectivity-to-number-of-groups-.patch
>   0009-fixup-of-regression-tests-plans-changes-by-group-by-.patch
>
> However, the first one is still just a temporary workaround that I plan to 
> address next, and the last 3 are all dealing with the ndistinct coefficients 
> (and shall be squashed into a single chunk).
>
>
> README docs
> ---
>
> Aside from fixing a few bugs, there are several major improvements, the main 
> one being that I've moved most of the comments explaining how it all works 
> into a set of regular README files, located in src/backend/utils/mvstats:
>
> 1) README.stats - Overview of available types of statistics, what
>clauses can be estimated, how multiple statistics are combined etc.
>This is probably the right place to start.
>
> 2) docs for each type of statistics currently available
>
>README.dependencies - soft functional dependencies
>README.mcv  - MCV lists
>README.histogram- histograms
>README.ndistinct- ndistinct coefficients
>
> The READMEs are added and modified through the patch series, so the best 
> thing to do is apply all the patches and start reading.
>
> I have not improved the user-oriented SGML documentation in this patch, 
> that's one of the tasks I'd lie to work on next. But the READMEs should give 
> you a good idea how it's supposed to work, and there are some examples of use 
> in the regression tests.
>
>
> Significantly simplified places
> ---
>
> The patch version also significantly simplifies several places that were 
> needlessly complex in the previous ones - firstly the function evaluating 
> clauses on multivariate histograms was rather needlessly bloated, so I've 
> simplified it a lot. Similarly for the code in clauselist_select() that 
> combines multiple statistics to estimate a list of clauses - that's much 
> simpler now too. And various other pieces.
>
> That being said, I still think the code in clausesel.c can be simplified. I 
> feel there's a lot of cruft, mostly due to unknowingly implementing something 
> that could be solved by an existing function.
>
> A prime example of that is inspecting the expression tree to check if we know 
> how to estimate the clauses using the multivariate statistics. That sounds 
> like a nice match for expression walker, but currently is done by custom 
> code. I plan to look at that next.
>
> Also, I'm not quite sure I understand what the varRelid parameter of 
> clauselist_selectivity is for, so the code may be handling that wrong (seems 
> to be working though).
>
>
> ndistinct coefficients
> --
>
> The one new piece in this patch is the GROUP BY estimation, based on the 
> ndistinct coefficients. So for example you can do this:
>
> CREATE TABLE t AS SELECT mod(i,1000) AS a, mod(i,1000) AS b
> FROM generate_series(1,100) s(i);
> ANALYZE t;
> EXPLAIN SELECT * FROM t GROUP BY a, b;
>
> which currently does this:
>
>   QUERY PLAN
> ---
>  Group  (cost=127757.34..135257.34 rows=6 width=8)
>Group Key: a, b
>->  Sort  (cost=127757.34..130257.34 rows=100 width=8)
>  Sort Key: a, b
>  ->  Seq Scan on t  (cost=0.00..14425.00 rows=100 width=8)
> (5 rows)
>
> but we know that there are only 1000 groups because the columns are 
> correlated. So let's create ndistinct statistics on the two columns:
>
> CREATE STATISTICS s1 ON t (a,b) WITH (ndistinct);
> ANALYZE t;
>
> which results in estimates like this:
>
>QUERY PLAN
> -
>  HashAggregate  (cost=19425.00..19435.00 rows=1000 width=8)
>Group Key: a, b
>->  Seq Scan on t  (cost=0.00..14425.00 rows=100 width=8)
> (3 rows)
>
> I'm not quite sure how to combine this type of statistics with MCV lists and 
> histograms, so for now it's used only for GROUP BY.

Well, firstly, the patches all apply.

But I have a question (which is coming really late, but I'll ask it
anyway).  Is it intended that CREATE STATISTICS will only be for
multivariate statistics?  Or do you think we could add support for
expression statistics in future too?

e.g.

CREATE STATISTICS stats_comment_length ON comments (length(comment));


I also note that the docs contain this:

CREATE STATISTICS [ IF NOT EXISTS ] statis

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-27 Thread Thom Brown

On 27 February 2016 at 13:20, Michael Paquier  wrote:
> On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown  wrote:
>> On 21 February 2016 at 23:18, Thomas Munro
>>  wrote:
>> The replay_lag is particularly cool.  Didn't think it was possible to
>> glean this information on the primary, but the timings are correct in
>> my tests.
>>
>> +1 for this patch.  Looks like this solves the problem that
>> semi-synchronous replication tries to solve, although arguably in a
>> more sensible way.
>
> Yeah, having extra logic at application layer to check if a certain
> LSN position has been applied or not is doable, but if we can avoid it
> that's a clear plus.
>
> This patch has no documentation. I will try to figure out by myself
> how the new parameters interact with the rest of the syncrep code
> while looking at it but if we want to move on to get something
> committable for 9.6 it would be good to get some documentation soon.

Could we rename "apply" to "remote_apply"?  It seems more consistent
with "remote_write", and matches its own enum entry too.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown

On 21 February 2016 at 23:18, Thomas Munro
 wrote:
> On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown  wrote:
>> On 3 February 2016 at 10:46, Thomas Munro  
>> wrote:
>>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>>>  wrote:
>>>> There seems to be a copy-pasto there - shouldn't that be:
>>>>
>>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>>>
>>> Indeed, thanks!  New patch attached.
>>
>> I've given this a test drive, and it works exactly as described.
>
> Thanks for trying it out!
>
>> But one thing which confuses me is when a standby, with causal_reads
>> enabled, has just finished starting up.  I can't connect to it
>> because:
>>
>> FATAL:  standby is not available for causal reads
>>
>> However, this is the same message when I'm successfully connected, but
>> it's lagging, and the primary is still waiting for the standby to
>> catch up:
>>
>> ERROR:  standby is not available for causal reads
>>
>> What is the difference here?  The problem being reported appears to be
>> identical, but in the first case I can't connect, but in the second
>> case I can (although I still can't issue queries).
>
> Right, you get the error at login before it has managed to connect to
> the primary, and for a short time after while it's in 'joining' state,
> or potentially longer if there is a backlog of WAL to apply.
>
> The reason is that when causal_reads = on in postgresql.conf (as
> opposed to being set for an individual session or role), causal reads
> logic is used for snapshots taken during authentication (in fact the
> error is generated when trying to take a snapshot slightly before
> authentication proper begins, in InitPostgres).  I think that's a
> desirable feature: if you have causal reads on and you create/alter a
> database/role (for example setting a new password) and commit, and
> then you immediately try to connect to that database/role on a standby
> where you have causal reads enabled system-wide, then you get the
> causal reads guarantee during authentication: you either see the
> effects of your earlier transaction or you see the error.  As you have
> discovered, there is a small window after a standby comes up where it
> will show the error because it hasn't got a lease yet so it can't let
> you log in yet because it could be seeing a stale catalog (your user
> may not exist on the standby yet, or have been altered in some way, or
> your database may not exist yet, etc).
>
> Does that make sense?

Ah, alles klar.  Yes, that makes sense now.  I've been trying to break
it the past few days, and this was the only thing which I wasn't clear
on.  The parameters all work as described

The replay_lag is particularly cool.  Didn't think it was possible to
glean this information on the primary, but the timings are correct in
my tests.

+1 for this patch.  Looks like this solves the problem that
semi-synchronous replication tries to solve, although arguably in a
more sensible way.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown

On 3 February 2016 at 10:46, Thomas Munro  wrote:
> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>  wrote:
>> There seems to be a copy-pasto there - shouldn't that be:
>>
>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>
> Indeed, thanks!  New patch attached.

I've given this a test drive, and it works exactly as described.

But one thing which confuses me is when a standby, with causal_reads
enabled, has just finished starting up.  I can't connect to it
because:

FATAL:  standby is not available for causal reads

However, this is the same message when I'm successfully connected, but
it's lagging, and the primary is still waiting for the standby to
catch up:

ERROR:  standby is not available for causal reads

What is the difference here?  The problem being reported appears to be
identical, but in the first case I can't connect, but in the second
case I can (although I still can't issue queries).

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Invalid user-level setting = confusing error message

2016-02-11 Thread Thom Brown

Hi,

At the moment, if we try to set up a configuration parameter for a
user which doesn't make sense in that context, we get an error message
that doesn't really tell us that we're not allowed to set it for
users:

# ALTER ROLE moo SET log_line_prefix = '%s';
ERROR:  parameter "log_line_prefix" cannot be changed now

Might I propose an alternative error message for user-level
configuration changes that target parameters of level sighup and
above?

Perhaps something like:

ERROR: parameter "log_line_prefix" cannot be set at the user level.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] max_parallel_degree context level

2016-02-11 Thread Thom Brown

Hi all,

As it currently stands, max_parallel_degree is set to a superuser
context, but we probably want to discuss whether we want to keep it
this way prior to releasing 9.6.  Might we want to reduce its level so
that users can adjust it accordingly?  They'd still be limited by
max_worker_processes, so they'd at least be constrained by that
setting.

Opinions?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-02-10 Thread Thom Brown

On 10 February 2016 at 08:00, Rushabh Lathia  wrote:
> Fujita-san, I am attaching update version of the patch, which added
> the documentation update.
>
> Once we finalize this, I feel good with the patch and think that we
> could mark this as ready for committer.

I find this wording a bit confusing:

"If the IsForeignRelUpdatable pointer is set to NULL, foreign tables
are assumed to be insertable, updatable, or deletable either the FDW
provides ExecForeignInsert,ExecForeignUpdate, or ExecForeignDelete
respectively or if the FDW optimizes a foreign table update on a
foreign table using PlanDMLPushdown (PlanDMLPushdown still needs to
provide BeginDMLPushdown, IterateDMLPushdown and EndDMLPushdown to
execute the optimized update.)."

This is a very long sentence, and the word "either" doesn't work here.

Also:

"When the query doesn't has the clause, the FDW must also increment
the row count for the ForeignScanState node in the EXPLAIN ANALYZE
case."

Should read "doesn't have"

The rest looks fine AFAICT.

Regards

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-02-09 Thread Thom Brown

On 7 January 2016 at 05:24, Amit Kapila  wrote:
> On Fri, Dec 25, 2015 at 6:36 AM, Robert Haas  wrote:
>>
>> On Wed, Dec 23, 2015 at 1:16 AM, Amit Kapila 
>> wrote:
>> >> >
>> >> > Here procArrayGroupXid sounds like Xid at group level, how about
>> >> > procArrayGroupMemberXid?
>> >> > Find the patch with renamed variables for PGProc
>> >> > (rename_pgproc_variables_v1.patch) attached with mail.
>> >>
>> >> I sort of hate to make these member names any longer, but I wonder if
>> >> we should make it procArrayGroupClearXid etc.
>> >
>> > If we go by this suggestion, then the name will look like:
>> > PGProc
>> > {
>> > ..
>> > bool procArrayGroupClearXid, pg_atomic_uint32
>> > procArrayGroupNextClearXid,
>> > TransactionId procArrayGroupLatestXid;
>> > ..
>> >
>> > PROC_HDR
>> > {
>> > ..
>> > pg_atomic_uint32 procArrayGroupFirstClearXid;
>> > ..
>> > }
>> >
>> > I think whatever I sent in last patch were better.  It seems to me it is
>> > better to add some comments before variable names, so that anybody
>> > referring them can understand better and I have added comments in
>> > attached patch rename_pgproc_variables_v2.patch to explain the same.
>>
>> Well, I don't know.  Anybody else have an opinion?
>>
>
> It seems that either people don't have any opinion on this matter or they
> are okay with either of the naming conventions being discussed.  I think
> specifying Member after procArrayGroup can help distinguishing which
> variables are specific to the whole group and which are specific to a
> particular member.  I think that will be helpful for other places as well
> if we use this technique to improve performance.  Let me know what
> you think about the same.
>
> I have verified that previous patches can be applied cleanly and passes
> make check-world.  To avoid confusion, I am attaching the latest
> patches with this mail.

Patches still apply 1 month later.

I don't really have an opinion on the variable naming.  I guess they
only need making longer if there's going to be some confusion about
what they're for, but I'm guessing it's not a blocker here.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-04 Thread Thom Brown

On 4 February 2016 at 15:07, Peter Geoghegan  wrote:
> On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown  wrote:
>>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
>
> I see the same.
>
> I use my regular SQL query to see the breakdown of leaf/internal/root pages:
>
> postgres=# with tots as (
>   SELECT count(*) c,
>   avg(live_items) avg_live_items,
>   avg(dead_items) avg_dead_items,
>   u.type,
>   r.oid
>   from (select c.oid,
>   c.relpages,
>   generate_series(1, c.relpages - 1) i
>   from pg_index i
>   join pg_opclass op on i.indclass[0] = op.oid
>   join pg_am am on op.opcmethod = am.oid
>   join pg_class c on i.indexrelid = c.oid
>   where am.amname = 'btree') r,
> lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
>   group by r.oid, type)
> select ct.relname table_name,
>   tots.oid::regclass::text index_name,
>   (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
>   upper(type) page_type,
>   c npages,
>   to_char(avg_live_items, '990.999'),
>   to_char(avg_dead_items, '990.999'),
>   to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
> %' as prop_of_index
>   from tots
>   join pg_index i on i.indexrelid = tots.oid
>   join pg_class ct on ct.oid = i.indrelid
>   where tots.oid = 'pgbench_accounts_pkey'::regclass
>   order by ct.relnamespace, table_name, index_name, npages, type;
> table_name│  index_name   │ non_meta_pages │ page_type
> │ npages │ to_char  │ to_char  │ prop_of_index
> ──┼───┼┼───┼┼──┼──┼───
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
> │  1 │   97.000 │0.000 │0.004 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
> │ 97 │  282.670 │0.000 │0.354 %
>  pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
> │ 27,323 │  366.992 │0.000 │   99.643 %
> (3 rows)
>
> But this looks healthy -- I see the same with master. And since the
> accounts table is listed as 1281 MB, this looks like a plausible ratio
> in the size of the table to its primary index (which I would not say
> is true of an 87MB primary key index).
>
> Are you sure you have the details right, Thom?

*facepalm*

No, I'm not.  I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense.  I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine.  Apologies for the confusion.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-04 Thread Thom Brown

On 4 February 2016 at 14:34, Fujii Masao  wrote:
> On Wed, Feb 3, 2016 at 11:00 AM, Robert Haas  wrote:
>> On Tue, Feb 2, 2016 at 8:48 PM, Fujii Masao  wrote:
>>> So you disagree with only third version that I proposed, i.e.,
>>> adding some hooks for sync replication? If yes and you're OK
>>> with the first and second versions, ISTM that we almost reached
>>> consensus on the direction of multiple sync replication feature.
>>> The first version can cover "one local and one remote sync standbys" case,
>>> and the second can cover "one local and at least one from several remote
>>> standbys" case. I'm thinking to focus on the first version now,
>>> and then we can work on the second to support the quorum commit
>>
>> Well, I think the only hard part of the third problem is deciding on
>> what syntax to use.  It seems like a waste of time to me to go to a
>> bunch of trouble to implement #1 and #2 using one syntax and then have
>> to invent a whole new syntax for #3.  Seriously, this isn't that hard:
>> it's not a technical problem.  It's just that we've got a bunch of
>> people who can't agree on what syntax to use.  IMO, you should just
>> pick something.  You're presumably the committer for this patch, and I
>> think you should just decide which of the 47,123 things proposed so
>> far is best and insist on that.  I trust that you will make a good
>> decision even if it's different than the decision that I would have
>> made.
>
> If we use one syntax for every cases, possible approaches that we can choose
> are mini-language, json, etc. Since my previous proposal covers only very
> simple cases, extra syntax needs to be supported for more complicated cases.
> My plan was to add the hooks so that the developers can choose their own
> syntax. But which might confuse users.
>
> Now I'm thinking that mini-language is better choice. A json has some good
> points, but its big problem is that the setting value is likely to be very 
> long.
> For example, when the master needs to wait for one local standby and
> at least one from three remote standbys in London data center, the setting
> value (synchronous_standby_names) would be
>
>   s_s_names = '{"priority":2, "nodes":["local1", {"quorum":1,
> "nodes":["london1", "london2", "london3"]}]}'
>
> OTOH, the value with mini-language is simple and not so long as follows.
>
>   s_s_names = '2[local1, 1(london1, london2, london3)]'
>
> This is why I'm now thinking that mini-language is better. But it's not easy
> to completely implement mini-language. There seems to be many problems
> that we need to resolve. For example, please imagine the case where
> the master needs to wait for at least one from two standbys "tokyo1", "tokyo2"
> in Tokyo data center. If Tokyo data center fails, the master needs to
> wait for at least one from two standbys "london1", "london2" in London
> data center, instead. This case can be configured as follows in mini-language.
>
>   s_s_names = '1[1(tokyo1, tokyo2), 1(london1, london2)]'
>
> One problem here is; what pg_stat_replication.sync_state value should be
> shown for each standbys? Which standby should be marked as sync? potential?
> any other value like quorum? The current design of pg_stat_replication
> doesn't fit complicated sync replication cases, so maybe we need to separate
> it into several views. It's almost impossible to complete those problems.
>
> My current plan for 9.6 is to support the minimal subset of mini-language;
> simple syntax of "[name, ...]". "" specifies the number of
> sync standbys that the master needs to wait for. "[name, ...]" specifies
> the priorities of the listed standbys. This first version supports neither
> quorum commit nor nested sync replication configuration like
> "[name, [name, ...]]". It just supports very simple
> "1-level" configuration.

Whatever the solution, I'm really don't like the idea of changing the
definition of s_s_names based on the value of another GUC, mainly
because it seems hacky, but also because the name of the GUC stops
making sense.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-02-02 Thread Thom Brown

On 2 February 2016 at 11:47, Anastasia Lubennikova
 wrote:
>
>
> 29.01.2016 20:43, Thom Brown:
>
>> On 29 January 2016 at 16:50, Anastasia Lubennikova
>>   wrote:
>>>
>>> 29.01.2016 19:01, Thom Brown:
>>>>
>>>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>>>   wrote:
>>>>>
>>>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>>>> only problem I could find is that INSERT works considerably slower
>>>>> after
>>>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>>>> pass, memory doesn't seem to leak, etc.
>>>
>>> Thank you for testing. I rechecked that, and insertions are really very
>>> very
>>> very slow. It seems like a bug.
>>>
>>>>>> Okay, now for some badness.  I've restored a database containing 2
>>>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>>>> many rows from it:
>>>>>> # delete from contacts where id % 3 != 0 ;
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>
>>>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>>>> to reproduce this issue please?
>>>>
>>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>>>> -r0), which creates an instance with a custom config, which is as
>>>> follows:
>>>>
>>>> shared_buffers = 8MB
>>>> max_connections = 7
>>>> wal_level = 'hot_standby'
>>>> cluster_name = 'primary'
>>>> max_wal_senders = 3
>>>> wal_keep_segments = 6
>>>>
>>>> Then create a pgbench data set (I didn't originally use pgbench, but
>>>> you can get the same results with it):
>>>>
>>>> createdb -p 5530 pgbench
>>>> pgbench -p 5530 -i -s 100 pgbench
>>>>
>>>> And delete some stuff:
>>>>
>>>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>>>> Timing is on.
>>>> psql (9.6devel)
>>>> Type "help" for help.
>>>>
>>>>
>>>>➤ psql://thom@[local]:5530/pgbench
>>>>
>>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> ...
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> DELETE 667
>>>> Time: 22218.804 ms
>>>>
>>>> There were 358 lines of that warning message.  I don't get these
>>>> messages without the patch.
>>>>
>>>> Thom
>>>
>>> Thank you for this report.
>>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>>
>>> I hope I'll fix these issueswithin the next few days.
>>>
>>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>>> changes. I fixed it in the new version (attached).
>>
>> Thanks.  Well I've tested this latest patch, and the warnings are no
>> longer generated.  However, the index sizes show that the patch
>> doesn't seem to be doing its job, so I'm wondering if you removed too
>> much from it.
>
>
> Huh, this patch seems to be enchanted) It works fine for me. Did you perform
> "make distclean"?

Yes.  Just tried it again:

git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench

$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.


 ➤ psql://thom@[local]:5488/pgbench

# \di+
List of relations
 Schema | Name  | Type  | Owner |  Table   |
Size  | Description
+---+---+---+--++-
 public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
 public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
 public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
(3 rows)

Previously, this would show an index size of 87MB for pgbench_accounts_pkey.

> Anyway, I'll send a new version soon.
> I just write here to say that I do not disappear and I do remember about the
> issue.
> I even almost fixed the insert speed problem. But I'm very very busy this
> week.
> I'll send an updated patch next week as soon as possible.

Thanks.

> Thank you for attention to this work.

Thanks for your awesome patches.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Thom Brown

On 29 January 2016 at 16:50, Anastasia Lubennikova
 wrote:
> 29.01.2016 19:01, Thom Brown:
>>
>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>  wrote:
>>>
>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>> only problem I could find is that INSERT works considerably slower after
>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>> pass, memory doesn't seem to leak, etc.
>
> Thank you for testing. I rechecked that, and insertions are really very very
> very slow. It seems like a bug.
>
>>>> Okay, now for some badness.  I've restored a database containing 2
>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>> many rows from it:
>>>> # delete from contacts where id % 3 != 0 ;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>
>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>> to reproduce this issue please?
>>
>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>> -r0), which creates an instance with a custom config, which is as
>> follows:
>>
>> shared_buffers = 8MB
>> max_connections = 7
>> wal_level = 'hot_standby'
>> cluster_name = 'primary'
>> max_wal_senders = 3
>> wal_keep_segments = 6
>>
>> Then create a pgbench data set (I didn't originally use pgbench, but
>> you can get the same results with it):
>>
>> createdb -p 5530 pgbench
>> pgbench -p 5530 -i -s 100 pgbench
>>
>> And delete some stuff:
>>
>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>> Timing is on.
>> psql (9.6devel)
>> Type "help" for help.
>>
>>
>>   ➤ psql://thom@[local]:5530/pgbench
>>
>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> ...
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> DELETE 667
>> Time: 22218.804 ms
>>
>> There were 358 lines of that warning message.  I don't get these
>> messages without the patch.
>>
>> Thom
>
>
> Thank you for this report.
> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>
> I hope I'll fix these issueswithin the next few days.
>
> BTW, I found a dummy mistake, the previous patch contains some unrelated
> changes. I fixed it in the new version (attached).

Thanks.  Well I've tested this latest patch, and the warnings are no
longer generated.  However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-29 Thread Thom Brown

On 29 January 2016 at 15:47, Aleksander Alekseev
 wrote:
> I tested this patch on x64 and ARM servers for a few hours today. The
> only problem I could find is that INSERT works considerably slower after
> applying a patch. Beside that everything looks fine - no crashes, tests
> pass, memory doesn't seem to leak, etc.
>
>> Okay, now for some badness.  I've restored a database containing 2
>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>> rows with a sequential id column.  I get a problem if I try to delete
>> many rows from it:
>> # delete from contacts where id % 3 != 0 ;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>
> I didn't manage to reproduce this. Thom, could you describe exact steps
> to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

 ➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown

On 28 January 2016 at 17:03, Thom Brown  wrote:

>
> On 28 January 2016 at 16:12, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>>
>> 28.01.2016 18:12, Thom Brown:
>>
>> On 28 January 2016 at 14:06, Anastasia Lubennikova <
>> a.lubennik...@postgrespro.ru> wrote:
>>
>>>
>>> 31.08.2015 10:41, Anastasia Lubennikova:
>>>
>>> Hi, hackers!
>>> I'm going to begin work on effective storage of duplicate keys in B-tree
>>> index.
>>> The main idea is to implement posting lists and posting trees for B-tree
>>> index pages as it's already done for GIN.
>>>
>>> In a nutshell, effective storing of duplicates in GIN is organised as
>>> follows.
>>> Index stores single index tuple for each unique key. That index tuple
>>> points to posting list which contains pointers to heap tuples (TIDs). If
>>> too many rows having the same key, multiple pages are allocated for the
>>> TIDs and these constitute so called posting tree.
>>> You can find wonderful detailed descriptions in gin readme
>>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>>> It also makes possible to apply compression algorithm to posting
>>> list/tree and significantly decrease index size. Read more in presentation
>>> (part 1)
>>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>>
>>> Now new B-tree index tuple must be inserted for each table row that we
>>> index.
>>> It can possibly cause page split. Because of MVCC even unique index
>>> could contain duplicates.
>>> Storing duplicates in posting list/tree helps to avoid superfluous
>>> splits.
>>>
>>>
>>> I'd like to share the progress of my work. So here is a WIP patch.
>>> It provides effective duplicate handling using posting lists the same
>>> way as GIN does it.
>>>
>>> Layout of the tuples on the page is changed in the following way:
>>> before:
>>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>>> (ip_blkid, ip_posid) + key
>>> with patch:
>>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>>
>>> It seems that backward compatibility works well without any changes. But
>>> I haven't tested it properly yet.
>>>
>>> Here are some test results. They are obtained by test functions
>>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>>> i - number of distinct values in the index. So i=1 means that all rows
>>> have the same key, and i=1000 means that all keys are different.
>>> The other columns contain the index size (MB).
>>>
>>> i B-tree Old B-tree New GIN
>>> 1 214,234375 87,7109375 10,2109375
>>> 10 214,234375 87,7109375 10,71875
>>> 100 214,234375 87,4375 15,640625
>>> 1000 214,234375 86,2578125 31,296875
>>> 1 214,234375 78,421875 104,3046875
>>> 10 214,234375 65,359375 49,078125
>>> 100 214,234375 90,140625 106,8203125
>>> 1000 214,234375 214,234375 534,0625
>>> You can note that the last row contains the same index sizes for B-tree,
>>> which is quite logical - there is no compression if all the keys are
>>> distinct.
>>> Other cases looks really nice to me.
>>> Next thing to say is that I haven't implemented posting list compression
>>> yet. So there is still potential to decrease size of compressed btree.
>>>
>>> I'm almost sure, there are still some tiny bugs and missed functions,
>>> but on the whole, the patch is ready for testing.
>>> I'd like to get a feedback about the patch testing on some real
>>> datasets. Any bug reports and suggestions are welcome.
>>>
>>> Here is a couple of useful queries to inspect the data inside the index
>>> pages:
>>> create extension pageinspect;
>>> select * from bt_metap('idx');
>>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>>> n) as bt;
>>> select n, bt.* from generate_series(1,1) as n, lateral
>>> bt_page_items('idx', n) as bt;
>>>
>>> And at last, the list of items I'm going to complete in the near future:
>>> 1. Add storage_parameter 'ena

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown

On 28 January 2016 at 17:09, Peter Geoghegan  wrote:
> On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown  wrote:
>> I'm surprised that efficiencies can't be realised beyond this point.  Your 
>> results show a sweet spot at around 1000 / 1000, with it getting 
>> slightly worse beyond that.  I kind of expected a lot of efficiency where 
>> all the values are the same, but perhaps that's due to my lack of 
>> understanding regarding the way they're being stored.
>
> I think that you'd need an I/O bound workload to see significant
> benefits. That seems unsurprising. I believe that random I/O from
> index writes is a big problem for us.

I was thinking more from the point of view of the index size.  An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown

On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
> 28.01.2016 18:12, Thom Brown:
>
> On 28 January 2016 at 14:06, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>>
>> 31.08.2015 10:41, Anastasia Lubennikova:
>>
>> Hi, hackers!
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in presentation
>> (part 1)
>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>>
>> I'd like to share the progress of my work. So here is a WIP patch.
>> It provides effective duplicate handling using posting lists the same way
>> as GIN does it.
>>
>> Layout of the tuples on the page is changed in the following way:
>> before:
>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>> (ip_blkid, ip_posid) + key
>> with patch:
>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>
>> It seems that backward compatibility works well without any changes. But
>> I haven't tested it properly yet.
>>
>> Here are some test results. They are obtained by test functions
>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>> i - number of distinct values in the index. So i=1 means that all rows
>> have the same key, and i=1000 means that all keys are different.
>> The other columns contain the index size (MB).
>>
>> i B-tree Old B-tree New GIN
>> 1 214,234375 87,7109375 10,2109375
>> 10 214,234375 87,7109375 10,71875
>> 100 214,234375 87,4375 15,640625
>> 1000 214,234375 86,2578125 31,296875
>> 1 214,234375 78,421875 104,3046875
>> 10 214,234375 65,359375 49,078125
>> 100 214,234375 90,140625 106,8203125
>> 1000 214,234375 214,234375 534,0625
>> You can note that the last row contains the same index sizes for B-tree,
>> which is quite logical - there is no compression if all the keys are
>> distinct.
>> Other cases looks really nice to me.
>> Next thing to say is that I haven't implemented posting list compression
>> yet. So there is still potential to decrease size of compressed btree.
>>
>> I'm almost sure, there are still some tiny bugs and missed functions, but
>> on the whole, the patch is ready for testing.
>> I'd like to get a feedback about the patch testing on some real datasets.
>> Any bug reports and suggestions are welcome.
>>
>> Here is a couple of useful queries to inspect the data inside the index
>> pages:
>> create extension pageinspect;
>> select * from bt_metap('idx');
>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>> n) as bt;
>> select n, bt.* from generate_series(1,1) as n, lateral
>> bt_page_items('idx', n) as bt;
>>
>> And at last, the list of items I'm going to complete in the near future:
>> 1. Add storage_parameter 'enable_compression' for btree access method
>> which specifies whether the index handles duplicates. default is 'off'
>> 2. Bring back microvacuum functionality for compressed indexes.
>> 3. Improve insertion speed. Insertions became significantly slower with
>> compressed btree, which is obviously not what we do want.
>> 4. Clean the code and comments, add related documentation.
>>
&g

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-01-28 Thread Thom Brown

On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

>
> 31.08.2015 10:41, Anastasia Lubennikova:
>
> Hi, hackers!
> I'm going to begin work on effective storage of duplicate keys in B-tree
> index.
> The main idea is to implement posting lists and posting trees for B-tree
> index pages as it's already done for GIN.
>
> In a nutshell, effective storing of duplicates in GIN is organised as
> follows.
> Index stores single index tuple for each unique key. That index tuple
> points to posting list which contains pointers to heap tuples (TIDs). If
> too many rows having the same key, multiple pages are allocated for the
> TIDs and these constitute so called posting tree.
> You can find wonderful detailed descriptions in gin readme
> 
> and articles .
> It also makes possible to apply compression algorithm to posting list/tree
> and significantly decrease index size. Read more in presentation (part 1)
> .
>
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index could
> contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>
>
> I'd like to share the progress of my work. So here is a WIP patch.
> It provides effective duplicate handling using posting lists the same way
> as GIN does it.
>
> Layout of the tuples on the page is changed in the following way:
> before:
> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
> (ip_blkid, ip_posid) + key
> with patch:
> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>
> It seems that backward compatibility works well without any changes. But I
> haven't tested it properly yet.
>
> Here are some test results. They are obtained by test functions
> test_btbuild and test_ginbuild, which you can find in attached sql file.
> i - number of distinct values in the index. So i=1 means that all rows
> have the same key, and i=1000 means that all keys are different.
> The other columns contain the index size (MB).
>
> i B-tree Old B-tree New GIN
> 1 214,234375 87,7109375 10,2109375
> 10 214,234375 87,7109375 10,71875
> 100 214,234375 87,4375 15,640625
> 1000 214,234375 86,2578125 31,296875
> 1 214,234375 78,421875 104,3046875
> 10 214,234375 65,359375 49,078125
> 100 214,234375 90,140625 106,8203125
> 1000 214,234375 214,234375 534,0625
> You can note that the last row contains the same index sizes for B-tree,
> which is quite logical - there is no compression if all the keys are
> distinct.
> Other cases looks really nice to me.
> Next thing to say is that I haven't implemented posting list compression
> yet. So there is still potential to decrease size of compressed btree.
>
> I'm almost sure, there are still some tiny bugs and missed functions, but
> on the whole, the patch is ready for testing.
> I'd like to get a feedback about the patch testing on some real datasets.
> Any bug reports and suggestions are welcome.
>
> Here is a couple of useful queries to inspect the data inside the index
> pages:
> create extension pageinspect;
> select * from bt_metap('idx');
> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
> n) as bt;
> select n, bt.* from generate_series(1,1) as n, lateral
> bt_page_items('idx', n) as bt;
>
> And at last, the list of items I'm going to complete in the near future:
> 1. Add storage_parameter 'enable_compression' for btree access method
> which specifies whether the index handles duplicates. default is 'off'
> 2. Bring back microvacuum functionality for compressed indexes.
> 3. Improve insertion speed. Insertions became significantly slower with
> compressed btree, which is obviously not what we do want.
> 4. Clean the code and comments, add related documentation.
>

This doesn't apply cleanly against current git head.  Have you caught up
past commit 65c5fcd35?

Thom

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-24 Thread Thom Brown

On 24 January 2016 at 20:11, Thom Brown  wrote:
> On 24 January 2016 at 19:53, Victor Wagner  wrote:
>> On Sun, 24 Jan 2016 15:58:10 +0000
>> Thom Brown  wrote:
>>
>>
>>>
>>> Output of \set variables without patch:
>>>
>>> HOST = '127.0.0.1'
>>> PORT =
>>> '5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'
>>>
>>> And with patch:
>>>
>>> HOST =
>>> '127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'
>>> PORT = '5488'
>>>
>>> They're both wrong, but I'm hoping we can just show the right
>>> information here.
>>
>> I think we should show right information here, but it is not so simple.
>>
>> Problem is that I never keep symbolic representation of individual
>> host/port pair. And when we connect successfully, we have only struct
>> sockaddr representation of the it, which contain right IP
>> address, but doesn't contain symbolic host name.
>>
>> Moreover, one hostname from connect string can produce more than one
>> addrinfo structures. For example, on the machines with IPv6 support,
>> 'localhost' hostname is resolved into both 127.0.0.1 IPv4 address and
>> [::1] IPv6, and produces two records.
>>
>> So would do any name, which have both A and  records in DNS. And
>> nothing prevent domain administrator to put more than one A record for
>> same hostname into DNS zone.
>>
>>
>> So, it is just same information which can be retrieved from the backend
>> via
>>
>> select inet_client_addr();
>> select inet_client_port();
>
> I think you mean:
>
> select inet_server_addr();
> select inet_server_port();
>
>> What is really interesting for HOST and PORT variable - it is the name
>> of host and port number used to make actual connection, as they were
>> specified in the connect string or service file.
>
> And this is probably not the correct thing for it to report.  The
> documentation says "The database server host you are currently
> connected to." and "The database server port to which you are
> currently connected.", so yeah, I'd expect to see those set to
> whatever those 2 functions resolve to.  That being said, if one
> connects via a domain socket, those appear to come back blank with
> those functions, yet HOST and PORT report the correct information in
> those cases (without passing in multiple hosts).  Is that a
> pre-existing bug?

I've just checked, and can see that this doesn't appear to be a bug.
As per network.c:

/*
 * IP address that the server accepted the connection on (NULL if Unix socket)
 */

and

/*
 * port that the server accepted the connection on (NULL if Unix socket)
 */

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-24 Thread Thom Brown

On 24 January 2016 at 19:53, Victor Wagner  wrote:
> On Sun, 24 Jan 2016 15:58:10 +
> Thom Brown  wrote:
>
>
>>
>> Output of \set variables without patch:
>>
>> HOST = '127.0.0.1'
>> PORT =
>> '5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'
>>
>> And with patch:
>>
>> HOST =
>> '127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'
>> PORT = '5488'
>>
>> They're both wrong, but I'm hoping we can just show the right
>> information here.
>
> I think we should show right information here, but it is not so simple.
>
> Problem is that I never keep symbolic representation of individual
> host/port pair. And when we connect successfully, we have only struct
> sockaddr representation of the it, which contain right IP
> address, but doesn't contain symbolic host name.
>
> Moreover, one hostname from connect string can produce more than one
> addrinfo structures. For example, on the machines with IPv6 support,
> 'localhost' hostname is resolved into both 127.0.0.1 IPv4 address and
> [::1] IPv6, and produces two records.
>
> So would do any name, which have both A and  records in DNS. And
> nothing prevent domain administrator to put more than one A record for
> same hostname into DNS zone.
>
>
> So, it is just same information which can be retrieved from the backend
> via
>
> select inet_client_addr();
> select inet_client_port();

I think you mean:

select inet_server_addr();
select inet_server_port();

> What is really interesting for HOST and PORT variable - it is the name
> of host and port number used to make actual connection, as they were
> specified in the connect string or service file.

And this is probably not the correct thing for it to report.  The
documentation says "The database server host you are currently
connected to." and "The database server port to which you are
currently connected.", so yeah, I'd expect to see those set to
whatever those 2 functions resolve to.  That being said, if one
connects via a domain socket, those appear to come back blank with
those functions, yet HOST and PORT report the correct information in
those cases (without passing in multiple hosts).  Is that a
pre-existing bug?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-24 Thread Thom Brown

On 24 January 2016 at 15:30, Thom Brown  wrote:
> On 23 January 2016 at 03:32, Thom Brown  wrote:
>> On 22 January 2016 at 19:30, Victor Wagner  wrote:
>>> On Tue, 19 Jan 2016 14:34:54 +0000
>>> Thom Brown  wrote:
>>>
>>>>
>>>> The segfault issue I originally reported now appears to be resolved.
>>>>
>>>> But now I have another one:
>>>>
>>>> psql
>>>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>>>> -c 'show port'
>>>
>>> Here is new version of the patch. Now I've reworked hostorder=random and
>>> it seems to work as well as sequential. failover_timeout works too.
>>> I've also found a case when attempt to connect fail when receiving
>>> FATAL message from server which is not properly up yet. So, it is fixed
>>> too.
>>>
>>> Addititonally, error messages from all failed connect attempts are not
>>> accumulated now. Only last one is returned.
>>
>> I can't connect to a standby with the patch applied:
>>
>> thom@swift:~/Development/test$ psql -p 5531 postgres
>> psql: thom@swift:~/Development/test$
>>
>> No error message, nothing in the logs.  I find this is the case with
>> any standby, but doesn't affect primaries.
>>
>> So this has broken existing functionality somewhere.
>
> Okay, I've tested this again with additional logging.  Again, I'm just
> running "psql -p 5531 postgres", which connects to a standby.  This
> immediately exits psql, and the logs show:
>
> 2016-01-24 15:04:59.879 GMT - [unknown] - [unknown] LOG:  0:
> connection received: host=[local]
> 2016-01-24 15:04:59.879 GMT - [unknown] - [unknown] LOCATION:
> BackendInitialize, postmaster.c:4081
> 2016-01-24 15:04:59.880 GMT - thom - postgres LOG:  0: connection
> authorized: user=thom database=postgres
> 2016-01-24 15:04:59.880 GMT - thom - postgres LOCATION:
> PerformAuthentication, postinit.c:272
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0: statement:
> SELECT pg_catalog.pg_is_in_recovery()
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
> exec_simple_query, postgres.c:935
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0: duration: 0.583 ms
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
> exec_simple_query, postgres.c:1164
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0:
> disconnection: session time: 0:00:00.007 user=thom database=postgres
> host=[local]
> 2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
> log_disconnections, postgres.c:4458
>
> This shouldn't be checking whether it's a standby.  I also noticed that with:
>
> psql 
> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535/postgres?hostorder=random&readonly=1'
> -c 'show port'
>
> The standby at port 5533 shows in the logs that it's checking whether
> it's a standby when it happens to hit it.  Shouldn't this be
> unnecessary if we've set "readonly" to 1?  The result of the query
> doesn't appear to be useful for anything.
>
> Another thing I've noticed is that the PORT variable (shown by \set)
> always shows PGPORT, but I expect it to equal the port of whichever
> host we successfully connected to.

Actually, the same goes for the HOST variable, which is currently
showing the list of hosts:ports.

Output of \set variables without patch:

HOST = '127.0.0.1'
PORT = 
'5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'

And with patch:

HOST = 
'127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535'
PORT = '5488'

They're both wrong, but I'm hoping we can just show the right information here.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-24 Thread Thom Brown

On 23 January 2016 at 03:32, Thom Brown  wrote:
> On 22 January 2016 at 19:30, Victor Wagner  wrote:
>> On Tue, 19 Jan 2016 14:34:54 +0000
>> Thom Brown  wrote:
>>
>>>
>>> The segfault issue I originally reported now appears to be resolved.
>>>
>>> But now I have another one:
>>>
>>> psql
>>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>>> -c 'show port'
>>
>> Here is new version of the patch. Now I've reworked hostorder=random and
>> it seems to work as well as sequential. failover_timeout works too.
>> I've also found a case when attempt to connect fail when receiving
>> FATAL message from server which is not properly up yet. So, it is fixed
>> too.
>>
>> Addititonally, error messages from all failed connect attempts are not
>> accumulated now. Only last one is returned.
>
> I can't connect to a standby with the patch applied:
>
> thom@swift:~/Development/test$ psql -p 5531 postgres
> psql: thom@swift:~/Development/test$
>
> No error message, nothing in the logs.  I find this is the case with
> any standby, but doesn't affect primaries.
>
> So this has broken existing functionality somewhere.

Okay, I've tested this again with additional logging.  Again, I'm just
running "psql -p 5531 postgres", which connects to a standby.  This
immediately exits psql, and the logs show:

2016-01-24 15:04:59.879 GMT - [unknown] - [unknown] LOG:  0:
connection received: host=[local]
2016-01-24 15:04:59.879 GMT - [unknown] - [unknown] LOCATION:
BackendInitialize, postmaster.c:4081
2016-01-24 15:04:59.880 GMT - thom - postgres LOG:  0: connection
authorized: user=thom database=postgres
2016-01-24 15:04:59.880 GMT - thom - postgres LOCATION:
PerformAuthentication, postinit.c:272
2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0: statement:
SELECT pg_catalog.pg_is_in_recovery()
2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
exec_simple_query, postgres.c:935
2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0: duration: 0.583 ms
2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
exec_simple_query, postgres.c:1164
2016-01-24 15:04:59.886 GMT - thom - postgres LOG:  0:
disconnection: session time: 0:00:00.007 user=thom database=postgres
host=[local]
2016-01-24 15:04:59.886 GMT - thom - postgres LOCATION:
log_disconnections, postgres.c:4458

This shouldn't be checking whether it's a standby.  I also noticed that with:

psql 
'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533,127.0.0.1:5534,127.0.0.1:5535/postgres?hostorder=random&readonly=1'
-c 'show port'

The standby at port 5533 shows in the logs that it's checking whether
it's a standby when it happens to hit it.  Shouldn't this be
unnecessary if we've set "readonly" to 1?  The result of the query
doesn't appear to be useful for anything.

Another thing I've noticed is that the PORT variable (shown by \set)
always shows PGPORT, but I expect it to equal the port of whichever
host we successfully connected to.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Thom Brown

On 23 January 2016 at 03:32, Thom Brown  wrote:
> On 22 January 2016 at 19:30, Victor Wagner  wrote:
>> On Tue, 19 Jan 2016 14:34:54 +0000
>> Thom Brown  wrote:
>>
>>>
>>> The segfault issue I originally reported now appears to be resolved.
>>>
>>> But now I have another one:
>>>
>>> psql
>>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>>> -c 'show port'
>>
>> Here is new version of the patch. Now I've reworked hostorder=random and
>> it seems to work as well as sequential. failover_timeout works too.
>> I've also found a case when attempt to connect fail when receiving
>> FATAL message from server which is not properly up yet. So, it is fixed
>> too.
>>
>> Addititonally, error messages from all failed connect attempts are not
>> accumulated now. Only last one is returned.
>
> I can't connect to a standby with the patch applied:
>
> thom@swift:~/Development/test$ psql -p 5531 postgres
> psql: thom@swift:~/Development/test$
>
> No error message, nothing in the logs.  I find this is the case with
> any standby, but doesn't affect primaries.
>
> So this has broken existing functionality somewhere.

Okay, I've tested my original complaints, and they appear to be
resolved.  The timeout works fine, the sequential and random orders
behave as expected, and the readonly setting is also working.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-22 Thread Thom Brown

On 22 January 2016 at 19:30, Victor Wagner  wrote:
> On Tue, 19 Jan 2016 14:34:54 +
> Thom Brown  wrote:
>
>>
>> The segfault issue I originally reported now appears to be resolved.
>>
>> But now I have another one:
>>
>> psql
>> 'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
>> -c 'show port'
>
> Here is new version of the patch. Now I've reworked hostorder=random and
> it seems to work as well as sequential. failover_timeout works too.
> I've also found a case when attempt to connect fail when receiving
> FATAL message from server which is not properly up yet. So, it is fixed
> too.
>
> Addititonally, error messages from all failed connect attempts are not
> accumulated now. Only last one is returned.

I can't connect to a standby with the patch applied:

thom@swift:~/Development/test$ psql -p 5531 postgres
psql: thom@swift:~/Development/test$

No error message, nothing in the logs.  I find this is the case with
any standby, but doesn't affect primaries.

So this has broken existing functionality somewhere.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-01-19 Thread Thom Brown

On 21 December 2015 at 14:50, Victor Wagner  wrote:
> On Mon, 21 Dec 2015 17:18:37 +0300
> Teodor Sigaev  wrote:
>
>> Sorry, but there is something wrong with your patch:
>> % patch -p1 -C < ~/Downloads/libpq-failover-5.patch
>
> Really, somehow broken version of the patch got into message.
>
> Here is correct one.

The segfault issue I originally reported now appears to be resolved.

But now I have another one:

psql 
'postgresql://thom@127.0.0.1:5530,127.0.0.1:5531,127.0.0.1:5532,127.0.0.1:5533/postgres?hostorder=random&readonly=1&failover_timeout=5'
-c 'show port'

Segmentation fault

This is where no nodes are available.  If I remove hostorder=random,
or replace it with hostorder=sequential, it doesn't segfault.
However, it then tries to connect to PGHOST on PGPORT repeatedly, even
if I bring up one of the nodes it should be looking for.  Not only
this, but it seems to do it forever if failover_timeout is greater
than 0.

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Column merging for inherited tables aren't schema-qualified

2016-01-19 Thread Thom Brown

Hi,

I've noticed that if I alter the parent of a inheritance tree, there
can be ambiguity of which tables the column definitions were merged
with.

For example:

# CREATE SCHEMA remote;
CREATE SCHEMA

# IMPORT public FROM SERVER remote INTO remote;
IMPORT FOREIGN SCHEMA

# CREATE TABLE public.customers (LIKE remote.customers);
CREATE TABLE

# CREATE TABLE master_customers (LIKE remote.customers);
CREATE TABLE

# ALTER TABLE remote.customers INHERIT master_customers;
ALTER TABLE

# ALTER TABLE customers INHERIT master_customers;
ALTER TABLE

# ALTER TABLE customers SET WITH OIDS;
ALTER TABLE

# ALTER TABLE remote.customers SET WITH OIDS;
ALTER TABLE

# ALTER TABLE master_customers SET WITH OIDS;
NOTICE:  merging definition of column "oid" for child "customers"
NOTICE:  merging definition of column "oid" for child "customers"
ALTER TABLE

If we only got one of those merge notices, we wouldn't know which
table the notice referred to, although not particularly important in
this case.

I wonder if this should be changed so it would instead read:

# ALTER TABLE master_customers SET WITH OIDS;
NOTICE:  merging definition of column "oid" for child "remote.customers"
NOTICE:  merging definition of column "oid" for child "public.customers"

However, this is only one example of what is probably a fairly common
case of table ambiguity in log messages.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-01-19 Thread Thom Brown

On 12 January 2016 at 11:49, Etsuro Fujita  wrote:
> On 2016/01/12 20:36, Thom Brown wrote:
>>
>> On 8 January 2016 at 05:08, Etsuro Fujita 
>> wrote:
>
>
>>>> On 2016/01/06 20:37, Thom Brown wrote:
>>>>>
>>>>> I've run into an issue:
>>>>>
>>>>> *# UPDATE master_customers SET id = 22 WHERE id = 16 RETURNING
>>>>> tableoid::regclass;
>>>>> ERROR:
>>>>> CONTEXT:  Remote SQL command: UPDATE public.customers SET id = 22
>>>>> WHERE ((id = 16)) RETURNING NULL
>
>
>>> While working on this, I noticed that the existing postgres_fdw system
>>> shows
>>> similar behavior, so I changed the subject.
>>>
>>> IIUC, the reason for that is when the local query specifies "RETURNING
>>> tableoid::regclass", the FDW has fmstate->has_returning=false while the
>>> remote query executed at ModifyTable has "RETURNING NULL", as shown in
>>> the
>>> above example; that would cause an abnormal exit in executing the remote
>>> query in postgresExecForeignUpdate, since that the FDW would get
>>> PGRES_TUPLES_OK as a result of the query while the FDW would think that
>>> the
>>> right result to get should be PGRES_COMMAND_OK, from the flag
>>> fmstate->has_returning=false.
>
>
>>> Attached is a patch to fix that.
>
>
>> I can't apply this patch in tandem with FDW DML pushdown patch (either
>> v2 or v3).
>
>
> That patch is for fixing the similar issue in the existing postgres_fdw
> system.  So, please apply that patch without the DML pushdown patch.  If
> that patch is reasonable as a fix for the issue, I'll update the DML
> pushdown patch (v3) on top of that patch.

The patch seems to work for me:

Before:

*# UPDATE master_customers SET id = 22 WHERE id = 1 RETURNING
tableoid::regclass;
ERROR:
CONTEXT:  Remote SQL command: UPDATE public.customers SET id = $2
WHERE ctid = $1 RETURNING NULL

After:

*# UPDATE master_customers SET id = 22 WHERE id = 1 RETURNING
tableoid::regclass;
 tableoid
--
 remote.customers
(1 row)

UPDATE 1

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-01-18 Thread Thom Brown

On 3 January 2016 at 13:26, Masahiko Sawada  wrote:
> On Fri, Dec 25, 2015 at 7:21 AM, Thomas Munro
>  wrote:
>> On Fri, Dec 25, 2015 at 8:50 AM, Masahiko Sawada  
>> wrote:
>>> On Wed, Dec 23, 2015 at 8:45 AM, Thomas Munro
>>>  wrote:
 On Wed, Dec 23, 2015 at 3:50 PM, Thomas Munro
  wrote:
> If you got rid of SyncRepGetSyncStandbysOnePriority as suggested
> above, then this function could be renamed to SyncRepGetSyncStandbys.
> I think it would be a tiny bit nicer if it also took a Size n argument
> along with the output buffer pointer.
>>>
>>> Sorry, I could not get your point. SyncRepGetSyncStandbysPriority()
>>> function uses synchronous_standby_num which is global variable.
>>> But you mean that the number of synchronous standbys is given as
>>> function argument?
>>
>> Yeah, I was thinking of it as the output buffer size which I would be
>> inclined to make more explicit (I am still coming to terms with the
>> use of global variables in Postgres) but it doesn't matter, please
>> disregard that suggestion.
>>
> As for the body of that function (which I won't paste here), it
> contains an algorithm to find the top K elements in an array of N
> elements.  It does that with a linear search through the top K seen so
> far for each value in the input array, so its worst case is O(KN)
> comparisons.  Some of the sorting gurus on this list might have
> something to say about that but my take is that it seems fine for the
> tiny values of K and N that we're dealing with here, and it's nice
> that it doesn't need any space other than the output buffer, unlike
> some other top-K algorithms which would win for larger inputs.
>>>
>>> Yeah, it's improvement point.
>>> But I'm assumed that the number of synchronous replication is not
>>> large, so I use this algorithm as first version.
>>> And I think that its worst case is O(K(N-K)). Am I missing something?
>>
>> You're right, I was dropping that detail, in the tradition of the
>> hand-wavy school of big-O notation.  (I suppose you could skip the
>> inner loop when the priority is lower than the current lowest
>> priority, giving a O(N) best case when the walsenders are perfectly
>> ordered by coincidence.  Probably a bad idea or just not worth
>> worrying about.)
>
> Thank you for reviewing the patch.
> Yeah, I added the logic that skip the inner loop.
>
>>
>>> Attached latest version patch.
>>
>> +/*
>> + * Obtain currently synced LSN location: write and flush, using priority
>> - * In 9.1 we support only a single synchronous standby, chosen from a
>> - * priority list of synchronous_standby_names. Before it can become the
>> + * In 9.6 we support multiple synchronous standby, chosen from a priority
>>
>> s/standby/standbys/
>>
>> + * list of synchronous_standby_names. Before it can become the
>>
>> s/Before it can become the/Before any standby can become a/
>>
>>   * synchronous standby it must have caught up with the primary; that may
>>   * take some time. Once caught up, the current highest priority standby
>>
>> s/standby/standbys/
>>
>>   * will release waiters from the queue.
>>
>> +bool
>> +SyncRepGetSyncLsnsPriority(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> +{
>> + int sync_standbys[synchronous_standby_num];
>>
>> I think this should be sync_standbys[SYNC_REP_MAX_SYNC_STANDBY_NUM].
>> (Variable sized arrays are a feature of C99 and PostgreSQL is written
>> in C89.)
>>
>> +/*
>> + * Populate a caller-supplied array which much have enough space for
>> + * synchronous_standby_num. Returns position of standbys currently
>> + * considered as synchronous, and its length.
>> + */
>> +int
>> +SyncRepGetSyncStandbys(int *sync_standbys)
>>
>> s/much/must/ (my bad, in previous email).
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be smaller than the
>> number of listed : %d",
>> + synchronous_standby_num)));
>>
>> How about "the number of synchronous standbys exceeds the length of
>> the standby list: %d"?  Error messages usually start with lower case,
>> ':' is not usually preceded by a space.
>>
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("The number of synchronous standbys must be between 1 and %d : %d",
>>
>> s/The/the/, s/ : /: /
>
> Fixed you mentioned.
>
> Attached latest v5 patch.
> Please review it.

synchronous_standby_num doesn't appear to be a valid GUC name:

LOG:  unrecognized configuration parameter "synchronous_standby_num"
in file "/home/thom/Development/test/primary/postgresql.conf" line 244

All I did was uncomment it and set it to a value.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-01-18 Thread Thom Brown

On 18 January 2016 at 10:46, Ashutosh Bapat
 wrote:
> Hi All,
> PFA patches for postgres_fdw join pushdown, taken care of all TODOs in my
> last mail.
>
> Here is the list of things that have been improved/added new as compared to
> Hanada-san's previous patch at [1].
>
> 1. Condition handling for join
> Patch in [1] allowed a foreign join to be pushed down if only all the
> conditions were safe to push down to the foreign server. This patch
> differentiates these conditions into 1. conditions to be applied while
> joining (ON clause) 2. conditions to be applied after joining (WHERE
> clause). For a join to be safe to pushdown, only conditions in 1 need to be
> all safe to pushdown. The conditions in second category, which are not safe
> to be pushed down can be applied locally. This allows more avenue for join
> pushdown. For an INNER join all the conditions can be applied on the cross
> product. Hence we can push down an INNER join even if one or more of the
> conditions are not safe to be pushed down. This patch includes the
> optimization as well.
>
> 2. Targetlist handling:
> The columns required to evaluate the non-pushable conditions on a join
> relation need to be fetched from the foreign server. In previous patch the
> SELECT clauses were built from rel->reltargetlist, which doesn't contain
> these columns. This patch includes those columns as well.
>
> 3. Column projection:
> Earlier patch required another layer of SQL to project whole-row attribute
> from a base relation. This patch takes care of that while constructing and
> deparsing
> targetlist. This reduces the complexity and length of the query to be sent
> to the foreign server e.g.
>
> With the projection in previous patch the query looked like
> EXPLAIN (COSTS false, VERBOSE)
> SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)
> ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
>   QUERY PLAN
> ... explain output clipped
>Remote SQL: SELECT l.a1, l.a2, l.a3, l.a4, r.a1 FROM (SELECT
> l.a7, ROW(l.a10, l.a11, l.a12, l.a13, l.a14, l.a15, l.a16, l.a17), l.a10,
> l.a12 FROM (SELECT "C 1" a10, c2 a11, c3 a12, c4 a13, c5 a14, c6 a15, c7
> a16, c8 a17, ctid a7 FROM "S 1"."T 1") l) l (a1, a2, a3, a4) INNER JOIN
> (SELECT ROW(r.a9, r.a10, r.a12, r.a13, r.a14, r.a15, r.a16, r.a17), r.a9
> FROM (SELECT "C 1" a9, c2 a10, c3 a12, c4 a13, c5 a14, c6 a15, c7 a16, c8
> a17 FROM "S 1"."T 1") r) r (a1, a2) ON ((l.a3 = r.a2))
>
> With this patch it looks like
> EXPLAIN (COSTS false, VERBOSE)
> SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)
> ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;
>QUERY PLAN
> ... explain output clipped
>Remote SQL: SELECT l.a3, l.a4, l.a1, l.a2, r.a2 FROM (SELECT
> "C 1", c3, ctid, ROW("C 1", c2, c3, c4, c5, c6, c7, c8) FROM "S 1"."T 1") l
> (a1, a2, a3, a4) INNER JOIN (SELECT "C 1", ROW("C 1", c2, c3, c4, c5, c6,
> c7, c8) FROM "S 1"."T 1") r (a1, a2) ON (TRUE) WHERE ((l.a1 = r.a1))
> (9 rows)
>
> 4. Local cost estimation
> Previous patch had a TODO left for estimating join cost locally, when
> use_remote_estimate is false. This patch adds support for the same. The
> relevant
> discussion in mail thread [2], [3].
>
> 5. This patch adds a GUC enable_foreignjoin to enable or disable join
> pushdown through core.
>
> 6. Added more tests to test lateral references, unsafe to push conditions at
> various places in the query,
>
> Many cosmetic improvements like adding static function declarations, comment
> improvements and making code readable.
>
> [1]
> http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com
> [2]
> http://www.postgresql.org/message-id/cafjfprcqswus+tb5iyp1m3c-w0k3xab6h5mw4+n2q2iuafs...@mail.gmail.com
> [3]
> http://www.postgresql.org/message-id/CAFjFpRepSC2e3mZ1uYSopJD6R19fOZ0dNNf9Z=gnyksb6wg...@mail.gmail.com
>
> I will be working next on (in that order)
> 1. eval_plan_qual fix for foreign join. (Considered as a must-have for 9.6)
> 2. Pushing down ORDER BY clause along with join pushdown
> 3. Parameterization of foreign join paths (Given the complexity of the
> feature this may not make it into 9.6)

It seems you forgot to attach the patch.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Truncating/vacuuming relations on full tablespaces

2016-01-15 Thread Thom Brown

On 15 January 2016 at 15:21, Robert Haas  wrote:
> On Fri, Sep 4, 2015 at 8:04 AM, Thom Brown  wrote:
>> Currently, when attempting to vacuum a table on a tablespace with no space
>> left, we get an error:
>>
>> postgres=# vacuum test;
>> ERROR:  could not extend file
>> "pg_tblspc/19605/PG_9.6_201508111/12382/19616_vm": No space left on device
>> HINT:  Check free disk space.
>>
>> This is because it first tries to grow the visibility map file.
>>
>> We also get a similar problem when attempting to truncate with restart
>> identity:
>>
>> postgres=# truncate table test restart identity;
>> ERROR:  canceling autovacuum task
>> CONTEXT:  automatic vacuum of table "postgres.public.test"
>> ERROR:  could not extend file "base/12382/16391": No space left on device
>> HINT:  Check free disk space.
>> STATEMENT:  truncate table test restart identity;
>>
>> I guess a workaround for the 2nd case is to truncate without restarting the
>> identify, then truncate again with restart identify, or just resetting the
>> sequence.  In any case, someone will likely be running this command to free
>> up space, and they can't due to lack of space.
>>
>> But shouldn't we not be creating FSM or VM files when truncating?
>
> That seems like it might possibly be a good thing to avoid, but we're
> not doing it in either of those examples.  So, I am confused.

So am I, reading it back I'm not sure why I said that now.

The problem is with attempting to extend some file on a full
tablespace during a vacuum or a truncate.  I guess they are different
but related problems.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Odd behavior in foreign table modification (Was: Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW)

2016-01-12 Thread Thom Brown

On 8 January 2016 at 05:08, Etsuro Fujita  wrote:
> On 2016/01/07 21:50, Etsuro Fujita wrote:
>>
>> On 2016/01/06 20:37, Thom Brown wrote:
>>>
>>> On 25 December 2015 at 10:00, Etsuro Fujita
>>>  wrote:
>>>>
>>>> Attached is an updated version of the patch, which is
>>>> still WIP, but I'd be happy if I could get any feedback.
>
>
>>> I've run into an issue:
>>>
>>> *# UPDATE master_customers SET id = 22 WHERE id = 16 RETURNING
>>> tableoid::regclass;
>>> ERROR:
>>> CONTEXT:  Remote SQL command: UPDATE public.customers SET id = 22
>>> WHERE ((id = 16)) RETURNING NULL
>
>
>> Will fix.
>
>
> While working on this, I noticed that the existing postgres_fdw system shows
> similar behavior, so I changed the subject.
>
> IIUC, the reason for that is when the local query specifies "RETURNING
> tableoid::regclass", the FDW has fmstate->has_returning=false while the
> remote query executed at ModifyTable has "RETURNING NULL", as shown in the
> above example; that would cause an abnormal exit in executing the remote
> query in postgresExecForeignUpdate, since that the FDW would get
> PGRES_TUPLES_OK as a result of the query while the FDW would think that the
> right result to get should be PGRES_COMMAND_OK, from the flag
> fmstate->has_returning=false.
>
> Attached is a patch to fix that.

I can't apply this patch in tandem with FDW DML pushdown patch (either
v2 or v3).

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2016-01-06 Thread Thom Brown

On 25 December 2015 at 10:00, Etsuro Fujita  wrote:
> On 2015/12/24 4:34, Robert Haas wrote:
>>
>> On Wed, Dec 23, 2015 at 5:50 AM, Rushabh Lathia
>>  wrote:
>>>
>>> +1.
>>>
>>> I like idea of separate FDW API for the DML Pushdown. Was thinking can't
>>> we
>>> can re-use the  IterateForeignScan(ForeignScanState *node) rather then
>>> introducing IterateDMLPushdown(ForeignScanState *node) new API ?
>
>
>> Yeah, I think we need to ask ourselves what advantage we're getting
>> out of adding any new core APIs.  Marking the scan as a pushed-down
>> update or delete has some benefit in terms of making the information
>> visible via EXPLAIN, but even that's a pretty thin benefit.  The
>> iterate method seems to just complicate the core code without any
>> benefit at all.  More generally, there is very, very little code in
>> this patch that accomplishes anything that could not be done just as
>> well with the existing methods.  So why are we even doing these core
>> changes?
>
>
> From the FDWs' point of view, ISTM that what FDWs have to do for
> IterateDMLPushdown is quite different from what FDWs have to do for
> IterateForeignScan; eg, IterateDMLPushdown must work in accordance with
> presence/absence of a RETURNING list.  (In addition to that,
> IterateDMLPushdown has been designed so that it must make the scan tuple
> available to later RETURNING projection in nodeModifyTable.c.)  So, I think
> that it's better to FDWs to add separate APIs for the DML pushdown, making
> the FDW code much simpler.  So based on that idea, I added the postgres_fdw
> changes to the patch.  Attached is an updated version of the patch, which is
> still WIP, but I'd be happy if I could get any feedback.
>
>> Tom seemed to think that we could centralize some checks in the core
>> code, say, related to triggers, or to LIMIT.  But there's nothing like
>> that in this patch, so I'm not really understanding the point.
>
>
> For the trigger check, I added relation_has_row_level_triggers.  I placed
> that function in postgres_fdw.c in the updated patch, but I think that by
> placing that function in the core, FDWs can share that function.  As for the
> LIMIT, I'm not sure we can do something about that.
>
> I think the current design allows us to handle a pushed-down update on a
> join, ie, "UPDATE foo ... FROM bar ..." where both foo and bar are remote,
> which was Tom's concern, but I'll leave that for another patch.  Also, I
> think the current design could also extend to push down INSERT .. RETURNING
> .., but I'd like to leave that for future work.
>
> I'll add this to the next CF.

I've run into an issue:

*# UPDATE master_customers SET id = 22 WHERE id = 16 RETURNING
tableoid::regclass;
ERROR:
CONTEXT:  Remote SQL command: UPDATE public.customers SET id = 22
WHERE ((id = 16)) RETURNING NULL

However, this works:

*# UPDATE master_customers SET id = 22 WHERE id = 16 RETURNING
tableoid::regclass, *;
tableoid | id | name  |company| registered_date |
expiry_date | active | status  | account_level
-++---+---+-+-++-+---
 local_customers | 22 | Bruce | Jo's Cupcakes | 2015-01-15  |
2017-01-14  | t  | running | basic
(1 row)

In this example, "local_customers" inherits from the remote table
"public"."customers", which inherits again from the local table
"master_customers"

Same issue with DELETE of course, and the ::regclass isn't important here.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

2016-01-04 Thread Thom Brown

On 4 January 2016 at 15:17, Pavel Stehule  wrote:
> Hi
>
> 2016-01-04 12:46 GMT+01:00 Shulgin, Oleksandr
> :
>>
>> On Wed, Dec 30, 2015 at 8:28 PM, Pavel Stehule 
>> wrote:
>>>
>>>
>>>
>>> 2015-12-30 17:33 GMT+01:00 Robert Haas :

 On Mon, Dec 28, 2015 at 8:45 AM, Shulgin, Oleksandr
  wrote:
 > I didn't check out earlier versions of this patch, but the latest one
 > still
 > changes pg_size_pretty() to emit PB suffix.
 >
 > I don't think it is worth it to throw a number of changes together
 > like
 > that.  We should focus on adding pg_size_bytes() first and make it
 > compatible with both pg_size_pretty() and existing GUC units: that is
 > support suffixes up to TB and make sure they have the meaning of
 > powers of
 > 2^10, not 10^3.  Re-using the table present in guc.c would be a plus.
 >
 > Next, we could think about adding handling of PB suffix on input and
 > output,
 > but I don't see a big problem if that is emitted as 1024TB or the user
 > has
 > to specify it as 1024TB in a GUC or argument to pg_size_bytes(): an
 > minor
 > inconvenience only.

 +1 to everything in this email.
>>>
>>>
>>> so I removed support for PB and SI units. Now the
>>> memory_unit_conversion_table is shared.
>>
>>
>> Looks better, thanks.
>>
>> I'm not sure why the need to touch the regression test for
>> pg_size_pretty():
>>
>> !10.5 | 10.5 bytes | -10.5 bytes
>> !  1000.5 | 1000.5 bytes   | -1000.5 bytes
>> !   100.5 | 977 kB | -977 kB
>> !10.5 | 954 MB | -954 MB
>> ! 1.5 | 931 GB | -931 GB
>> !  1000.5 | 909 TB | -909 TB
>>
>
> fixed
>
>>
>> A nitpick, this loop:
>>
>> + while (*cp)
>> + {
>> + if ((isdigit(*cp) || *cp == '.') && ndigits < MAX_DIGITS)
>> + digits[ndigits++] = *cp++;
>> + else
>> + break;
>> + }
>>
>> would be a bit easier to parse if spelled as:
>>
>> + while (*cp && (isdigit(*cp) || *cp == '.') && ndigits < MAX_DIGITS)
>> + digits[ndigits++] = *cp++;
>
>
> fixed
>
>>
>>
>> On the other hand, this seems to truncate the digits silently:
>>
>> + digits[ndigits] = '\0';
>>
>> I don't think we want that, e.g:
>>
>> postgres=# select pg_size_bytes('9223372036854775807.9');
>> ERROR:  invalid unit "9"
>> HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
>>
>> I think making a mutable copy of the input string and truncating it before
>> passing to numeric_in() would make more sense--no need to hard-code
>> MAX_DIGITS.  The same goes for hard-coding MAX_UNIT_LEN, e.g. compare the
>> following two outputs:
>>
>> postgres=# select pg_size_bytes('1 KiB');
>> ERROR:  invalid unit "KiB"
>> HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
>>
>> postgres=# select pg_size_bytes('1024 bytes');
>> ERROR:  invalid format
>>
>
> fixed

Hi,

Some feedback:

+  Converts a size in human-readable format with size units
+  into bytes.  The parameter is case insensitive string. Following
+  units are supported: kB, MB, GB, TB.

May I suggest:

"Converts a size in human-readable format with size units into bytes.
The parameter is case-insensitive, and the supported size units are
are: kB, MB, GB and TB."

+  * Convert human readable size to long int.

Big int?

+  * Due suppor decimal value and case insensitivity of units
+  * a function parse_intcannot be used.

Is this supposed to be saying:

"Due to support for decimal values..."?

and "a function parse_int cannot be used."?

+  * Use buffer as unit if there are not any nonspace char,
+  * else use a original unit string.

s/use a/use an/

+  * Now, the multiplier is in KB unit. It should be multiplied by 1024
+  * before usage

s/unit/units/

Regards

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double linking MemoryContext children

2015-12-08 Thread Thom Brown

On 7 December 2015 at 01:30, Jim Nasby  wrote:
> On 9/14/15 7:24 AM, Jan Wieck wrote:
>>
>> On 09/12/2015 11:35 AM, Kevin Grittner wrote:
>>
>>> On the other hand, a grep indicates that there are two places that
>>> MemoryContextData.nextchild is set (and we therefore probably need
>>> to also set the new field), and Jan's proposed patch only changes
>>> one of them.  If we do this, I think we need to change both places
>>> that are affected, so ResourceOwnerCreate() in resowner.c would
>>> need a line or two added.
>>
>>
>> ResourceOwnerCreate() sets ResourceOwnerData.nextchild, not
>> MemoryContextData.nextchild.
>
>
> Anything ever happen with this? 

Yeah, it was committed... a few mins ago.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-11-25 Thread Thom Brown

On 13 May 2015 at 04:10, Etsuro Fujita  wrote:
> On 2015/05/13 0:55, Stephen Frost wrote:
>>
>> Etsuro,
>>
>> * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
>>>
>>> Here is an updated version.  In this version, the bug has been
>>> fixed, but any regression tests for that hasn't been added, because
>>> I'm not sure that the above way is a good idea and don't have any
>>> other ideas.
>>>
>>> The EXPLAIN output has also been improved as discussed in [1].
>>
>>
>> While the EXPLAIN output changed, the structure hasn't really changed
>> from what was discussed previously and there's not been any real
>> involvment from the core code in what's happening here.
>>
>> Clearly, the documentation around how to use the FDW API hasn't changed
>> at all and there's been no additions to it for handling bulk work.
>> Everything here continues to be done inside of postgres_fdw, which
>> essentially ignores the prescribed "Update/Delete one tuple" interface
>> for ExecForeignUpdate/ExecForeignDelete.
>>
>> I've spent the better part of the past two days trying to reason my way
>> around that while reviewing this patch and I haven't come out the other
>> side any happier with this approach than I was back in
>> 20140911153049.gc16...@tamriel.snowman.net.
>>
>> There are other things that don't look right to me, such as what's going
>> on at the bottom of push_update_down(), but I don't think there's much
>> point going into it until we figure out what the core FDW API here
>> should look like.  It might not be all that far from what we have now,
>> but I don't think we can just ignore the existing, documented, API.
>
>
> OK, I'll try to introduce the core FDW API for this (and make changes to the
> core code) to address your previous comments.
>
> Thanks for taking the time to review the patch!

Fujita-san,

I'm a bit behind in reading up on this, so maybe it's been covered
since, but is there a discussion of this API on another thread, or a
newer patch available?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Thom Brown

On 19 November 2015 at 16:11, Robert Haas  wrote:
> On Thu, Nov 19, 2015 at 6:03 AM, Thom Brown  wrote:
>> I'm using git master, and if I crash the database whilst it's running
>> pgbench, then restart the database and try to run pgbench again, I
>> can't:
>>
>> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
>> ...crash database...
>> connection to database "pgbench" failed:
>> could not connect to server: Connection refused
>> Is the server running locally and accepting
>> connections on Unix domain socket "/tmp/.s.PGSQL.5488"?
>> thom@swift:~/Development/postgresql$ pg_ctl start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
>> starting vacuum...end.
>> setrandom: \setrandom maximum is less than minimum
>> client 0 aborted in state 1; execution of meta-command failed
>> transaction type: SELECT only
>> scaling factor: 0
>> query mode: simple
>> number of clients: 1
>> number of threads: 1
>> duration: 20 s
>> number of transactions actually processed: 0
>>
>> I can otherwise use the database without issue.
>
> The only explanation I can think of here is that pgbench on startup
> queries one of the tables to figure out the scale factor, and it seems
> to be coming up with scaling factor 0, suggesting that the table was
> perhaps truncated.  If the tables are unlogged, that's expected.
> Otherwise, it sounds like a serious bug in recovery.

Actually yes, that's something I appear to have omitted.  I was using
unlogged tables, which makes sense now.

Even so, the errors generated are not at all helpful, and I would
expect pgbench to handle this case explicitly.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Thom Brown

On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund  
> wrote in <20150326175024.gj...@alap3.anarazel.de>
>> I think the problem here is that the *primary* makes no such
>> assumptions. Init forks are logged via stuff like
>>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
> ..
>> i.e. the data is written out directly to disk, circumventing
>> shared_buffers. It's pretty bad that we don't do the same on the
>> standby. For master I think we should just add a bit to the XLOG_FPI
>> record saying the data should be forced out to disk. I'm less sure
>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>> record isn't really an option.
>
> The problem exists only for INIT_FORKNUM. So I suppose it is
> enough to check forknum to decide whether to sync immediately.
>
> Specifically for this instance, syncing buffers of INIT_FORKNUM
> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>
> The another (ugly!) solution sould be syncing only buffers for
> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
> though it is a kind of reversion of fast promotion. But buffers
> to be synced here should be pretty few.

This bug still exists.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Thom Brown

Hi,

I'm using git master, and if I crash the database whilst it's running
pgbench, then restart the database and try to run pgbench again, I
can't:

thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
...crash database...
connection to database "pgbench" failed:
could not connect to server: Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5488"?
thom@swift:~/Development/postgresql$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
server starting
thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
starting vacuum...end.
setrandom: \setrandom maximum is less than minimum
client 0 aborted in state 1; execution of meta-command failed
transaction type: SELECT only
scaling factor: 0
query mode: simple
number of clients: 1
number of threads: 1
duration: 20 s
number of transactions actually processed: 0

I can otherwise use the database without issue.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Add pg_audit, an auditing extension

2015-11-18 Thread Thom Brown

On 10 June 2015 at 14:41, Noah Misch  wrote:
> On Tue, Jun 09, 2015 at 03:54:59PM -0400, David Steele wrote:
>> I've certainly had quite the experience as a first-time contributor
>> working on this patch.  Perhaps I bit off more than I should have and I
>> definitely managed to ruffle a few feathers along the way.  I learned a
>> lot about how the community works, both the good and the bad.  Fear not,
>> though, I'm not so easily discouraged and you'll definitely be hearing
>> more from me.
>
> Glad to hear it.
>
>> The stated purpose of contrib is: "include porting tools, analysis
>> utilities, and plug-in features that are not part of the core PostgreSQL
>> system, mainly because they address a limited audience or are too
>> experimental to be part of the main source tree. This does not preclude
>> their usefulness."
>>
>> Perhaps we should consider modifying that language, because from my
>> perspective pg_audit fit the description perfectly.
>
> "What is contrib?" attracts enduring controversy; see recent thread "RFC:
> Remove contrib entirely" for the latest episode.  However that discussion
> concludes, that documentation passage is not too helpful as a guide to
> predicting contrib patch reception.  (Most recent contrib additions had an
> obvious analogy to an existing module, sidestepping the question.)

Is pg_audit being resubmitted for 9.6 at all?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown

On 17 November 2015 at 20:08, Robert Haas  wrote:
> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
>
>> However, the first parallel seq scan shows it getting 170314 rows.
>> Another run shows it getting 194165 rows.  The final result is
>> correct, but as you can see from the rows on the Append node (59094295
>> rows), it doesn't match the number of rows on the Gather node
>> (3000).
>
> Is this the same issue reported in
> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
> and not yet fixed?  I am inclined to think it probably is.

Yes, that seems to be the same issue.

>> And also, for some reason, I can no longer get this using more than 2
>> workers, even with max_worker_processes = 16 and max_parallel_degree =
>> 12.  I don't know if that's anything to do with this patch though.
>
> The number of workers is limited based on the size of the largest
> table involved in the Append.  That probably needs considerable
> improvement, of course, but this patch is still a step forward over
> not-this-patch.

Ah, okay.  I wasn't aware of this.  I'll bear that in mind in future.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch (3): Implement failover on libpq connect level.

2015-11-17 Thread Thom Brown

On 26 October 2015 at 07:58, Victor Wagner  wrote:
> On 2015.10.14 at 13:41:51 +0300, Victor Wagner wrote:
>
>> Attached patch which implements client library failover and
>> loadbalancing as was described in the proposal
>> <20150818041850.ga5...@wagner.pp.ru>.
>
> New version of patch
>
> 1. Handles replication connections correctly (i.e doesn't attempt to
> check readwrite mode if replication option is on)
> 2. Some minor improvement recommended by Korry Douglas  in
> <562a9259.4060...@enterprisedb.com>

This patch doesn't apply.  On line 636, this appears:

<<< BEGIN MERGE CONFLICT: local copy shown first <<<
+   if (value[0] ==  't')
=== COMMON ANCESTOR content follows 
+   if (strcmp(value, "true") == 0)
=== MERGED IN content follows ==
+   if (value[0]=='t')
>>> END MERGE CONFLICT >

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown

On 17 November 2015 at 15:43, Jim Nasby  wrote:
> On 11/17/15 4:41 AM, Thom Brown wrote:
>>
>> Could someone post a TL;DR summary of what the current plan looks
>> like?  I can see there is a huge amount of discussion to trawl back
>> through.  I can see it's something to do with the visibility map.  And
>> does it avoid freezing very large tables like the title originally
>> sought?
>
>
> Basically, it follows the same pattern that all-visible bits do, except
> instead of indicating a page is all-visible, the bit shows that all tuples
> on the page are frozen. That allows a scan_all vacuum to skip those pages.

So the visibility map is being repurposed?  And if a row on a frozen
page is modified, what happens to the visibility of all other rows on
that page, as the bit will be set back to 0?  I think I'm missing a
critical part of this functionality.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-17 Thread Thom Brown

On 17 November 2015 at 10:29, Masahiko Sawada  wrote:
>
>
> On Tue, Nov 17, 2015 at 10:45 AM, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 1:47 AM, Amit Kapila 
>> wrote:
>>> On Sat, Nov 14, 2015 at 1:19 AM, Andres Freund 
>>> wrote:
 On 2015-10-31 11:02:12 +0530, Amit Kapila wrote:
 > On Thu, Oct 8, 2015 at 11:05 PM, Simon Riggs 
 > wrote:
 > >
 > > On 1 October 2015 at 23:30, Josh Berkus  wrote:
 > >>
 > >> On 10/01/2015 07:43 AM, Robert Haas wrote:
 > >> > On Thu, Oct 1, 2015 at 9:44 AM, Fujii Masao
 > >> > 
 > wrote:
 > >> >> I wonder how much it's worth renaming only the file extension
 > >> >> while
 > >> >> there are many places where "visibility map" and "vm" are used,
 > >> >> for example, log messages, function names, variables, etc.
 > >> >
 > >> > I'd be inclined to keep calling it the visibility map (vm) even
 > >> > if
 > >> > it
 > >> > also contains freeze information.
 > >> >
 >
 > What is your main worry about changing the name of this map, is it
 > about more code churn or is it about that we might introduce new
 > issues
 > or is it about that people are already accustomed to call this map as
 > visibility map?

 Several:
 * Visibility map is rather descriptive, none of the replacement terms
   imo come close. Few people will know what a 'freeze' map is.
 * It increases the size of the patch considerably
 * It forces tooling that knows about the layout of the database
   directory to change their tools

>>>
>>> All these points are legitimate.
>>>
 On the benfit side the only argument I've heard so far is that it allows
 to disambiguate the format. But, uh, a look at the major version does
 that just as well, for far less trouble.

 > It seems to me quite logical for understanding purpose as well.  Any
 > new
 > person who wants to work in this area or is looking into it will
 > always
 > wonder why this map is named as visibility map even though it contains
 > information about visibility of page as well as frozen state of page.

 Being frozen is about visibility as well.

>>>
>>> OTOH being visible doesn't mean page is frozen.  I understand that frozen
>>> is
>>> related to visibility, but still it is a separate state of page and used
>>> for
>>> different
>>> purpose.  I think this is a subjective point and we could go either way,
>>> it
>>> is
>>> just a matter in which way more people are comfortable.
>>
>> I'm stickin' with what I said before, and what I think Andres is
>> saying too: renaming the map is a horrible idea.  It produces lots of
>> code churn for no real benefit.  We usually avoid such changes, and I
>> think we should do so here, too.
>
> I understood.
> I'm going to turn the patch back to visibility map, and just add the logic
> of enhancement of VACUUM FREEZE.
> If we want to add the other status not related to visibility into visibility
> map in the future, it would be worth to consider.

Could someone post a TL;DR summary of what the current plan looks
like?  I can see there is a huge amount of discussion to trawl back
through.  I can see it's something to do with the visibility map.  And
does it avoid freezing very large tables like the title originally
sought?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown

On 13 November 2015 at 22:09, Robert Haas  wrote:
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
>> I'm now designing the parallel feature of Append...
>>
>> Here is one challenge. How do we determine whether each sub-plan
>> allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
>
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
>
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
>
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>
> That's a much better plan.
>
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
>> Type-A) It can be performed on background worker context and
>>picked up by multiple worker processes concurrently.
>>   (e.g: Parallel SeqScan)
>> Type-B) It can be performed on background worker context but
>>   cannot be picked up by multiple worker processes.
>>   (e.g: non-parallel aware node)
>> Type-C) It should not be performed on background worker context.
>>(e.g: plan/path node with volatile functions)
>
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are thus prone to be the limiting factor on
> finishing the whole Append.  Having the plans divided in advance
> between these three lists (say, restricted_plans, safe_plans,
> parallel_plans) makes that easy to implement.
>
> Incidentally, I think it's subtly wrong to think of the parallel_aware
> flag as telling you whether the plan can absorb multiple workers.
> That's not really what it's for.  It's to tell you whether the plan is
> doing *something* parallel aware - that is, whether its Estimate,
> InitializeDSM, and InitializeWorker callbacks should do anything.  For
> SeqScan, flipping parallel_aware actually does split the input among
> all the workers, but for Append it's probably just load balances and
> for other nodes it might be something else again.  The term I'm using
> to indicate a path/plan that returns only a subset of the results in
> each worker is "partial".  Whether or not a

Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown

On 13 November 2015 at 15:22, Amit Kapila  wrote:
> On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown  wrote:
>>
>> On 13 November 2015 at 13:38, Amit Kapila  wrote:
>> > On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule
>> > 
>> > wrote:
>> >>
>> >>
>> >> yes - the another little bit unclean in EXPLAIN is number of workers.
>> >> If I
>> >> understand to the behave, the query is processed by two processes if
>> >> workers
>> >> in the explain is one.
>> >>
>> >
>> > You are right and I think that is current working model of Gather
>> > node which seems okay. I think the more serious thing here
>> > is that there is possibility that Explain Analyze can show the
>> > number of workers as more than actual workers working for Gather
>> > node.  We have already discussed that Explain Analyze should
>> > the actual number of workers used in query execution, patch for
>> > the same is still pending.
>>
>> This may have already been discussed before, but in a verbose output,
>> would it be possible to see the nodes for each worker?
>>
>
> There will be hardly any difference in nodes for each worker and it could
> be very long plan for large number of workers.  What kind of additional
> information you want which can't be shown in current format.

For explain plans, not that useful, but it's useful to see how long
each worker took for explain analyse.  And I imagine as more
functionality is added to scan partitions and foreign scans, it will
perhaps be more useful when the plans won't be identical. (or would
they?)

>>
>> And perhaps associated PIDs?
>>
>
> Yeah, that can be useful, if others also feel like it is important, I can
> look into preparing a patch for the same.

Thanks.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown

On 13 November 2015 at 13:38, Amit Kapila  wrote:
> On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule 
> wrote:
>>
>>
>> yes - the another little bit unclean in EXPLAIN is number of workers. If I
>> understand to the behave, the query is processed by two processes if workers
>> in the explain is one.
>>
>
> You are right and I think that is current working model of Gather
> node which seems okay. I think the more serious thing here
> is that there is possibility that Explain Analyze can show the
> number of workers as more than actual workers working for Gather
> node.  We have already discussed that Explain Analyze should
> the actual number of workers used in query execution, patch for
> the same is still pending.

This may have already been discussed before, but in a verbose output,
would it be possible to see the nodes for each worker?

e.g.

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%de%';
  QUERY
PLAN
--
 Aggregate  (cost=105557.59..105557.60 rows=1 width=0) (actual
time=400.752..400.752 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=175333
   ->  Gather  (cost=1000.00..104931.04 rows=250621 width=0) (actual
time=400.748..400.748 rows=0 loops=1)
 Output: content
 Number of Workers: 2
 Buffers: shared hit=175333
 ->  Parallel Seq Scan on public.js  (cost=0.00..39434.47
rows=125310 width=0) (actual time=182.256..398.14 rows=0 loops=1)
   Output: content
   Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
   Rows Removed by Filter: 626486
   Buffers: shared hit=87666
 ->  Parallel Seq Scan on public.js  (cost=0.00..39434.47
rows=1253101 width=0) (actual time=214.11..325.31 rows=0 loops=1)
   Output: content
   Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
   Rows Removed by Filter: 6264867
   Buffers: shared hit=876667
 Planning time: 0.085 ms
 Execution time: 414.713 ms
(14 rows)

And perhaps associated PIDs?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-13 Thread Thom Brown

On 13 November 2015 at 03:39, Amit Kapila  wrote:
> On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown  wrote:
>>
>> On 12 November 2015 at 15:23, Amit Kapila  wrote:
>> > On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule
>> > 
>> > wrote:
>> >>
>> >> Hi
>> >>
>> >> I have a first query
>> >>
>> >> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>> >> differen
>> >>
>> >
>> > Thanks for the report.  The reason for this problem is that
>> > instrumentation
>> > information from workers is getting aggregated multiple times.  In
>> > ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
>> > will wait for workers to finish and then accumulate stats from workers.
>> > Now ExecShutdownGatherWorkers() could be called multiple times
>> > (once we read all tuples from workers, at end of node) and it should be
>> > ensured that repeated calls should not try to redo the work done by
>> > first
>> > call.
>> > The same is ensured for tuplequeues, but not for parallel executor info.
>> > I think we can safely assume that we need to call ExecParallelFinish()
>> > only
>> > when there are workers started by the Gathers node, so on those lines
>> > attached patch should fix the problem.
>>
>> That fixes the count issue for me, although not the number of buffers
>> hit,
>>
>
> The number of shared buffers hit could be different across different runs
> because the read sequence of parallel workers can't be guaranteed, also
> I don't think same is even guaranteed for Seq Scan node, the other
> operations
> in parallel could lead to different number, however the actual problem was
> that in one of the plans shown by you [1], the Buffers hit at Gather node
> (175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201).
> Do you still (after applying above patch) see that Gather node is showing
> lesser hit buffers than Parallel Seq Scan node?

Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken there.

> [1]
>  # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->>'title' like '%design%';
>  QUERY
> PLAN
> 
>  Aggregate  (cost=132489.34..132489.35 rows=1 width=0) (actual
> time=382.987..382.987 rows=1 loops=1)
>Output: count(*)
>Buffers: shared hit=175288
>->  Gather  (cost=1000.00..132488.34 rows=401 width=0) (actual
> time=382.983..382.983 rows=0 loops=1)
>  Output: content
>  Number of Workers: 1
>  Buffers: shared hit=175288
>  ->  Parallel Seq Scan on public.js  (cost=0.00..131448.24
> rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
>Output: content
>Filter: (((js.content -> 'tags'::text) ->>
> 'title'::text) ~~ '%design%'::text)
>Rows Removed by Filter: 1724810
>Buffers: shared hit=241201
>  Planning time: 0.104 ms
>  Execution time: 403.045 ms
> (14 rows)
>
> Time: 403.596 ms
>
>> or the actual time taken.
>>
>
> Exactly what time you are referring here, Execution Time or actual time
> shown on Parallel Seq Scan node and what problem do you see with
> the reported time?

I'm referring to the Parallel Seq Scan actual time, showing
"379.407..1141.437" with 1 worker, but the total execution time shows
403.045.  If one worker is taking over a second, how come the whole
query was less than half a second?

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-12 Thread Thom Brown

On 12 November 2015 at 15:23, Amit Kapila  wrote:
> On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule 
> wrote:
>>
>> Hi
>>
>> I have a first query
>>
>> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>> differen
>>
>
> Thanks for the report.  The reason for this problem is that instrumentation
> information from workers is getting aggregated multiple times.  In
> ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
> will wait for workers to finish and then accumulate stats from workers.
> Now ExecShutdownGatherWorkers() could be called multiple times
> (once we read all tuples from workers, at end of node) and it should be
> ensured that repeated calls should not try to redo the work done by first
> call.
> The same is ensured for tuplequeues, but not for parallel executor info.
> I think we can safely assume that we need to call ExecParallelFinish() only
> when there are workers started by the Gathers node, so on those lines
> attached patch should fix the problem.

That fixes the count issue for me, although not the number of buffers
hit, or the actual time taken.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 >

1 - 100 of 784 matches

Mail list logo