date:20170928

On Fri, Sep 29, 2017 at 2:19 PM, Kyotaro HORIGUCHI
 wrote:
> It would practically work but I don't like the fact that the
> patch relies on the specific directory/file ordering in the tar
> stream. This is not about the CATVER directory but lower
> directories. Being more strict, it actually makes excessive calls
> to verify_dir_is_em..() for more lower directories contrarily to
> expectations.
>
> I think that we can take more more robust way to detect the
> CATVER directory. Just checking if it is a top-level directory
> will work. That is, making the following change.

My tendency about this patch is still that it should be rejected. This
is presenting additional handling for no real gain.

>
> -   if (firstfile && !basetablespace)
> +   /* copybuf must contain at least one '/' here */
> +   if (!basetablespace && strchr(copybuf, '/')[1] == 0)
>
> This condition exactly hits only CATVER directories not being
> disturbed by file ordering of the tar stream.

Anyway, as a new version is at least needed, I am marking it as
returned with feedback. The CF is close to its end.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Fri, Sep 29, 2017 at 6:39 AM, Alvaro Herrera  wrote:
> Peter Geoghegan wrote:
>
>> We certainly do still see wrong answers to queries here:
>>
>> postgres=# select ctid, xmin, xmax, * from t;
>>  ctid  | xmin  | xmax | id | name | x
>> ---+---+--++--+---
>>  (0,1) | 21171 |0 |  1 | 111  | 0
>>  (0,7) | 21177 |0 |  3 | 333  | 5
>> (2 rows)
>>
>> postgres=# select * from t where id = 3;
>>  id | name | x
>> +--+---
>>   3 | 333  | 5
>> (1 row)
>>
>> postgres=# set enable_seqscan = off;
>> SET
>> postgres=# select * from t where id = 3;
>>  id | name | x
>> +--+---
>> (0 rows)
>
> Yeah, oops.

This really looks like a problem at heap-level with the parent
redirection not getting defined (?). Please note that a subsequent
REINDEX fails as well:
=# reindex index t_pkey ;
ERROR:  XX000: failed to find parent tuple for heap-only tuple at
(0,7) in table "t"
LOCATION:  IndexBuildHeapRangeScan, index.c:2597

VACUUM FREEZE also is not getting things right, but a VACUUM FULL does.

Also, dropping the constraint and attempting to recreate it is failing:
=# alter table t drop constraint t_pkey;
ALTER TABLE
=# create index t_pkey on t(id);
ERROR:  XX000: failed to find parent tuple for heap-only tuple at
(0,7) in table "t"
LOCATION:  IndexBuildHeapRangeScan, index.c:2597

A corrupted page clearly indicates that there are no tuple redirections:
=# select lp, t_ctid, lp_off, t_infomask, t_infomask2 from
heap_page_items(get_raw_page('t', 0));
 lp | t_ctid | lp_off | t_infomask | t_infomask2
++++-
  1 | (0,1)  |   8152 |   2818 |   3
  2 | null   |  0 |   null |null
  3 | (0,4)  |   8112 |   9986 |   49155
  4 | (0,5)  |   8072 |   9986 |   49155
  5 | (0,6)  |   8032 |   9986 |   49155
  6 | (0,7)  |   7992 |   9986 |   49155
  7 | (0,7)  |   7952 |  11010 |   32771
(7 rows)

And a non-corrupted page clearly shows the redirection done with lp_off:
 lp | t_ctid | lp_off | t_infomask | t_infomask2
++++-
  1 | (0,1)  |   8152 |   2818 |   3
  2 | null   |  7 |   null |null
  3 | null   |  0 |   null |null
  4 | null   |  0 |   null |null
  5 | null   |  0 |   null |null
  6 | null   |  0 |   null |null
  7 | (0,7)  |   8112 |  11010 |   32771
(7 rows)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stuck with 100% cpu usage

2017-09-28 Thread Pavan Deolasee

On Fri, Sep 29, 2017 at 1:03 AM, Fabien COELHO  wrote:

>
> The commit that introduced this code is 12788ae49e1933f463bc. So I amn
>>> copying Heikki.
>>>
>>
>> AFAICR the commit was mostly a heavy restructuring of previous
>> unmaintainable spaghetti code. I'm not sure the problem was not there
>> before under one form or another.
>>
>> I agree that it should error out & stop the client in this case at least.
>>
>
> Here is a probable "fix", which does was the comment said should be done.
>
>
Looks good to me.


> I could not trigger an infinite loop with various kill -9 and other quick
> stops. Could you try it on your side?
>
>
Ok, I will try. But TBH I did not try to reproduce that either and I am not
sure if I can. I discovered the problem when my laptop's battery started
draining out much more quickly. Having seen the problem, it seems very
obvious though.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] pgbench stuck with 100% cpu usage

2017-09-28 Thread Pavan Deolasee

On Fri, Sep 29, 2017 at 12:22 AM, Fabien COELHO  wrote:

>
> While running some tests, I encountered a situation where pgbench gets
>> stuck in an infinite loop, consuming 100% cpu. The setup was:
>>
>> - Start postgres server from the master branch
>> - Initialise pgbench
>> - Run pgbench -c 10 -T 100
>> - Stop postgres with -m immediate
>>
>
> That is a strange test to run, but it would be better if the behavior was
> not that one.

Well, I think it's a very legitimate test, not for testing performance, but
testing crash recovery and I use it very often. This particular test was
run to catch another bug which will be reported separately.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

2017-09-28 Thread Kyotaro HORIGUCHI

Hi,

At Tue, 19 Sep 2017 17:07:10 +0200, Pierre Ducroquet 
 wrote in <1678633.OBaBNztJnJ@pierred-pdoc>
> On Tuesday, September 19, 2017 12:52:37 PM CEST you wrote:
> > On Thu, Sep 14, 2017 at 2:17 AM, Pierre Ducroquet  
> wrote:
> > > All my apologies for the schockingly long time with no answer on this
> > > topic.
> > No problem. That's the concept called life I suppose.
> > 
> > > I will do my best to help review some patches in the current CF.
> > 
> > Thanks for the potential help!
> > 
> > > Attached to this email is the new version of the patch, checked against
> > > HEAD and REL_10_STABLE, with the small change required by Michael (affect
> > > true/ false to the boolean instead of 1/0) applied.
> > 
> > Thanks for the new version.
> > 
> > So I have been pondering about this patch, and allowing multiple
> > versions of Postgres to run in the same tablespace is mainly here for
> > upgrade purposes, so I don't think that we should encourage such
> > practices for performance reasons. There is as well
> > --tablespace-mapping which is here to cover a similar need and we know
> > that this solution works, at least people have spent time to make sure
> > it does.
> > 
> > Another problem that I have with this patch is that on HEAD,
> > permission checks are done before receiving any data. I think that
> > this is really saner to do any checks like that first, so as you can
> > know if the tablespace is available for write access before writing
> > any data to it. With this patch, you may finish by writing a bunch of
> > data to a tablespace path, but then fail on another tablespace because
> > of permission issues so the backup becomes useless after transferring
> > and writing potentially hundreds of gigabytes. This is no fun to
> > detect that potentially too late, and can impact the responsiveness of
> > instances to get backups and restore from them.
> > 
> > All in all, this patch gets a -1 from me in its current shape. If
> > Horiguchi-san or yourself want to process further with this patch, of
> > course feel free to do so, but I don't feel that we are actually
> > trying to solve a new problem here. I am switching the patch to
> > "waiting on author".

Hmm. I recall that my opinion on the "problem" was that we should
just prohibit sharing rather than allow it. However, if there's
who wants it and the behavior is not so bizarre, I don't reject
the direction.

As long as I saw the patch, it just delays digging of the top
directory until the CATVER directory becomes knwon without
additional checks. The behavior is identical to what the current
version does on tablespace directories so the pointed problems
don't seem to be new ones caused by this patch and such situation
seems a bit malicious.

> Hi
> 
> The point of this patch has never been to 'promote' sharing tablespaces 
> between PostgreSQL versions. This is not a documented feature, and it would 
> be 
> imho a bad idea to promote it because of bugs like this one.

That *was* a bug, but could be a not-a-bug after we define the
behavior reasonably, and may be should be documented.

> But that bug is a problem for people who are migrating between postgresql 
> releases one database at a time on the same server (maybe not a best 
> practice, 
> but a real use case nevertheless). That's how I met this bug, and that's why 
> I 
> wanted to patch it. I know there is a workaround, but to me it seems counter-
> intuitive that with no warning I can create a postgresql cluster that can not 
> be restored without additional pg_basebackup settings.
> If it is indeed forbidden to share a tablespace between postgresql releases, 
> why don't we enforce it ? Or at least show a warning when CREATE TABLESPACE 
> encounter a non-empty folder ?
> 
> Regarding the permission check, I don't really see how this is a real problem 
> with the patch. I have to check on master, but it seems to me that the 
> permission check can still be done before starting writing data on disc. We 
> could just delay the 'empty' folder check, and keep checking the folder 
> permissions.
> 
> And I will do the pgindent run, thanks for the nitpick :)

So I'll wait for the new version, but I have a comment on the
patch.

It would practically work but I don't like the fact that the
patch relies on the specific directory/file ordering in the tar
stream. This is not about the CATVER directory but lower
directories. Being more strict, it actually makes excessive calls
to verify_dir_is_em..() for more lower directories contrarily to
expectations.

I think that we can take more more robust way to detect the
CATVER directory. Just checking if it is a top-level directory
will work. That is, making the following change.

-   if (firstfile && !basetablespace)
+   /* copybuf must contain at least one '/' here */
+   if (!basetablespace && strchr(copybuf, '/')[1] == 0)

This condition exactly hits only CATVER directories not being
disturbed by file ordering of the tar stream.


regards,

-- 
Kyota

Re: [HACKERS] Refactor handling of database attributes between pg_dump and pg_dumpall

2017-09-28 Thread Vaishnavi Prabakaran

Hi,

On Sat, Sep 9, 2017 at 1:29 PM, Haribabu Kommi 
wrote:

>
> Fixed patch is attached.
>

Patch applies and has lot of noise due to indent with spaces.
I did ran regression tests located in - src/test/regress,
src/test/modules/test_pg_dump, src/bin/pg_dump, src/bin/pg_upgrade folders
and no issues observed.

+  --enable-pgdumpall-behaviour
+  
+   
+This option is for the use of pg_dumpall or
+pg_upgrade utility to dump the database objects
+by pg_dump for a complete dump.
+This option can only be used with -C/--create.
+Its use for other ourposes is not recommended or supported.
+The behavior of the option may change in future releases without
notice.
+   
+  
+ 
+
+ 

s/ourposes/purposes/

Option name "--enable-pgdumpall-behaviour"  is very generic and it is
better to rename it to something that reflects its functionality like
--skip-default-db-create/--no-default-db-create

Thanks & Regards,
Vaishnavi,
Fujitsu Australia.

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

On 2017-09-28 20:56:57 -0700, Andres Freund wrote:
> +1, I saw the same in a bleeding edge c++ version.

err, gcc version.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

On 2017-09-28 20:01:37 -0400, Tom Lane wrote:
> I wondered how pervasive this behavior is.  AFAICS it is *not* required
> by the C standard; C99 does not say that the left operand of assignment
> must be evaluated first, in fact it says that the order of evaluation is
> unspecified.

FWIW, it's being specificied in C++17 ([1]), which seems to make it not
unlikely to end up in C as well.

> but it does suggest that this is worth fixing, and is not just an artifact
> of an old compiler version.

+1, I saw the same in a bleeding edge c++ version.

Greetings,

Andres Freund

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0145r3.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-09-28 Thread Thomas Munro

On Fri, Sep 29, 2017 at 4:17 PM, Michael Paquier
 wrote:
>> As for DSM, I think that that can come later, and can be written by
>> somebody closer to that problem. There can be more than one
>> initialization function.
>
> I don't completely disagree with that, there could be multiple
> initialization functions. Still, an advantage about designing things
> right from the beginning with a set of correct APIs is that we don't
> need extra things later and this will never bother module maintainers.
> I would think that this utility interface should be minimal and
> portable to maintain a long-term stance.

FWIW I think if I were attacking that problem the first thing I'd
probably try would be getting rid of that internal pointer
filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making
the interface look something like this:

extern size_t bloom_estimate(int64 total elems, int work_mem);
extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem);

Something that allocates new memory as the patch's bloom_init()
function does I'd tend to call 'make' or 'create' or 'new' or
something, rather than 'init'.  'init' has connotations of being the
second phase in an allocate-and-init pattern for me.  Then
bloom_filt_make() would be trivially implemented on top of
bloom_estimate() and bloom_init(), and bloom_init() could be used
directly in DSM, DSA, traditional shmem without having to add any
special DSM support.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

On Thu, Sep 28, 2017 at 3:32 AM, Peter Geoghegan  wrote:
> On Wed, Sep 27, 2017 at 1:45 AM, Michael Paquier
>  wrote:
>> I have signed up as a reviewer of this patch, and I have looked at the
>> bloom filter implementation for now. This is the kind of facility that
>> people have asked for on this list for many years.
>>
>> One first thing striking me is that there is no test for this
>> implementation, which would be a base stone for other things, it would
>> be nice to validate that things are working properly before moving on
>> with 0002, and 0001 is a feature on its own. I don't think that it
>> would be complicated to have a small module in src/test/modules which
>> plugs in a couple of SQL functions on top of bloomfilter.h.
>
> 0001 is just a library utility. None of the backend lib utilities
> (like HyperLogLog, Discrete knapsack, etc) have dedicated test
> frameworks. Coding up an SQL interface for these things is a
> non-trivial project in itself.
>
> I have testing the implementation myself, but that was something that
> used C code.
>
> Can you tell me what you have in mind here in detail? For example,
> should there be a custom datatype that encapsulates the current state
> of the bloom filter? Or, should there be an aggregate function, that
> takes a containment argument that is tested at the end?

That could be something very simple:
- A function to initiate a bloom filter in a session context, with a
number of elements in input, which uses for example integers.
- A function to add an integer element to it.
- A function to query if an integer may exist or not.
- A function to free it.
The point is just to stress this code, I don't think that it is a
project this "large".

>> +#define MAX_HASH_FUNCS 10
>> Being able to define the number of hash functions used at
>> initialization would be nicer. Usually this number is kept way lower
>> than the number of elements to check as part of a set, but I see no
>> reason to not allow people to play with this API in a more extended
>> way. You can then let your module decide what it wants to use.
>
> The number of hash functions used is itself a function of the amount
> of memory available, and an estimate of the overall size of the set
> made ahead of time (in the case of amcheck, this is
> pg_class.reltuples). The typical interface is that the caller either
> specifies the amount of memory, or the required false positive rate
> (either way, they must also provide that estimate).

+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
The estimate is from wikipedia-sensei:
https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
Being able to enforce that would be nice, not mandatory perhaps.

> The value of MAX_HASH_FUNCS, 10, was chosen based on the fact that we
> never actually use more than that many hash functions in practice,
> given the limitations on the total amount of memory you can use
> (128MB). The formula for determining the optimum number of hash
> functions is pretty standard stuff.

Hm... OK. That could be a default... Not really convinced though.

>> +/*
>> + * Hash function is taken from sdbm, a public-domain reimplementation of the
>> + * ndbm database library.
>> + */
>> Reference link?
>
> http://www.eecs.harvard.edu/margo/papers/usenix91/paper.pdf

That would be nicer if added in the code :)

> I'm probably going to end up using murmurhash32() instead of the sdbm
> hash function anyway, now that Andres has exposed it in a header file.
> This probably won't matter in the next version.

Yeah, that may be a good idea at the end.

>> +   bitset_bytes = bitset_bits / BITS_PER_BYTE;
>> +   filter->bitset = palloc0(bitset_bytes);
>> +   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
>> +   filter->seed = seed;
>> I think that doing the allocation within the initialization phase is a
>> mistake. Being able to use a DSA would be nice, but I predict as well
>> cases where a module may want to have a bloom filter that persists as
>> well across multiple transactions, so the allocation should be able to
>> live across memory contexts.
>
> Why not just switch memory context before calling? Again, other
> comparable utilities don't have provide this in their interface.
>
> As for DSM, I think that that can come later, and can be written by
> somebody closer to that problem. There can be more than one
> initialization function.

I don't completely disagree with that, there could be multiple
initialization functions. Still, an advantage about designing things
right from the beginning with a set of correct APIs is that we don't
need extra things later and this will never bother module maintainers.
I would think that this utility interface should be minimal and
portable to maintain a long-term stance.

>> What I think you should do instead to
>> make this bloom implementation more modular is to let the caller give
>> a pointer to a memory area as well as its size. Then what bloom_init
>> should do is to ju

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

2017-09-28 Thread Bossart, Nathan

On 9/28/17, 8:46 PM, "Michael Paquier"  wrote:
> On Fri, Sep 29, 2017 at 10:44 AM, Tom Lane  wrote:
>> Michael Paquier  writes:
>>> On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan  
>>> wrote:
 Alright, I've added logging for autovacuum in v23.  I ended up needing to
 do a little restructuring to handle the case when the relation was skipped
 because the lock could not be obtained.  While doing so, I became
 convinced that LOG was probably the right level for autovacuum logs.
>>
>>> OK, of course let's not change the existing log levels. This could be
>>> always tuned later on depending on feedback from others. I can see
>>> that guc.c also uses elevel == 0 for some logic, so we could rely on
>>> that as you do.
>>
>> FWIW, I don't think this patch should be mucking with logging behavior
>> at all; that's not within its headline charter, and I doubt many people
>> are paying attention.  I propose to commit it without that.  If you feel
>> hot about changing the logging behavior, you can resubmit that as a new
>> patch in a new thread where it will get some visibility and debate on
>> its own merits.
>
> Okay. I am fine with that as well.

Sure, that seems reasonable to me.

Nathan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_prepared_xact_status

On Fri, Sep 29, 2017 at 1:53 AM, Konstantin Knizhnik
 wrote:
> In Postgres 10 we have txid_status function which returns status of
> transaction by XID.
> I wonder if it will be also useful to have similar function for 2PC
> transactions which can operate with GID?
> pg_prepared_xacts view allows to get information about prepared transaction
> which are not yet committed or aborted.
> But if transaction is committed, then there is no way now to find status of
> this transaction.

But you need to keep track of the transaction XID of each transaction
happening on the remote nodes which are part of a global 2PC
transaction, no? If you have this data at hand using txid_status is
enough to guess if a prepared transaction has been marked as committed
or prepared. And it seems to me that tracking those XIDs is mandatory
anyway for other consistency checks.

> If crash happen during 2PC commit, then transaction can be in prepared state
> at some nodes and committed/aborted at  other nodes.

Handling inconsistencies here is a tricky problem, particularly if a
given transaction is marked as both committed and aborted on many
nodes. The only way that I could think of would be to perform PITR to
recover from the inconsistent states. So that's not an easy problem,
becoming even more tricky if more than one transaction is involved and
many transactions are inter-dependent across nodes.

> 3. Same GID can be reused multiple times. In this case
> pg_prepared_xact_status function will return incorrect result, because it
> will return information about first global transaction with such GID after
> checkpoint and not the recent one.

Yeah, this argument alone is why I think that this is a dead-end approach.

> There is actually alternative approach to recovery of 2PC transactions. We
> can include coordinator identifier in GID (we can use GetSystemIdentifier()
> to identify coordinator's node)
> and XID of coordinator's transaction. In this case we can use txid_status()
> to check status of transaction at coordinator. It eliminates need to scan
> WAL to determine status of prepared transaction.

+GetOldestRestartPoint(&lsn, &timeline);
+
+xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+if (!xlogreader)
So you scan a bunch of records for each GID? This is really costly. I
think that you would have an easier life by tracking the XID of each
transaction involved remotely. In Postgres-XL, this is not a problem
as XIDs are assigned globally and consistently. But you would gain in
performance by keeping track of it on the coordinator node.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

On Fri, Sep 29, 2017 at 10:44 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan  wrote:
>>> Alright, I've added logging for autovacuum in v23.  I ended up needing to
>>> do a little restructuring to handle the case when the relation was skipped
>>> because the lock could not be obtained.  While doing so, I became
>>> convinced that LOG was probably the right level for autovacuum logs.
>
>> OK, of course let's not change the existing log levels. This could be
>> always tuned later on depending on feedback from others. I can see
>> that guc.c also uses elevel == 0 for some logic, so we could rely on
>> that as you do.
>
> FWIW, I don't think this patch should be mucking with logging behavior
> at all; that's not within its headline charter, and I doubt many people
> are paying attention.  I propose to commit it without that.  If you feel
> hot about changing the logging behavior, you can resubmit that as a new
> patch in a new thread where it will get some visibility and debate on
> its own merits.

Okay. I am fine with that as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

Michael Paquier  writes:
> On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan  wrote:
>> Alright, I've added logging for autovacuum in v23.  I ended up needing to
>> do a little restructuring to handle the case when the relation was skipped
>> because the lock could not be obtained.  While doing so, I became
>> convinced that LOG was probably the right level for autovacuum logs.

> OK, of course let's not change the existing log levels. This could be
> always tuned later on depending on feedback from others. I can see
> that guc.c also uses elevel == 0 for some logic, so we could rely on
> that as you do.

FWIW, I don't think this patch should be mucking with logging behavior
at all; that's not within its headline charter, and I doubt many people
are paying attention.  I propose to commit it without that.  If you feel
hot about changing the logging behavior, you can resubmit that as a new
patch in a new thread where it will get some visibility and debate on
its own merits.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitions: \d vs \d+

On 2017/09/28 22:19, Maksim Milyutin wrote:
> I also noticed ambiguity in printing "No partition constraint" in
> non-verbose mode and "Partition constraint:..." in verbose one for
> partition tables regardless of the type of partition.
> Attached small patch removes any output about partition constraint in
> non-verbose mode.

Patch looks good.

So, we should be looking at partconstraintdef only when verbose is true,
because that's only when we set it to a valid value.  Now, if
partconstraintdef is NULL even after verbose is true, that means backend
returned that there exists no constraint for that partition, which I
thought would be true for a default partition (because the commit that
introduced default partitions also introduced "No partition constraint"),
but it's really not.

For example, \d and \d+ show contradictory outputs for a default partition.

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table pd partition of p default;

\d pd
 Table "public.pd"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p DEFAULT
No partition constraint

\d+ pd
Table "public.pd"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
Partition of: p DEFAULT
Partition constraint: (NOT ((a IS NOT NULL) AND (a = ANY (ARRAY[1]

Perhaps, there is no case when "No partition constraint" should be output,
but I may be missing something.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitions: \d vs \d+

On 2017/09/28 22:29, Jesper Pedersen wrote:
> On 09/28/2017 09:19 AM, Maksim Milyutin wrote:
>>> E.g. "No partition constraint" vs. "Partition constraint:
>>> satisfies_hash_partition(...)".
>>
>> I also noticed ambiguity in printing "No partition constraint" in
>> non-verbose mode and "Partition constraint:..." in verbose one for
>> partition tables regardless of the type of partition.
>> Attached small patch removes any output about partition constraint in
>> non-verbose mode.
>>
> 
> Yeah, that could be one way.
> 
> It should likely be backported to REL_10_STABLE, so the question is if we
> are too late in the release cycle to change that output.

I think the default partition commit [1] introduced some change around
that code, so the behavior is new in 11dev and I think it needs a fix like
the one that Maksim proposed.

When I check with REL_10_STABLE tip, I find things to be normal:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);

\d p1
 Table "public.p1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES IN (1)

\d+ p1
Table "public.p1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
Partition of: p FOR VALUES IN (1)
Partition constraint: ((a IS NOT NULL) AND (a = ANY (ARRAY[1])))

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/bin/psql/describe.c;h=d22ec68431e231d9c781c2256a6030d66e0fd09d;hp=6fb9bdd063583fb8b60ad282aeb5256df67942e4;hb=6f6b99d1335be8ea1b74581fc489a97b109dd08a;hpb=2cf15ec8b1cb29bea149559700566a21a790b6d3



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Proposal] Allow users to specify multiple tables in VACUUM commands

On Fri, Sep 29, 2017 at 2:44 AM, Bossart, Nathan  wrote:
> Alright, I've added logging for autovacuum in v23.  I ended up needing to
> do a little restructuring to handle the case when the relation was skipped
> because the lock could not be obtained.  While doing so, I became
> convinced that LOG was probably the right level for autovacuum logs.

+   if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+   elevel = LOG;
+   else if (!IsAutoVacuumWorkerProcess())
+   elevel = WARNING;
+   else
+   elevel = 0;
OK, of course let's not change the existing log levels. This could be
always tuned later on depending on feedback from others. I can see
that guc.c also uses elevel == 0 for some logic, so we could rely on
that as you do.

@@ -116,6 +116,9 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
int elevel;
AcquireSampleRowsFunc acquirefunc = NULL;
BlockNumber relpages = 0;
+   boolrel_lock;
+
+   Assert(relation != NULL);
I can see that this is new in your patch. Definitely adapted.

In short, I am switching it back to "ready for committer". We may want
the locking issues when building the relation list to be settled
first.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

Andres Freund  writes:
> On 2017-09-28 18:39:03 -0400, Tom Lane wrote:
>> + * Note: the reason for using a temporary variable "d", here 
>> and in
>> + * other places, is that some compilers think "*op->resvalue = 
>> f();"
>> + * requires them to evaluate op->resvalue into a register before
>> + * calling f(), just in case f() is able to modify op->resvalue
>> + * somehow.  The extra line of code can save a useless register 
>> spill
>> + * and reload, on architectures without many registers.

> I'd remove the "without many registers" bit - that's really more an
> functioncall ABI question (#caller vs #callee saved registers) than
> about the actual architecture.

Fair enough.

I wondered how pervasive this behavior is.  AFAICS it is *not* required
by the C standard; C99 does not say that the left operand of assignment
must be evaluated first, in fact it says that the order of evaluation is
unspecified.  But the latest gcc I have at hand (6.4.1 on Fedora 25) still
does it this way.  OTOH, Apple's latest edition of clang (LLVM version
9.0.0 (clang-900.0.37)) appears to be just fine with waiting till after
the function call to load op->resvalue.  So that's not many data points,
but it does suggest that this is worth fixing, and is not just an artifact
of an old compiler version.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

Andres Freund  writes:
> On 2017-09-28 18:52:28 -0400, Tom Lane wrote:
>> Uh, what?  Access to fmgr_nbuiltins shouldn't be part of any critical path
>> anymore after this change.

> Indeed. But the size of the the oid -> fmgr_builtins index array is
> relevant now. We could of course just make that dependent on
> FirstBootstrapObjectId, but that'd waste some memory.

Not enough to notice, considering there are pg_proc OIDs up in the 8K
range already.  We blow 2KB of never-accessed space for far less good
reason than this.

>> I'm kind of -0.5 on that.  I believe part of the argument for having
>> things set up as they were was to allow external code to access the
>> fmgr_builtins table (as my speed-test hack earlier today did).

> You could still do that, you'd just end up with a second copy. Doesn't
> seem bad for such an uncommon case.

If I understand what you're proposing, it would involve the extension
containing its *own* copy of the fmgr table, which seems pretty horrid.
It wouldn't necessarily match the actual contents in the core executable.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSoC 2017: weekly progress reports (week 8)

On Thu, Sep 28, 2017 at 2:22 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Fri, Jul 28, 2017 at 7:58 AM, Shubham Barai 
> wrote:
>
>> I am attaching a patch for predicate locking in SP-Gist index
>>
>
> I took a look over this patch.
>
> newLeafBuffer = SpGistGetBuffer(index,
>> GBUF_LEAF | (isNulls ? GBUF_NULLS : 0),
>> Min(totalLeafSizes,
>> SPGIST_PAGE_CAPACITY),
>> &xlrec.initDest);
>> PredicateLockPageSplit(index,
>> BufferGetBlockNumber(current->buffer),
>> BufferGetBlockNumber(newLeafBuffer));
>>
>
> You move predicate lock during split only when new leaf page is
> allocated.  However SP-GiST may move items to the free space of another
> busy page during split (see other branches in doPickSplit()).  Your patch
> doesn't seem to handle this case correctly.
>

I've more thoughts about this patch.

+ * SPGist searches acquire predicate lock only on the leaf pages.
> + During a page split, a predicate lock is copied from the original
> + page to the new page.


This approach isn't going to work.  When new tuple is inserted into
SP-GiST, choose method can return spgAddNode.  In this case new branch of
the tree is added.  And this new branch could probably be used by scans we
made in the past.  Therefore, you need to do predicate locking for internal
pages too in order to detect all the possible conflicts.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

On 2017-09-28 18:52:28 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I might be worse than you... But anyway, here's a patch doing
> > so. Looking at profiles, it turned out that having the integer limits as
> > extern variables in a different TU isn't a great idea.
> 
> Uh, what?  Access to fmgr_nbuiltins shouldn't be part of any critical path
> anymore after this change.

Indeed. But the size of the the oid -> fmgr_builtins index array is
relevant now. We could of course just make that dependent on
FirstBootstrapObjectId, but that'd waste some memory.


> > So I moved what
> > used to be fmgrtab.c to fmgrtab.h, and included it directly in fmgr.c.
> 
> I'm kind of -0.5 on that.  I believe part of the argument for having
> things set up as they were was to allow external code to access the
> fmgr_builtins table (as my speed-test hack earlier today did).

You could still do that, you'd just end up with a second copy. Doesn't
seem bad for such an uncommon case.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

Andres Freund  writes:
> I might be worse than you... But anyway, here's a patch doing
> so. Looking at profiles, it turned out that having the integer limits as
> extern variables in a different TU isn't a great idea.

Uh, what?  Access to fmgr_nbuiltins shouldn't be part of any critical path
anymore after this change.

> So I moved what
> used to be fmgrtab.c to fmgrtab.h, and included it directly in fmgr.c.

I'm kind of -0.5 on that.  I believe part of the argument for having
things set up as they were was to allow external code to access the
fmgr_builtins table (as my speed-test hack earlier today did).
While I'm not sure that anything really is using that API, I do not
believe we'd gain any performance by removing it, so why do so?
We can leave the table and the fmgr_nbuiltins variable completely as-is,
and just add an index table, which fmgr.c could be aware is of size
exactly "FirstBootstrapObjectId" entries.

> Is this roughly what you were thinking of?

I think you need the "no entry" values to be -1; 0 is a valid index
into the fmgr table.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

On 2017-09-28 18:39:03 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-09-28 16:21:34 -0400, Tom Lane wrote:
> >> We could save a pointless register spill
> >> and reload if there were a temporary variable in there,
> 
> > Makes sense.  Do you want to make it so, or shall I?
> 
> I just finished testing a patch, as attached.  On my machine (again,
> not latest gcc: 4.4.7 on RHEL6 x86_64), it reduces the code size of
> execExprInterp.o by a fraction of a percent, and it seems to offer
> a slight benefit in "pgbench -S" performance although I'd not put
> much stock in that being reproducible.

Cool.

> +  * Note: the reason for using a temporary variable "d", here 
> and in
> +  * other places, is that some compilers think "*op->resvalue = 
> f();"
> +  * requires them to evaluate op->resvalue into a register before
> +  * calling f(), just in case f() is able to modify op->resvalue
> +  * somehow.  The extra line of code can save a useless register 
> spill
> +  * and reload, on architectures without many registers.

I'd remove the "without many registers" bit - that's really more an
functioncall ABI question (#caller vs #callee saved registers) than
about the actual architecture.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The case for removing replacement selection sort

On Thu, Sep 28, 2017 at 3:18 PM, Robert Haas  wrote:
> On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan  wrote:
>> With the additional enhancements made to Postgres 10, I doubt that
>> there are any remaining cases where it wins.
>
> I tried my favorite sorting test case -- pgbench -i -s 300 and then
> reindex index pgbench_accounts_pkey.  I set maintenance_work_mem to
> 4MB so that replacement selection would be chosen.
>
> On master, this takes ~21.5 seconds; with replacement_sort_tuples = 0,
> it takes ~19.1 seconds.  So, disabling replacement selection is a win.
>
> On 9.6, this takes ~19.1 seconds; with replacement_sort_tuples = 0, it
> takes ~23 seconds.  So, disabling replacement selection is a loss.
>
> This supports your theory that we should go ahead and remove
> replacement selection.

I'm glad to hear that. But, I should reiterate that if sorting
actually gets faster when my patch is applied, then that's something
that I consider to be a bonus. (This is primarily a refactoring patch,
to remove a bunch of obsolete code.)

> It does however both me a bit that this test
> case is apparently slower in master than in 9.6, and not just with
> very small values of work_mem.  With default maintenance_work_mem
> (64MB), this takes about 13 seconds on 9.6 but about 15 seconds on
> master -- and with that setting replacement selection is not chosen at
> all.

As I mentioned, the picture changed for replacement selection during
the Postgres 10 development cycle, which caused me to finally push for
it to be killed in this thread. So, anyone that just started following
the thread should note that it's totally expected that replacement
selection still just about pulls its weight in 9.6, but not in 10.

> Any idea what causes this regression?

I don't know. My best guess is that the overall I/O scheduling is now
suboptimal due to commit e94568e (Heikki's preloading thing), because
this is CREATE INDEX, and there is new competitive pressure. You might
find it hard to replicate the problem with a "SELECT COUNT(DISTINCT
aid) FROM pgbench_accounts", which would confirm this explanation. Or,
you could also see what happens with a separate temp tablespace.

It's really, really hard to have a 100%, unambiguous win within
tuplesort.c. We've seen this time and again over the years.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

Andres Freund  writes:
> On 2017-09-28 16:21:34 -0400, Tom Lane wrote:
>> We could save a pointless register spill
>> and reload if there were a temporary variable in there,

> Makes sense.  Do you want to make it so, or shall I?

I just finished testing a patch, as attached.  On my machine (again,
not latest gcc: 4.4.7 on RHEL6 x86_64), it reduces the code size of
execExprInterp.o by a fraction of a percent, and it seems to offer
a slight benefit in "pgbench -S" performance although I'd not put
much stock in that being reproducible.

> I'd personally be
> somewhat tempted to keep the branches in sync here...

I was tempted that way too, but it doesn't apply cleanly to v10,
because of Peter's recent cleanup of function pointer invocation
style.  I don't think it's really worth worrying about.

regards, tom lane

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 09abd46..68a1f96 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -501,15 +501,17 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		EEO_CASE(EEOP_INNER_SYSVAR)
 		{
 			int			attnum = op->d.var.attnum;
+			Datum		d;
 
 			/* these asserts must match defenses in slot_getattr */
 			Assert(innerslot->tts_tuple != NULL);
 			Assert(innerslot->tts_tuple != &(innerslot->tts_minhdr));
-			/* heap_getsysattr has sufficient defenses against bad attnums */
 
-			*op->resvalue = heap_getsysattr(innerslot->tts_tuple, attnum,
-			innerslot->tts_tupleDescriptor,
-			op->resnull);
+			/* heap_getsysattr has sufficient defenses against bad attnums */
+			d = heap_getsysattr(innerslot->tts_tuple, attnum,
+innerslot->tts_tupleDescriptor,
+op->resnull);
+			*op->resvalue = d;
 
 			EEO_NEXT();
 		}
@@ -517,15 +519,17 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		EEO_CASE(EEOP_OUTER_SYSVAR)
 		{
 			int			attnum = op->d.var.attnum;
+			Datum		d;
 
 			/* these asserts must match defenses in slot_getattr */
 			Assert(outerslot->tts_tuple != NULL);
 			Assert(outerslot->tts_tuple != &(outerslot->tts_minhdr));
 
 			/* heap_getsysattr has sufficient defenses against bad attnums */
-			*op->resvalue = heap_getsysattr(outerslot->tts_tuple, attnum,
-			outerslot->tts_tupleDescriptor,
-			op->resnull);
+			d = heap_getsysattr(outerslot->tts_tuple, attnum,
+outerslot->tts_tupleDescriptor,
+op->resnull);
+			*op->resvalue = d;
 
 			EEO_NEXT();
 		}
@@ -533,15 +537,17 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		EEO_CASE(EEOP_SCAN_SYSVAR)
 		{
 			int			attnum = op->d.var.attnum;
+			Datum		d;
 
 			/* these asserts must match defenses in slot_getattr */
 			Assert(scanslot->tts_tuple != NULL);
 			Assert(scanslot->tts_tuple != &(scanslot->tts_minhdr));
-			/* heap_getsysattr has sufficient defenses against bad attnums */
 
-			*op->resvalue = heap_getsysattr(scanslot->tts_tuple, attnum,
-			scanslot->tts_tupleDescriptor,
-			op->resnull);
+			/* heap_getsysattr has sufficient defenses against bad attnums */
+			d = heap_getsysattr(scanslot->tts_tuple, attnum,
+scanslot->tts_tupleDescriptor,
+op->resnull);
+			*op->resvalue = d;
 
 			EEO_NEXT();
 		}
@@ -641,13 +647,22 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		 * As both STRICT checks and function-usage are noticeable performance
 		 * wise, and function calls are a very hot-path (they also back
 		 * operators!), it's worth having so many separate opcodes.
+		 *
+		 * Note: the reason for using a temporary variable "d", here and in
+		 * other places, is that some compilers think "*op->resvalue = f();"
+		 * requires them to evaluate op->resvalue into a register before
+		 * calling f(), just in case f() is able to modify op->resvalue
+		 * somehow.  The extra line of code can save a useless register spill
+		 * and reload, on architectures without many registers.
 		 */
 		EEO_CASE(EEOP_FUNCEXPR)
 		{
 			FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
+			Datum		d;
 
 			fcinfo->isnull = false;
-			*op->resvalue = op->d.func.fn_addr(fcinfo);
+			d = op->d.func.fn_addr(fcinfo);
+			*op->resvalue = d;
 			*op->resnull = fcinfo->isnull;
 
 			EEO_NEXT();
@@ -658,6 +673,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
 			bool	   *argnull = fcinfo->argnull;
 			int			argno;
+			Datum		d;
 
 			/* strict function, so check for NULL args */
 			for (argno = 0; argno < op->d.func.nargs; argno++)
@@ -669,7 +685,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 }
 			}
 			fcinfo->isnull = false;
-			*op->resvalue = op->d.func.fn_addr(fcinfo);
+			d = op->d.func.fn_addr(fcinfo);
+			*op->resvalue = d;
 			*op->resnull = fcinfo->isnull;
 
 	strictfail:
@@ -680,11 +697,13 @@ ExecInterpExpr(Ex

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 3:20 PM, Robert Haas  wrote:
> On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan  wrote:
>> In the end, commit 6bfa88a fixed that old recovery bug by making sure
>> the recovery routine heap_xlog_lock() did the right thing. In both
>> cases (Feb 2014 and today), the index wasn't really corrupt -- it just
>> pointed to the root of a HOT chain when it should point to some child
>> tuple (or maybe a successor HOT chain).
>
> Unless I'm very confused, it's really not OK to point at a child tuple
> rather than the root of the HOT chain.

Please see my later clarification.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Robert Haas

On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan  wrote:
> In the end, commit 6bfa88a fixed that old recovery bug by making sure
> the recovery routine heap_xlog_lock() did the right thing. In both
> cases (Feb 2014 and today), the index wasn't really corrupt -- it just
> pointed to the root of a HOT chain when it should point to some child
> tuple (or maybe a successor HOT chain).

Unless I'm very confused, it's really not OK to point at a child tuple
rather than the root of the HOT chain.

Pointing to a successor HOT chain would be OK, as long as you point to
the root tuple thereof.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The case for removing replacement selection sort

2017-09-28 Thread Robert Haas

On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan  wrote:
> With the additional enhancements made to Postgres 10, I doubt that
> there are any remaining cases where it wins.

I tried my favorite sorting test case -- pgbench -i -s 300 and then
reindex index pgbench_accounts_pkey.  I set maintenance_work_mem to
4MB so that replacement selection would be chosen.

On master, this takes ~21.5 seconds; with replacement_sort_tuples = 0,
it takes ~19.1 seconds.  So, disabling replacement selection is a win.

On 9.6, this takes ~19.1 seconds; with replacement_sort_tuples = 0, it
takes ~23 seconds.  So, disabling replacement selection is a loss.

This supports your theory that we should go ahead and remove
replacement selection.  It does however both me a bit that this test
case is apparently slower in master than in 9.6, and not just with
very small values of work_mem.  With default maintenance_work_mem
(64MB), this takes about 13 seconds on 9.6 but about 15 seconds on
master -- and with that setting replacement selection is not chosen at
all.

Any idea what causes this regression?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minor codegen silliness in ExecInterpExpr()

Hi,

On 2017-09-28 16:21:34 -0400, Tom Lane wrote:
> I noticed the following while poking around with perf:
>
>| fcinfo->isnull = false;
>|b5b:   movb   $0x0,0x1c(%rdx)
>| *op->resvalue = 
> op->d.func.fn_addr(fcinfo);
>   0.02 |   mov0x8(%rbx),%rcx
>   1.19 |   mov%rdx,%rdi
>   0.93 |   mov%rdx,(%rsp)
>|   mov%rcx,0x8(%rsp)
>   0.01 |   callq  *0x28(%rbx)
>   2.17 |   mov0x8(%rsp),%rcx
>|   mov%rax,(%rcx)
>| *op->resnull = fcinfo->isnull;
>   1.18 |   mov(%rsp),%rdx
>   4.32 |   mov0x10(%rbx),%rax
>   0.06 |   movzbl 0x1c(%rdx),%edx
>   9.14 |   mov%dl,(%rax)
>
> It looks to me like gcc believes it is required to evaluate "op->resvalue"
> before invoking the called function, just in case the function somehow has
> access to *op and modifies that.

Yea, the compiler probably doesn't have much choice besides assuming
that. Might be different if we'd annote function pointers as pure, and
used strict aliasing + perhaps some restrict markers, but ...


> We could save a pointless register spill
> and reload if there were a temporary variable in there, ie
>
>   EEO_CASE(EEOP_FUNCEXPR)
>   {
>   FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
> + Datum   fvalue;
>
>   fcinfo->isnull = false;
> - *op->resvalue = op->d.func.fn_addr(fcinfo);
> + fvalue = op->d.func.fn_addr(fcinfo);
> + *op->resvalue = fvalue;
>   *op->resnull = fcinfo->isnull;
>
>   EEO_NEXT();
>   }
>
> and likewise in the other FUNCEXPR cases.
>
> This is on a rather old gcc, haven't checked on bleeding-edge versions.

Makes sense.  Do you want to make it so, or shall I?  I'd personally be
somewhat tempted to keep the branches in sync here...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 2:47 PM, Peter Geoghegan  wrote:
> In the end, commit 6bfa88a fixed that old recovery bug by making sure
> the recovery routine heap_xlog_lock() did the right thing. In both
> cases (Feb 2014 and today), the index wasn't really corrupt -- it just
> pointed to the root of a HOT chain when it should point to some child
> tuple (or maybe a successor HOT chain).

Just to be clear, obviously I don't mean that the index should have
been magically updated to compensate for that 2014 heap_lock_tuple()
recovery bug (say, via an in-place IndexTuple update during recovery).
Certainly, the index AM was entitled to assume that the heap TID it
pointed to would continue to be the root of the same HOT chain, at
least until the next VACUUM. That was what I meant by "the index
wasn't actually corrupt".

All I meant here is that the index scan and sequential scan would have
been in agreement if something like that *had* happened artificially
(say, when the index tuple was edited with a hex editor). And, that
the actual high level problem was that we failed to treat a
HOT-updated tuple as a HOT-updated tuple, leading to consequences that
were mostly noticeable during index scans. Probably on both occasions
(back in 2014, and now).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

On 2017-09-27 15:50:05 -0400, Tom Lane wrote:
> ISTM it shouldn't be that hard to get Gen_fmgrtab.pl to emit an index
> array of the sort we're talking about, along with the FmgrBuiltin array
> it already prints out.  I'm the world's worst Perl programmer but
> I'm happy to take a stab at it if you don't want to.

I might be worse than you... But anyway, here's a patch doing
so. Looking at profiles, it turned out that having the integer limits as
extern variables in a different TU isn't a great idea. So I moved what
used to be fmgrtab.c to fmgrtab.h, and included it directly in fmgr.c.

Is this roughly what you were thinking of?

Greetings,

Andres Freund
>From 11fc054fda92fafc8b6c1ac66f70ecf059c61a9d Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 26 Sep 2017 12:40:56 -0700
Subject: [PATCH] Speed up fmgr_isbuiltin() by keeping an oid -> builtin
 mapping.

Turns out we have enough functions that the binary search is quite
noticeable in profiles.

To address that, have Gen_fmgrtab.pl build an oid -> fmgr_builtins
index mapping. With that mapping fmgr_isbuiltin() just consists out of
a comparison and two array lookups.

To avoid having to reference extern variables, move the generated
table to fmgrtab.h and include that from fmgr.c.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914065128.a5sk7z4xde5uy...@alap3.anarazel.de
---
 src/backend/Makefile | 14 +--
 src/backend/utils/Gen_fmgrtab.pl | 82 ++--
 src/backend/utils/Makefile   |  7 ++--
 src/backend/utils/fmgr/fmgr.c| 27 ++---
 src/include/fmgr.h   | 16 
 src/include/utils/fmgrtab.h  | 39 ---
 6 files changed, 102 insertions(+), 83 deletions(-)
 delete mode 100644 src/include/utils/fmgrtab.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index aab676dbbd..f0eb5d8d78 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -142,6 +142,8 @@ utils/errcodes.h: utils/generate-errcodes.pl utils/errcodes.txt
 # see explanation in parser/Makefile
 utils/fmgrprotos.h: utils/fmgroids.h ;
 
+utils/fmgrtab.h: utils/fmgroids.h ;
+
 utils/fmgroids.h: utils/Gen_fmgrtab.pl catalog/Catalog.pm $(top_srcdir)/src/include/catalog/pg_proc.h
 	$(MAKE) -C utils $(notdir $@)
 
@@ -169,7 +171,7 @@ submake-schemapg:
 
 .PHONY: generated-headers
 
-generated-headers: $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/storage/lwlocknames.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/fmgrprotos.h $(top_builddir)/src/include/utils/probes.h
+generated-headers: $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/storage/lwlocknames.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/fmgrprotos.h $(top_builddir)/src/include/utils/fmgrtab.h $(top_builddir)/src/include/utils/probes.h
 
 $(top_builddir)/src/include/parser/gram.h: parser/gram.h
 	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
@@ -201,6 +203,11 @@ $(top_builddir)/src/include/utils/fmgrprotos.h: utils/fmgrprotos.h
 	  cd '$(dir $@)' && rm -f $(notdir $@) && \
 	  $(LN_S) "$$prereqdir/$(notdir $<)" .
 
+$(top_builddir)/src/include/utils/fmgrtab.h: utils/fmgrtab.h
+	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
+	  cd '$(dir $@)' && rm -f $(notdir $@) && \
+	  $(LN_S) "$$prereqdir/$(notdir $<)" .
+
 $(top_builddir)/src/include/utils/probes.h: utils/probes.h
 	cd '$(dir $@)' && rm -f $(notdir $@) && \
 	$(LN_S) "../../../$(subdir)/utils/probes.h" .
@@ -219,7 +226,7 @@ distprep:
 	$(MAKE) -C catalog	schemapg.h postgres.bki postgres.description postgres.shdescription
 	$(MAKE) -C replication	repl_gram.c repl_scanner.c syncrep_gram.c syncrep_scanner.c
 	$(MAKE) -C storage/lmgr	lwlocknames.h
-	$(MAKE) -C utils	fmgrtab.c fmgroids.h fmgrprotos.h errcodes.h
+	$(MAKE) -C utils	fmgroids.h fmgrprotos.h fmgrtab.h errcodes.h
 	$(MAKE) -C utils/misc	guc-file.c
 	$(MAKE) -C utils/sort	qsort_tuple.c
 
@@ -312,6 +319,7 @@ clean:
 		$(top_builddir)/src/include/storage/lwlocknames.h \
 		$(top_builddir)/src/include/utils/fmgroids.h \
 		$(top_builddir)/src/include/utils/fmgrprotos.h \
+		$(top_builddir)/src/include/utils/fmgrtab.h \
 		$(top_builddir)/src/include/utils/probes.h
 ifeq ($(PORTNAME), cygwin)
 	rm -f postgres.dll libpostgres.a
@@ -341,7 +349,7 @@ maintainer-clean: distclean
 	  storage/lmgr/lwlocknames.h \
 	  utils/fmgroids.h \
 	  utils/fmgrprotos.h \
-	  utils/fmgrtab.c \
+	  utils/fmgrtab.h \
 	  utils/errcodes.h \
 	  utils/misc/guc-file.c \
 	  utils/sort/qsort_tuple.c
diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index 17851fe2a4..c09f8dedb1 100644
--- a/src/backend/utils/Gen_fmgrtab.pl
+++ b/src/backend/utils/Gen_fmgrtab.pl
@@

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 2:39 PM, Alvaro Herrera  wrote:
>> FWIW, I am reminded a little bit of the MultiXact/recovery bug I
>> reported way back in February of 2014 [1], which also had a HOT
>> interaction that caused index scans to give wrong answers, despite
>> more or less structurally sound indexes.
>>
>> [1] 
>> https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com
>
> Thanks for the reference.  I didn't remember this problem and it's not
> (wasn't) in my list of things to look into.  Perhaps these are both the
> same bug.

I was reminded of that old bug because initially, at the time, it
looked very much like a corrupt index: sequential scans were fine, but
index scans gave wrong answers. This is what I saw today.

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Arrays of domains

Andrew Dunstan  writes:
> On 09/28/2017 01:11 PM, Tom Lane wrote:
>>> I wonder if we need to do any benchmarking to assure ourselves that the
>>> changes to ArrayCoerceExpr don't have a significant performance impact?

>> That would likely be a good idea, though I'm not very sure what or
>> how to benchmark.

> Some case where we only expect the current code to produce a single
> ArrayCoerceExpr, I guess. say doing text[] -> int[] ?

I spent some time looking into this.  I settled on int4[] -> int8[]
as the appropriate case to test, because int48() is about as cheap
a cast function as we have.  Q1 is the base case without a cast:

select count(x) from
  (select array[i,i,i,i,i,i,i,i,i,i] as x
   from generate_series(1,1000) i) ss;

Q2 is same with a cast added:

select count(x::int8[]) from
  (select array[i,i,i,i,i,i,i,i,i,i] as x
   from generate_series(1,1000) i) ss;

Q3 and Q4 are the same thing, but testing 100-element instead of
10-element arrays:

select count(x) from
  (select array[
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i
] as x
   from generate_series(1,1000) i) ss;

select count(x::int8[]) from
  (select array[
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i,
i,i,i,i,i,i,i,i,i,i
] as x
   from generate_series(1,1000) i) ss;

I get these query timings in a non-cassert build:

HEADPatch

Q1  5453.235 ms 5440.876 ms
Q2  9340.670 ms 10191.194 ms
Q3  19078.457 ms18967.279 ms
Q4  48196.338 ms58547.531 ms

(Timings are reproducible to a few percent.)

So that's a bit disappointing; the per-element overhead is clearly
noticeably more than before.  However, poking into it with "perf"
gives some grounds for optimism; Q4's hotspots with the patch are

  Children  Self   Samples  Command  Shared Object  
  Symbol
+   33.44%33.35% 81314  postmaster   postgres   
  [.] ExecInterpExpr
+   21.88%21.83% 53223  postmaster   postgres   
  [.] array_map
+   15.19%15.15% 36944  postmaster   postgres   
  [.] CopyArrayEls
+   14.63%14.60% 35585  postmaster   postgres   
  [.] ArrayCastAndSet
+6.07% 6.06% 14765  postmaster   postgres   
  [.] construct_md_array
+1.80% 1.79%  4370  postmaster   postgres   
  [.] palloc0
+0.77% 0.77%  1883  postmaster   postgres   
  [.] AllocSetAlloc
+0.75% 0.74%  1815  postmaster   postgres   
  [.] int48
+0.52% 0.52%  1276  postmaster   postgres   
  [.] advance_aggregates

Surely we could get the amount of time spent in ExecInterpExpr down.

One idea is to make a dedicated evalfunc for the case where the
expression is just EEOP_CASE_TESTVAL + EEOP_FUNCEXPR[_STRICT],
similar to the existing fast-path routines (ExecJust*).  I've not
actually tried to do that, but a reasonable guess is that it'd about
halve that overhead, putting this case back on a par with the HEAD code.
Also, I'd imagine that Andres' planned work on JIT-compiled expressions
would put this back on par with HEAD, if not better, for installations
using that.

Also I believe that Andres has plans to revamp the CaseTestExpr mechanism,
which might shave a few cycles as well.  Right now there's an extra copy
of each array datum + isnull value, because array_map sticks those into
one pair of variables and then the EEOP_CASE_TESTVAL opcode just moves
them somewhere else.  It's reasonable to hope that we could redesign that
so that array_map sticks the values straight into where they're needed,
and then we need only the FUNCEXPR opcode, which'd be a great candidate
for an ExecJust* fast-path.

Assuming that that's going to happen for v11, I'm inclined to leave the
optimization problem until the dust settles around CaseTestExpr.
Does anyone feel that this can't be committed before that's addressed?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Alvaro Herrera

Peter Geoghegan wrote:

> We certainly do still see wrong answers to queries here:
> 
> postgres=# select ctid, xmin, xmax, * from t;
>  ctid  | xmin  | xmax | id | name | x
> ---+---+--++--+---
>  (0,1) | 21171 |0 |  1 | 111  | 0
>  (0,7) | 21177 |0 |  3 | 333  | 5
> (2 rows)
> 
> postgres=# select * from t where id = 3;
>  id | name | x
> +--+---
>   3 | 333  | 5
> (1 row)
> 
> postgres=# set enable_seqscan = off;
> SET
> postgres=# select * from t where id = 3;
>  id | name | x
> +--+---
> (0 rows)

Yeah, oops.

> FWIW, I am reminded a little bit of the MultiXact/recovery bug I
> reported way back in February of 2014 [1], which also had a HOT
> interaction that caused index scans to give wrong answers, despite
> more or less structurally sound indexes.
> 
> [1] 
> https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com

Thanks for the reference.  I didn't remember this problem and it's not
(wasn't) in my list of things to look into.  Perhaps these are both the
same bug.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

On Thu, Sep 28, 2017 at 10:52 PM, chenhj  wrote:

> On 2017-09-29 00:43:18，"Alexander Korotkov" 
> wrote:
>
> On Thu, Sep 28, 2017 at 6:44 PM, chenhj  wrote:
>
>> On 2017-09-28 01:29:29，"Alexander Korotkov" 
>> wrote:
>>
>> It appears that your patch conflicts with fc49e24f.  Please, rebase it.
>>
>>
>> Yes, i had rebased it, Please check the new patch.
>>
>
> Good, now it applies cleanly.
>
> else if (strncmp(path, XLOGDIR"/", strlen(XLOGDIR"/")) == 0 &&
>>  IsXLogFileName(path + strlen(XLOGDIR"/")) &&
>>  (strcmp(path + strlen(XLOGDIR"/") + 8, divergence_wal_filename + 8) < 0
>> ||
>>   strcmp(path + strlen(XLOGDIR"/") + 8, last_source_wal_filename + 8) >
>> 0))
>
>
> According to our conding style, you should leave a space betwen XLOGDIF
> and "/".
> Also, you do a trick by comparison xlog segment numbers using strcmp().
> It's nice, but I would prefer seeing XLogFromFileName() here.  It would
> improve code readability and be less error prone during further
> modifications.
>
>
> Thanks for advice!
> I had modified it.
>

OK. Patch becomes better.
I also have more general question.  Why do we need upper bound for segment
number (last_source_segno)?  I understand the purpose of lower bound
(divergence_segno) which save us from copying extra WAL files, but what is
upper bound for?  As I understood, we anyway need to replay most recent WAL
records to reach consistent state after pg_rewind.  I propose to
remove last_source_segno unless I'm missing something.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera  wrote:
> Alvaro Herrera wrote:
>
>> Odd that it's not fixed.  I guess there's still some more work to do
>> here ...
>
> Maybe what this means is that we need to do both Dan's initially
> proposed patch (or something related to it) apart from the fixes already
> pushed.  IOW we need to put back some of the "tupkeep" business ...

We certainly do still see wrong answers to queries here:

postgres=# select ctid, xmin, xmax, * from t;
 ctid  | xmin  | xmax | id | name | x
---+---+--++--+---
 (0,1) | 21171 |0 |  1 | 111  | 0
 (0,7) | 21177 |0 |  3 | 333  | 5
(2 rows)

postgres=# select * from t where id = 3;
 id | name | x
+--+---
  3 | 333  | 5
(1 row)

postgres=# set enable_seqscan = off;
SET
postgres=# select * from t where id = 3;
 id | name | x
+--+---
(0 rows)

FWIW, I am reminded a little bit of the MultiXact/recovery bug I
reported way back in February of 2014 [1], which also had a HOT
interaction that caused index scans to give wrong answers, despite
more or less structurally sound indexes.

[1] 
https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surjective functional indexes

2017-09-28 Thread Konstantin Knizhnik

On 09/28/2017 10:10 PM, Robert Haas wrote:

On Wed, Sep 13, 2017 at 7:00 AM, Simon Riggs wrote:

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

+1.

I have nothing against renaming "projection" option to "recheck_on_update" or
whatever else is suggested.
Just let me know the best version. From my point of view "recheck_on_update" is too verbose and still not self-explained (to understand the meaning of this option it is necessary to uunderstand how heap_update works). "projection"/"non-injective"/... are
more declarative notions, explaining the characteristic of the index, while "recheck_on_update" is more procedural notion, explaining behavior of heap_update.

Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.

I think the question we need to be able to answer is: What is the
probability that an update that would otherwise be non-HOT can be made
into a HOT update by performing a recheck to see whether the value has
changed? It doesn't seem easy to figure that out from any of the
statistics we have available today or could easily get, because it
depends not only on the behavior of the expression which appears in
the index definition but also on the application behavior. For
example, consider a JSON blob representing a bank account.
b->'balance' is likely to change most of the time, but
b->'account_holder_name' only rarely. That's going to be hard for an
automated system to determine.

We should clearly check as many of the other criteria for a HOT update
as possible before performing a recheck of this type, so that it only
gets performed when it might help. For example, if column a is
indexed and b->'foo' is indexed, there's no point in checking whether
b->'foo' has changed if we know that a has changed. I don't know
whether it would be feasible to postpone deciding whether to do a
recheck until after we've figured out whether the page seems to
contain enough free space to allow a HOT update.

Turning non-HOT updates into HOT updates is really good, so it seems
likely that the rechecks will often be worthwhile. If we avoid a HOT
update in 25% of cases, that's probably easily worth the CPU overhead
of a recheck assuming the function isn't something ridiculously
expensive to compute; the extra CPU cost will be repaid by reduced
bloat. However, if we avoid a HOT update only one time in a million,
it's probably not worth the cost of recomputing the expression the
other 999,999 times. I wonder where the crossover point is -- it
seems like something that could be figured out by benchmarking.

While I agree that it would be nice to have this be a completely
automatic determination, I am not sure that will be practical. I
oppose overloading some other marker (like function_cost>1) for
this; that's too magical.

I almost agree with you.
Just few remarks: indexes are rarely created for frequently changed attributes,
like b->'balance'.
So in case of proper database schema design it is possible to expect that most
of updates are hot updates: do not actually affect any index.
But certainly different attributes may have different probability of been
updated.
Unfortunately we do not know before check which attribute of JSON field (or any
other fields used in indexed expression) is changed.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surjective functional indexes

2017-09-28 Thread Oleg Bartunov

On Thu, May 25, 2017 at 7:30 PM, Konstantin Knizhnik
 wrote:
> Right now Postgres determines whether update operation touch index or not
> based only on set of the affected columns.
> But in case of functional indexes such policy quite frequently leads to
> unnecessary index updates.
> For example, functional index are widely use for indexing JSON data:
> info->>'name'.
>
> JSON data may contain multiple attributes and only few of them may be
> affected by update.
> Moreover, index is used to build for immutable attributes (like "id",
> "isbn", "name",...).
>
> Functions like (info->>'name') are named "surjective" ni mathematics.
> I have strong feeling that most of functional indexes are based on
> surjective functions.
> For such indexes current Postgresql index update policy is very inefficient.
> It cause disabling of hot updates
> and so leads to significant degrade of performance.
>
> Without this patch Postgres is slower than Mongo on YCSB benchmark with (50%
> update,50 % select)  workload.
> And after applying this patch Postgres beats Mongo at all workloads.

I confirm that the patch helps for workload A of YCSB, but actually
just extends #clients, where postgres outperforms mongodb (see
attached picture).  If we increase #clients > 100 postgres quickly
degrades not only for workload A, but even for workload B (5%
updates), while mongodb and mysql behave much-much better, but this is
another problem, we will discuss in different thread.

>
> My proposal is to check value of function for functional indexes instead of
> just comparing set of effected attributes.
> Obviously, for some complex functions it may  have negative effect on update
> speed.
> This is why I have added "surjective" option to index. By default it is
> switched on for all functional indexes (based on my assumption
> that most functions used in functional indexes are surjective). But it is
> possible to explicitly disable it and make decision weather index
> needs to be updated or not only based on set of effected attributes.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] plpgsql_check future

2017-09-28 Thread Pavel Stehule

Hi

The plpgsql_check is mature project now, and I am would to start discussion
about future of this project. It is still private project, although it is
important for one from key PostgreSQL feature - PLpgSQL. I would be happy
if the community can take some responsibility for this project. This
project is too small to create own community and infrastructure like
PostGIS.

The development is almost time quiet. There are two issues:

1. It is placed on my personal repository on GitHub. It is far to perfect
from security, from substitutability perspective.

2. There is problems with builds for other than Linux platforms. The builds
on MS Win are difficult for me, because I don't use Windows.

The plpgsql_check is too big be part of contrib. But I invite some similar,
what help me with mentioned issues.

Any ideas?

Regards

Pavel

[HACKERS] Minor codegen silliness in ExecInterpExpr()

I noticed the following while poking around with perf:

   | fcinfo->isnull = false;
   |b5b:   movb   $0x0,0x1c(%rdx)
   | *op->resvalue = op->d.func.fn_addr(fcinfo);
  0.02 |   mov0x8(%rbx),%rcx
  1.19 |   mov%rdx,%rdi
  0.93 |   mov%rdx,(%rsp)
   |   mov%rcx,0x8(%rsp)
  0.01 |   callq  *0x28(%rbx)
  2.17 |   mov0x8(%rsp),%rcx
   |   mov%rax,(%rcx)
   | *op->resnull = fcinfo->isnull;
  1.18 |   mov(%rsp),%rdx
  4.32 |   mov0x10(%rbx),%rax
  0.06 |   movzbl 0x1c(%rdx),%edx
  9.14 |   mov%dl,(%rax)

It looks to me like gcc believes it is required to evaluate "op->resvalue"
before invoking the called function, just in case the function somehow has
access to *op and modifies that.  We could save a pointless register spill
and reload if there were a temporary variable in there, ie

EEO_CASE(EEOP_FUNCEXPR)
{
FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
+   Datum   fvalue;

fcinfo->isnull = false;
-   *op->resvalue = op->d.func.fn_addr(fcinfo);
+   fvalue = op->d.func.fn_addr(fcinfo);
+   *op->resvalue = fvalue;
*op->resnull = fcinfo->isnull;

EEO_NEXT();
}

and likewise in the other FUNCEXPR cases.

This is on a rather old gcc, haven't checked on bleeding-edge versions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Alvaro Herrera

Alvaro Herrera wrote:

> Odd that it's not fixed.  I guess there's still some more work to do
> here ...

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed.  IOW we need to put back some of the "tupkeep" business ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Alvaro Herrera

Peter Geoghegan wrote:
> On Thu, Sep 28, 2017 at 7:47 AM, Alvaro Herrera  
> wrote:
> > Fix freezing of a dead HOT-updated tuple
> 
> If I run Dan Wood's test case again, the obvious symptom (spurious
> duplicates) goes away. However, the enhanced amcheck, and thus CREATE
> INDEX/REINDEX, still isn't happy about this:
> 
> postgres=# select bt_index_check('t_pkey', true);
> DEBUG:  0: verifying presence of required tuples in index "t_pkey"
> LOCATION:  bt_check_every_level, verify_nbtree.c:424
> ERROR:  XX000: failed to find parent tuple for heap-only tuple at
> (0,6) in table "t"
> LOCATION:  IndexBuildHeapRangeScan, index.c:2597
> Time: 3.699 ms

... Rats, I obviously missed the message where you said that amcheck
detected this problem.

Odd that it's not fixed.  I guess there's still some more work to do
here ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2017-09-28 Thread chenhj

On 2017-09-29 00:43:18，"Alexander Korotkov"  wrote:

On Thu, Sep 28, 2017 at 6:44 PM, chenhj  wrote:

On 2017-09-28 01:29:29，"Alexander Korotkov"  wrote:

It appears that your patch conflicts with fc49e24f.  Please, rebase it.



Yes, i had rebased it, Please check the new patch. 


Good, now it applies cleanly.


else if (strncmp(path, XLOGDIR"/", strlen(XLOGDIR"/")) == 0 &&
 IsXLogFileName(path + strlen(XLOGDIR"/")) &&
 (strcmp(path + strlen(XLOGDIR"/") + 8, divergence_wal_filename + 8) < 0 ||
  strcmp(path + strlen(XLOGDIR"/") + 8, last_source_wal_filename + 8) > 0))


According to our conding style, you should leave a space betwen XLOGDIF and "/".
Also, you do a trick by comparison xlog segment numbers using strcmp().  It's 
nice, but I would prefer seeing XLogFromFileName() here.  It would improve code 
readability and be less error prone during further modifications.




--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Thanks for advice!
I had modified it.


-
Best Regards,
Chen Huajun






pg_rewind_wal_copy_reduce_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] initdb w/ restart

2017-09-28 Thread Jesper Pedersen


Hi,

In a case of

initdb /tmp/pgsql

followed by

pg_ctl -D /tmp/pgsql/ -l /tmp/logfile restart

you'll get

pg_ctl: PID file "/tmp/pgsql/postmaster.pid" does not exist
Is server running?
starting server anyway
pg_ctl: could not read file "/tmp/pgsql/postmaster.opts"

The attached patch changes the message to "trying to start server 
anyway" to highlight it is an attempt, not something that will happen.


Probably not a good idea to change the logic around pg_ctl.c:688, hence 
this suggestion.


Thoughts ?

Best regards,
 Jesper
>From 9e8cdda3173a25f1e14cccd7261877f160d1b0f7 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 28 Sep 2017 15:31:24 -0400
Subject: [PATCH] Change message for restarting a server from a directory
 without a PID file. This account for the case where a restart happens after
 an initdb

Author: Jesper Pedersen 
---
 src/bin/pg_ctl/pg_ctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 4e02c4cea1..f5281a36a8 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -965,7 +965,7 @@ do_restart(void)
 		write_stderr(_("%s: PID file \"%s\" does not exist\n"),
 	 progname, pid_file);
 		write_stderr(_("Is server running?\n"));
-		write_stderr(_("starting server anyway\n"));
+		write_stderr(_("trying to start server anyway\n"));
 		do_start();
 		return;
 	}
-- 
2.13.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stuck with 100% cpu usage

2017-09-28 Thread Fabien COELHO




The commit that introduced this code is 12788ae49e1933f463bc. So I amn
copying Heikki.


AFAICR the commit was mostly a heavy restructuring of previous unmaintainable 
spaghetti code. I'm not sure the problem was not there before under one form 
or another.


I agree that it should error out & stop the client in this case at least.


Here is a probable "fix", which does was the comment said should be done.

I could not trigger an infinite loop with various kill -9 and other quick 
stops. Could you try it on your side?


--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e37496c..f039413 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2194,12 +2194,8 @@ doCustom(TState *thread, CState *st, StatsData *agg)
 {
 	if (!sendCommand(st, command))
 	{
-		/*
-		 * Failed. Stay in CSTATE_START_COMMAND state, to
-		 * retry. ??? What the point or retrying? Should
-		 * rather abort?
-		 */
-		return;
+		commandFailed(st, "SQL command send failed");
+		st->state = CSTATE_ABORTED;
 	}
 	else
 		st->state = CSTATE_WAIT_RESULT;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surjective functional indexes

2017-09-28 Thread Robert Haas

On Wed, Sep 13, 2017 at 7:00 AM, Simon Riggs  wrote:
> If we do have an option it won't be using fancy mathematical
> terminology at all, it would be described in terms of its function,
> e.g. recheck_on_update

+1.

> Yes, I'd rather not have an option at all, just some simple code with
> useful effect, like we have in many other places.

I think the question we need to be able to answer is: What is the
probability that an update that would otherwise be non-HOT can be made
into a HOT update by performing a recheck to see whether the value has
changed?  It doesn't seem easy to figure that out from any of the
statistics we have available today or could easily get, because it
depends not only on the behavior of the expression which appears in
the index definition but also on the application behavior.  For
example, consider a JSON blob representing a bank account.
b->'balance' is likely to change most of the time, but
b->'account_holder_name' only rarely.  That's going to be hard for an
automated system to determine.

We should clearly check as many of the other criteria for a HOT update
as possible before performing a recheck of this type, so that it only
gets performed when it might help.  For example, if column a is
indexed and b->'foo' is indexed, there's no point in checking whether
b->'foo' has changed if we know that a has changed.  I don't know
whether it would be feasible to postpone deciding whether to do a
recheck until after we've figured out whether the page seems to
contain enough free space to allow a HOT update.

Turning non-HOT updates into HOT updates is really good, so it seems
likely that the rechecks will often be worthwhile.  If we avoid a HOT
update in 25% of cases, that's probably easily worth the CPU overhead
of a recheck assuming the function isn't something ridiculously
expensive to compute; the extra CPU cost will be repaid by reduced
bloat.  However, if we avoid a HOT update only one time in a million,
it's probably not worth the cost of recomputing the expression the
other 999,999 times.  I wonder where the crossover point is -- it
seems like something that could be figured out by benchmarking.

While I agree that it would be nice to have this be a completely
automatic determination, I am not sure that will be practical.  I
oppose overloading some other marker (like function_cost>1) for
this; that's too magical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Domains and arrays and composites, oh my

Andrew Dunstan  writes:
> On 09/28/2017 01:02 PM, Tom Lane wrote:
>>> I do think that treating a function returning a domain-over-composite
>>> differently from one returning a base composite is a POLA. We'd be very
>>> hard put to explain the reasons for it to an end user.

>> Do you have any thoughts about how we ought to resolve that?

> Not offhand. Maybe we need to revisit the decision not to modify the
> executor at all.

I think it's more of a parse analysis change: the issue is whether to
smash a function's result type to base when determining whether it emits
columns.  Maybe we could just do that in that context, and otherwise leave
domains alone.

> One thought I had was that we could invent a new return
> type of TYPEFUNC_DOMAIN_COMPOSITE so there would be less danger of a PL
> just treating it as an unconstrained base type as it might do if it saw
> TYPEFUNC_COMPOSITE.

Hmm.  That would be a way of forcing the issue, no doubt ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

I wrote:
> [ we should use an index array ]

Just to prove the point, I threw together the attached trivial test case,
which time-trials the existing fmgr_isbuiltin implementation against both
the proposed hash implementation and a simple index array.  On my machine,
with a repeat count of 1, I get

NOTICE:  bsearch runtime 4234.087 ms
NOTICE:  hash runtime 2542.636 ms
NOTICE:  index runtime 165.184 ms

(These numbers are repeatable within 1% or so.)

It could be argued that trialling OIDs sequentially gives a bit of an
unfair advantage to the bsearch and index methods over the hash method,
because the former are going to suffer fewer cache misses that way.
But I don't see a randomized lookup order changing the conclusion much.

regards, tom lane

#include "postgres.h"

#include "fmgr.h"
#include "access/transam.h"
#include "portability/instr_time.h"
#include "utils/fmgrtab.h"
#include "utils/hashutils.h"

PG_MODULE_MAGIC;


static const FmgrBuiltin *
fmgr_isbuiltin_bsearch(Oid id)
{
	int			low = 0;
	int			high = fmgr_nbuiltins - 1;

	/*
	 * Loop invariant: low is the first index that could contain target entry,
	 * and high is the last index that could contain it.
	 */
	while (low <= high)
	{
		int			i = (high + low) / 2;
		const FmgrBuiltin *ptr = &fmgr_builtins[i];

		if (id == ptr->foid)
			return ptr;
		else if (id > ptr->foid)
			low = i + 1;
		else
			high = i - 1;
	}
	return NULL;
}


/*
 * Hashtable for fast lookup builtin functions.
 */
typedef struct BuiltinOidLookupEntry
{
	Oid			foid;
	int			status;
	const FmgrBuiltin *builtin;
} BuiltinOidLookupEntry;

/* define hashtable mapping block numbers to PagetableEntry's */
#define SH_PREFIX oid2builtins
#define SH_ELEMENT_TYPE BuiltinOidLookupEntry
#define SH_KEY_TYPE Oid
#define SH_KEY foid
#define SH_HASH_KEY(tb, key) murmurhash32(key)
#define SH_EQUAL(tb, a, b) a == b
#define SH_SCOPE static inline
#define SH_DEFINE
#define SH_DECLARE
#include "lib/simplehash.h"

static oid2builtins_hash * oid2builtins = 0;

static const FmgrBuiltin *
fmgr_isbuiltin_hash(Oid id)
{
	BuiltinOidLookupEntry *entry;

	entry = oid2builtins_lookup(oid2builtins, id);
	if (entry)
		return entry->builtin;
	else
		return NULL;
}

static void
hash_setup(void)
{
	int			i;

	oid2builtins = oid2builtins_create(TopMemoryContext,
	   fmgr_nbuiltins,
	   NULL);
	for (i = 0; i < fmgr_nbuiltins; i++)
	{
		const FmgrBuiltin *ptr = &fmgr_builtins[i];
		BuiltinOidLookupEntry *entry;
		bool		found;

		entry = oid2builtins_insert(oid2builtins, ptr->foid, &found);
		Assert(!found);
		entry->builtin = ptr;
	}
}


static int16 builtins_index[FirstBootstrapObjectId];

static const FmgrBuiltin *
fmgr_isbuiltin_index(Oid id)
{
	int			i;

	if (id >= FirstBootstrapObjectId)
		return NULL;
	i = builtins_index[id];
	if (i < 0)
		return NULL;
	return &fmgr_builtins[i];
}

static void
index_setup(void)
{
	int			i;

	memset(builtins_index, -1, sizeof(builtins_index));
	for (i = 0; i < fmgr_nbuiltins; i++)
	{
		const FmgrBuiltin *ptr = &fmgr_builtins[i];

		builtins_index[ptr->foid] = i;
	}
}


PG_FUNCTION_INFO_V1(test_lookups);

Datum
test_lookups(PG_FUNCTION_ARGS)
{
	int			rep_count = PG_GETARG_INT32(0);
	instr_time	start_time;
	instr_time	end_time;
	int			i;

	INSTR_TIME_SET_CURRENT(start_time);

	for (i = 0; i < rep_count; i++)
	{
		int			ct = 0;
		Oid			fo;

		for (fo = 1; fo < 1; fo++)
		{
			if (fmgr_isbuiltin_bsearch(fo))
ct++;
		}

		if (ct != fmgr_nbuiltins)
			elog(ERROR, "got %d builtins instead of %d", ct, fmgr_nbuiltins);
	}

	INSTR_TIME_SET_CURRENT(end_time);
	INSTR_TIME_SUBTRACT(end_time, start_time);
	elog(NOTICE, "bsearch runtime %.3f ms",
		 1000.0 * INSTR_TIME_GET_DOUBLE(end_time));

	hash_setup();

	INSTR_TIME_SET_CURRENT(start_time);

	for (i = 0; i < rep_count; i++)
	{
		int			ct = 0;
		Oid			fo;

		for (fo = 1; fo < 1; fo++)
		{
			if (fmgr_isbuiltin_hash(fo))
ct++;
		}

		if (ct != fmgr_nbuiltins)
			elog(ERROR, "got %d builtins instead of %d", ct, fmgr_nbuiltins);
	}

	INSTR_TIME_SET_CURRENT(end_time);
	INSTR_TIME_SUBTRACT(end_time, start_time);
	elog(NOTICE, "hash runtime %.3f ms",
		 1000.0 * INSTR_TIME_GET_DOUBLE(end_time));

	index_setup();

	INSTR_TIME_SET_CURRENT(start_time);

	for (i = 0; i < rep_count; i++)
	{
		int			ct = 0;
		Oid			fo;

		for (fo = 1; fo < 1; fo++)
		{
			if (fmgr_isbuiltin_index(fo))
ct++;
		}

		if (ct != fmgr_nbuiltins)
			elog(ERROR, "got %d builtins instead of %d", ct, fmgr_nbuiltins);
	}

	INSTR_TIME_SET_CURRENT(end_time);
	INSTR_TIME_SUBTRACT(end_time, start_time);
	elog(NOTICE, "index runtime %.3f ms",
		 1000.0 * INSTR_TIME_GET_DOUBLE(end_time));

	PG_RETURN_VOID();
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Domains and arrays and composites, oh my

On 09/28/2017 01:02 PM, Tom Lane wrote:
>
>> I do think that treating a function returning a domain-over-composite
>> differently from one returning a base composite is a POLA. We'd be very
>> hard put to explain the reasons for it to an end user.
> Do you have any thoughts about how we ought to resolve that?
>
>

Not offhand. Maybe we need to revisit the decision not to modify the
executor at all. Obviously that would make the patch a good deal more
invasive ;-(  One thought I had was that we could invent a new return
type of TYPEFUNC_DOMAIN_COMPOSITE so there would be less danger of a PL
just treating it as an unconstrained base type as it might do if it saw
TYPEFUNC_COMPOSITE.

Maybe I'm wrong, but I have a strong suspicion that of we leave it like
this now we'll regret it in the future.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stuck with 100% cpu usage

2017-09-28 Thread Fabien COELHO




While running some tests, I encountered a situation where pgbench gets
stuck in an infinite loop, consuming 100% cpu. The setup was:

- Start postgres server from the master branch
- Initialise pgbench
- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate


That is a strange test to run, but it would be better if the behavior was 
not that one.



Now it seems that pgbench gets stuck and it's state machine does not
advance. Attaching it to debugger, I saw that one of the clients remain
stuck in this loop forever.

   if (!sendCommand(st, command))
   {
   /*
* Failed. Stay in CSTATE_START_COMMAND state, to
* retry. ??? What the point or retrying? Should
* rather abort?
*/


As the comments indicate and your situation shows, probably stopping the 
client would be a better much option when send fails, instead of 
retrying... indefinitely.



The commit that introduced this code is 12788ae49e1933f463bc. So I amn
copying Heikki.


AFAICR the commit was mostly a heavy restructuring of previous 
unmaintainable spaghetti code. I'm not sure the problem was not there 
before under one form or another.


I agree that it should error out & stop the client in this case at least.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Arrays of domains



On 09/28/2017 01:11 PM, Tom Lane wrote:
>
>> I wonder if we need to do any benchmarking to assure ourselves that the
>> changes to ArrayCoerceExpr don't have a significant performance impact?
> That would likely be a good idea, though I'm not very sure what or
> how to benchmark.
>
>   


Some case where we only expect the current code to produce a single
ArrayCoerceExpr, I guess. say doing text[] -> int[] ?

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 200 = 199 + 1?

Marko Tiikkaja  writes:
> On Wed, Sep 27, 2017 at 5:45 PM, Tom Lane  wrote:
>> Looking at it another way, the main thing that the combination of hashagg
>> outer path + indexscan inner path knows that eqjoinsel_semi didn't account
>> for is that there's a unique index on foo.id.  But that info is available
>> to eqjoinsel_semi, in the sense that it's been given a nondefault estimate
>> that nd1 is equal to the outer relation size.  So the mistake that it's
>> making is to throw up its hands and use an 0.5 selectivity estimate just
>> because it has no info about the inner relation.  I think if we'd pushed
>> through the nd2/nd1 calculation after setting nd2 = size of inner rel,
>> we'd end up with an estimate matching the product of these path sizes.
>> (Caution: inadequate caffeine absorbed yet, this might be all wrong.)

> This sounds very reasonable to me.

I experimented a bit with the attached patch, which modifies
eqjoinsel_semi in two ways.  First, if the number-of-distinct-values
estimate for the inner rel is just a default rather than having any
real basis, it replaces it with inner_rel->rows, effectively assuming
that the inside of the IN or EXISTS is unique.  Second, it drops the
fallback to selectivity 0.5 altogether, just applying the nd1 vs nd2
heuristic all the time.  This gets rid of the discontinuity of behavior
at 200 estimated distinct values.  The behavior in your example is
maybe a bit funny-looking: for LIMITs above 200 you get something like

 Nested Loop  (cost=7.18..869.25 rows=300 width=4)
   ->  HashAggregate  (cost=6.75..8.75 rows=200 width=4)
 Group Key: i.i
 ->  Limit  (cost=0.00..3.00 rows=300 width=4)
   ->  Function Scan on generate_series i  (cost=0.00..10.00 
rows=1000 width=4)
   ->  Index Only Scan using foo_pkey on foo  (cost=0.42..4.30 rows=1 width=4)
 Index Cond: (id = i.i)

The reported rowcount for the HashAggregate is still 200, which
is out of sync with what we did at the join level.  I think that's
just a cosmetic thing though, and am disinclined to try to jury-rig
something to make it look different.

The change causes one plan change that's visible in the regression test
outputs; that change seems OK so I've just included an expected-output
change below.

This could use somebody trying harder to break it than I have.  It might
also be wise to dig into the list archives and see what prompted the
inclusion of the don't-use-the-equations-with-default-ndistinct logic
in the first place.

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index db1792b..c32ff2b 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*** eqjoinsel_semi(Oid operator,
*** 2570,2575 
--- 2570,2585 
  	memset(&sslot2, 0, sizeof(sslot2));

  	/*
+ 	 * If we lack any fact-based estimate for nd2, it seems best to set it
+ 	 * equal to the inner relation size estimate, ie, assume the inner side of
+ 	 * the semijoin is unique.  This may lead to overestimating the size of
+ 	 * the join, but that's usually better than an underestimate.  We don't
+ 	 * make any comparable assumption for the outer side, though.
+ 	 */
+ 	if (isdefault2)
+ 		nd2 = inner_rel->rows;
+ 
+ 	/*
  	 * We clamp nd2 to be not more than what we estimate the inner relation's
  	 * size to be.  This is intuitively somewhat reasonable since obviously
  	 * there can't be more than that many distinct values coming from the
*** eqjoinsel_semi(Oid operator,
*** 2583,2606 
  	 * We can apply this clamping both with respect to the base relation from
  	 * which the join variable comes (if there is just one), and to the
  	 * immediate inner input relation of the current join.
- 	 *
- 	 * If we clamp, we can treat nd2 as being a non-default estimate; it's not
- 	 * great, maybe, but it didn't come out of nowhere either.  This is most
- 	 * helpful when the inner relation is empty and consequently has no stats.
  	 */
  	if (vardata2->rel)
  	{
! 		if (nd2 >= vardata2->rel->rows)
! 		{
  			nd2 = vardata2->rel->rows;
- 			isdefault2 = false;
- 		}
  	}
! 	if (nd2 >= inner_rel->rows)
! 	{
  		nd2 = inner_rel->rows;
- 		isdefault2 = false;
- 	}

  	if (HeapTupleIsValid(vardata1->statsTuple))
  	{
--- 2593,2606 
  	 * We can apply this clamping both with respect to the base relation from
  	 * which the join variable comes (if there is just one), and to the
  	 * immediate inner input relation of the current join.
  	 */
  	if (vardata2->rel)
  	{
! 		if (nd2 > vardata2->rel->rows)
  			nd2 = vardata2->rel->rows;
  	}
! 	if (nd2 > inner_rel->rows)
  		nd2 = inner_rel->rows;

  	if (HeapTupleIsValid(vardata1->statsTuple))
  	{
*** eqjoinsel_semi(Oid operator,
*** 2701,2723 
  		 * the uncertain rows that a fraction nd2/nd1 have join partners. We
  		 * can discount the known-matched MCVs from the distinct-values counts
  		 * before doin

Re: [HACKERS] Arrays of domains

Andrew Dunstan  writes:
> On 08/11/2017 01:17 PM, Tom Lane wrote:
>> Attached is a patch series that allows us to create arrays of domain
>> types.

> I've reviewed and tested the updated versions of these patches. The
> patches apply but there's an apparent typo in arrayfuncs.c -
> DatumGetAnyArray instead of DatumGetAnyArrayP

Thanks for reviewing!  The DatumGetAnyArrayP thing is another artifact
of 4bd199465 --- sorry for missing that.

> Some of the line breaking in argument lists for some of the code
> affected by these patches is a bit bizarre. It hasn't been made worse by
> these patches but it hasn't been made better either. That's especially
> true of patch 1.

Yeah, perhaps.  A lot of these argument lists are long enough that I'm
not especially thrilled with the idea of making them one-arg-per-line;
that seems like it would consume a lot of vertical space and make it
harder to see context in a finite-size window.  I think there's been
some attempt at grouping the arguments into related groups on single
lines, though I concede it's probably not very obvious nor 100%
consistent.

> I wonder if we need to do any benchmarking to assure ourselves that the
> changes to ArrayCoerceExpr don't have a significant performance impact?

That would likely be a good idea, though I'm not very sure what or
how to benchmark.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Domains and arrays and composites, oh my

Andrew Dunstan  writes:
> On 07/13/2017 03:19 PM, Tom Lane wrote:
>> Attached is a draft patch that allows domains over composite types.
>> I think it's probably complete on its own terms, but there are some
>> questions around behavior of functions returning domain-over-composite
>> that could use discussion, and some of the PLs need some follow-on work.

> This is a pretty nice patch, and very small indeed all things
> considered. From a code point of view I have no criticism, although
> maybe we need to be a bit more emphatic in the header file comments
> about the unwisdom of using get_expr_result_tupdesc().

Thanks for reviewing!

> I do think that treating a function returning a domain-over-composite
> differently from one returning a base composite is a POLA. We'd be very
> hard put to explain the reasons for it to an end user.

Do you have any thoughts about how we ought to resolve that?

> I also think we shouldn't commit this until we have accompanying patches
> for the core PLs, at least for plpgsql but I bet there are things that
> should be fixed for the others too.

For my own part, I think it would be reasonable to commit the core patch
once we've resolved the question of what to do with the case of
function-in-FROM returning domain over composite.  That's core parser
behavior so it should be part of the same patch.  I think addressing
each PL separately in followon patches would be fine and would help to
avoid the giant-unreviewable-patch syndrome.  It is important to get
all the related work done in one release cycle, but since we're just
starting v11 I'm not too worried about that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_prepared_xact_status

2017-09-28 Thread Konstantin Knizhnik


Hi,

In Postgres 10 we have txid_status function which returns status of 
transaction by XID.
I wonder if it will be also useful to have similar function for 2PC 
transactions which can operate with GID?
pg_prepared_xacts view allows to get information about prepared 
transaction which are not yet committed or aborted.
But if transaction is committed, then there is no way now to find status 
of this transaction.


If crash happen during 2PC commit, then transaction can be in prepared 
state at some nodes and committed/aborted at  other nodes.
Using pg_prepared_xacts view DBA can find out global transactions which 
were not completed.
But there is no way (except pg_waldump) to determine whether this 
transaction needs to be committed or aborted at rest of the nodes.


Attached please find small patch with pg_prepared_xact_status function.
This function has the following obvious drawbacks:

1. It is not able to extract information about prepared transaction 
preceding last checkpoint. It seems to be enough to perform recovery in 
case of failure unless
checkpoint happen just before failure or there is large gap between 
prepare and commit.
The only workaround I see at this moment is to pass to this function 
optional parameter with start position in the WAL.

Any better solution?

2. On systems with huge workload interval between checkpoints may be 
very large. In this case we have to scan large amount of WAL data to be 
able to locate our transaction.

Whoich make take significant amount of time.
We can traverse WAL in smarter way, starting from last segment, assuming 
that in-doubt transaction was prepared just before crash.

But it significantly complicates traverse logic.

3. Same GID can be reused multiple times. In this case 
pg_prepared_xact_status function will return incorrect result, because 
it will return information about first global transaction with such GID 
after checkpoint and not the recent one.



There is actually alternative approach to recovery of 2PC transactions. 
We can include coordinator identifier in GID (we can use 
GetSystemIdentifier() to identify coordinator's node)
and XID of coordinator's transaction. In this case we can use 
txid_status() to check status of transaction at coordinator. It 
eliminates need to scan WAL to determine status of prepared transaction.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ae83291..be65ae7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -84,6 +84,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
@@ -2408,3 +2409,87 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 
 	return;
 }
+
+Datum
+pg_prepared_xact_status(PG_FUNCTION_ARGS)
+{
+char const* gid = PG_GETARG_CSTRING(0);
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+	XLogRecPtr lsn;
+	char const* xact_status = "unknown";
+	bool done = false;
+	TimeLineID timeline;
+	TransactionId xid = InvalidTransactionId;
+	XLogRecPtr wal_end = GetFlushRecPtr();
+
+	GetOldestRestartPoint(&lsn, &timeline);
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+(errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+	do
+	{
+		record = XLogReadRecord(xlogreader, lsn, &errormsg);
+		if (record == NULL)
+			break;
+		lsn = InvalidXLogRecPtr; /* continue after the record */
+		if (XLogRecGetRmid(xlogreader) == RM_XACT_ID)
+		{
+			uint32 info = XLogRecGetInfo(xlogreader);
+			switch (info & XLOG_XACT_OPMASK)
+			{
+			case XLOG_XACT_PREPARE:
+{
+	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *)XLogRecGetData(xlogreader);
+	char* xact_gid = (char*)hdr + MAXALIGN(sizeof(TwoPhaseFileHeader));
+	if (strcmp(xact_gid, gid) == 0)
+	{
+		xid = hdr->xid;
+		xact_status = "prepared";
+	}
+	break;
+}
+			case XLOG_XACT_COMMIT_PREPARED:
+{
+	xl_xact_commit *xlrec;
+	xl_xact_parsed_commit parsed;
+
+	xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+	ParseCommitRecord(info, xlrec, &parsed);
+	if (xid == parsed.twophase_xid)
+	{
+		Assert(TransactionIdIsValid(xid));
+		xact_status = "committed";
+		done = true;
+	}
+	break;
+}
+			case XLOG_XACT_ABORT_PREPARED:
+{
+	xl_xact_abort *xlrec;
+	xl_xact_parsed_abort parsed;
+
+	xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+	ParseAbortRecord(info, xlrec, &parsed);
+	if (xid == parsed.twophase_xid)
+	{
+		Assert(TransactionIdIsValid(xid));
+		xact_status = "aborted";
+		done = tru

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

On Thu, Sep 28, 2017 at 6:44 PM, chenhj  wrote:

> On 2017-09-28 01:29:29，"Alexander Korotkov" 
> wrote:
>
> It appears that your patch conflicts with fc49e24f.  Please, rebase it.
>
>
> Yes, i had rebased it, Please check the new patch.
>

Good, now it applies cleanly.

else if (strncmp(path, XLOGDIR"/", strlen(XLOGDIR"/")) == 0 &&
> IsXLogFileName(path + strlen(XLOGDIR"/")) &&
> (strcmp(path + strlen(XLOGDIR"/") + 8, divergence_wal_filename + 8) < 0 ||
>  strcmp(path + strlen(XLOGDIR"/") + 8, last_source_wal_filename + 8) > 0))

According to our conding style, you should leave a space betwen XLOGDIF and
"/".
Also, you do a trick by comparison xlog segment numbers using strcmp().
It's nice, but I would prefer seeing XLogFromFileName() here.  It would
improve code readability and be less error prone during further
modifications.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] crash-recovery test vs windows



On 09/28/2017 12:29 PM, Andrew Dunstan wrote:
> The new crash-recovery check is failing pretty reliably on jacana. It's
> already excluded on MSVC machines, so I'm inclined to exclude it on Msys
> as well.
>
>

Sorry, I meant crash-restart

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] crash-recovery test vs windows


The new crash-recovery check is failing pretty reliably on jacana. It's
already excluded on MSVC machines, so I'm inclined to exclude it on Msys
as well.


Thoughts?


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2017-09-28 Thread chenhj

On 2017-09-28 01:29:29，"Alexander Korotkov"  wrote:





It appears that your patch conflicts with fc49e24f.  Please, rebase it.


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company 


Yes, i had rebased it, Please check the new patch. 


--
Best Regards,
Chen Huajun







pg_rewind_wal_copy_reduce_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Domains and arrays and composites, oh my



On 07/13/2017 03:19 PM, Tom Lane wrote:
> I wrote:
>> I started to look into allowing domains over composite types, which is
>> another never-implemented case that there's no very good reason not to
>> allow.  Well, other than the argument that the SQL standard only allows
>> domains over "predefined" (built-in) types ... but we blew past that
>> restriction ages ago.
> Attached is a draft patch that allows domains over composite types.
> I think it's probably complete on its own terms, but there are some
> questions around behavior of functions returning domain-over-composite
> that could use discussion, and some of the PLs need some follow-on work.
>
> The core principle here was to not modify execution-time behavior by
> adding domain checks to existing code paths.  Rather, I wanted the
> parser to insert CoerceToDomain nodes wherever checks would be needed.
> Row-returning node types such as RowExpr and FieldStore only return
> regular composite types, as before; to generate a value of a composite
> domain, we construct a value of the base type and then CoerceToDomain.
> (This is pretty analogous to what happens for domains over arrays.)
> Whole-row Vars can only have regular composite types as vartype,
> TupleDescs can only have regular composite types as tdtypeid, composite
> Datums only have regular composite type OIDs in datum_typeid.  (The last
> would be expected anyway, since the physical representation of a domain
> value is never different from that of its base type.)
>
> Doing it that way led to a nicely small patch, only about 160 net new
> lines of code.  However, not touching the executor meant not touching
> the behavior of functions that return domains, even if the domain is
> domain-over-composite.  In code terms this means that 
> get_call_result_type() and sibling functions will return TYPEFUNC_SCALAR
> not TYPEFUNC_COMPOSITE for such a result type.  The difficulty with
> changing that is that if these functions look through the domain, then
> the calling code (in, usually, a PL) will simply build and return a result
> of the underlying composite type, failing to apply any domain constraints.
> Trying to get out-of-core PLs on board with a change in those requirements
> seems like a risky proposition.
>
> Concretely, consider
>
> create type complex as (r float8, i float8);
> create domain dcomplex as complex;
>
> You can make a SQL-language function to return complex in either of two
> ways:
>
> create function fc() returns complex language sql
> as 'select 1.0::float8, 2.0::float8';
>
> create function fc() returns complex language sql
> as 'select row(1.0::float8, 2.0::float8)::complex';
>
> As the patch stands, though, only the second way works for domains over
> composite:
>
> regression=# create function fdc() returns dcomplex language sql
> as 'select 1.0::float8, 2.0::float8';
> ERROR:  return type mismatch in function declared to return dcomplex
> DETAIL:  Final statement must return exactly one column.
> CONTEXT:  SQL function "fdc"
> regression=# create function fdc() returns dcomplex language sql
> as 'select row(1.0::float8, 2.0)::dcomplex';
> CREATE FUNCTION
>
> Now, maybe that's fine.  SQL-language functions have never been very
> willing to insert implicit casts to get to the function result type,
> and certainly the only way that the first definition could be legal
> is if there were an implicit up-cast to the domain type.  It might be
> OK to just leave it like this, though some documentation about it
> would be a good idea.
>
> plpgsql functions seem generally okay as far as composite domain return
> types go, because they don't have anything corresponding to the row
> return convention of SQL functions.  And plpgsql's greater willingness
> to do implicit coercions reduces the notational burden, too.  But
> there's some work yet to be done to get plpgsql to realize that
> composite domain local variables have substructure.  For example,
> this works:
>
>   declare x complex;
>   ...
>   x.r := 1;
>
> but it fails if x is dcomplex.  But ISTM that that would be better
> handled as a followon feature patch.  I suspect that the other PLs may
> have similar issues where it'd be nice to allow domain-over-composite
> to act like a plain composite for specific purposes; but I've not looked.
>
> Another issue related to function result types is that the parser
> considers a function-in-FROM returning a composite domain to be
> producing a scalar result not a rowtype.  Thus you get this for a
> function returning complex:
>
> regression=# select * from fc();
>  r | i 
> ---+---
>  1 | 2
> (1 row)
>
> but this for a function returning dcomplex:
>
> regression=# select * from fdc();
>   fdc  
> ---
>  (1,2)
> (1 row)
>
> I think that that could be changed with only local changes in parse
> analysis, but do we want to change it?  Arguably, making fdc() act the
> same as fc() here would amount to implicitly downcasting the domain to
> its base type.  But doing s

Re: [HACKERS] PoC: full merge join on comparison clause

2017-09-28 Thread Alexander Kuzmenkov


Hi Ashutosh,

Thanks for the review.

*Jeff*, I'm copying you because this is relevant to our discussion about 
what to do with mergeopfamilies when adding new merge join types.



You have renamed RestrictInfo member mergeopfamilies as
equivopfamilies. I don't think that's a good name; it doesn't convey
that these are opfamilies containing merge operators. The changes in
check_mergejoinable() suggest that an operator may act as equality
operator in few operator families and comparison operator in others.
That looks odd. Actually an operator family contains operators other
than equality operators, so you may want to retain this member and add
a new member to specify whether the clause is an equality clause or
not.


For mergeopfamilies, I'm not sure what is the best thing to do. I'll try 
to explain my understanding of the situation, please correct me if I'm 
wrong.


Before the patch, mergeopfamilies was used for two things: creating 
equivalence classes and performing merge joins.


For equivalence classes: we look at the restriction clauses, and if they 
have mergeopfamilies set, it means that these clause are based on an 
equality operator, and the left and right variables must be equal. To 
record this fact, we create an equivalence class. The variables might be 
equal for one equality operator and not equal for another, so we record 
the particular operator families to which our equality operator belongs.


For merge join: we look at the join clauses, and if they have 
mergeopfamilies set, it means that these clauses are based on an 
equality operator, and we can try performing this particular join as 
merge join. These opfamilies are also used beforehand to create the 
equivalence classes for left and right variables. The equivalence 
classes are used to match the join clauses to pathkeys describing the 
ordering of join inputs.


So, if we want to start doing merge joins for operators other than 
equality, we still need to record their opfamilies, but only use them 
for the second case and not the first. I chose to put these opfamilies 
to different variables, and
name the one used for equivalence classes 'equivopfamilies' and the one 
used for merge joins 'mergeopfamilies'. The equality operators are used 
for both cases, so we put their opfamilies into both of these variables.


I agree this might look confusing. Indeed, we could keep a single 
variable for opfamilies, and add separate flags that show how they can 
be used, be that for equivalence classes, merge joins, range joins or 
some combination of them. This is similar to what Jeff did in his range 
merge join patch [1]. I will think more about this and try to produce an 
updated patch.




In mergejoinscansel() you have just removed Assert(op_strategy ==
BTEqualStrategyNumber); Probably this function is written considering
on equality operators. But now that we are using this for all other
operators, we will need more changes to this function. That may be the
reason why INNER join in your earlier example doesn't choose right
costing.


I changed mergejoinscansel() slightly to reflect the fact that the inner 
relation is scanned from the beginning if we have an inequality merge 
clause.




The comment change in final_cost_mergejoin() needs more work. n1, n2,
n3 are number of rows on inner side with values 1, 2, 3 resp. So n1 +
n2 + n3 + ... = size of inner relation is correct. In that context I
am not able to understand your change
+* If the merge clauses contain inequality, (n1 + n2 + ...) ~=
+* (size of inner relation)^2.


I extended the comment in final_cost_mergejoin(). Not sure if that 
approximation makes any sense, but this is the best I could think of.


Style problems are fixed.

Attached please find the new version of the patch that addresses all the 
review comments except mergeopfamilies.


The current commitfest is ending, but I'd like to continue working on 
this patch, so I am moving it to the next one.



[1] 
https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#camp0ubfwaffw3o_ngkqprpmm56m4wetexjprb2gp_nrdaec...@mail.gmail.com 



--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 32dc4e6301..2958a9e53d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -722,19 +722,19 @@ get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	{
 		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
 
-		/* Consider only mergejoinable clauses */
-		if (restrictinfo->mergeopfamilies == NIL)
+		/* Consider only mergejoinable equality clauses */
+		if (restrictinfo->equivopfamilies == NIL)
 			continue;
 
 		/* Make sure we've got canonical ECs. */
-		update_mergeclause_eclasses(root, restrictinfo);
+		update_equivclause_eclasses(root, restrictinfo);
 
 		/*
-		 * restricti

Re: [HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Maksim Milyutin


On 28.09.2017 16:29, Jesper Pedersen wrote:


On 09/28/2017 09:19 AM, Maksim Milyutin wrote:
E.g. "No partition constraint" vs. "Partition constraint: 
satisfies_hash_partition(...)".


I also noticed ambiguity in printing "No partition constraint" in 
non-verbose mode and "Partition constraint:..." in verbose one for 
partition tables regardless of the type of partition.
Attached small patch removes any output about partition constraint in 
non-verbose mode.




Yeah, that could be one way.

It should likely be backported to REL_10_STABLE, so the question is if 
we are too late in the release cycle to change that output.


I want to prepare more complete patch for "Partition constraint" output. 
For example, I encountered the primitive output with repetitive 
conjuncts for subpartition whose parent is partitioned by the same key:


Partition constraint: (/(i IS NOT NULL)/ AND (i >= 30) AND (i < 40) AND 
/(i IS NOT NULL)/ AND (i = ANY (ARRAY[30, 31])))


--
Regards,
Maksim Milyutin

Re: [HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Jesper Pedersen


On 09/28/2017 09:19 AM, Maksim Milyutin wrote:
E.g. "No partition constraint" vs. "Partition constraint: 
satisfies_hash_partition(...)".


I also noticed ambiguity in printing "No partition constraint" in 
non-verbose mode and "Partition constraint:..." in verbose one for 
partition tables regardless of the type of partition.
Attached small patch removes any output about partition constraint in 
non-verbose mode.




Yeah, that could be one way.

It should likely be backported to REL_10_STABLE, so the question is if 
we are too late in the release cycle to change that output.


Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Maksim Milyutin


Hi!


On 28.09.2017 16:02, Jesper Pedersen wrote:

Hi,

Using hash partitions I noticed that \d gives

D=# \d T_p63
   Table "public.T_p63"
    Column | Type  | Collation | Nullable | Default
---+---+---+--+-



Partition of: T FOR VALUES WITH (modulus 64, remainder 63)
No partition constraint
Indexes:
    "T_p63" btree (X, Y)

where as \d+ gives

D=# \d+ T_p63
   Table "public.T_p63"
    Column | Type  | Collation | Nullable | Default | 
Storage  | Stats target | Description
---+---+---+--+-+--+--+- 





Partition of: T FOR VALUES WITH (modulus 64, remainder 63)
Partition constraint: satisfies_hash_partition(64, 63, 
hashint4extended(X, '8816678312871386367'::bigint))

Indexes:
    "T_p63" btree (X, Y)

E.g. "No partition constraint" vs. "Partition constraint: 
satisfies_hash_partition(...)".


I also noticed ambiguity in printing "No partition constraint" in 
non-verbose mode and "Partition constraint:..." in verbose one for 
partition tables regardless of the type of partition.
Attached small patch removes any output about partition constraint in 
non-verbose mode.


--
Regards,
Maksim Milyutin

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d22ec68..b301219 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1900,13 +1900,16 @@ describeOneTableDetails(const char *schemaname,
 			  partdef);
 			printTableAddFooter(&cont, tmpbuf.data);
 
-			/* If there isn't any constraint, show that explicitly */
-			if (partconstraintdef == NULL || partconstraintdef[0] == '\0')
-printfPQExpBuffer(&tmpbuf, _("No partition constraint"));
-			else
-printfPQExpBuffer(&tmpbuf, _("Partition constraint: %s"),
-  partconstraintdef);
-			printTableAddFooter(&cont, tmpbuf.data);
+			if (verbose)
+			{
+/* If there isn't any constraint, show that explicitly */
+if (partconstraintdef == NULL || partconstraintdef[0] == '\0')
+	printfPQExpBuffer(&tmpbuf, _("No partition constraint"));
+else
+	printfPQExpBuffer(&tmpbuf, _("Partition constraint: %s"),
+	  partconstraintdef);
+printTableAddFooter(&cont, tmpbuf.data);
+			}
 
 			PQclear(result);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Jesper Pedersen


Hi,

Using hash partitions I noticed that \d gives

D=# \d T_p63
   Table "public.T_p63"
Column | Type  | Collation | Nullable | Default
---+---+---+--+-



Partition of: T FOR VALUES WITH (modulus 64, remainder 63)
No partition constraint
Indexes:
"T_p63" btree (X, Y)

where as \d+ gives

D=# \d+ T_p63
   Table "public.T_p63"
Column | Type  | Collation | Nullable | Default | 
Storage  | Stats target | Description

---+---+---+--+-+--+--+-



Partition of: T FOR VALUES WITH (modulus 64, remainder 63)
Partition constraint: satisfies_hash_partition(64, 63, 
hashint4extended(X, '8816678312871386367'::bigint))

Indexes:
"T_p63" btree (X, Y)

E.g. "No partition constraint" vs. "Partition constraint: 
satisfies_hash_partition(...)".


Current master (7769fc000) with [1] and [2].

[1] https://commitfest.postgresql.org/14/1059/
[2] https://commitfest.postgresql.org/14/1089/

Best regards,
 Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optional message to user when terminating/cancelling backend

2017-09-28 Thread Yugo Nagata

On Sun, 3 Sep 2017 22:47:10 +0200
Daniel Gustafsson  wrote:

I have reviewed your latest patch. 

I can apply this to the master branch and build this successfully,
and the behavior is as expected. 

However, here are some comments and suggestions.

> 135 +   char   *buffer = palloc0(MAX_CANCEL_MSG);
> 136 +
> 137 +   GetCancelMessage(&buffer, MAX_CANCEL_MSG);
> 138 +   ereport(ERROR,
> 139 +   (errcode(ERRCODE_QUERY_CANCELED),
> 140 +errmsg("canceling statement due to user request: 
> \"%s\"",
> 141 +   buffer)));

The memory for buffer is allocated here, but I think we can do this
in GetCancelMessage. Since the size of allocated memory is fixed
to MAX_CANCEL_MSG, it isn't neccesary to pass this to the function.
In addition, how about returning the message as the return value?
That is, we can define GetCancelMessage as following;

  char * GetCancelMessage(void)

> 180 +   r = SetBackendCancelMessage(pid, msg);
> 181 +
> 182 +   if (r != -1 && r != strlen(msg))
> 183 +   ereport(NOTICE,
> 184 +   (errmsg("message is too long and has been 
> truncated")));
> 185 +   }

We can this error handling in SetBackendCancelMessage. I think this is better
because the truncation of message is done in this function. In addition, 
we should use pg_mbcliplen for this truncation as done in truncate_identifier. 
Else, multibyte character boundary is broken, and noises can slip in log
messages.

> 235 -   int r = pg_signal_backend(PG_GETARG_INT32(0), SIGTERM);
> 236 +   int r = pg_signal_backend(pid, SIGTERM, msg);

This line includes an unnecessary indentation change.

> 411 + * returns the length of message actually created. If the returned length

"the length of message" might be "the length of the message"

> 413 + * If the backend wasn't found and no message was set, -1 is returned. 
> If two

This comment is incorrect since this function returns 0 when no message was set.

> 440 +   strlcpy(slot->message, message, sizeof(slot->message));
> 441 +   slot->len = strlen(slot->message);
> 442 +   message_len = slot->len;
> 443 +   SpinLockRelease(&slot->mutex);
> 444 +
> 445 +   return message_len;

This can return slot->len directly and the variable message_len is
unnecessary. However, if we handle the "too long message" error
in this function as suggested above, this does not have to
return anything.

Regards,

-- 
Yugo Nagata 

> > On 02 Sep 2017, at 02:21, Thomas Munro  
> > wrote:
> > 
> > On Fri, Jun 23, 2017 at 1:48 AM, Daniel Gustafsson  wrote:
> >> Good point.  I’ve attached a new version which issues a NOTICE on 
> >> truncation
> >> and also addresses all other comments so far in this thread.  Thanks a lot 
> >> for
> >> the early patch reviews!"
> > 
> > FYI this no longer builds because commit 81c5e46c490e just stole your OIDs:
> > 
> > make[3]: Entering directory
> > `/home/travis/build/postgresql-cfbot/postgresql/src/backend/catalog'
> > cd ../../../src/include/catalog && '/usr/bin/perl' ./duplicate_oids
> > 772
> > 972
> > make[3]: *** [postgres.bki] Error 1
> 
> Thanks, I hadn’t spotted that yet.  Attached is an updated patch using new 
> OIDs
> to make it compile again.
> 
> cheers ./daniel
> 

-- 
Yugo Nagata 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgbench stuck with 100% cpu usage

2017-09-28 Thread Pavan Deolasee

Hello,

While running some tests, I encountered a situation where pgbench gets
stuck in an infinite loop, consuming 100% cpu. The setup was:

- Start postgres server from the master branch
- Initialise pgbench
- Run pgbench -c 10 -T 100
- Stop postgres with -m immediate

Now it seems that pgbench gets stuck and it's state machine does not
advance. Attaching it to debugger, I saw that one of the clients remain
stuck in this loop forever.

   if (command->type == SQL_COMMAND)
{
if (!sendCommand(st, command))
{
/*
 * Failed. Stay in CSTATE_START_COMMAND state, to
 * retry. ??? What the point or retrying? Should
 * rather abort?
 */
return;
}
else
st->state = CSTATE_WAIT_RESULT;
}

sendCommand() returns false because the underlying connection is bad
and PQsendQuery returns 0. Reading the comment, it seems that the author
thought about this situation but decided to retry instead of abort. Not
sure what was the rationale for that decision, may be to deal with
transient failures?

The commit that introduced this code is 12788ae49e1933f463bc. So I am
copying Heikki.

Thanks,
Pavan



-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2017-09-28 Thread Etsuro Fujita


Hi,

Commit 7086be6e3627c1ad797e32ebbdd232905b5f577f addressed mishandling of 
WCO in direct foreign table modification by disabling it when we have 
WCO, but I noticed another oddity in postgres_fdw:


postgres=# create table base_tbl (a int, b int);
postgres=# create function row_before_insupd_trigfunc() returns trigger 
as $$begin new.a := new.a + 10; return new; end$$ language plpgsql;
postgres=# create trigger row_before_insupd_trigger before insert or 
update on base_tbl for each row execute procedure 
row_before_insupd_trigfunc();
postgres=# create server loopback foreign data wrapper postgres_fdw 
options (dbname 'postgres');

postgres=# create user mapping for CURRENT_USER server loopback;
postgres=# create foreign table foreign_tbl (a int, b int) server 
loopback options (table_name 'base_tbl');
postgres=# create view rw_view as select * from foreign_tbl where a < b 
with check option;


So, this should fail, but

postgres=# insert into rw_view values (0, 5);
INSERT 0 1

The reason for that is: this is processed using postgres_fdw's 
non-direct foreign table modification (ie. ForeignModify), but unlike 
the RETURNING or local after trigger case, the ForeignModify doesn't 
take care that remote triggers might change the data in that case, so 
the WCO is evaluated using the data supplied, not the data actually 
inserted, which I think is wrong.  (I should have noticed that as well 
while working on the fix, though.)  So, I'd propose to fix that by 
modifying postgresPlanForeignModify so that it handles WCO the same way 
as for the RETURNING case.  Attached is a patch for that.  I'll add the 
patch to the next commitfest.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 138,143  static void deparseSubqueryTargetList(deparse_expr_cxt 
*context);
--- 138,144 
  static void deparseReturningList(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
 bool trig_after_row,
+List *withCheckOptionList,
 List *returningList,
 List **retrieved_attrs);
  static void deparseColumnRef(StringInfo buf, int varno, int varattno,
***
*** 1548,1554  void
  deparseInsertSql(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
 List *targetAttrs, bool doNothing,
!List *returningList, List **retrieved_attrs)
  {
AttrNumber  pindex;
boolfirst;
--- 1549,1556 
  deparseInsertSql(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
 List *targetAttrs, bool doNothing,
!List *withCheckOptionList, List *returningList,
!List **retrieved_attrs)
  {
AttrNumber  pindex;
boolfirst;
***
*** 1597,1603  deparseInsertSql(StringInfo buf, PlannerInfo *root,
  
deparseReturningList(buf, root, rtindex, rel,
 rel->trigdesc && 
rel->trigdesc->trig_insert_after_row,
!returningList, 
retrieved_attrs);
  }
  
  /*
--- 1599,1605 
  
deparseReturningList(buf, root, rtindex, rel,
 rel->trigdesc && 
rel->trigdesc->trig_insert_after_row,
!withCheckOptionList, 
returningList, retrieved_attrs);
  }
  
  /*
***
*** 1610,1616  deparseInsertSql(StringInfo buf, PlannerInfo *root,
  void
  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
!List *targetAttrs, List *returningList,
 List **retrieved_attrs)
  {
AttrNumber  pindex;
--- 1612,1619 
  void
  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 Index rtindex, Relation rel,
!List *targetAttrs,
!List *withCheckOptionList, List *returningList,
 List **retrieved_attrs)
  {
AttrNumber  pindex;
***
*** 1639,1645  deparseUpdateSql(StringInfo buf, PlannerInfo *root,
  
deparseReturningList(buf, root, rtindex, rel,
 rel->trigdesc && 
rel->trigdesc->trig_update_after_row,
!returningList, 
retrieved_attrs);
  }
  
  /*
--- 1642,1648 
  
deparseReturningList(buf, root, rtindex, rel,
 rel->trigdesc && 
r

Re: [HACKERS] GSoC 2017: weekly progress reports (week 8)

Hi!

On Fri, Jul 28, 2017 at 7:58 AM, Shubham Barai 
wrote:

> I am attaching a patch for predicate locking in SP-Gist index
>

I took a look over this patch.

newLeafBuffer = SpGistGetBuffer(index,
> GBUF_LEAF | (isNulls ? GBUF_NULLS : 0),
> Min(totalLeafSizes,
> SPGIST_PAGE_CAPACITY),
> &xlrec.initDest);
> PredicateLockPageSplit(index,
> BufferGetBlockNumber(current->buffer),
> BufferGetBlockNumber(newLeafBuffer));
>

You move predicate lock during split only when new leaf page is allocated.
However SP-GiST may move items to the free space of another busy page
during split (see other branches in doPickSplit()).  Your patch doesn't
seem to handle this case correctly.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

Hi!

On Wed, Jun 21, 2017 at 10:52 AM, Shubham Barai 
wrote:

> Hi,
>
> On 21 June 2017 at 13:11, Heikki Linnakangas  wrote:
>
>> On 06/16/2017 01:24 PM, Shubham Barai wrote:
>>
>>> @@ -497,6 +499,13 @@ gistplacetopage(Relation rel, Size freespace,
>>> GISTSTATE *giststate,
>>> for (ptr = dist->next; ptr; ptr = ptr->next)
>>> UnlockReleaseBuffer(ptr->buffer);
>>> }
>>> +
>>> +   for (ptr = dist; ptr; ptr = ptr->next)
>>> +   PredicateLockPageSplit(rel,
>>> +
>>>  BufferGetBlockNumber(buffer),
>>> +
>>>  BufferGetBlockNumber(ptr->buffer));
>>> +
>>> +
>>>
>>
>> I think this new code needs to go before the UnlockReleaseBuffer() calls
>> above. Calling BufferGetBlockNumber() on an already-released buffer is not
>> cool.
>>
>> - Heikki
>>
>> I know that. This is the old version of the patch. I had sent updated
> patch later. Please have a look at updated patch.
>

I took a look on this patch.

In gistdoinsert() you do CheckForSerializableConflictIn() only if page
wasn't exclusively locked before (xlocked is false).

if (!xlocked)
> {
> LockBuffer(stack->buffer, GIST_UNLOCK);
> LockBuffer(stack->buffer, GIST_EXCLUSIVE);
> CheckForSerializableConflictIn(r, NULL, stack->buffer);
> xlocked = true;


However, page might be exclusively locked before.  And in this case
CheckForSerializableConflictIn() would be skipped.  That happens very
rarely (someone fixes incomplete split before we did), but nevertheless.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)

On Thu, Sep 28, 2017 at 12:45 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Wed, Aug 9, 2017 at 6:30 PM, Shubham Barai 
> wrote:
>
>> Please find the updated patch for predicate locking in gin index here.
>>
>> There was a small issue in the previous patch. I didn't consider the case
>> where only root page exists in the tree, and there is a predicate lock on
>> it,
>> and it gets split.
>>
>> If we treat the original page as a left page and create a new root and
>> right
>> page, then we just need to copy a predicate lock from the left to the
>> right
>> page (this is the case in B-tree).
>>
>> But if we treat the original page as a root and create a new left and
>> right
>> page, then we need to copy a predicate lock on both new pages (in the
>> case of rum and gin).
>>
>> link to updated code and tests: https://github.com/shub
>> hambaraiss/postgres/commit/6172639a104785f051cb4aa0d511c58f2bae65a6
>>
>
> I've assigned to review this patch.  First of all I'd like to understand
> general idea of this patch.
>
> As I get, you're placing predicate locks to both entry tree leaf pages and
> posting tree leaf pages.  But, GIN implements so called "fast scan"
> technique which allows it to skip some leaf pages of posting tree when
> these pages are guaranteed to not contain matching item pointers.  Wherein
> the particular order of posting list scan and skip depends of their
> estimated size (so it's a kind of coincidence).
>
> But thinking about this more generally, I found that proposed locking
> scheme is redundant.  Currently when entry has posting tree, you're locking
> both:
> 1) entry tree leaf page containing pointer to posting tree,
> 2) leaf pages of corresponding posting tree.
> Therefore conflicting transactions accessing same entry would anyway
> conflict while accessing the same entry tree leaf page.  So, there is no
> necessity to lock posting tree leaf pages at all.  Alternatively, if entry
> has posting tree, you can skip locking entry tree leaf page and lock
> posting tree leaf pages instead.
>

I'd like to note that I had following warnings during compilation using
clang.

gininsert.c:519:47: warning: incompatible pointer to integer conversion
> passing 'void *' to parameter of type 'Buffer' (aka 'int')
> [-Wint-conversion]
> CheckForSerializableConflictIn(index, NULL, NULL);
> ^~~~
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/8.0.0/include/stddef.h:105:16:
> note: expanded from macro 'NULL'
> #  define NULL ((void*)0)
>^~
> ../../../../src/include/storage/predicate.h:64:87: note: passing argument
> to parameter 'buffer' here
> extern void CheckForSerializableConflictIn(Relation relation, HeapTuple
> tuple, Buffer buffer);
>
> ^
> 1 warning generated.
> ginvacuum.c:163:2: warning: implicit declaration of function
> 'PredicateLockPageCombine' is invalid in C99
> [-Wimplicit-function-declaration]
> PredicateLockPageCombine(gvs->index, deleteBlkno, rightlink);
> ^
> 1 warning generated.


Also, I tried to remove predicate locks from posting tree leafs.  At least
isolation tests passed correctly after this change.

However, after telegram discussion with Andrew Borodin, we decided that it
would be better to do predicate locking and conflict checking for posting
tree leafs, but skip that for entry tree leafs (in the case when entry has
posting tree).  That would give us more granular locking and less false
positives.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Parallel Append implementation

2017-09-28 Thread Amit Khandekar

On 20 September 2017 at 11:32, Amit Khandekar  wrote:
> There is still the open point being
> discussed : whether to have non-parallel-aware partial Append path, or
> always have parallel-aware Append paths.

Attached is the revised patch v16. In previous versions, we used to
add a non-parallel-aware Partial Append path having all partial
subpaths if the Parallel Append path already added does not contain
all-partial subpaths. Now in the patch, when we add such Append Path
containing all partial subpaths, we make it parallel-aware (unless
enable_parallelappend is false). So in this case, there will be a
parallel-aware Append path containing one or more non-partial
subpaths, and there will be another parallel-aware Append path
containing all-partial subpaths.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

ParallelAppend_v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-09-28 Thread amul sul

On Thu, Sep 28, 2017 at 11:24 AM, Amit Langote
 wrote:
> On 2017/09/27 22:41, Jesper Pedersen wrote:
>> On 09/27/2017 03:05 AM, amul sul wrote:
> Attached rebased patch, thanks.
>
>
 While reading through the patch I thought it would be better to keep
 MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to
 highlight that these are "keywords" for hash partition.

 Also updated some of the documentation.


>>> Thanks a lot for the patch, included in the attached version.
>>>
>>
>> Thank you.
>>
>> Based on [1] I have moved the patch to "Ready for Committer".
>
> Thanks a lot Amul for working on this.  Like Jesper said, the patch looks
> pretty good overall.  I was looking at the latest version with intent to
> study certain things about hash partitioning the way patch implements it,
> during which I noticed some things.
>

Thanks Amit for looking at the patch.

> +  The modulus must be a positive integer, and the remainder must a
>
> must be a
>

Fixed in the attached version.

> +  suppose you have a hash-partitioned table with 8 children, each of
> which
> +  has modulus 8, but find it necessary to increase the number of
> partitions
> +  to 16.
>

Fixed in the attached version.

> Might it be a good idea to say 8 "partitions" instead of "children" in the
> first sentence?
>
> +  each modulus-8 partition until none remain.  While this may still
> involve
> +  a large amount of data movement at each step, it is still better than
> +  having to create a whole new table and move all the data at once.
> + 
> +
>

Fixed in the attached version.

> I read the paragraph that ends with the above text and started wondering
> if the example to redistribute data in hash partitions by detaching and
> attaching with new modulus/remainder could be illustrated with an example?
> Maybe in the examples section of the ALTER TABLE page?
>

I think hint in the documentation is more than enough. There is N number of
ways of data redistribution, the document is not meant to explain all of those.

> +  Since hash operator class provide only equality, not ordering,
> collation
>
> Either "Since hash operator classes provide" or "Since hash operator class
> provides"
>

Fixed in the attached version.

> Other than the above points, patch looks good.
>
>
> By the way, I noticed a couple of things about hash partition constraints:
>
> 1. In get_qual_for_hash(), using
> get_fn_expr_rettype(&key->partsupfunc[i]), which returns InvalidOid for
> the lack of fn_expr being set to non-NULL value, causes funcrettype of the
> FuncExpr being generated for hashing partition key columns to be set to
> InvalidOid, which I think is wrong.  That is, the following if condition
> in get_fn_expr_rettype() is always satisfied:
>
> if (!flinfo || !flinfo->fn_expr)
> return InvalidOid;
>
> I think we could use get_func_rettype(&key->partsupfunc[i].fn_oid)
> instead.  Attached patch
> hash-v21-set-funcexpr-funcrettype-correctly.patch, which applies on top
> v21 of your patch.
>

Thanks for the patch, included in the attached version.

> 2. It seems that the reason constraint exclusion doesn't work with hash
> partitions as implemented by the patch is that predtest.c:
> operator_predicate_proof() returns false even without looking into the
> hash partition constraint, which is of the following form:
>
> satisfies_hash_partition(, , ,..)
>
> beccause the above constraint expression doesn't translate into a a binary
> opclause (an OpExpr), which operator_predicate_proof() knows how to work
> with.  So, false is returned at the beginning of that function by the
> following code:
>
> if (!is_opclause(predicate))
> return false;
>
> For example,
>
> create table p (a int) partition by hash (a);
> create table p0 partition of p for values with (modulus 4, remainder 0);
> create table p1 partition of p for values with (modulus 4, remainder 1);
> \d+ p0
> <...>
> Partition constraint: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
>
> -- both p0 and p1 scanned
> explain select * from p where satisfies_hash_partition(4, 0,
> hashint4extended(a, '8816678312871386367'::bigint));
>  QUERY PLAN
>
> 
>  Append  (cost=0.00..96.50 rows=1700 width=4)
>->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
>  Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
>->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
>  Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
> (5 rows)
>
> -- both p0 and p1 scanned
> explain select * from p where satisfies_hash_partition(4, 1,
> hashint4extended(a, '8816678312871386367'::bigint));
>

Re: [HACKERS] Partition-wise aggregation/grouping

2017-09-28 Thread Ashutosh Bapat

On Wed, Sep 27, 2017 at 3:42 PM, Jeevan Chalke
 wrote:
> Thanks Ashutosh for reviewing.
>
> Attached new patch-set with following changes:
>
> 1. Removed earlier 0007 and 0008 patches which were PoC for supporting
> partial aggregation over fdw. I removed them as it will be a different
> issue altogether and hence I will tackle them separately once this is
> done.
>
> This patch-set now includes support for parallel plans within partitions.
>
> Notes:
> HEAD: 59597e6
> Partition-wise Join Version: 34
>
> (First six patches 0001 - 0006, remains the same functionality-wise)
> 0007 - Refactors partial grouping paths creation into the separate function.
> 0008 - Enables parallelism within the partition-wise aggregation.
>
> This patch also includes a fix for the crash reported by Rajkumar.
> While forcibly applying scan/join target to all the Paths for the scan/join
> rel, earlier I was using apply_projection_to_path() which modifies the path
> in-place which causing this crash as the path finally chosen has been
> updated by partition-wise agg path creation. Now I have used
> create_projection_path() like we do in partial aggregation paths.
>
> Also, fixed issues reported by Ashutosh.

Thanks.

Here are comments on 0004 from last patch set. But most the comments
still apply.

Patch 0001 adds functions create_hash_agg_path() and create_sort_agg_path().
Patch 0004 adds a new argument to those functions for conditions in HAVING
clause. We should move those changes to 0001 and pass parse->havingQual to
these functions in 0001 itself. That will keep all changes to those functions
together and also make 0003 small.

The prologue of try_partition_wise_grouping() mentions a restriction of
partition keys being leading group by clauses. This restriction is not
specified in the prologue of have_grouping_by_partkey(), which actually checks
for this restriction. The requirement per prologue of that function is just to
have partition keys in group clause. I think have_grouping_by_partkey() is
correct, and we should change prologue of try_partition_wise_grouping() to be
in sync with have_grouping_by_partkey(). The prologue explains why
partition-wise aggregation/grouping would be efficient with this restriction,
but it doesn't explain why partial aggregation/grouping per partition would be
efficient. May be it should do that as well. Similar is the case with partial
aggregation/grouping discussion in README.

+/* Do not handle grouping sets for now. */
Is this a real restriction or just restriction for first cut of this feature?
Can you please add more explanation? May be update README as well?

+grouped_rel->part_scheme = input_rel->part_scheme;
Is this true even for partial aggregates? I think not. Since group by clause
does not contain partition keys, the rows from multiple partitions participate
in one group and thus the partition keys of input relation do not apply to the
grouped relation. In this case, it seems that the grouped rel will have
part_rels but will not be partitioned.

+/*
+ * If there is no path for the child relation, then we cannot add
+ * aggregate path too.
+ */
+if (input_child_rel->pathlist == NIL)
+return;
When can this happen? I think, similar to partition-wise join it happens when
there is a dummy parent relation. See [1]. If that's the case, you may want to
do things similar to what partition-wise join is doing. If there's some other
reason for doing this, returing from here half-way is actually waste of memory
and planning time. Instead, we may want to loop over the part_rels to find if
any of those have empty pathlist and return from there before doing any actual
work.

+extra.pathTarget = child_target;
+extra.inputRows = input_child_rel->cheapest_startup_path->rows;
+extra.havingQual = (Node *) adjust_appendrel_attrs(root,
+   (Node *)
query->havingQual,
+   nappinfos,
+   appinfos);
These lines are updating some fields of "extra" structure in every loop. The
structure is passed to create_child_grouping_paths() in the loop and to
add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() only
gets some member values for the last child. Is that right? Should we split
extra into two structures one to be used within the loop and one outside? Or
may be send the members being updated within the loop separately?

+/*
+ * Forcibly apply scan/join target to all the Paths for the scan/join
+ * rel.
+ *
[ lines clipped ]
+if (subpath == input_child_rel->cheapest_total_path)
+input_child_rel->cheapest_total_path = path;
+}
+}
This code seems to be copied from grouping_planner() almost verbatim. Is there
a way we can refactor it into a function and use it

Re: [HACKERS] path toward faster partition pruning

2017-09-28 Thread David Rowley

On 27 September 2017 at 14:22, Amit Langote
 wrote:
> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>   an author).  I slightly tweaked his patch -- renamed the function
>   get_matching_clause to match_clauses_to_partkey, similar to
>   match_clauses_to_index.

Hi Amit,

I've made a pass over the 0001 patch while trying to get myself up to
speed with the various developments that are going on in partitioning
right now.

These are just my thoughts from reading over the patch. I understand
that there's quite a bit up in the air right now about how all this is
going to work, but I have some thoughts about that too, which I
wouldn't mind some feedback on as my line of thought may be off.

Anyway, I will start with some small typos that I noticed while reading:

partition.c:

+ * telling what kind of NullTest has been applies to the corresponding

"applies" should be "applied"

plancat.c:

* might've occurred to satisfy the query.  Rest of the fields are set in

"Rest of the" should be "The remaining"

Any onto more serious stuff:

allpath.c:

get_rel_partitions()

I wonder if this function does not belong in partition.c. I'd have
expected a function to exist per partition type, RANGE and LIST, I'd
imagine should have their own function in partition.c to eliminate
partitions
which cannot possibly match, and return the remainder. I see people
speaking of HASH partitioning, but we might one day also want
something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
routing records to be processed into foreign tables where you never
need to query them again). It would be good if we could easily expand
this list and not have to touch files all over the optimizer to do
that. Of course, there would be other work to do in the executor to
implement any new partitioning method too.

I know there's mention of it somewhere about get_rel_partitions() not
being so smart about WHERE partkey > 20 AND partkey > 10, the code
does not understand what's more restrictive. I think you could
probably follow the same rules here that are done in
eval_const_expressions(). Over there I see that evaluate_function()
will call anything that's not marked as volatile. I'd imagine, for
each partition key, you'd want to store a Datum with the minimum and
maximum possible value based on the quals processed. If either the
minimum or maximum is still set to NULL, then it's unbounded at that
end. If you encounter partkey = Const, then minimum and maximum can be
set to the value of that Const. From there you can likely ignore any
other quals for that partition key, as if the query did contain
another qual with partkey = SomeOtherConst, then that would have
become a gating qual during the constant folding process. This way if
the user had written WHERE partkey >= 1 AND partkey <= 1 the
evaluation would end up the same as if they'd written WHERE partkey =
1 as the minimum and maximum would be the same value in both cases,
and when those two values are the same then it would mean just one
theoretical binary search on a partition range to find the correct
partition instead of two.

I see in get_partitions_for_keys you've crafted the function to return
something to identify which partitions need to be scanned. I think it
would be nice to see a special element in the partition array for the
NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
would be defined constants somewhere. The signature of that function
is not so pretty and that would likely tidy it up a bit. The matching
partition indexes could be returned as a Bitmapset, yet, I don't
really see any code which handles adding the NULL and DEFAULT
partition in get_rel_partitions() either, maybe I've just not looked
hard enough yet...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-28 Thread Dilip Kumar

On Thu, Sep 28, 2017 at 1:44 PM, Amit Langote
 wrote:
> On 2017/09/28 13:58, Dilip Kumar wrote:

>> I think the above logic is common between this patch and the runtime
>> pruning.  I think we can make
>> a reusable function.  Here we are preparing minkey and maxkey of Datum
>> because we can directly fetch rightop->constvalue whereas for runtime
>> pruning we are making minkeys and maxkeys of Expr because during
>> planning time we don't have the values for the Param.  I think we can
>> always make these minkey, maxkey array of Expr and later those can be
>> processed in whatever way we want it.  So this path will fetch the
>> constval out of Expr and runtime pruning will Eval that expression at
>> runtime.
>
> I think that makes sense.  In fact we could even move the minkey/maxkey
> collection code to match_clauses_to_partkey() itself.  No need for a
> different function and worrying about defining a separate interface for
> the same.  We match clauses exactly because we want to extract the
> constant or param values out of them.  No need to do the two activities
> independently and in different places.
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

On 2017/09/28 13:58, Dilip Kumar wrote:
> On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
>  wrote:
> 
> I was looking into the latest patch set, seems like we can reuse some
> more code between this path and runtime pruning[1]
> 
> + foreach(lc1, matchedclauses[i])
> + {
> + Expr   *clause = lfirst(lc1);
> + Const  *rightop = (Const *) get_rightop(clause);
> + Oid opno = ((OpExpr *) clause)->opno,
> + opfamily = rel->part_scheme->partopfamily[i];
> + StrategyNumber strategy;
> +
> + strategy = get_op_opfamily_strategy(opno, opfamily);
> + switch (strategy)
> + {
> + case BTLessStrategyNumber:
> + case BTLessEqualStrategyNumber:
> + if (need_next_max)
> + {
> + maxkeys[i] = rightop->constvalue;
> + if (!maxkey_set[i])
> + n_maxkeys++;
> + maxkey_set[i] = true;
> + max_incl = (strategy == BTLessEqualStrategyNumber);
> + }
> + if (strategy == BTLessStrategyNumber)
> + need_next_max = false;
> 
> I think the above logic is common between this patch and the runtime
> pruning.  I think we can make
> a reusable function.  Here we are preparing minkey and maxkey of Datum
> because we can directly fetch rightop->constvalue whereas for runtime
> pruning we are making minkeys and maxkeys of Expr because during
> planning time we don't have the values for the Param.  I think we can
> always make these minkey, maxkey array of Expr and later those can be
> processed in whatever way we want it.  So this path will fetch the
> constval out of Expr and runtime pruning will Eval that expression at
> runtime.

I think that makes sense.  In fact we could even move the minkey/maxkey
collection code to match_clauses_to_partkey() itself.  No need for a
different function and worrying about defining a separate interface for
the same.  We match clauses exactly because we want to extract the
constant or param values out of them.  No need to do the two activities
independently and in different places.

> Does this make sense or it will cause one level of extra processing
> for this path i.e converting the Expr array to CONST array?

Hm, it's not such a big cost to pay I'd think.

I will update the planner patch accordingly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

On 2017/09/28 16:13, Ashutosh Bapat wrote:
> On Thu, Sep 28, 2017 at 11:47 AM, Amit Langote wrote:
>> On 2017/09/21 12:42, Robert Haas wrote:
>>> Associate partitioning information with each RelOptInfo.
>>>
>>> This is not used for anything yet, but it is necessary infrastructure
>>> for partition-wise join and for partition pruning without constraint
>>> exclusion.
>>>
>>> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
>>> mostly cosmetic, by me.  Additional review and testing of this patch
>>> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
>>> Raghuwanshi, Thomas Munro, and Dilip Kumar.
>>
>> I noticed that this commit does not add partcollation field to
>> PartitionSchemeData, while it adds parttypcoll.  I think it'd be necessary
>> to have partcollation too, because partitioning would have used the same,
>> not parttypcoll.  For, example see the following code in
>> partition_rbound_datum_cmp:
>>
>> cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[i],
>>  key->partcollation[i],
>>  rb_datums[i],
>>  tuple_datums[i]));
>>
>> So, it would be wrong to use parttypcoll, if we are to use the collation
>> to match a clause with the partition bounds when doing partition-pruning.
>> Concretely, a clause's inputcollid should match partcollation for the
>> corresponding column, not the column's parttypcoll.
>>
>> Attached is a patch that adds the same.  I first thought of including it
>> in the partition-pruning patch set [1], but thought we could independently
>> fix this.
>>
> 
> I think PartitionSchemeData structure will grow as we need more
> information about partition key for various things. E.g. partsupfunc
> is not part of this structure right now, but it would be required to
> compare partition bound datums. Similarly partcollation. Please add
> this to partition pruning patchset. May be parttypcoll won't be used
> at all and we may want to remove it altogether.

Okay, I will post it with the partition-pruning patch set.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

2017-09-28 Thread Ashutosh Bapat

On Thu, Sep 28, 2017 at 11:47 AM, Amit Langote
 wrote:
> On 2017/09/21 12:42, Robert Haas wrote:
>> Associate partitioning information with each RelOptInfo.
>>
>> This is not used for anything yet, but it is necessary infrastructure
>> for partition-wise join and for partition pruning without constraint
>> exclusion.
>>
>> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
>> mostly cosmetic, by me.  Additional review and testing of this patch
>> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
>> Raghuwanshi, Thomas Munro, and Dilip Kumar.
>
> I noticed that this commit does not add partcollation field to
> PartitionSchemeData, while it adds parttypcoll.  I think it'd be necessary
> to have partcollation too, because partitioning would have used the same,
> not parttypcoll.  For, example see the following code in
> partition_rbound_datum_cmp:
>
> cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[i],
>  key->partcollation[i],
>  rb_datums[i],
>  tuple_datums[i]));
>
> So, it would be wrong to use parttypcoll, if we are to use the collation
> to match a clause with the partition bounds when doing partition-pruning.
> Concretely, a clause's inputcollid should match partcollation for the
> corresponding column, not the column's parttypcoll.
>
> Attached is a patch that adds the same.  I first thought of including it
> in the partition-pruning patch set [1], but thought we could independently
> fix this.
>

I think PartitionSchemeData structure will grow as we need more
information about partition key for various things. E.g. partsupfunc
is not part of this structure right now, but it would be required to
compare partition bound datums. Similarly partcollation. Please add
this to partition pruning patchset. May be parttypcoll won't be used
at all and we may want to remove it altogether.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-09-28 Thread Amit Khandekar

Below are some performance figures. Overall, there does not appear to
be a noticeable difference in the figures in partition key updates
with and without row movement (which is surprising), and
non-partition-key updates with and without the patch.

All the values are in milliseconds.

Configuration :

shared_buffers = 8GB
maintenance_work_mem = 4GB
synchronous_commit = off
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '
max_wal_size = 5GB
max_connections = 200

The attached files were used to create a partition tree made up of 16
partitioned tables, each containing 125 partitions. First half of the
2000 partitions are filled with 10 million rows. Update row movement
moves the data to the other half of the partitions.

gen.sql : Creates the partitions.
insert.data : This data file is uploaded here [1]. Used "COPY ptab
from '$PWD/insert.data' "
index.sql : Optionally, Create index on column d.

The schema looks like this :

CREATE TABLE ptab (a date, b int, c int, d int) PARTITION BY RANGE (a, b);

CREATE TABLE ptab_1_1 PARTITION OF ptab
for values from ('1900-01-01', 1) to ('1900-01-01', 7501)
PARTITION BY range (c);
CREATE TABLE ptab_1_1_1 PARTITION OF ptab_1_1
for values from (1) to (81);
CREATE TABLE ptab_1_1_2 PARTITION OF ptab_1_1
for values from (81) to (161);
..
..
CREATE TABLE ptab_1_2 PARTITION OF ptab
for values from ('1900-01-01', 7501) to ('1900-01-01', 15001)
PARTITION BY range (c);
..
..

On 20 September 2017 at 00:06, Robert Haas  wrote:
> I wonder how much more expensive it
> is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with
> 1000 subpartitions with this patch than without, assuming the update
> succeeds in both cases.

UPDATE query used : UPDATE ptab set d = d + 1 where d = 1; -- where d
is not a partition key of any of the partitions.
This query updates 8 rows out of 10 million rows.
With HEAD  : 2953.691 , 2862.298 , 2855.286 , 2835.879 (avg : 2876)
With Patch : 2933.719 , 2832.463 , 2749.979 , 2820.416 (avg : 2834)
(All the values are in milliseconds.)

> suppose you make a table with 1000 partitions each containing
> 10,000 tuples and update them all, and consider three scenarios: (1)
> partition key not updated but all tuples subject to non-HOT updates
> because the updated column is indexed, (2) partition key updated but
> no tuple movement required as a result, (3) partition key updated and
> all tuples move to a different partition.

Note that the following figures do not represent a consistent set of
figures. They keep on varying. For e.g. , even though the
partition-key-update without row movement appears to have taken a bit
more time with patch than with HEAD, a new set of tests run might even
end up the other way round.

NPK  : 42089 (patch)
NPKI : 81593 (patch)
PK   : 45250 (patch) , 44944 (HEAD)
PKR  : 46701 (patch)

The above figures are in milliseconds. The explanations of the above
short-forms :

NPK :
Update of column that is not a partition-key.
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

NPKI :
Update of column that is not a partition-key. And this column is
indexed (Used attached file index.sql).
UPDATE query used : UPDATE ptab set d = d + 1 ; This update *all* rows.

PK :
Update of partition key, but row movement does not occur. There are no
indexed columns.
UPDATE query used : UPDATE ptab set a = a + '1 hour'::interval ;

PKR :
Update of partition key, with all rows moved to other partitions.
There are no indexed columns.
UPDATE query used : UPDATE ptab set a = a + '2 years'::interval ;

[1] 
https://drive.google.com/open?id=0B_YJCqIAxKjeN3hMXzdDejlNYmlpWVJpaU9mWUhFRVhXTG5Z

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

gen.tar.gz
Description: GNU Zip compressed data

index.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of RangeVar for partitioned tables in autovacuum