Re: FW: [Todo item] Add entry creation timestamp column to pg_stat_replication

2018-11-14 Thread Masahiko Sawada
On Thu, Aug 2, 2018 at 6:34 PM MyungKyu LIM  wrote:
>
> I changed field name from 'reply_time' to 'last_msg_send_time'.
> Because 'last_msg_send_time' is used in 
> pg_stat_wal_receiver/pg_stat_subsctiption view.
> I think that field has the same meaning.

I got confused by the field name. If we have 'last_msg_send_time'
field in pg_stat_replciation which has information of wal senders
users would think it as a time when the wal sender sent a message last
time. However values of the fields actually shows a time when the wal
receiver sent a reply message last time. So perhaps
'last_reply_send_time' would be more clear.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] generated columns

2018-11-14 Thread Michael Paquier
On Tue, Oct 30, 2018 at 09:35:18AM +0100, Peter Eisentraut wrote:
> 1. (current implementation) New column attgenerated contains 's' for
> STORED, 'v' for VIRTUAL, '\0' for nothing.  atthasdef means "there is
> something in pg_attrdef for this column".
> 
> 2. Alternative: A generated column has attgenerated = s/v but atthasdef
> = false, so that atthasdef means specifically "column has a default".
> Then a column would have a pg_attrdef entry for either attgenerated !=
> '\0' or atthasdef = true.
> 
> 3. Radical alternative: Collapse everything into one new column.  We
> could combine atthasdef and attgenerated and even attidentity into a new
> column.  (Only one of the three can be the case.)  This would give
> client code a clean break, which may or may not be good.  The
> implementation would be uglier than #1 but probably cleaner than #2.  We
> could also get 4 bytes back per pg_attribute row.
> 
> I'm happy with the current choice #1, but it's worth thinking about.

#3 looks very appealing in my opinion as those columns have no overlap,
so it would take five possible values:
- generated always
- generated by default
- default value
- stored expression
- virtual expression

Obviously this requires a first patch to combine the catalog
representation of the existing columns atthasdef and attidentity.

+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("generated colums are not supported
on typed tables")));
s/colums/columns/.
--
Michael


signature.asc
Description: PGP signature


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Amit Langote
On 2018/11/15 15:22, Michael Paquier wrote:
>> If there are no partitions, nparts is 0, and other fields are NULL, though
>> rd_partdesc itself is never NULL.
> 
> I find a bit confusing that both concepts have the same meaning, aka
> that a relation has no partition, and that it is actually relkind which
> decides rd_partdesc should be NULL or set up.  This stuff also does
> empty allocations.
> 
>> If we want to redesign that and allow it to be NULL until some code in the
>> backend wants to use it, then maybe we can consider doing what you say.
>> But, many non-trivial operations on partitioned tables require the
>> PartitionDesc, so there is perhaps not much point to designing it such
>> that rd_partdesc is set only when needed, because it will be referenced
>> sooner than later.  Maybe, we can consider doing that sort of thing for
>> boundinfo, because it's expensive to build, and not all operations want
>> the canonicalized bounds.
> 
> I am fine if that's the consensus of this thread.  But as far as I can
> see it is possible to remove a bit of the memory handling mess by doing
> so.  My 2c.

Perhaps, we can discuss this another thread.  I know this thread contains
important points about partition descriptor creation and modification, but
memory context considerations seems like a separate topic.  The following
message could be a starting point, because there we were talking about
perhaps a similar design as you're saying:

https://www.postgresql.org/message-id/143ed9a4-6038-76d4-9a55-502035815e68%40lab.ntt.co.jp

Also, while I understood Alvaro's and your comment on the other thread
that memory handling is messy as is, but sorry, it's not clear to me why
you say this patch makes it messier.  It reduces context switch calls so
that RelationBuildPartitionDesc roughly looks like this after the patch:

Start with CurrentMemoryContext...

1. read catalogs and make bounddescs and oids arrays

2. partition_bounds_create(...)

3. create and switch to rd_pdcxt

4. create PartitionDesc, copy partdesc->oids and partdesc->boundinfo

5. switch back to the old context

Thanks,
Amit




Re: Undo logs

2018-11-14 Thread Dilip Kumar
On Wed, Nov 14, 2018 at 3:48 PM Dilip Kumar  wrote:
>
> On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila  wrote:
> > I think you can keep it with XXX instead of Fixme as there is nothing to 
> > fix.
> Changed
> >
> > Both the patches 0003-undo-interface-v4.patch and
> > 0004-undo-interface-test-v4.patch appears to be same except for the
> > name?
> My bad, please find the updated patch.
>
There was some problem in a assert and one comment was not aligned
properly so I have fixed that in the latest patch.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-undo-interface-v6.patch
Description: Binary data


Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-11-14 Thread Kyotaro HORIGUCHI
At Thu, 12 Jul 2018 10:31:46 +1200, Thomas Munro 
 wrote in 

> I suppose someone might argue that even when it's not a hit and it's
> not a read, we might still want to count this buffer interaction in
> some other way.  Perhaps there should be a separate counter?  It may
> technically be a kind of cache miss, but it's nowhere near as
> expensive as a synchronous system call like read() so I didn't propose
> that.

At Thu, 12 Jul 2018 10:19:29 +1000, Haribabu Kommi  
wrote in 
> Thanks for the details. I got your point. But we need to include
> RBM_ZERO_ON_ERROR case read operations, excluding others
> are fine.

FWIW I agree to Haribabu's comment that the case of RBM_ZERO_*
other than ON_ERROR is a hit. It seems to show zheap's disk I /O
reduction by itself in certain extent.

At Thu, 12 Jul 2018 13:28:37 +1200, David Rowley  
wrote in 
> On 12 July 2018 at 12:19, Haribabu Kommi  wrote:
> > Yes, I agree that we may need a new counter that counts the buffers that
> > are just allocated (no read or no write). But currently, may be the counter
> > value is very less, so people are not interested.

> The existing counters don't show up in EXPLAIN (ANALYZE, BUFFERS) when
> they're zero. The new counter would surely work the same way, so the
> users wouldn't be burdened by additional explain output it when
> they're not affected by it.

I don't object strongly neither to the new counter but I'm not
sure it is enough distinctive from hits, in the view that "a hit
is where we found a buffer for the requested page".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Michael Paquier
On Thu, Nov 15, 2018 at 02:53:47PM +0900, Amit Langote wrote:
> As things stand today, rd_partdesc of a partitioned table must always be
> non-NULL.  In fact, there are many places in the backend code that Assert it:
>
> [...]

I have noticed those, and they actually would not care much if
rd_partdesc was actually NULL.  I find interesting that the planner
portion actually does roughly the same thing with a partitioned table
with no partitions and a non-partitioned table.

> Maybe there are others in a different form.
> 
> If there are no partitions, nparts is 0, and other fields are NULL, though
> rd_partdesc itself is never NULL.

I find a bit confusing that both concepts have the same meaning, aka
that a relation has no partition, and that it is actually relkind which
decides rd_partdesc should be NULL or set up.  This stuff also does
empty allocations.

> If we want to redesign that and allow it to be NULL until some code in the
> backend wants to use it, then maybe we can consider doing what you say.
> But, many non-trivial operations on partitioned tables require the
> PartitionDesc, so there is perhaps not much point to designing it such
> that rd_partdesc is set only when needed, because it will be referenced
> sooner than later.  Maybe, we can consider doing that sort of thing for
> boundinfo, because it's expensive to build, and not all operations want
> the canonicalized bounds.

I am fine if that's the consensus of this thread.  But as far as I can
see it is possible to remove a bit of the memory handling mess by doing
so.  My 2c.
--
Michael


signature.asc
Description: PGP signature


Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-14 Thread Amit Kapila
On Thu, Nov 15, 2018 at 11:38 AM Amit Kapila  wrote:
>
> On Thu, Nov 15, 2018 at 2:05 AM Haribabu Kommi  
> wrote:
> >
> > On Wed, Nov 14, 2018 at 8:45 PM Amit Kapila  wrote:
> >> Apart from the above, I think we should add a test where all the
> >> parameters are valid as the corresponding code is not covered by any
> >> existing tests.
> >
> >
> > Added another test with all the 3 parameters are valid.
> > Updated patch attached.
> >
>
> +--
> +-- remove query ('SELECT $1 + $2 AS "TWO"') executed by
> regress_stats_user2 in the current_database
> +--
> +SELECT pg_stat_statements_reset((SELECT r.oid FROM pg_roles AS r
> WHERE r.rolname = 'regress_stats_user2'),
> + (SELECT d.oid from pg_database As d where datname = current_database()),
> + (SELECT s.queryid FROM pg_stat_statements AS s WHERE s.query =
> 'SELECT $1 AS "ONE"'));
>
> The query in comments is different than what is actually used?  And
> how is able to remove the correct statement from hash (it seems you
> intended to remove 'SELECT $1 AS "ONE"', but it removed 'SELECT $1 +
> $2 AS "TWO"')?
>

One more point, the length of each line is too long in this statement,
try to reduce it by starting parameters for pg_stat_statements_reset
from next line or something like that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-14 Thread Amit Kapila
On Thu, Nov 15, 2018 at 2:05 AM Haribabu Kommi  wrote:
>
> On Wed, Nov 14, 2018 at 8:45 PM Amit Kapila  wrote:
>> Apart from the above, I think we should add a test where all the
>> parameters are valid as the corresponding code is not covered by any
>> existing tests.
>
>
> Added another test with all the 3 parameters are valid.
> Updated patch attached.
>

+--
+-- remove query ('SELECT $1 + $2 AS "TWO"') executed by
regress_stats_user2 in the current_database
+--
+SELECT pg_stat_statements_reset((SELECT r.oid FROM pg_roles AS r
WHERE r.rolname = 'regress_stats_user2'),
+ (SELECT d.oid from pg_database As d where datname = current_database()),
+ (SELECT s.queryid FROM pg_stat_statements AS s WHERE s.query =
'SELECT $1 AS "ONE"'));

The query in comments is different than what is actually used?  And
how is able to remove the correct statement from hash (it seems you
intended to remove 'SELECT $1 AS "ONE"', but it removed 'SELECT $1 +
$2 AS "TWO"')?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Amit Langote
On 2018/11/15 14:38, Michael Paquier wrote:
> On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
>> I've fixed 0001 again to re-order the code so that allocations happen the
>> correct context and now tests pass with the rebased patches.
> 
> I have been looking at 0001, and it seems to me that you make even more
> messy the current situation.  Coming to my point: do we have actually
> any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
> has no partitions?  It seems to me that we had better set rd_pdcxt and
> rd_partdesc to NULL in this case.

As things stand today, rd_partdesc of a partitioned table must always be
non-NULL.  In fact, there are many places in the backend code that Assert it:

tablecmds.c: ATPrepDropNotNull()

if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
PartitionDesc partdesc = RelationGetPartitionDesc(rel);

Assert(partdesc != NULL);

prepunion.c: expand_partitioned_rtentry()

PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);

check_stack_depth();

/* A partitioned table should always have a partition descriptor. */
Assert(partdesc);

plancat.c: set_relation_partition_info()

partdesc = RelationGetPartitionDesc(relation);
partkey = RelationGetPartitionKey(relation);
rel->part_scheme = find_partition_scheme(root, relation);
Assert(partdesc != NULL && rel->part_scheme != NULL);

Maybe there are others in a different form.

If there are no partitions, nparts is 0, and other fields are NULL, though
rd_partdesc itself is never NULL.

If we want to redesign that and allow it to be NULL until some code in the
backend wants to use it, then maybe we can consider doing what you say.
But, many non-trivial operations on partitioned tables require the
PartitionDesc, so there is perhaps not much point to designing it such
that rd_partdesc is set only when needed, because it will be referenced
sooner than later.  Maybe, we can consider doing that sort of thing for
boundinfo, because it's expensive to build, and not all operations want
the canonicalized bounds.

Thanks,
Amit




Re: doc fix for pg_stat_activity.backend_type

2018-11-14 Thread Amit Kapila
On Tue, Nov 13, 2018 at 5:03 PM Amit Kapila  wrote:
>
> On Tue, Nov 13, 2018 at 3:37 PM John Naylor  wrote:
> > > +  In addition, extensions may have additional types.
> > >
> > > How about: "In addition, background workers registered by extensions
> > > may have additional types."?
> >
> > Sounds good to me -- I've included this language.
> >
>
> LGTM.  I will wait for a day or so, if nobody has any comments, I will
> push your patch.
>

Pushed, thanks for working on this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2018-11-14 Thread Laurenz Albe
On Tue, 2018-04-17 at 15:09 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Andres was working on a radix tree structure to fix this problem, but
> > that seems to be abandoned now, and it seems a major undertaking.  While
> > I agree that the proposed solution is a wart, it seems much better than
> > no solution at all.  Can we consider Fujii's proposal as a temporary
> > measure until we fix shared buffers?  I'm +1 on it myself.
> 
> Once we've introduced a user-visible reloption it's going to be
> practically impossible to get rid of it, so I'm -1.  I'd much rather
> see somebody put some effort into the radix-tree idea than introduce
> a kluge that we'll be stuck with, and that doesn't even provide a
> good user experience.  Disabling vacuum truncation is *not* something
> that I think we should recommend.

This new option would not only mitigate the long shared_buffers scan,
it would also get rid of the replication conflict caused by the
AccessExclusiveLock taken during truncation, which is discussed in
https://www.postgresql.org/message-id/c9374921e50a5e8fb1ecf04eb8c6ebc3%40postgrespro.ru
and seems to be a more difficult problem than anticipated.

Could that tip the scales in favor of this stop-gap?

FWIW, I have always considered heap truncation on VACUUM to be something
strange anyway.  VACUUM does not get rid of empty pages in the middle of
a relation, so why is it so important to do it at the end of the relation?
If the answer is "just because we can do it easily", then I think it would be
ok to disable the feature in cases where it causes problems.

Yours,
Laurenz Albe




Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Michael Paquier
On Thu, Nov 15, 2018 at 01:38:55PM +0900, Amit Langote wrote:
> I've fixed 0001 again to re-order the code so that allocations happen the
> correct context and now tests pass with the rebased patches.

I have been looking at 0001, and it seems to me that you make even more
messy the current situation.  Coming to my point: do we have actually
any need to set rel->rd_pdcxt and rel->rd_partdesc at all if a relation
has no partitions?  It seems to me that we had better set rd_pdcxt and
rd_partdesc to NULL in this case.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronous replay take III

2018-11-14 Thread Masahiko Sawada
On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
 wrote:
>
> Hi hackers,
>
> I was pinged off-list by a fellow -hackers denizen interested in the
> synchronous replay feature and wanting a rebased patch to test.  Here
> it goes, just in time for a Commitfest.  Please skip to the bottom of
> this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

>
> In previous threads[1][2][3] I called this feature proposal "causal
> reads".  That was a terrible name, borrowed from MySQL.  While it is
> probably a useful term of art, for one thing people kept reading it as
> "casual", which it ain't, and more importantly this patch is only one
> way to achieve read-follows-write causal consistency.  Several others
> are proposed or exist in forks (user managed wait-for-LSN, global
> transaction manager, ...).
>
> OVERVIEW
>
> For writers, it works a bit like RAID mirroring: when you commit a
> write transaction, it waits until the data has become visible on all
> elements of the array, and if an array element is not responding fast
> enough it is kicked out of the array.  For readers, it's a little
> different because you're connected directly to the array elements
> (rather than going through a central controller), so it uses a system
> of leases allowing read transactions to know instantly and whether
> they are running on an element that is currently in the array and are
> therefore able to service synchronous_replay transactions, or should
> raise an error telling you to go and ask some other element.
>
> This is a design choice favouring read-mostly workloads at the expense
> of write transactions.  Hot standbys' whole raison for existing is to
> move *some* read-only workloads off the primary server.  This proposal
> is for users who are prepared to trade increased primary commit
> latency for a guarantee about visibility on the standbys, so that
> *all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



RE: [Todo item] Add entry creation timestamp column to pg_stat_replication

2018-11-14 Thread myungkyu.lim
Hi.
Thanks for your feedback.

> Can you explain the purpose of this feature more because now we have columns 
> to report replication delay times like write_lag ,flush_lag and replay_lag 
> that can use for similar purpose .

I think, time elapsed stats are very useful on DML query active system, 
but not present that stats on idle system - not query, or only select.

sent_lsn   | 0/5476C88
write_lsn  | 0/5476C88
flush_lsn  | 0/5476C88
replay_lsn | 0/5476C88
write_lag  | 00:00:00.55
flush_lag  | 00:00:00.000855
replay_lag | 00:00:00.000914
sync_priority  | 0
sync_state | async
last_msg_send_time | 2018-11-15 14:04:39.65889+09

state  | streaming
sent_lsn   | 0/5476CC0
write_lsn  | 0/5476CC0
flush_lsn  | 0/5476CC0
replay_lsn | 0/5476CC0
write_lag  |
flush_lag  |
replay_lag |
sync_priority  | 0
sync_state | async
last_msg_send_time | 2018-11-15 14:05:02.935457+09

state  | streaming
sent_lsn   | 0/5476CC0
write_lsn  | 0/5476CC0
flush_lsn  | 0/5476CC0
replay_lsn | 0/5476CC0
write_lag  |
flush_lag  |
replay_lag |
sync_priority  | 0
sync_state | async
last_msg_send_time | 2018-11-15 14:06:23.128947+09

This timestamp column is useful when react interval check and debugging on idle 
system.

Best regards,
Myungkyu, Lim




Re: [RFC] Removing "magic" oids

2018-11-14 Thread Andres Freund
Hi,

On 2018-11-15 04:57:28 +, Noah Misch wrote:
> On Wed, Nov 14, 2018 at 12:01:52AM -0800, Andres Freund wrote:
> > - one pgbench test tested concurrent insertions into a table with
> >   oids, as some sort of stress test for lwlocks and spinlocks. I *think*
> >   this doesn't really have to be a system oid column, and this was just
> >   because that's how we triggered a bug on some machine. Noah, do I get
> >   this right?
> 
> The point of the test is to exercise OidGenLock by issuing many parallel
> GetNewOidWithIndex() and verifying absence of duplicates.  There's nothing
> special about OidGenLock, but it is important to use an operation that takes a
> particular LWLock many times, quickly.  If the test query spends too much time
> on things other than taking locks, it will catch locking races too rarely.

Sequences ought to do that, too. And if it's borked, we'd hopefully see
unique violations. But it's definitely not a 1:1 replacement.


> >  pgbench(
> > '--no-vacuum --client=5 --protocol=prepared --transactions=25',
> > 0,
> > [qr{processed: 125/125}],
> > [qr{^$}],
> > -   'concurrency OID generation',
> > +   'concurrent insert generation',
> > {
> > -   '001_pgbench_concurrent_oid_generation' =>
> > - 'INSERT INTO oid_tbl SELECT FROM generate_series(1,1000);'
> > +   '001_pgbench_concurrent_insert' =>
> > + 'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);'
> 
> The code for sequences is quite different, so this may or may not be an
> effective replacement.  To study that, you could remove a few barriers from
> lwlock.c, measure how many iterations today's test needs to catch the
> mutation, and then measure the same for this proposal.

Unfortunately it's really hard to hit barrier issues on x86. I think
that's the only arch I currently have access to, but it's possible I
have access to some ppc too.  If you have a better idea for a
replacement test, I'd be all ears.

Thanks!

Andres



Re: [RFC] Removing "magic" oids

2018-11-14 Thread Noah Misch
On Wed, Nov 14, 2018 at 12:01:52AM -0800, Andres Freund wrote:
> - one pgbench test tested concurrent insertions into a table with
>   oids, as some sort of stress test for lwlocks and spinlocks. I *think*
>   this doesn't really have to be a system oid column, and this was just
>   because that's how we triggered a bug on some machine. Noah, do I get
>   this right?

The point of the test is to exercise OidGenLock by issuing many parallel
GetNewOidWithIndex() and verifying absence of duplicates.  There's nothing
special about OidGenLock, but it is important to use an operation that takes a
particular LWLock many times, quickly.  If the test query spends too much time
on things other than taking locks, it will catch locking races too rarely.

> --- a/src/bin/pgbench/t/001_pgbench_with_server.pl
> +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
> @@ -48,28 +48,26 @@ sub pgbench
>   return;
>  }
>  
> -# Test concurrent insertion into table with UNIQUE oid column.  DDL expects
> -# GetNewOidWithIndex() to successfully avoid violating uniqueness for indexes
> -# like pg_class_oid_index and pg_proc_oid_index.  This indirectly exercises
> -# LWLock and spinlock concurrency.  This test makes a 5-MiB table.
> +# Test concurrent insertion into table with serial column.  This
> +# indirectly exercises LWLock and spinlock concurrency.  This test
> +# makes a 5-MiB table.
>  
>  $node->safe_psql('postgres',
> - 'CREATE UNLOGGED TABLE oid_tbl () WITH OIDS; '
> -   . 'ALTER TABLE oid_tbl ADD UNIQUE (oid);');
> + 'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); ');
>  
>  pgbench(
>   '--no-vacuum --client=5 --protocol=prepared --transactions=25',
>   0,
>   [qr{processed: 125/125}],
>   [qr{^$}],
> - 'concurrency OID generation',
> + 'concurrent insert generation',
>   {
> - '001_pgbench_concurrent_oid_generation' =>
> -   'INSERT INTO oid_tbl SELECT FROM generate_series(1,1000);'
> + '001_pgbench_concurrent_insert' =>
> +   'INSERT INTO insert_tbl SELECT FROM generate_series(1,1000);'

The code for sequences is quite different, so this may or may not be an
effective replacement.  To study that, you could remove a few barriers from
lwlock.c, measure how many iterations today's test needs to catch the
mutation, and then measure the same for this proposal.



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Amit Langote
On 2018/11/15 11:03, Amit Langote wrote:
> As Michael pointed out, the first cleanup patch needs to be rebased due to
> a recent commit [1].  I did that to see if something we did in that commit
> made things worse for your patch, but seems fine.  I had to go and change
> things outside RelationBuildPartitionDesc as I rebased, due to the
> aforementioned commit, but they're simple changes such as changing List *
> arguments of some newly added functions to PartitionBoundSpec **.  Please
> find the rebased patches attached with this email.
> 
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2

I noticed that the regression tests containing partitioned tables fail
randomly with the rebased patches I posted, whereas they didn't if I apply
them to HEAD without [1].

It seems to be due to the slightly confused memory context handling in
RelationBuildPartitionDesc after [1], which Alvaro had expressed some
doubts about yesterday.

I've fixed 0001 again to re-order the code so that allocations happen the
correct context and now tests pass with the rebased patches.

By the way, I noticed that the oids array added by Robert's original 0001
patch wasn't initialized to NULL, which could lead to calling pfree on a
garbage value of oids after the 2nd patch.

Thanks,
Amit

[2]
https://www.postgresql.org/message-id/20181113135915.v4r77tdthlajdlqq%40alvherre.pgsql
From bff9ca17e43d6538c5e01e5de7a95f6e426e0d55 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH 1/2] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/partitioning/partbounds.c | 66 ++-
 src/backend/utils/cache/partcache.c   | 73 ++-
 src/include/partitioning/partbounds.h |  5 +--
 3 files changed, 67 insertions(+), 77 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c 
b/src/backend/partitioning/partbounds.c
index 0b5e0dd89f..a8f4a1a685 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -70,15 +70,12 @@ static int32 qsort_partition_list_value_cmp(const void *a, 
const void *b,
   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
   void *arg);
-static PartitionBoundInfo create_hash_bounds(List *boundspecs,
-  PartitionKey key,
-  int **mapping);
-static PartitionBoundInfo create_list_bounds(List *boundspecs,
-  PartitionKey key,
-  int **mapping);
-static PartitionBoundInfo create_range_bounds(List *boundspecs,
-   PartitionKey key,
-   int **mapping);
+static PartitionBoundInfo create_hash_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_list_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_range_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
 static PartitionRangeBound *make_one_partition_rbound(PartitionKey key, int 
index,
  List *datums, bool lower);
 static int32 partition_hbound_cmp(int modulus1, int remainder1, int modulus2,
@@ -169,9 +166,9 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * current memory context.
  */
 PartitionBoundInfo
-partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
+partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
+   PartitionKey key, int **mapping)
 {
-   int nparts = list_length(boundspecs);
int i;
 
Assert(nparts > 0);
@@ -199,13 +196,13 @@ partition_bounds_create(List *boundspecs, PartitionKey 
key, int **mapping)
switch (key->strategy)
{
case PARTITION_STRATEGY_HASH:
-   return create_hash_bounds(boundspecs, key, mapping);
+   return create_hash_bounds(boundspecs, nparts, key, 
mapping);
 
case 

Re: lbound1 default in buildint2vector/buildoidvector

2018-11-14 Thread Tom Lane
Kohei KaiGai  writes:
> I noticed buildint2vector / buildoidvector assigns lbound1=0 as default
> value, but array type shall have lbound1=1 in the default.
> Is there some reasons for the difference?

Backwards compatibility.

regards, tom lane



Re: Pull up sublink of type 'NOT NOT (expr)'

2018-11-14 Thread Richard Guo
Hi Tom,

Thanks for reviewing.

On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane  wrote:

> Richard Guo  writes:
> > Currently for quals in the form of "NOT NOT (SubLink)", this SubLink
> would
> > not be considered when pulling up sublinks.
>
> Yup.
>
> > Should we give it a chance, like the attached does?
>
> What is the argument that this occurs often enough to be worth expending
> extra cycles and code space on?
>
> If we do do something like this, I'd be inclined to make it handle
> any-number-of-consecutive-NOTs, and maybe remove NOT NOT over an ANY,
> not just EXISTS.  But I don't honestly think that it's worth troubling
> over.  Do even the dumbest ORMs generate such code?
>

What this patch does is to recursively remove NOT NOT over a SubLink, so it
actually can handle any-number-of-consecutive-NOTs, both over ANY and over
EXISTS.

Over ANY:

gpadmin=# explain select * from a where not not not not a.i in (select i
from b);
  QUERY PLAN
---
 Hash Join  (cost=42.75..93.85 rows=1130 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
   ->  Hash  (cost=40.25..40.25 rows=200 width=4)
 ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)
   Group Key: b.i
   ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)

Over EXISTS:

gpadmin=# explain select * from a where not not not not exists (select 1
from b where a.i = b.i);
  QUERY PLAN
---
 Hash Join  (cost=42.75..93.85 rows=1130 width=8)
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a  (cost=0.00..32.60 rows=2260 width=8)
   ->  Hash  (cost=40.25..40.25 rows=200 width=4)
 ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)
   Group Key: b.i
   ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=4)
(7 rows)


I am not using an ORM, but just considering maybe it would be better if
PostgreSQL can do such pull-up.
Tom, what's your suggestion? Is it worthwhile expending several lines of
codes to do this pull-up?

Thanks
Richard

>
>


Re: DSM segment handle generation in background workers

2018-11-14 Thread Thomas Munro
On Thu, Nov 15, 2018 at 4:25 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > What do you think about the attached?
>
> I think you need to cast MyStartTimestamp to uint64 before shifting
> to ensure portable behavior of the shifts.  In principle it wouldn't
> matter because the int64 sign bit is nowhere near the part we care
> about, but I've heard some pretty wild claims about what compiler
> writers are entitled to do with "undefined" cases.

Thanks.  Pushed.

-- 
Thomas Munro
http://www.enterprisedb.com



lbound1 default in buildint2vector/buildoidvector

2018-11-14 Thread Kohei KaiGai
Hello,

I noticed buildint2vector / buildoidvector assigns lbound1=0 as default
value, but array type shall have lbound1=1 in the default.
Is there some reasons for the difference?

When I made a simple C-function that returns result of int2vector which
carries attribute numbers of the argument.

postgres=# select attnums_of('t0','{aid,cid,bid}');
  attnums_of
---
 [0:2]={3,5,4}
(1 row)

Once it assigns lbound1=1 manually,

postgres=# select attnums_of('t0','{aid,cid,bid}');
 attnums_of

 {3,5,4}
(1 row)

Maybe, the later one is natural.
Of course, these APIs are mostly internal, so lbound1 setting is not
significant so much.

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Amit Langote
On 2018/11/15 4:27, Robert Haas wrote:
> RelationBuildPartitionDesc doesn't lock the children
> whose relpartbounds it is fetching (!), so unless we're guaranteed to
> have already locked them children earlier for some other reason, we
> could grab the partition bound at this point and then it could change
> again before we get a lock on them.

Hmm, I think that RelationBuildPartitionDesc doesn't need to lock a
partition before fetching its relpartbound, because the latter can't
change if the caller is holding a lock on the parent, which it must be if
we're in RelationBuildPartitionDesc for parent at all.  Am I missing
something?


As Michael pointed out, the first cleanup patch needs to be rebased due to
a recent commit [1].  I did that to see if something we did in that commit
made things worse for your patch, but seems fine.  I had to go and change
things outside RelationBuildPartitionDesc as I rebased, due to the
aforementioned commit, but they're simple changes such as changing List *
arguments of some newly added functions to PartitionBoundSpec **.  Please
find the rebased patches attached with this email.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b52b7dc2
From 3e7642c236f04bc65f3f39e7de98d7ff85166e03 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Wed, 14 Nov 2018 11:11:58 -0500
Subject: [PATCH 1/2] Reduce unnecessary list construction in
 RelationBuildPartitionDesc.

The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.
---
 src/backend/partitioning/partbounds.c | 66 +++
 src/backend/utils/cache/partcache.c   | 36 +++
 src/include/partitioning/partbounds.h |  5 ++-
 3 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c 
b/src/backend/partitioning/partbounds.c
index 0b5e0dd89f..a8f4a1a685 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -70,15 +70,12 @@ static int32 qsort_partition_list_value_cmp(const void *a, 
const void *b,
   void *arg);
 static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
   void *arg);
-static PartitionBoundInfo create_hash_bounds(List *boundspecs,
-  PartitionKey key,
-  int **mapping);
-static PartitionBoundInfo create_list_bounds(List *boundspecs,
-  PartitionKey key,
-  int **mapping);
-static PartitionBoundInfo create_range_bounds(List *boundspecs,
-   PartitionKey key,
-   int **mapping);
+static PartitionBoundInfo create_hash_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_list_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
+static PartitionBoundInfo create_range_bounds(PartitionBoundSpec **boundspecs,
+  int nparts, PartitionKey key, int **mapping);
 static PartitionRangeBound *make_one_partition_rbound(PartitionKey key, int 
index,
  List *datums, bool lower);
 static int32 partition_hbound_cmp(int modulus1, int remainder1, int modulus2,
@@ -169,9 +166,9 @@ get_qual_from_partbound(Relation rel, Relation parent,
  * current memory context.
  */
 PartitionBoundInfo
-partition_bounds_create(List *boundspecs, PartitionKey key, int **mapping)
+partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
+   PartitionKey key, int **mapping)
 {
-   int nparts = list_length(boundspecs);
int i;
 
Assert(nparts > 0);
@@ -199,13 +196,13 @@ partition_bounds_create(List *boundspecs, PartitionKey 
key, int **mapping)
switch (key->strategy)
{
case PARTITION_STRATEGY_HASH:
-   return create_hash_bounds(boundspecs, key, mapping);
+   return create_hash_bounds(boundspecs, nparts, key, 
mapping);
 
case PARTITION_STRATEGY_LIST:
-   return create_list_bounds(boundspecs, key, mapping);
+   return create_list_bounds(boundspecs, nparts, key, 
mapping);
 
   

Re: speeding up planning with partitions

2018-11-14 Thread Amit Langote
On 2018/11/15 10:19, Imai, Yoshikazu wrote:
> On Tue, Nov 13, 2018 at 10:29 PM, Amit Langote wrote:
>> On 2018/11/12 13:35, Imai, Yoshikazu wrote:
>>> How amount of memory is used with above tests is...
>>>
>>> without v5 patches, Set1: 242MB
>>> without v5 patches, Set2: 247MB
>>> with v5 patches, Set1: 420MB
>>> with v5 patches, Set2: 820MB
>>
>> I understood why update/delete planning consumed more memory with the
>> patch.  It was due to a problem with the patch that modifies inheritance
>> update/delete planning.  The exact problem was that the query tree would
>> be translated (hence copied) *twice* for every partition!  First during
>> query planning where the query tree would be translated to figure out
>> a targetlist for partitions and then again before calling
>> grouping_planner.
>> Also, the adjust_appendrel_attrs_multilevel made it worse for
>> multi-level partitioning case, because of repeated copying for root to
>> intermediate partitioned tables, as Imai-san pointed out.
>>
>> I've fixed that making sure that query tree is translated only once and
>> saved for later steps to use.  Imai-san, please check the memory
>> consumption with the latest patch.
> 
> Thanks for fixing!
> Now, memory consumption is lower than the previous.
> 
> with v7 patches, Set1: 223MB
> with v7 patches, Set2: 226MB

Thanks for checking.  So at least we no longer have any memory
over-allocation bug with the patch, but perhaps other bugs are still
lurking. :)

Regards,
Amit




RE: speeding up planning with partitions

2018-11-14 Thread Imai, Yoshikazu
Hi Amit,

On Tue, Nov 13, 2018 at 10:29 PM, Amit Langote wrote:
> On 2018/11/12 13:35, Imai, Yoshikazu wrote:
> > adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1
> > adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2
> 
> Ah, I see what you mean.
> 
> The root -> sub1 translation will be repeated for each leaf partition
> if done via adjust_appendrel_attrs_multilevel.  On the other hand, if
> we could do the root to sub1 translation once and pass it to the recursive
> call using sub1 as the parent.
> 
> I've changed the patch use adjust_appendrel_attrs.
> 
> > Since it is difficult to explain my thoughts with words, I will show
> > the performance degration case.
> >
> > Partition tables are below two sets.
> 
> [ ... ]
> 
> > Create a generic plan of updation or deletion.
> >
> > [create a delete generic plan]
> > set plan_cache_mode = 'force_generic_plan'; prepare delete_stmt(int)
> > as delete from rt where b = $1; execute delete_stmt(1);
> 
> [ ... ]
> 
> > How amount of memory is used with above tests is...
> >
> > without v5 patches, Set1: 242MB
> > without v5 patches, Set2: 247MB
> > with v5 patches, Set1: 420MB
> > with v5 patches, Set2: 820MB
> 
> Although I didn't aim to fix planning for the generic plan case where
> no pruning occurs, the above result is not acceptable.  That is, the new
> implementation of inheritance update/delete planning shouldn't consume
> more memory than the previous.  In fact, it should've consumed less,
> because the patch claims that it gets rid of redundant processing per
> partition.
> 
> I understood why update/delete planning consumed more memory with the
> patch.  It was due to a problem with the patch that modifies inheritance
> update/delete planning.  The exact problem was that the query tree would
> be translated (hence copied) *twice* for every partition!  First during
> query planning where the query tree would be translated to figure out
> a targetlist for partitions and then again before calling
> grouping_planner.
> Also, the adjust_appendrel_attrs_multilevel made it worse for
> multi-level partitioning case, because of repeated copying for root to
> intermediate partitioned tables, as Imai-san pointed out.
> 
> I've fixed that making sure that query tree is translated only once and
> saved for later steps to use.  Imai-san, please check the memory
> consumption with the latest patch.

Thanks for fixing!
Now, memory consumption is lower than the previous.

with v7 patches, Set1: 223MB
with v7 patches, Set2: 226MB

Thanks,

--
Yoshikazu Imai


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Michael Paquier
On Wed, Nov 14, 2018 at 02:27:31PM -0500, Robert Haas wrote:
> Here are a couple of patches to illustrate this approach to this part
> of the overall problem.  0001 is, I think, a good cleanup that may as
> well be applied in isolation; it makes the code in
> RelationBuildPartitionDesc both cleaner and more efficient.  0002
> adjust things so that - I hope - the partition bounds we get for the
> individual partitions has to be as of the same point in the commit
> sequence as the list of children.  As I noted before, Alvaro's patch
> doesn't seem to have tackled this part of the problem.

You may want to rebase these patches as of b52b7dc2, and change the
first argument of partition_bounds_create() so as a list is used in
input...
--
Michael


signature.asc
Description: PGP signature


Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-14 Thread Amit Langote
On 2018/11/15 8:58, Alvaro Herrera wrote:
> On 2018-Nov-15, David Rowley wrote:
> 
>> On 15 November 2018 at 07:10, Alvaro Herrera  
>> wrote:
>>> What's with this comment?
>>>
>>>  * Initially we must only set up 1 PartitionDispatch object; the 
>>> one for
>>>  * the partitioned table that's the target of the command.  If we 
>>> must
>>>  * route a tuple via some sub-partitioned table, then its
>>>  * PartitionDispatch is only built the first time it's required.
>>>
>>> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
>>> odds with the '1' mentioned in the comment.  Which is wrong?
>>
>> I don't think either is wrong, but I guess something must be
>> misleading, so could perhaps be improved.
> 
> Ah, that makes sense.  Yeah, it seems a bit misleading to me.  No
> worries.

Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Thanks,
Amit




Re: PostgreSQL Limits and lack of documentation about them.

2018-11-14 Thread Tom Lane
David Rowley  writes:
> [ v4-0001-Add-documentation-section-appendix-detailing-some.patch ]

A few nitpicky gripes on this -

* I don't like inserting this as Appendix B, because that means
renumbering appendixes that have had their same names for a *long*
time; for instance the release notes have been Appendix E since
we adopted the modern division of the docs in 7.4.  So I'd put it
below anything that's commonly-referenced.  Maybe just before
"Acronyms"?

* I think I'd make the title "PostgreSQL Limitations", as it
applies to the product not any one database.

* The repetition of "Maximum" in each table row seems rather
pointless; couldn't we just drop that word?

* Items such as "relations per database" are surely not unlimited;
that's bounded at 4G by the number of distinct OIDs.  (In practice
you'd get unhappy well before that, though I suppose that's true
for many of these.)

* Rows per table is also definitely finite if you are documenting
pages per relation as finite.  But it'd be worth pointing out that
partitioning provides a way to surmount that.

* Many of these values are affected by BLCKSZ.  How much effort
shall we spend on documenting that?

* Max ID length is 63 bytes not characters.

* Don't think I'd bother with mentioning INCLUDE columns in the
"maximum indexed columns" entry.  Also, maybe call that "maximum
columns per index"; as phrased, it could be misunderstood to
mean that only 32 columns can be used in all indexes put together.

* Ordering of the table entries seems a bit random.

> The only other thing that sprung to my mind was the maximum tables per
> query.  This is currently limited to 64999, not including double
> counting partitioned tables and inheritance parents, but I kinda think
> of we feel the need to document it, then we might as well just raise
> the limit.

Can't get excited about documenting that one ... although as things
stand, it implies a limit on the number of partitions you can use
that's way lower than the claimed 256M.

> It seems a bit arbitrarily set at the moment. I don't see
> any reason it couldn't be higher.

It's evidently intended to make sure varnos can fit in uint16.
Whether there's anyplace that's actually doing so, rather than
storing them as ints, I dunno.

regards, tom lane



Re: doc - improve description of default privileges

2018-11-14 Thread Tom Lane
Progress on this patch seems to be blocked on the question of whether
we want to keep enlarging the amount of psql-specific information
in the GRANT reference page, or move that all somewhere else.

FWIW, I think I agree with Peter's position that moving it somewhere
else is the better option.  Section 5.6 "Privileges" seems like a
reasonable choice.

One reason for preferring that is that I don't think putting a  in
a reference page is a great idea.  I tried the patch in HEAD, and what
I see is that the table cross-reference renders as "Table 244", which is
a number that has no chance at all of holding still for any long period.
It's especially weird to read that in "man GRANT.7", where you're
supposedly reading a standalone document.  And while my version of "man"
makes a valiant effort to render the table in ASCII, it still doesn't look
great, and older man versions might do very poorly with that.  So I'm for
moving this to a part of the docs where we only need to worry about two
output formats not three.

A few thoughts on other issues:

* Perhaps we could fix Peter's complaint about the "Owner" column by
relabeling it "All Privileges".  I'd be inclined to label the last
column "Default PUBLIC Privileges", too, if we can fit that in.

* The phrase "relation-like objects" seems way too vague, especially since
one has to read it as excluding sequences, which surely are relations for
most purposes.  Is there a good reason not to just leave that entry as
"TABLE", full stop?  Or maybe it could be "TABLE, VIEW, etc" or some such.

* I don't think the use of "hardcoded" adds anything.

* Is it worth adding another table matching privilege names ("INSERT")
with their aclitem letters ("a"), rather than having the semi-formal
format currently appearing in grant.sgml?  There's also some related
material in 9.25 with the aclitem functions; it'd be worth unifying
that too maybe.

regards, tom lane



Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-14 Thread Alvaro Herrera
On 2018-Nov-15, David Rowley wrote:

> On 15 November 2018 at 07:10, Alvaro Herrera  wrote:
> > What's with this comment?
> >
> >  * Initially we must only set up 1 PartitionDispatch object; the 
> > one for
> >  * the partitioned table that's the target of the command.  If we 
> > must
> >  * route a tuple via some sub-partitioned table, then its
> >  * PartitionDispatch is only built the first time it's required.
> >
> > You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
> > odds with the '1' mentioned in the comment.  Which is wrong?
> 
> I don't think either is wrong, but I guess something must be
> misleading, so could perhaps be improved.

Ah, that makes sense.  Yeah, it seems a bit misleading to me.  No
worries.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Constraint documentation

2018-11-14 Thread Tom Lane
Fabien COELHO  writes:
> I've put the patch as "Ready".

I think this could be improved some more.  Perhaps something like this
(I've not bothered with markup...)

 PostgreSQL does not support CHECK constraints that reference table
 data other than the new or updated row being checked.  While a CHECK
 constraint that violates this rule may appear to work in simple
 tests, it cannot guarantee that the database will not reach a state
 in which the constraint condition is false (due to subsequent changes
 of the other row(s) involved).  This would cause a database dump and
 reload to fail.  The reload could fail even when the complete
 database state is consistent with the constraint, due to rows not
 being loaded in an order that will satisfy the constraint.  If
 possible, use UNIQUE, EXCLUDE, or FOREIGN KEY constraints to express
 cross-row and cross-table restrictions.

 If what you desire is a one-time check against other rows at row
 insertion, rather than a continuously-maintained consistency
 guarantee, a custom trigger can be used to implement that.  (This
 approach avoids the dump/reload problem because pg_dump does not
 reinstall triggers until after reloading data, so that the check will
 not be enforced during a dump/reload.)

This is a little verbose maybe, but as the text stands, it sounds like
using a trigger is enough to solve all the consistency problems that
a cross-row CHECK has.  Which it's not of course.

I'm also wondering whether it's better to put this in the CREATE TABLE
reference page instead of here.  While there are certainly benefits in
having the caveat here, I'm a bit troubled by the number of forward
references to concepts that are described later.  OTOH, a lot of people
who need the warning might never see it if it's buried in the reference
material.

regards, tom lane



Re: Refactoring the checkpointer's fsync request queue

2018-11-14 Thread Andres Freund
On 2018-11-14 16:36:49 -0500, Robert Haas wrote:
> On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
>  wrote:
> > I'm not sure if it matters whether we send the fd before or after the
> > write, but we still need some kind of global ordering of fds that can
> > order a given fd with respect to writes in other processes, so the
> > patch introduces a global shared counter captured immediately after
> > open() (including when reopened in the vfd machinery).
> 
> But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

Greetings,

Andres Freund



Re: date_trunc() in a specific time zone

2018-11-14 Thread Vik Fearing
On 14/11/2018 21:42, Tom Lane wrote:
> I wrote:
>> Here's a v2 that transposes the code to C so that we can get that
>> optimization.
> 
> Pushed after a bit more testing and documentation-wordsmithing.

Thank you, Tom!
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



Re: Doc patch on psql output formats

2018-11-14 Thread Tom Lane
"Daniel Verite"  writes:
>   Tom Lane wrote:
>> Ugh.  Should we not fix the code so that it complains if there's
>> not a unique match?  I would bet that the code was also written
>> on the assumption that any abbrevation must be unique.

> Here's a patch making "\pset format" reject ambiguous abbreviations.

Pushed.  (I simplified the code a bit by using just one state variable,
and also made the error message more verbose.)

regards, tom lane



Re: In-place updates and serializable transactions

2018-11-14 Thread Kevin Grittner
On Wed, Nov 14, 2018 at 5:43 AM Joshua Yanovski
 wrote:

> This is only a personal anecdote, but from my own experience with 
> serializability, this sort of blind update isn't often contended in realistic 
> workloads.

> So, if this only affects transactions with blind updates, I doubt it will 
> cause much pain in real workloads (even though it might look bad in 
> benchmarks which include a mix of blind writes and rmw operations).  
> Particularly if it only happens if you explicitly opt into zheap storage.

I agree with all of that, but will be very interested in what
failures, if any, kick out from the "isolation" test set when all
tables are created using zheap.  I added all the common failure
patterns I had seen to that set, and other have filled in some corner
cases I missed since then, so if everything there passes I would not
worry about it at all.  If we do see some failures, we can take
another look to see whether any action is needed.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/



Re: Refactoring the checkpointer's fsync request queue

2018-11-14 Thread Robert Haas
On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
 wrote:
> > That sounds a little like you are proposing to go back to the way
> > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > the division of labor wouldn't be quite the same.
>
> But is there an argument against it?  The checkpointer would still be
> creating checkpoints including running fsync, but the background
> writer would be, erm, writing, erm, in the background.

I don't know.  I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point.  I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

> I'm not sure if it matters whether we send the fd before or after the
> write, but we still need some kind of global ordering of fds that can
> order a given fd with respect to writes in other processes, so the
> patch introduces a global shared counter captured immediately after
> open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: In-place updates and serializable transactions

2018-11-14 Thread Kevin Grittner
On Tue, Nov 13, 2018 at 10:45 PM Kuntal Ghosh
 wrote:

> Currently, we're working on the serializable implementations for
> zheap.

Great!

> If transaction T1 reads a row version (thus acquiring a predicate lock
> on it) and a second transaction T2 updates that row version (thus
> creating a rw-conflict graph edge from T1 to T2), must a third
> transaction T3 which re-updates the new version of the row also have a
> rw-conflict in from T1 to prevent anomalies?  In other words,  does it
> matter whether we recognize the edge T1 --rw--> T3?

No.  Keep in mind that extensive study has shown that snapshot
isolation can only be non-serializable if there is a cycle in the
apparent order of execution and that this can only occur if there is a
"dangerous structure" of two adjacent read-write antidependencies
(a/k/a read-write dependencies, a/k/a rw-conflicts) *AND* the
transaction you identify as T3 in that structure *IS THE FIRST
TRANSACTION IN THE CYCLE TO COMMIT*.  Looking at the implied T1/T3
relationship and looking for a T4 to complete the structure is not
necessary, because there are proofs that three *ADJACENT* transactions
are necessary for a serialization anomaly to be seen.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/



Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-14 Thread David Rowley
Thanks for picking this up.

On 15 November 2018 at 07:10, Alvaro Herrera  wrote:
> What's with this comment?
>
>  * Initially we must only set up 1 PartitionDispatch object; the one 
> for
>  * the partitioned table that's the target of the command.  If we must
>  * route a tuple via some sub-partitioned table, then its
>  * PartitionDispatch is only built the first time it's required.
>
> You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
> odds with the '1' mentioned in the comment.  Which is wrong?

I don't think either is wrong, but I guess something must be
misleading, so could perhaps be improved.

We're simply allocating enough space for PARTITION_ROUTING_INITSIZE
but we're only initialising 1 item. That leaves space for
PARTITION_ROUTING_INITSIZE - 1 more items before we'd need to
reallocate the array.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: date_trunc() in a specific time zone

2018-11-14 Thread Tom Lane
I wrote:
> Here's a v2 that transposes the code to C so that we can get that
> optimization.

Pushed after a bit more testing and documentation-wordsmithing.

regards, tom lane



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-14 Thread Haribabu Kommi
On Wed, Nov 14, 2018 at 8:45 PM Amit Kapila  wrote:

> On Wed, Nov 14, 2018 at 7:04 AM Haribabu Kommi 
> wrote:
> >
> > On Wed, Nov 14, 2018 at 12:26 AM Amit Kapila 
> wrote:
> >>
> >> On Tue, Nov 13, 2018 at 11:32 AM Haribabu Kommi
> >>  wrote:
> >> >
> >> > On Mon, Nov 12, 2018 at 6:34 PM Amit Kapila 
> wrote:
> >> >> > I can revert it back to void,
> >> >> >
> >> >>
> >> >> +1, as we don't see any good reason to break backward compatibility.
> >> >
> >> >
> >> > Thanks for the review.
> >> > Attached the updated patch with return type as void.
> >> >
> >>
> >> With this patch, we are intending to remove the entries matching
> >> userid, dbid, queryid from hash table (pgss_hash), but the contents of
> >> the file (
> >> pgss_query_texts.stat) will remain unchanged unless all of the entries
> >> are removed from hash table.  This appears fine to me, especially
> >> because there is no easy way to remove the contents from the file.
> >> Does anybody see any problem with this behavior?
> >
> >
> > Adding more info to the above point, even if the file contents are not
> > removed, later if the file size and number of pg_stat_statements entries
> > satisfy the garbage collection, the file will be truncated. So I feel not
> > removing of the contents when the query stats are reset using specific
> > parameters is fine.
> >
>
> I have further reviewed this patch and below are my comments:
>

Thanks for the review.


> 1.
> + if ((!userid || (userid && entry->key.userid == userid)) &&
> + (!dbid || (dbid && entry->key.dbid == dbid)) &&
> + (!queryid || (queryid && entry->key.queryid == queryid)))
>
> I think this check can be simplified as:
> + if ((!userid || entry->key.userid == userid) &&
> + (!dbid || entry->key.dbid == dbid) &&
> + (!queryid || entry->key.queryid == queryid))
>

Yes, the second check is not necessary.


> 2.
> + else
> + {
> + hash_seq_init(_seq, pgss_hash);
> + while ((entry = hash_seq_search(_seq)) != NULL)
> + {
> + if ((!userid || (userid && entry->key.userid == userid)) &&
> + (!dbid || (dbid && entry->key.dbid == dbid)) &&
> + (!queryid || (queryid && entry->key.queryid == queryid)))
> + {
> + hash_search(pgss_hash, >key, HASH_REMOVE, NULL);
> + num_remove++;
> + }
> + }
> + }
>
> I think this 'if check' is redundant when none of the parameters are
> specified.  We can easily avoid it, see attached.
>

Yes, that removes the unnecessary if check if none of the parameters
are available.

I have fixed above two observations in the attached patch and edited
> few comments and doc changes.  Kindly review the same.
>

Thanks for the correction, all are fine.


> Apart from the above, I think we should add a test where all the
> parameters are valid as the corresponding code is not covered by any
> existing tests.
>

Added another test with all the 3 parameters are valid.
Updated patch attached.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-pg_stat_statements_reset-to-reset-specific-query-use_v11.patch
Description: Binary data


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-14 Thread Robert Haas
On Fri, Nov 9, 2018 at 9:50 AM Robert Haas  wrote:
> I had another idea, too.  I think we might be able to reuse the
> technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7.
> That is:
>
> - make a note of SharedInvalidMessageCounter before doing any of the
> relevant catalog lookups
> - do them
> - AcceptInvalidationMessages()
> - if SharedInvalidMessageCounter has changed, discard all the data we
> collected and retry from the top
>
> I believe that is sufficient to guarantee that whatever we construct
> will have a consistent view of the catalogs which is the most recent
> available view as of the time we do the work.  And with this approach
> I believe we can continue to use syscache lookups to get the data
> rather than having to use actual index scans, which is nice.

Here are a couple of patches to illustrate this approach to this part
of the overall problem.  0001 is, I think, a good cleanup that may as
well be applied in isolation; it makes the code in
RelationBuildPartitionDesc both cleaner and more efficient.  0002
adjust things so that - I hope - the partition bounds we get for the
individual partitions has to be as of the same point in the commit
sequence as the list of children.  As I noted before, Alvaro's patch
doesn't seem to have tackled this part of the problem.

Another part of the problem is finding a way to make sure that if we
execute a query (or plan one), all parts of the executor (or planner)
see the same PartitionDesc for every relation.  In the case of
parallel query, I think it's important to try to get consistency not
only within a single backend but also across backends.  I'm thinking
about perhaps creating an object with a name like
PartitionDescDirectory which can optionally attach to dynamic shared
memory.  It would store an OID -> PartitionDesc mapping in local
memory, and optionally, an additional OID -> serialized-PartitionDesc
in DSA.  When given an OID, it would check the local hash table first,
and then if that doesn't find anything, check the shared hash table if
there is one.  If an entry is found there, deserialize and add to the
local hash table.  We'd then hang such a directory off of the EState
for the executor and the PlannerInfo for the planner.  As compared
with Alvaro's proposal, this approach has the advantage of not
treating parallel query as a second-class citizen, and also of keeping
partitioning considerations out of the snapshot handling, which as I
said before seems to me to be a good idea.

One thing which was vaguely on my mind in earlier emails but which I
think I can now articulate somewhat more clearly is this: In some
cases, a consistent but outdated view of the catalog state is just as
bad as an inconsistent view of the catalog state.  For example, it's
not OK to decide that a tuple can be placed in a certain partition
based on an outdated list of relation constraints, including the
partitioning constraint - nor is it OK to decide that a partition can
be pruned based on an old view of the partitioning constraint.  I
think this means that whenever we change a partition's partitioning
constraint, we MUST take AccessExclusiveLock on the partition.
Otherwise, a heap_insert() [or a partition pruning decision] can be in
progress on that relation in one relation at the same time that some
other partition is changing the partition constraint, which can't
possibly work.  The best we can reasonably do is to reduce the locking
level on the partitioned table itself.

A corollary is that holding the PartitionDescs that a particular query
sees for a particular relation fixed, whether by the method Alvaro
proposes or by what I am proposing here or by some other method is not
a panacea.  For example, the ExecPartitionCheck call in copy.c
sometimes gets skipped on the theory that if tuple routing has sent us
to partition X, then the tuple being routed must satisfy the partition
constraint for that partition.  But that's not true if we set up tuple
routing using one view of the catalogs, and then things changed
afterwards.  RelationBuildPartitionDesc doesn't lock the children
whose relpartbounds it is fetching (!), so unless we're guaranteed to
have already locked them children earlier for some other reason, we
could grab the partition bound at this point and then it could change
again before we get a lock on them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch
Description: Binary data


0001-Reduce-unnecessary-list-construction-in-RelationBuil.patch
Description: Binary data


Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2018-11-14 Thread James Coleman
Also, my apologies for top posting; I forgot to remove the old email
before clicking send.



Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2018-11-14 Thread James Coleman
I've become more confident that this approach is correct after
discussions with others on my team and have added the patch to the
open commitfest.

I'm attaching v2 of the patch here with a regression test (that fails
on current master, but is green both with my patch and with current
master if you remove items from the test array to make the array <=
100 items) as well as a comment detailing the reasoning in predtest.c.


On Sat, Nov 10, 2018 at 4:33 PM James Coleman  wrote:
>
> I've recently been investigating improving our plans for queries like:
> SELECT * FROM t WHERE t.foo IN (1, 2..1000);
> where the table "t" has a partial index on "foo" where "foo IS NOT NULL".
>
> Currently the planner generates an index [only] scan so long as the number of
> items in the IN expression is <= 100, but as soon as you add the 101st item
> it reverts to seq scan. If we add the explicit null check like:
> SELECT * FROM t WHERE t.foo IN (1, 2..1000) AND foo IS NOT NULL;
> then we go back to the desired index scan.
>
> This occurs because predtest.c stops expanding ScalarArrayOpExpr's with
> constant array arguments into OR trees when the array size is > 100. The rest
> of the predicate proving code then becomes unable to infer that foo is not 
> null
> and therefore the planner cannot prove that the partial index is correct to
> use.
>
> (Please pardon technical details in the below background that may be way off;
> I don't have a lot of experience with the Postgres codebase yet, and am still
> trying to build a mental model of things.)
>
> At first I was imagining having the parse keep track of whether an array const
> expr contained any nulls and perhaps adding generated quals (in an equivalence
> class?) to allow the planner to easily prove the index was correct. I'd been
> going down this track because in my mind the issue was because the planner
> needed to verify whether all of the array elements were not null.
>
> But as I started to dig into the predtest.c NOT NULL proofs and add test 
> cases,
> I realized that at least in many normal op cases we can safely infer that foo
> is not null when "foo  " is true even if the array contains null
> elements.
>
> This is such a simple change that it seems like I must be missing a case where
> the above doesn't hold true, but I can't immediately think of any, and indeed
> with the attached patch all existing tests pass (including some additional
> ones I added for predtest to play around with it).
>
> Am I missing something obvious? Is this a valid approach?
>
>
> Other outstanding questions:
>
> Should I add additional tests for predtest? It already seems to cover some 
> null
> test cases with scalar array ops, but I'd be happy to add more if desired.
>
> Should I add a test case for the resulting plan with "foo IN (...)" with an
> array with more than 100 elements?
>
> Thanks,
> James Coleman


saop_is_not_null-v2.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-14 Thread Andrey Borodin
Hi everyone!I didn't noticed this thread for too long somehow, sorry.8 нояб. 2018 г., в 6:46, Peter Geoghegan  написал(а):I don't thinkthe general "there can't be any inserters at this subtree" thing worksgiven that we have to couple buffer locks when moving right for otherreasons. We call ginStepRight() within ginFinishSplit(), for reasonsthat date back to bug fix commit ac4ab97e from 2013 -- that detail isprobably important, because it seems to be what breaks the subtreedesign (we don't delete in two phases or anything in GIN).ginVacuumPostingTreeLeaves() holds LockBufferForCleanup() on subtree root b.Thus there may be no GinBtreeStack's holding pin on b at the moment.When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to the buffer from your stack.Hence you can never have ginFinishSplit() deadlock with cleanup of subtree whose root is LockBufferForCleanup()'d.Is this correct or did I miss something?But we have a deadlock at hand, I'll think more about it. Something with locking protocol is clearly wrong.11 нояб. 2018 г., в 22:33, chenhj  написал(а):The order of get lwlock in ginRedoDeletePage() may should be change from "dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?This looks correct to me.Best regards, Andrey Borodin.


Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-14 Thread Alvaro Herrera
What's with this comment?

 * Initially we must only set up 1 PartitionDispatch object; the one for
 * the partitioned table that's the target of the command.  If we must
 * route a tuple via some sub-partitioned table, then its
 * PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment.  Which is wrong?

(I have a few edits on the patch, so please don't send a full v18 -- a
delta patch would be welcome, if you have further changes to propose.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Doc patch on psql output formats

2018-11-14 Thread Daniel Verite
Tom Lane wrote:

> > but "one letter is enough" is not true since 9.3 that added
> > "latex-longtable" sharing the same start as "latex", and then 
> > 9.5 added "asciidoc" with the same first letter as "aligned".
> 
> Yeah, that text has clearly outstayed its welcome.
> 
> > When a non-unique abbreviation is used, psql uses the first
> > match in an arbitrary order defined in do_pset() by
> > a cascade of pg_strncasecmp().
> 
> Ugh.  Should we not fix the code so that it complains if there's
> not a unique match?  I would bet that the code was also written
> on the assumption that any abbrevation must be unique.

Here's a patch making "\pset format" reject ambiguous abbreviations.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index a1ca940..6e6d0f4 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2589,8 +2589,7 @@ lo_import 152801
   latex (uses tabular),
   latex-longtable, troff-ms,
   unaligned, or wrapped.
-  Unique abbreviations are allowed.  (That would mean one letter
-  is enough.)
+  Unique abbreviations are allowed.
   
 
   unaligned format writes all columns of a 
row on one
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 0dea54d..1f23aaf 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3637,28 +3637,52 @@ do_pset(const char *param, const char *value, 
printQueryOpt *popt, bool quiet)
/* set format */
if (strcmp(param, "format") == 0)
{
+   static const struct fmt
+   {
+   const char *name;
+   enum printFormat number;
+   }   formats[] =
+   {
+   {"aligned", PRINT_ALIGNED},
+   {"asciidoc", PRINT_ASCIIDOC},
+   {"html", PRINT_HTML},
+   {"latex", PRINT_LATEX},
+   {"latex-longtable", PRINT_LATEX_LONGTABLE},
+   {"troff-ms", PRINT_TROFF_MS},
+   {"unaligned", PRINT_UNALIGNED},
+   {"wrapped", PRINT_WRAPPED}
+   };
+
if (!value)
;
-   else if (pg_strncasecmp("aligned", value, vallen) == 0)
-   popt->topt.format = PRINT_ALIGNED;
-   else if (pg_strncasecmp("asciidoc", value, vallen) == 0)
-   popt->topt.format = PRINT_ASCIIDOC;
-   else if (pg_strncasecmp("html", value, vallen) == 0)
-   popt->topt.format = PRINT_HTML;
-   else if (pg_strncasecmp("latex", value, vallen) == 0)
-   popt->topt.format = PRINT_LATEX;
-   else if (pg_strncasecmp("latex-longtable", value, vallen) == 0)
-   popt->topt.format = PRINT_LATEX_LONGTABLE;
-   else if (pg_strncasecmp("troff-ms", value, vallen) == 0)
-   popt->topt.format = PRINT_TROFF_MS;
-   else if (pg_strncasecmp("unaligned", value, vallen) == 0)
-   popt->topt.format = PRINT_UNALIGNED;
-   else if (pg_strncasecmp("wrapped", value, vallen) == 0)
-   popt->topt.format = PRINT_WRAPPED;
else
{
-   psql_error("\\pset: allowed formats are aligned, 
asciidoc, html, latex, latex-longtable, troff-ms, unaligned, wrapped\n");
-   return false;
+   int match_count = 0;
+   int first_match = 0;
+
+   /*
+* Count the number of left-anchored matches. Exactly 
one match
+* must be found, otherwise error out.
+*/
+   for (int i = 0; i < lengthof(formats); i++)
+   {
+   if (pg_strncasecmp(formats[i].name, value, 
vallen) == 0)
+   {
+   if (++match_count > 1)
+   {
+   psql_error("\\pset: ambiguous 
abbreviation: \"%s\"\n", value);
+   return false;
+   }
+   first_match = i;
+   }
+   }
+   if (match_count == 0)
+   {
+   psql_error("\\pset: allowed formats are 
aligned, asciidoc, html, latex, latex-longtable, troff-ms, unaligned, 
wrapped\n");
+   return false;
+   }
+  

Re: Add function to release an allocated SQLDA

2018-11-14 Thread Dmitry Dolgov
> On Wed, 13 Jun 2018 at 06:30, Kato, Sho  wrote:
>
> I add a function called ECPGfreeSQLDA() becasue there is no API for releasing 
> the SQLDA stored the result set.
>
> On Mon, 18 Jun 2018 at 07:42, Thomas Munro  
> wrote:
>
> I wonder how other ESQL/C implementations deal with this stuff (...)
> To allow that type of usage, we would need two new functions: (...)

Thank you for the patch. Unfortunately, judging from the cfbot.cputube.org it
can't be applied anymore to the current master. Could you please rebase the
code and address the reviewer feedback?



Re: Index Skip Scan

2018-11-14 Thread Dmitry Dolgov
> On Mon, 12 Nov 2018 at 13:55, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, 12 Nov 2018 at 13:29, Sergei Kornilov  wrote:
> >
> > I found reproducible crash due assert failure: FailedAssertion("!(numCols > 
> > 0)", File: "pathnode.c", Line: 2795)
> > > create table tablename (i int primary key);
> > > select distinct i from tablename where i = 1;
> > Query is obviously strange, but this is bug.
>
> Wow, thanks a lot! I can reproduce it too, will fix it.

Yep, we had to check number of distinct columns too, here is the fixed patch
(with a bit more verbose commit message).


0001-Index-skip-scan-with-fallback-v3.patch
Description: Binary data


Re: [Bug Fix]ECPG: cancellation of significant digits on ECPG

2018-11-14 Thread Tom Lane
I wrote:
> Hmmm ... looking at PGTYPESnumeric_from_asc, it seems like the current
> behavior is different from what was described in that old thread; the only
> case where a digit buffer wouldn't be created is a NaN.  But maybe a crash
> could occur for NaN.  Perhaps we should use "if (num->sign !=
> NUMERIC_NAN)" as a guard?

After sleeping on it, it seems like the right thing to check for is
whether num->buf is NULL, which describes precisely the situation
where we should not try to make a copy of the digit buffer (and the
initial struct memcpy has made a valid copy of the null pointers).
So I fixed it like that.

regards, tom lane



Parametrization and UNION in view

2018-11-14 Thread Ronan Dunklau
Hello,

We've encountered a query which took forever on our database, and after
investigating why I managed to reduce the test case to something simple.

The problem is that the optimizer seems to fail to consider pushing a
predicate down a "unionized" view:

CREATE TABLE t1 AS SELECT i FROM generate_series(1, 10) i;
CREATE INDEX ON t1 (i);
CREATE TABLE t2 AS SELECT i FROM generate_series(1, 10) i;
CREATE INDEX ON t2 (i);
CREATE TABLE t3 AS SELECT i FROM generate_series(1, 2) i;
CREATE INDEX ON t3 (i);

CREATE VIEW v1 AS
  SELECT * FROM t1
  UNION
  SELECT * FROM t2;

explain SELECT * FROM t3 WHERE EXISTS (SELECT i FROM v1 WHERE v1.i = t3.i);
 QUERY PLAN

 Hash Join  (cost=6387.05..11157.05 rows=2000 width=4)
   Hash Cond: (t1.i = t3.i)
   ->  HashAggregate  (cost=6386.00..8386.00 rows=20 width=4)
 Group Key: t1.i
 ->  Append  (cost=0.00..5886.00 rows=20 width=4)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1443.00 rows=10 width=4)
   ->  Hash  (cost=1.02..1.02 rows=2 width=4)
 ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4)
(9 rows)

We can see that it choses to perform a join between the full output of the
view and the really small subset.

The optimizer can be forced to select a proper plan by introducing an
offset clause:

explain SELECT * FROM t3 WHERE EXISTS (SELECT i FROM v1 WHERE v1.i = t3.i
OFFSET 0);
  QUERY
PLAN
--
 Seq Scan on t3  (cost=0.00..1966.07 rows=1 width=4)
   Filter: (SubPlan 1)
   SubPlan 1
 ->  HashAggregate  (cost=982.51..992.51 rows=1000 width=4)
   Group Key: t1.i
   ->  Append  (cost=12.17..980.01 rows=1000 width=4)
 ->  Bitmap Heap Scan on t1  (cost=12.17..482.50 rows=500
width=4)
   Recheck Cond: (i = t3.i)
   ->  Bitmap Index Scan on t1_i_idx  (cost=0.00..12.04
rows=500 width=0)
 Index Cond: (i = t3.i)
 ->  Bitmap Heap Scan on t2  (cost=12.17..482.50 rows=500
width=4)
   Recheck Cond: (i = t3.i)
   ->  Bitmap Index Scan on t2_i_idx  (cost=0.00..12.04
rows=500 width=0)
 Index Cond: (i = t3.i)

Notice how the cost is much lower than the one of the previous plan, which
indicates to me that this plan is not even considered. If we raise the cost
of various operations (set enable_hashjoin = off, set enable_material =
off), we end up with a nonsensical nested loop:


explain SELECT * FROM t3 WHERE EXISTS (SELECT i FROM v1 WHERE v1.i = t3.i);
 QUERY PLAN

 Nested Loop  (cost=6386.00..25773.02 rows=2000 width=4)
   Join Filter: (t3.i = t1.i)
   ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4)
   ->  HashAggregate  (cost=6386.00..8386.00 rows=20 width=4)
 Group Key: t1.i
 ->  Append  (cost=0.00..5886.00 rows=20 width=4)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1443.00 rows=10 width=4)

Should that be considered a bug ?


Re: In-place updates and serializable transactions

2018-11-14 Thread Joshua Yanovski
This is only a personal anecdote, but from my own experience with
serializability, this sort of blind update isn't often contended in
realistic workloads.  The reason is that (again, IME), most blind writes
are either insertions, or "read-writes in disguise" (the client read an old
value in a different transaction); in the latter case, the data in question
are often logically "owned" by the client, and will therefore rarely be
contended.  I think there are two major exceptions to this: transactions
that perform certain kinds of monotonic updates (for instance, marking a
row complete in a worklist irrespective of whether it was already
completed), and automatic bulk updates.  However, these were exactly the
classes of transactions that we already ran under a lower isolation level
than serializability, since they have tightly constrained shapes and don't
benefit much from the additional guarantees.

So, if this only affects transactions with blind updates, I doubt it will
cause much pain in real workloads (even though it might look bad in
benchmarks which include a mix of blind writes and rmw operations).
Particularly if it only happens if you explicitly opt into zheap storage.

On Wed, Nov 14, 2018 at 5:46 AM Kuntal Ghosh 
wrote:

> In brief, due to in-place updates, in some cases, the false positives
> may increase for serializable transactions. Any thoughts?
>
> [1] src/backend/storage/lmgr/README-SSI
> [2] src/test/isolation/specs/multiple-row-versions.spec
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>
>


Re: [PATCH] Memory leak in pg_config

2018-11-14 Thread Raúl Marín Rodríguez
Hi,

I understand it, as I said it's not an issue; just annoying when using
sanitizers. Thanks for the information.


--
Raúl Marín Rodríguez
carto.com



Re: Alter index rename concurrently to

2018-11-14 Thread Peter Eisentraut
On 25/10/2018 19:35, Fabrízio de Royes Mello wrote:
>> > OK, I can refine those descriptions/comments.  Do you have any concerns
>> > about the underlying principle of this patch?
>>
>> Patch with updated comments to reflect your input.

> I didn't found any issue, so the patch looks in a very good shape.

Committed, thanks all.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Refactoring the checkpointer's fsync request queue

2018-11-14 Thread Dmitry Dolgov
> On Wed, 14 Nov 2018 at 00:44, Thomas Munro  
> wrote:
>
> Here is a rebased version of the patch, post pread()/pwrite().  I have
> also rewritten the commit message to try to explain the rationale
> concisely, instead of requiring the reader to consult multiple
> discussions that jump between lengthy email threads to understand the
> key points.

Thank you for working on this patch!

> There is one major problem with this patch:  BufferSync(), run in the
> checkpointer, can deadlock against a backend that holds a buffer lock
> and is blocked in SendFsyncRequest().  To break this deadlock, we need
> way out of it on either the sending or receiving side.

Or introduce a third side, but I'm not sure how appropriate it here.

> 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> keep draining the pipe.  Communication between checkpointer and
> bgwriter can be fairly easily multiplexed with the pipe draining work.

I also think it sounds better than other options (although probably it's
partially because these options were formulated while already having some bias
towards one of the solution).

> > > 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> > > keep draining the pipe.  Communication between checkpointer and
> > > bgwriter can be fairly easily multiplexed with the pipe draining work.
> >
> > That sounds a little like you are proposing to go back to the way
> > things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> > belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> > the division of labor wouldn't be quite the same.

I had the same first thought, but then after reading the corresponding mailing
thread I've got an impression that the purpose of this change was rather
technical (to split work between different processed because of performance
reasons) and not exactly relevant to the division of labor - am I wrong here?

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)
2018-11-13 22:06:29.773 CET [7886] LOG:  checkpointer process (PID
7934) was terminated by signal 6: Aborted
2018-11-13 22:06:29.773 CET [7886] LOG:  terminating any other active
server processes
2018-11-13 22:06:29.773 CET [7937] WARNING:  terminating connection
because of crash of another server process
2018-11-13 22:06:29.773 CET [7937] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-11-13 22:06:29.773 CET [7937] HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
2018-11-13 22:06:29.778 CET [7886] LOG:  all server processes
terminated; reinitializing

I assume it should't be like that? I haven't investigated deeply yet, but
backtrace looks like:

>>> bt
#0  0x7f7ee7a3af00 in raise () from /lib64/libc.so.6
#1  0x7f7ee7a3ca57 in abort () from /lib64/libc.so.6
#2  0x560e89d1858e in ExceptionalCondition
(conditionName=conditionName@entry=0x560e89eca333 "!(rc > 0)",
errorType=errorType@entry=0x560e89d6cec8 "FailedAssertion",
fileName=fileName@entry=0x560e89eca2c9 "checkpointer.c",
lineNumber=lineNumber@entry=574) at assert.c:54
#3  0x560e89b5e3ff in CheckpointerMain () at checkpointer.c:574
#4  0x560e8995ef9e in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe05c32f60) at bootstrap.c:460
#5  0x560e89b69c55 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5369
#6  0x560e89b6af15 in reaper (postgres_signal_arg=)
at postmaster.c:2916
#7  
#8  0x7f7ee7afe00b in select () from /lib64/libc.so.6
#9  0x560e89b6bd20 in ServerLoop () at postmaster.c:1679
#10 0x560e89b6d1bc in PostmasterMain (argc=3, argv=) at postmaster.c:1388
#11 0x560e89acadc6 in main (argc=3, argv=0x560e8ad42c00) at main.c:228



Re: [PATCH] Memory leak in pg_config

2018-11-14 Thread Tomas Vondra

On 11/14/18 3:59 PM, Tom Lane wrote:

=?UTF-8?B?UmHDumwgTWFyw61uIFJvZHLDrWd1ZXo=?=  writes:

I've been trying to run Postgis regress tests under Clang sanitizers and one of
the issues I'm facing is the constant stream of errors during the `configure`
step coming from calls to `pg_config`.


TBH, I do not think we should do anything about this.  It has never been
project policy that short-lived programs should free everything before
exiting, and I don't think we should change that.  initdb, in particular,
would need a huge amount of work to meet such a policy, and it would
really be entirely wasted effort.  Just because you've configured your
tools to enforce an unreasonable policy doesn't make it a reasonable one.



Yeah. Incidentally we had the same discussion about initdb a few days 
ago [1], and the conclusion was pretty much exactly the same.


[1] 
https://www.postgresql.org/message-id/flat/3fe1e38a-fb70-6260-9300-ce67ede21c32%40redhat.com



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [WIP] Special role for subscriptions

2018-11-14 Thread Evgeniy Efimkin
Hello!
I started work on patch (draft attached). Draft has changes related only to 
`CREATE SUBSCRIPTION`. 
I also introduce a new status (DEFFERED)  for tables in `FOR TABLE` clause (but 
not in publication).
New column in pg_subscription (suballtables) will be used in `REFRESH` clause

09.11.2018, 15:24, "Evgeniy Efimkin" :
> Hi!
> In order to support create subscription from non-superuser, we need to make 
> it possible to choose tables on the subscriber side:
> 1. add `FOR TABLE` clause in `CREATE SUBSCRIPTION`:
>    ```
> CREATE SUBSCRIPTION subscription_name
> CONNECTION 'conninfo'
> PUBLICATION publication_name [, ...]
> [ FOR TABLE [ ONLY ] table_name [ * ] [, ...]| FOR ALL TABLES ]
> [ WITH ( subscription_parameter [= value] [, ... ] ) ]
>    ```
>    ... where `FOR ALL TABLES` is only allowed for superuser.
>    and table list in `FOR TABLES` clause will be stored in 
> pg_subscription_rel table (maybe another place?)
>
> 2. Each subscription should have "all tables" attribute.
>    For example via a new column in pg_subscription "suballtables".
>
> 3. Add `ALTER SUBSCRIPTION (ADD TABLE | DROP TABLE)`:
>    ```
> ALTER SUBSCRIPTION subscription_name ADD TABLE [ ONLY ] table_name 
> [WITH copy_data];
> ALTER SUBSCRIPTION subscription_name DROP TABLE [ ONLY ] table_name;
>    ```
> 4. On `ALTER SUBSCRIPTION  REFRESH PUBLICATION` should check if 
> table owner equals subscription owner. The check is ommited if subscription 
> owner is superuser.
> 5. If superuser calls `ALTER SUBSCRIPTION REFRESH PUBLICATION` on 
> subscription with table list and non-superuser owner, we should filter tables 
> which owner is not subscription's owner or maybe we need to raise error?
>
> What do you think about it? Any objections?
>
> 07.11.2018, 00:52, "Stephen Frost" :
>>  Greetings,
>>
>>  * Evgeniy Efimkin (efim...@yandex-team.ru) wrote:
>>>   As a first step I suggest we allow CREATE SUBSCRIPTION for table owner 
>>> only.
>>
>>  That's a nice idea but seems like we would want to have a way to filter
>>  what tables a subscription follows then..? Just failing if the
>>  publication publishes tables that we don't have access to or are not the
>>  owner of seems like a poor solution..
>>
>>  Thanks!
>>
>>  Stephen
>
> 
> Ефимкин Евгений

 
Ефимкин Евгений
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index f138e61a8d..5452bd6a55 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -29,6 +29,7 @@
 #include "catalog/pg_subscription.h"
 #include "catalog/pg_subscription_rel.h"
 
+#include "commands/dbcommands.h"
 #include "commands/defrem.h"
 #include "commands/event_trigger.h"
 #include "commands/subscriptioncmds.h"
@@ -321,6 +322,14 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	char		originname[NAMEDATALEN];
 	bool		create_slot;
 	List	   *publications;
+	AclResult	aclresult;
+	bool 		alltables;
+
+	/* must have CREATE privilege on database */
+	aclresult = pg_database_aclcheck(MyDatabaseId, GetUserId(), ACL_CREATE);
+	if (aclresult != ACLCHECK_OK)
+		aclcheck_error(aclresult, OBJECT_DATABASE,
+	   get_database_name(MyDatabaseId));
 
 	/*
 	 * Parse and check options.
@@ -340,11 +349,13 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	 */
 	if (create_slot)
 		PreventInTransactionBlock(isTopLevel, "CREATE SUBSCRIPTION ... WITH (create_slot = true)");
-
-	if (!superuser())
+	alltables = !stmt->tables || !stmt->for_all_tables;
+	/* FOR ALL TABLES requires superuser */
+	if (alltables && !superuser())
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- (errmsg("must be superuser to create subscriptions";
+ (errmsg("must be superuser to create FOR ALL TABLES publication";
+
 
 	rel = heap_open(SubscriptionRelationId, RowExclusiveLock);
 
@@ -384,6 +395,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 		DirectFunctionCall1(namein, CStringGetDatum(stmt->subname));
 	values[Anum_pg_subscription_subowner - 1] = ObjectIdGetDatum(owner);
 	values[Anum_pg_subscription_subenabled - 1] = BoolGetDatum(enabled);
+	values[Anum_pg_subscription_suballtables - 1] = BoolGetDatum(alltables);
 	values[Anum_pg_subscription_subconninfo - 1] =
 		CStringGetTextDatum(conninfo);
 	if (slotname)
@@ -407,6 +419,27 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 	snprintf(originname, sizeof(originname), "pg_%u", subid);
 	replorigin_create(originname);
 
+
+	if (stmt->tables&&!connect)
+	{
+		ListCell   *lc;
+		char		table_state;
+		foreach(lc, stmt->tables)
+		{
+			RangeVar   *rv = (RangeVar *) lfirst(lc);
+			Oid			relid;
+			relid = RangeVarGetRelid(rv, AccessShareLock, false);
+			/* must be owner */
+			if (!pg_class_ownercheck(relid, GetUserId()))
+	

Re: [PATCH] Memory leak in pg_config

2018-11-14 Thread Tom Lane
=?UTF-8?B?UmHDumwgTWFyw61uIFJvZHLDrWd1ZXo=?=  writes:
> I've been trying to run Postgis regress tests under Clang sanitizers and one 
> of
> the issues I'm facing is the constant stream of errors during the `configure`
> step coming from calls to `pg_config`.

TBH, I do not think we should do anything about this.  It has never been
project policy that short-lived programs should free everything before
exiting, and I don't think we should change that.  initdb, in particular,
would need a huge amount of work to meet such a policy, and it would
really be entirely wasted effort.  Just because you've configured your
tools to enforce an unreasonable policy doesn't make it a reasonable one.

regards, tom lane



Re: Changing SQL Inlining Behaviour (or...?)

2018-11-14 Thread Komяpa
>
>
>> At pgconf.eu, I canvassed this problem and some potential solutions:
>>
>

I wonder if there is a middle ground between #2 and #3. A proper mechanism
for deduplicating entries might be hard, but on the inlining stage we
already know they're going to get duplicated. Can we make a subplan/lateral
join/whatever for arguments and deduplicate just them, as we know about
that duplication from structure already?

Another thing I see here is PostGIS using indexes. Can we just always
inline if we see an index-accelerated operator (or just an operator) on top
level AND inside inlined function?



> * Solution #1 - Quick and dirty and visible: Add an 'INLINE' function
>> decorator, which tells PostgreSQL to just ignore costs and inline the
>> function regardless. Pros: it's not too hard to implement and I'm happy to
>> contribute this. Cons: it adds very specific single-purpose syntax to
>> CREATE FUNCTION.
>>
>> * Solution #2 - Quick and dirty and invisible. Tom suggested a hack that
>> achieves the aims of #1 but without adding syntax to CREATE FUNCTION: have
>> the inlining logic look at the cost of the wrapper and the cost of
>> parameters, and if the cost of the wrapper "greatly exceeded" the cost of
>> the parameters, then inline. So the PostGIS project would just set the cost
>> of our wrappers very high, and we'd get the behaviour we want, while other
>> users who want to use wrappers to force caching of calculations would have
>> zero coded wrapper functions.  Pros: Solves the problem and easy to
>> implement, I'm happy to contribute. Cons: it's so clearly a hack involving
>> hidden (from users) magic.
>>
>> * Solution #3 - Correct and globally helpful. When first presented with
>> this problem last year, both Andres and Tom said [1] "but the right fix is
>> to avoid the double-calculation of identical entries in the target list"
>> because then it would be safe to inline functions with duplicate expensive
>> parameters. This would not only address the proximate PostGIS problem but
>> make a whole class of queries faster. There was some discussion of this
>> approach last week [2]. Pros: The right thing! Improves a whole pile of
>> other performance cases. Cons: Hard! Only experienced PgSQL developers need
>> apply.
>>
>> Naturally, I would love to see #3 implemented, but there's only so much
>> experienced developer time to go around, and it's beyond my current skill
>> set. I would like to be able to start to improve PostGIS parallelism with
>> PgSQL 12, so in order to make that not impossible, I'd like to implement
>> either #1 or #2 in case #3 doesn't happen for PgSQL 12.
>>
>> So my question to hackers is: which is less worse, #1 or #2, to implement
>> and submit to commitfest, in case #3 does not materialize in time for PgSQL
>> 12?
>>
>
> Absent any preferences, I would be inclined to go with #2, having a high
> personal tolerance for ugly hacks... :)
>
> P
>
>
>
>>
>> [1]
>> https://www.postgresql.org/message-id/20171116182208.kcvf75nfaldv36uh%40alap3.anarazel.de
>> [2]
>> https://www.postgresql.org/message-id/10355.1540926295%40sss.pgh.pa.us
>>
>> --
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: proposal: simple query profile and tracing API

2018-11-14 Thread Pavel Stehule
st 14. 11. 2018 v 14:06 odesílatel legrand legrand <
legrand_legr...@hotmail.com> napsal:

> Pavel Stehule wrote
> > út 13. 11. 2018 v 20:38 odesílatel Tomas Vondra <
>
> > tomas.vondra@
>
> >> napsal:
> >
> > My idea is very simple.
> >
> > 1. continual collect of data - planning start, execution start, waiting
> > start, waiting end, query end
> >
> > 2. run a some callback function after query is finished. Collected data
> > will be passed there.
> >
> > I think so anybody can do some different with these data. Sometimes only
> > sum can be ok, sometimes you need to increment some sorted counts,
> > sometimes you need to store these data for median or percentil
> > calculation.
> >
> > I think so it can be very simple and fast, because you should to work
> with
> > just one metrics vector.
>
> the same idea was already proposed for planning time in pg_stat_statements
> by Thomas
>
> https://www.postgresql-archive.org/Planning-counters-in-pg-stat-statements-tt5990933.html#a6002416


Should not be original every time :)

Regards

Pavel


>
> Regards
> PAscal
>
>
>
>
> --
> Sent from:
> http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
>
>


Re: proposal: simple query profile and tracing API

2018-11-14 Thread legrand legrand
Pavel Stehule wrote
> út 13. 11. 2018 v 20:38 odesílatel Tomas Vondra <

> tomas.vondra@

>> napsal:
> 
> My idea is very simple.
> 
> 1. continual collect of data - planning start, execution start, waiting
> start, waiting end, query end
> 
> 2. run a some callback function after query is finished. Collected data
> will be passed there.
> 
> I think so anybody can do some different with these data. Sometimes only
> sum can be ok, sometimes you need to increment some sorted counts,
> sometimes you need to store these data for median or percentil
> calculation.
> 
> I think so it can be very simple and fast, because you should to work with
> just one metrics vector.

the same idea was already proposed for planning time in pg_stat_statements
by Thomas
https://www.postgresql-archive.org/Planning-counters-in-pg-stat-statements-tt5990933.html#a6002416

Regards
PAscal




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



[PATCH] Memory leak in pg_config

2018-11-14 Thread Raúl Marín Rodríguez
Hi,

I've been trying to run Postgis regress tests under Clang sanitizers and one of
the issues I'm facing is the constant stream of errors during the `configure`
step coming from calls to `pg_config`.

Example:
```
$ pg_config --cc
clang

=
==14521==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 368 byte(s) in 1 object(s) allocated from:
#0 0x55de20d161d9 in malloc (/usr/bin/pg_config+0xf81d9)
[...]

SUMMARY: AddressSanitizer: 2610 byte(s) leaked in 47 allocation(s).
```

The leaked memory is part of the `configdata` array which isn't freed before
exiting. It doesn't have any long term impact but it's annoying.

A similar thing happens in the `pg_config` SQL function. Since the memory
will be released at the end of the transaction, releasing it is optional but
I've done it anyway.

I'm attaching a the patch with the changes.

Greetings,

Greetings,

-- 
Raúl Marín Rodríguez
carto.com
From 0d35d7a21b87554df7ef27b70dcf7e4ea512699f Mon Sep 17 00:00:00 2001
From: Raul Marin 
Date: Wed, 14 Nov 2018 09:08:50 +0100
Subject: [PATCH] pg_config: Avoid leaking configdata

---
 src/backend/utils/misc/pg_config.c |  1 +
 src/bin/pg_config/pg_config.c  |  4 
 src/common/config_info.c   | 15 ++-
 src/include/common/config_info.h   |  2 ++
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/pg_config.c b/src/backend/utils/misc/pg_config.c
index aa434bc3ab..62d1aea287 100644
--- a/src/backend/utils/misc/pg_config.c
+++ b/src/backend/utils/misc/pg_config.c
@@ -78,6 +78,7 @@ pg_config(PG_FUNCTION_ARGS)
 		tuple = BuildTupleFromCStrings(attinmeta, values);
 		tuplestore_puttuple(tupstore, tuple);
 	}
+	free_configdata(configdata, configdata_len);
 
 	/*
 	 * no longer need the tuple descriptor reference created by
diff --git a/src/bin/pg_config/pg_config.c b/src/bin/pg_config/pg_config.c
index a341b756de..c53a802422 100644
--- a/src/bin/pg_config/pg_config.c
+++ b/src/bin/pg_config/pg_config.c
@@ -160,6 +160,7 @@ main(int argc, char **argv)
 	{
 		for (i = 0; i < configdata_len; i++)
 			printf("%s = %s\n", configdata[i].name, configdata[i].setting);
+		free_configdata(configdata, configdata_len);
 		exit(0);
 	}
 
@@ -180,9 +181,12 @@ main(int argc, char **argv)
 			fprintf(stderr, _("%s: invalid argument: %s\n"),
 	progname, argv[i]);
 			advice();
+			free_configdata(configdata, configdata_len);
 			exit(1);
 		}
 	}
 
+	free_configdata(configdata, configdata_len);
+
 	return 0;
 }
diff --git a/src/common/config_info.c b/src/common/config_info.c
index 55e688e656..f8da71c598 100644
--- a/src/common/config_info.c
+++ b/src/common/config_info.c
@@ -27,7 +27,7 @@
  * get_configdata(const char *my_exec_path, size_t *configdata_len)
  *
  * Get configure-time constants. The caller is responsible
- * for pfreeing the result.
+ * for pfreeing the result [free_configdata]
  */
 ConfigData *
 get_configdata(const char *my_exec_path, size_t *configdata_len)
@@ -203,3 +203,16 @@ get_configdata(const char *my_exec_path, size_t *configdata_len)
 
 	return configdata;
 }
+
+
+void
+free_configdata(ConfigData *configdata, size_t configdata_len)
+{
+	int i;
+	for (i = 0; i < configdata_len; i++)
+	{
+		pfree(configdata[i].name);
+		pfree(configdata[i].setting);
+	}
+	pfree(configdata);
+}
diff --git a/src/include/common/config_info.h b/src/include/common/config_info.h
index 72014a915a..26f85e86a9 100644
--- a/src/include/common/config_info.h
+++ b/src/include/common/config_info.h
@@ -18,4 +18,6 @@ typedef struct ConfigData
 extern ConfigData *get_configdata(const char *my_exec_path,
 			   size_t *configdata_len);
 
+extern void free_configdata(ConfigData *configdata, size_t configdata_len);
+
 #endif			/* COMMON_CONFIG_INFO_H */
-- 
2.19.1



Re: csv format for psql

2018-11-14 Thread Fabien COELHO


Bonjour Michaël,


But again COPY is concerned with importing the data that preexists,
even if it's weird, whereas a psql output formats are not.


Hm.  I checked the contents of the patch in details which provide output
consistent with COPY, but after looking at the global picture I am
getting cold feet on this patch for a couple of reasons:
- This stuff adds new code paths in the frontend mimicking what the
backend already does for years, both doing the same thing.


 - Backend's COPY cannot write to client space, so the comparison
   is not relevant.

 - "\copy (SQL-query) TO STDOUT CSV;" is kind of a pain, because one has
   to edit around the query, which is not convenient, esp from the
   command line:

   sh> psql --csv -c 'SELECT 1 as one, 2 as two' > out.csv

   vs

   sh> psql -c "\copy (SELECT 1 AS one, 2 as two) TO STDOUT CSV" > out.csv

   or mixing the output file name inside the argument, which is very
   unshell like:

   sh> psql -c "\copy (SELECT 1 AS one, 2 as two) TO 'out.csv' CSV"

If you have a "query.sql" file that you want to output in csv, there is no 
simple way to do that with \copy/COPY, whereas "psql --csv -f query.sql" 
looks pretty straightforward to me. Also, in a makefile, I could write:


   %.csv: %.sql
   psql --csv -f $< > $@

My point is that \copy, COPY and the proposed CSV format do not address 
the same use cases.



- There are already three ways to fetch data in this format with COPY,
\copy and file_fdw, with all three using the same code paths for option
validations (I can see the arguments at the top of the thread for which
COPY SELECT can actually do everything you want with?).
- The experience is confusing, as the psql format uses different options
than the backend to do the same things:
-- tuples_only instead of HEADER.
-- fieldsep_csv instead of DELIMITER
-- null is an equivalent of the one with the same name, which is
actually consistent.
-- encoding is also an equivalent of ENCODING.
-- and all the others missing.
That looks like a lot.


I disagree on this one: the proposed csv format just follows the existing 
psql format pattern used for 8 formats and reuses it for csv.


Note that there are existing command line options for tuples_only (-t), 
encoding is inherited from the shell and does not need to be changed that 
often nowadays, fieldsep_csv is kind-of a pain, but then if someone wants 
"comma-separated-values" NOT separated by commas, probably they can handle 
it.


Basically the proposed patch addresses a simple and convenient use case 
which are neither addressed by \copy nor COPY. The fact that more options 
are available with these commands does it precludes its usefulness as is.


--
Fabien.

Re: [Todo item] Add entry creation timestamp column to pg_stat_replication

2018-11-14 Thread Surafel Temesgen
Hi .
On Tue, Jul 31, 2018 at 11:57 AM MyungKyu LIM 
wrote:

>
> Feedback and suggestion will be very welcome.
>


Can you explain the purpose of this feature more because now we have
columns to report replication delay times like write_lag ,flush_lag and
replay_lag that can use for similar purpose .

regards

Surafel


Re: proposal: simple query profile and tracing API

2018-11-14 Thread Pavel Stehule
út 13. 11. 2018 v 20:38 odesílatel Tomas Vondra <
tomas.von...@2ndquadrant.com> napsal:

> On Tue, 2018-11-13 at 13:55 +0100, Pavel Stehule wrote:
> > út 13. 11. 2018 v 13:12 odesílatel legrand legrand <
> > legrand_legr...@hotmail.com> napsal:
> >
> > > Hello Pavel,
> > >
> > > What about using wait events and a trigger on pg_stat_activity ?
> > >
> >
> > pg_stat_activity should not to show fresh data. Using
> > pg_stat_activity can be too expensive for fast queries
> >
>
> More importantly, how would you create a trigger on pg_stat_activity,
> considering it's a system view backed by SRF?
>
> > > ...
> > > An other solution: a customized version of pgsentinel (for high
> > > frequency sampling):
> > >
> >
> > I don't believe to sampling method - I talk about less than 10ms
> > queries, I would to see a 2-3ms planning time, 2-5ms waitings - and
> > it means sampling aboy 2ms, what is expensive
> >
>
> You're quietly assuming that whatever alternative solution you end up
> inventing will be cheaper than this sampling. Which is going to be
> hard, if you want to do that for every execution of even the shortest
> queries. I'd say that's doomed to fail.
>
>
My idea is very simple.

1. continual collect of data - planning start, execution start, waiting
start, waiting end, query end

2. run a some callback function after query is finished. Collected data
will be passed there.

I think so anybody can do some different with these data. Sometimes only
sum can be ok, sometimes you need to increment some sorted counts,
sometimes you need to store these data for median or percentil calculation.

I think so it can be very simple and fast, because you should to work with
just one metrics vector.


> Moreover, the sampling does not need to catch every query execution.
> The idea is to do it "just often enough" for some desired accuracy. For
> example you might pick 10ms interval - it will hit even shorter queries
> if they are executed often enough (and if they're not, who cares about
> them?). And given the sample percentages and total time, you can do
> some estimates for each query / phase.
>

With 10ms sampling there will not be big error, but 10ms sampling can
utilize CPU too high. Now I don't see a advantage of sampling based method
with more complex processing (because you should to process more rows)
against to session based processing.



>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


Re: Undo logs

2018-11-14 Thread Dilip Kumar
On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila  wrote:
> I think you can keep it with XXX instead of Fixme as there is nothing to fix.
Changed
>
> Both the patches 0003-undo-interface-v4.patch and
> 0004-undo-interface-test-v4.patch appears to be same except for the
> name?
My bad, please find the updated patch.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0004-undo-interface-test-v5.patch
Description: Binary data


0003-undo-interface-v5.patch
Description: Binary data


Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2018-11-14 Thread Hubert Zhang
Thanks, Tomas,

Yes, we want to add the hooks into postgres repo.
But before that, we plan to firstly get some feedbacks from community about
the diskquota extension implementation and usage?
Later, we'll modify our license and submit the hooks into CF.

Thanks
Hubert

On Wed, Nov 14, 2018 at 3:54 AM Tomas Vondra 
wrote:

> On Tue, 2018-11-13 at 16:47 +0800, Hubert Zhang wrote:
> > Hi all,
> >
> > We implement disk quota feature on Postgresql as an extension(link:
> > https://github.com/greenplum-db/diskquota),
> > If you are interested, try and use it to limit the amount of disk
> > space that
> > a schema or a role can use in Postgresql.
> > Any feedback or question are appreciated.
> >
>
> It's not clear to me what's the goal of this thread? I understand what
> quotas are about, but are you sharing it because (a) it's a useful
> extension, (b) you propose adding a couple of new hooks (and keep the
> extension separate) or (c) you propose adding both the hooks and the
> extension (into contrib)?
>
> I assume it's either (b) and (c), in which case you should add it to
> 2019-01 CF: https://commitfest.postgresql.org/21/
>
> In either case, it might be useful to add a LICENSE to the github
> repository, it's not clear what's the situation in this respect. That
> probably matters especially for (b), because for (c) it'd end up with
> PostgreSQL license automatically.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Thanks

Hubert Zhang


Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-14 Thread Amit Kapila
On Wed, Nov 14, 2018 at 7:04 AM Haribabu Kommi  wrote:
>
> On Wed, Nov 14, 2018 at 12:26 AM Amit Kapila  wrote:
>>
>> On Tue, Nov 13, 2018 at 11:32 AM Haribabu Kommi
>>  wrote:
>> >
>> > On Mon, Nov 12, 2018 at 6:34 PM Amit Kapila  
>> > wrote:
>> >> > I can revert it back to void,
>> >> >
>> >>
>> >> +1, as we don't see any good reason to break backward compatibility.
>> >
>> >
>> > Thanks for the review.
>> > Attached the updated patch with return type as void.
>> >
>>
>> With this patch, we are intending to remove the entries matching
>> userid, dbid, queryid from hash table (pgss_hash), but the contents of
>> the file (
>> pgss_query_texts.stat) will remain unchanged unless all of the entries
>> are removed from hash table.  This appears fine to me, especially
>> because there is no easy way to remove the contents from the file.
>> Does anybody see any problem with this behavior?
>
>
> Adding more info to the above point, even if the file contents are not
> removed, later if the file size and number of pg_stat_statements entries
> satisfy the garbage collection, the file will be truncated. So I feel not
> removing of the contents when the query stats are reset using specific
> parameters is fine.
>

I have further reviewed this patch and below are my comments:

1.
+ if ((!userid || (userid && entry->key.userid == userid)) &&
+ (!dbid || (dbid && entry->key.dbid == dbid)) &&
+ (!queryid || (queryid && entry->key.queryid == queryid)))

I think this check can be simplified as:
+ if ((!userid || entry->key.userid == userid) &&
+ (!dbid || entry->key.dbid == dbid) &&
+ (!queryid || entry->key.queryid == queryid))

2.
+ else
+ {
+ hash_seq_init(_seq, pgss_hash);
+ while ((entry = hash_seq_search(_seq)) != NULL)
+ {
+ if ((!userid || (userid && entry->key.userid == userid)) &&
+ (!dbid || (dbid && entry->key.dbid == dbid)) &&
+ (!queryid || (queryid && entry->key.queryid == queryid)))
+ {
+ hash_search(pgss_hash, >key, HASH_REMOVE, NULL);
+ num_remove++;
+ }
+ }
+ }

I think this 'if check' is redundant when none of the parameters are
specified.  We can easily avoid it, see attached.

I have fixed above two observations in the attached patch and edited
few comments and doc changes.  Kindly review the same.

Apart from the above, I think we should add a test where all the
parameters are valid as the corresponding code is not covered by any
existing tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-pg_stat_statements_reset-to-reset-specific-query-use_v10.patch
Description: Binary data


Re: Undo logs

2018-11-14 Thread Amit Kapila
On Wed, Nov 14, 2018 at 2:42 PM Dilip Kumar  wrote:
>
> On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila  wrote:
> >
> > On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar  wrote:
> > >
> > [review for undo record layer (0003-undo-interface-v3)]
> >
> > I might sound repeating myself, but just to be clear, I was involved
> > in the design of this patch as well and I have given a few high-level
> > inputs for this patch.  I have used this interface in the zheap
> > development, but haven't done any sort of detailed review which I am
> > doing now.  I encourage others also to review this patch.
>
> Thanks for the review, please find my reply inline.
> >
> > 1.
> >  * NOTES:
> > + * Handling multilog -
> > + *  It is possible that the undo record of a transaction can be spread 
> > across
> > + *  multiple undo log.  And, we need some special handling while inserting 
> > the
> > + *  undo for discard and rollback to work sanely.
> >
> > I think before describing how the undo record is spread across
> > multiple logs, you can explain how it is laid out when that is not the
> > case.  You can also explain how undo record headers are linked.  I am
> > not sure file header is the best place or it should be mentioned in
> > README, but I think for now we can use file header for this purpose
> > and later we can move it to README if required.
> Added in the header.
>
> >
> > 2.
> > +/*
> > + * FIXME:  Do we want to support undo tuple size which is more than the 
> > BLCKSZ
> > + * if not than undo record can spread across 2 buffers at the max.
> > + */
> >
> > +#define MAX_BUFFER_PER_UNDO2
> >
> > I think here the right question is what is the possibility of undo
> > record to be greater than BLCKSZ?  For zheap, as of today, we don'
> > have any such requirement as the largest undo record is written for
> > update or multi_insert and in both cases we don't exceed the limit of
> > BLCKSZ.  I guess some user other than zheap could probably have such
> > requirement and I don't think it is impossible to enhance this if we
> > have any requirement.
> >
> > If anybody else has an opinion here, please feel to share it.
>
> Should we remove this FIXME or lets wait for some other opinion.  As
> of now I have kept it as it is.
> >

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: DSM segment handle generation in background workers

2018-11-14 Thread Thomas Munro
On Wed, Nov 14, 2018 at 8:52 PM Noah Misch  wrote:
> On Wed, Nov 14, 2018 at 08:22:42PM +1300, Thomas Munro wrote:
> > On Wed, Nov 14, 2018 at 6:34 PM Noah Misch  wrote:
> > > On Wed, Nov 14, 2018 at 05:50:26PM +1300, Thomas Munro wrote:
> > > > On Wed, Nov 14, 2018 at 3:24 PM Noah Misch  wrote:
> > > > > What counts is the ease of predicting a complete seed.  HEAD's 
> > > > > algorithm has
> > > > > ~13 trivially-predictable bits, and the algorithm that stood in 
> > > > > BackendRun()
> > > > > from 98c5065 until 197e4af had no such bits.  You're right that the 
> > > > > other 19
> > > > > bits are harder to predict than any given 19 bits under the old 
> > > > > algorithm, but
> > > > > the complete seed remains more predictable than it was before 197e4af.
> > > >
> > > > However we mix them, given that the source code is well known, isn't
> > > > an attacker's job really to predict the time and pid, two not
> > > > especially well guarded secrets?
> > >
> > > True.  Better to frame the issue as uniform distribution of seed, not
> > > unpredictability of seed selection.
> >
> > What do you think about the attached?
>
> You mentioned that you rewrote the algorithm because the new function had a
> TimestampTz.  But the BackendRun() code, which it replaced, also had a
> TimestampTz.  You can reuse the exact algorithm.  Is there a reason to change?

The old code used a "slightly hacky way to convert timestamptz into
integers" because TimestampTz might have been floating point back in
the day.  Now that TimestampTz is always an integer, we might as well
use it directly and shuffle its bits for the same general effect, no?
The difference between x >> 20 and x / USECS_PER_SEC doesn't seem to
be material.

-- 
Thomas Munro
http://www.enterprisedb.com