date:20190331

Re: New vacuum option to do only freezing

2019-03-31 Thread Masahiko Sawada

On Sat, Mar 30, 2019 at 5:04 AM Robert Haas  wrote:
>
> On Fri, Mar 29, 2019 at 12:27 PM Masahiko Sawada  
> wrote:
> > Yeah, but since multiple relations might be specified in VACUUM
> > command we need to process index_cleanup option after opened each
> > relations. Maybe we need to process all options except for
> > INDEX_CLEANUP in ExecVacuum() and pass VacuumStmt down to vacuum_rel()
> > and process it  in manner of you suggested after opened the relation.
> > Is that right?
>
> Blech, no, let's not do that.  We'd better use some method that can
> indicate yes/no/default.  Something like psql's trivalue thingy, but
> probably not exactly that.  We can define an enum for this purpose,
> for example - VACUUM_INDEX_CLEANUP_{ENABLED,DISABLED,DEFAULT}.  Or
> maybe there's some other way.  But let's not pass bits of the parse
> tree around any more than really required.

I've defined an enum VacOptTernaryValue representing
enabled/disabled/default and added index_cleanup variable as the new
enum type to VacuumParams. The vacuum options that uses the reloption
value as the default value such as index cleanup and new truncation
option can use this enum and set either enabled or disabled after
opened the relation when it’s set to default. Please find the attached
patches.

>
> > > I think it would be better to just ignore the INDEX_CLEANUP option
> > > when FULL is specified.
> >
> > Okay, but why do we ignore that in this case while we complain in the
> > case of FULL and DISABLE_PAGE_SKIPPING?
>
> Well, that's a fair point, I guess.  If we go that that route, we'll
> need to make sure that setting the reloption doesn't prevent VACUUM
> FULL from working -- the complaint must only affect an explicit option
> on the VACUUM command line.  I think we should have a regression test
> for that.

I've added regression tests. Since we check it before setting
index_cleanup based on reloptions we doesn't prevent VAUCUM FULL from
working even when the vacuum_index_cleanup is false.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From ca7a9f26283558525130a6c3d072096422aff525 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 7 Mar 2019 09:45:11 +0900
Subject: [PATCH v12 1/2] Add INDEX_CLEANUP option to VACUUM command

If this option is false, VACUUM does HOT-pruning for live tuples but
doesn't remove dead tuples completely and disables index vacuum.

vacrelstats->dead_tuples could have tuples that became dead after
checked at a HOT-pruning time, which are not marked as dead. Per
discussion on pgsql-hackers We normally records and remove them but
with this option we don't process and leave for the next vacuum for
simplifing the code. That's okay because it's very rare condition and
those tuples will be processed by the next vacuum.
---
 doc/src/sgml/ref/create_table.sgml | 15 +++
 doc/src/sgml/ref/vacuum.sgml   | 23 ++
 src/backend/access/common/reloptions.c | 13 +-
 src/backend/access/heap/vacuumlazy.c   | 79 ++
 src/backend/commands/vacuum.c  | 37 +++-
 src/backend/postmaster/autovacuum.c|  1 +
 src/bin/psql/tab-complete.c|  6 ++-
 src/include/commands/vacuum.h  | 15 +++
 src/include/utils/rel.h|  1 +
 src/test/regress/expected/vacuum.out   | 11 +
 src/test/regress/sql/vacuum.sql| 11 +
 11 files changed, 191 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 0fcbc66..910db52 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1378,6 +1378,21 @@ WITH ( MODULUS numeric_literal, REM

 

+vacuum_index_cleanup (boolean)
+
+ 
+  Enables or disables index cleanup when VACUUM is
+  run on this table.  The default value is true.
+  Disabling index cleanup can speed of VACUUM very
+  significantly, but may also lead to severely bloated indexes if table
+  modifications are frequent.  The INDEX_CLEANUP
+  parameter to , if specified, overrides
+  the value of this option.
+ 
+
+   
+
+   
 autovacuum_vacuum_threshold, toast.autovacuum_vacuum_threshold (integer)
 
  
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 906d0c2..643e97b 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -32,6 +32,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ]
 DISABLE_PAGE_SKIPPING [ boolean ]
 SKIP_LOCKED [ boolean ]
+INDEX_CLEANUP [ boolean ]
 
 and table_and_columns is:
 
@@ -182,6 +183,28 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ ).  However, if index
+  cleanup is not performed regularly, performance may suffer, because
+  as the table is modified, indexes will accumulate dead line pointers
+  and the table itself will accumulate dead line pointers

Re: speeding up planning with partitions

2019-03-31 Thread Amit Langote

(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/)

On 2019/04/01 2:04, Tom Lane wrote:
> Amit Langote  writes:
>> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  
>> wrote:
>>> Certainly, using bitmapset contributes to the performance when scanning
>>> one partition(few partitions) from large partitions.
> 
>> Thanks Imai-san for testing.
> 
> I tried to replicate these numbers with the code as-committed, and
> could not.

Thanks for that.

> What I get, using the same table-creation code as you
> posted and a pgbench script file like
> 
> \set param random(1, :N)
> select * from rt where a = :param;
> 
> is scaling like this:
> 
> N tps, range  tps, hash
> 
> 2 10520.51993210415.230400
> 8 10443.36145710480.987665
> 3210341.19676810462.551167
> 128   10370.95384910383.885128
> 512   10207.57841310214.049394
> 1024  10042.79434010121.683993
> 4096  8937.561825 9214.993778
> 8192  8247.614040 8486.728918
> 
> If I use "-M prepared" the numbers go up a bit for lower N, but
> drop at high N:
> 
> N tps, range  tps, hash
> 
> 2 11449.92052711462.253871
> 8 11530.51314611470.812476
> 3211372.41299911450.213753
> 128   11289.35159611322.698856
> 512   11095.42845111200.683771
> 1024  10757.64610810805.052480
> 4096  8689.165875 8930.690887
> 8192  7301.609147 7502.806455
> 
> Digging into that, it seems like the degradation with -M prepared is
> mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
> What I think must be happening is that with -M prepared, at some point the
> plancache decides to try a generic plan, which causes opening/locking all
> the partitions, resulting in permanent bloat in the locallock hash table.
> We immediately go back to using custom plans, but hash_seq_search has
> more buckets to look through for the remainder of the process' lifetime.

Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then
even posted a patch which is being discussed at:

https://commitfest.postgresql.org/22/1993/

> I do see some cycles getting spent in apply_scanjoin_target_to_paths
> that look to be due to scanning over the long part_rels array,
> which your proposal would ameliorate.  But (a) that's pretty small
> compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
> is a remarkable display of brute force inefficiency to begin with.
> I think we should see if we can't nuke that function altogether in
> favor of generating the paths with the right target the first time.

That's an option if we can make it work.

Shouldn't we look at *all* of the places that have code that now look like
this:

  for (i = 0; i < rel->nparts; i++)
  {
  RelOptInfo *partrel = rel->part_rels[i];

  if (partrel == NULL)
  continue;
  ...
  }

Beside apply_scanjoin_target_to_paths(), there are:

create_partitionwise_grouping_paths()
make_partitionedrel_pruneinfo()


> BTW, the real elephant in the room is the O(N^2) cost of creating
> these tables in the first place.  The runtime for the table-creation
> scripts looks like
> 
> N range   hash
> 
> 2 0m0.011s0m0.011s
> 8 0m0.015s0m0.014s
> 320m0.032s0m0.030s
> 128   0m0.132s0m0.099s
> 512   0m0.969s0m0.524s
> 1024  0m3.306s0m1.442s
> 4096  0m46.058s   0m15.522s
> 8192  3m11.995s   0m58.720s
> 
> This seems to be down to the expense of doing RelationBuildPartitionDesc
> to rebuild the parent's relcache entry for each child CREATE TABLE.
> Not sure we can avoid that, but maybe we should consider adopting a
> cheaper-to-read representation of partition descriptors.  The fact that
> range-style entries seem to be 3X more expensive to load than hash-style
> entries is strange.

I've noticed this many times too, but never prioritized doing something
about it.  I'll try sometime.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com

Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-31 Thread Rahila Syed

Hi Alvaro,

Please see few comments below:

1. Makecheck fails currently as view definition of expected rules.out does
not reflect latest changes in progress metrics numbering.

2. +  
+   When creating an index on a partitioned, this column is set to the
+   total number of partitions on which the index is to be created.
+  
+ 
+ 
+  partitions_done
+  bigint
+  
+   When creating an index on a partitioned, this column is set to the

I think there is a typo here 's/partitioned/partitioned table/'

3.
+   if (hscan->rs_base.rs_parallel != NULL)
+   {
+   ParallelBlockTableScanDesc bpscan;
+
+   bpscan = (ParallelBlockTableScanDesc)
hscan->rs_base.rs_parallel;
+   startblock = bpscan->phs_startblock;
+   }
+   else
+   startblock = hscan->rs_startblock;
+
+   /*
+* Might have wrapped around the end of the relation, if startblock
was
+* not zero.
+*/
+   if (hscan->rs_cblock > startblock)
+   blocks_done = hscan->rs_cblock - startblock;
+   else
+   blocks_done = hscan->rs_nblocks - startblock +
+   hscan->rs_cblock;
+
+   return blocks_done;

I think parallel scan equivalent bpscan->phs_nblocks along with
hscan->rs_nblocks should be used similar to startblock computation above.

Thank you,
Rahila Syed

On Fri, 29 Mar 2019 at 23:46, Alvaro Herrera 
wrote:

> On 2019-Mar-29, Alvaro Herrera wrote:
>
> > So, CLUSTER and ALTER TABLE rewrites only do non-concurrent index
> > builds; and REINDEX can reuse pretty much the same wait-for metrics
> > columns as CIC.  So I think it's okay if I move only the metrics that
> > conflict for index_build.
>
> The attached version does it that way.  I had to enlarge the param set a
> bit more.  (I suspect those extra columns will be useful to reindex.)
> Also, rebased for recent conflicting changes.
>
>
> I think we should consider a new column of an array type, where we could
> put things like the list of PIDs to be waited for, the list of OIDs of
> index to rebuild, or the list of partitions to build the index on.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Implementing Incremental View Maintenance

2019-03-31 Thread Yugo Nagata

On Thu, 27 Dec 2018 21:57:26 +0900
Yugo Nagata  wrote:

> Hi,
> 
> I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.  

I am now working on an initial patch for implementing IVM on PostgreSQL.
This enables materialized views to be updated incrementally after one
of their base tables is modified.

At the first patch, I want to start from very simple features.

Firstly, this will handle simple definition views which includes only
selection, projection, and join.  Standard aggregations (count, sum, avg,
min, max) are not planned to be implemented in the first patch, but these
are commonly used in materialized views, so I'll implement them later on. 
Views which include sub-query, outer-join, CTE, and window functions are also
out of scope of the first patch. Also, views including self-join or views
including other views in their definition is not considered well, either. 
I need more investigation on these type of views although I found some papers
explaining how to handle sub-quries and outer-joins. 

Next, this will handle materialized views with no duplicates in their
tuples. I am thinking of implementing an algorithm to handle duplicates
called "counting-algorithm" afterward, but I'll start from this
no-duplicates assumption in the first patch for simplicity.

In the first patch, I will implement only "immediate maintenance", that is, 
materialized views are updated immediately in a transaction where a base
table is modified.  On other hand, in "deferred maintenance", materialized
views are updated after the transaction, for example, by the user command
like REFRESH. Although I plan to implement both eventually, I'll start from 
"immediate" because this seems to need smaller code than "deferred". For
implementing "deferred", it is need to implement a mechanism to maintain logs
for recording changes and an algorithm to compute the delta to be applied to
materialized views are necessary. 

I plan to implement the immediate maintenance using AFTER triggers created 
automatically on a materialized view's base tables.  In AFTER trigger using 
transition table features, changes occurs on base tables is recorded ephemeral 
relations. We can compute the delta to be applied to materialized views by
using these ephemeral relations and the view definition query, then update
the view by applying this delta.

-- 
Yugo Nagata

GSoC proposal for pgAdmin 4 bytea support

2019-03-31 Thread Haoran Yu

Dear PostgreSQL community,

I have submitted a proposal for the project pgAdmin 4 bytea support. The
project discusses storing media content (images, audio, video) as bytea.
However, I have a quick question. What does bytea data look like typically
when storing media content? What I had in mind is, media contents that uses
MIME type, which are rendered as part of HTML. For example, the following
is rendered as a red dot:

'data:image/png;base64,iVBORw0KGgoNSUhEUgUA
AAAFCAYAAACNbyblHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwBJRU5ErkJggg==’

This string is decoded to bytea, and I stored it in a bytea column.

What are some other examples of using bytea to store media content, not
necessarily using the MIME type? Is there a way to detect the type of these
media (audio, image) stored in bytea?

Another question I had is, I read that there are performance-related issues
for storing media in bytea. Are there practical ways to store bytea data
that does not face performance-related issues? For example, storing large
media content using multiple bytea parts, and reassembling them together
once retrieved from the database?

Thank you,
Howard

https://docs.google.com/document/d/1ADkdj1Nnhzpy1HTqgs6c6nVPXvPBmwLaysmaNX9eFc0/edit?usp=sharing

Re: Protect syscache from bloating with negative cache entries

2019-03-31 Thread Kyotaro HORIGUCHI

At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190329.172440.199616830.horiguchi.kyot...@lab.ntt.co.jp>
> I ran three artificial test cases. The database is created by
> gen_tbl.pl. Numbers are the average of the fastest five runs in
> successive 15 runs.
> 
> Test cases are listed below.
> 
> 1_0. About 3,000,000 negative entries are created in pg_statstic
>   cache by scanning that many distinct columns. It is 3000 tables
>   * 1001 columns. Pruning scans happen several times while a run
>   but no entries are removed. This emulates the bloating phase of
>   cache. catalog_cache_prune_min_age is default (300s).
>   (access_tbl1.pl)
> 
> 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
>   times. This emulates the stable cache case without having
>   pruning. catalog_cache_prune_min_age is default (300s).
>  (access_tbl2.pl)
> 
> 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 3_0. Scan over the 3,000,000 entries twice with setting prune_age
>   to 10s. A run takes about 18 seconds on my box so fair amount
>   of old entries are removed. This emulates the stable case with
>   continuous pruning. (access_tbl3.pl)
> 
> 2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 
> The result follows.
> 
>  | master |  LRU   |  Full  |Full-mod|
> -|++++
>  1_0 | 17.287 | 17.370 | 17.255 | 16.623 |
>  1_1 | 17.287 | 17.063 | 16.336 | 17.192 |
>  2_0 | 15.695 | 18.769 | 18.563 | 15.527 |
>  2_1 | 15.695 | 18.603 | 18.498 | 18.487 |
>  3_0 | 26.576 | 33.817 | 34.384 | 34.971 |
>  3_1 | 26.576 | 27.462 | 26.202 | 26.368 |
> 
> The result of 2_0 and 2_1 seems strange, but I show you the
> numbers at the present.
> 
> - Full-scan seems to have the smallest impact when turned off.
> 
> - Full-scan-mod seems to perform best in total. (as far as
>   Full-mod-2_0 is wrong value..)
> 
> - LRU doesn't seem to outperform full scanning.

I had another.. unstable..  result.

 | master |  LRU   |  Full  |Full-mod|
-|++++
 1_0 | 16.312 | 16.540 | 16.482 | 16.348 |
 1_1 | 16.312 | 16.454 | 16.335 | 16.232 |
 2_0 | 16.710 | 16.954 | 17.873 | 17.345 |
 2_1 | 16.710 | 17.373 | 18.499 | 17.563 |
 3_0 | 25.010 | 33.031 | 33.452 | 33.937 |
 3_1 | 25.010 | 24.784 | 24.570 | 25.453 |


Normalizing on master's result and rounding off to 1.0%, it looks
as:

 | master |  LRU   |  Full  |Full-mod|  Test description
-|++++---
 1_0 |   100  |   101  |   101  |   100  |   bloating. pruning enabled.
 1_1 |   100  |   101  |   100  |   100  |   bloating. pruning disabled.
 2_0 |   100  |   101  |   107  |   104  |   normal access. pruning enabled.
 2_1 |   100  |   104  |   111  |   105  |   normal access. pruning disabled.
 3_0 |   100  |   132  |   134  |   136  |   pruning continuously running.
 3_1 |   100  |99  |98  |   102  |   pruning disabled.

I'm not sure why the 2_1 is slower than 2_0, but LRU impacts
least if the numbers are right.

I will investigate the strange behavior using profiler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Timeout parameters

2019-03-31 Thread Nagaura, Ryohei

Hello, Fabien-san.

> From: Fabien COELHO [mailto:coe...@cri.ensmp.fr]
> I have further remarks after Kirk-san extensive review on these patches.
Yes, I'm welcome.
Thank you very much for your review.


> * About TCP interface v18.
> For homogeneity with the surrounding cases, ISTM that "TCP_user_timeout"
> should be ""TCP-user-timeout".
Oops, it would be better. Thanks.
Note that I changed to "TCP-User-Timeout" not "TCP-user-timeout."
Also, I deleted this
+   /* TCP USER TIMEOUT */


> * About TCP backend v19 patch
> I still disagree with "on other systems, it must be zero.": I do not see why
> "it must be zero", I think that the parameter should simply be ignored if it
> does not apply or is not implemented on a platform?
No, please see the below with "src/backend/libpq/pqcomm.c".
I made consistent with Keepalives documentation and implementation.
i,e,. if the setting value is non-zero on a system that does not support 
these parameters, an "not supported" error is output.
Therefore, "on other systems, it must be zero" in doc.
Also, the error message "not supported" is needed because if an error is not 
output when you set that parameter on such a system, it seems to be available 
in spite of not available.
In my patch, this situation corresponds to line 114.

> If there are consistency constraint with other timeout parameters, probably 
> the
> documentation should mention it?
As I said above.


> * About socket_timeout v12 patch, I'm not sure there is a consensus.
> I still think that there should be an attempt at cancelling before
> severing.
When some accidents hits the server, infinite wait of the user may occur.
This feature is a way to get around it.
It is not intended to reduce server load.
Since sending a cancellation request will cause the user to wait[1], I do not 
think that it follows the intention.
Do you think that you still need to send a cancellation request?

> Robert pointed out that it is not a timeout wrt the query, but this is not
> clearly explained in the documentation nor the comments.
Sorry, I couldn't recognize your intension.
Is it that you think the description "this is not a query timeout" is needed?

> The doc says that
> it is the time for socket read/write operations, but it is somehow the
> time between messages, some of which may not be linked to read/write
> operations. I feel that the documentation is not very precise about what
> it really does.
What socket_timeout really does is a timeout by poll().
poll() monitors a socket file descriptor,
so the document "the time for socket read/write operations" is correct.

> ISTM that the implementation could make the cancelling as low as 1 second
> because of rounding. This could be said somewhere, maybe in the doc,
> surely in a comment.
It is said in doc. Right?
+The minimum allowed timeout is 2 seconds, so a value of 1 is
+interpreted as 2.
Or "because of rounding" is needed?

> I still think that this parameter should be preservered on psql's
> reconnections when explicitely set to non zero.
Would you please tell me your vision ?
Do you want to inherit all parameters with one option?
Or one to one?

[1] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBC7FD1%40G01JPEXMBYT05
Best regards,
-
Ryohei Nagaura




socket_timeout_v12.patch
Description: socket_timeout_v12.patch


TCP_backend_v19.patch
Description: TCP_backend_v19.patch


TCP_interface_v19.patch
Description: TCP_interface_v19.patch

Re: DWIM mode for psql

2019-03-31 Thread Amit Langote

Hi Thomas,

Thanks for working on this.

On Mon, Apr 1, 2019 at 5:53 Thomas Munro  wrote:

Hello,
>
> Building on the excellent work begun by commit e529cd4ffa60, I would
> like to propose a do-what-I-mean mode for psql.  Please find a POC
> patch attached.  It works like this:
>
> postgres=# select datnaam from pg_database where ooid = 12917;
> ERROR:  column "datnaam" does not exist
> LINE 1: select datnaam from pg_database where ooid = 12917;
>^
> HINT:  Perhaps you meant to reference the column "pg_database.datname".
> postgres=# YES
>  datname
> --
>  postgres
> (1 row)
>
> As you can see, by "shouting" a new keyword at the computer, it will
> take its own hint and run the corrected query.  To avoid having to do
> this in two steps, you can also shout the whole query for the same
> effect:
>
> postgres=# SELECT DATNAAM FROM PG_DATABASE WHERE OOID = 12917;
>  datname
> --
>  postgres
> (1 row)


Neat.

The next version will be able to fix permissions problems and override
> errors automatically as follows, though that is proving trickier to
> get working.  Example:
>
> postgres=# SUDO DROP TABLE PG_DATABASS;
> NO CARRIER


Have you tried rebooting the machine?

Thanks,
Amit

>

Re: speeding up planning with partitions

2019-03-31 Thread David Rowley

On Sun, 31 Mar 2019 at 05:50, Robert Haas  wrote:
>
> On Sat, Mar 30, 2019 at 12:16 PM Amit Langote  wrote:
> > Fwiw, I had complained when reviewing the run-time pruning patch that
> > creating those maps in the planner and putting them in
> > PartitionPruneInfo might not be a good idea, but David insisted that
> > it'd be good for performance (in the context of using cached plans) to
> > compute this information during planning.
>
> Well, he's not wrong about that, I expect.

I'm aware that there have been combinations of objects to either
having these arrays and/or editing them during execution.

I don't recall Amit's complaint, but I do recall Tom's.  He suggested
we not resequence the arrays in the executor and just maintain NULL
elements in the Append/MergeAppend subplans. I did consider this when
writing run-time pruning but found that performance suffers. I
demonstrated this on a thread somewhere.

IIRC, I wrote this code because there was no way to translate the
result of the pruning code into Append/MergeAppend subplan indexes.
Robert has since added a map of Oids to allow the executor to have
those details, so it perhaps would be possible to take the result of
the pruning code then lookup the Oids of the partitions that survived
pruning, then map those to the subplans using the array Robert added.
Using the array for that wouldn't be very efficient due to a lookup
being O(n) per surviving partition. Maybe it could be thrown into a
hashtable to make that faster.  This solution would need to take into
account mixed hierarchy Appends. e.g SELECT * FROM partitioned_table
WHERE partkey = $1 UNION ALL SELECT * FROM something_else; so it would
likely need to be a hashtable per partitioned table.  If the pruning
code returned a partition whose Oid we didn't know about, then it must
be from a partition that was added concurrently since the plan was
built... However, that shouldn't happen today since Robert went to
great lengths for it not to.

Further discussions are likely best put in their own thread. As far as
I know, nothing is broken with the code today.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: idle-in-transaction timeout error does not give a hint

2019-03-31 Thread Tatsuo Ishii

>>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>>
>>> Personally, I don't find this hint particularly necessary.  The
>>> session was terminated because nothing was happening, so the real fix
>>> on the application side is probably more involved than just retrying.
>>> This is different from some of the other cases that were cited, such
>>> as serialization conflicts, where you just got unlucky due to
>>> concurrent events.  In the case of idle-in-transaction-timeout, the
>>> fault is entirely your own.
>>
>>Hum. Sounds like a fair argument. Ideriha-san, what do you think?
>>
> 
> Hi.
> As Peter mentioned, application code generally needs to handle more things 
> than retrying.
> So I'm ok with not adding this hint.

I have changed the patch status to "Withdrawn".

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

RE: idle-in-transaction timeout error does not give a hint

2019-03-31 Thread Ideriha, Takeshi

>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>
>> Personally, I don't find this hint particularly necessary.  The
>> session was terminated because nothing was happening, so the real fix
>> on the application side is probably more involved than just retrying.
>> This is different from some of the other cases that were cited, such
>> as serialization conflicts, where you just got unlucky due to
>> concurrent events.  In the case of idle-in-transaction-timeout, the
>> fault is entirely your own.
>
>Hum. Sounds like a fair argument. Ideriha-san, what do you think?
>

Hi.
As Peter mentioned, application code generally needs to handle more things 
than retrying.
So I'm ok with not adding this hint.

Regards,
Takeshi Ideriha

Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-03-31 Thread Noah Misch

On Fri, Mar 29, 2019 at 09:53:51AM +, Daniel Gustafsson wrote:
> On Saturday, March 9, 2019 8:16 AM, Noah Misch  wrote:
> > I renamed IpcMemoryAnalyze() to PGSharedMemoryAttach() and deleted the old
> > function of that name. Now, this function never calls shmdt(); the caller is
> > responsible for that. I do like things better this way. What do you think?
> 
> I think it makes for a good API for the caller to be responsible, but it does
> warrant a comment on the function to explicitly state that.

The name "PGSharedMemoryAttach" makes that fact sufficiently obvious, I think.

> A few other small comments:
> 
> +   state = PGSharedMemoryAttach((IpcMemoryId) id2, );
> +   if (memAddress)
> +   shmdt(memAddress);
> 
> This seems like a case where it would be useful to log a shmdt() error or do
> an Assert() around the success of the operation perhaps?

I'll add the same elog(LOG) we have at other shmdt() sites.  I can't think of
a site where we Assert() about the results of a system call.  While shmdt()
might be a justified exception, elog(LOG) seems reasonable.

> +* Loop till we find a free IPC key.  Trust CreateDataDirLockFile() to
> +* ensure no more than one postmaster per data directory can enter this
> +* loop simultaneously.  (CreateDataDirLockFile() does not ensure that,
> +* but prefer fixing it over coping here.)
> 
> This comment make it seem like there is a fix to CreateLockFile() missing to
> his patch, is that correct? If so, do you have an idea for that patch?

That comment refers to
https://postgr.es/m/flat/20120803145635.GE9683%40tornado.leadboat.com

> Switching this to Ready for Committer since I can't see anything but tiny 
> things.

Thanks.

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-03-31 Thread Noah Misch

On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> I also liked the design in the https://postgr.es/m/559fa0ba.3080...@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch.  I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that.

Now I am asking for that.  Would anyone like to try implementing that other
design, to see how much simpler it would be?  I now expect the already-drafted
design to need several more iterations before it reaches a finished patch.

Separately, I reviewed v9 of the already-drafted design:

> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
> 
> What is the coding rule for deciding when to call this?  Currently, only
> ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.

This question still applies.  (The function name did change from
RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch  wrote in 
> <20190321054835.gb3842...@rfd.leadboat.com>
> > On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch  wrote 
> > > in <20190311022708.ga2189...@rfd.leadboat.com>
> > > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > +   elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u 
> > > > block %u, because sync_above is %u",
> > > 
> > > As you mention upthread, you have many debugging elog()s.  These are too
> > > detailed to include in every binary, but I do want them in the code.  See
> > > CACHE_elog() for a good example of achieving that.
> > 
> > Agreed will do. They were need to check the behavior precisely
> > but usually not needed.
> 
> I removed all such elog()s.

Again, I do want them in the code.  Please restore them, but use a mechanism
like CACHE_elog() so they're built only if one defines a preprocessor symbol.

On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> @@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
>   (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
>errmsg("savepoint \"%s\" does not exist within 
> current savepoint level", name)));
>  
> + smgrProcessWALRequirementInval(s->subTransactionId, true);
> +
>   /*
>* Mark "commit pending" all subtransactions up to the target
>* subtransaction.  The actual commits will happen when control gets to
> @@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
>   (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
>errmsg("savepoint \"%s\" does not exist within 
> current savepoint level", name)));
>  
> + smgrProcessWALRequirementInval(s->subTransactionId, false);

The smgrProcessWALRequirementInval() calls almost certainly belong in
CommitSubTransaction() and AbortSubTransaction(), not in these functions.  By
doing it here, you'd get the wrong behavior in a subtransaction created via a
plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

> +/*
> + * Process pending invalidation of WAL requirements happened in the
> + * subtransaction
> + */
> +void
> +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> +{
> + HASH_SEQ_STATUS status;
> + RelWalRequirement *walreq;
> +
> + if (!walRequirements)
> + return;
> +
> + /* We expect that we don't have walRequirements in almost all cases */
> + hash_seq_init(, walRequirements);
> +
> + while ((walreq = hash_seq_search()) != NULL)
> + {
> + /* remove useless entry */
> + if (isCommit ?
> + walreq->invalidate_sxid == sxid :
> + walreq->create_sxid == sxid)
> + hash_search(walRequirements, >relnode, 
> HASH_REMOVE, NULL);

Do not remove entries during subtransaction commit, because a parent
subtransaction might still abort.  See other CommitSubTransaction() callees
for examples of correct subtransaction handling.  AtEOSubXact_Files() is one
simple example.

> @@ -3567,15 +3602,26 @@ heap_update
>*/
>   if (RelationIsAccessibleInLogicalDecoding(relation))
>   {
> - log_heap_new_cid(relation, );
> - log_heap_new_cid(relation, heaptup);
> + if (oldbuf_needs_wal)
> + log_heap_new_cid(relation, );
> + if (newbuf_needs_wal)
> + log_heap_new_cid(relation, heaptup);

These if(...) conditions are always true, since they're redundant with

Re: FETCH FIRST clause PERCENT option

2019-03-31 Thread Tom Lane

Andres Freund  writes:
>> offset_clause:
>> @@ -15435,6 +15442,7 @@ reserved_keyword:
>>  | ONLY
>>  | OR
>>  | ORDER
>> +| PERCENT
>>  | PLACING
>>  | PRIMARY
>>  | REFERENCES

> Are we really ok with adding a new reserved keyword 'PERCENT' for this?

I'm not.  It doesn't look like it ought to be necessary to reserve it,
either, given that we don't have to reduce the production right there.

(If that doesn't work right away, try getting rid of row_or_rows
in favor of spelling out those alternatives in separate productions.)

More generally: using an undocumented list as the data structure for
select_limit's result was already a pretty bad idea, I think, and
this patch certainly makes it totally unreadable.  Probably ought
to refactor that to use some sort of struct.

regards, tom lane

Re: DWIM mode for psql

2019-03-31 Thread Andreas Karlsson

On 3/31/19 10:52 PM, Thomas Munro wrote:> Building on the excellent work 
begun by commit e529cd4ffa60, I would

like to propose a do-what-I-mean mode for psql.  Please find a POC
patch attached.  It works like this:

postgres=# select datnaam from pg_database where ooid = 12917;
ERROR:  column "datnaam" does not exist
LINE 1: select datnaam from pg_database where ooid = 12917;
^
HINT:  Perhaps you meant to reference the column "pg_database.datname".
postgres=# YES
  datname
--
  postgres
(1 row)


I think it is potentially confusing that YES and NO does not look like 
other psql commands. Let's pick something which is more in line with 
existing commands like \y and \n.


Andreas

Re: FETCH FIRST clause PERCENT option

2019-03-31 Thread Andres Freund

Hi,

On 2019-03-29 12:04:50 +0300, Surafel Temesgen wrote:

> + if (node->limitOption == PERCENTAGE)
> + {
> + while (node->position - node->offset < 
> node->count)
> + {
> + slot = 
> MakeSingleTupleTableSlot(tupleDescriptor, );
> + if 
> (tuplestore_gettupleslot(node->tupleStore, true, true, slot))

There's several blocks like this - how's this not going to be a resource
leak? As far as I can tell you're just going to create new slots, and
their previous contents aren't going to be cleared?   I think there very
rarely are cases where an executor node's *Next function (or its
subsidiaries) creates slots. Normally you ought to create them *once* on
the *Init function.

You might not directly leak memory, because this is probably running in
a short lived context, but you'll leak tuple desc references etc. (Or if
it were a heap buffer slot, buffer pins).


> + /* In PERCENTAGE case the result is already in 
> tuplestore */
> + if (node->limitOption == PERCENTAGE)
> + {
> + slot = 
> MakeSingleTupleTableSlot(tupleDescriptor, );
> + if (tuplestore_gettupleslot(node->tupleStore, 
> false, false, slot))
> + {
> + node->subSlot = slot;
> + node->lstate = LIMIT_INWINDOW;
> + }
> + else
> + elog(ERROR, "LIMIT subplan failed to 
> run backwards");
> + }
> + else if (node->limitOption == EXACT_NUMBER)
> + {
>   /*
>* Backing up from subplan EOF, so re-fetch previous 
> tuple; there
>* should be one!  Note previous tuple must be in 
> window.
> @@ -194,6 +329,7 @@ ExecLimit(PlanState *pstate)
>   node->subSlot = slot;
>   node->lstate = LIMIT_INWINDOW;
>   /* position does not change 'cause we didn't advance it 
> before */
> + }

The indentation here looks wrong...

>   break;
>  
>   case LIMIT_WINDOWEND:
> @@ -278,17 +414,29 @@ recompute_limits(LimitState *node)
>   /* Interpret NULL count as no count (LIMIT ALL) */
>   if (isNull)
>   {
> - node->count = 0;
> + node->count = 1;
>   node->noCount = true;

Huh?


>   }
>   else
>   {
> - node->count = DatumGetInt64(val);
> - if (node->count < 0)
> - ereport(ERROR,
> - 
> (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> -  errmsg("LIMIT must not be 
> negative")));
> - node->noCount = false;
> + if (node->limitOption == PERCENTAGE)
> + {
> + /*
> +  * We expect to return at least one row (unless 
> there
> +  * are no rows in the subplan), and we'll 
> update this
> +  * count later as we go.
> +  */
> + node->count = 0;
> + node->percent = DatumGetFloat8(val);
> + }
> + else
> + {
> + node->count = DatumGetInt64(val);
> + if (node->count < 0)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> +  errmsg("LIMIT must not 
> be negative")));
> + }

Shouldn't we error out with a negative count, regardless of PERCENT? Or
is that prevented elsewhere?

>   }
>   }
>   else
> @@ -299,8 +447,10 @@ recompute_limits(LimitState *node)
>   }
>  
>   /* Reset position to start-of-scan */
> - node->position = 0;
> + node->position = 0;;

unnecessary


>   if (parse->sortClause)
>   {
> - current_rel = create_ordered_paths(root,
> - 
>current_rel,
> - 
>final_target,
> - 
>final_target_parallel_safe,
>

Re: DWIM mode for psql

2019-03-31 Thread Corey Huinker

On Sun, Mar 31, 2019 at 5:04 PM Andres Freund  wrote:

> On 2019-04-01 09:52:34 +1300, Thomas Munro wrote:
> > +/*
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + *  the Free Software Foundation, either version 3 of the License, or
> > + *  (at your option) any later version.
>
> Indentation bug. You really need to work a bit more careful.
>

The patch applies cleanly, and passes "make check", but it generated an
executable called "mongodb".
Should I have run "make maintainer-clean" first?

Re: DWIM mode for psql

2019-03-31 Thread Andres Freund

On 2019-04-01 09:52:34 +1300, Thomas Munro wrote:
> +/*
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation, either version 3 of the License, or
> + *  (at your option) any later version.

Indentation bug. You really need to work a bit more careful.

DWIM mode for psql

2019-03-31 Thread Thomas Munro

Hello,

Building on the excellent work begun by commit e529cd4ffa60, I would
like to propose a do-what-I-mean mode for psql.  Please find a POC
patch attached.  It works like this:

postgres=# select datnaam from pg_database where ooid = 12917;
ERROR:  column "datnaam" does not exist
LINE 1: select datnaam from pg_database where ooid = 12917;
   ^
HINT:  Perhaps you meant to reference the column "pg_database.datname".
postgres=# YES
 datname
--
 postgres
(1 row)

As you can see, by "shouting" a new keyword at the computer, it will
take its own hint and run the corrected query.  To avoid having to do
this in two steps, you can also shout the whole query for the same
effect:

postgres=# SELECT DATNAAM FROM PG_DATABASE WHERE OOID = 12917;
 datname
--
 postgres
(1 row)

The next version will be able to fix permissions problems and override
errors automatically as follows, though that is proving trickier to
get working.  Example:

postgres=# SUDO DROP TABLE PG_DATABASS;
NO CARRIER

-- 
Thomas Munro
https://enterprisedb.com


0001-Add-do-what-I-mean-mode-to-psql.patch
Description: Binary data

Re: COPY FROM WHEN condition

2019-03-31 Thread Andres Freund

Hi,

On 2019-04-01 02:00:26 +1300, David Rowley wrote:
> On Fri, 29 Mar 2019 at 01:15, Andres Freund  wrote:
> > On 2019-03-28 20:48:47 +1300, David Rowley wrote:
> > > I had a look at this and performance has improved again, thanks.
> > > However, I'm not sure if the patch is exactly what we need, let me
> > > explain.
> >
> > I'm not entirely sure either, I just haven't really seen an alternative
> > that's convincing.
> 
> I wonder if instead of having the array of slots in ResultRelInfo,
> have a struct that's local to copy.c containing the array and the
> number of tuples stored so far.  For partitioned tables, we could
> store this struct in a hashtable by partition Oid. When the partition
> changes check if we've got this partition Oid in the hash table and
> keep adding tuples until the buffer fills.   We could keep a global
> count of the number of tuple stored in all the slot arrays and flush
> all of them when it gets full.
> 
> The trade-off here would be that instead of flushing on each partition
> change, we'd do a hash table lookup on each partition change and
> possibly create a new array of slots.   This would allow us to get rid
> of the code that conditionally switches on/off the batching based on
> how often the partition is changing. The key to it being better would
> hang on the hash lookup + multi-row-inserts being faster than
> single-row-inserts.

It's not clear to me why this separate hashtable is useful /
necessary. We're essentially already doing such a lookup for the
partition routing - what's the point of having a repetition of the same
mapping?  I don't see any benefit of storing the slots separately from
that.

> I'm just not too sure about how to handle getting rid of the slots
> when we flush all the tuples.  Getting rid of them might be a waste,
> but it might also stop the code creating tens of millions of slots in
> the worst case.  Maybe to fix that we could get rid of the slots in
> arrays that didn't get any use at all when we flush the tuples, as
> indicated by a 0 tuple count.  This would require a hash seq scan, but
> maybe we could keep that cheap by flushing early if we get too many
> distinct partitions.

I'd suspect a better approach to this might be to just have a linked
list of partitions with slots in them, that'd avoid unnecessarily going
through all the partitions that got copied into them.  I'm not convinced
this is necessary, but if I were to implement this, I'd leave the slots
in the ResultRelInfo, but additionally keep track of how many slots have
been created. When creating new slots, and some limit has been reached,
I'd just go to the front of that list and delete those slots (probably
delaying doing so to the the point of flushing pending insertions).
That seems like it'd not actually be that much changed code over my
current patch, and would avoid deleting slots in the common case.

I'll work on pushing all the other pending tableam patches today -
leaving COPY the last non pluggable part. You'd written in a private
email that you might try to work on this on Monday, so I think I'll give
this a shot on Tuesday if you've not gotten around till then? I'd like
to push this sooner than the exact end of the freeze...

> That would save the table from getting bloated if
> there happened to be a point in the copy stream where we saw high
> numbers of distinct partitions with just a few tuples each.
> Multi-inserts won't help much in that case anyway.

I thought your benchmarking showed that you saw benefits after even two
tuples?

Greetings,

Andres Freund

Re: Ltree syntax improvement

2019-03-31 Thread Nikolay Shaplov

В письме от четверг, 7 марта 2019 г. 13:09:49 MSK пользователь Chris Travers 
написал:

>  We maintain an extension (https://github.com/adjust/wltree)
> which has a fixed separator (::) and allows any utf-8 character in the tree.
> 
> In our case we currently use our extended tree to store user-defined
> hierarchies of labels, which might contain whitespace, Chinese, Japanese,
> or Korean characters, etc.
> 
> I would love to be able to deprecate our work on this extension and
> eventually use stock ltree.
I am afraid, that if would not be possible to have ltree with custom 
separator. Or we need to pass a very long way to reach there.

To have custom separator we will need some place to store it.

We can use GUC variable, and set separator for ltree globally. It might solve 
some of the problems. But will get some more. If for example we dump database, 
and then restore it on instance where that GUC option is not set, everything 
will be brocken.

It is not opclass option (that are discussed  here on the list), because 
opclass options is no about parsing, but about comparing things that was 
already parsed.

It is not option of an attribute (if we can have custom attribute option), 
because we can have ltree as a variable, without creating an attribute...

It should be an option of ltree type itself. I have no idea how we can 
implement this. And who will allow us to commit such strange thing ;-)

Re: speeding up planning with partitions

2019-03-31 Thread Tom Lane

One thing that I intentionally left out of the committed patch was changes
to stop short of scanning the whole simple_rel_array when looking only for
baserels.  I thought that had been done in a rather piecemeal fashion
and it'd be better to address it holistically, which I've now done in the
attached proposed patch.

This probably makes little if any difference in the test cases we've
mostly focused on in this thread, since there wouldn't be very many
otherrels anyway now that we don't create them for pruned partitions.
However, in a case where there's a lot of partitions that we can't prune,
this could be useful.

I have not done any performance testing to see if this is actually
worth the trouble, though.  Anybody want to do that?

regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 727da33..7a9aa12 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -157,7 +157,7 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	 * Construct the all_baserels Relids set.
 	 */
 	root->all_baserels = NULL;
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
@@ -290,7 +290,7 @@ set_base_rel_sizes(PlannerInfo *root)
 {
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 		RangeTblEntry *rte;
@@ -333,7 +333,7 @@ set_base_rel_pathlists(PlannerInfo *root)
 {
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 
@@ -1994,7 +1994,7 @@ has_multiple_baserels(PlannerInfo *root)
 	int			num_base_rels = 0;
 	Index		rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 61b5b11..723643c 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -828,11 +828,11 @@ generate_base_implied_equalities(PlannerInfo *root)
 	 * This is also a handy place to mark base rels (which should all exist by
 	 * now) with flags showing whether they have pending eclass joins.
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
-		if (brel == NULL)
+		if (brel == NULL || brel->reloptkind != RELOPT_BASEREL)
 			continue;
 
 		brel->has_eclass_joins = has_relevant_eclass_joinclause(root, brel);
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 9798dca..c5459b6 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -145,7 +145,7 @@ add_other_rels_to_query(PlannerInfo *root)
 {
 	int			rti;
 
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
 		RangeTblEntry *rte = root->simple_rte_array[rti];
@@ -312,7 +312,7 @@ find_lateral_references(PlannerInfo *root)
 	/*
 	 * Examine all baserels (the rel array has been set up by now).
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 
@@ -460,7 +460,7 @@ create_lateral_join_info(PlannerInfo *root)
 	/*
 	 * Examine all baserels (the rel array has been set up by now).
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 		Relids		lateral_relids;
@@ -580,7 +580,7 @@ create_lateral_join_info(PlannerInfo *root)
 	 * The outer loop considers each baserel, and propagates its lateral
 	 * dependencies to those baserels that have a lateral dependency on it.
 	 */
-	for (rti = 1; rti < root->simple_rel_array_size; rti++)
+	for (rti = 1; rti <= root->last_base_relid; rti++)
 	{
 		RelOptInfo *brel = root->simple_rel_array[rti];
 		Relids		outer_lateral_relids;
@@ -595,7 +595,7 @@ create_lateral_join_info(PlannerInfo *root)
 			continue;
 
 		/* else scan all baserels */
-		for (rti2 = 1; rti2 < root->simple_rel_array_size; rti2++)
+		for (rti2 = 1; rti2 <= root->last_base_relid; rti2++)
 		{
 			RelOptInfo *brel2 = root->simple_rel_array[rti2];
 
@@ -614,7 +614,7 @@ create_lateral_join_info(PlannerInfo *root)
 	 * with the set of relids of rels that reference it laterally (possibly
 	 * indirectly) --- that is, the inverse mapping of lateral_relids.
 	 */
-	for (rti = 1; rti <

RE: jsonpath

2019-03-31 Thread Tom Turelinckx

Tom Lane wrote:

> I assume this trace is from this run?
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skate=2019-03-31%2006%3A24%3A35

Yes. I did get a core file now, but it wasn't picked up by the buildfarm
script, so I extracted the backtrace manually.

> That looks a whole lot like the previous failure on snapper:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper=2019-03-23%2013%3A01%3A28

That's what I meant.

> Huh ... so that's nowhere near the jsonpath-syntax-error crash that
> we saw before.

Sorry, I wasn't aware there were multiple crashes.

> BTW, what is the difference between skate and snapper?  They look to
> be running on the same machine.

They are. Skate runs with default buildfarm options. Snapper mimics
the options used by the pgdg debian source packages. Both build the
same source (first skate then snapper). This to avoid a repeat of [1].
Snapper also runs more tests (UpgradeXversion, CollateLinuxUTF8).

Best regards,
Tom Turelinckx

1. 
https://www.postgresql.org/message-id/20160413175827.dmlbtdf7c3mgmnex%40alap3.anarazel.de

RE: jsonpath

2019-03-31 Thread Tom Turelinckx

Alexander Korotkov wrote:

> Hmm... 550b9d26f just makes jsonpath_gram.y and jsonpath_scan.l
> compile at once.  I've re-read this commit and didn't find anything
> suspicious.
> I've asked Andrew for access to jacana in order to investigate this myself.

Stack trace from skate:

[New LWP 6614]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/sparc-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: pgbf regression [local] SELECT 
 '.
Program terminated with signal 11, Segmentation fault.
#0  strlen () at ../sysdeps/sparc/sparc64/strlen.S:34
34  ldx [%o0], %o5
#0  strlen () at ../sysdeps/sparc/sparc64/strlen.S:34
#1  0x0008a3e4 in printtup (slot=0x834888, self=0x864cc0) at printtup.c:435
#2  0x00259b60 in ExecutePlan (execute_once=, dest=0x864cc0, 
direction=, numberTuples=0, 
sendTuples=true, operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x833eb0, estate=0x833d70)
at execMain.c:1686
#3  standard_ExecutorRun (queryDesc=0x7dcdc0, direction=, 
count=0, execute_once=)
at execMain.c:365
#4  0x00259da0 in ExecutorRun (queryDesc=0x808920, queryDesc@entry=0x7dcdc0, 
direction=direction@entry=ForwardScanDirection, count=8801472, 
execute_once=) at execMain.c:309
#5  0x003e6714 in PortalRunSelect (portal=portal@entry=0x808920, 
forward=forward@entry=true, count=0, 
count@entry=2147483647, dest=dest@entry=0x864cc0) at pquery.c:929
#6  0x003e7d3c in PortalRun (portal=portal@entry=0x808920, 
count=count@entry=2147483647, 
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, 
dest=dest@entry=0x864cc0, 
altdest=altdest@entry=0x864cc0, 
completionTag=completionTag@entry=0xff86e830 "") at pquery.c:770
#7  0x003e32d0 in exec_simple_query (
query_string=0x7ba400 "select '$.g ? (@.a == 1 || @.a == 4 && @.b == 
7)'::jsonpath;") at postgres.c:1215
#8  0x003e4854 in PostgresMain (argc=, argv=argv@entry=0x7e2370, 
dbname=0x7e2148 "regression", 
username=) at postgres.c:4247
#9  0x0007ae6c in BackendRun (port=0x7ddd40) at postmaster.c:4399
#10 BackendStartup (port=0x7ddd40) at postmaster.c:4090
#11 ServerLoop () at postmaster.c:1703
#12 0x00353a68 in PostmasterMain (argc=argc@entry=6, argv=argv@entry=0x7b49a8) 
at postmaster.c:1376
#13 0x0007cc60 in main (argc=6, argv=0x7b49a8) at main.c:228

Does this help?

Best regards,
Tom Turelinckx

Re: jsonpath

2019-03-31 Thread Andrew Dunstan



On 3/31/19 12:21 PM, Tom Turelinckx wrote:
> Tom Lane wrote:
>
>> I assume this trace is from this run?
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skate=2019-03-31%2006%3A24%3A35
> Yes. I did get a core file now, but it wasn't picked up by the buildfarm
> script, so I extracted the backtrace manually.
>
>

I have just committed a patch that should result in stack traces being
picked up in pg_upgrade testing. See



You should be able to drop the updated file in place, get it from



There will be a buildfarm release out very soon that includes this.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Google Summer of Code: question about GiST API advancement project

2019-03-31 Thread Andrey Borodin

Hi!

> 31 марта 2019 г., в 14:58, GUO Rui  написал(а):
> 
> I'm Rui Guo, a PhD student focusing on database at the University of 
> California, Irvine. I'm interested in the "GiST API advancement" project for 
> the Google Summer of Code 2019 which is listed at 
> https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .
> 
> I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have 
> a better idea on my proposal. Meanwhile, I have a very basic and simple 
> question:
> 
> Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are 
> heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the 
> least), is it possible to apply machine learning algorithm to improve it? The 
> only related reference I got is to use deep learning in database join 
> operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use 
> machine learning here or someone already did?

If you are interested in ML and DBs you should definitely look into [0]. You do 
not have to base your proposal on mentor ideas, you can use your own. 
Implementing learned indexes - seems reasonable.

RR*-tree algorithms are heuristic in some specific parts, but in general they 
are designed to optimize very clear metrics. Generally, ML algorithms tend to 
compose much bigger pile of heuristics and solve less mathematically clear 
tasks than splitting subtrees or choosing subtree for insertion.
R*-tree algorithms are heuristic only to be faster.

Best regards, Andrey Borodin.

[0] https://arxiv.org/pdf/1712.01208.pdf

Re: speeding up planning with partitions

2019-03-31 Thread Tom Lane

Amit Langote  writes:
> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu  
> wrote:
>> Certainly, using bitmapset contributes to the performance when scanning
>> one partition(few partitions) from large partitions.

> Thanks Imai-san for testing.

I tried to replicate these numbers with the code as-committed, and
could not.  What I get, using the same table-creation code as you
posted and a pgbench script file like

\set param random(1, :N)
select * from rt where a = :param;

is scaling like this:

N   tps, range  tps, hash

2   10520.51993210415.230400
8   10443.36145710480.987665
32  10341.19676810462.551167
128 10370.95384910383.885128
512 10207.57841310214.049394
102410042.79434010121.683993
40968937.561825 9214.993778
81928247.614040 8486.728918

If I use "-M prepared" the numbers go up a bit for lower N, but
drop at high N:

N   tps, range  tps, hash

2   11449.92052711462.253871
8   11530.51314611470.812476
32  11372.41299911450.213753
128 11289.35159611322.698856
512 11095.42845111200.683771
102410757.64610810805.052480
40968689.165875 8930.690887
81927301.609147 7502.806455

Digging into that, it seems like the degradation with -M prepared is
mostly in LockReleaseAll's hash_seq_search over the locallock hash table.
What I think must be happening is that with -M prepared, at some point the
plancache decides to try a generic plan, which causes opening/locking all
the partitions, resulting in permanent bloat in the locallock hash table.
We immediately go back to using custom plans, but hash_seq_search has
more buckets to look through for the remainder of the process' lifetime.

I do see some cycles getting spent in apply_scanjoin_target_to_paths
that look to be due to scanning over the long part_rels array,
which your proposal would ameliorate.  But (a) that's pretty small
compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths
is a remarkable display of brute force inefficiency to begin with.
I think we should see if we can't nuke that function altogether in
favor of generating the paths with the right target the first time.

BTW, the real elephant in the room is the O(N^2) cost of creating
these tables in the first place.  The runtime for the table-creation
scripts looks like

N   range   hash

2   0m0.011s0m0.011s
8   0m0.015s0m0.014s
32  0m0.032s0m0.030s
128 0m0.132s0m0.099s
512 0m0.969s0m0.524s
10240m3.306s0m1.442s
40960m46.058s   0m15.522s
81923m11.995s   0m58.720s

This seems to be down to the expense of doing RelationBuildPartitionDesc
to rebuild the parent's relcache entry for each child CREATE TABLE.
Not sure we can avoid that, but maybe we should consider adopting a
cheaper-to-read representation of partition descriptors.  The fact that
range-style entries seem to be 3X more expensive to load than hash-style
entries is strange.

regards, tom lane

Re: [HACKERS][Proposal] LZ4 Compressed Storage Manager

2019-03-31 Thread Nikolay Petrov

31.03.2019, 17:26, "Nikolay Petrov" :
> Hello everyone!
> Thank you for your interest to this topic.
>
> I would like to propose Compressed Storage Manager for PostgreSQL.

Previous thread here  
https://www.postgresql.org/message-id/flat/op.ux8if71gcigqcu%40soyouz
And the result of previous investigation 
https://www.postgresql.org/message-id/op.uyhszpgkcke6l8%40soyouz

Regards, 
Nikolay P.

Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-03-31 Thread Nikolay Shaplov

В письме от среда, 20 марта 2019 г. 6:15:38 MSK пользователь Iwata, Aya 
написал:

> You told us "big picture" about opclass around the beginning of this thread.
> In my understanding, the purpose of this refactoring is to make reloptions
> more flexible to add opclass. I understand this change may be needed for
> opclass, however I think that this is not the best way to refactor
> reloption.
> 
> How about discussing more about this specification including opclass, and
> finding the best way to refactor reloption? I have not read the previous
> thread tightly, so you may have already started preparing.

The idea is the following: now there are options that are build in (and 
actually nailed to) the postgres core. And there are options that can be 
created in Access  Methods in extensions. They share some code, but not all of 
it. And it is bad. There should be one set of option-related code for both 
in-core relations and indexes, and for indexes from extensions. If this code 
is well written it can be used for opclass options as well.

So we need to unnail options code from reloptions.c, so no options are nailed 
to it. 
So you need to move options definitions (at least for indexes) inside access 
method code. But we can't do it just that, because some indexes uses 
StdRdOptions structure for options, it is big, and indexes uses only fillfactor 
from there...
What should we do? Create an individual Options structure for each index.
So we did it.
And now we have StdRdOptions that is used only for Heap and Toast. And Toast 
also does not use all of the variables of the structure.
Why everywhere we have it's own option structure, but for Heap and Toast it is 
joined, and in not a very effective way? 
To successfully apply a patch I plan to commit I need a single option 
structure for each relation kind. Otherwise I will have to write some hack 
code around it.
I do not want to do so. So it is better to get rid of StdRdOptions completely.

This is the only purpose of this patch. Get rid of StdRdOptions and give each 
relation kind it's own option set. 

StdRdOptions is ancient stuff. I guess it were added when there was fillfactor 
only. Now life is more complex. Each relation kind has it's own set of 
options. Let's not mix them together!

PS. If you wand to have some impression of what refactioring I am planning at 
the end, have a look at the patch https://commitfest.postgresql.org/15/992/
It is old, but you can get an idea.

Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-03-31 Thread Nikolay Shaplov

В письме от понедельник, 18 марта 2019 г. 17:00:24 MSK пользователь Kyotaro 
HORIGUCHI написал:

> > So I change status to "Waiting for Author".
> That seems to be a good oppotunity. I have some comments.
> 
> rel.h:
> -#define RelationGetToastTupleTarget(relation, defaulttarg) \
> -((relation)->rd_options ? \
> - ((StdRdOptions *) (relation)->rd_options)->toast_tuple_target :
> (defaulttarg)) +#define RelationGetToastTupleTarget(relation, defaulttarg) 
>\ +(AssertMacro(relation->rd_rel->relkind ==
> RELKIND_RELATION ||\ + relation->rd_rel->relkind ==
> RELKIND_MATVIEW),\ + ((relation)->rd_options ? 
>\ +  ((HeapRelOptions *)
> (relation)->rd_options)->toast_tuple_target : \ +   
> (defaulttarg)))
> 
> Index AMs parse reloptions by their handler
> functions. HeapRelOptions in this patch are parsed in
> relation_reloptions calling heap_reloptions. Maybe at least it
> should be called as table_options. But I'm not sure what is the
> desirable shape of relation_reloptions for the moment.

If we create some TableOptions, or table_options  we will create a bigger 
mess, then we have now. Table == relation, and index is also relation.
So better to follow the rule: we have heap relation, toast relation, and all 
variety  of index relations. Each kind of relation have it's own options set, 
each kind of relation has own marcoses to access it's options. This will make 
the situation more structured. 

> -saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
> -  
> HEAP_DEFAULT_FILLFACTOR); +if (IsToastRelation(relation))
> +saveFreeSpace = ToastGetTargetPageFreeSpace();
> +else
> +saveFreeSpace = HeapGetTargetPageFreeSpace(relation);
> 
> This locution appears four times in the patch and that's the all
> where the two macros appear. And it might not be good that
> RelationGetBufferForTuple identifies a heap directly since I
> suppose that currently tost is a part of heap. Thus it'd be
> better that HeapGetTargetPageFreeSpace handles the toast case.
> Similary, it doesn't looks good that RelationGetBufferForTuple
> consults HeapGetTargretPageFreeSpace() directly. Perhaps it
> should be called via TableGetTargetPageFreeSpace(). I'm not sure
> what is the right shape of the relationship among a relation and
> a table and other kinds of relation. extract_autovac_opts
> penetrates through the modularity boundary of toast/heap in a
> similar way.
So I started from the idea, I said above: each relation kind has it's own 
macroses to access it's option.
If some options are often used combined, it might be good to create some 
helper-macros, that will do this combination. But Alvaro is against adding any 
new macroses (I understand why), and keep old one as intact as possible. So 
here I can suggest only one thing: keep to the rule: "Each reloption kind has 
it's own option access macroses" and let the developers of heap-core make 
their live more comfortable with a helpers function the way they need it. 

So I would leave this decision to Alvaro... He knows better...

> plancat.c:
> +if (relation->rd_rel->relkind == RELKIND_RELATION ||
> +relation->rd_rel->relkind == RELKIND_MATVIEW)
> +rel->rel_parallel_workers = RelationGetParallelWorkers(relation,
> -1); +else
> +rel->rel_parallel_workers = -1;
> 
> rel.h:
> #define RelationGetParallelWorkers(relation, defaultpw) \
> (AssertMacro(relation->rd_rel->relkind == RELKIND_RELATION ||\
>  relation->rd_rel->relkind == RELKIND_MATVIEW),\
> ((relation)->rd_options ? \
> ((HeapRelOptions *) (relation)->rd_options)->parallel_workers : \
> (defaultpw)))
> 
> These function/macros are doing the same check twice at a call.
No. The first check is actual check, that does in runtime in production, and it 
decides if we need to fetch some options or just use default value.

The second check is AssertMacro check. If you look into AssertMacro code you 
will see that in production builds it do not perform any real check, just an 
empty function.
AssertMacro is needed for developer builds, where it would crash if check is 
not passed. It is needed to make sure that function is always used in proper 
context. You should build postgres with --enable-cassert options to get it to 
work. So there is no double checks here...

Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-03-31 Thread Nikolay Shaplov

В письме от понедельник, 18 марта 2019 г. 3:03:04 MSK пользователь Iwata, Aya 
написал:
> Hi Nikolay,
Hi!
Sorry for long delay. Postgres is not my primary work, so sometimes it takes a 
while to get to it.

> This patch does not apply.
Oh... Sorry... here goes new version

> Please refer to http://commitfest.cputube.org/
> and update it.
Oh! a nice service... Wish it can stream status changes as RSS feeds... But I 
can wirte a script on top of it. It would be more easy... 

> How about separating your patch by feature or units that
> can be phased commit. For example, adding assert macro at first,
> refactoring StdRdOptions by the next, etc.
No I think we should not. All Macros changes are driven by removing 
StdRdOptions. (Actually AssertMarco added there, but I do not think it is a 
great addition that require a singe patch)
If it is really necessary I would file AssertMacro addition as a single patch, 
but I would abstain from doing it if possible...
 diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b58a1f7..db5397f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -22,7 +22,7 @@
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/reloptions.h"
-#include "access/spgist.h"
+#include "access/spgist_private.h"
 #include "access/tuptoaster.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
@@ -46,9 +46,8 @@
  * upper and lower bounds (if applicable); for strings, consider a validation
  * routine.
  * (ii) add a record below (or use add__reloption).
- * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
- * (iv) add it to the appropriate handling routine (perhaps
- * default_reloptions)
+ * (iii) add it to the appropriate options struct
+ * (iv) add it to the appropriate handling routine
  * (v) make sure the lock level is set correctly for that operation
  * (vi) don't forget to document the option
  *
@@ -1010,7 +1009,7 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = relation_reloptions(classForm->relkind, datum, false);
 			break;
 		case RELKIND_VIEW:
 			options = view_reloptions(datum, false);
@@ -1343,63 +1342,69 @@ fillRelOptions(void *rdopts, Size basesize,
 
 
 /*
- * Option parser for anything that uses StdRdOptions.
+ * Option parsing definition for autovacuum. Used in toast and heap options.
+ */
+
+#define AUTOVACUUM_RELOPTIONS(OFFSET)\
+		{"autovacuum_enabled", RELOPT_TYPE_BOOL, \
+		OFFSET + offsetof(AutoVacOpts, enabled)},\
+		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, vacuum_threshold)},   \
+		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,\
+		OFFSET + offsetof(AutoVacOpts, analyze_threshold)},  \
+		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_REAL,   \
+		OFFSET + offsetof(AutoVacOpts, vacuum_cost_delay)},  \
+		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,\
+		OFFSET + offsetof(AutoVacOpts, vacuum_cost_limit)},  \
+		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, freeze_min_age)}, \
+		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, freeze_max_age)}, \
+		{"autovacuum_freeze_table_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, freeze_table_age)},   \
+		{"autovacuum_multixact_freeze_min_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_min_age)},   \
+		{"autovacuum_multixact_freeze_max_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_max_age)},   \
+		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_table_age)}, \
+		{"log_autovacuum_min_duration", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, log_min_duration)},   \
+		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL, \
+		OFFSET + offsetof(AutoVacOpts, vacuum_scale_factor)},\
+		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,\
+		OFFSET + offsetof(AutoVacOpts, analyze_scale_factor)}
+
+/*
+ * Option parser for heap
  */
 bytea *
-default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
+heap_reloptions(Datum reloptions, bool validate)
 {
 	relopt_value *options;
-	StdRdOptions *rdopts;
+	HeapRelOptions *rdopts;
 	int			numoptions;
 	static const relopt_parse_elt tab[] = {
-		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
-		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
-

Re: jsonpath

2019-03-31 Thread Tom Lane

"Tom Turelinckx"  writes:
> Stack trace from skate:

Huh ... so that's nowhere near the jsonpath-syntax-error crash that
we saw before.

I assume this trace is from this run?

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skate=2019-03-31%2006%3A24%3A35

That looks a whole lot like the previous failure on snapper:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper=2019-03-23%2013%3A01%3A28

right down to it having gotten through "make check" only to fail when the
same regression tests are run again during the pg_upgrade test.  I wonder
if there's any real significance to that, or if it's just that the failure
is not very repeatable.

BTW, what is the difference between skate and snapper?  They look to
be running on the same machine.

regards, tom lane

Re: Problem during Windows service start

2019-03-31 Thread Ramanarayana

Hi,

If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it be correct
to set the status of windows service to SERVICE_START_PENDING ?

I would like to take this up if no one is working on this.

Regards,
Ram.

[HACKERS][Proposal] LZ4 Compressed Storage Manager

2019-03-31 Thread Николай Петров

Hello everyone! 
Thank you for your interest to this topic. 

I would like to propose Compressed Storage Manager for PostgreSQL.

The problem:
In cases when you store some log-like data in your tables, or when you
store time-series data you may face with high disk space consumption 
because of a lot of data. It is a good idea to compress tables, 
especially if you have a compressible data and OLAP 
WORM (write once read many) usage scenarios.

Current ways to solve this problem:
Now this could be solved via a compressible file system such as BTRFS 
or ZFS. This approach has a contradictory impact on performance and 
connected with difficulties of administration.

Other's DB approaches:
Postgres Pro Enterprise has embedded CFS [1][2] for this purposes.
MySQL InnoDB has two options of compression - table level compression 
(zlib only) [3] and transparency pages compression (zlib, LZ4) [4] 
via hole punching [5]. 

My offer:
Implement LZ4 Compressed Storage Manager. It should compress pages on 
writing to block files and decompress on reading. I would like to 
offer LZ4 at first, because it has low CPU consumption and it is 
available under BSD 2 clause license. 

Compressed Storage Manager operation description (TLDR: algorithm could
be similar to MySQL table level compression):
- It should store compressed pages in a block file, but because of 
  different size of compressed data, it should have an additional 
  file with offset for each pages.
- When it reads a page, it translates upper PostgreSQL layers 
  file/offset query to actual page offset, read compressed page 
  bytes, decompress them and fill the requested  buffer with 
  decompressed page. 
- New pages writing quite a simple, it has to compress the page, 
  write it to block file and write page offset into a file with 
  pointers.
- In cases when it's necessary to write changed page, it has to 
  check that the size of the compressed page smaller or equal to 
  previous version. If it's bigger, it is should to write page 
  to the end of the block file and change the page pointer. The 
  old page version became dead.
- There is an ability to make free space release mechanism, for instance, 
  MySQL use hole punching (what contradictory impact on 
  performance [6]). At first time dead pages could be freed 
  via VACUUM FULL. 

pointers file
  ++++
  | p1 | p2 | p3 |  
  +=|==+==|=+==|=+  
| ||_
| |  |
|  | | block file
  +=|==+=+=|===+=|==+
  | p1 len | p1 data | p2 len | p2 #d# | p3 len | p3 #data# |
  ++=+=++


Test of possible compression (database [7], table ticket_flights [8]):
547M 47087 <- uncompressed
200M 47087.lz4.1.pages.compressed  <-- pages compression (37%)

Pros:
- decreases disk space usage
- decreases disk reads
Cons:
- possible increases random access I/O
- increases CPU usage
- possible conflicts with PostgreSQL expectations 
  of Storage Manager behaviour
- could conflict with pg_basebackup and pg_upgrade utilities
- compression requires additional memory

Why it should be implemented on Storage Manager level instead of usage
Pluggable storage API [9]?
  - From my perspective view Storage Manager level implementation 
allows to focus on proper I/O operations and compression. 
It allows to write much more simple realization. It's because of 
Pluggable storage API force you to implement more complex 
interfaces. To be honest, I am really hesitating about this point, 
especially because of Pluggable storage API allows to create 
extension without core code modification and it potentially allows 
to use more perfective compression algorithms (Table Access Manager
allows you to get more information about storing data). 

I would like to implement a proof of concept 
and have a couple of questions:
  - your opinion about necessity of this feature 
(Compressed Storage Manager)
  - Is it good idea to implement DB compression on Storage Manager 
level? Perhaps it is better to use Pluggable storage API.
  - Is there any reason to refuse this proposal?
  - Are there any circumstances what didn't allow to implement 
Compressed Storage Manager?

Regards, 
Nikolay P.

[1] - https://postgrespro.com/docs/enterprise/9.6/cfs
[2] - 
https://afiskon.github.io/static/2017/postgresql-in-core-compression-pgconf2017.pdf
 (page 17)
[3] - https://dev.mysql.com/doc/refman/8.0/en/innodb-table-compression.html
[4] - https://dev.mysql.com/doc/refman/8.0/en/innodb-page-compression.html
[5] - https://lwn.net/Articles/415889/
[6] - https://www.percona.com/blog/2017/11/20/innodb-page-compression/
[7] - https://postgrespro.com/education/demodb
[8] -

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2019-03-31 Thread Michael Paquier

On Sat, Mar 30, 2019 at 10:12:33PM +0900, Michael Paquier wrote:
> Okay, I'll use that then.

And finally committed.  I have changed the debug1 message so as "to
prevent wraparound" is used instead of "anti-wraparound".  I have
noticed something which was also missing from all the patches proposed
on this thread: a reset of the progress state was not done, and we
need to call pgstat_progress_end_command() in order to do that.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] generated columns

2019-03-31 Thread Jeff Janes

On Sat, Mar 30, 2019 at 4:03 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-26 20:50, Pavel Stehule wrote:
> > It is great feature and I'll mark this feature as ready for commit
>
> Committed, thanks.
>

I can't do a same-major-version pg_upgrade across this commit, as the new
pg_dump is trying to dump a nonexistent attgenerated column from the old
database.  Is that acceptable collateral damage?  I thought we usually
avoided that breakage within the dev branch, but I don't know how we do it.

Cheers,

Jeff

Re: COPY FROM WHEN condition

2019-03-31 Thread David Rowley

On Fri, 29 Mar 2019 at 01:15, Andres Freund  wrote:
> On 2019-03-28 20:48:47 +1300, David Rowley wrote:
> > I had a look at this and performance has improved again, thanks.
> > However, I'm not sure if the patch is exactly what we need, let me
> > explain.
>
> I'm not entirely sure either, I just haven't really seen an alternative
> that's convincing.

I wonder if instead of having the array of slots in ResultRelInfo,
have a struct that's local to copy.c containing the array and the
number of tuples stored so far.  For partitioned tables, we could
store this struct in a hashtable by partition Oid. When the partition
changes check if we've got this partition Oid in the hash table and
keep adding tuples until the buffer fills.   We could keep a global
count of the number of tuple stored in all the slot arrays and flush
all of them when it gets full.

The trade-off here would be that instead of flushing on each partition
change, we'd do a hash table lookup on each partition change and
possibly create a new array of slots.   This would allow us to get rid
of the code that conditionally switches on/off the batching based on
how often the partition is changing. The key to it being better would
hang on the hash lookup + multi-row-inserts being faster than
single-row-inserts.

I'm just not too sure about how to handle getting rid of the slots
when we flush all the tuples.  Getting rid of them might be a waste,
but it might also stop the code creating tens of millions of slots in
the worst case.  Maybe to fix that we could get rid of the slots in
arrays that didn't get any use at all when we flush the tuples, as
indicated by a 0 tuple count.  This would require a hash seq scan, but
maybe we could keep that cheap by flushing early if we get too many
distinct partitions. That would save the table from getting bloated if
there happened to be a point in the copy stream where we saw high
numbers of distinct partitions with just a few tuples each.
Multi-inserts won't help much in that case anyway.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa

>
>
> > Idea: look not on dead tuples, but on changes, just like ANALYZE does.
> > It's my first patch on Postgres, it's probably all wrong but I hope it
> > helps you get the idea.
>
> This was suggested and rejected years ago:
>
> https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30...@2ndquadrant.fr


Thank you for sharing the link. I've read through the thread and see you
posted two patches, first being similar but different from mine, and second
being about a different matter.

I don't see "rejected" there, just a common distraction of "you should also
consider this" and time-out leading to "returned with feedback" at the end.

Thing is, we have dead large productions and post-mortems now as your patch
wasn't pushed back in 2016, so situation is different. Let's push at least
first of two patches of yours, or mine.

Which one is better and why?

I believe mine, as it just follows a pattern already established and proven
in autoanalyze. If vacuum comes and unable to harvest some dead tuples, it
will come over again in your case, and just sleep until it gets new dead
tuples in mine, which looks better to me - there's no dead loop in case
some dead tuples are stuck forever.
If someone thinks yours is better we may also consider it for autoanalyze?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa

>
> If it's months, we probably want limit vacuum to working at a pretty
> slow rate, say 1% of the table size per hour or something.  If it's in
> hours, we need to be a lot more aggressive.  Right now we have no
> information to tell us which of those things is the case, so we'd just
> be shooting in the dark.


Thing is, you don't need to spread out your vacuum in time if the rate of
vacuuming matches rate of table growth. Can we mark tuples/pages as
all-visible and all-frozen say, the moment they're pushed out of
shared_buffers?


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa

>
> By the way, the Routine Vacuuming chapter of the documentation says:
>
> "The sole disadvantage of increasing autovacuum_freeze_max_age (and
> vacuum_freeze_table_age along with it) is that the pg_xact and
> pg_commit_ts subdirectories of the database cluster will take more space
>
> [...]
>
> If [pg_xact and pg_commit_ts taking 0.5 and 20 GB, respectively]
> is trivial compared to your total database size, setting
> autovacuum_freeze_max_age to its maximum allowed value is recommended."
>
> Maybe this should be qualified with "unless you have trouble with your
> autovacuum keeping up" or so; or generally reworded?


This recommendation is in the mindset of "wraparound never happens".
If your database is large, you have more chances to hit it painfully, and
if it's append-only even more so.

Alternative point of "if your database is super large and actively written,
you may want to set autovacuum_freeze_max_age to even smaller values so
that autovacuum load is more evenly spread over time" may be needed.




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: Berserk Autovacuum (let's save next Mandrill)

2019-03-31 Thread Komяpa

On Thu, Mar 28, 2019 at 6:43 PM Masahiko Sawada 
wrote:

> >> 1. heap vacuum
> >>
> >> 2. HOT pruning
> >
> > Is it worth skipping it if we're writing a page anyway for the sake of
> hint bits and new xids? This will all be no-op anyway on append-only tables
> and happen only when we actually need something?
> >
>
> Yeah, these operations are required only when the table has actual
> garbage. IOW, append-only tables never require them.
>
> >>
> >> 3. freezing tuples
> >> 4. updating visibility map (all-visible and all-frozen)
> >
> > These two are needed, and current autovacuum launch process does not
> take into account that this is also needed for non-dead tuples.
> >
> >>
> >> 5. index vacuum/cleanup
> >
> > There is a separate patch for that. But, since
> https://commitfest.postgresql.org/16/952/ for almost a year already
> Postgres skips index cleanup on tables without new dead tuples, so this
> case is taken care of already?
>
> I think that's not enough. The feature "GUC for cleanup index
> threshold" allows us to skip only index cleanup when there are less
> insertion than the fraction of the total number of heap tuples since
> last index cleanup. Therefore it helps only append-only tables (and
> supporting only btree index for now). We still have to do index
> vacuuming even if the table has just a few dead tuple. The proposed
> patch[1] helps this situation; vacuum can run while skipping index
> vacuuming and index cleanup.
>

So, the patch I posted can be technically applied after
https://commitfest.postgresql.org/22/1817/ gets merged?

The change with my patch is that a table with 49 insertions and one delete:
 - previously will wait for 49 more deletes by default (and ignore
insertions), and only then clean up both table and indexes.
 - with patch will freeze/update VM for insertions, and scan the index.

In my experience only btree index is requiring a slow full index scan,
that's why only it was in the "GUC for cleanup index
threshold" patch. Is it wrong and more index types do a full index scan on
vacuum after deletion of a single tuple?

> >> 6. truncation
> >
> > This shouldn't be a heavy operation?
> >
>
> I don't think so. This could take AccessExclusiveLock on the table and
> take a long time with large shared buffer as per reported on that
> thread[2].
>

While this can be a useful optimization, I believe it is out of scope for
this patch. I want to fix vacuum never coming to append only tables without
breaking other behaviors. Truncation is likely a case of enough dead tuples
to trigger a vacuum via currently existing mechanisms.

> >>
> >>
> >> With the proposed patch[1] we can control to do 5 or not. In addition
> >> to that, another proposed patch[2] allows us to control 6.
> >>
> >> For append-only tables (and similar tables), what we periodically want
> >> to do would be 3 and 4 (possibly we can do 2 as well). So maybe we
> >> need to have both an option of (auto)vacuum to control whether to do 1
> >> and something like a new autovacuum threshold (or an option) to invoke
> >> the vacuum that disables 1, 5 and 6. The vacuum that does only 2, 3
> >> and 4 would be much cheaper than today's vacuum and anti-wraparound
> >> vacuum would be able to skip almost pages.
> >
> >
> > Why will we want to get rid of 1? It's a noop from write perspective and
> saves a scan to do it if it's not noop.
> >
>
> Because that's for tables that have many inserts but have some
> updates/deletes. I think that this strategy would help not only
> append-only tables but also such tables.
>

How much do we save by skipping a heap vacuum on almost-append-only table,
where amount of updates is below 50 which is current threshold?

>
> > Why make it faster in emergency situations when situation can be made
> non-emergency from the very beginning instead?
> >
>
> I don't understand the meaning of "situation can be made non-emergency
> from the very beginning". Could you please elaborate on that?
>

Let's imagine a simple append-only workflow on current default settings
Postgres. You create a table, and start inserting tuples, one per
transaction. Let's imagine a page fits 50 tuples (my case for taxi movement
data), and Amazon gp2 storage which caps you say at 1000 IOPS in non-burst
mode.
Anti-wrap-around-auto-vacuum (we need a drawing of misreading of this term
with a crossed out car bent in Space) will be triggered
in autovacuum_freeze_max_age inserts, 2 by default. That converts
into 400 pages, or around 32 GB. It will be the first vacuum ever on
that table, since no other mechanism triggers it, and if it steals all the
available IOPS, it will finish in 2/50 /1000 = 4000 seconds,
killing prod for over an hour.

Telemetry workloads can easily generate 32 GB of data a day (I've seen
more, but let's stick to that number). Production going down for an hour a
day isn't good and I consider it an emergency.

Now, two ways to fix it that reading documentation leads you while you're a
sleepy

Google Summer of Code: question about GiST API advancement project

2019-03-31 Thread GUO Rui

Hi, hackers,

I'm Rui Guo, a PhD student focusing on database at the University of
California, Irvine. I'm interested in the "GiST API advancement" project
for the Google Summer of Code 2019 which is listed at
https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .

I'm still reading about RR*-tree, GiST and the PostgreSQL source code to
have a better idea on my proposal. Meanwhile, I have a very basic and
simple question:

Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are
heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the
least), is it possible to apply *machine learning* algorithm to improve it?
The only related reference I got is to use deep learning in database join
operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use
machine learning here or someone already did?

Thanks,
Rui Guo

42 matches

Mail list logo