Re: Questions/Observations related to Gist vacuum

2019-10-15 Thread Dilip Kumar
On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
>
> On 15/10/2019 09:37, Amit Kapila wrote:
> > While reviewing a parallel vacuum patch [1], we noticed a few things
> > about $SUBJECT implemented in commit -
> > 7df159a620b760e289f1795b13542ed1b3e13b87.
> >
> > 1. A new memory context GistBulkDeleteResult->page_set_context has
> > been introduced, but it doesn't seem to be used.
>
> Oops. internal_page_set and empty_leaf_set were supposed to be allocated
> in that memory context. As things stand, we leak them until end of
> vacuum, in a multi-pass vacuum.

Here is a patch to fix this issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


user_correct_memorycontext_gist_stat.patch
Description: Binary data


Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-15 Thread Dave Cramer
On Sat, 12 Oct 2019 at 05:05, Tom Lane  wrote:

> Andres Freund  writes:
> > On 2019-10-11 16:30:17 -0400, Robert Haas wrote:
> >> But, if it does need to be changed, it seems like a terrible idea to
> >> allow it to be done via SQL. Otherwise, the user can break the driver
> >> by using SQL to set the list to something that the driver's not
> >> expecting, and there's nothing the driver can do to prevent it.
>
> > Uhm. The driver can just ignore GUCs it's not interested in, like our
> > docs have told them for a long time?
>
> Certainly it should do that; but the problematic case is where it
> *doesn't* get told about something it's depending on knowing about.
>
> regards, tom lane
>


Here's an updated patch that addresses some of Andres' concerns
specifically does not use strtok.

As far as addressing connection poolers goes; one thought is to use the
cancellation key to "validate" the SQL.
This should be known to all drivers and pool implementations. Thoughts ?

Dave


0001-Add-a-STARTUP-packet-option-to-set-GUC_REPORT-on-GUC.patch
Description: Binary data


Re: pgbench - extend initialization phase control

2019-10-15 Thread btendouan





Here is rebase v3.


Hi,

Thanks for your new patch.

Failed regression test.
It's necessary to change the first a in “allowed step characters are” to 
uppercase A in the regression test of 002_pgbench_no_server.pl.


The behavior of "g" is different between v12 and the patche, and 
backward compatibility is lost.

In v12, BEGIN and COMMIT are specified only by choosing "g".
It's a problem that backward compatibility is lost.

When using ( and ) with the -I, the documentation should indicate that 
double quotes are required,

and  "v" not be able to enclose in ( and ).

Regards,

--
Anna




Re: Collation versioning

2019-10-15 Thread Thomas Munro
On Wed, Oct 16, 2019 at 5:33 PM Thomas Munro  wrote:
> On Tue, Oct 15, 2019 at 5:39 PM Thomas Munro  wrote:
> > Here's a version with a small note added to the documentation.  I'm
> > planning to commit this tomorrow.
>
> Done.

The buildfarm is telling me that I didn't test this with the full set
of locales installed, so it fails on some systems.  Will fix.




Re: Collation versioning

2019-10-15 Thread Thomas Munro
On Tue, Oct 15, 2019 at 5:39 PM Thomas Munro  wrote:
> Here's a version with a small note added to the documentation.  I'm
> planning to commit this tomorrow.

Done.

It's not much, but it's a start.  Some things to do:

* handle default collation (probably comes with CF entry 2256?)
* preserve versions of initdb-created collations in pg_upgrade
* ditch collversion and ALTER ... REFRESH VERSION and start tracking
versions dependencies per-index (etc)




Re: Ordering of header file inclusion

2019-10-15 Thread vignesh C
On Wed, Oct 16, 2019 at 8:10 AM Amit Kapila  wrote:
>
> Thanks for working on this.  I will look into this in the coming few
> days or during next CF.  Can you please register it for the next CF
> (https://commitfest.postgresql.org/25/)?
>
Thanks, I have added it to the commitfest.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




RE: Copy data to DSA area

2019-10-15 Thread ideriha.take...@fujitsu.com
Hi, 

Sorry for waiting. 
>Thomas Munro  wrote:
>>What do you think about the following?  Even though I know you want to
>>start with much simpler kinds of cache, I'm looking ahead to the lofty
>>end-goal of having a shared plan cache.  No doubt, that involves
>>solving many other problems that don't belong in this thread, but please 
>>indulge me:
>
>My initial motivation came from shared catcache and relcache but I also think 
>shared
>plan cache is one of the big topics and I'd be very excited if it's come true. 
>Sometimes
>making plan at each backend becomes enormous overhead for speed.
>
>>On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro 
>wrote:
>>> Hmm.  I wonder if we should just make ShmContextFree() do nothing!
>>> And make ShmContextAlloc() allocate (say) 8KB chunks (or larger if
>>> needed for larger allocation) and then hand out small pieces from the
>>> 'current' chunk as needed.  Then the only way to free memory is to
>>> destroy contexts, but for the use case being discussed, that might
>>> actually be OK.  I suppose you'd want to call this implementation
>>> something different, like ShmRegionContext, ShmZoneContext or
>>> ShmArenaContext[1].
>>
>>
>>
>>I guess what I said above is only really appropriate for complex things
>>like plans that have their own contexts so that we can delete them
>>easily "in bulk".  I guess it's not true for caches of simpler objects
>>like catcache, that don't want a context for each cached thing and want
>>to free objects "retail" (one by one).  So I guess you might want
>>something more like your current patch for (say) SharedCatCache, and something
>>like the above-quoted idea for (say) SharedPlanCache or SharedRelCache.

I updated shared memory context for SharedCatCache, which I call
ShmRetailContext. I refactored my previous PoC, 
which palloc calls dsa_allocate every time in a retail way.
And I also implemented MemoryContextClone(template_context, 
short_lived_parent_context).

>>For an implementation that supports retail free, perhaps you could
>>store the address of the clean-up list element in some extra bytes
>>before the returned pointer, so you don't have to find it by linear
>>search.  

ShmRetailContext is supposed to use SharedCatCache.
Here are some features of current CatCache entries:
1. The number of cache entries is generally larger than compared to relcache
  and plan cache. This is because relcache is proportional to the number of 
  tables and indexes. Catcache has much more kinds and some kind like 
  pg_statistic is proportional to the number of attributes of each table.

2. Cache entry (catctup) is built via only one or two times palloc(). 

3. When cache entry is evicted from hash table, it is deleted by pfree() 
   one by one.

Because of my point 1, I'd rather not to have the extra pointer to
clean-up list element. This pointer is allocated per catcache entry 
and it would take space. And also the length of clean-up list is not
much big becase of my point 2. So it would be fine with linear search.

And also because of my point 1 again, I didn't create MemoryContext
header for each catalog cache. These memory context header is located
in shared memory and takes space. So I use 
ShmRetailContextMoveChunk(), (which I called ChangeToPermShmContext() before),
instead of MemoryContextSetParent(). This moves only chunks 
from locally allocated parent to shared parent memory context.

Due to my point 3, I think it's also ok not to have clean-up list in
shared memory in ShmRetailContext. There is no situation to
free chunks all at once.

What do you think about above things?

>>Next, I suppose you don't want to leave holes in the middle of
>>the array, so perhaps instead of writing NULL there, you could transfer the 
>>last item
>in the array to this location (with associated concurrency problems).
Done.

ShmZoneContext for SharedPlan and SharedRelCache is not implemented
but I'm going to do it following your points.

Regards,
Takeshi Ideriha


shm_retail_context-v1.patch
Description: shm_retail_context-v1.patch


Re: Ordering of header file inclusion

2019-10-15 Thread Amit Kapila
On Tue, Oct 15, 2019 at 10:57 PM vignesh C  wrote:
>
> On Wed, Oct 9, 2019 at 11:37 AM Amit Kapila  wrote:
> >
> > On Tue, Oct 8, 2019 at 8:19 PM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > On Wed, Oct 2, 2019 at 2:57 PM vignesh C  wrote:
> > > >> I noticed that some of the header files inclusion is not ordered as
> > > >> per the usual standard that is followed.
> > > >> The attached patch contains the fix for the order in which the header
> > > >> files are included.
> > > >> Let me know your thoughts on the same.
> > >
> > > > +1.
> > >
> > > FWIW, I'm not on board with reordering system-header inclusions.
> > > Some platforms have (had?) ordering dependencies for those, and where
> > > that's true, it's seldom alphabetical.  It's only our own headers
> > > where we can safely expect that any arbitrary order will work.
> > >
> >
> > Okay, that makes sense.  However, I noticed that ordering for
> > system-header inclusions is somewhat random.  For ex. nodeSubPlan.c,
> > datetime.c, etc. include limits.h first and then math.h whereas
> > knapsack.c, float.c includes them in reverse order.  There could be
> > more such inconsistencies and the probable reason is that we don't
> > have any specific rule, so different people decide to do it
> > differently.
> >
> > > > I think we shouldn't remove the extra line as part of the above change.
> > >
> > > I would take out the blank lines between our own #includes.
> > >
> >
> > Okay, that would be better, but doing it half-heartedly as done in
> > patch might make it worse.  So, it is better to remove blank lines
> > between our own #includes in all cases.
> >
> Attached patch contains the fix based on the comments suggested.
>

Thanks for working on this.  I will look into this in the coming few
days or during next CF.  Can you please register it for the next CF
(https://commitfest.postgresql.org/25/)?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: maintenance_work_mem used by Vacuum

2019-10-15 Thread Masahiko Sawada
On Sat, Oct 12, 2019 at 8:45 PM Amit Kapila  wrote:
>
> On Sat, Oct 12, 2019 at 10:49 AM Masahiko Sawada  
> wrote:
> >
> > On Fri, Oct 11, 2019 at 5:13 PM Amit Kapila  wrote:
> > >
> > > That's right, but OTOH, if the user specifies gin_pending_list_limit
> > > as an option during Create Index with a value greater than GUC
> > > gin_pending_list_limit, then also we will face the same problem.  It
> > > seems to me that we haven't thought enough on memory usage during Gin
> > > pending list cleanup and I don't want to independently make a decision
> > > to change it.  So unless the original author/committer or some other
> > > people who have worked in this area share their opinion, we can leave
> > > it as it is and make a parallel vacuum patch adapt to it.
> >
> > Yeah I totally agreed.
> >
> > Apart from the GIN problem can we discuss whether need to change the
> > current memory usage policy of parallel utility command described in
> > the doc? We cannot control the memory usage in index AM after all but
> > we need to generically consider how much memory is used during
> > parallel vacuum.
> >
>
> Do you mean to say change the docs for a parallel vacuum patch in this
> regard?  If so, I think we might want to do something for
> maintenance_work_mem for parallel vacuum as described in one of the
> emails above [1] and then change the docs accordingly.
>

Yes agreed. I thought that we can discuss that while waiting for other
opinion on the memory usage of gin index's pending list cleanup. For
example one of your suggestions[1] is simple and maybe acceptable but
I guess that it can deal with only gin indexes but not other index AMs
that might consume more memory.

[1] 
https://www.postgresql.org/message-id/CAA4eK1JhpNsTiHj%2BJOy3N8uCGyTBMH8xDhUEtBw8ZeCAPRGp6Q%40mail.gmail.com

Regards,

--
Masahiko Sawada




Understanding TupleQueue impact and overheads?

2019-10-15 Thread Tom Mercha
I have been looking at PostgreSQL's Tuple Queue 
(/include/executor/tqueue.h) which provides functionality for queuing 
tuples between processes through shm_mq. I am still familiarising myself 
with the bigger picture and TupTableStores. I can see that a copy (not a 
reference) of a HeapTuple (obtained from TupleTableSlot or SPI_TupTable 
etc) can be sent to a queue using shm_mq. Then, another process can 
receive these HeapTuples, probably later placing them in 'output' 
TupleTableSlots.

What I am having difficulty understanding is what happens to the 
location of the HeapTuple as it moves from one TupleTableSlot to the 
other as described above. Since there most likely is a reference to a 
physical tuple involved, am I incurring a disk-access overhead with each 
copy of a tuple? This would seem like a massive overhead; how can I keep 
such overheads to a minimum?

Furthermore, to what extent can I expect other modules to impact a 
queued HeapTuple? If some external process updates this tuple, when will 
I see the change? Would it be a possiblity that the update is not 
reflected on the queued HeapTuple but the external process is not 
blocked/delayed from updating? In other words, like operating on some 
kind of multiple snapshots? When does DBMS logging kick in whilst I am 
transferring a tuple from TupTableStore to another?

Thanks,
Tom


Re: [HACKERS] Block level parallel vacuum

2019-10-15 Thread Masahiko Sawada
On Tue, Oct 15, 2019 at 6:33 PM Amit Kapila  wrote:
>
> On Tue, Oct 15, 2019 at 1:26 PM Masahiko Sawada  wrote:
> >
> > On Tue, Oct 15, 2019 at 4:15 PM Masahiko Sawada  
> > wrote:
> > >
> > > > > > If we avoid postponing deleting empty pages till the cleanup phase,
> > > > > > then we don't have the problem for gist indexes.
> > > > >
> > > > > Yes. But considering your pointing out I guess that there might be
> > > > > other index AMs use the stats returned from bulkdelete in the similar
> > > > > way to gist index (i.e. using more larger structure of which
> > > > > IndexBulkDeleteResult is just the first field). If we have the same
> > > > > concern the parallel vacuum still needs to deal with that as you
> > > > > mentioned.
> > > > >
> > > >
> > > > Right, apart from some functions for memory allocation/estimation and
> > > > stats copy, we might need something like amcanparallelvacuum, so that
> > > > index methods can have the option to not participate in parallel
> > > > vacuum due to reasons similar to gist or something else.  I think we
> > > > can work towards this direction as this anyway seems to be required
> > > > and till we reach any conclusion for gist indexes, you can mark
> > > > amcanparallelvacuum for gist indexes as false.
> > >
> > > Agreed. I'll create a separate patch to add this callback and change
> > > parallel vacuum patch so that it checks the participation of indexes
> > > and then vacuums on un-participated indexes after parallel vacuum.
> >
> > amcanparallelvacuum is not necessary to be a callback, it can be a
> > boolean field of IndexAmRoutine.
> >
>
> Yes, it will be a boolean.  Note that for parallel-index scans, we
> already have amcanparallel.
>

Attached updated patch set. 0001 patch introduces new index AM field
amcanparallelvacuum. All index AMs except for gist sets true for now.
0002 patch incorporated the all comments I got so far.

Regards,

--
Masahiko Sawada
From 698ba00a46f06a196bc805693b060e9c5b721cf2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 2 Oct 2019 22:46:21 +0900
Subject: [PATCH v30 2/3] Add parallel option to VACUUM command

This change adds PARALLEL option to VACUUM command that enable us to
perform index vacuuming and index cleanup with background
workers. Individual indexes is processed by one vacuum
process. Therefore parallel vacuum can be used when the table has at
least two indexes and it cannot specify larger parallel degree than
the number of indexes that the table has.

The parallel degree is either specified by user or determined based on
the number of indexes that the table has, and further limited by
max_parallel_maintenance_workers. The table size and index size don't
affect it.
---
 doc/src/sgml/config.sgml  |  14 +-
 doc/src/sgml/ref/vacuum.sgml  |  45 ++
 src/backend/access/heap/vacuumlazy.c  | 984 +++---
 src/backend/access/transam/parallel.c |   4 +
 src/backend/commands/vacuum.c |  45 ++
 src/backend/postmaster/autovacuum.c   |   2 +
 src/bin/psql/tab-complete.c   |   2 +-
 src/include/access/heapam.h   |   3 +
 src/include/commands/vacuum.h |   5 +
 src/test/regress/expected/vacuum.out  |  14 +
 src/test/regress/sql/vacuum.sql   |  10 +
 11 files changed, 1019 insertions(+), 109 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 47b12c6a8f..9012e5549e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2265,13 +2265,13 @@ include_dir 'conf.d'

 
  Sets the maximum number of parallel workers that can be
- started by a single utility command.  Currently, the only
- parallel utility command that supports the use of parallel
- workers is CREATE INDEX, and only when
- building a B-tree index.  Parallel workers are taken from the
- pool of processes established by , limited by .  Note that the requested
+ started by a single utility command.  Currently, the parallel
+ utility commands that support the use of parallel workers are
+ CREATE INDEX only when building a B-tree index,
+ and VACUUM without FULL
+ option. Parallel workers are taken from the pool of processes
+ established by , limited
+ by .  Note that the requested
  number of workers may not actually be available at run time.
  If this occurs, the utility operation will run with fewer
  workers than expected.  The default value is 2.  Setting this
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index f9b0fb8794..ae086b976b 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -34,6 +34,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ]
 INDEX_CLEANUP [ boolean ]
 TRUNCATE [ boolean ]
+PARALLEL [ integer ]
 
 and table_and_columns is:
 
@@ -223,6 +224,32 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ 

Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread David Rowley
On Wed, 16 Oct 2019 at 11:40, Justin Pryzby  wrote:
> It didn't occur to me at the time, but that would also allow
> creating numerous, partial BRIN indices, each of which would have separate
> correlation computed over just their "restricted range", which *might* also
> handle your case, depending on how packed your data is.

Perhaps I've misunderstood you, but the correlation that's used is per
column, not per index. The only way to have it calculate multiple
correlations would be to partition the table. There'd be a correlation
for the column on each partition that way.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread David Rowley
On Wed, 16 Oct 2019 at 05:05, Jeremy Finzel  wrote:
> But perhaps it would be worth exploring if there could be more detailed stats 
> on physical vs logical correlation, such as when ANALYZE takes its samples, 
> noting physical locations as well as logical values, and allowing the 
> correlation to take account of that more detailed analysis.  Of course, 
> sounds like a huge amount of work with uncertain benefits.

Yes. I think improving the statistics could be beneficial. It does
appear like the single value for the entire column is not fine-grained
enough for your use case.

> As the docs state, I do believe that the only use case that will work really 
> well for BRIN is either a truly insert-only table which is never pruned (an 
> idea I dislike as a DBA who wants us to keep OLTP data trim and implement 
> data retention policies :), or a table which is routinely CLUSTERed!

Have you considered partitioning the table so that the retention
policy could be implemented by dropping a partition rather than
performing a bulk DELETE?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Justin Pryzby
This reminds me of an issue I reported several years ago where Btree index
scans were chosen over seq scan of a large, INSERT-only table due to very high
correlation, but performed poorly.  I concluded that use of the the high "large
scale" correlation on a large 50+GB table caused the planner to fail to account
for a larger number of pages being read nonsequentially (the opposite of your
issue).  I think that's because we were INSERTing data which was at least
approximately sorted on record END time, and the index was on record START
time.  For a large table with a week's data, the correlation of "start time"
was still be very high (0.5).  But scanning the index ends up reading pages
nonsequentially, and also multiple visits per page.

I eeked out a patch which made "correlation" a per-index statistic rather than
a per-column one.  That means the planner could distinguish between a
freshly-built btree index and a fragmented one.  (At the time, there was a
hypothesis that our issue was partially due to repeated values of the index
columns.)  It didn't occur to me at the time, but that would also allow
creating numerous, partial BRIN indices, each of which would have separate
correlation computed over just their "restricted range", which *might* also
handle your case, depending on how packed your data is.

https://www.postgresql.org/message-id/flat/20170707234119.GN17566%40telsasoft.com#fdcbebc342b8fb9ad0ff293913f54d11

On Tue, Oct 15, 2019 at 11:05:13AM -0500, Jeremy Finzel wrote:
> I do believe that the only use case that will work really well for BRIN is
> either a truly insert-only table which is never pruned ...  or a table which
> is routinely CLUSTERed!

Or partitioned table, which for large data sets I highly recommend instead of
DELETE.

Justin




Re: v12.0 ERROR: trying to store a heap tuple into wrong type of slot

2019-10-15 Thread Justin Pryzby
On Tue, Oct 15, 2019 at 01:50:09PM -0700, Andres Freund wrote:
> On 2019-10-13 07:51:06 -0700, Andres Freund wrote:
> > On 2019-10-11 16:03:20 -0500, Justin Pryzby wrote:
> > > ts=# CLUSTER huawei_m2000_config_enodebcell_enodeb USING 
> > > huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;
> > The cause of the error is that, while that sounds like it should be the
> > case, a virtual slot isn't sufficient for tuplesort_begin_cluster(). So
> > the fix is pretty trivial.  Will fix.
> 
> I pushed the fix, including a few tests, a few hours ago. I hope that
> fixes the issue for you?

On another server already running REL_12_STABLE, I created index with same
definition, which didn't previously exist.  I failed before pulling and
works after.

ts=# CLUSTER huawei_m2000_config_enodebcell_enodeb USING 
huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;
CLUSTER

Thanks,
Justin




Re: v12.0 ERROR: trying to store a heap tuple into wrong type of slot

2019-10-15 Thread Andres Freund
Hi,

On 2019-10-13 07:51:06 -0700, Andres Freund wrote:
> On 2019-10-11 16:03:20 -0500, Justin Pryzby wrote:
> > I'm not sure why we have that index, and my script probably should have 
> > known
> > to choose a better one to cluster on, but still..
> >
> > ts=# CLUSTER huawei_m2000_config_enodebcell_enodeb USING 
> > huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;
> > DEBUG:  0: building index "pg_toast_1840151315_index" on table 
> > "pg_toast_1840151315" serially
> > LOCATION:  index_build, index.c:2791
> > DEBUG:  0: clustering "public.huawei_m2000_config_enodebcell_enodeb" 
> > using sequential scan and sort
> > LOCATION:  copy_table_data, cluster.c:907
> > ERROR:  XX000: trying to store a heap tuple into wrong type of slot
> > LOCATION:  ExecStoreHeapTuple, execTuples.c:1328
> 
> Well, that's annoying. There apparently is not a single test covering
> cluster on expression indexes, that' really ought to not be the
> case. Equally annoying that I just broke this without noticing at all
> :(.
> 
> The cause of the error is that, while that sounds like it should be the
> case, a virtual slot isn't sufficient for tuplesort_begin_cluster(). So
> the fix is pretty trivial.  Will fix.

I pushed the fix, including a few tests, a few hours ago. I hope that
fixes the issue for you?

Greetings,

Andres Freund




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Michael Lewis
Thanks for closing the loop on the data correlation question. I've been
playing with BRIN indexes on a log table of sorts and this thread helped
clear up some of the behavior I have been seeing.

I am curious, would a partial btree index fit your needs? Perhaps the
maintenance overhead is too significant or this is too off-the-wall, but a
daily job to create new index and drop the old concurrently could give the
performance you need while still saving the extra disk space of the full
btree on the timestamp.

CREATE INDEX CONCURRENTLY log_table_rec_insert_time_partial_10_04 ON
log_table USING btree ( rec_insert_time ) WHERE rec_insert_time >
'2019-10-04'::DATE;
DROP INDEX CONCURRENTLY IF EXISTS log_table_rec_insert_time_partial_10_03;

I would consider including category column as well, but I suspect that
would increase the size of the index significantly. Of course, this depends
on the query planner evaluating that "l.rec_insert_time >= now() - interval
'10 days'" and determining that the index fulfills the need.

>


v12.0: interrupt reindex CONCURRENTLY: ccold: ERROR: could not find tuple for parent of relation ...

2019-10-15 Thread Justin Pryzby
On a badly-overloaded VM, we hit the previously-reported segfault in progress
reporting.  This left around some *ccold indices.  I tried to drop them but:

sentinel=# DROP INDEX child.alarms_null_alarm_id_idx1_ccold; -- 
child.alarms_null_alarm_time_idx_ccold; -- alarms_null_alarm_id_idx_ccold;
ERROR:  could not find tuple for parent of relation 41351896

Those are children of relkind=I index on relkind=p table.

postgres=# CREATE TABLE t(i int)PARTITION BY RANGE(i);
postgres=# CREATE TABLE t1 PARTITION OF t FOR VALUES FROM (1)TO(100);
postgres=# INSERT INTO t1 SELECT 1 FROM generate_series(1,9);
postgres=# CREATE INDEX ON t(i);

postgres=# begin; SELECT * FROM t; -- DO THIS IN ANOTHER SESSION

postgres=# REINDEX INDEX CONCURRENTLY t1_i_idx; -- cancel this one
^CCancel request sent
ERROR:  canceling statement due to user request

postgres=# \d t1
...
"t1_i_idx" btree (i)
"t1_i_idx_ccold" btree (i) INVALID

postgres=# SELECT inhrelid::regclass FROM pg_inherits WHERE 
inhparent='t_i_idx'::regclass;
inhrelid
t1_i_idx
(1 row)

Not only can't I DROP the _ccold indexes, but also dropping the table doesn't
cause them to be dropped, and then I can't even slash dee them anymore:

jtp=# DROP INDEX t1_i_idx_ccold;
ERROR:  could not find tuple for parent of relation 290818869

jtp=# DROP TABLE t; -- does not fail, but ..

jtp=# \d t1_i_idx_ccold
ERROR:  cache lookup failed for relation 290818865

jtp=# SELECT indrelid::regclass, * FROM pg_index WHERE 
indexrelid='t1_i_idx_ccold'::regclass;
indrelid   | 290818865
indexrelid | 290818869
indrelid   | 290818865
[...]

Justin




Re: BRIN index which is much faster never chosen by planner

2019-10-15 Thread Jeremy Finzel
Thank you for the thorough and thoughtful reply!  Please see below.

On Mon, Oct 14, 2019 at 3:48 PM David Rowley 
wrote:

> Another thing which you might want to look at is the correlation
> column in the pg_stats view for the rec_insert_time column. Previous
> to 7e534adcd, BRIN index were costed based on the selectivity
> estimate. There was no accountability towards the fact that the pages
> for those records might have been spread out over the entire table.
> Post 7e534adcd, we use the correlation estimate to attempt to estimate
> how many pages (more specifically "ranges") we're likely to hit based
> on that and the selectivity estimate. This commit intended to fix the
> issue we had with BRIN indexes being selected far too often.  Of
> course, the correlation is based on the entire table, if there are
> subsets of the table that are perhaps perfectly correlated, then the
> planner is not going to know about that. It's possible that some of
> your older rec_insert_times are spread out far more than the newer
> ones.  As a test, you could try creating a new table and copying the
> records over to it in rec_insert_time order and seeing if the BRIN
> index is selected for that table (after having performed an ANALYZE).
>
> It would be interesting if you could show the pg_stats row for the
> column so that we can see if the correlation is low.
>

So what I said originally (and light bulbs now going off) is that the table
is insert-only, but it is **pruned (deletes) to the past year of data**.  I
think this is the key of what I've missed.  Once vacuum runs, we have a
bunch of old physical space being re-used by new inserts, thus botching
that good correlation between physical and logical order.  So it appears
the physical order of the data is indeed well-correlated in recent history,
but not so when you go back a bit further.  Here are pg_stats:

-[ RECORD 1 ]--+---
schemaname | foo
tablename  | log_table
attname| rec_insert_time
inherited  | f
null_frac  | 0
avg_width  | 8
n_distinct | 1.89204e+06
correlation| 0.193951
most_common_elems  |
most_common_elem_freqs |
elem_count_histogram   |

I have not tried creating a fresh table, but if I modify my OP query to
instead take a window of 10 days 100 days ago, the BRIN index actually
performs really bad... identically to the seq scan:

Here is with a seq scan:

   QUERY PLAN
-
 WindowAgg  (cost=26822167.70..26822170.60 rows=129 width=120) (actual
time=200730.856..200730.910 rows=62 loops=1)
   ->  Sort  (cost=26822167.70..26822168.02 rows=129 width=104) (actual
time=200730.834..200730.837 rows=62 loops=1)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 Sort Method: quicksort  Memory: 33kB
 ->  Subquery Scan on unique_cases  (cost=26822160.60..26822163.18
rows=129 width=104) (actual time=200730.672..200730.763 rows=62 loops=1)
   ->  Unique  (cost=26822160.60..26822161.89 rows=129
width=124) (actual time=200730.670..200730.753 rows=62 loops=1)
 ->  Sort  (cost=26822160.60..26822160.92 rows=129
width=124) (actual time=200730.668..200730.686 rows=395 loops=1)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   Sort Method: quicksort  Memory: 80kB
   ->  Nested Loop  (cost=0.00..26822156.08
rows=129 width=124) (actual time=200692.321..200730.121 rows=395 loops=1)
 Join Filter: ((l.category)::text =
filter.category)
 Rows Removed by Join Filter: 552210
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8) (actual time=0.013..0.179 rows=1399
loops=1)
 ->  Materialize  (cost=0.00..26818970.84
rows=129 width=99) (actual time=136.942..143.440 rows=395 loops=1399)
   ->  Seq Scan on log_table l
 (cost=0.00..26818970.20 rows=129 width=99) (actual
time=191581.193..200649.013 rows=395 loops=1)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '100
days'::interval)) AND (rec_insert_time <= (now() - '90 days'::interval)))
 Rows Removed by Filter:
315097963
 Planning Time: 1.541 ms
 Execution Time: 200731.273 ms
(19 rows)

Here is with the forced brin index scan:


QUERY PLAN

Re: Questions/Observations related to Gist vacuum

2019-10-15 Thread Heikki Linnakangas

On 15/10/2019 09:37, Amit Kapila wrote:

While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.


Oops. internal_page_set and empty_leaf_set were supposed to be allocated 
in that memory context. As things stand, we leak them until end of 
vacuum, in a multi-pass vacuum.



2. Right now, in gistbulkdelete we make a note of empty leaf pages and
internals pages and then in the second pass during gistvacuumcleanup,
we delete all the empty leaf pages.  I was thinking why unlike nbtree,
we have delayed the deletion of empty pages till gistvacuumcleanup.  I
don't see any problem if we do this during gistbulkdelete itself
similar to nbtree, also I think there is some advantage in marking the
pages as deleted as early as possible.  Basically, if the vacuum
operation is canceled or errored out between gistbulkdelete and
gistvacuumcleanup, then I think the deleted pages could be marked as
recyclable very early in next vacuum operation.  The other advantage
of doing this during gistbulkdelete is we can avoid sharing
information between gistbulkdelete and gistvacuumcleanup which is
quite helpful for a parallel vacuum as the information is not trivial
(it is internally stored as in-memory Btree).   OTOH, there might be
some advantage for delaying the deletion of pages especially in the
case of multiple scans during a single VACUUM command.  We can
probably delete all empty leaf pages in one go which could in some
cases lead to fewer internal page reads.  However, I am not sure if it
is really advantageous to postpone the deletion as there seem to be
some downsides to it as well. I don't see it documented why unlike
nbtree we consider delaying deletion of empty pages till
gistvacuumcleanup, but I might be missing something.


Hmm. The thinking is/was that removing the empty pages is somewhat 
expensive, because it has to scan all the internal nodes to find the 
downlinks to the to-be-deleted pages. Furthermore, it needs to scan all 
the internal pages (or at least until it has found all the downlinks), 
regardless of how many empty pages are being deleted. So it makes sense 
to do it only once, for all the empty pages. You're right though, that 
there would be advantages, too, in doing it after each pass. All things 
considered, I'm not sure which is better.


- Heikki




Re: Fix most -Wundef warnings

2019-10-15 Thread Andrew Gierth
> "Mark" == Mark Dilger  writes:

 >> +#ifdef HSTORE_IS_HSTORE_NEW

 Mark> Checking the current sources, git history, and various older
 Mark> commits, I did not find where HSTORE_IS_HSTORE_NEW was ever
 Mark> defined.

In contrib/hstore, it never was.

The current version of contrib/hstore had a brief life as a separate
extension module called hstore-new, which existed to backport its
functionality into 8.4. The data format for hstore-new was almost
identical to the new contrib/hstore one (and thus different from the old
contrib/hstore), and changed at one point before its final release, so
there were four possible upgrade paths as explained in the comments.

The block comment with the most pertinent explanation seems to have
been a victim of pgindent, but the relevant part is this:

 * [...] So the upshot of all this
 * is that we can treat all the edge cases as "new" if we're being built
 * as hstore-new, and "old" if we're being built as contrib/hstore.

So, HSTORE_IS_HSTORE_NEW was defined if you were building a pgxs module
called "hstore-new" (which was distributed separately on pgfoundry but
the C code was the same), and not if you're building "hstore" (whether
an in or out of tree build).

 Mark> The check on HSTORE_IS_HSTORE_NEW goes back at least as far as
 Mark> 2006, suggesting it was needed for migrating from some version
 Mark> pre-9.0, making me wonder if anybody would need this in the
 Mark> field. Should we drop support for this? I don't have a strong
 Mark> reason to advocate dropping support other than that this #define
 Mark> appears to be undocumented.

The only reason not to remove most of hstore_compat.c is that there is
no way to know what data survives in the wild in each of the three
possible hstore formats (8.4 contrib, pre-final hstore-new, current). I
think it's most unlikely that any of the pre-final hstore-new data still
exists, but how would anyone know?

(The fact that there have been exactly zero field reports of either of
the WARNING messages unfortunately doesn't prove much. Almost all
possible non-current hstore values are unambiguously in one or other of
the possible formats, the ambiguity is only possible because the old
code didn't always set the varlena length to the correct size, but left
unused space at the end.)

-- 
Andrew (irc:RhodiumToad)




Re: tuplesort test coverage

2019-10-15 Thread Peter Geoghegan
On Sun, Oct 13, 2019 at 3:41 PM Andres Freund  wrote:
> - cluster for expression indexes (line 935)

We've never had coverage of this, but perhaps that can be added now.

> - sorts exceeding INT_MAX / 2 memory (line 1337), but that seems hard to
>   test realistically

I don't think that that can be tested, realistically.

> - aborted abbreviated keys (lines 1522, 1608, 1774, 3620, 3739, 3867, 4266)

Also hard to test -- there was a bug here when abbreviated keys first
went in -- that was detected by amcheck.

All of the places where we abort are essentially the same, though.

> - in memory backwards scans (lines 1936, 3042)
> - *any* coverage for TSS_SORTEDONTAPE (line 1964)

That used to exist, but it went away when we killed replacement selection sort.

> - disk sort skiptuples (line 2325)

Couldn't hurt.

> - mergeruns without abbrev key (line 2582)

mergeruns() doesn't use abbreviated keys -- this code disables their
use in the standard way.

> - disk sorts with more than one run (lines 2707, 2789)
> - any disk based tuplesort_begin_heap() (lines 3649, 3676)

I had to convince Tom to get the coverage of external sorts we have
now. Apparently there are buildfarm animals that are very sensitive to
that cost, that could have substantially increased test runtimes were
we to do more. Perhaps this could be revisited.

> - Seems copytup_index currently is essentially dead, because
>   tuplesort_putindextuplevalues() doesn't use COPYTUP (line 4142)

Yeah, that looks like dead code -- it should just be a stub with a
"can't happen" error.

> I'm pretty unhappy that tuplesort has been whacked around pretty heavily
> in the last few years, while *reducing* effective test coverage
> noticeably, rather than increasing it.

I don't think that that's true, on balance. There are only 1,000 extra
lines of code in tuplesort.c in master compared to 9.4, even though we
added parallel sorts and abbreviated keys, two huge enhancements. In
many ways, tuplesort is now simpler than ever.

> There's pretty substantial and
> nontrivial areas without any tests - do we have actually have any
> confidence that they work?

Everything that you're talking about has existed since v11 came out a
year ago, and most of it is a year or two older than that. So yeah,
I'm pretty confident that it works.

-- 
Peter Geoghegan




Re: Zedstore - compressed in-core columnar storage

2019-10-15 Thread Ashutosh Sharma
Hi,

I got chance to spend some time looking into the recent changes done
in the zedstore code, basically the functions for packing datums into
the attribute streams and handling attribute leaf pages. I didn't find
any issues but there are some minor comments that I found when
reviewing. I have worked on those and attached is the patch with the
changes. See if the changes looks meaningful to you.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Mon, Sep 30, 2019 at 4:08 PM Ashutosh Sharma  wrote:
>
> On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang  wrote:
> >
> > Hi Ashutosh,
> >
> > Sorry I indeed missed your question, thanks for the reminder!
> >
> > On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma  
> > wrote:
> >>
> >> > Further, the UPDATE operation on zedstore table is very slow. I think
> >> > that's because in case of zedstore table we have to update all the
> >> > btree data structures even if one column is updated and that really
> >> > sucks. Please let me know if there is some other reason for it.
> >> >
> >>
> >> There was no answer for this in your previous reply. It seems like you
> >> missed it. As I said earlier, I tried performing UPDATE operation with
> >> optimised build and found that to update around 10 lacs record in
> >> zedstore table it takes around 24k ms whereas for normal heap table it
> >> takes 2k ms. Is that because in case of zedstore table we have to
> >> update all the Btree data structures even if one column is updated or
> >> there is some other reason for it. If yes, could you please let us
> >> know. FYI, I'm trying to update the table with just two columns.
> >
> >
> > Zedstore UPDATE operation currently fetches the old rows, updates the
> > undo pointers stored in the tid btree, and insert new rows into all
> > the attribute btrees with the new tids. So performance of updating one
> > column makes no difference from updating all the columns. That said,
> > the wider the table is, the longer it takes to update, regardless
> > updating one column or all the columns.
> >
> > However, since your test table only has two columns, and we also
> > tested the same on a one-column table and got similar results as
> > yours, there is definitely room for optimizations. Attached file
> > zedstore_update_flames_lz4_first_update.svg is the profiling results
> > for the update query on a one-column table with 1M records. It spent
> > most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
> > zedstoream_fetch_row(), Taylor and I had some interesting findings
> > which I'm going to talk about next, I haven't dived into
> > zsbt_tid_update() yet and need to think about it more.
> >
> > To understand what slows down zedstore UDPATE, Taylor and I did the
> > following test and profiling on a zedstore table with only one column.
> >
> > postgres=# create table onecol(a int) using zedstore;
> > postgres=# insert into onecol select i from generate_series(1, 100) i;
> >
> > -- Create view to count zedstore pages group by page types
> > postgres=# CREATE VIEW pg_zs_page_counts AS
> > SELECT
> > c.relnamespace::regnamespace,
> > c.oid,
> > c.relname,
> > pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
> > count(*)
> > FROM pg_am am
> > JOIN pg_class c ON (c.relam = am.oid)
> > WHERE am.amname='zedstore'
> > GROUP BY 1,2,3,4;
> >
> > postgres=# select * from pg_zs_page_counts;
> >  relnamespace |  oid  | relname | pg_zs_page_type | count
> > --+---+-+-+---
> >  public   | 32768 | onecol  | BTREE   |   640
> >  public   | 32768 | onecol  | FREE|90
> >  public   | 32768 | onecol  | META| 1
> > (3 rows)
> >
> > -- Run update query the first time
> > postgres=# update onecol set a = 200; -- profiling attached in 
> > zedstore_update_flames_lz4_first_update.svg
> > Time: 28760.199 ms (00:28.760)
> >
> > postgres=# select * from pg_zs_page_counts;
> >  relnamespace |  oid  | relname | pg_zs_page_type | count
> > --+---+-+-+---
> >  public   | 32768 | onecol  | BTREE   |  6254
> >  public   | 32768 | onecol  | FREE| 26915
> >  public   | 32768 | onecol  | META| 1
> > (6 rows)
> >
>
> Oops, the first UPDATE created a lot of free pages.
>
> Just FYI, when the second update was ran, it took around 5 mins (which
> is almost 10-12 times more than what 1st UPDATE took) but this time
> there was no more free pages added, instead the already available free
> pages were used. Here is the stats observed before and after second
> update,
>
> before:
> =
> postgres=# select * from pg_zs_page_counts;
>  relnamespace |  oid  | relname | pg_zs_page_type | count
> --+---+-+-+---
>  public   | 16390 | t1  | FREE| 26915
>  public   

Re: configure fails for perl check on CentOS8

2019-10-15 Thread Kyotaro Horiguchi
Hi. Sorry for the delay.

At Thu, 10 Oct 2019 11:51:21 -0400, Tom Lane  wrote in 
> Andrew Dunstan  writes:
> > On 10/10/19 1:46 AM, Kyotaro Horiguchi wrote:
> >> Hello, While I'm moving to CentOS8 environment, I got stuck at
> >> ./configure with the following error.
> >> configure: error: libperl library is requred for Perl
> >> It complains that it needs -fPIC.
> >> Configure uses only $Config{ccflags}, but it seems that
> >> $Config{cccdlflags} is also required. The attached patch make
> >> ./configure success. (configure itself is excluded in the patch.)
> 
> > ./configure --with-perl
> > is working for me on Centos8 (double checked after a `dnf update`)
> 
> Yeah, I'm quite suspicious of this too.  Although we don't seem to have
> any buildfarm members covering exactly RHEL8/CentOS8, we have enough
> coverage of different Fedora releases to make it hard to believe that
> we missed any changes in Red Hat's packaging of Perl.
> 
> Is this error perhaps occurring with a non-vendor Perl installation?
> What's the exact error message?  config.log might contain some useful
> clues, too.

The perl package is official one. I found the cause, that's my
mistake.

The problematic command line boils down to:

$ ./configure --with-perl CFLAGS=-O0

It is bash-aliased and survived without being found for a long time on
my Cent7 environment, but CentOS8 doesn't allow that.

By the way, is there any official way to specify options like -O0
while configure time?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center






Re: [HACKERS] Block level parallel vacuum

2019-10-15 Thread Amit Kapila
On Tue, Oct 15, 2019 at 1:26 PM Masahiko Sawada  wrote:
>
> On Tue, Oct 15, 2019 at 4:15 PM Masahiko Sawada  wrote:
> >
> > > > > If we avoid postponing deleting empty pages till the cleanup phase,
> > > > > then we don't have the problem for gist indexes.
> > > >
> > > > Yes. But considering your pointing out I guess that there might be
> > > > other index AMs use the stats returned from bulkdelete in the similar
> > > > way to gist index (i.e. using more larger structure of which
> > > > IndexBulkDeleteResult is just the first field). If we have the same
> > > > concern the parallel vacuum still needs to deal with that as you
> > > > mentioned.
> > > >
> > >
> > > Right, apart from some functions for memory allocation/estimation and
> > > stats copy, we might need something like amcanparallelvacuum, so that
> > > index methods can have the option to not participate in parallel
> > > vacuum due to reasons similar to gist or something else.  I think we
> > > can work towards this direction as this anyway seems to be required
> > > and till we reach any conclusion for gist indexes, you can mark
> > > amcanparallelvacuum for gist indexes as false.
> >
> > Agreed. I'll create a separate patch to add this callback and change
> > parallel vacuum patch so that it checks the participation of indexes
> > and then vacuums on un-participated indexes after parallel vacuum.
>
> amcanparallelvacuum is not necessary to be a callback, it can be a
> boolean field of IndexAmRoutine.
>

Yes, it will be a boolean.  Note that for parallel-index scans, we
already have amcanparallel.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: ProcArrayGroupClearXid() compare-exchange style

2019-10-15 Thread Amit Kapila
On Tue, Oct 15, 2019 at 9:23 AM Noah Misch  wrote:
>
> ProcArrayGroupClearXid() has this:
>
> while (true)
> {
> nextidx = 
> pg_atomic_read_u32(>procArrayGroupFirst);
>
> ...
>
> if 
> (pg_atomic_compare_exchange_u32(>procArrayGroupFirst,
>   
>  ,
>   
>  (uint32) proc->pgprocno))
> break;
> }
>
> This, from UnpinBuffer(), is our more-typical style:
>
> old_buf_state = pg_atomic_read_u32(>state);
> for (;;)
> {
> ...
>
> if (pg_atomic_compare_exchange_u32(>state, 
> _buf_state,
>   
>  buf_state))
> break;
> }
>
> That is, we typically put the pg_atomic_read_u32() outside the loop.  After
> the first iteration, it is redundant with the side effect of
> pg_atomic_compare_exchange_u32().  I haven't checked whether this materially
> improves performance, but, for style, I would like to change it in HEAD.
>

+1.  I am not sure if it would improve performance as this whole
optimization was to reduce the number of attempts to acquire LWLock,
but definitely, it makes the code consistent.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Block level parallel vacuum

2019-10-15 Thread Dilip Kumar
On Tue, Oct 15, 2019 at 12:25 PM Amit Kapila  wrote:
>

> Right, apart from some functions for memory allocation/estimation and
> stats copy, we might need something like amcanparallelvacuum, so that
> index methods can have the option to not participate in parallel
> vacuum due to reasons similar to gist or something else.  I think we
> can work towards this direction as this anyway seems to be required
> and till we reach any conclusion for gist indexes, you can mark
> amcanparallelvacuum for gist indexes as false.
>
I think for estimating the size of the stat I suggest "amestimatestat"
or "amstatsize" and for copy stat data we can add "amcopystat"?  It
would be helpful to extend the parallel vacuum for the indexes which
has extended stats.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Block level parallel vacuum

2019-10-15 Thread Masahiko Sawada
On Tue, Oct 15, 2019 at 4:15 PM Masahiko Sawada  wrote:
>
> On Tue, Oct 15, 2019 at 3:55 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 15, 2019 at 10:34 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Oct 14, 2019 at 6:37 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > 3. Do we really need to give the responsibility of deleting empty
> > > > pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup.  Can't we
> > > > do it in gistbulkdelte?  I see one advantage of postponing it till the
> > > > cleanup phase which is if somehow we can accumulate stats over
> > > > multiple calls of gistbulkdelete, but I am not sure if it is feasible.
> > > > At least, the way current code works, it seems that there is no
> > > > advantage to postpone deleting empty pages till the cleanup phase.
> > > >
> > >
> > > Considering the current strategy of page deletion of gist index the
> > > advantage of postponing the page deletion till the cleanup phase is
> > > that we can do the bulk deletion in cleanup phase which is called at
> > > most once. But I wonder if we can do the page deletion in the similar
> > > way to btree index.
> > >
> >
> > I think there might be some advantage of the current strategy due to
> > which it has been chosen.  I was going through the development thread
> > and noticed some old email which points something related to this.
> > See [1].
>
> Thanks.
>
> >
> > > Or even we use the current strategy I think we can
> > > do that while not passing the pages information from bulkdelete to
> > > vacuumcleanup using by GistBulkDeleteResult.
> > >
> >
> > Yeah, I also think so.  I have started a new thread [2] to know the
> > opinion of others on this matter.
> >
>
> Thank you.
>
> > > > If we avoid postponing deleting empty pages till the cleanup phase,
> > > > then we don't have the problem for gist indexes.
> > >
> > > Yes. But considering your pointing out I guess that there might be
> > > other index AMs use the stats returned from bulkdelete in the similar
> > > way to gist index (i.e. using more larger structure of which
> > > IndexBulkDeleteResult is just the first field). If we have the same
> > > concern the parallel vacuum still needs to deal with that as you
> > > mentioned.
> > >
> >
> > Right, apart from some functions for memory allocation/estimation and
> > stats copy, we might need something like amcanparallelvacuum, so that
> > index methods can have the option to not participate in parallel
> > vacuum due to reasons similar to gist or something else.  I think we
> > can work towards this direction as this anyway seems to be required
> > and till we reach any conclusion for gist indexes, you can mark
> > amcanparallelvacuum for gist indexes as false.
>
> Agreed. I'll create a separate patch to add this callback and change
> parallel vacuum patch so that it checks the participation of indexes
> and then vacuums on un-participated indexes after parallel vacuum.

amcanparallelvacuum is not necessary to be a callback, it can be a
boolean field of IndexAmRoutine.

Regards,

--
Masahiko Sawada




Re: Columns correlation and adaptive query optimization

2019-10-15 Thread Konstantin Knizhnik



On 15.10.2019 1:20, legrand legrand wrote:

Hello Konstantin,

What you have proposed regarding join_selectivity and multicolumn statistics
is a very good new !

Regarding your auto_explain modification, maybe an "advisor" mode would also
be helpfull (with auto_explain_add_statistics_threshold=-1 for exemple).
This would allow to track which missing statistic should be tested (manually
or in an other environment).

In my point of view this advice should be an option of the EXPLAIN command,
that should also permit
auto_explain module to propose "learning" phase.

Thank you for good suggestion. Advisor mode is really good idea.
I have added "auto_explain.suggest_only" GUC.
When it is switched on, suggested CREATE STATISTICS statement is just 
printed in  log but not actually created.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 4bf777d..a356df9 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,7 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 #include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 
 /*
@@ -159,6 +160,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause (and it was not estimated yet), just go
@@ -170,6 +174,9 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		return clause_selectivity(root, (Node *) linitial(clauses),
   varRelid, jointype, sjinfo);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, );
+
 	/*
 	 * Anything that doesn't look like a potential rangequery clause gets
 	 * multiplied into s1 and forgotten. Anything that does gets inserted into
@@ -181,7 +188,6 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -213,6 +219,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -224,6 +231,55 @@ clauselist_selectivity_simple(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+Var* var = (Var*)linitial(expr->args);
+if (!IsA(var, Var) || var->varno != innerRelid)
+{
+	var = (Var*)lsecond(expr->args);
+}
+if (IsA(var, Var) && var->varno == innerRelid)
+{
+	clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+	if (n_clauses_attnums++ != 0)
+	{
+		RelOptInfo* rel = find_base_rel(root, innerRelid);
+		if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+		{
+			StatisticExtInfo *stat = choose_best_statistics(rel->statlist, clauses_attnums,
+			STATS_EXT_DEPENDENCIES);
+			if (stat != NULL)
+			{
+MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+MVDependency *strongest = NULL;
+int i;
+for (i = 0; i < dependencies->ndeps; i++)
+{
+	MVDependency *dependency = dependencies->deps[i];
+	int n_dep_vars = dependency->nattributes - 1;
+	/* Dependency implies attribute */
+	if (var->varattno == dependency->attributes[n_dep_vars])
+	{
+		while (--n_dep_vars >= 0
+			   && bms_is_member(dependency->attributes[n_dep_vars], clauses_attnums));
+		if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+			strongest = dependency;
+	}
+}
+if (strongest)
+{
+	Selectivity dep_sel = (strongest->degree + (1 - strongest->degree) * s1);
+	s1 = Min(dep_sel, s2);
+	continue;
+}
+			}
+		}
+	}
+}
+			}
 
 			if (rinfo)
 			{
@@ -249,7 +305,7 @@ clauselist_selectivity_simple(PlannerInfo *root,
  * selectivity in generically.  But if it's the right oprrest,
  * add the clause to rqlist for later processing.
  */
-switch (get_oprrest(expr->opno))
+switch (oprrest)
 {
 	case F_SCALARLTSEL:
 	case F_SCALARLESEL:
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index a9536c2..915a204 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,25 @@
 #include "postgres.h"
 
 #include 
+#include 
 
 #include 

Re: [HACKERS] Block level parallel vacuum

2019-10-15 Thread Masahiko Sawada
On Tue, Oct 15, 2019 at 3:55 PM Amit Kapila  wrote:
>
> On Tue, Oct 15, 2019 at 10:34 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Oct 14, 2019 at 6:37 PM Amit Kapila  wrote:
> > >
> >
> > > 3. Do we really need to give the responsibility of deleting empty
> > > pages (gistvacuum_delete_empty_pages) to gistvacuumcleanup.  Can't we
> > > do it in gistbulkdelte?  I see one advantage of postponing it till the
> > > cleanup phase which is if somehow we can accumulate stats over
> > > multiple calls of gistbulkdelete, but I am not sure if it is feasible.
> > > At least, the way current code works, it seems that there is no
> > > advantage to postpone deleting empty pages till the cleanup phase.
> > >
> >
> > Considering the current strategy of page deletion of gist index the
> > advantage of postponing the page deletion till the cleanup phase is
> > that we can do the bulk deletion in cleanup phase which is called at
> > most once. But I wonder if we can do the page deletion in the similar
> > way to btree index.
> >
>
> I think there might be some advantage of the current strategy due to
> which it has been chosen.  I was going through the development thread
> and noticed some old email which points something related to this.
> See [1].

Thanks.

>
> > Or even we use the current strategy I think we can
> > do that while not passing the pages information from bulkdelete to
> > vacuumcleanup using by GistBulkDeleteResult.
> >
>
> Yeah, I also think so.  I have started a new thread [2] to know the
> opinion of others on this matter.
>

Thank you.

> > > If we avoid postponing deleting empty pages till the cleanup phase,
> > > then we don't have the problem for gist indexes.
> >
> > Yes. But considering your pointing out I guess that there might be
> > other index AMs use the stats returned from bulkdelete in the similar
> > way to gist index (i.e. using more larger structure of which
> > IndexBulkDeleteResult is just the first field). If we have the same
> > concern the parallel vacuum still needs to deal with that as you
> > mentioned.
> >
>
> Right, apart from some functions for memory allocation/estimation and
> stats copy, we might need something like amcanparallelvacuum, so that
> index methods can have the option to not participate in parallel
> vacuum due to reasons similar to gist or something else.  I think we
> can work towards this direction as this anyway seems to be required
> and till we reach any conclusion for gist indexes, you can mark
> amcanparallelvacuum for gist indexes as false.

Agreed. I'll create a separate patch to add this callback and change
parallel vacuum patch so that it checks the participation of indexes
and then vacuums on un-participated indexes after parallel vacuum.

Regards,

--
Masahiko Sawada




Clean up MinGW def file generation

2019-10-15 Thread Peter Eisentraut
I was mystified by this comment in Makefile.shlib:

# We need several not-quite-identical variants of .DEF files to build
# DLLs for Windows.  These are made from the single source file
# exports.txt.  Since we can't assume that Windows boxes will have
# sed, the .DEF files are always built and included in distribution
# tarballs.

ifneq (,$(SHLIB_EXPORTS))
distprep: lib$(NAME)dll.def lib$(NAME)ddll.def
...

This doesn't make much sense (anymore?) since MinGW surely has sed and
MSVC doesn't use this (and has Perl).  I think this is a leftover from
various ancient client-only ad-hoc Windows build provisions (those
win32.mak files we used to have around).  Also, the ddll.def (debug
build) isn't used by anything anymore AFAICT.

I think we can clean this up and just have the regular ddl.def built
normally at build time if required.

Does anyone know more about this?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 5cf449d76be516b207bb77b7802585f7612a776f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 15 Oct 2019 08:48:35 +0200
Subject: [PATCH] Clean up MinGW def file generation

There were some leftovers from ancient ad-hoc ways to build on
Windows, prior to the standardization on MSVC and MinGW.  We don't
need to build a lib$(NAME)ddll.def (debug build, as opposed to
lib$(NAME)dll.def) for MinGW, since nothing uses that.  We also don't
need to build the regular .def file during distprep, since the MinGW
build environment is perfectly capable of creating that normally at
build time.
---
 src/Makefile.shlib  | 40 ++---
 src/interfaces/ecpg/compatlib/Makefile  |  2 +-
 src/interfaces/ecpg/ecpglib/Makefile|  2 +-
 src/interfaces/ecpg/pgtypeslib/Makefile |  2 +-
 src/interfaces/libpq/Makefile   |  2 +-
 5 files changed, 13 insertions(+), 35 deletions(-)

diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index c4893edfc3..526361f31b 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -49,7 +49,6 @@
 # installdirs-lib   create installation directory $(libdir)
 # uninstall-lib remove the libraries from $(libdir)
 # clean-lib delete the static and shared libraries from the build 
dir
-# maintainer-clean-lib  delete .def files built for win32
 #
 # Typically you would add `all-lib' to the `all' target so that `make all'
 # builds the libraries.  In the most simple case it would look like this:
@@ -374,6 +373,13 @@ DLL_DEFFILE = lib$(NAME)dll.def
 
 $(shlib): $(OBJS) $(DLL_DEFFILE) | $(SHLIB_PREREQS)
$(CC) $(CFLAGS)  -shared -static-libgcc -o $@  $(OBJS) $(DLL_DEFFILE) 
$(LDFLAGS) $(LDFLAGS_SL) $(SHLIB_LINK) $(LIBS) -Wl,--out-implib=$(stlib)
+
+UC_NAME = $(shell echo $(NAME) | tr 'abcdefghijklmnopqrstuvwxyz' 
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
+
+$(DLL_DEFFILE): $(SHLIB_EXPORTS)
+   echo 'LIBRARY LIB$(UC_NAME).dll' >$@
+   echo 'EXPORTS' >>$@
+   sed -e '/^#/d' -e 's/^\(.*[ ]\)\([0-9][0-9]*\)/\1@ \2/' $< >>$@
 endif
 
 endif # PORTNAME == cygwin
@@ -397,32 +403,6 @@ endif # PORTNAME == cygwin || PORTNAME == win32
echo 'Libs.private: $(sort $(filter-out -L.% -L$(top_srcdir)/%,$(filter 
-L%,$(LDFLAGS) $(SHLIB_LINK $(filter-out 
$(PKG_CONFIG_REQUIRES_PRIVATE:lib%=-l%),$(filter -l%,$(SHLIB_LINK)))' >>$@
 
 
-# We need several not-quite-identical variants of .DEF files to build
-# DLLs for Windows.  These are made from the single source file
-# exports.txt.  Since we can't assume that Windows boxes will have
-# sed, the .DEF files are always built and included in distribution
-# tarballs.
-
-ifneq (,$(SHLIB_EXPORTS))
-distprep: lib$(NAME)dll.def lib$(NAME)ddll.def
-
-UC_NAME = $(shell echo $(NAME) | tr 'abcdefghijklmnopqrstuvwxyz' 
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
-
-lib$(NAME)dll.def: $(SHLIB_EXPORTS)
-   echo '; DEF file for Makefile.shlib (MinGW)' >$@
-   echo 'LIBRARY LIB$(UC_NAME).dll' >>$@
-   echo 'EXPORTS' >>$@
-   sed -e '/^#/d' -e 's/^\(.*[ ]\)\([0-9][0-9]*\)/\1@ \2/' $< >>$@
-
-lib$(NAME)ddll.def: $(SHLIB_EXPORTS)
-   echo '; DEF file for Makefile.shlib (MinGW)' >$@
-   echo 'LIBRARY LIB$(UC_NAME)D.dll' >>$@
-   echo 'EXPORTS' >>$@
-   sed -e '/^#/d' -e 's/^\(.*[ ]\)\([0-9][0-9]*\)/\1@ \2/' $< >>$@
-
-endif # SHLIB_EXPORTS
-
-
 ##
 ## INSTALL
 ##
@@ -505,8 +485,6 @@ endif # no soname
 .PHONY: clean-lib
 clean-lib:
rm -f $(shlib) $(shlib_bare) $(shlib_major) $(stlib) $(exports_file) 
lib$(NAME).pc
-
-ifneq (,$(SHLIB_EXPORTS))
-maintainer-clean-lib:
-   rm -f lib$(NAME)dll.def lib$(NAME)ddll.def
+ifneq (,$(DLL_DEFFILE))
+   rm -f $(DLL_DEFFILE)
 endif
diff --git a/src/interfaces/ecpg/compatlib/Makefile 
b/src/interfaces/ecpg/compatlib/Makefile
index cd9879ba75..b9a4fbe9c0 100644
--- a/src/interfaces/ecpg/compatlib/Makefile
+++ b/src/interfaces/ecpg/compatlib/Makefile
@@ -54,4 +54,4 @@ uninstall: uninstall-lib
 clean distclean: 

Questions/Observations related to Gist vacuum

2019-10-15 Thread Amit Kapila
While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.
2. Right now, in gistbulkdelete we make a note of empty leaf pages and
internals pages and then in the second pass during gistvacuumcleanup,
we delete all the empty leaf pages.  I was thinking why unlike nbtree,
we have delayed the deletion of empty pages till gistvacuumcleanup.  I
don't see any problem if we do this during gistbulkdelete itself
similar to nbtree, also I think there is some advantage in marking the
pages as deleted as early as possible.  Basically, if the vacuum
operation is canceled or errored out between gistbulkdelete and
gistvacuumcleanup, then I think the deleted pages could be marked as
recyclable very early in next vacuum operation.  The other advantage
of doing this during gistbulkdelete is we can avoid sharing
information between gistbulkdelete and gistvacuumcleanup which is
quite helpful for a parallel vacuum as the information is not trivial
(it is internally stored as in-memory Btree).   OTOH, there might be
some advantage for delaying the deletion of pages especially in the
case of multiple scans during a single VACUUM command.  We can
probably delete all empty leaf pages in one go which could in some
cases lead to fewer internal page reads.  However, I am not sure if it
is really advantageous to postpone the deletion as there seem to be
some downsides to it as well. I don't see it documented why unlike
nbtree we consider delaying deletion of empty pages till
gistvacuumcleanup, but I might be missing something.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JEQ2y3uNucNopDjK8pse6xSe5%3D_oknoWfRQvAF%3DVqsBA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: v12.0: segfault in reindex CONCURRENTLY

2019-10-15 Thread Michael Paquier
On Mon, Oct 14, 2019 at 08:57:16AM +0900, Michael Paquier wrote:
> I need to think about that, but shouldn't we have a way to reproduce
> that case rather reliably with an isolation test?  The patch looks to
> good to me, these are also the two places I spotted yesterday after a
> quick lookup.  The only other caller is isTempNamespaceInUse() which
> does its thing correctly.

Actually, reindex-concurrently.spec stresses that, except that in
order to reproduce the failure we need to close the connection exactly
in the waiting loop before sending the progress report but after
looking at VirtualTransactionIdIsValid.  Using a debugger and a simple
checkpoint I can easily reproduce the crash, but we'd need more to
make that test case deterministic, like a termination with the correct
timing.

So, Alvaro, your patch looks good to me.  Could you apply it?
--
Michael


signature.asc
Description: PGP signature