date:20180618

Re: Pluggable storage

2018-06-18 Thread AJG

@Amit

Re: Vacuum etc.

Chrome V8 just released this blog post around concurrent marking, which may
be of interest considering how cpu limited the browser is. Contains
benchmark numbers etc in post as well.

https://v8project.blogspot.com/2018/06/concurrent-marking.html

"This post describes the garbage collection technique called concurrent
marking. The optimization allows a JavaScript application to continue
execution while the garbage collector scans the heap to find and mark live
objects. Our benchmarks show that concurrent marking reduces the time spent
marking on the main thread by 60%–70%"



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Pluggable storage

2018-06-18 Thread Andres Freund

On 2018-06-18 12:43:57 -0700, AJG wrote:
> @Amit
> 
> Re: Vacuum etc.
> 
> Chrome V8 just released this blog post around concurrent marking, which may
> be of interest considering how cpu limited the browser is. Contains
> benchmark numbers etc in post as well.
> 
> https://v8project.blogspot.com/2018/06/concurrent-marking.html
> 
> "This post describes the garbage collection technique called concurrent
> marking. The optimization allows a JavaScript application to continue
> execution while the garbage collector scans the heap to find and mark live
> objects. Our benchmarks show that concurrent marking reduces the time spent
> marking on the main thread by 60%–70%"

I don't see how in-memory GC techniques have much bearing on the
discussion here?

Greetings,

Andres Freund

Re: Index Skip Scan

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 11:25 AM, Jesper Pedersen wrote:

Hi all,

I would like to start a discussion on Index Skip Scan referred to as 
Loose Index Scan in the wiki [1].



awesome






I wasn't planning on making this a patch submission for the July 
CommitFest due to the reasons mentioned above, but can do so if people 
thinks it is best. T



New large features are not appropriate for the July CF. September should 
be your goal.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 01:31 PM, Andres Freund wrote:

On 2018-06-18 13:21:43 -0400, Alvaro Herrera wrote:

On 2018-Jun-17, Peter Geoghegan wrote:


On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera
 wrote:

I don't necessarily object to the proposed change, but I think you
should generally wait a bit longer for others to react.

What wait period do you think is appropriate in this case?

One which includes at least half a working day in a different timezone.
You asked mid-afternoon on a Friday in a timezone pretty far west.  I
think you could find people in their offices in Easter Island, but not
many more.

I think there's also a question of how much a patch is blocking you /
others.  A shorter question period is more understandable if it's step
3/40, rather than 1/1...





Yeah, but this would still apply for the first in a series of advertised 
patches.


I usually try to wait at least a couple of days. I'm sure every 
committer understands how you can feel the itch to push. I know I do.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2018-06-18 Thread Robbie Harwood

tKonstantin Knizhnik  writes:

> On 06.06.2018 02:03, Thomas Munro wrote:
>> On Wed, Jun 6, 2018 at 2:06 AM, Konstantin Knizhnik
>>  wrote:
>>> Thank you for review. Updated version of the patch fixing all reported
>>> problems is attached.
>> Small problem on Windows[1]:
>>
>>C:\projects\postgresql\src\include\common/zpq_stream.h(17): error
>> C2143: syntax error : missing ')' before '*'
>> [C:\projects\postgresql\libpq.vcxproj]
>> 2395
>>
>> You used ssize_t in zpq_stream.h, but Windows doesn't have that type.
>> We have our own typedef in win32_port.h.  Perhaps zpq_stream.c should
>> include postgres.h/postgres_fe.h (depending on FRONTEND) like the
>> other .c files in src/common, before it includes zpq_stream.h?
>> Instead of "c.h".
>>
>> [1] 
>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.1106
>>
> Thank you very much for reporting the problem.
> I attached new patch with include of postgres_fe.h added to zpq_stream.c

Hello!

Due to being in a similar place, I'm offering some code review.  I'm
excited that you're looking at throughput on the network stack - it's
not usually what we think of in database performance.  Here are some
first thoughts, which have some overlap with what others have said on
this thread already:

###

This build still doesn't pass Windows:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.2277

You can find more about what the bot is doing here:
http://cfbot.cputube.org/

###

I have a few misgivings about pq_configure(), starting with the name.
The *only* thing this function does is set up compression, so it's
mis-named.  (There's no point in making something generic unless it's
needed - it's just confusing.)

I also don't like that you've injected into the *startup* path - before
authentication takes place.  Fundamentally, authentication (if it
happens) consists of exchanging some combination of short strings (e.g.,
usernames) and binary blobs (e.g., keys).  None of this will compress
well, so I don't see it as worth performing this negotiation there - it
can wait.  It's also another message in every startup.  I'd leave it to
connection parameters, personally, but up to you.

###

Documentation!  You're going to need it.  There needs to be enough
around for other people to implement the protocol (or if you prefer,
enough for us to debug the protocol as it exists).

In conjunction with that, please add information on how to set up
compressed vs. uncompressed connections - similarly to how we've
documentation on setting up TLS connection (though presumably compressed
connection documentation will be shorter).

###

Using terminology from https://facebook.github.io/zstd/zstd_manual.html :

Right now you use streaming (ZSTD_{compress,decompress}Stream()) as the
basis for your API.  I think this is probably a mismatch for what we're
doing here - we're doing one-shot compression/decompression of packets,
not sending video or something.

I think our use case is better served by the non-streaming interface, or
what they call the "Simple API" (ZSTD_{decompress,compress}()).
Documentation suggests it may be worth it to keep an explicit context
around and use that interface instead (i.e.,
ZSTD_{compressCCTx,decompressDCtx}()), but that's something you'll have
to figure out.

You may find making this change helpful for addressing the next issue.

###

I don't like where you've put the entry points to the compression logic:
it's a layering violation.  A couple people have expressed similar
reservations I think, so let me see if I can explain using
`pqsecure_read()` as an example.  In pseudocode, `pqsecure_read()` looks
like this:

if conn->is_tls:
n = tls_read(conn, ptr, len)
else:
n = pqsecure_raw_read(conn, ptr, len)
return n

I want to see this extended by your code to something like:

if conn->is_tls:
n = tls_read(conn, ptr, len)
else:
n = pqsecure_raw_read(conn, ptr, len)

if conn->is_compressed:
n = decompress(ptr, n)

return n

In conjunction with the above change, this should also significantly
reduce the size of the patch (I think).

###

The compression flag has proven somewhat contentious, as you've already
seen on this thread.  I recommend removing it for now and putting it in
a separate patch to be merged later, since it's separable.

###

It's not worth flagging style violations in your code right now, but you
should be aware that there are quite a few style and whitespace
problems.  In particular, please be sure that you're using hard tabs
when appropriate, and that line lengths are fewer than 80 characters
(unless they contain error messages), and that pointers are correctly
declared (`void *arg`, not `void* arg`).

###

Thanks,
--Robbie

signature.asc
Description: PGP signature

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Claudio Freire

On Mon, Jun 18, 2018 at 4:59 PM Peter Geoghegan  wrote:
> > Note, guaranteed allowable time of index scans (used for quick deletion)
> > will be achieved by storing equal-key index tuples in physical TID order [2]
> > with patch [3].
>
> I now have greater motivation to develop that patch into a real project.
>
> I bet that my heap-tid-sort patch will allow you to refine your
> interface when there are many logical duplicates: You can create one
> explicit scan key, but have a list of heap TIDs that need to be killed
> within the range of matching index tuples. That could be a lot more
> efficient in the event of many non-HOT updates, where most index tuple
> values won't actually change. You can sort the list of heap TIDs that
> you want to kill once, and then "merge" it with the tuples at the leaf
> level as they are matched/killed. It should be safe to avoid
> rechecking anything other than the heap TID values.
>
> [1] 
> http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html
> [2] 
> https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com

Actually, once btree tids are sorted, you can continue tree descent
all the way to the exact leaf page that contains the tuple to be
deleted.

Thus, the single-tuple interface ends up being quite OK. Sure, you can
optimize things a bit by scanning a range, but only if vacuum is able
to group keys in order to produce the optimized calls, and I don't see
that terribly likely.

So, IMHO the current interface may be quite enough.

Re: WAL prefetch

2018-06-18 Thread Robert Haas

On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund  wrote:
>> The posix_fadvise approach is not perfect, no doubt about that. But it
>> works pretty well for bitmap heap scans, and it's about 13249x better
>> (rough estimate) than the current solution (no prefetching).
>
> Sure, but investing in an architecture we know might not live long also
> has it's cost. Especially if it's not that complicated to do better.

My guesses are:

- Using OS prefetching is a very small patch.
- Prefetching into shared buffers is a much bigger patch.
- It'll be five years before we have direct I/O.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Speedup of relation deletes during recovery

2018-06-18 Thread Fujii Masao

On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund  wrote:
> Hi,
>
> On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
>> > +
>> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
>> > for (i = 0; i < ndelrels; i++)
>> > -   {
>> > -   SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
>> > +   srels[i] = smgropen(delrels[i], InvalidBackendId);
>> >
>> > -   smgrdounlink(srel, false);
>> > -   smgrclose(srel);
>> > -   }
>> > +   smgrdounlinkall(srels, ndelrels, false);
>> > +
>> > +   for (i = 0; i < ndelrels; i++)
>> > +   smgrclose(srels[i]);
>> > +   pfree(srels);
>
> Hm. This will probably cause another complexity issue. If we just
> smgropen() the relation will be unowned. Which means smgrclose() will
> call remove_from_unowned_list(), which is O(open-relations). Which means
> this change afaict creates a new O(ndrels^2) behaviour?
>
> It seems like the easiest fix for that would be to have a local
> SMgrRelationData in that loop, that we assign the relations to?  That's
> a bit hacky, but afaict should work?

The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

   for (i = ndelrels - 1; i >= 0; i--)
   smgrclose(srels[i]);

instead of

   for (i = 0; i < ndelrels; i++)
   smgrclose(srels[i]);

Since add_to_unowned_list() always adds the entry into the head of
the list, each SMgrRelation entry is added into the "unowned" list in
descending order of its identifier, i.e., SMgrRelation entry with larger
identifier should be placed ahead of one with smaller identifier.
So if we calls remove_from_unowed_list() in descending order of
SMgrRelation entry's identifier, the performance of
remove_from_unowned_list()'s search should O(1). IOW, we can
immediately find the SMgrRelation entry to remove, at the head of
the list.

Regards,

-- 
Fujii Masao

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-18 Thread Nico Williams

On Mon, Jun 18, 2018 at 07:38:13PM +0100, Dent John wrote:
> I commented to Corey (privately) that, while my rewrite extension has
> gotten me a server that responds quickly to aggregate queries, the
> constant need to refresh the supporting MVs means the system’s load
> average is constant and much higher than before. I’m happy with the
> tradeoff for now, but it’s a huge waste of energy, and I’m sure it
> must thrash my disk.
> 
> I’m very interested in what other people think of Corey’s idea.

I've written an alternative materialization extension (entirely as
PlPgSQL) based on PG's internals, but my version has a few big wins that
might help here.  I'm thinking of properly integrating it with PG.  Some
of the features include:

 - you can write triggers that update the materialization

   This is because the materialization is just a regular table in my
   implementation.

 - you can mark a view as needing a refresh (e.g., in a trigger)

 - you can declare a PK, other constraints, and indexes on a
   materialization

   The DMLs used to refresh a view concurrently can take advantage of
   the PK and/or other indexes to go fast.

 - you get a history table which records updates to the materialization

   This is useful for generating incremental updates to external
   systems.

Keeping track of refresh times should help decide whether to use or not
use a materialization in some query, or whether to refresh it first, or
not use it at all.

One of the things I'd eventually like to do is analyze the view query
AST to automatically generate triggers to update materializations or
mark them as needing refreshes.  A first, very very rough sketch of such
an analysis looks like this:

 - if the view query has CTEs
   -> create triggers on all its table sources to mark the
  materialization as needing a refresh

 - else if a table appears more than once as a table source in the view
   query
   -> create triggers on that table that mark the materialization as
  needing a refresh

 - else if a table appears anywhere other than the top-level
   -> create triggers .. mark as needing refresh

 - else if a table is a right-side of a left join
   -> create triggers .. mark as needing refresh

 - else if a table has no PK
   -> create triggers .. mark as needing refresh

 - else if the query has no GROUP BY, or only does a GROUP BY on this
   table and a list of columns prefixed by the table's PK
   -> rewrite the query to have WHERE eq conditions on values for the
  table's PK columns

  analyze this query

  if the result shows this table source as the first table in the
  plan
  -> create triggers on this table to update the materialization
 directly from querying the source view

 - else
   -> create triggers .. mark as needing refresh

Nico
--

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Peter Geoghegan

On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov
 wrote:
> I have written a code for quick indextuple deletion from an relation by heap
> tuple TID. The code relate to "Retail IndexTuple deletion" enhancement of
> btree index on postgresql wiki [1].

I knew that somebody would eventually read that Wiki page.  :-)

> Now, index relation cleaning performs by vacuum which scan all index
> relation for dead entries sequentially, tuple-by-tuple. This simplistic and
> safe method can be significantly surpasses in the cases, than a number of
> dead entries is not large by retail deletions which uses index scans for
> search dead entries.

I assume that the lazy vacuum bulk delete thing is much faster for the
situation where you have many dead tuples in the index. However,
allowing a B-Tree index to accumulate so much bloat in the first place
often has consequences that cannot be reversed with anything less than
a REINDEX. It's true that "prevention is better than cure" for all
types of bloat. However, this could perhaps be as much as 100x more
important for B-Tree bloat, since we cannot place new tuples in any
place that is convenient. It's very different to bloat in the heap,
even at a high level.

> The code demands hold DEAD tuple storage until scan key will be created. In
> this version I add 'target_index_deletion_factor' option. If it more than 0,
> heap_page_prune() uses ItemIdMarkDead() instead of ItemIdSetDead() function
> for set DEAD flag and hold the tuple storage.

Makes sense.

> Next step is developing background worker which will collect pairs (tid,
> scankey) of DEAD tuples from heap_page_prune() function.

Makes sense.

> | n | t1, s  | t2, s  | speedup |
> |---|---|
> | 0 | 0.3| 0.4476 | 1748.4  |
> | 1 | 0.6| 0.5367 | 855.99  |
> | 2 | 0.0004 | 0.9804 | 233.99  |
> | 3 | 0.0048 | 1.6493 | 34.576  |
> | 4 | 0.5600 | 2.4854 | 4.4382  |
> | 5 | 3.3300 | 3.9555 | 1.2012  |
> | 6 | 17.700 | 5.6000 | 0.3164  |
> |---|---|
> in the table, t1 - measured execution time of lazy_vacuum_index() function
> by Quick-Vacuum strategy; t2 - measured execution time of
> lazy_vacuum_index() function by Lazy-Vacuum strategy;

The speedup looks promising. However, the real benefit should be in
query performance, especially when we have heavy contention. Very
eager retail index tuple deletion could help a lot there. It already
makes sense to make autovacuum extremely aggressive in this case, to
the point when it's running almost constantly. A more targeted cleanup
process that can run much faster could do the same job, but be much
more eager, and so be much more effective at *preventing* bloating of
the key space [1][2].

> Note, guaranteed allowable time of index scans (used for quick deletion)
> will be achieved by storing equal-key index tuples in physical TID order [2]
> with patch [3].

I now have greater motivation to develop that patch into a real project.

I bet that my heap-tid-sort patch will allow you to refine your
interface when there are many logical duplicates: You can create one
explicit scan key, but have a list of heap TIDs that need to be killed
within the range of matching index tuples. That could be a lot more
efficient in the event of many non-HOT updates, where most index tuple
values won't actually change. You can sort the list of heap TIDs that
you want to kill once, and then "merge" it with the tuples at the leaf
level as they are matched/killed. It should be safe to avoid
rechecking anything other than the heap TID values.

[1] 
http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html
[2] 
https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com
-- 
Peter Geoghegan

Re: Speedup of relation deletes during recovery

2018-06-18 Thread Fujii Masao

On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev  wrote:
>> We just had a customer hit this issue. I kind of wonder whether this
>> shouldn't be backpatched: Currently the execution on the primary is
>> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
>> standby - with a lot higher constants to boot.  That means it's very
>> easy to get into situations where the standy starts to lag behind very
>> significantly.
>
> +1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

Regards,

-- 
Fujii Masao

Re: Speedup of relation deletes during recovery

2018-06-18 Thread Andres Freund

On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:
> On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund  wrote:
> > Hi,
> >
> > On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> >> > +
> >> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
> >> > for (i = 0; i < ndelrels; i++)
> >> > -   {
> >> > -   SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> >> > +   srels[i] = smgropen(delrels[i], InvalidBackendId);
> >> >
> >> > -   smgrdounlink(srel, false);
> >> > -   smgrclose(srel);
> >> > -   }
> >> > +   smgrdounlinkall(srels, ndelrels, false);
> >> > +
> >> > +   for (i = 0; i < ndelrels; i++)
> >> > +   smgrclose(srels[i]);
> >> > +   pfree(srels);
> >
> > Hm. This will probably cause another complexity issue. If we just
> > smgropen() the relation will be unowned. Which means smgrclose() will
> > call remove_from_unowned_list(), which is O(open-relations). Which means
> > this change afaict creates a new O(ndrels^2) behaviour?
> >
> > It seems like the easiest fix for that would be to have a local
> > SMgrRelationData in that loop, that we assign the relations to?  That's
> > a bit hacky, but afaict should work?
> 
> The easier (but tricky) fix for that is to call smgrclose() for
> each SMgrRelation in the reverse order. That is, we should do
> 
>for (i = ndelrels - 1; i >= 0; i--)
>smgrclose(srels[i]);
> 
> instead of
> 
>for (i = 0; i < ndelrels; i++)
>smgrclose(srels[i]);

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned.  I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

Greetings,

Andres Freund

Re: Transform for pl/perl

2018-06-18 Thread Tom Lane

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> [ 0001-Fix-excess-enreferencing-in-plperl-jsonb-transform.patch ]

I tested this a bit more thoroughly by dint of applying Data::Dumper
to the Perl values, and found that we were still getting extra references
to sub-objects, for example

INFO:  $VAR1 = {'1' => \{'2' => \['3','4','5']},'2' => '3'};

where what we want is

INFO:  $VAR1 = {'1' => {'2' => ['3','4','5']},'2' => '3'};

That turns out to be because the newRV() call in JsonbValue_to_SV()
is also superfluous, if we've set up refs around HV and AV scalars.

Pushed with that change and the extra testing technology.  I'll go
push the dereferencing patch I proposed shortly, as well.

The remaining unresolved issue in this thread is Ilmari's suggestion
that we should convert integers to Perl IV (or UV?) if they fit, rather
than always convert to NV as now.  I'm inclined to reject that proposal,
though, and not just because we don't have a patch for it.  What's
bothering me about it is that then the behavior would be dependent
on the width of IV in the particular Perl installation.  I think that
is a platform dependency we can do without.

regards, tom lane

Re: Fix slot's xmin advancement and subxact's lost snapshots in decoding.

2018-06-18 Thread Alvaro Herrera

On 2018-Jun-11, Antonin Houska wrote:

> Arseny Sher  wrote:
> > Please see detailed description of the issues, tests which reproduce
> > them and fixes in the attached patch.
> 
> I've played with your tests and gdb for a while, both w/o and with your
> patch. I think I can understand both problems. I could not invent simpler way
> to fix them.
> 
> One comment about the coding: since you introduce a new transaction list and
> it's sorted by LSN, I think you should make the function AssertTXNLsnOrder()
> more generic and use it to check the by_base_snapshot_lsn list too. For
> example call it after the list insertion and deletion in
> ReorderBufferPassBaseSnapshot().

I've been going over this patch also, and I've made a few minor edits of
the patch and the existing code as I come to understand what it's all
about.

> Also I think it makes no harm if you split the diff into two, although both
> fixes use the by_base_snapshot_lsn field.

Please don't.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-18 Thread Dent John

I commented to Corey (privately) that, while my rewrite extension has gotten me 
a server that responds quickly to aggregate queries, the constant need to 
refresh the supporting MVs means the system’s load average is constant and much 
higher than before. I’m happy with the tradeoff for now, but it’s a huge waste 
of energy, and I’m sure it must thrash my disk.

I’m very interested in what other people think of Corey’s idea.

Re: WAL prefetch

2018-06-18 Thread Andres Freund

On 2018-06-18 16:44:09 -0400, Robert Haas wrote:
> On Sat, Jun 16, 2018 at 3:41 PM, Andres Freund  wrote:
> >> The posix_fadvise approach is not perfect, no doubt about that. But it
> >> works pretty well for bitmap heap scans, and it's about 13249x better
> >> (rough estimate) than the current solution (no prefetching).
> >
> > Sure, but investing in an architecture we know might not live long also
> > has it's cost. Especially if it's not that complicated to do better.
> 
> My guesses are:
> 
> - Using OS prefetching is a very small patch.
> - Prefetching into shared buffers is a much bigger patch.

Why?  The majority of the work is standing up a bgworker that does
prefetching (i.e. reads WAL, figures out reads not in s_b, does
prefetch). Allowing a configurable number + some synchronization between
them isn't that much more work.

> - It'll be five years before we have direct I/O.

I think we'll have lost a significant market share by then if that's the
case. Deservedly so.

Greetings,

Andres Freund

Re: [HACKERS] Statement-level rollback

2018-06-18 Thread Alvaro Herrera

On 2018-Jun-16, Robert Haas wrote:

> I'm not sure that really solves the problem, because changing the GUC
> in either direction causes the system to behave differently.  I don't
> see any particular reason to believe that changing the behavior from A
> to B is any more or less likely to break applications than a change
> from B to A.

I suppose the other option is to just disallow a change during a running
session completely.  That is, whatever value is has when you connect is
final.

Maybe a better idea is to make write-once: the application connects,
establishes its desired behavior, and then it cannot be changed anymore.

> I put this feature, which - in this interest of full disclosure - is
> already implemented in Advanced Server and has been for many years,
> more or less in the same category as a hypothetical GUC that changes
> case-folding from lower case to upper case.  That is, it's really nice
> for compatibility, but you can't get around the fact that every bit of
> software that is supposed to run on all PostgreSQL instances has to be
> tested in all possible modes, because otherwise you might find that it
> doesn't work in all of those modes, or doesn't work as expected.  It
> is a behavior-changing GUC par excellence.

Thanks for bringing this up.

While I agree that both your example and the feature being proposed are
behavior-changing, I don't think the parallel is very good, because the
level of effort in order to port from one behavior to the other is much
higher with statement-scoped rollback than with case-folding.  In the
case-folding example, I don't think you need to audit/rewrite all your
applications in order to make them work: most of the time just rename a
few tables, or if not just add a few quotes (and you can probably do it
programatically.)

With statement-scope rollback what you face is a more thorough review ..
probably adding a savepoint call every other line.  I'm not sure that
for a large codebase this is even reasonable to start talking about.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Statement-level rollback

2018-06-18 Thread Robert Haas

On Mon, Jun 18, 2018 at 4:51 PM, Alvaro Herrera
 wrote:
> On 2018-Jun-16, Robert Haas wrote:
>> I'm not sure that really solves the problem, because changing the GUC
>> in either direction causes the system to behave differently.  I don't
>> see any particular reason to believe that changing the behavior from A
>> to B is any more or less likely to break applications than a change
>> from B to A.
>
> I suppose the other option is to just disallow a change during a running
> session completely.  That is, whatever value is has when you connect is
> final.
>
> Maybe a better idea is to make write-once: the application connects,
> establishes its desired behavior, and then it cannot be changed anymore.

That sounds even worse.  I think if we're going to have this behavior
at all, it should be possible to change the setting.

>> I put this feature, which - in this interest of full disclosure - is
>> already implemented in Advanced Server and has been for many years,
>> more or less in the same category as a hypothetical GUC that changes
>> case-folding from lower case to upper case.  That is, it's really nice
>> for compatibility, but you can't get around the fact that every bit of
>> software that is supposed to run on all PostgreSQL instances has to be
>> tested in all possible modes, because otherwise you might find that it
>> doesn't work in all of those modes, or doesn't work as expected.  It
>> is a behavior-changing GUC par excellence.
>
> Thanks for bringing this up.
>
> While I agree that both your example and the feature being proposed are
> behavior-changing, I don't think the parallel is very good, because the
> level of effort in order to port from one behavior to the other is much
> higher with statement-scoped rollback than with case-folding.  In the
> case-folding example, I don't think you need to audit/rewrite all your
> applications in order to make them work: most of the time just rename a
> few tables, or if not just add a few quotes (and you can probably do it
> programatically.)
>
> With statement-scope rollback what you face is a more thorough review ..
> probably adding a savepoint call every other line.  I'm not sure that
> for a large codebase this is even reasonable to start talking about.

Yeah.  The real problem is what happens when two code bases collide.
For example, suppose you have a large code base that is using this,
and then you install some extensions that weren't tested with it
enabled.  Or, you install a tool like pgAdmin or pgpool or whatever
that turns out not to understand the new mode, and stuff breaks.  It's
a distributed burden on the ecosystem.  I don't think we can avoid
that.  It's just a matter of whether it is worth it or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Index Skip Scan

2018-06-18 Thread Alexander Korotkov

On Mon, Jun 18, 2018 at 11:20 PM Andrew Dunstan
 wrote:
> On 06/18/2018 11:25 AM, Jesper Pedersen wrote:
> > I wasn't planning on making this a patch submission for the July
> > CommitFest due to the reasons mentioned above, but can do so if people
> > thinks it is best. T
>
> New large features are not appropriate for the July CF. September should
> be your goal.

Assuming this, should we have possibility to register patch to
September CF from now?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 1:42 PM, Claudio Freire  wrote:
> Actually, once btree tids are sorted, you can continue tree descent
> all the way to the exact leaf page that contains the tuple to be
> deleted.
>
> Thus, the single-tuple interface ends up being quite OK. Sure, you can
> optimize things a bit by scanning a range, but only if vacuum is able
> to group keys in order to produce the optimized calls, and I don't see
> that terribly likely.

Andrey talked about a background worker that did processing + index
tuple deletion when handed off work by a user backend after it
performs HOT pruning of a heap page. I consider that idea to be a good
place to go with the patch, because in practice the big problem is
workloads that suffer from so-called "write amplification", where most
index tuples are created despite being "logically unnecessary" (e.g.
one index among several prevents an UPDATE being HOT-safe, making
inserts into most of the indexes "logically unnecessary").

I think that it's likely that only descending the tree once in order
to kill multiple duplicate index tuples in one shot will in fact be
*very* important (unless perhaps you assume that that problem is
solved by something else, such as zheap). The mechanism that Andrey
describes is rather unlike VACUUM as we know it today, but that's the
whole point.

-- 
Peter Geoghegan

Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-18 Thread Thomas Reiss




Le 18/06/2018 à 10:46, David Rowley a écrit :
> On 12 June 2018 at 01:49, Robert Haas  wrote:
>> On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane  wrote:
>>> Robert Haas  writes:
 That being said, I don't mind a bit if you want to look for further
 speedups here, but if you do, keep in mind that a lot of queries won't
 even use partition-wise join, so all of the arrays will be of length
 1.  Even when partition-wise join is used, it is quite likely not to
 be used for every table in the query, in which case it will still be
 of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
 is probably a regression overall.
>>>
>>> TBH, I am way more concerned about the pessimization introduced for
>>> every pre-existing usage of these functions by putting search loops
>>> into them at all.  I'd like very much to revert that.  If we can
>>> replace those with something along the line of root->index_array[varno]
>>> we'll be better off across the board.
>>
>> I think David's analysis is correct -- this doesn't quite work.  We're
>> trying to identify whether a given varno is one of the ones to be
>> translated, and it's hard to come up with a faster solution than
>> iterating over a (very short) array of those values.  One thing we
>> could do is have two versions of each function, or else an optimized
>> path for the very common nappinfos=1 case.  I'm just not sure it would
>> be worthwhile.  Traversing a short array isn't free, but it's pretty
>> cheap.
> 
> So this is the current situation as far as I see it:
> 
> We could go and add a new version of adjust_appendrel_attrs() and
> adjust_appendrel_attrs_mutator() that accept a Relids for the child
> rels rather than an array of AppendRelInfos. However, that's quite a
> lot of code duplication. We could perhaps cut down on duplication by
> having a callback function stored inside of
> adjust_appendrel_attrs_context which searches for the correct
> AppendRelInfo to use. However, it does not seem to be inline with
> simplifying the code.
> 
> We've not yet heard back from Tom with more details about his
> root->index_array[varno] idea. I can't quite see how this is possible
> and for the moment I assume Tom misunderstood that it's the parent
> varno that's known, not the varno of the child.
> 
> I've attached a patch which cleans up my earlier version and moves the
> setup of the append_rel_array into its own function instead of
> sneaking code into setup_simple_rel_arrays(). I've also now updated
> the comment above find_childrel_appendrelinfo(), which is now an
> unused function.
> 
> I tried the following test case:
> 
> CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
> INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
> PARTITION BY RANGE (date);
> \o /dev/null
> select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
> FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
> hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
> || ' hours')::interval || ''');'
> from generate_Series(0,) x;
> \gexec
> \o
> 
> SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00';
> 
> Patched
> 
> Time: 190.989 ms
> Time: 187.313 ms
> Time: 190.239 ms
> 
> Unpatched
> 
> Time: 775.771 ms
> Time: 781.631 ms
> Time: 762.777 ms
> 
> Is this good enough for v11?

It works pretty well with your last patch. IMHO, this issue should be
addressed in v11. I used a pretty unrealistic test-case to show this
regression but it appear with a reasonnable count of partitions, v11 is
slower than v10 even with 10 partitions.

Thanks a lot !

Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-18 Thread Rajkumar Raghuwanshi

Hi,

Below test case crashed, when set enable_partitionwise_aggregate to true.

CREATE TABLE part (c1 INTEGER,c2 INTEGER,c3 CHAR(10)) PARTITION BY
RANGE(c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (MINVALUE) TO (500);
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (500) TO (1000);
CREATE TABLE part_p3 PARTITION OF part FOR VALUES FROM (1000) TO (MAXVALUE);
INSERT INTO part SELECT i,i % 250, to_char(i % 4, 'FM') FROM
GENERATE_SERIES(1,1500,2)i;
ANALYSE part;

ALTER TABLE part_p1 SET (parallel_workers = 0);
ALTER TABLE part_p2 SET (parallel_workers = 0);
ALTER TABLE part_p3 SET (parallel_workers = 0);

SET enable_partitionwise_join to on;

set enable_partitionwise_aggregate to off;
EXPLAIN (COSTS OFF)
SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 =
t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;

set enable_partitionwise_aggregate to on;
EXPLAIN (COSTS OFF)
SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 =
t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;

/*
postgres=# set enable_partitionwise_aggregate to off;
SET
postgres=# EXPLAIN (COSTS OFF)
postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
(t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
1,2;
   QUERY PLAN

 Sort
   Sort Key: (avg(t2.c1)), (sum(t1.c1))
   ->  HashAggregate
 Group Key: t1.c1, t2.c1
 Filter: ((sum(t1.c1) % '125'::bigint) = 0)
 ->  Append
   ->  Hash Join
 Hash Cond: (t1.c1 = t2.c1)
 ->  Seq Scan on part_p1 t1
 ->  Hash
   ->  Seq Scan on part_p1 t2
   ->  Hash Join
 Hash Cond: (t1_1.c1 = t2_1.c1)
 ->  Seq Scan on part_p2 t1_1
 ->  Hash
   ->  Seq Scan on part_p2 t2_1
   ->  Hash Join
 Hash Cond: (t1_2.c1 = t2_2.c1)
 ->  Seq Scan on part_p3 t1_2
 ->  Hash
   ->  Seq Scan on part_p3 t2_2
(21 rows)

postgres=#
postgres=# set enable_partitionwise_aggregate to on;
SET
postgres=# EXPLAIN (COSTS OFF)
postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
(t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
1,2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q
*/

--logfile
TRAP: FailedAssertion("!(parallel_workers > 0)", File: "allpaths.c", Line:
1630)
2018-06-14 23:24:58.375 IST [69650] LOG:  server process (PID 69660) was
terminated by signal 6: Aborted
2018-06-14 23:24:58.375 IST [69650] DETAIL:  Failed process was running:
EXPLAIN (COSTS OFF)
SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
(t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
1,2;


--core.file
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: edb postgres [local]
EXPLAIN  '.
Program terminated with signal 6, Aborted.
#0  0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu
x/raise.c:64
64  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu
x/raise.c:64
#1  0x003dd2633c75 in abort () at abort.c:92
#2  0x00a326da in ExceptionalCondition (conditionName=0xc1a970
"!(parallel_workers > 0)", errorType=0xc1a426 "FailedAssertion",
fileName=0xc1a476 "allpaths.c",
lineNumber=1630) at assert.c:54
#3  0x00797bda in add_paths_to_append_rel (root=0x1d6ff08,
rel=0x1d45d80, live_childrels=0x0) at allpaths.c:1630
#4  0x007d37e1 in create_partitionwise_grouping_paths
(root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520,
partially_grouped_rel=0x1d45d80,
agg_costs=0x7ffceb18dd20, gd=0x0, patype=PARTITIONWISE_AGGREGATE_FULL,
extra=0x7ffceb18dbe0) at planner.c:7120
#5  0x007ce58d in create_ordinary_grouping_paths (root=0x1d6ff08,
input_rel=0x1da5380, grouped_rel=0x1d43520, agg_costs=0x7ffceb18dd20,
gd=0x0, extra=0x7ffceb18dbe0,
partially_grouped_rel_p=0x7ffceb18dc70) at planner.c:4011
#6  0x007ce14b in create_grouping_paths (root=0x1d6ff08,
input_rel=0x1da5380, target=0x1d446d0, target_parallel_safe=true,
agg_costs=0x7ffceb18dd20, gd=0x0)
at planner.c:3783
#7  0x007cb344 in grouping_planner (root=0x1d6ff08,
inheritance_update=false, tuple_fraction=0)

Re: Concurrency bug in UPDATE of partition-key

2018-06-18 Thread Amit Khandekar

On 18 June 2018 at 17:56, Amit Kapila  wrote:
> On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar  wrote:
>> Should we also create a test case where we can verify that some
>> unnecessary or duplicate triggers are not executed?
>>
>
> I am not sure how much value we will add by having such a test.  In
> general, it is good to have tests that cover various aspects of
> functionality, but OTOH, we have to be careful to not overdo it.

Actually I am thinking, it's not a big deal adding a RAISE statement
in trigger function in the existing testcases. It will clearly show how
many times the trigger has executed. So I will go ahead and do that.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-18 Thread Amit Khandekar

On 16 June 2018 at 10:44, Amit Kapila  wrote:
> On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane  wrote:
>> I wrote:
>>> This appears to be the fault of commit ab7271677, whose authors I've cc'd:
>>> the stanza starting at about allpaths.c:1672 is bullheadedly creating a
>>> parallel path whether that's allowed or not.  Fixing it might be as simple
>>> as adding "rel->consider_parallel" to the conditions there.
>>
>> Oh, and while I'm bitching: the same commit has created this exceedingly
>> dangerous coding pattern in create_append_path:
>>
>> pathnode->subpaths = list_concat(subpaths, partial_subpaths);
>>
>> foreach(l, subpaths)
>> {
>> Path   *subpath = (Path *) lfirst(l);
>>
>> Because list_concat() modifies its first argument, this will usually
>> have the effect that the foreach traverses the partial_subpaths as
>> well as the main subpaths.  But it's very unclear to the reader whether
>> that's intended or not.  Worse, if we had *only* partial subpaths so
>> that subpaths was initially NIL, then the loop would fail to traverse
>> the partial subpaths, resulting in inconsistent behavior.
>>
>> It looks to me like traversal of the partial subpaths is the right
>> thing here, in which case we should do
>>
>> -   foreach(l, subpaths)
>> +   foreach(l, pathnode->subpaths)
>>
>> or perhaps better
>>
>> -   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
>> +   pathnode->subpaths = subpaths = list_concat(subpaths, 
>> partial_subpaths);
>>
>> to make the behavior clear and consistent.
>>
>
> I agree with your analysis and proposed change.  However, I think in
> practice, it might not lead to any bug as in the loop, we are
> computing parallel_safety and partial_subpaths should be
> parallel_safe.

Will have a look at this soon.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-18 Thread Jeevan Chalke

On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Hi,
>
> Below test case crashed, when set enable_partitionwise_aggregate to true.
>

I will have a look over this.

Thanks for reporting.


>
> CREATE TABLE part (c1 INTEGER,c2 INTEGER,c3 CHAR(10)) PARTITION BY
> RANGE(c1);
> CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (MINVALUE) TO (500);
> CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (500) TO (1000);
> CREATE TABLE part_p3 PARTITION OF part FOR VALUES FROM (1000) TO
> (MAXVALUE);
> INSERT INTO part SELECT i,i % 250, to_char(i % 4, 'FM') FROM
> GENERATE_SERIES(1,1500,2)i;
> ANALYSE part;
>
> ALTER TABLE part_p1 SET (parallel_workers = 0);
> ALTER TABLE part_p2 SET (parallel_workers = 0);
> ALTER TABLE part_p3 SET (parallel_workers = 0);
>
> SET enable_partitionwise_join to on;
>
> set enable_partitionwise_aggregate to off;
> EXPLAIN (COSTS OFF)
> SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 =
> t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;
>
> set enable_partitionwise_aggregate to on;
> EXPLAIN (COSTS OFF)
> SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 =
> t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;
>
> /*
> postgres=# set enable_partitionwise_aggregate to off;
> SET
> postgres=# EXPLAIN (COSTS OFF)
> postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
> (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
> 1,2;
>QUERY PLAN
> 
>  Sort
>Sort Key: (avg(t2.c1)), (sum(t1.c1))
>->  HashAggregate
>  Group Key: t1.c1, t2.c1
>  Filter: ((sum(t1.c1) % '125'::bigint) = 0)
>  ->  Append
>->  Hash Join
>  Hash Cond: (t1.c1 = t2.c1)
>  ->  Seq Scan on part_p1 t1
>  ->  Hash
>->  Seq Scan on part_p1 t2
>->  Hash Join
>  Hash Cond: (t1_1.c1 = t2_1.c1)
>  ->  Seq Scan on part_p2 t1_1
>  ->  Hash
>->  Seq Scan on part_p2 t2_1
>->  Hash Join
>  Hash Cond: (t1_2.c1 = t2_2.c1)
>  ->  Seq Scan on part_p3 t1_2
>  ->  Hash
>->  Seq Scan on part_p3 t2_2
> (21 rows)
>
> postgres=#
> postgres=# set enable_partitionwise_aggregate to on;
> SET
> postgres=# EXPLAIN (COSTS OFF)
> postgres-# SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
> (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
> 1,2;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !> \q
> */
>
> --logfile
> TRAP: FailedAssertion("!(parallel_workers > 0)", File: "allpaths.c",
> Line: 1630)
> 2018-06-14 23:24:58.375 IST [69650] LOG:  server process (PID 69660) was
> terminated by signal 6: Aborted
> 2018-06-14 23:24:58.375 IST [69650] DETAIL:  Failed process was running:
> EXPLAIN (COSTS OFF)
> SELECT AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON
> (t1.c1 = t2.c1) GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY
> 1,2;
>
>
> --core.file
> Loaded symbols for /lib64/libnss_files.so.2
> Core was generated by `postgres: edb postgres [local]
> EXPLAIN  '.
> Program terminated with signal 6, Aborted.
> #0  0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu
> x/raise.c:64
> 64  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
> libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
> openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  0x003dd2632495 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linu
> x/raise.c:64
> #1  0x003dd2633c75 in abort () at abort.c:92
> #2  0x00a326da in ExceptionalCondition (conditionName=0xc1a970
> "!(parallel_workers > 0)", errorType=0xc1a426 "FailedAssertion",
> fileName=0xc1a476 "allpaths.c",
> lineNumber=1630) at assert.c:54
> #3  0x00797bda in add_paths_to_append_rel (root=0x1d6ff08,
> rel=0x1d45d80, live_childrels=0x0) at allpaths.c:1630
> #4  0x007d37e1 in create_partitionwise_grouping_paths
> (root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520,
> partially_grouped_rel=0x1d45d80,
> agg_costs=0x7ffceb18dd20, gd=0x0, patype=PARTITIONWISE_AGGREGATE_FULL,
> extra=0x7ffceb18dbe0) at planner.c:7120
> #5  0x007ce58d in create_ordinary_grouping_paths (root=0x1d6ff08,
> input_rel=0x1da5380, grouped_rel=0x1d43520, agg_costs=0x7ffceb18dd20,

Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-18 Thread Surafel Temesgen

On Sat, Jun 16, 2018 at 11:36 AM, Dilip Kumar  wrote:

>
> @@ -172,6 +172,7 @@ typedef struct _dumpOptions
>   char*outputSuperuser;
>
>   int sequence_data; /* dump sequence data even in schema-only mode */
> + int do_nothing;
>  } DumpOptions;
>
> The new structure member appears out of place, can you move up along
> with other "command-line long options" ?
>
> Done

regards Surafel


pg_dump_onConflect_v3.pach
Description: Binary data

Runtime partition pruning for MergeAppend

2018-06-18 Thread David Rowley

Back in the v11 cycle, there was just not quite enough time to get the
MergeAppend run-time partition pruning patch in.

I've attached v24 of this patch.  The only changes done from v23 [1]
are to re-base the patch atop of current master. There's was a bit of
churn in the partition pruning and run-time pruning code last week.
I've re-aligned the new code to account for the work done there.

I'll add this to the July 'fest.

[1] 
https://www.postgresql.org/message-id/cakjs1f_yfc8vvld3mhp4pxhrgprkb9hqa4zv+5v5pko6f03...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v24-0001-Expand-run-time-partition-pruning-to-work-with-M.patch
Description: Binary data

Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-18 Thread Amit Khandekar

On 16 June 2018 at 19:30, Amit Kapila  wrote:
> On Sat, Jun 16, 2018 at 10:44 AM, Amit Kapila  wrote:
>> Yeah, or perhaps disallow creation of any partial paths at the first
>> place like in attached.   This will save us some work as well.
>>
>
> Attached patch contains test case as well.  I have tried to come up
> with some simple test, but couldn't come up with anything much simpler
> than reported by Rajkumar, so decided to use the test case provided by
> him.
>

Thanks for the patch. I had a look at it, and it looks good to me. One
minor comment :

+-- Parallel Append is not be used when the subpath depends on the outer param
"is not be used" => "is not to be used"

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Possible bug in logical replication.

2018-06-18 Thread Michael Paquier

On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
> I confirm that starting reading WAL since restart_lsn as implemented in
> f731cfa fixes this issue, as well as the second issue tushar mentioned
> at [1].

Thanks!

+/*
+ * Start reading WAL at restart_lsn, which certainly points to the valid
+ * record.
+ */
 XLogRecPtrstartlsn = MyReplicationSlot->data.restart_lsn;
 XLogRecPtrretlsn = MyReplicationSlot->data.confirmed_flush;

What does this one actually bring?

 PG_TRY();
 {
-/* restart at slot's confirmed_flush */
+/* start_lsn doesn't matter here, we don't replay xacts at all */
 ctx = CreateDecodingContext(InvalidXLogRecPtr,
 NIL,
 true,

Okay for this one.

-/*
- * The {begin_txn,change,commit_txn}_wrapper callbacks above will
- * store the description into our tuplestore.
- */
+/* Changes are not actually produced in fast_forward mode. */

This one is a good idea.  Now CreateDecodingContext is missing the
description of what fast_forward actually does, aka no changes are
produced.  Could you update your patch to reflect that?  That would be
useful for future callers of CreateDecodingContext as well.

-/* Stop once the moving point wanted by caller has been reached */
-if (moveto <= ctx->reader->EndRecPtr)
-break;
-
 CHECK_FOR_INTERRUPTS();

It seems to me that we still want to have the slot forwarding finish in
this case even if this is interrupted.  Petr, isn't that the intention
here?
--
Michael


signature.asc
Description: PGP signature

Re: Slow planning time for simple query

2018-06-18 Thread Amit Kapila

On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
 wrote:
>> "Tom" == Tom Lane  writes:
>
>  Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
>  Tom> bits, those bits could propagate to the standby anyway, as a
>  Tom> result of a subsequent WAL action on the index page that gets a
>  Tom> full-page image added.
>
> That's OK as long as we're ignoring those hints on the standby.
>

What if we don't ignore those hints on standby for a specific case
like the one in get_actual_variable_range?  Now, if the user has
enabled wal_log_hints on the master, it could save time from scanning
many dead tuples on standby.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: AtEOXact_ApplyLauncher() and subtransactions

2018-06-18 Thread Amit Khandekar

On 16 June 2018 at 00:03, Amit Khandekar  wrote:
> The way I am implementing this can be seen in attached
> apply_launcher_subtrans_WIP.patch. (check launcher.c changes). I
> haven't started testing it yet though.

Attached patch passes some basic testing I did. Will do some more
testing, and some self-review and code organising, if required. I will
also split the patch into two : one containing the main issue
regarding subtransaction, and the other containing the other issue I
mentioned earlier that shows up without subtransaction as well.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

apply_launcher_subtrans.patch
Description: Binary data

Re: Concurrency bug in UPDATE of partition-key

2018-06-18 Thread Amit Kapila

On Mon, Jun 18, 2018 at 11:28 AM, Dilip Kumar  wrote:
> On Mon, Jun 18, 2018 at 10:21 AM, Amit Khandekar  
> wrote:
>> Attached is v2 version of the patch. It contains the above
>> trigger-related issue fixed.
>>
>> The updated tuple is passed back using the existing newslot parameter
>> of GetTupleForTrigger(). When ExecBRDeleteTriggers() is called using a
>> new epqslot parameter, it means caller wants to skip the trigger
>> execution, because the updated tuple needs to be again checked for
>> constraints. I have added comments of this behaviour in the
>> ExecBRDeleteTriggers() function header.
>>
>
> Thanks for the updated patch.  I have verified the BR trigger
> behaviour, its working fine with the patch.
>
> +  CREATE FUNCTION func_footrg() RETURNS TRIGGER AS $$
> +  BEGIN
> + RETURN OLD;
> +  END $$ LANGUAGE PLPGSQL;
> +  CREATE TRIGGER footrg_ondel BEFORE DELETE ON footrg1
> +   FOR EACH ROW EXECUTE PROCEDURE func_footrg();
>
> Should we also create a test case where we can verify that some
> unnecessary or duplicate triggers are not executed?
>

I am not sure how much value we will add by having such a test.  In
general, it is good to have tests that cover various aspects of
functionality, but OTOH, we have to be careful to not overdo it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-18 Thread Andres Freund

On 2018-06-18 17:10:12 +0530, Jeevan Chalke wrote:
> On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
> 
> > Hi,
> >
> > Below test case crashed, when set enable_partitionwise_aggregate to true.
> >
> 
> I will have a look over this.
> 
> Thanks for reporting.

I've added an v11 open-items entry.

- Andres

Re: [PATCH] Find additional connection service files in pg_service.conf.d directory

2018-06-18 Thread Arthur Zakirov

Hello,

On Thu, Mar 01, 2018 at 01:40:10PM -0500, Curt Tilmes wrote:
> New patch limits files to ".conf".

Isn't it worth to check errno for stat(), opendir() and readdir() calls?

I think when some error occured in searchServiceFileDirectory() then the
error "definition of service \"%s\" not found" will be raised, because
group_found is false. It may confuse as it hides a real error. For
example, if permission is denied to open the directory.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Joe Conway

On 06/18/2018 09:49 AM, Robert Haas wrote:
> On Wed, Jun 13, 2018 at 9:20 AM, Joe Conway  wrote:
>>> Also, if I understand correctly, at unconference session there also
>>> were two suggestions about the design other than the suggestion by
>>> Alexander: implementing TDE at column level using POLICY, and
>>> implementing TDE at table-space level. The former was suggested by Joe
>>> but I'm not sure the detail of that suggestion. I'd love to hear the
>>> deal of that suggestion.
>>
>> The idea has not been extensively fleshed out yet, but the thought was
>> that we create column level POLICY, which would transparently apply some
>> kind of transform on input and/or output. The transforms would
>> presumably be expressions, which in turn could use functions (extension
>> or builtin) to do their work. That would allow encryption/decryption,
>> DLP (data loss prevention) schemes (masking, redacting), etc. to be
>> applied based on the policies.
> 
> It seems to me that column-level encryption is a lot less secure than
> block-level encryption.  I am supposing here that the attack vector is
> stealing the disk.  If all you've got is a bunch of 8192-byte blocks,
> it's unlikely you can infer much about the contents.  You know the
> size of the relations and that's probably about it. 

Not necessarily. Our pages probably have enough predictable bytes to aid
cryptanalysis, compared to user data in a column which might not be very
predicable.


> If you've got individual values being encrypted, then there's more
> latitude to figure stuff out.  You can infer something about the
> length of particular values.  Perhaps you can find cases where the
> same encrypted value appears multiple times.

This completely depends on the encryption scheme you are using, and the
column level POLICY leaves that entirely up to you.

But in any case most encryption schemes use a random nonce (salt) to
ensure two identical strings do not encrypt to the same result. And
often the encrypted length is padded, so while you might be able to
infer short versus long, you would not usually be able to infer the
exact plaintext length.


> If there's a btree index, you know the ordering of the values under
> whatever ordering semantics apply to that index.  It's unclear to me
> how useful such information would be in practice or to what extent it
> might allow you to attack the underlying cryptography, but it seems
> like there might be cases where the information leakage is
> significant.  For example, suppose you're trying to determine which
> partially-encrypted record is that of Aaron Aardvark... or this guy: 
> https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff,_Sr.
Again, this only applies if your POLICY uses this type of encryption,
i.e. homomorphic encryption. If you use strong encryption you will not
be indexing those columns at all, which is pretty commonly the case.

> Recently, it was suggested to me that a use case for column-level
> encryption might be to prevent casual DBA snooping.  So, you'd want
> the data to appear in pg_dump output encrypted, because the DBA might
> otherwise look at it, but you wouldn't really be concerned about the
> threat of the DBA loading a hostile C module that would steal user
> keys and use them to decrypt all the data, because they don't care
> that much and would be fired if they were caught doing it.

Again completely dependent on the extension you use to do the encryption
for the input policy. The keys don't need to be stored with the data,
and the decryption can be transparent only for certain users or if
certain session variables exist which the DBA does not have access to.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Slow planning time for simple query

2018-06-18 Thread Tom Lane

Amit Kapila  writes:
> On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
>  wrote:
>> That's OK as long as we're ignoring those hints on the standby.

> What if we don't ignore those hints on standby for a specific case
> like the one in get_actual_variable_range?

Yeah, that's the same idea I suggested upthread.  Andrew shot down
my first thought (correctly I think) but the second one still seems
feasible.

> Now, if the user has
> enabled wal_log_hints on the master, it could save time from scanning
> many dead tuples on standby.

We should have the standby set the hint bits for itself, rather than
relying on wal_log_hints.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Joe Conway

On 06/18/2018 10:26 AM, Robert Haas wrote:
> On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:
>> Not necessarily. Our pages probably have enough predictable bytes to aid
>> cryptanalysis, compared to user data in a column which might not be very
>> predicable.
> 
> Really?  I would guess that the amount of entropy in a page is WAY
> higher than in an individual column value.

It isn't about the entropy of the page overall, it is about the
predictability of specific bytes at specific locations on the pages. At
least as far as I understand it.

>> But in any case most encryption schemes use a random nonce (salt) to
>> ensure two identical strings do not encrypt to the same result. And
>> often the encrypted length is padded, so while you might be able to
>> infer short versus long, you would not usually be able to infer the
>> exact plaintext length.
> 
> Sure, that could be done, although it means that equality comparisons
> must be done unencrypted.

Sure. Typically equality comparisons are done on other unencrypted
attributes. Or if you need to do equality on encrypted columns, you can
store non-reversible cryptographic hashes in a separate column.

>> Again completely dependent on the extension you use to do the encryption
>> for the input policy. The keys don't need to be stored with the data,
>> and the decryption can be transparent only for certain users or if
>> certain session variables exist which the DBA does not have access to.
> 
> Not arguing with that.  And to be clear, I'm not trying to attack your
> proposal.  I'm just trying to have a discussion about advantages and
> disadvantages of different approaches.

Understood. Ultimately we might want both page-level encryption and
column level POLICY, as they are each useful for different use-cases.
Personally I believe the former is more generally useful than the
latter, but YMMV.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Claudio Freire

On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan  wrote:

> > I think it would be helpful if you could talk more about these
> > regressions (and the wins).
>
> I think that the performance regressions are due to the fact that when
> you have a huge number of duplicates today, it's useful to be able to
> claim space to fit further duplicates from almost any of the multiple
> leaf pages that contain or have contained duplicates. I'd hoped that
> the increased temporal locality that the patch gets would more than
> make up for that. As far as I can tell, the problem is that temporal
> locality doesn't help enough. I saw that performance was somewhat
> improved with extreme Zipf distribution contention, but it went the
> other way with less extreme contention. The details are not that fresh
> in my mind, since I shelved this patch for a while following limited
> performance testing.
>
> The code could certainly use more performance testing, and more
> general polishing. I'm not strongly motivated to do that right now,
> because I don't quite see a clear path to making this patch useful.
> But, as I said, I have an open mind about what the next step should
> be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.

Index Skip Scan

2018-06-18 Thread Jesper Pedersen


Hi all,

I would like to start a discussion on Index Skip Scan referred to as 
Loose Index Scan in the wiki [1].


My use-case is the simplest form of Index Skip Scan (B-Tree only), 
namely going from


CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 1000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
 HashAggregate  (cost=169247.71..169247.74 rows=3 width=4) (actual 
time=4104.099..4104.099 rows=3 loops=1)

   Output: b
   Group Key: t1.b
   Buffers: shared hit=44248
   ->  Seq Scan on public.t1  (cost=0.00..144247.77 rows=977 
width=4) (actual time=0.059..1050.376 rows=1000 loops=1)

 Output: a, b
 Buffers: shared hit=44248
 Planning Time: 0.157 ms
 Execution Time: 4104.155 ms
(9 rows)

to

CREATE TABLE t1 (a integer PRIMARY KEY, b integer);
CREATE INDEX idx_t1_b ON t1 (b);
INSERT INTO t1 (SELECT i, i % 3 FROM generate_series(1, 1000) as i);
ANALYZE;
EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT DISTINCT b FROM t1;
 Index Skip Scan using idx_t1_b on public.t1  (cost=0.43..1.30 rows=3 
width=4) (actual time=0.061..0.137 rows=3 loops=1)

   Output: b
   Heap Fetches: 3
   Buffers: shared hit=13
 Planning Time: 0.155 ms
 Execution Time: 0.170 ms
(6 rows)

I took Thomas Munro's previous patch [2] on the subject, added a GUC, a 
test case, documentation hooks, minor code cleanups, and made the patch 
pass an --enable-cassert make check-world run. So, the overall design is 
the same.


However, as Robert Haas noted in the thread there are issues with the 
patch as is, especially in relationship to the amcanbackward functionality.


A couple of questions to begin with.

Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should 
a new node (T_IndexSkipScan) be created ? If latter, then there likely 
will be functionality that needs to be refactored into shared code 
between the nodes.


Which is the best way to deal with the amcanbackward functionality ? Do 
people see another alternative to Robert's idea of adding a flag to the 
scan.


I wasn't planning on making this a patch submission for the July 
CommitFest due to the reasons mentioned above, but can do so if people 
thinks it is best. The patch is based on master/4c8156.


Any feedback, suggestions, design ideas and help with the patch in 
general is greatly appreciated.


Thanks in advance !

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] 
https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com


Best regards,
 Jesper
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3742..74ed15bfeb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b60240ecfe..d03d64a4bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3769,6 +3769,22 @@ ANY num_sync ( 
+  enable_indexskipscan (boolean)
+  
+   enable_indexskipscan configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of index-skip-scan plan
+types (see ). This parameter requires
+that enable_indexonlyscan is on.
+The default is on.
+   
+  
+ 
+
  
   enable_material (boolean)
   
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 24c3405f91..bd6a2a7b93 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
 amendscan_function amendscan;
 ammarkpos_function ammarkpos;   /* can be NULL */
 amrestrpos_function amrestrpos; /* can be NULL */
+amskip_function amskip; /* can be NULL */
 
 /* interface functions to support parallel index scans */
 amestimateparallelscan_function amestimateparallelscan;/* can be NULL */
@@ -664,6 +665,14 @@ amrestrpos (IndexScanDesc scan);
 
   
 
+bool
+amskip (IndexScanDesc scan, ScanDirection direction, int prefix);
+
+   TODO
+  
+
+  
+
 Size
 amestimateparallelscan (void);
 
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 14a1aa56cb..5fd9f81f23 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1312,6 +1312,22 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
such cases and allow index-only scans to be generated, but older versions
will not.
   
+
+  
+Index Skip Scans
+
+
+  index
+  index-skip scans
+
+
+  index-skip scan
+
+
+
+  TODO
+
+

Re: Index Skip Scan

2018-06-18 Thread Alexander Korotkov

Hi!

On Mon, Jun 18, 2018 at 6:26 PM Jesper Pedersen
 wrote:
> I would like to start a discussion on Index Skip Scan referred to as
> Loose Index Scan in the wiki [1].

Great, I glad to see you working in this!

> However, as Robert Haas noted in the thread there are issues with the
> patch as is, especially in relationship to the amcanbackward functionality.
>
> A couple of questions to begin with.
>
> Should the patch continue to "piggy-back" on T_IndexOnlyScan, or should
> a new node (T_IndexSkipScan) be created ? If latter, then there likely
> will be functionality that needs to be refactored into shared code
> between the nodes.

Is skip scan only possible for index-only scan?  I guess, that no.  We
could also make plain index scan to behave like a skip scan.  And it
should be useful for accelerating DISTINCT ON clause.  Thus, we might
have 4 kinds of index scan: IndexScan, IndexOnlyScan, IndexSkipScan,
IndexOnlySkipScan.  So, I don't think I like index scan nodes to
multiply this way, and it would be probably better to keep number
nodes smaller.  But I don't insist on that, and I would like to hear
other opinions too.

> Which is the best way to deal with the amcanbackward functionality ? Do
> people see another alternative to Robert's idea of adding a flag to the
> scan.

Supporting amcanbackward seems to be basically possible, but rather
complicated and not very efficient.  So, it seems to not worth
implementing, at least in the initial version.  And then the question
should how index access method report that it supports both skip scan
and backward scan, but not both together?  What about turning
amcanbackward into a function which takes (bool skipscan) argument?
Therefore, this function will return whether backward scan is
supported depending of whether skip scan is used.

> I wasn't planning on making this a patch submission for the July
> CommitFest due to the reasons mentioned above, but can do so if people
> thinks it is best. The patch is based on master/4c8156.

Please, register it on commitfest.  If even there wouldn't be enough
of time for this patch on July commitfest, it's no problem to move it.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-18 Thread Andres Freund

On 2018-06-18 13:21:43 -0400, Alvaro Herrera wrote:
> On 2018-Jun-17, Peter Geoghegan wrote:
> 
> > On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera
> >  wrote:
> > > I don't necessarily object to the proposed change, but I think you
> > > should generally wait a bit longer for others to react.
> > 
> > What wait period do you think is appropriate in this case?
> 
> One which includes at least half a working day in a different timezone.
> You asked mid-afternoon on a Friday in a timezone pretty far west.  I
> think you could find people in their offices in Easter Island, but not
> many more.

I think there's also a question of how much a patch is blocking you /
others.  A shorter question period is more understandable if it's step
3/40, rather than 1/1...

Greetings,

Andres Freund

Re: Query Rewrite for Materialized Views (Postgres Extension)

2018-06-18 Thread Corey Huinker

>
> Hope it is useful or interesting for someone! Questions or comments are
>> very welcome.
>>
>
> good idea.
>
> Regards
>
> Pavel
>

In a recent PgConf NYC presentation [1] I was talking about the technical
hurdles to implementing materialized views that could be kept up to date at
all times, and the benefits of having such a thing.

Some use cases can be addressed with eventually-consistent derivative table
structures (Vertica's projections, PipelineDB's continuous views, etc), but
those methods rely on the source data never having deletes or updates, or
confining those updates to the "hot" part of the source tables, so it
generally works for time-series data, but not for other cases.

It has occurred to me that Dave Fetter's work on ASSERTIONS [2] has common
underpinnings with true continuous materialized views. In both cases, the
creation of a system object causes the creations of insert/update/delete
triggers on one or more existing tables. In the case of assertions, those
triggers are run with the goal of raising an error if rows are returned
from a query. In the case of a materialized view, those same triggers would
be used to delete rows from a CMV and insert replacements rows.

If we can get always-up-to-date materialized views, then Denty's work on
query rewrite would have greatly enhanced utility.

[1]
https://postgresconf.org/conferences/2018/program/proposals/a-roadmap-to-continuous-materialized-views-b4644661-8d5a-4186-8c17-4fb82600e147
[2]
http://databasedoings.blogspot.com/2018/06/ive-posted-my-slides-for-my-asssertions.html

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Alvaro Herrera

On 2018-Jun-18, Ashutosh Bapat wrote:

> That's a wrong comparison. Inheritance based partitioning works even
> after declarative partitioning feature is added. So, users can
> continue using inheritance based partitioning if they don't want to
> move to declarative partitioning. That's not true here. A user creates
> a BEFORE ROW trigger on each partition that mimics partitioned table
> level BEFORE ROW trigger. The proposed BEFORE ROW trigger on
> partitioned table will cascade the trigger down to each partition. If
> that fails to recognize that there is already an equivalent trigger, a
> partition table will get two triggers doing the same thing silently
> without any warning or notice. That's fine if the trigger changes the
> salaries to $50K but not OK if each of those increases salary by 10%.
> The database has limited ability to recognize what a trigger is doing.

I agree with Robert that nobody in their right minds would be caught by
this problem by adding new triggers without thinking about it and
without testing them.  If you add a partitioned-table-level trigger to
raise salaries by 10% but there was already one in the partition level
that does the same thing, you'll readily notice that salaries have been
increased by 21% instead.

So like Robert I'm inclined to not change the wording in the
documentation.

What does worry me a little bit now, reading this discussion, is whether
we've made the triggers in partitions visible enough.  We'll have this
problem once we implement BEFORE ROW triggers as proposed, and I think
we already have this problem now.  Let's suppose a user creates a
duplicated after trigger:

create table parent (a int) partition by range (a);
create table child partition of parent for values from (0) to (100);
create function noise() returns trigger language plpgsql as $$ begin raise 
notice 'nyaa'; return null; end; $$;
create trigger trig_p after insert on parent for each row execute procedure 
noise();
create trigger trig_c after insert on child for each row execute procedure 
noise();

Now let's try it:

alvherre=# insert into child values (1);
NOTICE:  nyaa
NOTICE:  nyaa
INSERT 0 1

OK, so where does that one come from?

alvherre=# \d child
Tabla «public.child»
 Columna │  Tipo   │ Collation │ Nullable │ Default 
─┼─┼───┼──┼─
 a   │ integer │   │  │ 
Partition of: parent FOR VALUES FROM (0) TO (100)
Triggers:
trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise()

Hmm, there's only one trigger here, why does it appear twice?  To find
out, you have to know where to look:

alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger;
 tgname │ tgrelid │ tgisinternal 
┼─┼──
 trig_p │ parent  │ f
 trig_p │ child   │ t
 trig_c │ child   │ f
(3 filas)

So there is a trigger in table child, but it's hidden because
tgisinternal.  Of course, you can see it if you look at the parent's
definition:

alvherre=# \d parent
   Tabla «public.parent»
 Columna │  Tipo   │ Collation │ Nullable │ Default 
─┼─┼───┼──┼─
 a   │ integer │   │  │ 
Partition key: RANGE (a)
Triggers:
trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise()
Number of partitions: 1 (Use \d+ to list them.)

I think it'd be useful to have a list of triggers that have been
inherited from ancestors, or maybe simply a list of internal triggers

Or maybe this is not something to worry about?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: why partition pruning doesn't work?

2018-06-18 Thread Tom Lane

Amit Langote  writes:
> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ]

I took a look at this.  While I'm in agreement with the general idea of
holding open the partitioned relations' relcache entries throughout the
query, I do not like anything about this patch:

* It seems to be outright broken for the case that any of the partitioned
relations are listed in nonleafResultRelations.  If we're going to do it
like this, we have to open every one of the partrels regardless of that.
(I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations
altogether, in favor of merging the related code in InitPlan with this.
That existing code is already a mighty ugly wart, and this patch makes
it worse by adding new, related warts elsewhere.)

* You've got *far* too much intimate knowledge of the possible callers
in ExecOpenAppendPartitionedTables.

Personally, what I would have this function do is return a List of
the opened Relation pointers, and add a matching function to run through
such a List and close the entries again, and make the callers responsible
for stashing the List pointer in an appropriate field in their planstate.
Or maybe what we should do is drop ExecLockNonLeafAppendTables/
ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

regards, tom lane

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  wrote:
> Way back when I was dabbling in this kind of endeavor, my main idea to
> counteract that, and possibly improve performance overall, was a
> microvacuum kind of thing that would do some on-demand cleanup to
> remove duplicates or make room before page splits. Since nbtree
> uniqueification enables efficient retail deletions, that could end up
> as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

-- 
Peter Geoghegan

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-18 Thread Alvaro Herrera

On 2018-Jun-17, Peter Geoghegan wrote:

> On Sat, Jun 16, 2018 at 8:51 PM, Alvaro Herrera
>  wrote:
> > I don't necessarily object to the proposed change, but I think you
> > should generally wait a bit longer for others to react.
> 
> What wait period do you think is appropriate in this case?

One which includes at least half a working day in a different timezone.
You asked mid-afternoon on a Friday in a timezone pretty far west.  I
think you could find people in their offices in Easter Island, but not
many more.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Tomas Vondra





On 06/18/2018 05:06 PM, Joe Conway wrote:

On 06/18/2018 10:52 AM, Tom Lane wrote:

Robert Haas  writes:

On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:

Not necessarily. Our pages probably have enough predictable bytes to aid
cryptanalysis, compared to user data in a column which might not be very
predicable.



Really?  I would guess that the amount of entropy in a page is WAY
higher than in an individual column value.


Depending on the specifics of the encryption scheme, having some
amount of known (or guessable) plaintext may allow breaking the
cipher, even if much of the plaintext is not known. This is
cryptology 101, really.


Exactly


At the same time, having to have a bunch of
independently-decipherable short field values is not real secure
either, especially if they're known to all be encrypted with the
same key. But what you know or can guess about the plaintext in
such cases would be target-specific, rather than an attack that
could be built once and used against any PG database.


Again is dependent on the specific solution for encryption. In some 
cases you might do something like generate a single use random key, 
encrypt the payload with that, encrypt the single use key with the 
"global" key, append the two results and store.




Yeah, I suppose we could even have per-page keys, for example.

One topic I haven't seen mentioned in this thread yet is indexes. That's 
a pretty significant side-channel, when built on encrypted columns. Even 
if the indexes are encrypted too, you can often deduce a lot of 
information from them.


So what's the plan here? Disallow indexes on encrypted columns? Index 
encypted values directly? Something else?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread David G. Johnston

On Mon, Jun 18, 2018 at 9:59 AM, Alvaro Herrera 
wrote:

>
> alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger;
>  tgname │ tgrelid │ tgisinternal
> ┼─┼──
>  trig_p │ parent  │ f
>  trig_p │ child   │ t
>  trig_c │ child   │ f
> (3 filas)
>
> So there is a trigger in table child, but it's hidden because
> tgisinternal.  Of course, you can see it if you look at the parent's
> definition:
>
> alvherre=# \d parent
>Tabla «public.parent»
>  Columna │  Tipo   │ Collation │ Nullable │ Default
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │
> Partition key: RANGE (a)
> Triggers:
> trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise()
> Number of partitions: 1 (Use \d+ to list them.)
>
> I think it'd be useful to have a list of triggers that have been
> inherited from ancestors, or maybe simply a list of internal triggers
>
> Or maybe this is not something to worry about?

For the main internal trigger, foreign key, we don't show the trigger on
the relevant table but we do indicate its effect by showing the presence of
the foreign key.  We likewise need to show the effect of the inherited
trigger on the child table.  When viewing the output the display order of
invocation should be retained (is it that way now?) as a primary goal -
with the directed or inherited nature presentation dependent upon that.
i.e., I would like to see "Parent Triggers:" and "Triggers:" sections if
possible but if trig_c is going to be invoked before trig_p that won't work
and having a single "Triggers:" section with "(parent)" somewhere in the
trigger info printout would be preferred.

David J.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Nico Williams

On Mon, Jun 11, 2018 at 06:22:22PM +0900, Masahiko Sawada wrote:
> As per discussion at PGCon unconference, I think that firstly we need
> to discuss what threats we want to defend database data against. If

We call that a threat model.  There can be many threat models, of
course.

> user wants to defend against a threat that is malicious user who
> logged in OS or database steals an important data on datbase this
> design TDE would not help. Because such user can steal the data by
> getting a memory dump or by SQL. That is of course differs depending
> on system requirements or security compliance but what threats do you
> want to defend database data against? and why?

This design guards (somewhat) againts the threat of the storage theft
(e.g., because the storage is remote).  It's a fine threat model to
address, but it's also a lot easier to address in the filesystem or
device drivers -- there's no need to do this in PostgreSQL itself except
so as to support it on all platforms regardless of OS capabilities.

Note that unless the pg_catalog is protected against manipulation by
remote storage, then TDE for user tables might be possible to
compromise.  Like so: the attacker manipulates the pg_catalog to
escalate privelege in order to obtain the TDE keys.  This argues for
full database encryption, not just specific tables or columns.  But
again, this is for the threat model where the storage is the threat.

Another similar thread model is dump management, where dumps are sent
off-site where untrusted users might read them, or even edit them in the
hopes that they will be used for restores and thus compromise the
database.  This is most easily addressed by just encrypting the backups
externally to PG.

Threat models where client users are the threat are easily handled by
PG's permissions system.

I think any threat model where DBAs are not the threat is just not that
interesting to address with crypto within postgres itself...

Encryption to public keys for which postgres does not have private keys
would be one way to address DBAs-as-the-thread, but this is easily done
with an extension...  A small amount of syntactic sugar might help:

  CREATE ROLE "bar" WITH (PUBLIC KEY "...");

  CREATE TABLE foo (
name TEXT PRIMARY KEY,
payload TEXT ENCRYPTED TO ROLE "bar" BOUND TO name
  );

but this is just syntactic sugar, so not that valuable.  On the other
hand, just a bit of syntactic sugar can help tick a feature checkbox,
which might be very valuable for marketing reasons even if it's not
valuable for any other reason.

Note that encrypting the payload without a binding to the PK (or similar
name) is very dangerous!  So the encryption option would have to support
some way to indicate what other plaintext to bind in (here the "name"
column).

Note also that for key management reasons it would be necessary to be
able to write the payload as ciphertext rather than as to-be-encrypted
TEXT.

Lastly, for a symmetric encryption option one would need a remote oracle
to do the encryption, which seems rather complicated, but in some cases
may well perform faster.

Nico
--

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 10:21 AM, Alvaro Herrera
 wrote:
> One which includes at least half a working day in a different timezone.
> You asked mid-afternoon on a Friday in a timezone pretty far west.

It was 11 am PST.

I'll make a note about this. It won't happen again.

-- 
Peter Geoghegan

Re: Removing "Included attributes in B-tree indexes" section from docs

2018-06-18 Thread Robert Haas

On Mon, Jun 18, 2018 at 1:31 PM, Andres Freund  wrote:
> I think there's also a question of how much a patch is blocking you /
> others.  A shorter question period is more understandable if it's step
> 3/40, rather than 1/1...

Agreed.  For non-critical stuff like this it seems like waiting 2 or 3
business days is a good idea.

Doesn't sound like there's any actual controversy in this case, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Claudio Freire

On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  
> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Joe Conway

On 06/18/2018 10:52 AM, Tom Lane wrote:
> Robert Haas  writes:
>> On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:
>>> Not necessarily. Our pages probably have enough predictable bytes to aid
>>> cryptanalysis, compared to user data in a column which might not be very
>>> predicable.
> 
>> Really?  I would guess that the amount of entropy in a page is WAY
>> higher than in an individual column value.
> 
> Depending on the specifics of the encryption scheme, having some amount
> of known (or guessable) plaintext may allow breaking the cipher, even
> if much of the plaintext is not known.  This is cryptology 101, really.

Exactly

> At the same time, having to have a bunch of independently-decipherable
> short field values is not real secure either, especially if they're known
> to all be encrypted with the same key.  But what you know or can guess
> about the plaintext in such cases would be target-specific, rather than
> an attack that could be built once and used against any PG database.

Again is dependent on the specific solution for encryption. In some
cases you might do something like generate a single use random key,
encrypt the payload with that, encrypt the single use key with the
"global" key, append the two results and store.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:
>> Not necessarily. Our pages probably have enough predictable bytes to aid
>> cryptanalysis, compared to user data in a column which might not be very
>> predicable.

> Really?  I would guess that the amount of entropy in a page is WAY
> higher than in an individual column value.

Depending on the specifics of the encryption scheme, having some amount
of known (or guessable) plaintext may allow breaking the cipher, even
if much of the plaintext is not known.  This is cryptology 101, really.

At the same time, having to have a bunch of independently-decipherable
short field values is not real secure either, especially if they're known
to all be encrypted with the same key.  But what you know or can guess
about the plaintext in such cases would be target-specific, rather than
an attack that could be built once and used against any PG database.

regards, tom lane

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Robert Haas

On Mon, Jun 18, 2018 at 1:20 AM, Ashutosh Bapat
 wrote:
> That's a wrong comparison. Inheritance based partitioning works even
> after declarative partitioning feature is added. So, users can
> continue using inheritance based partitioning if they don't want to
> move to declarative partitioning. That's not true here. A user creates
> a BEFORE ROW trigger on each partition that mimics partitioned table
> level BEFORE ROW trigger. The proposed BEFORE ROW trigger on
> partitioned table will cascade the trigger down to each partition. If
> that fails to recognize that there is already an equivalent trigger, a
> partition table will get two triggers doing the same thing silently
> without any warning or notice. That's fine if the trigger changes the
> salaries to $50K but not OK if each of those increases salary by 10%.
> The database has limited ability to recognize what a trigger is doing.

I don't even know what to say about that.  You are correct that if a
user creates a new trigger on a partitioned table and doesn't check
what happens to any existing triggers that they already have, bad
things might happen.  But that seems like a pretty stupid thing to do.
If you make *any* kind of critical change to your production database
without testing it, bad things may happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Robert Haas

On Wed, Jun 13, 2018 at 9:20 AM, Joe Conway  wrote:
>> Also, if I understand correctly, at unconference session there also
>> were two suggestions about the design other than the suggestion by
>> Alexander: implementing TDE at column level using POLICY, and
>> implementing TDE at table-space level. The former was suggested by Joe
>> but I'm not sure the detail of that suggestion. I'd love to hear the
>> deal of that suggestion.
>
> The idea has not been extensively fleshed out yet, but the thought was
> that we create column level POLICY, which would transparently apply some
> kind of transform on input and/or output. The transforms would
> presumably be expressions, which in turn could use functions (extension
> or builtin) to do their work. That would allow encryption/decryption,
> DLP (data loss prevention) schemes (masking, redacting), etc. to be
> applied based on the policies.

It seems to me that column-level encryption is a lot less secure than
block-level encryption.  I am supposing here that the attack vector is
stealing the disk.  If all you've got is a bunch of 8192-byte blocks,
it's unlikely you can infer much about the contents.  You know the
size of the relations and that's probably about it.  If you've got
individual values being encrypted, then there's more latitude to
figure stuff out.  You can infer something about the length of
particular values.  Perhaps you can find cases where the same
encrypted value appears multiple times.  If there's a btree index, you
know the ordering of the values under whatever ordering semantics
apply to that index.  It's unclear to me how useful such information
would be in practice or to what extent it might allow you to attack
the underlying cryptography, but it seems like there might be cases
where the information leakage is significant.  For example, suppose
you're trying to determine which partially-encrypted record is that of
Aaron Aardvark... or this guy:
https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff,_Sr.

Recently, it was suggested to me that a use case for column-level
encryption might be to prevent casual DBA snooping.  So, you'd want
the data to appear in pg_dump output encrypted, because the DBA might
otherwise look at it, but you wouldn't really be concerned about the
threat of the DBA loading a hostile C module that would steal user
keys and use them to decrypt all the data, because they don't care
that much and would be fired if they were caught doing it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-18 Thread Robert Haas

On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:
> Not necessarily. Our pages probably have enough predictable bytes to aid
> cryptanalysis, compared to user data in a column which might not be very
> predicable.

Really?  I would guess that the amount of entropy in a page is WAY
higher than in an individual column value.

> But in any case most encryption schemes use a random nonce (salt) to
> ensure two identical strings do not encrypt to the same result. And
> often the encrypted length is padded, so while you might be able to
> infer short versus long, you would not usually be able to infer the
> exact plaintext length.

Sure, that could be done, although it means that equality comparisons
must be done unencrypted.

> Again completely dependent on the extension you use to do the encryption
> for the input policy. The keys don't need to be stored with the data,
> and the decryption can be transparent only for certain users or if
> certain session variables exist which the DBA does not have access to.

Not arguing with that.  And to be clear, I'm not trying to attack your
proposal.  I'm just trying to have a discussion about advantages and
disadvantages of different approaches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ON CONFLICT DO NOTHING on pg_dump

2018-06-18 Thread Nico Williams

On Fri, Jun 15, 2018 at 02:20:21AM +, Ideriha, Takeshi wrote:
> >From: Nico Williams [mailto:n...@cryptonector.com]
> >On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote:
> >> Only the difference of data can be restored.
> >
> >But that's additive-only.  Only missing rows are restored this way, and 
> >differences are
> >not addressed.
> >
> >If you want restore to restore data properly and concurrently (as opposed to 
> >renaming
> >a new database into place or whatever) then you'd want a) MERGE, b) dump to
> >generate MERGE statements.  A concurrent data restore operation would be 
> >rather
> >neat.
> 
> I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems hard 
> work.
> Only ON-CONCLICT-DO-NOTHING use case may be narrow.

Is it narrow, or is it just easy enough to add quickly?

And by the way, you don't need MERGE.  You can just generate INSERT/
UPDATE/DELETE statements -- MERGE is mainly an optimization on that, and
could wait until PG has a MERGE.

Nico
--

Re: Invisible Indexes

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 05:44 PM, Peter Geoghegan wrote:

On Mon, Jun 18, 2018 at 2:36 PM, Andrew Dunstan
 wrote:

This is a MySQL feature, where an index is not considered by the planner.
Implementing it should be fairly straightforward, adding a new boolean to
pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would
become a new unreserved keyword.
So, do we want this feature? If we do I'll go ahead and prepare a patch.

I know that it's definitely a feature that I want.



Well, that's encouraging ;-)


Haven't thought
about the syntax, though.





I envisioned:

CREATE INDEX  [NOT VISIBLE] ...;
ALTER INDEX ... [SET [NOT] VISIBLE] ...;


Let the bikeshedding begin :-)

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Invisible Indexes

2018-06-18 Thread Jaime Casanova

On 18 June 2018 at 16:36, Andrew Dunstan  wrote:
>
> This is a MySQL feature, where an index is not considered by the planner.
> Implementing it should be fairly straightforward, adding a new boolean to
> pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would
> become a new unreserved keyword.
>
> The most obvious use case is to see what the planner does when the index is
> not visible, for example which other index(es) it might use. There are
> probably other cases where we might want an index to enforce a constraint
> but not to be used in query planning.
>
> So, do we want this feature? If we do I'll go ahead and prepare a patch.
>

should pg_index.indisvalid works for this? in that case you only need
the syntax for it...

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Invisible Indexes

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 05:46 PM, Jaime Casanova wrote:

On 18 June 2018 at 16:36, Andrew Dunstan  wrote:

This is a MySQL feature, where an index is not considered by the planner.
Implementing it should be fairly straightforward, adding a new boolean to
pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE would
become a new unreserved keyword.

The most obvious use case is to see what the planner does when the index is
not visible, for example which other index(es) it might use. There are
probably other cases where we might want an index to enforce a constraint
but not to be used in query planning.

So, do we want this feature? If we do I'll go ahead and prepare a patch.


should pg_index.indisvalid works for this? in that case you only need
the syntax for it...




I thought about that. But I think these are more or less orthogonal.  I 
doubt it will involve lots of extra code, though.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Invisible Indexes

2018-06-18 Thread Tom Lane

Andrew Dunstan  writes:
> This is a MySQL feature, where an index is not considered by the 
> planner. Implementing it should be fairly straightforward, adding a new 
> boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I 
> guess VISIBLE would become a new unreserved keyword.

> The most obvious use case is to see what the planner does when the index 
> is not visible, for example which other index(es) it might use. There 
> are probably other cases where we might want an index to enforce a 
> constraint but not to be used in query planning.

Traditionally the way to do the former is

begin;
drop index unwanted;
explain ;
rollback;

Admittedly, this isn't great in a production environment, but neither
would be disabling the index in the way you suggest.

I think the actually desirable way to handle this sort of thing is through
an "index advisor" sort of plugin, which can hide a given index from the
planner without any globally visible side-effects.

I'm not sure about the "enforce constraint only" argument --- that
sounds like a made-up use-case to me.  It's pretty hard to imagine
a case where a unique index applies to a query and yet you don't want
to use it.

> So, do we want this feature? If we do I'll go ahead and prepare a patch.

On the whole I'm not excited about it, at least not with this approach.
Have you considered an extension or GUC with only local side effects?

regards, tom lane

Re: Invisible Indexes

2018-06-18 Thread Andres Freund

On 2018-06-18 17:50:44 -0400, Andrew Dunstan wrote:
> 
> 
> On 06/18/2018 05:46 PM, Jaime Casanova wrote:
> > On 18 June 2018 at 16:36, Andrew Dunstan  
> > wrote:
> > > This is a MySQL feature, where an index is not considered by the planner.
> > > Implementing it should be fairly straightforward, adding a new boolean to
> > > pg_index, and options to CREATE INDEX and ALTER INDEX. I guess VISIBLE 
> > > would
> > > become a new unreserved keyword.
> > > 
> > > The most obvious use case is to see what the planner does when the index 
> > > is
> > > not visible, for example which other index(es) it might use. There are
> > > probably other cases where we might want an index to enforce a constraint
> > > but not to be used in query planning.
> > > 
> > > So, do we want this feature? If we do I'll go ahead and prepare a patch.
> > > 
> > should pg_index.indisvalid works for this? in that case you only need
> > the syntax for it...
> > 
> 
> 
> I thought about that. But I think these are more or less orthogonal.  I
> doubt it will involve lots of extra code, though.

Be careful about that - currently it's not actually trivially possible
to ever update pg_index rows. No, I'm not kidding
you. pg_index.indexcheckxmin relies on the pg_index row's xmin. If you
have ALTER do a non inplace update, you'll break things.

Greetings,

Andres Freund

Re: Invisible Indexes

2018-06-18 Thread Andres Freund

Hi,

On 2018-06-18 17:57:04 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > This is a MySQL feature, where an index is not considered by the 
> > planner. Implementing it should be fairly straightforward, adding a new 
> > boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I 
> > guess VISIBLE would become a new unreserved keyword.
> 
> > The most obvious use case is to see what the planner does when the index 
> > is not visible, for example which other index(es) it might use. There 
> > are probably other cases where we might want an index to enforce a 
> > constraint but not to be used in query planning.
> 
> Traditionally the way to do the former is
> 
> begin;
> drop index unwanted;
> explain ;
> rollback;
> 
> Admittedly, this isn't great in a production environment, but neither
> would be disabling the index in the way you suggest.

Yea, I don't think a global action - which'll at least take a something
like a share-update-exclusive lock - is a suitable approach for this
kinda thing.


> I think the actually desirable way to handle this sort of thing is through
> an "index advisor" sort of plugin, which can hide a given index from the
> planner without any globally visible side-effects.

Although I'm a bit doubtful that just shoving this into an extension is
really sufficient. This is an extremely common task.

Greetings,

Andres Freund

Re: Invisible Indexes

2018-06-18 Thread Andres Freund

On 2018-06-18 18:05:11 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-06-18 17:57:04 -0400, Tom Lane wrote:
> >> I think the actually desirable way to handle this sort of thing is through
> >> an "index advisor" sort of plugin, which can hide a given index from the
> >> planner without any globally visible side-effects.
> 
> > Although I'm a bit doubtful that just shoving this into an extension is
> > really sufficient. This is an extremely common task.
> 
> Well, what I was thinking about was that this functionality already
> exists (I think) in one or more "index advisor" plugins.

They're doing the opposite, right? I.e. they return "hypothetical
indexes", which then can be used by the planner. None of the ones I've
seen currently mask out an existing index.

> It's possible that they've all bit-rotted for lack of support, which
> would not speak highly of the demand for the feature.

IDK, the DBA / developer crowd hitting issues like this isn't the same
as the crowd willing to update an external plugin that doesn't even do
quite what you want, and was more experimental than anything.

Greetings,

Andres Freund

Re: Invisible Indexes

2018-06-18 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane  wrote:
>> I think the actually desirable way to handle this sort of thing is through
>> an "index advisor" sort of plugin, which can hide a given index from the
>> planner without any globally visible side-effects.

> The globally visible side-effects are the point, though. Some users
> desire cheap insurance against dropping what turns out to be the wrong
> index.

Perhaps there are use-cases where you want globally visible effects,
but the primary use-case Andrew cited (i.e. EXPLAIN experimentation)
would not want that.

Anyway, if we do it with a GUC, the user can control the scope of
the effects.

regards, tom lane

Re: Invisible Indexes

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 3:12 PM, Tom Lane  wrote:
> Perhaps there are use-cases where you want globally visible effects,
> but the primary use-case Andrew cited (i.e. EXPLAIN experimentation)
> would not want that.
>
> Anyway, if we do it with a GUC, the user can control the scope of
> the effects.

I had imagined that those use cases would be the most common. Dropping
an index in production because it very much looks like it is unused is
always a bit nerve-wracking in my experience. It's often hard to be
100% sure due to factors like replicas, the possible loss of statistic
collector stats masking a problem, the possibility that there are very
important queries that do use the index but are only run very
infrequently, and so on.

-- 
Peter Geoghegan

Re: Invisible Indexes

2018-06-18 Thread Tom Lane

Andrew Dunstan  writes:
> On 06/18/2018 06:12 PM, Tom Lane wrote:
>> Anyway, if we do it with a GUC, the user can control the scope of
>> the effects.

> Yeah, but Peter makes the case that people want it for global 
> experimentation. "We think we can safely drop this humungous index that 
> would take us days to rebuild, but before we do let's make it invisible 
> and run for a few days just to make sure." I guess we could do that with 
> a GUC, but it seems ugly.

I find it hard to believe that it's uglier than what you suggested...
and it also does more, and is easier to implement.

regards, tom lane

Invisible Indexes

2018-06-18 Thread Andrew Dunstan




This is a MySQL feature, where an index is not considered by the 
planner. Implementing it should be fairly straightforward, adding a new 
boolean to pg_index, and options to CREATE INDEX and ALTER INDEX. I 
guess VISIBLE would become a new unreserved keyword.


The most obvious use case is to see what the planner does when the index 
is not visible, for example which other index(es) it might use. There 
are probably other cases where we might want an index to enforce a 
constraint but not to be used in query planning.


So, do we want this feature? If we do I'll go ahead and prepare a patch.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transform for pl/perl

2018-06-18 Thread Tom Lane

I wrote:
> The remaining unresolved issue in this thread is Ilmari's suggestion
> that we should convert integers to Perl IV (or UV?) if they fit, rather
> than always convert to NV as now.

Oh ... after re-reading the thread I realized there was one other point
that we'd all forgotten about, namely the business about ~0 getting
converted to -1 rather than what Perl interprets it as.  Ilmari sent in
a patch for that, against which I'd raised two complaints:

1. Possible compiler complaints about a constant-false comparison,
on machines where type UV is 32 bits.

2. Need for secondary expected-output files, which'd be a pain to
maintain.

I realized that point 1 could be dealt with just by not trying to be
smart, but always using the convert-to-text code path.  Given that it
seems to be hard to produce a UV value in Perl, I doubt it is worth
working any harder than that.  Also, point 2 could be dealt with in
this perhaps crude way:

-- this might produce either 18446744073709551615 or 4294967295
SELECT testUVToJsonb() IN ('18446744073709551615'::jsonb, '4294967295'::jsonb);

Pushed with those adjustments.

regards, tom lane

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan  wrote:
> On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov
>  wrote:
>> Patch '0001-retail-indextuple-deletion' introduce new function
>> amtargetdelete() in access method interface. Patch
>> '0002-quick-vacuum-strategy' implements this function for an alternative
>> strategy of lazy index vacuum, called 'Quick Vacuum'.
>
> My compiler shows the following warnings:

Some real feedback:

What we probably want to end up with here is new lazyvacuum.c code
that does processing for one heap page (and associated indexes) that
is really just a "partial" lazy vacuum. Though it won't do things like
advance relfrozenxid, it will do pruning for the heap page, index
tuple killing, and finally heap tuple killing. It will do all of these
things reliably, just like traditional lazy vacuum. This will be what
your background worker eventually uses.

I doubt that you can use routines like index_beginscan() within
bttargetdelete() at all. I think that you need something closer to
_bt_doinsert() or _bt_pagedel(), that manages its own scan (your code
should probably live in nbtpage.c). It does not make sense to teach
external, generic routines like index_beginscan() about heap TID being
an implicit final index attribute, which is one reason for this (I'm
assuming that this patch relies on my own patch).  Another reason is
that you need to hold an exclusive buffer lock at the point that you
identify the tuple to be killed, until the point that you actually
kill it. You don't do that now.

IOW, the approach you've taken in bttargetdelete() isn't quite correct
because you imagine that it's okay to occasionally "lose" the index
tuple that you originally found, and just move on. That needs to be
100% reliable, or else we'll end up with index tuples that point to
the wrong heap tuples in rare cases with concurrent insertions. As I
said, we want a "partial" lazy vacuum here, which must mean that it's
reliable. Note that _bt_pagedel() actually calls _bt_search() when it
deletes a page. Your patch will not be the first patch that makes
nbtree vacuuming do an index scan. You should be managing your own
insertion scan key, much like _bt_pagedel() does. If you use my patch,
_bt_search() can be taught to look for a specific heap TID.

Finally, doing things this way would let you delete multiple
duplicates in one shot, as I described in an earlier e-mail. Only a
single descent of the tree is needed to delete quite a few index
tuples, provided that they all happen to be logical duplicates. Again,
your background worker will take advantage of this.

This code does not follow the Postgres style:

> -   else
> +   }
> +   else {
> +   if (rootoffnum != latestdead)
> +   heap_prune_record_unused(prstate, latestdead);
> heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
> +   }
> }

Please be more careful about that. I find it very distracting.

-- 
Peter Geoghegan

Re: Invisible Indexes

2018-06-18 Thread Tom Lane

Andres Freund  writes:
> On 2018-06-18 17:57:04 -0400, Tom Lane wrote:
>> I think the actually desirable way to handle this sort of thing is through
>> an "index advisor" sort of plugin, which can hide a given index from the
>> planner without any globally visible side-effects.

> Although I'm a bit doubtful that just shoving this into an extension is
> really sufficient. This is an extremely common task.

Well, what I was thinking about was that this functionality already
exists (I think) in one or more "index advisor" plugins.  It's possible
that they've all bit-rotted for lack of support, which would not speak
highly of the demand for the feature.  But if we feel this is worth
pulling into core, I think something along the lines of a GUC listing
indexes to ignore for planning purposes might be a better design.
It'd certainly dodge the issues you mentioned about lack of mutability
of pg_index entries.

regards, tom lane

Re: Invisible Indexes

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane  wrote:
> Admittedly, this isn't great in a production environment, but neither
> would be disabling the index in the way you suggest.
>
> I think the actually desirable way to handle this sort of thing is through
> an "index advisor" sort of plugin, which can hide a given index from the
> planner without any globally visible side-effects.

The globally visible side-effects are the point, though. Some users
desire cheap insurance against dropping what turns out to be the wrong
index.

FWIW, this isn't just a MySQL feature. Oracle has a similar feature.

-- 
Peter Geoghegan

Re: Invisible Indexes

2018-06-18 Thread Julien Rouhaud

On Tue, Jun 19, 2018 at 12:05 AM, Tom Lane  wrote:
>
> Well, what I was thinking about was that this functionality already
> exists (I think) in one or more "index advisor" plugins.  It's possible
> that they've all bit-rotted for lack of support, which would not speak
> highly of the demand for the feature.  But if we feel this is worth
> pulling into core, I think something along the lines of a GUC listing
> indexes to ignore for planning purposes might be a better design.
> It'd certainly dodge the issues you mentioned about lack of mutability
> of pg_index entries.

I know only one extension which does exactly that:
https://github.com/postgrespro/plantuner

It seems that it's still maintained.

Re: Invisible Indexes

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 06:12 PM, Tom Lane wrote:

Peter Geoghegan  writes:

On Mon, Jun 18, 2018 at 2:57 PM, Tom Lane  wrote:

I think the actually desirable way to handle this sort of thing is through
an "index advisor" sort of plugin, which can hide a given index from the
planner without any globally visible side-effects.

The globally visible side-effects are the point, though. Some users
desire cheap insurance against dropping what turns out to be the wrong
index.

Perhaps there are use-cases where you want globally visible effects,
but the primary use-case Andrew cited (i.e. EXPLAIN experimentation)
would not want that.

Anyway, if we do it with a GUC, the user can control the scope of
the effects.





Yeah, but Peter makes the case that people want it for global 
experimentation. "We think we can safely drop this humungous index that 
would take us days to rebuild, but before we do let's make it invisible 
and run for a few days just to make sure." I guess we could do that with 
a GUC, but it seems ugly.


To Andres' point about the fragility of pg_index, maybe we'd need a 
separate_catalog (pg_invisible_index)?


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Robert Treat

On Mon, Jun 18, 2018 at 12:59 PM, Alvaro Herrera
 wrote:
> On 2018-Jun-18, Ashutosh Bapat wrote:
>
>> That's a wrong comparison. Inheritance based partitioning works even
>> after declarative partitioning feature is added. So, users can
>> continue using inheritance based partitioning if they don't want to
>> move to declarative partitioning. That's not true here. A user creates
>> a BEFORE ROW trigger on each partition that mimics partitioned table
>> level BEFORE ROW trigger. The proposed BEFORE ROW trigger on
>> partitioned table will cascade the trigger down to each partition. If
>> that fails to recognize that there is already an equivalent trigger, a
>> partition table will get two triggers doing the same thing silently
>> without any warning or notice. That's fine if the trigger changes the
>> salaries to $50K but not OK if each of those increases salary by 10%.
>> The database has limited ability to recognize what a trigger is doing.
>
> I agree with Robert that nobody in their right minds would be caught by
> this problem by adding new triggers without thinking about it and
> without testing them.  If you add a partitioned-table-level trigger to
> raise salaries by 10% but there was already one in the partition level
> that does the same thing, you'll readily notice that salaries have been
> increased by 21% instead.
>
> So like Robert I'm inclined to not change the wording in the
> documentation.
>
> What does worry me a little bit now, reading this discussion, is whether
> we've made the triggers in partitions visible enough.  We'll have this
> problem once we implement BEFORE ROW triggers as proposed, and I think
> we already have this problem now.  Let's suppose a user creates a
> duplicated after trigger:
>
> create table parent (a int) partition by range (a);
> create table child partition of parent for values from (0) to (100);
> create function noise() returns trigger language plpgsql as $$ begin raise 
> notice 'nyaa'; return null; end; $$;
> create trigger trig_p after insert on parent for each row execute procedure 
> noise();
> create trigger trig_c after insert on child for each row execute procedure 
> noise();
>
> Now let's try it:
>
> alvherre=# insert into child values (1);
> NOTICE:  nyaa
> NOTICE:  nyaa
> INSERT 0 1
>
> OK, so where does that one come from?
>
> alvherre=# \d child
> Tabla «public.child»
>  Columna │  Tipo   │ Collation │ Nullable │ Default
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │
> Partition of: parent FOR VALUES FROM (0) TO (100)
> Triggers:
> trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise()
>
> Hmm, there's only one trigger here, why does it appear twice?  To find
> out, you have to know where to look:
>
> alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger;
>  tgname │ tgrelid │ tgisinternal
> ┼─┼──
>  trig_p │ parent  │ f
>  trig_p │ child   │ t
>  trig_c │ child   │ f
> (3 filas)
>
> So there is a trigger in table child, but it's hidden because
> tgisinternal.  Of course, you can see it if you look at the parent's
> definition:
>
> alvherre=# \d parent
>Tabla «public.parent»
>  Columna │  Tipo   │ Collation │ Nullable │ Default
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │
> Partition key: RANGE (a)
> Triggers:
> trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise()
> Number of partitions: 1 (Use \d+ to list them.)
>
> I think it'd be useful to have a list of triggers that have been
> inherited from ancestors, or maybe simply a list of internal triggers
>
> Or maybe this is not something to worry about?
>

One of the things I was thinking about while reading this thread is
that the scenario of creating "duplicate" triggers on a table meaning
two triggers doing the same thing is entirely possible now but we
don't really do anything to prevent it, which is Ok. I've never seen
much merit in the argument "people should test" (it assumes a certain
infallibility that just isn't true) but we've also generally been
pretty good about exposing what is going on so people can debug this
type of unexpected behavior.

So +1 for thinking we do need to worry about it. I'm not exactly sure
how we want to expose that info; with \d+ we list various "Partition
X:" sections, perhaps adding one for "Partition triggers:" would be
enough, although I am inclined to think it needs exposure at the \d
level. One other thing to consider is firing order of said triggers...
if all parent level triggers fire before child level triggers then the
above separation is straightforward, but if the execution order is, as
I suspect, mixed, then it becomes more complicated.


Robert Treat
http://xzilla.net

Re: Invisible Indexes

2018-06-18 Thread Robert Treat

On Mon, Jun 18, 2018 at 6:11 PM, Andres Freund  wrote:
> On 2018-06-18 18:05:11 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>> > On 2018-06-18 17:57:04 -0400, Tom Lane wrote:
>> >> I think the actually desirable way to handle this sort of thing is through
>> >> an "index advisor" sort of plugin, which can hide a given index from the
>> >> planner without any globally visible side-effects.
>>
>> > Although I'm a bit doubtful that just shoving this into an extension is
>> > really sufficient. This is an extremely common task.
>>
>> Well, what I was thinking about was that this functionality already
>> exists (I think) in one or more "index advisor" plugins.
>
> They're doing the opposite, right? I.e. they return "hypothetical
> indexes", which then can be used by the planner. None of the ones I've
> seen currently mask out an existing index.
>
>
>> It's possible that they've all bit-rotted for lack of support, which
>> would not speak highly of the demand for the feature.
>
> IDK, the DBA / developer crowd hitting issues like this isn't the same
> as the crowd willing to update an external plugin that doesn't even do
> quite what you want, and was more experimental than anything.
>

Indeed. ISTR a conversation I had with someone on slack earlier this
year about the validity of just manually updating indisvalid as a
means for determining if an index could be safely removed (for the
record, I did not recommend it ;-)

DBA's are often willing to weedwhacker at things in SQL when the
alternative is to learn C.


Robert Treat
http://xzilla.net

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan  wrote:
> Finally, doing things this way would let you delete multiple
> duplicates in one shot, as I described in an earlier e-mail. Only a
> single descent of the tree is needed to delete quite a few index
> tuples, provided that they all happen to be logical duplicates. Again,
> your background worker will take advantage of this.

BTW, when you do this you should make sure that there is only one call
to _bt_delitems_vacuum(), so that there aren't too many WAL records.
Actually, that's not quite correct -- there should be one
_bt_delitems_vacuum() call *per leaf page* per bttargetdelete() call,
which is slightly different. There should rarely be more than one or
two calls to _bt_delitems_vacuum() in total, because your background
worker is only going to delete one heap page's duplicates per
bttargetdelete() call, and because there will be locality/correlation
with my TID order patch.

-- 
Peter Geoghegan

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-18 Thread Michael Paquier

On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote:
> As there is visibly a consensus for HEAD, for now I propose to just
> process with the previous patch on only the master branch then.  This
> will address the open item.  Any objections to that?

As there is a consensus at least on this part, I have just pushed the
patch to only HEAD.  I'll notify buildfarm members which see failures in
case those link to OpenSSL 1.0.1 or older.
--
Michael


signature.asc
Description: PGP signature

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Amit Langote

Hi.

On 2018/06/19 1:59, Alvaro Herrera wrote:
> What does worry me a little bit now, reading this discussion, is whether
> we've made the triggers in partitions visible enough.  We'll have this
> problem once we implement BEFORE ROW triggers as proposed, and I think
> we already have this problem now.  Let's suppose a user creates a
> duplicated after trigger:
> 
> create table parent (a int) partition by range (a);
> create table child partition of parent for values from (0) to (100);
> create function noise() returns trigger language plpgsql as $$ begin raise 
> notice 'nyaa'; return null; end; $$;
> create trigger trig_p after insert on parent for each row execute procedure 
> noise();
> create trigger trig_c after insert on child for each row execute procedure 
> noise();
> 
> Now let's try it:
> 
> alvherre=# insert into child values (1);
> NOTICE:  nyaa
> NOTICE:  nyaa
> INSERT 0 1
> 
> OK, so where does that one come from?
> 
> alvherre=# \d child
> Tabla «public.child»
>  Columna │  Tipo   │ Collation │ Nullable │ Default 
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │ 
> Partition of: parent FOR VALUES FROM (0) TO (100)
> Triggers:
> trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise()
> 
> Hmm, there's only one trigger here, why does it appear twice?  To find
> out, you have to know where to look:
> 
> alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger;
>  tgname │ tgrelid │ tgisinternal 
> ┼─┼──
>  trig_p │ parent  │ f
>  trig_p │ child   │ t
>  trig_c │ child   │ f
> (3 filas)
> 
> So there is a trigger in table child, but it's hidden because
> tgisinternal.  Of course, you can see it if you look at the parent's
> definition:
> 
> alvherre=# \d parent
>Tabla «public.parent»
>  Columna │  Tipo   │ Collation │ Nullable │ Default 
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │ 
> Partition key: RANGE (a)
> Triggers:
> trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise()
> Number of partitions: 1 (Use \d+ to list them.)
> 
> I think it'd be useful to have a list of triggers that have been
> inherited from ancestors, or maybe simply a list of internal triggers

In CreateTrigger(), 86f575948c7 did this.

-values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal);
+values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal ||
in_partition);

I'm not sure why it had to be done, but undoing this change doesn't seem
to break any regression tests (neither those added by 86f575948c7 nor of
the partitioned table foreign key commit).  Did we really intend to change
the meaning of tginternal like this here?

Thanks,
Amit

Re: Index Skip Scan

2018-06-18 Thread Michael Paquier

On Tue, Jun 19, 2018 at 12:06:59AM +0300, Alexander Korotkov wrote:
> Assuming this, should we have possibility to register patch to
> September CF from now?

There cannot be two commit fests marked as open at the same time as
Magnus mentions here:
https://www.postgresql.org/message-id/cabuevex1k+axzcv2t3weyf5ulg72ybksch_huhfnzq+-kso...@mail.gmail.com

There is no issue in registering new patches in future ones for admins,
but normal users cannot, right?  In this case, could you wait that the
next CF is marked as in progress and that the one of September is
opened?  You could also add it to the July one if you are not willing to
wait, and that will get bumped by one of the CFMs, but this makes the
whole process unnecessary noisy.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Find additional connection service files in pg_service.conf.d directory

2018-06-18 Thread Michael Paquier

On Mon, Jun 18, 2018 at 07:17:14PM +0300, Arthur Zakirov wrote:
> Isn't it worth to check errno for stat(), opendir() and readdir()
> calls?

Checking for such errors is mandatory.  There are some cases where there
are filters based on errno like ENOENT, but if something unexpected is
happening the user should know about it.  Of course this depends on the
feature discussed and its properties..
--
Michael


signature.asc
Description: PGP signature

Re: Supporting tls-server-end-point as SCRAM channel binding for OpenSSL 1.0.0 and 1.0.1

2018-06-18 Thread Michael Paquier

On Sat, Jun 09, 2018 at 09:28:17AM +0900, Michael Paquier wrote:
> I am still not completely sure what is the correct course of action
> here.  Heikki and Peter and not much in favor of adding more complexity
> here as OpenSSL has a long history of having a non-linear history across
> platforms.  On the other side, PGDG provides packages down to RHEL6, and
> there are surely servers which use it as backend.

As Peter and Heikki have worked as well on all those features with me,
are there any objection to discard this open item?  I looked again at
the patch this morning and it is true that OpenSSL's history makes
things harder, so keeping code consistent and simple with their last LTS
version is appealing.
--
Michael


signature.asc
Description: PGP signature

Re: Partitioning with temp tables is broken

2018-06-18 Thread Amit Langote

Hello.

On 2018/06/18 15:02, Michael Paquier wrote:
> On Mon, Jun 18, 2018 at 01:27:51PM +0900, Amit Langote wrote:
>> On 2018/06/17 22:11, Michael Paquier wrote:
>> Which checks do you think are missing other than those added by the
>> proposed patch?
> 
> I was just wondering how this reacted if trying to attach a foreign
> table to a partition tree which is made of temporary tables, and things
> get checked in MergeAttributes, and you have added a check for that:
> +-- do not allow a foreign table to become a partition of temporary
> +-- partitioned table
> +create temp table temp_parted (a int) partition by list (a);
> 
> Those tests should be upper-case I think to keep consistency with the
> surrounding code.

Ah, sorry about that.

As you may have seen in the changed code, the guard in MergeAttributes
really just checks relpersistance, so the check prevents foreign tables
from being added as a partition of a temporary parent table.  Not sure how
much sense it makes to call a foreign table's relpersistence to be
RELPERSISTENCE_PERMANENT but that's a different matter I guess.  One
cannot create temporary foreign tables because of the lack of syntax for
it, so there's no need to worry about that.

>>> I am quickly looking at forbid-temp-parts-1.patch from previous message
>>> https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp
>>> and this shines per its lack of tests.  It would be easy enough to test
>>> that temp and permanent relations are not mixed within the same session
>>> for multiple levels of partitioning.  Amit, could you add some?  There
>>> may be tweaks needed for foreign tables or such, but I have not looked
>>> close enough at the problem yet..
>>
>> OK, I have added some tests.  Thanks for looking!
> 
> Okay, I have looked at this patch and tested it manually and I can
> confirm the following restrictions:
>
> - Partition trees are supported for a set of temporary relations within
> the same session.
> - If trying to work with temporary relations from different sessions,
> then failure.
> - If trying to attach a temporary partition to a permanent parent, then
> failure.
> - If trying to attach a permanent relation to a temporary parent, then
> failure.
> 
> +   /* If the parent is permanent, so must be all of its partitions. */
> +   if (is_partition &&
> +   relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
> +   relpersistence == RELPERSISTENCE_TEMP)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +errmsg("cannot create a temporary relation as partition of 
> permanent relation \"%s\"",
> +RelationGetRelationName(relation;
>
> I had some second thoughts about this one actually because inheritance
> trees allow a temporary child for a permanent parent:
>
> =# create table aa_parent (a int);
> CREATE TABLE
> =# create temp table aa_temp_child (a int) inherits (aa_parent);
> NOTICE:  0: merging column "a" with inherited definition
> LOCATION:  MergeAttributes, tablecmds.c:2306
> CREATE TABLE
>
> And they also disallow a permanent child to inherit from a temporary
> parent:
> =# create temp table aa_temp_parent (a int);
> CREATE TABLE
> =# create table aa_child (a int) inherits (aa_temp_parent> ERROR:  42809: 
> cannot inherit from temporary relation "aa_temp_parent"
>
> I am not seeing any regression tests for that actually so that would be
> a nice addition, with also a test for inheritance of only temporary
> relations.  That's not something for the patch discussed on this thread
> to tackle.
> 
> Still the partition bound checks make that kind of harder to bypass, and
> the use-case is not obvious, so let's keep the restriction as
> suggested...

Yeah, unlike regular inheritance, we access partitions from PartitionDesc
without checking relpersistence in some of the newly added code in PG 11
and also in PG 10 (the latter doesn't crash but gives an unintuitive error
as you said upthread).  If a use case to mix temporary and permanent
tables in partition tree pops up in the future, we can revisit it and
implement it correctly.

> -   /* Ditto for the partition */
> +   /* Ditto for the partition. */
>
> Some noise here.

Oops, fixed.

> Adding a test which makes sure that partition trees made of only
> temporary relations work would be nice.

I added a test to partition_prune.sql.

> Documenting all those restrictions and behaviors would be nice, why not
> adding a paragraph in ddl.sgml, under the section for declarative
> partitioning?

OK, I've tried that in the attached updated patch, but I couldn't write
beyond a couple of sentences that I've added in 5.10.2.3. Limitations.

Thanks,
Amit
>From 8c067c66a4ccf25eee75f6454de7087e7228a259 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 15 Jun 2018 10:20:26 +0900
Subject: [PATCH v3] Disallow mixing partitions of differing relpersistence and
 visibility

---
 doc/src/sgml/ddl.sgml |  8

Adding tests for inheritance trees with temporary tables

2018-06-18 Thread Michael Paquier

Hi all,

While look at the handling of temporary relations with partition trees,
I have noticed that there are no tests for inheritance trees with temp
tables.  This has been mentioned here:
https://www.postgresql.org/message-id/20180618060200.gg3...@paquier.xyz

Attached is a patch to close the gap.  That's of course not v11
material, so I am parking this patch into the next CF.

Thanks,
--
Michael
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index b2b912ed5c..9b1312a0ca 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1710,6 +1710,51 @@ select * from cnullparent where f1 = 2;
 drop table cnullparent cascade;
 NOTICE:  drop cascades to table cnullchild
 --
+-- Check use of temprary tables with inheritance trees
+--
+create table inh_perm_parent (a1 int);
+create temp table inh_temp_parent (a1 int);
+create temp table inh_temp_child (a1 int) inherits (inh_perm_parent); -- ok
+NOTICE:  merging column "a1" with inherited definition
+create table inh_perm_child (a1 int) inherits (inh_temp_parent); -- error
+ERROR:  cannot inherit from temporary relation "inh_temp_parent"
+create temp table inh_temp_child_2 (a1 int) inherits (inh_temp_parent); -- ok
+NOTICE:  merging column "a1" with inherited definition
+insert into inh_perm_parent values (1);
+insert into inh_temp_parent values (2);
+insert into inh_temp_child values (3);
+insert into inh_temp_child_2 values (4);
+select * from inh_perm_parent;
+ a1 
+
+  1
+  3
+(2 rows)
+
+select * from inh_temp_parent;
+ a1 
+
+  2
+  4
+(2 rows)
+
+select * from inh_temp_child;
+ a1 
+
+  3
+(1 row)
+
+select * from inh_temp_child_2;
+ a1 
+
+  4
+(1 row)
+
+drop table inh_perm_parent cascade;
+NOTICE:  drop cascades to table inh_temp_child
+drop table inh_temp_parent cascade;
+NOTICE:  drop cascades to table inh_temp_child_2
+--
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.
 --
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 5a48376fc0..2c3e35583a 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -635,6 +635,25 @@ select * from cnullparent;
 select * from cnullparent where f1 = 2;
 drop table cnullparent cascade;
 
+--
+-- Check use of temprary tables with inheritance trees
+--
+create table inh_perm_parent (a1 int);
+create temp table inh_temp_parent (a1 int);
+create temp table inh_temp_child (a1 int) inherits (inh_perm_parent); -- ok
+create table inh_perm_child (a1 int) inherits (inh_temp_parent); -- error
+create temp table inh_temp_child_2 (a1 int) inherits (inh_temp_parent); -- ok
+insert into inh_perm_parent values (1);
+insert into inh_temp_parent values (2);
+insert into inh_temp_child values (3);
+insert into inh_temp_child_2 values (4);
+select * from inh_perm_parent;
+select * from inh_temp_parent;
+select * from inh_temp_child;
+select * from inh_temp_child_2;
+drop table inh_perm_parent cascade;
+drop table inh_temp_parent cascade;
+
 --
 -- Check that constraint exclusion works correctly with partitions using
 -- implicit constraints generated from the partition bound information.


signature.asc
Description: PGP signature

Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan  wrote:
> IOW, the approach you've taken in bttargetdelete() isn't quite correct
> because you imagine that it's okay to occasionally "lose" the index
> tuple that you originally found, and just move on. That needs to be
> 100% reliable, or else we'll end up with index tuples that point to
> the wrong heap tuples in rare cases with concurrent insertions.

Attached patch adds a new amcheck check within
bt_index_parent_check(). With the patch, heap TIDs are accumulated in
a tuplesort and sorted at the tail end of verification (before
optional heapallindexed verification runs). This will reliably detect
the kind of corruption I noticed was possible with your patch.

Note that the amcheck enhancement that went along with my
heap-tid-btree-sort patch may not have detected this issue, which is
why I wrote this patch --  the heap-tid-btree-sort amcheck stuff could
detect duplicates, but only when all other attributes happened to be
identical when comparing sibling index tuples (i.e. only when we must
actually compare TIDs across sibling index tuples). If you add this
check, I'm pretty sure that you can detect any possible problem. You
should think about using this to debug your patch.

I may get around to submitting this to a CF, but that isn't a priority
right now.

-- 
Peter Geoghegan

0001-Detect-duplicate-heap-TIDs-using-a-tuplesort.patch
Description: Binary data

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-18 Thread Michael Paquier

On Tue, Jun 19, 2018 at 09:11:20AM +0900, Michael Paquier wrote:
> On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote:
> > As there is visibly a consensus for HEAD, for now I propose to just
> > process with the previous patch on only the master branch then.  This
> > will address the open item.  Any objections to that?
> 
> As there is a consensus at least on this part, I have just pushed the
> patch to only HEAD.  I'll notify buildfarm members which see failures in
> case those link to OpenSSL 1.0.1 or older.

Andrew, bowerbird is complaining with what looks like a library loading
issue with initdb:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=bowerbird=2018-06-19%2001%3A08%3A30=check
pg_regress: initdb failed
Examine G:/prog/bf/root/HEAD/pgsql.build/src/test/regress/log/initdb.log
for the reason.

In my own environment I have a copy of ssleay32.dll and libeay32.dll in
c:\Windows\System32 so as that I don't get any complaints when running
regression tests.  Is your environment using something specific?
--
Michael


signature.asc
Description: PGP signature

Re: Libpq support to connect to standby server as priority

2018-06-18 Thread Haribabu Kommi

On Wed, Jan 24, 2018 at 9:01 AM Jing Wang  wrote:

> Hi All,
>
> Recently I put a proposal to support 'prefer-read' parameter in
> target_session_attrs in libpq. Now I updated the patch with adding content
> in the sgml and regression test case.
>
> Some people may have noticed there is already another patch (
> https://commitfest.postgresql.org/15/1148/ ) which looks similar with
> this. But I would say this patch is more complex than my proposal.
>
> It is better separate these 2 patches to consider.
>

I also feel prefer-read and read-only options needs to take as two
different options.
prefer-read is simple to support than read-only.

Here I attached an updated patch that is rebased to the latest master and
also
fixed some of the corner scenarios.

Regards,
Haribabu Kommi
Fujitsu Australia

0001-Allow-target-session-attrs-to-accept-prefer-read-opti.patch
Description: Binary data

Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-18 Thread Amit Kapila

On Mon, Jun 18, 2018 at 3:09 PM, Amit Khandekar  wrote:
> On 16 June 2018 at 19:30, Amit Kapila  wrote:
>> On Sat, Jun 16, 2018 at 10:44 AM, Amit Kapila  
>> wrote:
>>> Yeah, or perhaps disallow creation of any partial paths at the first
>>> place like in attached.   This will save us some work as well.
>>>
>>
>> Attached patch contains test case as well.  I have tried to come up
>> with some simple test, but couldn't come up with anything much simpler
>> than reported by Rajkumar, so decided to use the test case provided by
>> him.
>>
>
> Thanks for the patch. I had a look at it, and it looks good to me. One
> minor comment :
>
> +-- Parallel Append is not be used when the subpath depends on the outer param
> "is not be used" => "is not to be used"
>

Fixed in the attached patch.  I will wait for a day or two to see if
Tom or Robert wants to say something and then commit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_pa_path_generation_v3.patch
Description: Binary data

RE: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-18 Thread Tsunakawa, Takayuki

> On Tue, Jun 12, 2018 at 1:09 PM, Tsunakawa, Takayuki
>  wrote:
> > My colleague is now preparing a patch for that, which adds a function
> ECPGFreeSQLDA() in libecpg.so.  That thread is here:
> >
> https://www.postgresql.org/message-id/25C1C6B2E7BE044889E4FE8643A58BA9
> 63A42097@G01JPEXMBKW03
> 
> Thanks.  I will follow up on that thread.

He's created a separate thread for a new CF entry here:

https://commitfest.postgresql.org/18/1669/



> > I want some remedy for older releases.  Our customer worked around this
> problem by getting a libpq connection in their ECPG application and calling
> PQfreemem().  That's an ugly kludge, and I don't want other users to follow
> it.
> >
> > I don't see a problem with back-patching as-is, because existing users
> who just call free() or don't call free() won't be affected.  I think that
> most serious applications somehow state their supported minor releases like
> "this application supports (or is certified against) PostgreSQL 10.5 or
> later", just like other applications support "RHEL 6.2 or later" or "Windows
> XP Sp2 or later."
> 
> If there is a consensus that we should do that then I'll back-patch,
> but so far no one else has spoken up in support.

I'll follow the community decision.  But I'm afraid that not enough people will 
comment on this to call it a consensus, because this topic will not be 
interesting...  FWIW, I thought back-patching would make committers' future 
burdon smaller thanks to the smaller difference in the code of multiple major 
versions.


Regards
Takayuki Tsunakawa

Re: Partitioning with temp tables is broken

2018-06-18 Thread Michael Paquier

On Mon, Jun 18, 2018 at 01:27:51PM +0900, Amit Langote wrote:
> On 2018/06/17 22:11, Michael Paquier wrote:
> Which checks do you think are missing other than those added by the
> proposed patch?

I was just wondering how this reacted if trying to attach a foreign
table to a partition tree which is made of temporary tables, and things
get checked in MergeAttributes, and you have added a check for that:
+-- do not allow a foreign table to become a partition of temporary
+-- partitioned table
+create temp table temp_parted (a int) partition by list (a);

Those tests should be upper-case I think to keep consistency with the
surrounding code.

>> I am quickly looking at forbid-temp-parts-1.patch from previous message
>> https://postgr.es/m/a6bab73c-c5a8-2c25-f858-5d6d800a8...@lab.ntt.co.jp
>> and this shines per its lack of tests.  It would be easy enough to test
>> that temp and permanent relations are not mixed within the same session
>> for multiple levels of partitioning.  Amit, could you add some?  There
>> may be tweaks needed for foreign tables or such, but I have not looked
>> close enough at the problem yet..
> 
> OK, I have added some tests.  Thanks for looking!

Okay, I have looked at this patch and tested it manually and I can
confirm the following restrictions:
- Partition trees are supported for a set of temporary relations within
the same session.
- If trying to work with temporary relations from different sessions,
then failure.
- If trying to attach a temporary partition to a permanent parent, then
failure.
- If trying to attach a permanent relation to a temporary parent, then
failure.

+   /* If the parent is permanent, so must be all of its partitions. */
+   if (is_partition &&
+   relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
+   relpersistence == RELPERSISTENCE_TEMP)
+   ereport(ERROR,
+   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("cannot create a temporary relation as partition of 
permanent relation \"%s\"",
+RelationGetRelationName(relation;
I had some second thoughts about this one actually because inheritance
trees allow a temporary child for a permanent parent:
=# create table aa_parent (a int);
CREATE TABLE
=# create temp table aa_temp_child (a int) inherits (aa_parent);
NOTICE:  0: merging column "a" with inherited definition
LOCATION:  MergeAttributes, tablecmds.c:2306
CREATE TABLE
And they also disallow a permanent child to inherit from a temporary
parent:
=# create temp table aa_temp_parent (a int);
CREATE TABLE
=# create table aa_child (a int) inherits (aa_temp_parent);
ERROR:  42809: cannot inherit from temporary relation "aa_temp_parent"
I am not seeing any regression tests for that actually so that would be
a nice addition, with also a test for inheritance of only temporary
relations.  That's not something for the patch discussed on this thread
to tackle.

Still the partition bound checks make that kind of harder to bypass, and
the use-case is not obvious, so let's keep the restriction as
suggested...

-   /* Ditto for the partition */
+   /* Ditto for the partition. */
Some noise here.

Adding a test which makes sure that partition trees made of only
temporary relations work would be nice.

Documenting all those restrictions and behaviors would be nice, why not
adding a paragraph in ddl.sgml, under the section for declarative
partitioning?
--
Michael


signature.asc
Description: PGP signature

Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-18 Thread David Rowley

On 12 June 2018 at 01:49, Robert Haas  wrote:
> On Fri, Jun 8, 2018 at 3:08 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> That being said, I don't mind a bit if you want to look for further
>>> speedups here, but if you do, keep in mind that a lot of queries won't
>>> even use partition-wise join, so all of the arrays will be of length
>>> 1.  Even when partition-wise join is used, it is quite likely not to
>>> be used for every table in the query, in which case it will still be
>>> of length 1 in some cases.  So pessimizing nappinfos = 1 even slightly
>>> is probably a regression overall.
>>
>> TBH, I am way more concerned about the pessimization introduced for
>> every pre-existing usage of these functions by putting search loops
>> into them at all.  I'd like very much to revert that.  If we can
>> replace those with something along the line of root->index_array[varno]
>> we'll be better off across the board.
>
> I think David's analysis is correct -- this doesn't quite work.  We're
> trying to identify whether a given varno is one of the ones to be
> translated, and it's hard to come up with a faster solution than
> iterating over a (very short) array of those values.  One thing we
> could do is have two versions of each function, or else an optimized
> path for the very common nappinfos=1 case.  I'm just not sure it would
> be worthwhile.  Traversing a short array isn't free, but it's pretty
> cheap.

So this is the current situation as far as I see it:

We could go and add a new version of adjust_appendrel_attrs() and
adjust_appendrel_attrs_mutator() that accept a Relids for the child
rels rather than an array of AppendRelInfos. However, that's quite a
lot of code duplication. We could perhaps cut down on duplication by
having a callback function stored inside of
adjust_appendrel_attrs_context which searches for the correct
AppendRelInfo to use. However, it does not seem to be inline with
simplifying the code.

We've not yet heard back from Tom with more details about his
root->index_array[varno] idea. I can't quite see how this is possible
and for the moment I assume Tom misunderstood that it's the parent
varno that's known, not the varno of the child.

I've attached a patch which cleans up my earlier version and moves the
setup of the append_rel_array into its own function instead of
sneaking code into setup_simple_rel_arrays(). I've also now updated
the comment above find_childrel_appendrelinfo(), which is now an
unused function.

I tried the following test case:

CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,) x;
\gexec
\o

SELECT * FROM partbench WHERE date = '2018-04-23 00:00:00';

Patched

Time: 190.989 ms
Time: 187.313 ms
Time: 190.239 ms

Unpatched

Time: 775.771 ms
Time: 781.631 ms
Time: 762.777 ms

Is this good enough for v11?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v3-0001-Allow-direct-lookups-of-AppendRelInfo-by-child-re.patch
Description: Binary data

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-18 Thread Michael Paquier

On Tue, Jun 19, 2018 at 12:01:39AM -0400, Andrew Dunstan wrote:
>> In my own environment I have a copy of ssleay32.dll and libeay32.dll in
>> c:\Windows\System32 so as that I don't get any complaints when running
>> regression tests.  Is your environment using something specific?
> 
> I have adjusted the PATH setting - it should be fixed on the next run.

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Ashutosh Bapat

On Mon, Jun 18, 2018 at 10:29 PM, Alvaro Herrera
 wrote:
> On 2018-Jun-18, Ashutosh Bapat wrote:
>
>> That's a wrong comparison. Inheritance based partitioning works even
>> after declarative partitioning feature is added. So, users can
>> continue using inheritance based partitioning if they don't want to
>> move to declarative partitioning. That's not true here. A user creates
>> a BEFORE ROW trigger on each partition that mimics partitioned table
>> level BEFORE ROW trigger. The proposed BEFORE ROW trigger on
>> partitioned table will cascade the trigger down to each partition. If
>> that fails to recognize that there is already an equivalent trigger, a
>> partition table will get two triggers doing the same thing silently
>> without any warning or notice. That's fine if the trigger changes the
>> salaries to $50K but not OK if each of those increases salary by 10%.
>> The database has limited ability to recognize what a trigger is doing.
>
> I agree with Robert that nobody in their right minds would be caught by
> this problem by adding new triggers without thinking about it and
> without testing them.  If you add a partitioned-table-level trigger to
> raise salaries by 10% but there was already one in the partition level
> that does the same thing, you'll readily notice that salaries have been
> increased by 21% instead.
>
> So like Robert I'm inclined to not change the wording in the
> documentation.
>
> What does worry me a little bit now, reading this discussion, is whether
> we've made the triggers in partitions visible enough.  We'll have this
> problem once we implement BEFORE ROW triggers as proposed, and I think
> we already have this problem now.  Let's suppose a user creates a
> duplicated after trigger:
>
> create table parent (a int) partition by range (a);
> create table child partition of parent for values from (0) to (100);
> create function noise() returns trigger language plpgsql as $$ begin raise 
> notice 'nyaa'; return null; end; $$;
> create trigger trig_p after insert on parent for each row execute procedure 
> noise();
> create trigger trig_c after insert on child for each row execute procedure 
> noise();
>
> Now let's try it:
>
> alvherre=# insert into child values (1);
> NOTICE:  nyaa
> NOTICE:  nyaa
> INSERT 0 1
>
> OK, so where does that one come from?
>
> alvherre=# \d child
> Tabla «public.child»
>  Columna │  Tipo   │ Collation │ Nullable │ Default
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │
> Partition of: parent FOR VALUES FROM (0) TO (100)
> Triggers:
> trig_c AFTER INSERT ON child FOR EACH ROW EXECUTE PROCEDURE noise()
>
> Hmm, there's only one trigger here, why does it appear twice?  To find
> out, you have to know where to look:
>
> alvherre=# select tgname, tgrelid::regclass, tgisinternal from pg_trigger;
>  tgname │ tgrelid │ tgisinternal
> ┼─┼──
>  trig_p │ parent  │ f
>  trig_p │ child   │ t
>  trig_c │ child   │ f
> (3 filas)
>
> So there is a trigger in table child, but it's hidden because
> tgisinternal.  Of course, you can see it if you look at the parent's
> definition:
>
> alvherre=# \d parent
>Tabla «public.parent»
>  Columna │  Tipo   │ Collation │ Nullable │ Default
> ─┼─┼───┼──┼─
>  a   │ integer │   │  │
> Partition key: RANGE (a)
> Triggers:
> trig_p AFTER INSERT ON parent FOR EACH ROW EXECUTE PROCEDURE noise()
> Number of partitions: 1 (Use \d+ to list them.)
>
> I think it'd be useful to have a list of triggers that have been
> inherited from ancestors, or maybe simply a list of internal triggers

That's a very good description of the problem people will face. Thanks
for elaborating it this way.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Slow planning time for simple query

2018-06-18 Thread Amit Kapila

On Mon, Jun 18, 2018 at 7:50 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
>>  wrote:
>>> That's OK as long as we're ignoring those hints on the standby.
>
>> What if we don't ignore those hints on standby for a specific case
>> like the one in get_actual_variable_range?
>
> Yeah, that's the same idea I suggested upthread.  Andrew shot down
> my first thought (correctly I think) but the second one still seems
> feasible.
>
>> Now, if the user has
>> enabled wal_log_hints on the master, it could save time from scanning
>> many dead tuples on standby.
>
> We should have the standby set the hint bits for itself, rather than
> relying on wal_log_hints.
>

Yeah, I think that is a good idea unless someone sees a hole in it,
but we have to keep in mind that the bits set by standby can be
overwritten later for instance (a) due to an FPI for a change on that
page in master and (b)  when that page is evicted from shared buffers
because we don't allow to dirty the page during recovery.  However, I
think the same is true for regular hint bits (hint bit on a tuple that
suggests transaction is committed).  On the one hand it will be good
if we can do it as a single patch, but it is not difficult to imagine
that we can do this as two separate patches where the first patch can
just introduce an idea to not ignore kill bits on standby in a
particular case (get_actual_variable_range) and the second patch to
introduce something like StandbyGlobalXmin which will then be used in
get_actual_variable_range.  Both are independently useful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Pluggable storage

2018-06-18 Thread Amit Kapila

On Tue, Jun 19, 2018 at 1:13 AM, AJG  wrote:
> @Amit
>
> Re: Vacuum etc.
>
> Chrome V8 just released this blog post around concurrent marking, which may
> be of interest considering how cpu limited the browser is. Contains
> benchmark numbers etc in post as well.
>
> https://v8project.blogspot.com/2018/06/concurrent-marking.html
>
> "This post describes the garbage collection technique called concurrent
> marking. The optimization allows a JavaScript application to continue
> execution while the garbage collector scans the heap to find and mark live
> objects. Our benchmarks show that concurrent marking reduces the time spent
> marking on the main thread by 60%–70%"
>

Thanks for sharing the link, but I don't think it is not directly
related to the work we are doing, but feel free to discuss it on zheap
thread [1] or even you can start a new thread, because it appears more
like some general technique to improve GC (garbage collection) rather
than something directly related zheap (or undo) technology.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BYtM5vxzSM2NZm%2BpC37MCwyvtkmJrO_yRBQeZDp9Wa2w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pg_config.h.win32 missing a set of flags from pg_config.h.in added in v11 development

2018-06-18 Thread Andrew Dunstan





On 06/18/2018 11:13 PM, Michael Paquier wrote:

On Tue, Jun 19, 2018 at 09:11:20AM +0900, Michael Paquier wrote:

On Mon, Jun 18, 2018 at 10:57:34AM +0900, Michael Paquier wrote:

As there is visibly a consensus for HEAD, for now I propose to just
process with the previous patch on only the master branch then.  This
will address the open item.  Any objections to that?

As there is a consensus at least on this part, I have just pushed the
patch to only HEAD.  I'll notify buildfarm members which see failures in
case those link to OpenSSL 1.0.1 or older.

Andrew, bowerbird is complaining with what looks like a library loading
issue with initdb:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=bowerbird=2018-06-19%2001%3A08%3A30=check
pg_regress: initdb failed
Examine G:/prog/bf/root/HEAD/pgsql.build/src/test/regress/log/initdb.log
for the reason.

In my own environment I have a copy of ssleay32.dll and libeay32.dll in
c:\Windows\System32 so as that I don't get any complaints when running
regression tests.  Is your environment using something specific?



I have adjusted the PATH setting - it should be fixed on the next run.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Partitioning with temp tables is broken

2018-06-18 Thread Michael Paquier

On Tue, Jun 19, 2018 at 10:56:49AM +0900, Amit Langote wrote:
> On 2018/06/18 15:02, Michael Paquier wrote:
>> Those tests should be upper-case I think to keep consistency with the
>> surrounding code.
> 
> As you may have seen in the changed code, the guard in MergeAttributes
> really just checks relpersistance, so the check prevents foreign tables
> from being added as a partition of a temporary parent table.  Not sure how
> much sense it makes to call a foreign table's relpersistence to be
> RELPERSISTENCE_PERMANENT but that's a different matter I guess.  

Its existence does not go away when the session finishes, so that makes
sense to me, at least philosophically :)

> One cannot create temporary foreign tables because of the lack of
> syntax for it, so there's no need to worry about that.

This could have its own use-cases.

> Yeah, unlike regular inheritance, we access partitions from PartitionDesc
> without checking relpersistence in some of the newly added code in PG 11
> and also in PG 10 (the latter doesn't crash but gives an unintuitive error
> as you said upthread).  If a use case to mix temporary and permanent
> tables in partition tree pops up in the future, we can revisit it and
> implement it correctly.

Agreed.  Let's keep things simple for now.

>> Adding a test which makes sure that partition trees made of only
>> temporary relations work would be nice.
> 
> I added a test to partition_prune.sql.

I didn't think about that actually, but that's actually a good idea to
keep that around.  Having a test case which checks that ATTACH works
when everything has temporary relations was still missing.

>> Documenting all those restrictions and behaviors would be nice, why not
>> adding a paragraph in ddl.sgml, under the section for declarative
>> partitioning?
> 
> OK, I've tried that in the attached updated patch, but I couldn't write
> beyond a couple of sentences that I've added in 5.10.2.3. Limitations.

Adding the description in this section is a good idea.

+ 
+  
+   One cannot have both temporary and permanent relations in a given
+   partition tree.  That is, if the root partitioned table is permanent,
+   so must be its partitions at all levels and vice versa.
+  
+ 

I have reworded a bit that part.

+/* If the parent is permanent, so must be all of its partitions. */
+if (is_partition &&
+relation->rd_rel->relpersistence != RELPERSISTENCE_TEMP &&
+relpersistence == RELPERSISTENCE_TEMP)
+ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot create a temporary relation as partition 
of permanent relation \"%s\"",
+RelationGetRelationName(relation;

Added a note about inheritance allowing this case, and reduced the diff
noise of the patch.

--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
[...]
+ERROR:  cannot attach a permanent relation as partition of temporary relation 
"temp_parted"
+drop table temp_parted;

This set of tests does not check that trees made of only temporary
relations can work, so I added a test for that, refactoring the tests a
bit.  The same applies for both create_table and alter_table.

+-- Check pruning for a partition tree containining only temporary relations
+create temp table pp_temp_parent (a int) partition by list (a);
+create temp table pp_temp_part_1 partition of pp_temp_parent for values in (1);
+create temp table pp_temp_part_def partition of pp_temp_parent default;
+explain (costs off) select * from pp_temp_parent where true;
+explain (costs off) select * from pp_temp_parent where a = 2;
+drop table pp_temp_parent;

That's a good idea.  Typo here => s/containining/containing/.

Attached is what I am finishing with after a closer review.  Amit, what
do you think?
--
Michael
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 0258391154..9e5bc06abf 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3353,6 +3353,15 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
on individual partitions, not the partitioned table.
   
  
+
+ 
+  
+   Mixing temporary and permanent relations in the same partition tree
+   is not allowed.  Hence, if the root partitioned table is permanent,
+   so must be its partitions at all levels and vice versa for temporary
+   relations.
+  
+ 
 
 
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8b848f91a7..7c0cf0d7ee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1985,6 +1985,19 @@ MergeAttributes(List *schema, List *supers, char relpersistence,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 	 errmsg("inherited relation \"%s\" is not a table or foreign table",
 			parent->relname)));
+
+		/*
+		 * If the parent is permanent, so

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-18 Thread Ashutosh Bapat

On Tue, Jun 19, 2018 at 3:51 AM, Robert Treat  wrote:
>
> So +1 for thinking we do need to worry about it. I'm not exactly sure
> how we want to expose that info; with \d+ we list various "Partition
> X:" sections, perhaps adding one for "Partition triggers:" would be
> enough, although I am inclined to think it needs exposure at the \d
> level.

Something like this or what Alvaro proposed will be helpful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Excessive CPU usage in StandbyReleaseLocks()

2018-06-18 Thread Thomas Munro

Hello hackers,

Andres Freund diagnosed a case of $SUBJECT in a customer's 9.6 system.
I've written a minimal reproducer and a prototype patch to address the
root cause.

The problem is that StandbyReleaseLocks() does a linear search of all
known AccessExclusiveLocks when a transaction ends.  Luckily, since
v10 (commit 9b013dc2) that is skipped for transactions that haven't
taken any AELs and aren't using 2PC, but that doesn't help all users.

It's fine if the AEL list is short, but if you do something that takes
a lot of AELs such as restoring a database with many tables or
truncating a lot of partitions while other transactions are in flight
then we start doing O(txrate * nlocks * nsubxacts) work and that can
hurt.

The reproducer script I've attached creates one long-lived transaction
that acquires 6,000 AELs and takes a nap, while 48 connections run
trivial 2PC transactions (I was also able to reproduce the effect
without 2PC by creating a throw-away temporary table in every
transaction, but it was unreliable due to contention slowing
everything down).  For me, the standby's startup process becomes 100%
pegged, replay_lag begins to climb and perf says something like:

+   97.88%96.96%  postgres  postgres[.] StandbyReleaseLocks

The attached patch splits the AEL list into one list per xid and
sticks them in a hash table.  That makes perf say something like:

+0.60% 0.00%  postgres  postgres[.] StandbyReleaseLocks

This seems like something we'd want to back-patch because the problem
affects all branches (the older releases more severely because they
lack the above-mentioned optimisation).

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com
import psycopg2
import threading
import time

DSN = "dbname=postgres host=localhost"

def init():
  conn = psycopg2.connect(DSN)
  cursor = conn.cursor()
  cursor.execute("""DROP TABLE IF EXISTS foo""")
  cursor.execute("""CREATE TABLE foo (i serial primary key)""")
  conn.commit()

def big_slow():
  conn = psycopg2.connect(DSN)
  cursor = conn.cursor()
  # hold onto many AELs
  for i in range(6000):
cursor.execute("create temp table foo%s (i int) on commit drop" % (i,))
  time.sleep(600)
  conn.rollback()

def many_small_xacts(thread_id):
  conn = psycopg2.connect(DSN)
  cursor = conn.cursor()
  for loops in range(10):
#cursor.execute("create temp table foo1 (i int) on commit drop");
cursor.execute("begin")
cursor.execute("insert into foo values (default)")
cursor.execute("prepare transaction %s", [thread_id])
cursor.execute("commit prepared %s", [thread_id])

init()
threading.Thread(target=big_slow).start()  
for i in range(48):
  threading.Thread(target=many_small_xacts, args=['t' + str(i)]).start()  



0001-Move-RecoveryLockList-into-a-hash-table.patch
Description: Binary data

1 2 >

100 matches

Mail list logo