Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-10-27 Thread Robert Haas
On Thu, Oct 26, 2017 at 4:11 PM, Masahiko Sawada  wrote:
>> Why do we want the the backend to linger behind, once it has added its
>> foreign transaction entries in the shared memory and informed resolver
>> about it? The foreign connections may take their own time and even
>> after that there is no guarantee that the foreign transactions will be
>> resolved in case the foreign server is not available. So, why to make
>> the backend wait?
>
> Because I don't want to break the current user semantics. that is,
> currently it's guaranteed that the subsequent reads can see the
> committed result of previous writes even if the previous transactions
> were distributed transactions.

Right, this is very important, and having the backend wait for the
resolver(s) is, I think, the right way to implement it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
hi,

On 10/28/2017 02:41 AM, Nico Williams wrote:
> On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:
>>> + * We use an optimisation that initially we store the uint32 values 
>>> directly,
>>> + * without the extra hashing step. And only later filling the bitmap space,
>>> + * we switch to the regular bloom filter mode.
>>>
>>> I don't think that optimization is worthwhile.  If I'm going to be using
>>> a Bloom filter, it's because I expect to have a lot of rows.
>>>
>>> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
>>> table just won't have lots of rows, then I might want this optimization,
>>> but I can always just drop the Bloom filter index, or not include
>>> indexes in the LIKE.)
>>
>> I think you're confusing "rows" and "distinct values". Furthermore, it's
>> rather irrelevant how many rows are in the table because BRIN indexes
>> split the table into "ranges" that are 1MB by default. And you can only
>> squash certain number of rows into such range.
>>
>> The idea of the optimization is to efficiently support cases where each
>> range contains only small number of distinct values - which is quite
>> common in the cases I described in my initial message (with bursts of
>> rows with the same IP / client identifier etc.).
> 
> What range?  It's just bits to set.
> 
> The point is that Bloom filters should ideally be about 50% full -- much
> less than that and you are wasting space, much more than than and your
> false positive rate becomes useless.

The range defined by BRIN indexes. BRIN indexes split the table into a
sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom
filters are built on those table "chunks". So it's irrelevant how many
rows are in the table in total, what matters is just the page range.

> 
>>> Filter compression is not worthwhile.  You want to have a fairly uniform
>>> hash distribution, and you want to end up with a Bloom filter that's not
>>> much more than 50% full.  That means that when near 50% full the filter
>>> will not be very sparse at all.  Optimizing for the not common case
>>> doesn't seem worthwhile, and adds complexity.
>>
>> Properly sizing the bloom filter requires knowledge of many variables,
>> in particular the number of distinct values expected to be added to the
>> filter. But we don't really know that number, and we also don't even
>> know many values useful for estimating that (how many rows will fit into
>> a range, number of distinct values in the whole table, etc.)
> 
> This is why Scalable Bloom filters exist: so you can adapt.
> 
>> So the idea was to oversize the bloom filter, and then use the sparse
>> representation to reduce the size.
> 
> A space-efficient representation of sparse bitmaps is not as simple as a
> Scalable Bloom filter.
> 
> And a filter that requires user intervention to size correctly, or which
> requires rebuilding when it gets too full, is also not as convenient as
> a Scalable Bloom filter.
> 

Maybe. For now I'm fine with the simple relopts-based approach and don't
plan to spend much time on these improvements. Which is not to say that
they may not be worthwhile, but it's not the only thing I'm working on.

I also suspect there are practical implications, as some things are only
possible with equally-sized bloom filters. I believe the "union"
(missing in the current patch) will rely on that when merging bloom filters.

>>> + * XXX We can also watch the number of bits set in the bloom filter, and
>>> + * then stop using it (and not store the bitmap, to save space) when the
>>> + * false positive rate gets too high.
>>>
>>> Ah, right, what you want is a "Scalable Bloom Filter".
>>
>> That's not what I had in mind. My idea was to size the bloom filter on
>> "best effort" basis, and then if one range gets a bit too inaccurate
>> then just get rid of the filter. If that happens for many ranges, we
>> should probably allow specifying parameters as relopts for the index.
> 
> Scalable Bloom filters are way more convenient than that.  They're
> always not-too-full, and only the open filter is less-than-full-enough.
> 
> And since you should grow them somewhat exponentially (but not quite as
> much as a doubling in each generation), there should never be too many
> filters to search.  But you can always "vacuum" (rebuild) the filter
> starting with the size of the last filter added prior to the vacuum.
> 
>> I think this is really an over-engineering, and I certainly don't plan
>> to extend the patch in this direction.
> 
> Whereas I think a sparse bitmap representation is overly complex and
> "over-engineering".  Scalable Bloom filters are very well understood in
> the literature -- there's nothing terribly complex to them.
> 

That "sparse bitmap" is mentioned merely in an XXX comment as a thing to
consider, not something I plan to work right now. Perhaps it's not
really a good idea in general, or maybe it's addressing a non-issue. In
which case we don't need scalable bloom 

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Nico Williams
On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:
> > + * We use an optimisation that initially we store the uint32 values 
> > directly,
> > + * without the extra hashing step. And only later filling the bitmap space,
> > + * we switch to the regular bloom filter mode.
> > 
> > I don't think that optimization is worthwhile.  If I'm going to be using
> > a Bloom filter, it's because I expect to have a lot of rows.
> > 
> > (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> > table just won't have lots of rows, then I might want this optimization,
> > but I can always just drop the Bloom filter index, or not include
> > indexes in the LIKE.)
> 
> I think you're confusing "rows" and "distinct values". Furthermore, it's
> rather irrelevant how many rows are in the table because BRIN indexes
> split the table into "ranges" that are 1MB by default. And you can only
> squash certain number of rows into such range.
> 
> The idea of the optimization is to efficiently support cases where each
> range contains only small number of distinct values - which is quite
> common in the cases I described in my initial message (with bursts of
> rows with the same IP / client identifier etc.).

What range?  It's just bits to set.

The point is that Bloom filters should ideally be about 50% full -- much
less than that and you are wasting space, much more than than and your
false positive rate becomes useless.

> > Filter compression is not worthwhile.  You want to have a fairly uniform
> > hash distribution, and you want to end up with a Bloom filter that's not
> > much more than 50% full.  That means that when near 50% full the filter
> > will not be very sparse at all.  Optimizing for the not common case
> > doesn't seem worthwhile, and adds complexity.
> 
> Properly sizing the bloom filter requires knowledge of many variables,
> in particular the number of distinct values expected to be added to the
> filter. But we don't really know that number, and we also don't even
> know many values useful for estimating that (how many rows will fit into
> a range, number of distinct values in the whole table, etc.)

This is why Scalable Bloom filters exist: so you can adapt.

> So the idea was to oversize the bloom filter, and then use the sparse
> representation to reduce the size.

A space-efficient representation of sparse bitmaps is not as simple as a
Scalable Bloom filter.

And a filter that requires user intervention to size correctly, or which
requires rebuilding when it gets too full, is also not as convenient as
a Scalable Bloom filter.

> > + * XXX We can also watch the number of bits set in the bloom filter, and
> > + * then stop using it (and not store the bitmap, to save space) when the
> > + * false positive rate gets too high.
> > 
> > Ah, right, what you want is a "Scalable Bloom Filter".
> 
> That's not what I had in mind. My idea was to size the bloom filter on
> "best effort" basis, and then if one range gets a bit too inaccurate
> then just get rid of the filter. If that happens for many ranges, we
> should probably allow specifying parameters as relopts for the index.

Scalable Bloom filters are way more convenient than that.  They're
always not-too-full, and only the open filter is less-than-full-enough.

And since you should grow them somewhat exponentially (but not quite as
much as a doubling in each generation), there should never be too many
filters to search.  But you can always "vacuum" (rebuild) the filter
starting with the size of the last filter added prior to the vacuum.

> I think this is really an over-engineering, and I certainly don't plan
> to extend the patch in this direction.

Whereas I think a sparse bitmap representation is overly complex and
"over-engineering".  Scalable Bloom filters are very well understood in
the literature -- there's nothing terribly complex to them.

> I do not expect these parameters (particularly the number of distinct
> values in a range) to significantly change over time, so the easiest
> solution is to provide a reloption to specify that number in
> CREATE/ALTER index.

Doesn't this depend on the use-case though?  I think a self-tuning
system is better than one that doesn't self-tune.  Call that
over-engineering if you like, but it doesn't make it not desirable :)

> Alternatively, we could allow the summarization process to gather some
> statistics (either on the whole range, or perhaps the whole table), and
> store them somewhere for later use. For example to compute the number of
> distinct values per range (in the existing data), and use that later.

Again, Scalable Bloom filters automatically adapt without needing a
statistics gathering exercise.  All you need is a good hash function
(that's another topic).

Scalable Bloom filters are a trivial extension of Bloom filters.

> > What I'm getting at is that the query planner really needs to understand
> > that a Bloom filter is a probabilistic data structure.
> 
> It does, and we 

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Nico Williams
On Fri, Oct 27, 2017 at 02:13:27PM -0700, srielau wrote:
> While the standard may not require a unique index for the ON clause I have
> never seen a MERGE statement that did not have this property.  So IMHO this
> is a reasonable restrictions.

I don't understand how one could have a conflict upon which to turn
INSERT into UPDATE without having a UNIQUE constraint violated...

The only question is whether one should have control over -or have to
specify- which constraint violations lead to UPDATE vs. which ones lead
to failure vs. which ones lead to doing nothing.

The row to update is the one that the to-be-inserted row conflicted with
-- there can only have been one if the constraint violated was a PRIMARY
KEY constraint, or if there is a PRIMARY KEY at all, but if there's no
PRIMARY KEY, then there can have been more conflicting rows because of
NULL columns in the to-be-inserted row.  If the to-be-inserted row
conflicts with multiple rows, then just fail, or don't allow MERGE on
tables that have no PK (as you know, many think it makes no sense to not
have a PK on a table in SQL).

In the common case one does not care about which UNIQUE constraint is
violated because there's only one that could have been violated, or
because if the UPDATE should itself cause some other UNIQUE constraint
to be violated, then the whole statement should fail.

PG's UPSERT is fantastic -- it allows very fine-grained control, but it
isn't as pithy as it could be when the author doesn't care to specify
all that detail.

Also, something like SQLite3's INSERT OR REPLACE is very convenient:
pithy, INSERT syntax, upsert-like semantics[*].

I'd like to have this in PG:

  INSERT INTO ..
  ON CONFLICT DO UPDATE;  -- I.e., update all columns of the existing
  -- row to match the ones from the row that
  -- would have been inserted had there not been
  -- a conflict.
  --
  -- If an INSERTed row conflicts and then the
  -- UPDATE it devolves to also conflicts, then
  -- fail.

and

  INSERT INTO ..
  ON CONFLICT DO UPDATE   -- I.e., update all columns of the existing
  -- row to match the ones from the row that
  -- would have been inserted had there not been
  -- a conflict.
  --
  ON CONFLICT DO NOTHING; -- If an INSERTed row conflicts and then the
  -- UPDATE it devolves to also conflicts, then
  -- DO NOTHING.


[*] SQLite3's INSERT OR REPLACE is NOT an insert-or-update, but an
insert-or-delete-and-insert, and any deletions that occur in the
process do fire triggers.  INSERT OR UPDATE would be much more
useful.


Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Michael Paquier
On Fri, Oct 27, 2017 at 9:00 AM, Robert Haas  wrote:
> On Fri, Oct 27, 2017 at 4:41 PM, Simon Riggs  wrote:
>> I didn't say it but my intention was to just throw an ERROR if no
>> single unique index can be identified.
>>
>> It could be possible to still run MERGE in that situaton but we would
>> need to take a full table lock at ShareRowExclusive. It's quite likely
>> that such statements would throw duplicate update errors, so I
>> wouldn't be aiming to do anything with that for PG11.
>
> Like Peter, I think taking such a strong lock for a DML statement
> doesn't sound like a very desirable way forward.  It means, for
> example, that you can only have one MERGE in progress on a table at
> the same time, which is quite limiting.  It could easily be the case
> that you have multiple MERGE statements running at once but they touch
> disjoint groups of rows and therefore everything works.  I think the
> code should be able to cope with concurrent changes, if nothing else
> by throwing an ERROR, and then if the user wants to ensure that
> doesn't happen by taking ShareRowExclusiveLock they can do that via an
> explicit LOCK TABLE statement -- or else they can prevent concurrency
> by any other means they see fit.

+1, I would suspect users to run this query in parallel of the same
table for multiple data sets.

Peter has taken some time to explain me a bit his arguments today, and
I agree that it does not sound much appealing to have constraint
limitations for MERGE. Particularly using the existing ON CONFLICT
structure gets the feeling of having twice a grammar for what's
basically the same feature, with pretty much the same restrictions.

By the way, this page sums up nicely the situation about many
implementations of UPSERT taken for all systems:
https://en.wikipedia.org/wiki/Merge_(SQL)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reading timeline from pg_control on replication slave

2017-10-27 Thread Michael Paquier
On Fri, Oct 27, 2017 at 1:04 AM, Andrey Borodin  wrote:
> I'm working on backups from replication salve in WAL-G [0]
> Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to 
> pg_start_backup() works nice, but "pg_walfile_name() cannot be executed 
> during recovery."
> This function has LSN as argument and reads TimeLineId from global state.
> So I made a function[1] that, if on replica, reads timeline from pg_control 
> file and formats WAL file name as is it was produces by pg_wal_filename(lsn).

ThisTimeLineID is not something you can rely on for standby backends
as it is not set during recovery. That's the reason behind
pg_walfile_name disabled during recovery. There are three things
popping on top of my mind that one could think about:
1) Backups cannot be completed when started on a standby in recovery
and when stopped after the standby has been promoted, meaning that its
timeline has changed.
2) After a standby has been promoted, by using pg_start_backup, you
issue a checkpoint which makes sure that the control file gets flushed
with the new information, so when pg_start_backup returns to the
caller you should have the correct timeline number because the outer
function gets evaluated last.
3) Backups taken from cascading standbys, where a direct parent has
been promoted.

1) and 2) are actually not problems per the restrictions I am giving
above, but 3) is. If I recall correctly, when a streaming standby does
a timeline jump, a restart point is not immediately generated, so you
could have the timeline on the control file not updated to the latest
timeline value, meaning that you could have the WAL file name you use
here referring to a previous timeline and not the newest one.

In short, yes, what you are doing is definitely risky in my opinion,
particularly for complex cascading setups.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Serge Rielau
via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx=9.8.79=10.12.6=email_footer_2]
On Fri, Oct 27, 2017 at 2:42 PM, Peter Geoghegan  wrote:
On Fri, Oct 27, 2017 at 2:13 PM, srielau  wrote:
> While the standard may not require a unique index for the ON clause I have
> never seen a MERGE statement that did not have this property. So IMHO this
> is a reasonable restrictions.

The Oracle docs on MERGE say nothing about unique indexes or
constraints. They don't even mention them in passing. They do say
"This statement is a convenient way to combine multiple operations. It
lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

SQL Server's MERGE docs do mention unique indexes, but only in
passing, saying something about unique violations, and that unique
violations *cannot* be suppressed in MERGE, even though that's
possible with other DML statements (with something called
IGNORE_DUP_KEY).

What other systems *do* have this restriction? I've never seen one that did. 
Not clear what you are leading up to here. When I did MERGE in DB2 there was 
also no limitation: " Each row in the target can only be operated on once. A 
row in the target can only be identified as MATCHED with one row in the result 
table of the table-reference” What there was however was a significant amount 
of code I had to write and test to enforce the above second sentence. IIRC it 
involved, in the absence of a proof that the join could not expand, adding a 
row_number() over() AS rn over the target leg of the join and then a 
row_number() over(partition by rn) > 1 THEN RAISE_ERROR() to catch violators. 
Maybe in PG there is a trivial way to detect an expanding join and block it at 
runtime.
So the whole point I’m trying to make is that I haven’t seen the need for the 
extra work I had to do once the feature appeared in the wild.
Cheers Serge

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan
On Fri, Oct 27, 2017 at 2:13 PM, srielau  wrote:
> While the standard may not require a unique index for the ON clause I have
> never seen a MERGE statement that did not have this property.  So IMHO this
> is a reasonable restrictions.

The Oracle docs on MERGE say nothing about unique indexes or
constraints. They don't even mention them in passing. They do say
"This statement is a convenient way to combine multiple operations. It
lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

SQL Server's MERGE docs do mention unique indexes, but only in
passing, saying something about unique violations, and that unique
violations *cannot* be suppressed in MERGE, even though that's
possible with other DML statements (with something called
IGNORE_DUP_KEY).

What other systems *do* have this restriction? I've never seen one that did.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-27 Thread Tom Lane
Robert Haas  writes:
> On Fri, Oct 27, 2017 at 7:27 PM, Alvaro Herrera  
> wrote:
>> I noticed that RelationBuildPartitionKey is generating a partition key
>> in a temp context, then creating a private context and copying the key
>> into that.  That seems leftover from some previous iteration of some
>> other patch; I think it's pretty reasonable to create the new context
>> right from the start and allocate the key there directly instead.  Then
>> there's no need for copy_partition_key at all.

> We could do that, but the motivation for the current system was to
> avoid leaking memory in a long-lived context.  I think the same
> technique is used elsewhere for similar reasons.  I admit I haven't
> checked whether there would actually be a leak here if we did it as
> you propose, but I wouldn't find it at all surprising.

Another key point is to avoid leaving a corrupted relcache entry behind
if you fail partway through.  I think that the current coding is
RelationBuildPartitionKey is safe, although it's far too undercommented
for my taste.  (The ordering of the last few steps is *critical*.)

It might work to build the new key in a context that's initially a
child of CurrentMemoryContext, then reparent it to be a child of
CacheMemoryContext when done.  But you'd have to look at whether that
would end up wasting long-lived memory, if the build process generates
any cruft in the same context.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-10-27 Thread Thomas Munro
On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munro
 wrote:
>> I tried few more times, and I've got it two times from four attempts on a
>> fresh
>> installation (when all instances were on the same machine). But anyway I'll
>> try
>> to investigate, maybe it has something to do with my environment.
>>
>> ...
>>
>> 2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number  in log
>> segment 00010020, offset 10092544
>
> Hi Dmitry,
>
> Thanks for testing.  Yeah, it looks like the patch may be corrupting
> the WAL stream in some case that I didn't hit in my own testing
> procedure.  I will try to reproduce these failures.

Hi Dmitry,

I managed to reproduce something like this on one of my home lab
machines running a different OS.  Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread srielau
Simon,

Nice writeup.

While the standard may not require a unique index for the ON clause I have
never seen a MERGE statement that did not have this property.  So IMHO this
is a reasonable restrictions.

In fact I have only ever seen two flavors of usage:
* Single row source (most often simply a VALUES clause) in OLTP
  In that case there was lots of concurrency
* Massive source which affects a significant portion of the target table in
DW.
 In this case there were no concurrent MERGEs

I believe support for returning rows at a later stage would prove to be very
powerful, especially in combination with chaining MERGE statements in CTEs.
To do that would require language extensions to pass the coloring of the
source row through, especially for rows that fell into "do nothing".




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] git down

2017-10-27 Thread Erik Rijkers

git.postgresql.org is down/unreachable

( git://git.postgresql.org/git/postgresql.git )



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel safety for extern params

2017-10-27 Thread Robert Haas
On Mon, Oct 16, 2017 at 12:59 PM, Amit Kapila  wrote:
>> I think the Param case should be mentioned after "... but" not before
>> - i.e. referencing the child node's output... but setrefs.c might also
>> have copied a Const or Param is-is.
>
> I am not sure if we can write the comment like that (.. copied a Const
> or Param as-is.) because fix_upper_expr_mutator in setrefs.c has a
> special handling for Var and Param where constants are copied as-is
> via expression_tree_mutator.  Also as proposed, the handling for
> params is more like Var in exec_save_simple_expr.

I committed fix_parallel_mode_nested_execution_v2.patch with some
cosmetic tweaks.  I back-patched it to 10 and 9.6, then had to fix
some issues reported by Tom as followup commits.

With respect to the bit quoted above, I rephrased the comment in a
slightly different way that hopefully is a reasonable compromise,
combined it with the main patch, and pushed it to master.  Along the
way I also got rid of the if statement you introduced and just made
the Assert() more complicated instead, which seems better to me.

When I tried back-porting the patch to v10 I discovered that the
plpgsql changes conflict heavily and that ripping them out all the way
produces regression failures under force_parallel_mode.  I think I see
why that's happening but it's not obvious how to me how to adapt
b73f1b5c29e0ace5a281bc13ce09dea30e1b66de to the v10 code.  Do you want
to have a crack at it or should we just leave this as a master-only
fix?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-27 Thread Robert Haas
On Fri, Oct 27, 2017 at 7:27 PM, Alvaro Herrera  wrote:
> I noticed that RelationBuildPartitionKey is generating a partition key
> in a temp context, then creating a private context and copying the key
> into that.  That seems leftover from some previous iteration of some
> other patch; I think it's pretty reasonable to create the new context
> right from the start and allocate the key there directly instead.  Then
> there's no need for copy_partition_key at all.

We could do that, but the motivation for the current system was to
avoid leaking memory in a long-lived context.  I think the same
technique is used elsewhere for similar reasons.  I admit I haven't
checked whether there would actually be a leak here if we did it as
you propose, but I wouldn't find it at all surprising.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
Hi,

On 10/27/2017 05:22 PM, Sokolov Yura wrote:
> 
> Hi, Tomas
> 
> BRIN bloom index is a really cool feature, that definitely should be in
> core distribution (either in contrib or builtin)!!!
> 
> Small suggestion for algorithm:
> 
> It is well known practice not to calculate whole hash function for every
> point, but use double hashing approach.
> Example of proving double hashing approach for bloom filters:
> https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
> 
> So, instead of:
> for (i = 0; i < filter->nhashes; i++)
> {
>     /* compute the hashes with a seed, used for the bloom filter */
>     uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value,
> i))) % filter->nbits;
> 
>     /* set or check bit h */
> }
> 
> such construction could be used:
> /* compute the hashes, used for the bloom filter */
> uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
> uint32 h = big_h % filter->nbits;
> /* ensures d is never >= filter->nbits */
> uint32 d = big_h % (filter->nbits - 1);
> 
> for (i = 0; i < filter->nhashes; i++)
> {
>     /* set or check bit h */
> 
>     /* compute next bit h */
>     h += d++;
>     if (h >= filter->nbits) h -= filter->nbits;
>     if (d == filter->nbits) d = 0;
> }
> 
> Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`)
> gives two good-quality functions usable for double hashing.
> 

OK, thanks for the suggestion.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
Hi,

On 10/27/2017 07:17 PM, Nico Williams wrote:
> On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:
> 
> A bloom filter index would, indeed, be wonderful.
> 
> Comments:
> 
> + * We use an optimisation that initially we store the uint32 values directly,
> + * without the extra hashing step. And only later filling the bitmap space,
> + * we switch to the regular bloom filter mode.
> 
> I don't think that optimization is worthwhile.  If I'm going to be using
> a Bloom filter, it's because I expect to have a lot of rows.
> 
> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> table just won't have lots of rows, then I might want this optimization,
> but I can always just drop the Bloom filter index, or not include
> indexes in the LIKE.)
> 

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

> + * XXX Perhaps we could save a few bytes by using different data types, but
> + * considering the size of the bitmap, the difference is negligible.
> 
> A bytea is all that's needed.  See below.
> 
> + * XXX We could also implement "sparse" bloom filters, keeping only the
> + * bytes that are not entirely 0. That might make the "sorted" phase
> + * mostly unnecessary.
> 
> Filter compression is not worthwhile.  You want to have a fairly uniform
> hash distribution, and you want to end up with a Bloom filter that's not
> much more than 50% full.  That means that when near 50% full the filter
> will not be very sparse at all.  Optimizing for the not common case
> doesn't seem worthwhile, and adds complexity.
> 

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

> + * XXX We can also watch the number of bits set in the bloom filter, and
> + * then stop using it (and not store the bitmap, to save space) when the
> + * false positive rate gets too high.
> 
> Ah, right, what you want is a "Scalable Bloom Filter".
> 

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

> A Scalable Bloom filter is actually a series of Bloom filters where all
> but the newest filter are closed to additions, and the newest filter is
> where you do all the additions.  You generally want to make each new
> filter bigger than the preceding one because when searching a Scalable
> Bloom filter you have to search *all* of them, so you want to minimize
> the number of filters.
> 
> Eventually, when you have enough sub-filters, you may want to re-write
> the whole thing so that you start with a single sub-filter that is large
> enough.
> 
> The format of the bytea might then be something like:
> 
> [[[...]]
> 
> where the last bitmap is the filter that is open to additions.
>

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

> ...
> 
> Of course, for something like:
> 
>   SELECT a.*, b.* FROM a a JOIN b b USING (some_col);
> 
> a Bloom filter can only be used as an optimization to avoid using a
> slower index (or heap scan) on the inner table source.
> 
> What I'm getting at is that the query planner really needs to understand
> that a Bloom filter is a probabilistic data structure.
> 

It does, and we never produce incorrect results. Which seems like the
right thing to do.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] logical decoding of two-phase transactions

2017-10-27 Thread Sokolov Yura

On 2017-10-26 22:01, Sokolov Yura wrote:

On 2017-09-27 14:46, Stas Kelvich wrote:
On 7 Sep 2017, at 18:58, Nikhil Sontakke  
wrote:


Hi,

FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.



Cool!

So what kind of architecture do you have in mind? Same way as is it
was implemented before?
As far as I remember there were two main issues:

* Decodong of aborted prepared transaction.

If such transaction modified catalog then we can’t read reliable info
with our historic snapshot,
since clog already have aborted bit for our tx it will brake
visibility logic. There are some way to
deal with that — by doing catalog seq scan two times and counting
number of tuples (details
upthread) or by hijacking clog values in historic visibility function.
But ISTM it is better not solve this
issue at all =) In most cases intended usage of decoding of 2PC
transaction is to do some form
of distributed commit, so naturally decoding will happens only with
in-progress transactions and
we commit/abort will happen only after it is decoded, sent and
response is received. So we can
just have atomic flag that prevents commit/abort of tx currently being
decoded. And we can filter
interesting prepared transactions based on GID, to prevent holding
this lock for ordinary 2pc.

* Possible deadlocks that Andres was talking about.

I spend some time trying to find that, but didn’t find any. If locking
pg_class in prepared tx is the only
example then (imho) it is better to just forbid to prepare such
transactions. Otherwise if some realistic
examples that can block decoding are actually exist, then we probably
need to reconsider the way
tx being decoded. Anyway this part probably need Andres blessing.


Just rebased patch logical_twophase_v6 to master.

Fixed small issues:
- XactLogAbortRecord wrote DBINFO twice, but it was decoded in
  ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN.
  Fixed by removing second write of DBINFO.
- SnapBuildPrepareTxnFinish tried to remove xid from `running` instead
  of `committed`. And it removed only xid, without subxids.
- test_decoding skipped returning "COMMIT PREPARED" and "ABORT 
PREPARED",


Big issue were with decoding ddl-including two-phase transactions:
- prepared.out were misleading. We could not reproduce decoding body of
  "test_prepared#3" with logical_twophase_v6.diff. It was skipped if
  `pg_logical_slot_get_changes` were called without
  `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED
  test_prepared#3" were decoded.
The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare`
twice:
- first on "PREPARE" itself,
- second - on "COMMIT PREPARED".
In v6, `pg_filter_prepare` without `with-catalog-changes` first time
answered "true" (ie it should not be decoded), and second time (when
transaction became committed) it answered "false" (ie it should be
decoded). But second time in DecodePrepare
`ctx->snapshot_builder->start_decoding_at`
is already in a future compared to `buf->origptr` (because it is
on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called
ReorderBufferForget.
If `pg_filter_prepare` is called with `with-catalog-changes`, then
it returns "false" both times, thus DeocdePrepare decodes transaction
in first time, and calls `ReorderBufferForget` in second time.

I didn't found a way to fix it gracefully. I just change 
`pg_filter_prepare`
to return same answer both times: "false" if called 
`with-catalog-changes`

(ie need to call DecodePrepare), and "true" otherwise. With this
change, catalog changing two-phase transaction is decoded as simple
one-phase transaction, if `pg_logical_slot_get_changes` is called
without `with-catalog-changes`.


Small improvement compared to v7:
- twophase_gid is written with alignment padding in the 
XactLogCommitRecord

and XactLogAbortRecord.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

logical_twophase_v8.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Sokolov Yura

On 2017-10-27 20:17, Nico Williams wrote:

On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values 
directly,
+ * without the extra hashing step. And only later filling the bitmap 
space,

+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile.  If I'm going to be 
using

a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this 
optimization,

but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

+ * XXX Perhaps we could save a few bytes by using different data 
types, but

+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed.  See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only 
the

+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile.  You want to have a fairly 
uniform
hash distribution, and you want to end up with a Bloom filter that's 
not

much more than 50% full.  That means that when near 50% full the filter
will not be very sparse at all.  Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

+ * XXX We can also watch the number of bits set in the bloom filter, 
and
+ * then stop using it (and not store the bitmap, to save space) when 
the

+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions.  You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is 
large

enough.

The format of the bytea might then be something like:

[[[...]]

where the last bitmap is the filter that is open to additions.

I wonder if there are write concurrency performance considerations
here...

It might be better to have a bytea value per-sub-filter so that there 
is

no lock contention for the closed filters.  The closed sub-filters are
constant, thus not even shared locks are needed for them, and 
especially

not exclusive locks.

Writing to the filter will necessitate locking the entire open filter,
or else byte-range locking on it.  Something to think about.


Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.


A Bloom filter has non-zero false positives for existence, but zero
false positives for non-existence.

This means that a Bloom filter index can only be used for:

a) non-existence tests (with equality predicates, as you point out),

b) as an optimization to avoid slower index checks (or heap scans) when
   the Bloom filter indicates non-existence.

(b) is really just an internal application of (a).

There might be applications where false positives might be ok in a 
query

like:

  SELECT a.* FROM a a JOIN b b USING (some_col);

but for most real-world queries like that, allowing false positives by
default would be very bad.  An option in the index declaration could be
used to enable existence equality predicates, but I would not include
such an option initially -- let's see if anyone actually has a use case
for it first.

Of course, for something like:

  SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to 
understand

that a Bloom filter is a probabilistic data structure.

Nico
--


PostgreSQL has a lot of probabilistic indexes.

Some GIST opclasses returns false positives and tells to executor to
recheck their results.
Bitmap-index-scan, when there are a lot of result tuples, falls back to
storing only page numbers, without actual tid's, and executer then scans
heap-pages to find necessary tuples.

BRIN index at all just "recommends executor to scan that heap page". 
Thus
BRIN index is at whole just an optimisation (regardless is it `minmax` 
or

`bloom`).
So that is ok.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scan for cube and seg

2017-10-27 Thread Tom Lane
Robert Haas  writes:
> On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin  wrote:
>> For cube there is new default opclass.

> I seem to recall that changing the default opclass causes unsolvable
> problems with upgrades.  You might want to check the archives for
> previous discussions of this issue; unfortunately, I don't recall the
> details off-hand.

Quite aside from that, replacing the opclass with a new one creates
user-visible headaches that I don't think are justified, i.e. having to
reconstruct indexes in order to get the benefit.

Maybe I'm missing something, but ISTM you could just drop the compress
function and call it good.  This would mean that an IOS scan would
sometimes hand back a toast-compressed value, but what's the problem
with that?

(The only reason for making a decompress function that just detoasts
is that your other support functions then do not have to consider
the possibility that they're handed a toast-compressed value.  I have
not checked whether that really matters for cube.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ALTER COLUMN TYPE vs. domain constraints

2017-10-27 Thread Tom Lane
I found out that altering a column's type does not play nicely with
domain constraints: tablecmds.c expects that only table constraints
could depend on a column.  Now, it's easy to hit that with domains
over composite, so I propose to fix it in HEAD with the attached
patch.  However, if you really work at it, you can make it fail
in the back branches too:

regression=# create type comptype as (r float8, i float8);
CREATE TYPE
regression=# create domain silly as float8 check ((row(value,0)::comptype).r > 
0);
CREATE DOMAIN
regression=# alter type comptype alter attribute r type varchar;
ERROR:  cache lookup failed for relation 0

Before commit 6784d7a1d, the ALTER actually went through, leaving a
mess.  Fortunately it doesn't actually crash afterwards, but you
get things like

regression=# select 0::silly;
ERROR:  ROW() column has type double precision instead of type character varying

We could consider back-patching the attached to cover this, but
I'm not entirely sure it's worth the trouble, because I haven't
thought of any non-silly use-cases in the absence of domains
over composite.  Comments?

regards, tom lane

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3ab8087..fc93c4e 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** static void ATPostAlterTypeParse(Oid old
*** 426,431 
--- 426,433 
  	 bool rewrite);
  static void RebuildConstraintComment(AlteredTableInfo *tab, int pass,
  		 Oid objid, Relation rel, char *conname);
+ static void RebuildDomainConstraintComment(AlteredTableInfo *tab, int pass,
+ 			   Oid objid, List *domname, char *conname);
  static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
  static void TryReuseForeignKey(Oid oldId, Constraint *con);
  static void change_owner_fix_column_acls(Oid relationOid,
*** AlterTableGetLockLevel(List *cmds)
*** 3319,3324 
--- 3321,3327 
  			case AT_ProcessedConstraint:	/* becomes AT_AddConstraint */
  			case AT_AddConstraintRecurse:	/* becomes AT_AddConstraint */
  			case AT_ReAddConstraint:	/* becomes AT_AddConstraint */
+ 			case AT_ReAddDomainConstraint:	/* becomes AT_AddConstraint */
  if (IsA(cmd->def, Constraint))
  {
  	Constraint *con = (Constraint *) cmd->def;
*** ATRewriteCatalogs(List **wqueue, LOCKMOD
*** 3819,3825 
  			rel = relation_open(tab->relid, NoLock);
  
  			foreach(lcmd, subcmds)
! ATExecCmd(wqueue, tab, rel, (AlterTableCmd *) lfirst(lcmd), lockmode);
  
  			/*
  			 * After the ALTER TYPE pass, do cleanup work (this is not done in
--- 3822,3830 
  			rel = relation_open(tab->relid, NoLock);
  
  			foreach(lcmd, subcmds)
! ATExecCmd(wqueue, tab, rel,
! 		  castNode(AlterTableCmd, lfirst(lcmd)),
! 		  lockmode);
  
  			/*
  			 * After the ALTER TYPE pass, do cleanup work (this is not done in
*** ATExecCmd(List **wqueue, AlteredTableInf
*** 3936,3941 
--- 3941,3953 
  ATExecAddConstraint(wqueue, tab, rel, (Constraint *) cmd->def,
  	true, true, lockmode);
  			break;
+ 		case AT_ReAddDomainConstraint:	/* Re-add pre-existing domain check
+ 		 * constraint */
+ 			address =
+ AlterDomainAddConstraint(((AlterDomainStmt *) cmd->def)->typeName,
+ 		 ((AlterDomainStmt *) cmd->def)->def,
+ 		 NULL);
+ 			break;
  		case AT_ReAddComment:	/* Re-add existing comment */
  			address = CommentObject((CommentStmt *) cmd->def);
  			break;
*** ATPostAlterTypeCleanup(List **wqueue, Al
*** 9616,9622 
  		if (!HeapTupleIsValid(tup)) /* should not happen */
  			elog(ERROR, "cache lookup failed for constraint %u", oldId);
  		con = (Form_pg_constraint) GETSTRUCT(tup);
! 		relid = con->conrelid;
  		confrelid = con->confrelid;
  		conislocal = con->conislocal;
  		ReleaseSysCache(tup);
--- 9628,9642 
  		if (!HeapTupleIsValid(tup)) /* should not happen */
  			elog(ERROR, "cache lookup failed for constraint %u", oldId);
  		con = (Form_pg_constraint) GETSTRUCT(tup);
! 		if (OidIsValid(con->conrelid))
! 			relid = con->conrelid;
! 		else
! 		{
! 			/* must be a domain constraint */
! 			relid = get_typ_typrelid(getBaseType(con->contypid));
! 			if (!OidIsValid(relid))
! elog(ERROR, "could not identify relation associated with constraint %u", oldId);
! 		}
  		confrelid = con->confrelid;
  		conislocal = con->conislocal;
  		ReleaseSysCache(tup);
*** ATPostAlterTypeParse(Oid oldId, Oid oldR
*** 9753,9759 
  
  			foreach(lcmd, stmt->cmds)
  			{
! AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd);
  
  if (cmd->subtype == AT_AddIndex)
  {
--- 9773,9779 
  
  			foreach(lcmd, stmt->cmds)
  			{
! AlterTableCmd *cmd = castNode(AlterTableCmd, lfirst(lcmd));
  
  if (cmd->subtype == AT_AddIndex)
  {
*** ATPostAlterTypeParse(Oid oldId, Oid oldR
*** 9781,9789 
  }
  else if (cmd->subtype == 

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-27 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I'm now working on the ability to build unique indexes (and unique
> constraints) on top of this.

So I think there's not a lot of additional code required to support
unique indexes with the restrictions mentioned; proof-of-concept (with
several holes still) attached.

As before, this is not finished, as there a few things that are wrong
(such as generateClonedIndexStmt taking a RangeVar), and others that I
don't understand (such as why is rd_partkey NULL for partitioned
partitions when DefineIndex cascades to them and the columns are
checked).

I noticed that RelationBuildPartitionKey is generating a partition key
in a temp context, then creating a private context and copying the key
into that.  That seems leftover from some previous iteration of some
other patch; I think it's pretty reasonable to create the new context
right from the start and allocate the key there directly instead.  Then
there's no need for copy_partition_key at all.

Anyway, here is a preliminary submission before I close shop for the
week.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 37e683e352df5f3e6c110d94881abe888e36953e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 16 Oct 2017 12:01:12 +0200
Subject: [PATCH v2 1/7] Tweak index_create/index_constraint_create APIs

I was annoyed by index_create and index_constraint_create respective
APIs.  It's too easy to get confused when adding a new behavioral flag
given the large number of boolean flags they already have.  Turning them
into a flags bitmask makes that code easier to read, as in the attached
patch.

I split the flags in two -- one set for index_create directly and
another related to constraints.  index_create() itself receives both,
and then passes down the constraint one to index_constraint_create.  It
is shorter in LOC terms to create a single set mixing all the values,
but it seemed to make more sense this way.  Also, I chose not to turn
the booleans 'allow_sytem_table_mods' and 'is_internal' into flag bits
because of the different way they interact with this code.
---
 src/backend/catalog/index.c  | 101 +++
 src/backend/catalog/toasting.c   |   3 +-
 src/backend/commands/indexcmds.c |  33 +
 src/backend/commands/tablecmds.c |  13 +++--
 src/include/catalog/index.h  |  29 ++-
 5 files changed, 100 insertions(+), 79 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c7b2f031f0..17faeffada 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -680,19 +680,25 @@ UpdateIndexRelation(Oid indexoid,
  * classObjectId: array of index opclass OIDs, one per index column
  * coloptions: array of per-index-column indoption settings
  * reloptions: AM-specific options
- * isprimary: index is a PRIMARY KEY
- * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint
- * deferrable: constraint is DEFERRABLE
- * initdeferred: constraint is INITIALLY DEFERRED
+ * flags: bitmask that can include any combination of these bits:
+ * INDEX_CREATE_IS_PRIMARY
+ * the index is a primary key
+ * INDEX_CREATE_ADD_CONSTRAINT:
+ * invoke index_constraint_create also
+ * INDEX_CREATE_SKIP_BUILD:
+ * skip the index_build() step for the moment; caller must 
do it
+ * later (typically via reindex_index())
+ * INDEX_CREATE_CONCURRENT:
+ * do not lock the table against writers.  The index will 
be
+ * marked "invalid" and the caller must take additional 
steps
+ * to fix it up.
+ * INDEX_CREATE_IF_NOT_EXISTS:
+ * do not throw an error if a relation with the same name
+ * already exists.
+ * constr_flags: flags passed to index_constraint_create
+ * (only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
- * skip_build: true to skip the index_build() step for the moment; caller
- * must do it later (typically via reindex_index())
- * concurrent: if true, do not lock the table against writers.  The index
- * will be marked "invalid" and the caller must take additional 
steps
- * to fix it up.
  * is_internal: if true, post creation hook for new index
- * if_not_exists: if true, do not throw an error if a relation with
- * the same name already exists.
  *
  * Returns the OID of the created index.
  */
@@ -709,15 +715,10 @@ index_create(Relation heapRelation,
 Oid *classObjectId,
 int16 *coloptions,
 Datum reloptions,
-bool isprimary,
-bool 

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Nico Williams
On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:

A bloom filter index would, indeed, be wonderful.

Comments:

+ * We use an optimisation that initially we store the uint32 values directly,
+ * without the extra hashing step. And only later filling the bitmap space,
+ * we switch to the regular bloom filter mode.

I don't think that optimization is worthwhile.  If I'm going to be using
a Bloom filter, it's because I expect to have a lot of rows.

(Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
table just won't have lots of rows, then I might want this optimization,
but I can always just drop the Bloom filter index, or not include
indexes in the LIKE.)

+ * XXX Perhaps we could save a few bytes by using different data types, but
+ * considering the size of the bitmap, the difference is negligible.

A bytea is all that's needed.  See below.

+ * XXX We could also implement "sparse" bloom filters, keeping only the
+ * bytes that are not entirely 0. That might make the "sorted" phase
+ * mostly unnecessary.

Filter compression is not worthwhile.  You want to have a fairly uniform
hash distribution, and you want to end up with a Bloom filter that's not
much more than 50% full.  That means that when near 50% full the filter
will not be very sparse at all.  Optimizing for the not common case
doesn't seem worthwhile, and adds complexity.

+ * XXX We can also watch the number of bits set in the bloom filter, and
+ * then stop using it (and not store the bitmap, to save space) when the
+ * false positive rate gets too high.

Ah, right, what you want is a "Scalable Bloom Filter".

A Scalable Bloom filter is actually a series of Bloom filters where all
but the newest filter are closed to additions, and the newest filter is
where you do all the additions.  You generally want to make each new
filter bigger than the preceding one because when searching a Scalable
Bloom filter you have to search *all* of them, so you want to minimize
the number of filters.

Eventually, when you have enough sub-filters, you may want to re-write
the whole thing so that you start with a single sub-filter that is large
enough.

The format of the bytea might then be something like:

[[[...]]

where the last bitmap is the filter that is open to additions.

I wonder if there are write concurrency performance considerations
here...

It might be better to have a bytea value per-sub-filter so that there is
no lock contention for the closed filters.  The closed sub-filters are
constant, thus not even shared locks are needed for them, and especially
not exclusive locks.

Writing to the filter will necessitate locking the entire open filter,
or else byte-range locking on it.  Something to think about.

> Now, what about query performance? Unlike the "minmax" indexes, the
> "bloom" filter can only handle equality conditions.

A Bloom filter has non-zero false positives for existence, but zero
false positives for non-existence.

This means that a Bloom filter index can only be used for:

a) non-existence tests (with equality predicates, as you point out),

b) as an optimization to avoid slower index checks (or heap scans) when
   the Bloom filter indicates non-existence.

(b) is really just an internal application of (a).

There might be applications where false positives might be ok in a query
like:

  SELECT a.* FROM a a JOIN b b USING (some_col);

but for most real-world queries like that, allowing false positives by
default would be very bad.  An option in the index declaration could be
used to enable existence equality predicates, but I would not include
such an option initially -- let's see if anyone actually has a use case
for it first.

Of course, for something like:

  SELECT a.*, b.* FROM a a JOIN b b USING (some_col);

a Bloom filter can only be used as an optimization to avoid using a
slower index (or heap scan) on the inner table source.

What I'm getting at is that the query planner really needs to understand
that a Bloom filter is a probabilistic data structure.

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scan for cube and seg

2017-10-27 Thread Robert Haas
On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin  wrote:
> For cube there is new default opclass.

I seem to recall that changing the default opclass causes unsolvable
problems with upgrades.  You might want to check the archives for
previous discussions of this issue; unfortunately, I don't recall the
details off-hand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Robert Haas
On Fri, Oct 27, 2017 at 4:41 PM, Simon Riggs  wrote:
> I didn't say it but my intention was to just throw an ERROR if no
> single unique index can be identified.
>
> It could be possible to still run MERGE in that situaton but we would
> need to take a full table lock at ShareRowExclusive. It's quite likely
> that such statements would throw duplicate update errors, so I
> wouldn't be aiming to do anything with that for PG11.

Like Peter, I think taking such a strong lock for a DML statement
doesn't sound like a very desirable way forward.  It means, for
example, that you can only have one MERGE in progress on a table at
the same time, which is quite limiting.  It could easily be the case
that you have multiple MERGE statements running at once but they touch
disjoint groups of rows and therefore everything works.  I think the
code should be able to cope with concurrent changes, if nothing else
by throwing an ERROR, and then if the user wants to ensure that
doesn't happen by taking ShareRowExclusiveLock they can do that via an
explicit LOCK TABLE statement -- or else they can prevent concurrency
by any other means they see fit.

Other problems with taking ShareRowExclusiveLock include (1) probable
lock upgrade hazards and (2) do you really want MERGE to kick
autovacuum off of your giant table?  Probably not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan
On Fri, Oct 27, 2017 at 6:24 AM, Robert Haas  wrote:
> I think one of the reasons why Peter Geoghegan decided to pursue
> INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL
> syntax, he felt free to mandate a non-standard SQL requirement, namely
> the presence of a unique index on the arbiter columns.

That's true, but what I was really insistent on, more than anything
else, was that the user would get a practical guarantee about an
insert-or-update outcome under concurrency. There could be no
"unprincipled deadlocks", nor could there be spurious unique
violations.

This is the kind of thing that the SQL standard doesn't really concern
itself with, and yet it's of significant practical importance to
users. Both Oracle and SQL Server allow these things that I
specifically set out to avoid. I think that that's mostly a good
thing, though; they do a really bad job of explaining what's what, and
don't provide for a very real need ("upsert") in some other way, but
their MERGE semantics do make sense to me.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan
On Fri, Oct 27, 2017 at 7:41 AM, Simon Riggs  wrote:
> Good points.
>
> I didn't say it but my intention was to just throw an ERROR if no
> single unique index can be identified.

You'd also throw an error when there was no "upsert compatible" join
quals, I take it?

I don't see the point in that. That's just mapping one syntax on to another.

> It could be possible to still run MERGE in that situaton but we would
> need to take a full table lock at ShareRowExclusive. It's quite likely
> that such statements would throw duplicate update errors, so I
> wouldn't be aiming to do anything with that for PG11.

I would avoid mixing up ON CONFLICT DO UPDATE and MERGE. The
"anomalies" you describe in MERGE are not really anomalies IMV.
They're simply how the feature is supposed to operate, and how it's
possible to make MERGE use alternative join algorithms based only on
the underlying cost. You might use a merge join for a bulk load
use-case, for example.

I think an SQL MERGE feature would be compelling, but I don't think
that it should take much from ON CONFLICT. As I've said many times
[1], they're really two different features (Teradata had features
similar to each, for example). I suggest that you come up with
something that has the semantics that the standard requires, and
therefore makes none of the ON CONFLICT guarantees about an outcome
under concurrency (INSERT or UPDATE). Those guarantees are basically
incompatible with how MERGE needs to work.

In case it matters, I think that the idea of varying relation
heavyweight lock strength based on subtle semantics within a DML
statement is a bad one. Frankly, I think that that's going to be a
nonstarter.

[1] 
https://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=yydgyactzg0xa8-1_fcvo5xm7hrel3...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Sokolov Yura

On 2017-10-19 23:15, Tomas Vondra wrote:

Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and


To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,100)/100)::text)::uuid, md5(i::text)
  FROM generate_series(1,2) s(i);

VACUUM ANALYZE bloom_test;

 List of relations
 Schema |Name| Type  | Owner | Size  | Description
++---+---+---+-
 public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the 
id

column.

create index test_brin_idx on bloom_test
 using brin (id);

create index test_bloom_idx on bloom_test
 using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

  List of relations
 Schema |  Name  | Type  | Owner |   Table|  Size
++---+---++-
 public | test_bloom_idx | index | tomas | bloom_test | 12 MB
 public | test_brin_idx  | index | tomas | bloom_test | 832 kB
 public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
 where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=390.31..2130910.82 rows=20593 width=49)
 (actual time=197.974..22707.311 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 19998
   Heap Blocks: lossy=2061856
   ->  Bitmap Index Scan on test_brin_idx
   (cost=0.00..385.16 rows=5493161 width=0)
   (actual time=133.463..133.463 rows=20619520 loops=1)
 Index Cond: (id = 
'8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)

 Planning time: 0.165 ms
 Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=5898.31..2136418.82 rows=20593 width=49)
 (actual time=24.229..338.324 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 2500448
   Heap Blocks: lossy=25984
   ->  Bitmap Index Scan on test_bloom_idx
 (cost=0.00..5893.16 rows=5493161 width=0)
 (actual time=22.811..22.811 rows=259840 loops=1)
 Index Cond: (id = 
'8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)

 Planning time: 0.201 ms
 Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely 

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Simon Riggs
On 27 October 2017 at 15:24, Robert Haas  wrote:
> On Fri, Oct 27, 2017 at 10:55 AM, Simon Riggs  wrote:
>> Questions?
>
> I think one of the reasons why Peter Geoghegan decided to pursue
> INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL
> syntax, he felt free to mandate a non-standard SQL requirement, namely
> the presence of a unique index on the arbiter columns.  If MERGE's
> join clause happens to involve equality conditions on precisely the
> same set of columns as some unique index on the target table, then I
> think you can reuse the INSERT .. ON CONFLICT UPDATE infrastructure
> and I suspect there won't be too many problems.

Agreed

> However, if it
> doesn't, then what?  You could decree that such cases will fail, but
> that might not meet your use case for developing the feature.  Or you
> could try to soldier on without the INSERT .. ON CONFLICT UPDATE
> machinery, but that means, I think, that sometimes you will get
> serialization anomalies - at least, I think, you will sometimes obtain
> results that couldn't have been obtained under any serial order of
> execution, and maybe it would end up being possible to fail with
> serialization errors or unique index violations.
>
> In the past, there have been objections to implementations of MERGE
> which would give rise to such serialization anomalies, but I'm not
> sure we should feel bound by those discussions.  One thing that's
> different is that the common and actually-useful case can now be made
> to work in a fairly satisfying way using INSERT .. ON CONFLICT UPDATE;
> if less useful cases are vulnerable to some weirdness, maybe it's OK
> to just document the problems.

Good points.

I didn't say it but my intention was to just throw an ERROR if no
single unique index can be identified.

It could be possible to still run MERGE in that situaton but we would
need to take a full table lock at ShareRowExclusive. It's quite likely
that such statements would throw duplicate update errors, so I
wouldn't be aiming to do anything with that for PG11.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linking libpq statically to libssl

2017-10-27 Thread Daniele Varrazzo
On Fri, Oct 27, 2017 at 2:37 PM, Tom Lane  wrote:
> Daniele Varrazzo  writes:
>> I have a problem building binary packages for psycopg2. Binary
>> packages ship with their own copies of libpq and libssl; however if
>> another python package links to libssl the library will be imported
>> twice with conflicting symbols, likely resulting in a segfault (see
>> https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if
>> a python script both connects to postgres and opens an https resource.
>
> Basically, you're doing it wrong.  Shipping your own copy of libssl,
> rather than depending on whatever packaging the platform provides,
> is just asking for pain --- and not only of this sort.  You're also
> now on the hook to update your package whenever libssl fixes a bug
> or a security vulnerability, which happens depressingly often.
>
> The same applies to libpq, really.  You don't want to be in the
> business of shipping bits that you are not the originator of.
>
> When I worked at Red Hat, there was an ironclad policy against
> building packages that incorporated other packages statically.
> I would imagine that other distros have similar policies for
> similar reasons.  Just because you *can* ignore those policies
> doesn't mean you *should*.

Distros do compile the library from source and against the system
package, and everyone using the package directly can still do so; the
binary packages are only installed by the Python package manager.
Psycopg is more complex to install than the average Python package (it
needs python and postgres dev files, pg_config available somewhere
etc) and a no-dependencies package provides a much smoother
experience. It also happens that the libpq and libssl versions used
tend to be more up-to-date than the system one (people can already use
the new libpq 10 features without waiting for debian packages).

I am confronted with the reality of Python developers as of 201x's,
and shipping binary packages has proven generally a positive feature,
even factoring in the need of shipping updated binary packages when
the need arises. The benefit of a simple-to-use library is for the
Postgres project at large, it is not for my personal gain. So while I
know the shortcomings of binary packages and static libraries, I would
still be interested in knowing the best way to fix the problem in the
subject. Feel free to write in private if you want to avoid the public
shaming.

-- Daniele


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: schema variables

2017-10-27 Thread Pavel Stehule
2017-10-27 15:38 GMT+02:00 Gilles Darold :

> Le 26/10/2017 à 09:21, Pavel Stehule a écrit :
> > Hi,
> >
> > I propose a  new database object - a variable. The variable is
> > persistent object, that holds unshared session based not transactional
> > in memory value of any type. Like variables in any other languages.
> > The persistence is required for possibility to do static checks, but
> > can be limited to session - the variables can be temporal.
> >
> > My proposal is related to session variables from Sybase, MSSQL or
> > MySQL (based on prefix usage @ or @@), or package variables from
> > Oracle (access is controlled by scope), or schema variables from DB2.
> > Any design is coming from different sources, traditions and has some
> > advantages or disadvantages. The base of my proposal is usage schema
> > variables as session variables for stored procedures. It should to
> > help to people who try to port complex projects to PostgreSQL from
> > other databases.
> >
> > The Sybase  (T-SQL) design is good for interactive work, but it is
> > weak for usage in stored procedures - the static check is not
> > possible. Is not possible to set some access rights on variables.
> >
> > The ADA design (used on Oracle) based on scope is great, but our
> > environment is not nested. And we should to support other PL than
> > PLpgSQL more strongly.
> >
> > There is not too much other possibilities - the variable that should
> > be accessed from different PL, different procedures (in time) should
> > to live somewhere over PL, and there is the schema only.
> >
> > The variable can be created by CREATE statement:
> >
> > CREATE VARIABLE public.myvar AS integer;
> > CREATE VARIABLE myschema.myvar AS mytype;
> >
> > CREATE [TEMP] VARIABLE [IF NOT EXISTS] name AS type
> >   [ DEFAULT expression ] [[NOT] NULL]
> >   [ ON TRANSACTION END { RESET | DROP } ]
> >   [ { VOLATILE | STABLE } ];
> >
> > It is dropped by command DROP VARIABLE  [ IF EXISTS] varname.
> >
> > The access rights is controlled by usual access rights - by commands
> > GRANT/REVOKE. The possible rights are: READ, WRITE
> >
> > The variables can be modified by SQL command SET (this is taken from
> > standard, and it natural)
> >
> > SET varname = expression;
> >
> > Unfortunately we use the SET command for different purpose. But I am
> > thinking so we can solve it with few tricks. The first is moving our
> > GUC to pg_catalog schema. We can control the strictness of SET
> > command. In one variant, we can detect custom GUC and allow it, in
> > another we can disallow a custom GUC and allow only schema variables.
> > A new command LET can be alternative.
> >
> > The variables should be used in queries implicitly (without JOIN)
> >
> > SELECT varname;
> >
> > The SEARCH_PATH is used, when varname is located. The variables can be
> > used everywhere where query parameters are allowed.
> >
> > I hope so this proposal is good enough and simple.
> >
> > Comments, notes?
> >
> > regards
> >
> > Pavel
> >
> >
>
> Great feature that will help for migration. How will you handle CONSTANT
> declaration? With Oracle it is possible to declare a constant as follow:
>
>
>   varname CONSTANT INTEGER:= 500;
>
>
> for a variable that can't be changed. Do you plan to add a CONSTANT or
> READONLY keyword or do you want use GRANT on the object to deal with
> this case?
>

Plpgsql  declaration supports CONSTANT

I forgot it. Thank you

Pavel



>
> Regards
>
> --
> Gilles Darold
> Consultant PostgreSQL
> http://dalibo.com - http://dalibo.org
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


[HACKERS] bgwriter_lru_maxpages range in postgresql.conf

2017-10-27 Thread Jeff Janes
With v10, the upper limit on bgwriter_lru_maxpages was changed from 1000 to
INT_MAX / 2, but the postgresql.conf.sample was not updated.

#bgwriter_lru_maxpages = 100# 0-1000 max buffers written/round

I don't see any precedence for including INT_MAX-type limits in the sample
config file, so maybe something like this:

#bgwriter_lru_maxpages = 100# max buffers written/round, 0 to turn
off

Cheers,

Jeff


Re: [HACKERS] proposal: schema variables

2017-10-27 Thread Gilles Darold
Le 26/10/2017 à 09:21, Pavel Stehule a écrit :
> Hi,
>
> I propose a  new database object - a variable. The variable is
> persistent object, that holds unshared session based not transactional
> in memory value of any type. Like variables in any other languages.
> The persistence is required for possibility to do static checks, but
> can be limited to session - the variables can be temporal.
>
> My proposal is related to session variables from Sybase, MSSQL or
> MySQL (based on prefix usage @ or @@), or package variables from
> Oracle (access is controlled by scope), or schema variables from DB2.
> Any design is coming from different sources, traditions and has some
> advantages or disadvantages. The base of my proposal is usage schema
> variables as session variables for stored procedures. It should to
> help to people who try to port complex projects to PostgreSQL from
> other databases.
>
> The Sybase  (T-SQL) design is good for interactive work, but it is
> weak for usage in stored procedures - the static check is not
> possible. Is not possible to set some access rights on variables.
>
> The ADA design (used on Oracle) based on scope is great, but our
> environment is not nested. And we should to support other PL than
> PLpgSQL more strongly.
>
> There is not too much other possibilities - the variable that should
> be accessed from different PL, different procedures (in time) should
> to live somewhere over PL, and there is the schema only.
>
> The variable can be created by CREATE statement:
>
> CREATE VARIABLE public.myvar AS integer;
> CREATE VARIABLE myschema.myvar AS mytype;
>
> CREATE [TEMP] VARIABLE [IF NOT EXISTS] name AS type
>   [ DEFAULT expression ] [[NOT] NULL]
>   [ ON TRANSACTION END { RESET | DROP } ]
>   [ { VOLATILE | STABLE } ];
>
> It is dropped by command DROP VARIABLE  [ IF EXISTS] varname.
>
> The access rights is controlled by usual access rights - by commands
> GRANT/REVOKE. The possible rights are: READ, WRITE
>
> The variables can be modified by SQL command SET (this is taken from
> standard, and it natural)
>
> SET varname = expression;
>
> Unfortunately we use the SET command for different purpose. But I am
> thinking so we can solve it with few tricks. The first is moving our
> GUC to pg_catalog schema. We can control the strictness of SET
> command. In one variant, we can detect custom GUC and allow it, in
> another we can disallow a custom GUC and allow only schema variables.
> A new command LET can be alternative.
>
> The variables should be used in queries implicitly (without JOIN)
>
> SELECT varname;
>
> The SEARCH_PATH is used, when varname is located. The variables can be
> used everywhere where query parameters are allowed.
>
> I hope so this proposal is good enough and simple.
>
> Comments, notes?
>
> regards
>
> Pavel
>
>

Great feature that will help for migration. How will you handle CONSTANT
declaration? With Oracle it is possible to declare a constant as follow:


  varname     CONSTANT INTEGER    := 500;


for a variable that can't be changed. Do you plan to add a CONSTANT or
READONLY keyword or do you want use GRANT on the object to deal with
this case?


Regards

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linking libpq statically to libssl

2017-10-27 Thread Tom Lane
Daniele Varrazzo  writes:
> I have a problem building binary packages for psycopg2. Binary
> packages ship with their own copies of libpq and libssl; however if
> another python package links to libssl the library will be imported
> twice with conflicting symbols, likely resulting in a segfault (see
> https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if
> a python script both connects to postgres and opens an https resource.

Basically, you're doing it wrong.  Shipping your own copy of libssl,
rather than depending on whatever packaging the platform provides,
is just asking for pain --- and not only of this sort.  You're also
now on the hook to update your package whenever libssl fixes a bug
or a security vulnerability, which happens depressingly often.

The same applies to libpq, really.  You don't want to be in the
business of shipping bits that you are not the originator of.

When I worked at Red Hat, there was an ironclad policy against
building packages that incorporated other packages statically.
I would imagine that other distros have similar policies for
similar reasons.  Just because you *can* ignore those policies
doesn't mean you *should*.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Postgres 10 manual breaks links with anchors

2017-10-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 10/26/17 16:10, Tom Lane wrote:
>> In view of commit 1ff01b390, aren't we more or less locked into
>> lower-case anchors going forward?

> The details are more complicated. ...

Ah.  I'd imagined that we were using the original case for the anchors,
rather than smashing them to upper (or lower) case.

> So the options are simply
> 1) Use the patch and keep indefinitely, keeping anchors compatible back
> to forever and forward indefinitely.
> 2) Don't use the patch, breaking anchors from <=9.6, but keeping them
> compatible going forward.
>
> Considering how small the patch is compared to some other customizations
> we carry, #1 seems reasonable to me.

+1

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Robert Haas
On Fri, Oct 27, 2017 at 10:55 AM, Simon Riggs  wrote:
> Questions?

I think one of the reasons why Peter Geoghegan decided to pursue
INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL
syntax, he felt free to mandate a non-standard SQL requirement, namely
the presence of a unique index on the arbiter columns.  If MERGE's
join clause happens to involve equality conditions on precisely the
same set of columns as some unique index on the target table, then I
think you can reuse the INSERT .. ON CONFLICT UPDATE infrastructure
and I suspect there won't be too many problems.  However, if it
doesn't, then what?  You could decree that such cases will fail, but
that might not meet your use case for developing the feature.  Or you
could try to soldier on without the INSERT .. ON CONFLICT UPDATE
machinery, but that means, I think, that sometimes you will get
serialization anomalies - at least, I think, you will sometimes obtain
results that couldn't have been obtained under any serial order of
execution, and maybe it would end up being possible to fail with
serialization errors or unique index violations.

In the past, there have been objections to implementations of MERGE
which would give rise to such serialization anomalies, but I'm not
sure we should feel bound by those discussions.  One thing that's
different is that the common and actually-useful case can now be made
to work in a fairly satisfying way using INSERT .. ON CONFLICT UPDATE;
if less useful cases are vulnerable to some weirdness, maybe it's OK
to just document the problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Amit Kapila
On Fri, Oct 27, 2017 at 5:46 PM, Thomas Kellerer  wrote:
> akapila wrote:
>
>> You might want to give a try with the hash index if you are planning
>> to use PG10 and your queries involve equality operations.
>
> But you can't replace the PK index with a hash index, because hash indexes
> don't support uniqueness.
>

That's true, but it hasn't been mentioned in the mail that the usage
of hash index is the for primary key.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Amit Kapila
On Fri, Oct 27, 2017 at 5:36 PM, Alvaro Herrera
 wrote:
> Amit Kapila wrote:
>
>> You might want to give a try with the hash index if you are planning
>> to use PG10 and your queries involve equality operations.
>
> So, btree indexes on monotonically increasing sequences don't write tons
> of full page writes because typically the same page is touched many
> times by insertions on each checkpoint cycle; so only one or very few
> full page writes are generated for a limited number of index pages.
>
> With UUID you lose locality of access: each insert goes to a different
> btree page, so you generate tons of full page writes because the number
> of modified index pages is very large.
>
> With hash on monotonically increasing keys, my guess is that you get
> behavior similar to btrees on UUID: the inserts are all over the place
> in the index, so tons of full page writes.  Am I wrong?
>
> With hash on UUID, the same thing should happen.  Am I wrong?
>

If the bucket pages are decided merely based on hashkey, then what you
are saying should be right.  However, we mask the hash key with
high|low mask due to which it falls in one of existing page in the
hash index.  Also, I have suggested based on some of the tests we have
done on UUID column and the result was that most of the time hash
index size was lesser than btree size.  See pages 15-17 of hash index
presentation in the last PGCon [1].

[1] - 
https://www.pgcon.org/2017/schedule/attachments/458_durable-hash-indexes-postgresql.pdf

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Robert Haas
On Fri, Oct 27, 2017 at 2:55 PM, Alvaro Herrera  wrote:
> I was rather thinking that if we can make this very robust against the
> index growing out of proportion, we should consider ditching the
> original minmax and replace it with multirange minmax, which seems like
> it'd have much better behavior.

If the multirange stuff can be done in such a way that it's just an
updated version of the same opclass, and backward-compatible on disk,
then I think this would be OK.  But otherwise I don't think we ditch
what already exists.  That would break upgrades via both pg_upgrade
and pg_dump, which seems like too high a price to pay to get rid of
some arguably-worse code.  It's actually WORSE to drop an opclass
(which will make dumps not restore) than to do something like bump
HASH_VERSION (which doesn't affect pg_dump at all and for pg_upgrade
only requires post-upgrade steps rather than failing outright).

> I don't see any reason to put any of this in contrib.

Well, for one thing, it makes it easier to drop stuff later if we
decide we don't really want it.  I think that whichever BRIN opclasses
are thought to be high quality and of general utility can go into
core, just as we've done with other index AMs.  However, if we think
that something is useful-ish but maybe not something to which we want
to make a permanent commitment, putting it into contrib is good for
that.

Upgrades are easier for things in contrib, too, because there's a
built-in mechanism for people to try updating the SQL extensions
(ALTER EXTENSION .. UPDATE) and if it fails they can adjust things and
try again.  When you just make a hard change to SQL definitions in a
new release, any failures that result from those changes just turn
into upgrade failures, which is IMHO a lot more painful than a failure
to update an extension version while the database is still up and
usable the whole time.

For instance, if pg_stat_activity were bundled in an extension and we
made the C code backward-compatibility with old extension versions,
then some of the upgrade pain users have had with that over the years
could have been avoided.  People could update to the new version
without failures and then at their leisure try to update.  If the
update failed due to dependencies, then they would have time to figure
out what to do about it and try again later; in the meantime, they'd
be on the new version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)

2017-10-27 Thread Robert Haas
On Thu, Oct 19, 2017 at 1:35 AM, Badrul Chowdhury  wrote:
> The new functionality is for sending 64bit ints. I think 32bits is sufficient 
> for the information we want to pass around in the protocol negotiation phase, 
> so I left this part unchanged.

No, it isn't.  That commit didn't add any new functionality, but it
changed the preferred interfaces for assembling protocol messages.
Your patch was still using the old ones.

Attached is an updated patch.  I've made a few modifications:

- I wrote documentation.  For future reference, that's really the job
of the patch submitter.

- I changed the message type byte for the new message from 'Y' to 'v'.
'Y' is apparently used by logical replication as a "type message", but
'v' is not currently used for anything.  It's also somewhat mnemonic.

- I deleted the minimum protocol version from the new message.  I know
there were a few votes for including it, but I think it's probably
useless.  The client should always request the newest version it can
support; if that's not new enough for the server, then we're dead
anyway and we might as well just handle that via ERROR. Moreover, it
seems questionable whether we'd ever deprecate 3.0 support in the
server anyway, or if we do, it'll probably be because 4.0 has been
stable for a decade or so.  Desupporting 3.0 while continuing to
support 3.x,x>0 seems like a remote outcome (and, like I say, if it
does happen, ERROR is a good-enough response).  If there's some use
case for having a client request an older protocol version than the
newest one it can support, then this change could be revisited, or we
can just handle that by retrying the whole connection attempt.

- I changed the test for whether to send NegotiateProtocolVersion to
send it only when the client requests a version too new for the
server.  I think that if the client requests a version older than what
the server could support, the server should just silently use the
older version.  That's arguable.  You could argue that the message
should be sent anyway (at least to post-3.0 clients) as a way of
reporting what happened with _pq_. parameters, but I have
two counter-arguments.  Number one, maybe clients just shouldn't send
_pq_. parameters that the protocol version they're using
doesn't support, or if they do, be prepared for them to have no
effect.  Number two, this is exactly the sort of thing that we can
change in future minor protocol versions if we wish.  For example, we
could define protocol version 3.1 or 3.43 or whatever as always
sending a NegotiateProtocolVersion message.  There's no need for the
code to make 3.0 compatible with future versions to decide what
choices those future versions might make.

- Made the prefix matching check for "_pq_." rather than "_pq_".  I
think we're imagining extensions like _pq_.magic_fairy_dust, not
_pq_magic_fairy_dust.

- I got rid of the optional_parameters thing in Port and just passed a
list of unrecognized parameters around directly.  Once some parameters
are recognized, e.g. _pq_.magic_fairy_dust = 1, we'll probably have
dedicated fields in the Port for them (i.e. int magic_fairy_dust)
rather than digging them out of some list.  Moreover, with your
design, we'd end up having to make NegotiateServerProtocol exclude
things that are actually recognized, which would be annoying.

- I got rid of the pq_flush().  There's no need for this because we're
always going to send another protocol message (ErrorResponse or
AuthenticationSomething) afterwards.

- I renamed NegotiateServerProtocol to SendNegotiateProtocolVersion
and moved it to what I think is a more sensible place in the file.

- I removed the error check for 2.x, x != 0.  I may have advocated for
this before, but on reflection, I don't see much advantage in
rejecting things that work today.

- I fixed the above-mentioned failure to use the new interface for
assembling the NegotiateProtocolVersion message.

- I did some work on the comments.

Barring objections, I plan to commit this and back-patch it all the
way.  Of course, this is not a bug fix, but Tom previously proposed
back-patching it and I think that's a good idea, because if we don't,
it will be a very long time before servers with this code become
commonplace in the wild.  Back-patching doesn't completely fix that
problem because not everybody applies upgrades and some people may be
running EOL versions, but it will still help.

Also attached is a patch I used for testing purposes, some version of
which we might eventually use when we actually introduce version 3.1
of the protocol.  It bumps the protocol version that libpq uses from
3.0 to 3.1 without changing what the server thinks the latest protocol
version is, so the server always replies with a
NegotiateProtocolVersion message.  It also teaches libpq to ignore the
NegotiateProtocolVersion message.  With that patch applied, make
check-world passes, which seems to show that the server-side changes
are not totally broken.  

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Alvaro Herrera
Tomas Vondra wrote:

> Not sure "a number of in-core opclasses" is a good reason to (not) add
> new ones. Also, we already have two built-in BRIN opclasses (minmax and
> inclusion).
>
> In general, "BRIN bloom" can be packed as a contrib module (at least I
> believe so). That's not the case for the "BRIN multi-range" which also
> requires some changes to some code in brin.c (but the rest can be moved
> to contrib module, of course).

I was rather thinking that if we can make this very robust against the
index growing out of proportion, we should consider ditching the
original minmax and replace it with multirange minmax, which seems like
it'd have much better behavior.

I don't see any reason to put any of this in contrib.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-10-27 Thread Amit Kapila
On Tue, Sep 19, 2017 at 8:47 AM, Amit Kapila  wrote:
> On Mon, Sep 18, 2017 at 10:00 PM, Tom Lane  wrote:
>> Amit Kapila  writes:
>>> Attached patch fixes these problems.
>>
>> Hmm, this patch adds a kill(notify_pid) after one call to
>> ForgetBackgroundWorker, but the postmaster has several more such calls.
>> Shouldn't they all notify the notify_pid?  Should we move that
>> functionality into ForgetBackgroundWorker itself, so we can't forget
>> it again?
>>
>
> Among other places, we already notify during
> ReportBackgroundWorkerExit().  However, it seems to me that all other
> places except where this patch has added a call to notify doesn't need
> such a call.  The other places like in DetermineSleepTime and
> ResetBackgroundWorkerCrashTimes is called for a crashed worker which I
> don't think requires notification to the backend as that backend
> itself would have restarted.  The other place where we call
> ForgetBackgroundWorker is in maybe_start_bgworkers when rw_terminate
> is set to true which again seems to be either the case of worker crash
> or when someone has explicitly asked to terminate the worker in which
> case we already send a notification.
>
> I think we need to notify the backend on start, stop and failure to
> start a worker.  OTOH, if it is harmless to send a notification even
> after the worker is crashed, then we can just move that functionality
> into ForgetBackgroundWorker itself as that will simplify the code and
> infact that is the first thing that occurred to me as well, but I
> haven't done that way as I was not sure if we want to send
> notification in all kind of cases.
>

The patch still applies (with some hunks).  I have added it in CF [1]
to avoid losing track.

[1] - https://commitfest.postgresql.org/15/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Alvaro Herrera
Simon Riggs wrote:

> Earlier thoughts on how this could/could not be done were sometimes
> imprecise or inaccurate, so I have gone through the command per
> SQL:2011 spec and produced a definitive spec in the form of an SGML
> ref page. This is what I intend to deliver for PG11.

Nice work.  I didn't verify the SQL spec, just read your HTML page;
some very minor comments based on that:

* use "and" not "where" as initial words in "when_clause" and
  "merge_update" clause definitions

* missing word here: "the DELETE privilege on the if you specify"

* I think the word "match." is leftover from some editing in the phrase
  " that specifies which rows in the data_source match rows in the
  target_table_name. match."  In the same paragraph, it is not clear
  whether all columns must be matched or it can be a partial match.

* In the when_clause note, it is not clear whether you can have multiple
  WHEN MATCHED and WHEN NOT MATCHED clauses.  Obviously you can have one
  of each, but I think your doc says it is possible to have more than one of
  each, with different conditions (WHEN MATCHED AND foo THEN bar WHEN
  MATCHED AND baz THEN qux).  No example shows more than one.

  On the same point: Is there short-circuiting of such conditions, i.e.
  will the execution will stop looking for further WHEN matches if some
  rule matches, or will it rather check all rules and raise an error if
  more than one WHEN rules match each given row?

* Your last example uses ELSE but that appears nowhere in the synopsys.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Linking libpq statically to libssl

2017-10-27 Thread Daniele Varrazzo
Hello,

I have a problem building binary packages for psycopg2. Binary
packages ship with their own copies of libpq and libssl; however if
another python package links to libssl the library will be imported
twice with conflicting symbols, likely resulting in a segfault (see
https://github.com/psycopg/psycopg2/issues/543). This happens e.g. if
a python script both connects to postgres and opens an https resource.

A solution to work around the problem could be to change libssl
symbols names, in the binaries or in the source code
(https://github.com/psycopg/psycopg2-wheels/issues/7). But maybe a
simpler workaround could be just to build libpq linking to libssl
statically. Is this possible?

...and other libs too I guess, because the problem was found on libssl
but conflicts with other libs is possible too, ldap comes to mind...

Any help to solve the problem would be appreciated. Thank you very much.

-- Daniele


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Thomas Kellerer
akapila wrote:

> You might want to give a try with the hash index if you are planning
> to use PG10 and your queries involve equality operations.

But you can't replace the PK index with a hash index, because hash indexes
don't support uniqueness.




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Alvaro Herrera
Amit Kapila wrote:

> You might want to give a try with the hash index if you are planning
> to use PG10 and your queries involve equality operations.

So, btree indexes on monotonically increasing sequences don't write tons
of full page writes because typically the same page is touched many
times by insertions on each checkpoint cycle; so only one or very few
full page writes are generated for a limited number of index pages.

With UUID you lose locality of access: each insert goes to a different
btree page, so you generate tons of full page writes because the number
of modified index pages is very large.

With hash on monotonically increasing keys, my guess is that you get
behavior similar to btrees on UUID: the inserts are all over the place
in the index, so tons of full page writes.  Am I wrong?

With hash on UUID, the same thing should happen.  Am I wrong?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Amit Kapila
On Fri, Oct 27, 2017 at 11:26 AM, sanyam jain  wrote:
> Hi,
>
> I was reading the blog
> https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes .
>
> My queries:
>
> How randomness of UUID will likely to create new leaf page in btree index?
> In my understanding as the size of UUID is 128 bits i.e. twice of BIGSERIAL
> , more number of pages will be required to store the same number of rows and
> hence there can be increase in WAL size due to FPW .
> When compared the index size in local setup UUID index is ~2x greater in
> size.
>

You might want to give a try with the hash index if you are planning
to use PG10 and your queries involve equality operations.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Postgres 10 manual breaks links with anchors

2017-10-27 Thread Peter Eisentraut
On 10/26/17 16:10, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 10/16/17 03:19, Thomas Kellerer wrote:
>>> I don't know if this is intentional, but the Postgres 10 manual started to 
>>> use lowercase IDs as anchors in the manual.
> 
>> Here is a patch that can be applied to PG 10 to put the upper case
>> anchors back.
>> The question perhaps is whether we want to maintain this patch
>> indefinitely, or whether a clean break is better.
> 
> In view of commit 1ff01b390, aren't we more or less locked into
> lower-case anchors going forward?  I'm not sure I see the point
> of changing v10 back to the old way if v11 will be incompatible
> anyhow.

The details are more complicated.

The IDs in DocBook documents have two purposes.

One is to ensure non-broken links between things like 
and .  This is set up in the DTD and checked during
parsing (validation, more precisely).  In DocBook SGML, many things
including tag names, attribute names, and IDs are case insensitive.  But
in DocBook XML, everything is case sensitive.  So in order to make
things compatible for a conversion, we had to consolidate some variant
spellings that have accumulated in our sources.  For simplicity, I have
converted everything to lower case.

The other purpose is that the DocBook XSL and DSSSL stylesheets use the
IDs for creating anchors in HTML documents (and also for the HTML file
names themselves).  This is merely a useful choice of those stylesheets.

In PG 9.6 and earlier, we used a straight SGML toolchain, using Jade and
DSSSL.  The internal representation of a DocBook SGML document after
parsing converts all the case insensitive bits to upper case.  (This
might be configured somewhere; I'm not sure.)  So the stylesheets see
all the IDs as upper case to begin with, and that's why all the anchors
come out in upper case in the HTML output.

In PG 10, the build first converts the SGML sources to XML, redeclares
them as DocBook XML, then builds using XSLT.  Because DocBook XML
requires lower-case tags and attribute names, we have to use the osx -x
lower option to convert all the case-insensitive bits to lower case
instead of the default upper case.  That's why the XSLT stylesheets see
the IDs as lower case and that's why they are like that in the output.
(If there were options more detailed than -x lower, that could have been
useful.)

The proposed patch works much later in the build process and converts
IDs to upper case only when they are being considered for making an HTML
anchor.  The structure of the document as far as the XML parser is
concerned stays the same.

For PG 11, the idea is to convert the sources to a pure XML document.
XML is case insensitive, so the XML parser would see the IDs as what
they are.  Without the mentioned patch to convert all IDs to lower case
in the source, the XSL processor would see the IDs in whatever case they
are, and anchors would end up in the HTML output using whatever case
they are.  So the conversion to lower case in the source also ensured
anchor compatibility to PG 10.  Otherwise, someone might well have
complained in a similar manner a year from now.

Applying the proposed patch to master/PG 11 would have the same effect
as in PG 10.  It would convert anchors to upper case in the HTML output
but leave the logical and physical structure of the XML document alone.

So the options are simply

1) Use the patch and keep indefinitely, keeping anchors compatible back
to forever and forward indefinitely.

2) Don't use the patch, breaking anchors from <=9.6, but keeping them
compatible going forward.

Considering how small the patch is compared to some other customizations
we carry, #1 seems reasonable to me.  I just didn't know to what extent
people had actually bookmarked fragment links.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inconsistency in process names - bgworker: logical replication launcher

2017-10-27 Thread Pavel Stehule
2017-10-27 13:03 GMT+02:00 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com>:

> On 10/27/17 04:06, Pavel Stehule wrote:
> > Why buildin process has prefix bgworker?
>
> Implementation detail.  This has been changed in master already.
>

ok

Thank you

Pavel

>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] Add support for tuple routing to foreign partitions

2017-10-27 Thread Etsuro Fujita

On 2017/10/26 16:40, Etsuro Fujita wrote:
Other changes I made 
to the executor are: (1) currently, we set the RT index for the root 
partitioned table to ri_RangeTableIndex of partitions' ResultRelInfos, 
but the proposed EXPLAIN requires that the partition's 
ri_RangeTableIndex is set to the RT index for that partition's RTE, to 
show that partition info in the output.  So, I made that change.


One thing I forgot to mention is: that would be also required to call 
BeginForeignModify, ExecForeignInsert, and EndForeignModify with the 
partition's ResultRelInfo.


I updated docs in doc/src/sgml/ddl.sgml the same way as [1].  (I used 
only the ddl.sgml change proposed by [1], not all the changes.)  I did 
some cleanup as well.  Please find attached an updated version of the patch.


Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/b19a8e2b-e000-f592-3e0b-3e90ba0fa816%40lab.ntt.co.jp
*** a/contrib/file_fdw/input/file_fdw.source
--- b/contrib/file_fdw/input/file_fdw.source
***
*** 176,182  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', 
delimiter ',');
--- 176,188 
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
  SELECT tableoid::regclass, * FROM p2;
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
***
*** 315,321  SELECT tableoid::regclass, * FROM p2;
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter 
','); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
--- 315,321 
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter 
','); -- ERROR
! ERROR:  cannot route copied tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
***
*** 341,348  SELECT tableoid::regclass, * FROM p2;
   p2   | 2 | qux
  (2 rows)
  
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
--- 341,366 
   p2   | 2 | qux
  (2 rows)
  
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+  Insert on public.pt
+Foreign Insert on public.p1
+Insert on public.p2
+->  Result
+  Output: 1, 'xyzzy'::text
+ 
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to foreign table "p1"
! \t on
! EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
!  Insert on public.pt
!Foreign Insert on public.p1
!Insert on public.p2
!->  Result
!  Output: 2, 'xyzzy'::text
! 
! \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 7029,7034  NOTICE:  drop cascades to foreign table bar2
--- 7029,7236 
  drop table loct1;
  drop table loct2;
  -- ===
+ -- test tuple routing for foreign-table partitions
+ -- ===
+ create table pt (a int, b int) partition by list (a);
+ create table locp partition of pt for values in (1);
+ create table locfoo (a int check (a in (2)), b int);
+ create foreign table remp partition of pt for values in (2) server loopback 
options (table_name 'locfoo');
+ explain (verbose, costs off)
+ insert into pt values (1, 1), (2, 1);
+QUERY PLAN
+ -
+  Insert on public.pt
+Insert on public.locp
+Foreign Insert on public.remp
+  Remote SQL: INSERT INTO public.locfoo(a, b) VALUES ($1, $2)
+->  Values Scan on "*VALUES*"
+  Output: "*VALUES*".column1, "*VALUES*".column2
+ (6 rows)
+ 
+ insert into pt values (1, 1), (2, 1);
+ select tableoid::regclass, * FROM pt;
+  tableoid | a | b 
+ --+---+---
+  locp | 1 | 1
+  remp | 2 | 1
+ (2 rows)
+ 
+ select tableoid::regclass, * FROM locp;
+  tableoid | a | b 
+ --+---+---
+  locp | 1 | 1
+ (1 row)
+ 
+ select tableoid::regclass, * FROM remp;
+  tableoid | a | b 
+ 

Re: [HACKERS] inconsistency in process names - bgworker: logical replication launcher

2017-10-27 Thread Peter Eisentraut
On 10/27/17 04:06, Pavel Stehule wrote:
> Why buildin process has prefix bgworker?

Implementation detail.  This has been changed in master already.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-27 Thread Jeevan Chalke
Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front in the
patch-set.
2. Since we now cost Append node, we don't need
partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main
implementation
patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


On Wed, Oct 18, 2017 at 9:53 AM, Dilip Kumar  wrote:

> On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke
>  wrote:
> >
>
> > I didn't get what you mean by regression here. Can you please explain?
> >
> > I see that PWA plan is selected over regular plan when enabled on the
> basis
> > of costing.
> > Regular planning need a Result node due to which costing increases where
> as
> > PWA don't need that and thus wins.
>
> Sorry for not clearly explaining,  I meant that with normal plan
> execution time is 263.678 ms whereas with PWA its 339.929 ms.
>
> I only set enable_partition_wise_agg=on and it switched to PWA and
> execution time increased by 30%.
> I understand that the this is the worst case for PWA where
> FinalizeAggregate is getting all the tuple.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


partition-wise-agg-v6.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel safety for extern params

2017-10-27 Thread Amit Kapila
On Mon, Oct 16, 2017 at 4:29 PM, Amit Kapila  wrote:
> On Sat, Oct 14, 2017 at 1:51 AM, Robert Haas  wrote:
>
>> I think the bug is in ExecGather(Merge): it assumes that if we're in
>> parallel mode, it's OK to start workers.  But actually, it shouldn't
>> do this unless ExecutePlan ended up with use_parallel_mode == true,
>> which isn't quite the same thing.  I think we might need ExecutePlan
>> to set a flag in the estate that ExecGather(Merge) can check.
>>
>
> Attached patch fixes the problem in a suggested way.
>
>

All the patches still apply and pass the regression test.  I have
added this to CF to avoid losing the track of this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-10-27 Thread Rajkumar Raghuwanshi
On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote  wrote:

> 0001: added some new tests
> 0002: no change
> 0003: fixed issue that Rajkumar reported (cope with Params properly)
> 0004: no change
> 0005: fix the case to prune the default partition when warranted (the
>   issue reported by Beena)
>

Thanks for the updated patch, i am getting server crash with below query.

CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY
RANGE(c2);
CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY
RANGE(c2);
CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);

INSERT INTO mp VALUES(10, 100, 10);
INSERT INTO mp VALUES(20, 200, 20);
INSERT INTO mp VALUES(21, 150, 30);
INSERT INTO mp VALUES(30, 200, 40);
INSERT INTO mp VALUES(31, 300, 30);
INSERT INTO mp VALUES(40, 400, 40);

EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND
c2 < 300;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Szymon Lipiński
Hey,
It looks quite nice. Personally I'd like to also have the returning
statement, and have the number of deleted and inserted rows as separate
numbers in the output message.

regards
Szymon Lipiński

pt., 27.10.2017, 10:56 użytkownik Simon Riggs 
napisał:

> I'm working on re-submitting MERGE for PG11
>
> Earlier thoughts on how this could/could not be done were sometimes
> imprecise or inaccurate, so I have gone through the command per
> SQL:2011 spec and produced a definitive spec in the form of an SGML
> ref page. This is what I intend to deliver for PG11.
>
> MERGE will use the same mechanisms as INSERT ON CONFLICT, so
> concurrent behavior does not require further infrastructure changes,
> just detailed work on the statement itself.
>
> I'm building up the code from scratch based upon the spec, rather than
> trying to thwack earlier attempts into shape. This looks more likely
> to yield a commitable patch.
>
> Major spanners or objections, please throw them in now cos I don't see any.
>
> Questions?
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
hi,

On 10/27/2017 09:34 AM, Simon Riggs wrote:
> On 27 October 2017 at 07:20, Robert Haas  wrote:
>> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
>>  wrote:
>>> Let's see a query like this:
>>>
>>> select * from bloom_test
>>>  where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>>>
>>> The minmax index produces this plan
>>>
>>>Heap Blocks: lossy=2061856
>>>  Execution time: 22707.891 ms
>>>
>>> Now, the bloom index:
>>>
>>>Heap Blocks: lossy=25984
>>>  Execution time: 338.946 ms
>>
>> It's neat to see BRIN being extended like this.  Possibly we could
>> consider making it a contrib module rather than including it in core,
>> although I don't have strong feelings about it.
> 
> I see that SP-GIST includes two operator classes in core, one default.
> 
> Makes sense to do the same thing with BRIN and add this new op class
> as a non-default option in core.
> 

Not sure "a number of in-core opclasses" is a good reason to (not) add
new ones. Also, we already have two built-in BRIN opclasses (minmax and
inclusion).

In general, "BRIN bloom" can be packed as a contrib module (at least I
believe so). That's not the case for the "BRIN multi-range" which also
requires some changes to some code in brin.c (but the rest can be moved
to contrib module, of course).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Tomas Vondra


On 10/27/2017 07:56 AM, sanyam jain wrote:
> Hi,
> 
> I was reading the
> blog https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes .
> 

For the record, I assume you're referring to this part:

With BIGSERIAL new values are sequential, and so get inserted to the
same leaf pages in the btree index. As only the first modification
to a page triggers the full-page write, only a tiny fraction of the
WAL records are FPIs. With UUID it’s completely different case, of
couse – the values are not sequential at all, in fact each insert is
likely to touch completely new leaf index leaf page (assuming the
index is large enough).

> My queries:
> 
> How randomness of UUID will likely to create new leaf page in btree index?
> In my understanding as the size of UUID is 128 bits i.e. twice of
> BIGSERIAL , more number of pages will be required to store the same
> number of rows and hence there can be increase in WAL size due to FPW .
> When compared the index size in local setup UUID index is ~2x greater in
> size.
> 

Perhaps this is just a poor choice of words on my side, but I wasn't
suggesting new leaf pages will be created but merely that the inserts
will touch a different (possibly existing) leaf page. That's a direct
consequence of the inherent UUID randomness.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Simon Riggs
I'm working on re-submitting MERGE for PG11

Earlier thoughts on how this could/could not be done were sometimes
imprecise or inaccurate, so I have gone through the command per
SQL:2011 spec and produced a definitive spec in the form of an SGML
ref page. This is what I intend to deliver for PG11.

MERGE will use the same mechanisms as INSERT ON CONFLICT, so
concurrent behavior does not require further infrastructure changes,
just detailed work on the statement itself.

I'm building up the code from scratch based upon the spec, rather than
trying to thwack earlier attempts into shape. This looks more likely
to yield a commitable patch.

Major spanners or objections, please throw them in now cos I don't see any.

Questions?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Title: MERGE

MERGEPrev UpSQL CommandsHome NextMERGEMERGE — insert, update, or delete rows of a table based upon source dataSynopsisMERGE INTO target_table_name [ [ AS ] target_alias ]
USING data_source
ON join_condition
when_clause [...]

where data_source is

{ source_table_name |
  ( source_query )
}
[ [ AS ] source_alias ]

where when_clause is

{ WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete } |
  WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING }
}

where merge_update is

UPDATE SET { column_name = { _expression_ | DEFAULT } |
 ( column_name [, ...] ) = ( { _expression_ | DEFAULT } [, ...] )
   } [, ...]

and merge_insert is

INSERT [( column_name [, ...] )]
[ OVERRIDING { SYSTEM | USER } VALUE ]
{ VALUES ( { _expression_ | DEFAULT } [, ...] ) | DEFAULT VALUES }

and merge_delete is

DELETEDescription   MERGE performs actions that modify rows in the
   target_table_name,
   using the data_source.
   MERGE provides a single SQL
   statement that can conditionally INSERT,
   UPDATE or DELETE rows, a task
   that would otherwise require multiple procedural language statements.
 First, the MERGE command performs a left outer join
   from data_source to
   target_table_name
   producing zero or more candidate change rows.
   For each candidate change row, WHEN clauses are evaluated
   in the order specified; if one of them is activated, the specified
   action occurs. No more than one WHEN clause can be
   activated for any candidate change row.
 MERGE actions have the same effect as
   regular UPDATE, INSERT, or
   DELETE commands of the same names. The syntax of
   those commands is different, notably that there is no WHERE
   clause and no tablename is specified.  All actions refer to the
   target_table_name,
   though modifications to other tables may be made using triggers.
 The ON clause must join on all of the column(s) of the primary
   key, or if other columns are specified then another unique index on the
   target_table_name.
   Each candidate change row is locked via speculative insertion into the
   chosen unique index, ensuring that the status of MATCHED
   or NOT MATCHED is decided just once for each candidate change
   row, preventing interference from concurrent transactions. As a result of
   these requirements MERGE cannot yet work against
   partitioned tables.
 There is no MERGE privilege.  
   You must have the UPDATE privilege on the column(s)
   of the target_table_name
   referred to in the SET clause
   if you specify an update action, the INSERT privilege
   on the target_table_name
   if you specify an insert action and/or the DELETE
   privilege on the if you specify a delete action on the
   target_table_name.
   Privileges are tested once at statement start and are checked
   whether or not particular WHEN clauses are activated
   during the subsequent execution.
   You will require the SELECT privilege on the
   data_source and any column(s)
   of the target_table_name
   referred to in a condition.
  Parameterstarget_table_name  The name (optionally schema-qualified) of the target table to merge into.
 target_alias  A substitute name for the target table. When an alias is
  provided, it completely hides the actual name of the table.  For
  example, given MERGE foo AS f, the remainder of the
  MERGE statement must refer to this table as
  f not foo.
 source_table_name  The name (optionally schema-qualified) of the source table, view or
  transition table.
 source_query  A query (SELECT statement or VALUES
  statement) that supplies the rows to be merged into the
  target_table_name.
  Refer to the SELECT
  statement or VALUES
  statement for a description of the syntax.
 source_alias  A substitute name for the data source. When an alias is
  provided, it completely hides whether table or query was specified.
 join_condition  join_condition is
  an _expression_ resulting in a value of type
  boolean (similar to a WHERE
  clause) that specifies which rows 

[HACKERS] inconsistency in process names - bgworker: logical replication launcher

2017-10-27 Thread Pavel Stehule
Hi

Why buildin process has prefix bgworker?

 1907 ?Ss13:00 postgres: ides ides_immaj_prac
192.168.1.50(3524) idle
 1941 ?Ss 0:05 postgres: ides ides_immaj_prac
192.168.1.50(3527) idle
 3706 ?Ss 0:00 postgres: ides ides_immaj_prac
192.168.1.50(4012) idle
11924 pts/0S+ 0:00 grep postgres
18710 ?Ss 0:01 postgres: logger
process
18712 ?Ss 0:06 postgres: checkpointer
process
18713 ?Ss 0:02 postgres: writer
process
18714 ?Ss 0:05 postgres: wal writer
process
18715 ?Ss 0:04 postgres: autovacuum launcher
process
18716 ?Ss 0:39 postgres: stats collector
process
18717 ?Ss 0:00 postgres: bgworker: logical replication
launcher
32202 ?Ss 0:00 postgres: ides postgres 192.168.1.50(3109)
idle
32226 ?Ss 0:00 postgres: ides ides_immaj_prac
192.168.1.50(3114) idle
32274 ?Ss 0:00 postgres: ides ides_jmmaj_akt 192.168.1.50(3124)
idle

regards

Pavel


Re: [HACKERS] path toward faster partition pruning

2017-10-27 Thread Rajkumar Raghuwanshi
On Thu, Oct 26, 2017 at 4:47 PM, Amit Langote  wrote:

>
> Meanwhile, attached updated set of patches including fixes for the typos
> you reported in the other message.  Updated 0005 fixes the first bug (the
> Case 1 in your email), while other patches 0002-0004 are updated mostly to
> fix the reported typos.  A couple of tests are added in 0001 to test the
> default partition case a bit more.
>

Hi Amit,

while testing further this feature, I got a bug with partitions as foreign
tables. Test case given below. Take a look.

CREATE EXTENSION postgres_fdw;
CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname
'postgres', port '5432', use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER fp_server;

CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('',
'0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1');
CREATE TABLE fplt1_p2 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004',
'0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2');

INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(0, 198, 2) i;
INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(200, 398, 2) i;

--PG-HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
QUERY PLAN
--
 Nested Loop
   ->  Append
 ->  Foreign Scan on ftplt1_p1 t1
 ->  Foreign Scan on ftplt1_p2 t1_1
   ->  Unique
 ->  Append
   ->  Foreign Scan on ftplt1_p1 t2
   ->  Foreign Scan on ftplt1_p2 t2_1
(8 rows)

--PG-HEAD +v5 patches
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;

*ERROR:  invalid expression for partition key*
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


[HACKERS] Reading timeline from pg_control on replication slave

2017-10-27 Thread Andrey Borodin
Hi, hackers!

I'm working on backups from replication salve in WAL-G [0]
Backups used to use result of pg_walfile_name(pg_start_backup(...)). Call to 
pg_start_backup() works nice, but "pg_walfile_name() cannot be executed during 
recovery."
This function has LSN as argument and reads TimeLineId from global state.
So I made a function[1] that, if on replica, reads timeline from pg_control 
file and formats WAL file name as is it was produces by pg_wal_filename(lsn).

Are there any serious dangers? Obviously, this hack is not crisp and clear. Is 
the risk of reading stale timeline really a problem? By reading TimeLineId from 
file I'm fighting those precautions in pg_walfile_name(..) which were 
implemented for a reason, I guess. 

Thanks for reading this. I'll be happy to hear any comments on the matter.

[0] https://github.com/wal-g/wal-g/pull/32
[1] 
https://github.com/wal-g/wal-g/pull/32/files#diff-455316c14fb04c9f9748a0798ad151d9R158

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Is anything preventing us from allowing write to foreign tables from standby?

2017-10-27 Thread Robert Haas
On Thu, Oct 26, 2017 at 11:14 AM, Ashutosh Bapat
 wrote:
> In a nearby thread, we are discussing about atomic commit of
> transactions involving foreign transactions. For maintaining
> consistency, atomicity of transactions writing to foreign server, we
> will need to create local transactions. Will that be possible on
> standby? Obviously, we could add a restriction that no consistency and
> atomic commit is guaranteed when foreign servers are written from a
> standby. I am not sure how easy would be to impose such a restriction
> and whether such a restriction would be practical.

Yeah, that feature definitely can't work on a standby.  But we could
probably allow writes when that feature is not in use.  One thing that
bothers me about Alexander's patch is that there wouldn't be any way
to distinguish between those two cases.  Maybe we need a callback to
ask the FDW "given the configured options, could you work on a
standby?"

However, as Michael also points out, it's arguably wrong to allow a
nominally read-only transaction to write data regardless of whether it
works.  In the case of a standby it could be argued that your
transaction is only read-only because you had no other choice, but
nonetheless that's how it is marked.  I have a feeling that if we
extend the definition of "read only" to mean "sometimes allow writes",
we may regret it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typos in src/backend/optimizer/README

2017-10-27 Thread Etsuro Fujita

Hi,

This sentence in the section of Partition-wise joins in 
src/backend/optimizer/README should be fixed: "This technique of 
breaking down a join between partition tables into join between their 
partitions is called partition-wise join."


(1) s/a join between partition tables/a join between partitioned tables/
(2) s/join between their partitions/joins between their partitions/

It might be okay to leave #2 as-is, but I'd like to propose to change 
that way to make the meaning clear.


Attached is a patch for that.

Best regards,
Etsuro Fujita
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 031e503..1e4084d 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -1088,8 +1088,8 @@ broken into joins between the matching partitions. The 
resultant join is
 partitioned in the same way as the joining relations, thus allowing an N-way
 join between similarly partitioned tables having equi-join condition between
 their partition keys to be broken down into N-way joins between their matching
-partitions. This technique of breaking down a join between partition tables
-into join between their partitions is called partition-wise join. We will use
+partitions. This technique of breaking down a join between partitioned tables
+into joins between their partitions is called partition-wise join. We will use
 term "partitioned relation" for either a partitioned table or a join between
 compatibly partitioned tables.
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Simon Riggs
On 27 October 2017 at 07:20, Robert Haas  wrote:
> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
>  wrote:
>> Let's see a query like this:
>>
>> select * from bloom_test
>>  where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>>
>> The minmax index produces this plan
>>
>>Heap Blocks: lossy=2061856
>>  Execution time: 22707.891 ms
>>
>> Now, the bloom index:
>>
>>Heap Blocks: lossy=25984
>>  Execution time: 338.946 ms
>
> It's neat to see BRIN being extended like this.  Possibly we could
> consider making it a contrib module rather than including it in core,
> although I don't have strong feelings about it.

I see that SP-GIST includes two operator classes in core, one default.

Makes sense to do the same thing with BRIN and add this new op class
as a non-default option in core.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing pg_receivewal --no-sync

2017-10-27 Thread Michael Paquier
On Thu, Oct 26, 2017 at 10:46 PM, Robert Haas  wrote:
> On Wed, Oct 25, 2017 at 10:03 PM, Michael Paquier
>  wrote:
>> This sentence is actually wrong, a feedback message is never sent with
>> the feedback message.
>
> Eh?

"A feedback message is never sent depending on the status interval".

> I think this looks basically fine, though I'd omit the short option
> for it.  There are only so many letters in the alphabet, so let's not
> use them up for developer-convenience options.

No objections to that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CurTransactionContext freed before transaction COMMIT ???

2017-10-27 Thread Gaddam Sai Ram
Thanks for your responses Michael and Amit Kapila,



Yes, we have included your suggestions in our code and started 
testing to reproduce the same issue. In case we encounter this issue again we 
will get back here.



Regards

G. Sai Ram










Re: [HACKERS] proposal: schema variables

2017-10-27 Thread Pavel Stehule
2017-10-27 7:47 GMT+02:00 Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com>:

> From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Pavel Stehule
> > I propose a  new database object - a variable. The variable is persistent
> > object, that holds unshared session based not transactional in memory
> value
> > of any type. Like variables in any other languages. The persistence is
> > required for possibility to do static checks, but can be limited to
> session
> > - the variables can be temporal.
> >
> >
> > My proposal is related to session variables from Sybase, MSSQL or MySQL
> > (based on prefix usage @ or @@), or package variables from Oracle (access
> > is controlled by scope), or schema variables from DB2. Any design is
> coming
> > from different sources, traditions and has some advantages or
> disadvantages.
> > The base of my proposal is usage schema variables as session variables
> for
> > stored procedures. It should to help to people who try to port complex
> > projects to PostgreSQL from other databases.
>
> Very interesting.  I hope I could join the review and testing.
>

you are welcome. I wrote a prototype last year based on envelope functions.
But the integration must be much more close to SQL to be some clear benefit
of this feature. So there is lot of work. I hope so I have a prototype
after this winter. It is my plan for winter.


>
> How do you think this would contribute to easing the port of Oracle PL/SQL
> procedures?  Would the combination of orafce and this feature promote
> auto-translation of PL/SQL procedures?  I'm curious what will be the major
> road blocks after adding the schema variable.
>

It depends on creativity of PL/SQL developers. Usual .. 80% application is
possible to migrate with current GUC - some work does ora2pg. But GUC is
little bit slower (not too important) and is not simple possibility to
secure it.

So work with variables will be similar like GUC, but significantly more
natural (not necessary to build wrap functions). It should be much better
when value is of some composite type. The migrations will need some
inteligence still, but less work and code will be more readable and cleaner.

I talked already about "schema pined" functions (schema private/public
objects) - but I didn't think about it more deeply. There can be special
access right to schema variables, the pined schema can be preferred before
search_path. With this feature the schema will have very similar behave
like Oracle Modules. Using different words - we can implement scope access
rights based on schemas. But it is far horizon. What is important -
proposal doesn't block any future enhancing in this case, and is consistent
with current state. In future you can work with schema private functions,
tables, variables, sequences. So variables are nothing special.

Regards

Pavel


Regards
> Takayuki Tsunakawa
>
>
>


Re: [HACKERS] path toward faster partition pruning

2017-10-27 Thread Amit Langote
On 2017/10/27 13:57, Robert Haas wrote:
> On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote
>  wrote:
>>> I don't think we really want to get into theorem-proving here, because
>>> it's slow.
>>
>> Just to be clear, I'm saying we could use theorem-proving (if at all) just
>> for the default partition.
> 
> I don't really see why it should be needed there either.  We've got
> all the bounds in order, so we should know where there are any gaps
> that are covered by the default partition in the range we care about.

Sorry, I forgot to add: "...just for the default partition, for cases like
the one in Beena's example."

In normal cases, default partition selection doesn't require any
theorem-proving.  It proceeds in a straightforward manner more or less
like what you said it should.

After thinking more on it, I think there is a rather straightforward trick
to implement the idea you mentioned to get this working for the case
presented in Beena's example, which works as follows:

For any non-root partitioned tables, we add the list of its partition
constraint clauses to the query-provided list of clauses and use the whole
list to drive the partition-pruning algorithm.  So, when partition-pruning
runs for tprt_1, along with (< 1) which the original query provides,
we also have (>= 1) which comes from the partition constraint of tprt_1
(which is >= 1 and < 5).  Note that there exists a trick in the new
code for the (< 5) coming from the constraint to be overridden by the
more restrictive (< 1) coming from the original query.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers