date:20180112

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-12 Thread Tomas Vondra

On 01/12/2018 05:35 PM, Peter Eisentraut wrote:
> On 1/11/18 18:23, Greg Stark wrote:
>> AIUI spilling to disk doesn't affect absorbing future updates, we
>> would just keep accumulating them in memory right? We won't need to
>> unspill until it comes time to commit.
> 
> Once a transaction has been serialized, future updates keep accumulating
> in memory, until perhaps it gets serialized again.  But then at commit
> time, if a transaction has been partially serialized at all, all the
> remaining changes are also serialized before the whole thing is read
> back in (see reorderbuffer.c line 855).
> 
> So one optimization would be to specially keep track of all transactions
> that have been serialized already and pick those first for further
> serialization, because it will be done eventually anyway.
> 
> But this is only a secondary optimization, because it doesn't help in
> the extreme cases that either no (or few) transactions have been
> serialized or all (or most) transactions have been serialized.
> 
>> The real aim should be to try to pick the transaction that will be
>> committed furthest in the future. That gives you the most memory to
>> use for live transactions for the longest time and could let you
>> process the maximum amount of transactions without spilling them. So
>> either the oldest transaction (in the expectation that it's been open
>> a while and appears to be a long-lived batch job that will stay open
>> for a long time) or the youngest transaction (in the expectation that
>> all transactions are more or less equally long-lived) might make
>> sense.
> 
> Yes, that makes sense.  We'd still need to keep a separate ordered list
> of transactions somewhere, but that might be easier if we just order
> them in the order we see them.
> 

Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
don't really commit independently, but as part of the toplevel xact. And
that list is ordered by LSN, which is pretty much exactly the order in
which we see the transactions.

I feel somewhat uncomfortable about evicting oldest (or youngest)
transactions for based on some assumed correlation with the commit
order. I'm pretty sure that will bite us badly for some workloads.

Another somewhat non-intuitive detail is that because ReorderBuffer
switched to Generation allocator for changes (which usually represent
99% of the memory used during decoding), it does not reuse memory the
way AllocSet does. Actually, it does not reuse memory at all, aiming to
eventually give the memory back to libc (which AllocSet can't do).

Because of this evicting the youngest transactions seems like a quite
bad idea, because those chunks will not be reused and there may be other
chunks on the blocks, preventing their release.

Yeah, complicated stuff.

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-12 Thread Tomas Vondra

On 01/11/2018 08:41 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> 
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
> 
> ... snip ...

Thanks for a comprehensive summary of the patch!

> 
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
> 
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
> 

Yeah, when writing that comment I was worried that (a) might get rather
expensive. I was thinking about maintaining a dlist of transactions
sorted by size (ReorderBuffer now only has a hash table), so that we
could evict transactions from the beginning of the list.

But while that speeds up the choice of transactions to evict, the added
cost is rather high, particularly when most transactions are roughly of
the same size. Because in that case we probably have to move the nodes
around in the list quite often. So it seems wiser to just walk the list
once when looking for a victim.

What I'm thinking about instead is tracking just some approximated
version of this - it does not really matter whether we evict the really
largest transaction or one that is a couple of kilobytes smaller. What
we care about is an answer to this question:

Is there some very large transaction that we could evict to free
a lot of memory, or are all transactions fairly small?

So perhaps we can define some "size classes" and track to which of them
each transaction belongs. For example, we could split the memory limit
into 100 buckets, each representing a 1% size increment.

A transaction would not switch the class very often, and it would be
trivial to pick the largest transaction. When all the transactions are
squashed in the smallest classes, we may switch to some alternative
strategy. Not sure.

In any case, I don't really know how expensive the selection actually
is, and if it's an issue. I'll do some measurements.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible performance regression with pg_dump of a large number of relations

2018-01-12 Thread Stephen Frost

Greetings Jeff & Luke,

* Jeff Janes (jeff.ja...@gmail.com) wrote:
> Sorry, that query reflects some munging I did to it.  The real part added
> to the query is:
> 
> EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid
> = pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname =
> 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND
> ((SELECT array_agg(acl) FROM (SELECT
> unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) IS NOT
> NULL OR (SELECT array_agg(acl) FROM (SELECT
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT
> SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo) IS
> NOT NULL OR NULL IS NOT NULL OR NULL IS NOT NULL))AS changed_ac

Yes, this is to check if any of the rights on the table or any of its'
columns has been changed from what it's initial rights are as recorded
in pg_init_privs.

I've been playing around with this a bit tonight trying to think of a
way to avoid doing this work and it occurs to me that we really only
need to worry about initprivs on objects in schemas that are either
there at init time, or from extensions.  Not all of the objects in the
system can have init-privs because the only way to get init-privs is
at initdb time or from an extension creating a new object.

As such, I've reworked the query (but not yet put it into pg_dump to
run it through the regression tests) to look like this (for anyone else
who wants to take a look at it and play with it):

EXPLAIN ANALYZE
SELECT c.tableoid, c.oid, c.relname,
(SELECT pg_catalog.array_agg(acl ORDER BY row_n)
FROM (SELECT acl, row_n 
FROM pg_catalog.unnest(
coalesce(c.relacl,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S'
THEN 's' 
ELSE 'r' END::"char",c.relowner)))
WITH ORDINALITY AS perm(acl,row_n) 
WHERE NOT EXISTS ( 
SELECT 1 FROM 
pg_catalog.unnest(
coalesce(pip.initprivs,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S' 
THEN 's' 
ELSE 'r' 
END::"char",c.relowner))) 
AS init(init_acl) 
WHERE acl = init_acl)) as foo
WHERE nsp.nspname IN ('information_schema','pg_catalog')
OR ext.oid IS NOT NULL)
AS relacl,
(SELECT pg_catalog.array_agg(acl ORDER BY row_n) 
FROM (SELECT acl, row_n 
FROM pg_catalog.unnest(
coalesce(pip.initprivs,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S' 
THEN 's' 
ELSE 'r' END::"char",c.relowner)))
WITH ORDINALITY AS initp(acl,row_n)
WHERE NOT EXISTS (
SELECT 1 FROM pg_catalog.unnest(
coalesce(c.relacl,pg_catalog.acldefault(
CASE WHEN c.relkind = 'S' 
THEN 's' 
ELSE 'r' 
END::"char",c.relowner)))
AS permp(orig_acl) 
WHERE acl = orig_acl)) as foo
WHERE nsp.nspname IN ('information_schema','pg_catalog')
OR ext.oid IS NOT NULL)
as rrelacl,
NULL AS initrelacl,
NULL as initrrelacl,
c.relkind,
c.relnamespace,
(SELECT rolname
FROM pg_catalog.pg_roles
WHERE oid = c.relowner) AS rolname,
c.relchecks,
c.relhastriggers,
c.relhasindex,
c.relhasrules,
c.relhasoids,
c.relrowsecurity,
c.relforcerowsecurity,
c.relfrozenxid,
c.relminmxid,
tc.oid AS toid,
tc.relfrozenxid AS tfrozenxid,
tc.relminmxid AS tminmxid,
c.relpersistence,
c.relispopulated,
c.relreplident,
c.relpages,
CASE WHEN c.reloftype <> 0 
THEN c.reloftype::pg_catalog.regtype 
ELSE NULL END AS reloftype,
d.refobjid AS owning_tab,
d.refobjsubid AS owning_col,
(SELECT spcname 
FROM pg_tablespace t
WHERE t.oid = c.reltablespace) AS reltablespace,
array_remove(
array_remove(
c.reloptions,'check_option=local'),
'check_option=cascaded') AS reloptions,
CASE WHEN 'check_option=local' = ANY (c.reloptions)
THEN 'LOCAL'::text 
WHEN 'check_option=cascaded' = ANY (c.reloptions)
THEN 'CASCADED'::text ELSE NULL END AS checkoption,
tc.reloptions AS toast_reloptions,
c.relkind = 'S' AND 
EXISTS (SELECT

Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 5:06 PM, Andres Freund  wrote:
> OTOH, it seems quite likely that we'll add more transaction-lifetime
> shared data (e.g. combocid), so building per-xact infrastructure
> actually seems like a good idea.

Sure, but there's no urgency about it.  Particularly if the first cut
at this is implemented using dshash, moving the dshash to a different
DSA with a different lifespan later should be easy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PROPOSAL] Shared Ispell dictionaries

2018-01-12 Thread Tomas Vondra

Hi Arthur,

I've done some initial review of the patch today, and here are some
thoughts:

0001-Fix-ispell-memory-handling-v2.patch

This makes sense. The patch simply replaces two cpstrdup() calls with
MemoryContextStrdup, but I see spell.c already has two macros to
allocate memory in the buildCxt. What about adding tmpstrdup to copy a
string into the context? I admit this is mostly nitpicking though.


0002-Retreive-shmem-location-for-ispell-v2.patch

I think the GUC name should make it clear it's a maximum number of
something, just like "max_parallel_workers" and other such GUCs. When I
first saw "shared_dictionaries" in the patch I thought it's a list of
dictionary names, or something like that.

I have a bunch of additional design questions and proposals (not
necessarily required for v1, but perhaps useful for shaping it).


1) Why do we actually need the limit? Is it really necessary / useful?

When I wrote shared_ispell back in 2012, all we had were fixed segments
allocated at start, and so similar limits were a built-in restriction.
But after the DSM stuff was introduced I imagined it would not be necessary.

I realize the current implementation requires that, because the hash
table is still created in an old-style memory context (and only the
dictionaries are in DSM segments).

But that seems fairly straightforward to fix by maintaining the hash
table in a separate DSM segment too. So lookup of the dictionary DSM
would have to fist check what the current hash table segment is, and
then continue as now.

I'm not sure if dynahash can live in a DSM segment, but we already have
a hash table that supports that in dshash.c (which is also concurrent,
although I'm not sure if that's a major advantage for this use case).


2) Do we actually want/need some limits? Which ones?

That is not to say we don't need/want some limits, but the current limit
may not be the droid we're looking for, for a couple of reasons.

Firstly, currently it only matters during startup, when the dynahash is
created. So to change the limit (e.g. to increase it) you actually have
to restart the database, which is obviously a major hassle.

Secondly, dynahash tweaks the values to get proper behavior. For example
it's not using the values directly but some higher value of 2^N form.
Which means the limit may not enforced immediately when hitting the GUC,
but unexpectedly somewhat later.

And finally, I believe this is log-worthy - right now the dictionary
load silently switches to backend memory (thus incurring all the parsing
overhead). This certainly deserves at least a log message.

Actually, I'm not sure "number of dictionaries" is a particularly useful
limit in the first place - that's not a number I really care about. But
I do care about amount of memory consumed by the loaded dictionaries.

So I do suggest adding such "max memory for shared dictionaries" limit.
I'm not sure we can enforce it strictly, because when deciding where to
load the dict we haven't parsed it yet and so don't know how much memory
will be required. But I believe a lazy check should be fine (load it,
and if we exceeded the total memory disable loading additional ones).


3) How do I unload a dictionary from the shared memory?

Assume we've reached the limit (it does not matter if it's the number of
dictionaries or memory used by them). How do I resolve that without
restarting the database? How do I unload a dictionary (which may be
unused) from shared memory?

ALTER TEXT SEARCH DICTIONARY x UNLOAD


4) How do I reload a dictionary?

Assume I've updated the dictionary files (added new words into the
files, or something like that). How do I reload the dictionary? Do I
have to restart the server, DROP/CREATE everything again, or what?

What about instead having something like this:

ALTER TEXT SEARCH DICTIONARY x RELOAD


5) Actually, how do I list currently loaded dictionaries (and how much
memory they use in the shared memory)?


6) What other restrictions would be useful?

I think it should be possible to specify which ispell dictionaries may
be loaded into shared memory, and which should be always loaded into
local backend memory. That is, something like

CREATE TEXT SEARCH DICTIONARY x (
TEMPLATE = ispell,
DictFile = czech,
AffFile = czech,
StopWords = czech,
SharedMemory = true/false (default: false)
);

because otherwise the dictionaries will compete for shared memory, and
it's unclear which of them will get loaded. For a server with a single
application that may not be a huge issue, but think about servers shared
by multiple applications, etc.

In the extension this was achieved kinda explicitly by definition of a
separate 'shared_ispell' template, but if you modify the current one
that won't work, of course.


7) You mentioned you had to get rid of the compact_palloc0 - can you
elaborate a bit why that was necessary? Also, when benchmarking the
impact of this make sure to measure not only the tim

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Claudio Freire

On Fri, Jan 12, 2018 at 10:49 PM, Tom Lane  wrote:

> Claudio Freire  writes:
> > On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs 
> wrote:
> >> So we can't completely remove xl_prev field, without giving up some
> >> functionality.
>
> > Or, you can use the lower 16-bits of the previous record's CRC
>
> Hmm ... that is an interesting idea, but I'm not sure it helps much
> towards Simon's actual objective.  AIUI the core problem here is the
> contention involved in retrieving the previous WAL record's address.
> Changing things so that we need the previous record's CRC isn't really
> gonna improve that --- if anything, it'll be worse, because the
> record's address can (in principle) be known sooner than its CRC.
>
> Still, if we were just looking to shave some bits off of WAL record
> headers, it might be possible to do something with this idea.
>
> regards, tom lane
>

I later realized. That's why I corrected myself to the first record, not
the previous.

Now, that assumes there's enough entropy in CRC values to actually make
good use of those 16 bits... there may not. WAL segments are highly
compressible after all.

So, maybe a hash of the LSN of the first record instead? That should be
guaranteed to have good entropy (given a good hash).

In any case, there are many rather good alternatives to the segment number
that should be reasonably safe from consistent collisions with garbage data.

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Tom Lane

Claudio Freire  writes:
> On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs  wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality.

> Or, you can use the lower 16-bits of the previous record's CRC

Hmm ... that is an interesting idea, but I'm not sure it helps much
towards Simon's actual objective.  AIUI the core problem here is the
contention involved in retrieving the previous WAL record's address.
Changing things so that we need the previous record's CRC isn't really
gonna improve that --- if anything, it'll be worse, because the
record's address can (in principle) be known sooner than its CRC.

Still, if we were just looking to shave some bits off of WAL record
headers, it might be possible to do something with this idea.

regards, tom lane

Re: [HACKERS] Planning counters in pg_stat_statements

2018-01-12 Thread Tom Lane

I wrote:
> Haribabu Kommi  writes:
>> I checked the latest patch and it is working fine and I don't have any
>> further comments. Marked the patch as "ready for committer".

> I started to look at this patch,

... looking further, I'm really seriously unhappy about this bit:

@@ -943,9 +1006,16 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
...
+
+   /*
+* Clear planning_time, so that we only count it once for each
+* replanning of a prepared statement.
+*/
+   queryDesc->plannedstmt->planning_time = 0;
}

What we have here is pgss_ExecutorEnd stomping on the plan cache.
I do *not* find that acceptable.  At the very least, it ruins the
theory that this field is a shared resource.

What we could/should do instead, I think, is have pgss_planner_hook
make its own pgss_store() call to log the planning time results
(which would mean we don't need the added PlannedStmt field at all).
That would increase the overhead of this feature, which might mean
that it'd be worth having a pg_stat_statements GUC to enable it.
But it'd be a whole lot cleaner.

regards, tom lane

Re: [HACKERS] Planning counters in pg_stat_statements

2018-01-12 Thread Tom Lane

Haribabu Kommi  writes:
> I checked the latest patch and it is working fine and I don't have any
> further comments. Marked the patch as "ready for committer".

I started to look at this patch, and I'm not entirely convinced whether
it is a good thing for the planner_hook to bump nested_level.  ISTM
that will change behavior in some ways, in particular when the planner
chooses to evaluate an immutable or stable function that runs the
executor (probably via SPI).  Before, that execution would have been
regarded as a top-level call, now it will not be.  Maybe that's fine,
but did anyone think hard about it?

A possible alternative behavior is for planner_hook to maintain its
own nesting depth counter, separate from the one for execution
nesting depth.  I'm not sure if that's better or not.

Discuss ...

regards, tom lane

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Claudio Freire

On Fri, Jan 12, 2018 at 8:43 PM, Claudio Freire 
wrote:

>
>
> On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs 
> wrote:
>
>> So we can't completely remove xl_prev field, without giving up some
>> functionality. But we don't really need to store the 8-byte previous
>> WAL pointer in order to detect torn pages. Something else which can
>> tell us that the WAL record does not belong to current WAL segno would
>> be enough as well. I propose that we replace it with a much smaller
>> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
>> decide to call it) is the low order 16-bits of the WAL segno to which
>> the WAL record belongs. While reading WAL, we always match that the
>> "xl_walid" value stored in the WAL record matches with the current WAL
>> segno's lower order 16-bits and if not, then consider that as the end
>> of the stream.
>>
>> For this to work, we must ensure that WAL files are either recycled in
>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>> easy to do with the existing infrastructure.
>>
>
> Or, you can use the lower 16-bits of the previous record's CRC
>
>
>
Sorry, missed the whole point. Of the *first* record's CRC

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Claudio Freire

On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs  wrote:

> So we can't completely remove xl_prev field, without giving up some
> functionality. But we don't really need to store the 8-byte previous
> WAL pointer in order to detect torn pages. Something else which can
> tell us that the WAL record does not belong to current WAL segno would
> be enough as well. I propose that we replace it with a much smaller
> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
> decide to call it) is the low order 16-bits of the WAL segno to which
> the WAL record belongs. While reading WAL, we always match that the
> "xl_walid" value stored in the WAL record matches with the current WAL
> segno's lower order 16-bits and if not, then consider that as the end
> of the stream.
>
> For this to work, we must ensure that WAL files are either recycled in
> such a way that the "xl_walid" of the previous (to be recycled) WAL
> differs from the new WAL or we zero-out the new WAL file. Seems quite
> easy to do with the existing infrastructure.
>

Or, you can use the lower 16-bits of the previous record's CRC

Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-01-12 Thread Tom Lane

Fabien COELHO  writes:
> Attached is an attempt at improving the situation:

I took a quick look at this and can't really convince myself that it
improves our situation.  The fact that it prints nothing if the runtime
is outside of a tightly constrained range seems likely to hide the sort
of bug you might hope such a test would detect.  A simple example is
that, if there's an off-by-one bug causing the test to run for 3 seconds
not 2, this test script won't complain about it.  Worse, if pgbench dumps
core instantly on startup when given -p, this test script won't complain
about it.  And the test is of no value at all on very slow buildfarm
critters such as valgrind or clobber-cache-always animals.

> (2) the test now only expects "progress: \d" from the output, so it is enough 
> that one progress is shown, whenever it is shown.

Hm.  Could we get somewhere by making the test look for that, and
adjusting the loop logic inside pgbench so that (maybe only with the
tested switch values) it's guaranteed to print at least one progress
output regardless of timing, because it won't check for exit until after
it's printed a log message?

> For the random-ness related test (--sample-rate), we could add a feature to 
> pgbench which allows to control the random seed, so that the number of samples
> could be known in advance for testing purposes.

Don't see how locking down the seed gets us anywhere.  The behavior of
random() can't be assumed identical across different machines, even
with the same seed.

regards, tom lane

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Andres Freund

On 2018-01-12 17:43:00 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-01-12 17:24:54 -0500, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> Right. I wonder if it be reasonable to move that to a page's header
> >>> instead of individual records?  To avoid torn page issues we'd have to
> >>> reduce the page size to a sector size, but I'm not sure that's that bad?
> 
> >> Giving up a dozen or two bytes out of every 512 sounds like quite an
> >> overhead.
> 
> > It's not nothing, that's true. But if it avoids 8 bytes in every record,
> > that'd probably at least as much in most usecases.
> 
> Fair point.  I don't have a very good handle on what "typical" WAL record
> sizes are, but we might be fine with that --- some quick counting on the
> fingers says we'd break even with an average record size of ~160 bytes,
> and be ahead below that.

This is far from a definitive answer, but here's some data:

pgbench -i -s 100 -q:
Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Total 3089581077269060 
[84.19%]202269468 [15.81%]   1279538528 [100%]

So here records are really large, which makes sense, given it's
largelyinitialization of data. With wal_compression that'd probably look
different, but still commonly spanning multiple pages.


pgbench -M prepared -c 16 -j 16 -T 100
Type   N  (%)  Record size  
(%) FPI size  (%)Combined size  (%)
   -  ---  ---  
---   ----  ---
Total   14228881 947824170 
[100.00%]8192 [0.00%] 947832362 [100%]

Here we're at 66 bytes...


> We'd need to investigate the page-crossing overhead carefully though.

agreed.

Greetings,

Andres Freund

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-01-12 Thread Tomas Vondra



On 01/12/2018 01:48 AM, Thomas Munro wrote:
> On Thu, Jan 4, 2018 at 1:12 PM, Tomas Vondra
>  wrote:
>> Attached is an updated patch series, where the first patch fixes this by
>> removing the reset of estimatedclauses (and tweaking the comment).
> 
> Hi Tomas,
> 
> FYI, from the annoying robot department:
> 
> ref/create_statistics.sgml:170: parser error : Opening and ending tag
> mismatch: structname line 170 and unparseable
>Create table t2 with two perfectly correlated columns
>  ^
> ref/create_statistics.sgml:195: parser error : Opening and ending tag
> mismatch: structname line 195 and unparseable
>Create table t3 with two strongly correlated columns, and
>  ^

Thanks. Attached is an updated patch fixing all the doc issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Keep-information-about-already-estimated-clauses.patch.gz
Description: application/gzip


0002-multivariate-MCV-lists.patch.gz
Description: application/gzip


0003-multivariate-histograms.patch.gz
Description: application/gzip

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Tom Lane

Andres Freund  writes:
> On 2018-01-12 17:24:54 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> Right. I wonder if it be reasonable to move that to a page's header
>>> instead of individual records?  To avoid torn page issues we'd have to
>>> reduce the page size to a sector size, but I'm not sure that's that bad?

>> Giving up a dozen or two bytes out of every 512 sounds like quite an
>> overhead.

> It's not nothing, that's true. But if it avoids 8 bytes in every record,
> that'd probably at least as much in most usecases.

Fair point.  I don't have a very good handle on what "typical" WAL record
sizes are, but we might be fine with that --- some quick counting on the
fingers says we'd break even with an average record size of ~160 bytes,
and be ahead below that.

We'd need to investigate the page-crossing overhead carefully though.

regards, tom lane

Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2018-01-12 Thread Daniel Gustafsson

> On 11 Jan 2018, at 09:01, Michael Paquier  wrote:
> 
> On Wed, Jan 10, 2018 at 11:49:47PM +0100, Daniel Gustafsson wrote:
>>> On 08 Jan 2018, at 17:27, Robert Haas  wrote:
>>> nor has anyone taken the trouble to list with precision all of the
>>> places where the behavior will change.  I think the latter is
>>> absolutely indispensable,
>> 
>> I had a look at the available commands in postgres and compiled a list
>> of them in options.sql based on if they have options, and how those
>> options and matched (case sensitive of insensitive).  The queries in
>> the file are nonsensical, they are just meant to show the handling of
>> the options.  The idea was to illustrate the impact of this proposal
>> by running examples. Running this file with and without the patches
>> applied shows the following commands being affected:  
>> 
>> 
>> 
>> The output with the patch is attached as options_patched.out, and the output
>> from master as options_master.out.  Diffing the two files is rather helpful I
>> think.
> 
> Thanks. This is saving me hours of lookups and testing during the
> review, as now reviewers just have to map you test series with the code
> modified.

Well, being the one proposing the patch I should be the one spending those
hours and not you.  Sorry for not including in the original submission.

> I can't help to notice that tests for code paths with
> contrib modules are missing. This brings up the point: do we want those
> tests to be in the patch?

I left them out since they are version of ALTER TEXT SEARCH .. rather than new
statements.

> I would like to think that a special section
> dedicated to option compatibility for each command would be welcome to
> track which grammar is supported and which grammar is not supported.

I’m not sure I follow?

>> One open question from this excercise is how to write a good test for
>> this.  It can either be made part of the already existing test queries
>> or a separate suite.  I’m leaning on the latter simply because the
>> case-flipping existing tests seems like something that can be cleaned
>> up years from now accidentally because it looks odd.
> 
> Adding them into src/test/regress/ sounds like a good plan to me.

If there is interest in this patch now that the list exists and aids review, I
can turn the list into a proper test that makes a little more sense than the
current list which is rather aimed at helping reviewers.

>> If you however combine the new and old syntax, the statement works and the 
>> WITH
>> option wins by overwriting any previous value.  The below statement creates 
>> an
>> IMMUTABLE function:
>> 
>>create function int42(cstring) returns int42 AS 'int4in'
>>language internal strict volatile with (isstrict, iscachable);
> 
> Here is another idea: nuking isstrict and iscachable from CREATE
> FUNCTION syntax and forget about them. I would be tempted of the opinion
> to do that before the rest.

Thats certainly an option, I have no idea about the prevalence in real life
production environments to have much an opinion to offer.

> 
> -old_aggr_elem:  IDENT '=' def_arg
> +old_aggr_elem:  PARALLEL '=' def_arg
> +   {
> +   $$ = makeDefElem("parallel", (Node *)$3, @1);
> +   }
> +   | IDENT '=' def_arg
> Nit: alphabetical order.

Fixed.

> I have spent a couple of hours reviewing all the calls to pg_strcasecmp,
> and the only thing I have noticed is in transformRelOptions(), where the
> namespace string should be evaluated as well by strcmp, no?

> So per my set of examples, the evaluation of the schema name should fail
> when creating relation a3 with your patch applied. As the patch does
> things, the experience for the user is not consistent, and the schema
> name goes through parser via reloption_elem using makeDefElemExtended,
> so this should use strcmp.

I believe you are entirely correct, the attached v5 patch is updated with this
behavior.  The volatility patch is unchanged by this so didn’t submit a new
version of that one.

Thanks for the review!

cheers ./daniel



defname_strcmp-v5.patch
Description: Binary data

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Andres Freund

On 2018-01-12 17:24:54 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
> >> The xl_prev field is our only way of detecting that we're looking at
> >> old WAL data when we cross a sector boundary.
> 
> > Right. I wonder if it be reasonable to move that to a page's header
> > instead of individual records?  To avoid torn page issues we'd have to
> > reduce the page size to a sector size, but I'm not sure that's that bad?
> 
> Giving up a dozen or two bytes out of every 512 sounds like quite an
> overhead.

It's not nothing, that's true. But if it avoids 8 bytes in every record,
that'd probably at least as much in most usecases.


> Also, this'd mean that a much larger fraction of WAL records
> need to be split across page boundaries, which I'd expect to produce a
> performance hit in itself --- a page crossing has to complicate figuring
> out how much space we need for the record.

It does increase the computation a bit, see XLogBytePosToRecPtr(). I'd
guess that more of the overhead would come from the xlog buffer
management though.

Greetings,

Andres Freund

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Tom Lane

Andres Freund  writes:
> On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
>> The xl_prev field is our only way of detecting that we're looking at
>> old WAL data when we cross a sector boundary.

> Right. I wonder if it be reasonable to move that to a page's header
> instead of individual records?  To avoid torn page issues we'd have to
> reduce the page size to a sector size, but I'm not sure that's that bad?

Giving up a dozen or two bytes out of every 512 sounds like quite an
overhead.  Also, this'd mean that a much larger fraction of WAL records
need to be split across page boundaries, which I'd expect to produce a
performance hit in itself --- a page crossing has to complicate figuring
out how much space we need for the record.

regards, tom lane

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Tom Lane

Alvaro Herrera  writes:
> Peter Eisentraut wrote:
>> I don't think I like this.  I know pg_proc.h is a pain to manage, but at
>> least right now it's approachable programmatically.  I recently proposed
>> to patch to replace the columns proisagg and proiswindow with a combined
>> column prokind.  I could easily write a small Perl script to make that
>> change in pg_proc.h, because the format is easy to parse and has one
>> line per entry.  With this new format, that approach would no longer
>> work, and I don't know what would replace it.

> The idea in my mind is that you'd write a Perl program to do such
> changes, yeah.  If the code we supply contains enough helpers and a few
> samples, it should be reasonably simple for people that don't do much
> Perl.

It would be good to see a sample --- for a concrete example, how about
creating a Perl script to do the conversion Peter mentions?

>>> 3. references to objects in other catalogs are by name, such as "int8"
>>> or "btree/integer_ops" rather than OID.

>> I think we could already do this by making more use of things like
>> regtype and regproc.  That should be an easy change to make.

> Well, that assumes we *like* the current format, which I think is not a
> given ... more the opposite.

Note that we *can't* easily improve that given the current tooling,
mainly because the bootstrap-time capabilities of regproc_in et al are
so limited.  We don't even have regxxx types for many of the other
cross-reference columns like opclass references, and I don't think
I want to build them because they'd also have bootstrapping issues.

According to my understanding, part of what's going on here is that
we're going to teach genbki.pl to parse these object references and
convert them to hard-coded OIDs in the emitted BKI file.  That seems
good to me, but one thing we're going to need is a spec for how
genbki.pl knows what to do.

>> I'm afraid a key value system would invite writing the attributes in
>> random order and create a mess over time.

> Yeah, I share this concern.  But you could fix it if the Perl tooling to
> write these files had a hardcoded list to work with.  Maybe we could put
> it in a declaration of sorts at the start of each data file.

This is more or less the same concern I stated upthread.  But the
impression I'm getting is that we expect these files to often be written
out from a Perl script, so it's mostly a question of how we teach the
Perl scripts to emit stylistically consistent data.  Then we can use the
Perl scripts as a kind of pgindent for this data.

>> But if we want to do it, I think we could also add it to the current BKI
>> format.  The same goes for defining default values for some columns.

> As above -- do we really like our current format so much that we're
> satisfied with minor tweaks?

I'm sure not.  This will be a big change, without a doubt, but I think
we'll end up in a better place.

regards, tom lane

Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?

2018-01-12 Thread Andres Freund

On 2018-01-12 07:51:34 -0500, Robert Haas wrote:
> On Thu, Jan 11, 2018 at 11:01 PM, Thomas Munro
>  wrote:
> > Are you saying we should do the work now to create a per-transaction
> > DSM segment + DSA area + thing that every backend attaches to?
> 
> No, I was just thinking you could stuff it into the per-parallel-query
> DSM/DSA.  But...
> 
> > I didn't think creating backend local hash tables would be a problem
> > because it's a vanishingly rare occurrence for the hash table to be
> > created at all (ie when you've altered an enum), and if created, to
> > have more than a couple of entries in it.
> 
> ...this is also a fair point.

OTOH, it seems quite likely that we'll add more transaction-lifetime
shared data (e.g. combocid), so building per-xact infrastructure
actually seems like a good idea.

Greetings,

Andres Freund

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Andres Freund

On 2018-01-12 10:45:54 -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs  wrote:
> > I have some reservations about whether this makes the mechanism less
> > reliable.
> 
> Yeah, it scares me too.

Same here.

> The xl_prev field is our only way of detecting that we're looking at
> old WAL data when we cross a sector boundary.

Right. I wonder if it be reasonable to move that to a page's header
instead of individual records?  To avoid torn page issues we'd have to
reduce the page size to a sector size, but I'm not sure that's that bad?

> > Of course, we also have xl_crc, so I'm not sure whether there's any
> > chance of real harm...
> 
> The CRC only tells you that you have a valid WAL record, it won't clue
> you in that it's old data you shouldn't replay.

Yea, I don't like relying on the CRC alone at all.

Greetings,

Andres Freund

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Andres Freund

On 2018-01-12 18:36:14 -0300, Alvaro Herrera wrote:
> As above -- do we really like our current format so much that we're
> satisfied with minor tweaks?

A definite no from here.  I think especially pg_proc desperately needs
something key=value like to be understandable, and that very clearly
seems to be something we can't do in the current format.

- Andres

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Alvaro Herrera

Peter Eisentraut wrote:
> On 1/12/18 12:24, Alvaro Herrera wrote:
> > Here's a small sample pg_proc entry:
> > 
> >  { oid => '2147', descr => 'number of input rows for which the input 
> > expression is not null',
> >   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 
> > 'any', s => 'aggregate_dummy' },
> > 
> > An pg_amop entry:
> > { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper 
> > => '<(int2,int2)', am => 'btree' },
> > 
> > Notes:
> > 1. this is Perl data; it is read with 'eval' without any external modules.
> > 2. the pg_proc entry has been compressed to two lines, to avoid
> >content-free lines that would easily confuse git merge, but keep line
> >length reasonable.
> 
> I don't think I like this.  I know pg_proc.h is a pain to manage, but at
> least right now it's approachable programmatically.  I recently proposed
> to patch to replace the columns proisagg and proiswindow with a combined
> column prokind.  I could easily write a small Perl script to make that
> change in pg_proc.h, because the format is easy to parse and has one
> line per entry.  With this new format, that approach would no longer
> work, and I don't know what would replace it.

The idea in my mind is that you'd write a Perl program to do such
changes, yeah.  If the code we supply contains enough helpers and a few
samples, it should be reasonably simple for people that don't do much
Perl.

The patch series does contain a few helper programs to write the data
files.  I haven't looked in detail what can they do and what they cannot.

> > 3. references to objects in other catalogs are by name, such as "int8"
> >or "btree/integer_ops" rather than OID.
> 
> I think we could already do this by making more use of things like
> regtype and regproc.  That should be an easy change to make.

Well, that assumes we *like* the current format, which I think is not a
given ... more the opposite.

> > 4. for each attribute, an abbreviation can be declared.  In the
> >pg_proc sample we have "n" which stands for proname, because we have
> >this line:
> >+   NameDataproname BKI_ABBREV(n);
> 
> I'm afraid a key value system would invite writing the attributes in
> random order and create a mess over time.


Yeah, I share this concern.  But you could fix it if the Perl tooling to
write these files had a hardcoded list to work with.  Maybe we could put
it in a declaration of sorts at the start of each data file.

> But if we want to do it, I think we could also add it to the current BKI
> format.  The same goes for defining default values for some columns.

As above -- do we really like our current format so much that we're
satisfied with minor tweaks?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread David Fetter

On Fri, Jan 12, 2018 at 04:22:26PM -0500, Peter Eisentraut wrote:
> On 1/12/18 12:24, Alvaro Herrera wrote:
> > Here's a small sample pg_proc entry:
> > 
> >  { oid => '2147', descr => 'number of input rows for which the input 
> > expression is not null',
> >   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 
> > 'any', s => 'aggregate_dummy' },
> > 
> > An pg_amop entry:
> > { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper 
> > => '<(int2,int2)', am => 'btree' },
> > 
> > Notes:
> > 1. this is Perl data; it is read with 'eval' without any external modules.
> > 2. the pg_proc entry has been compressed to two lines, to avoid
> >content-free lines that would easily confuse git merge, but keep line
> >length reasonable.
> 
> I don't think I like this.  I know pg_proc.h is a pain to manage,
> but at least right now it's approachable programmatically.  I
> recently proposed to patch to replace the columns proisagg and
> proiswindow with a combined column prokind.  I could easily write a
> small Perl script to make that change in pg_proc.h, because the
> format is easy to parse and has one line per entry.  With this new
> format, that approach would no longer work, and I don't know what
> would replace it.

How about ingesting with Perl, manipulating there, and spitting back
out as Perl data structures?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 12:23 PM, Amit Khandekar  wrote:
>> (1) if they need it by subplan index, first use
>> subplan_partition_offsets to convert it to a per-leaf index
>
> Before that, we need to check if there *is* an offset array. If there
> are no partitions, there is only going to be a per-subplan array,
> there won't be an offsets array. But I guess, you are saying : "do the
> on-demand allocation only for leaf partitions; if there are no
> partitions, the per-subplan maps will always be allocated for each of
> the subplans from the beginning" . So if there is no offset array,
> just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
> without any further checks.

Oops.  I forgot that there might not be partitions.  I was assuming
that mt_per_subplan_tupconv_maps wouldn't exist at all, and we'd
always use subplan_partition_offsets.  Both that won't work in the
inheritance case.

> But after that, I am not sure then why is mt_per_sub_plan_maps[] array
> needed ? We are always going to convert the subplan index into leaf
> index, so per-subplan map array will not come into picture. Or are you
> saying, it will be allocated and used only when there are no
> partitions ?  From one of your earlier replies, you did mention about
> trying to share the maps between the two arrays, that means you were
> considering both arrays being used at the same time.

We'd use them both at the same time if we didn't have, or didn't use,
subplan_partition_offsets, but if we have subplan_partition_offsets
and can use it then we don't need mt_per_sub_plan_maps.

I guess I'm inclined to keep mt_per_sub_plan_maps for the case where
there are no partitions, but not use it when partitions are present.
What do you think about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Peter Eisentraut

On 1/12/18 12:24, Alvaro Herrera wrote:
> Here's a small sample pg_proc entry:
> 
>  { oid => '2147', descr => 'number of input rows for which the input 
> expression is not null',
>   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 
> 'any', s => 'aggregate_dummy' },
> 
> An pg_amop entry:
> { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => 
> '<(int2,int2)', am => 'btree' },
> 
> Notes:
> 1. this is Perl data; it is read with 'eval' without any external modules.
> 2. the pg_proc entry has been compressed to two lines, to avoid
>content-free lines that would easily confuse git merge, but keep line
>length reasonable.

I don't think I like this.  I know pg_proc.h is a pain to manage, but at
least right now it's approachable programmatically.  I recently proposed
to patch to replace the columns proisagg and proiswindow with a combined
column prokind.  I could easily write a small Perl script to make that
change in pg_proc.h, because the format is easy to parse and has one
line per entry.  With this new format, that approach would no longer
work, and I don't know what would replace it.

> 3. references to objects in other catalogs are by name, such as "int8"
>or "btree/integer_ops" rather than OID.

I think we could already do this by making more use of things like
regtype and regproc.  That should be an easy change to make.

> 4. for each attribute, an abbreviation can be declared.  In the
>pg_proc sample we have "n" which stands for proname, because we have
>this line:
>+   NameDataproname BKI_ABBREV(n);

I'm afraid a key value system would invite writing the attributes in
random order and create a mess over time.

But if we want to do it, I think we could also add it to the current BKI
format.  The same goes for defining default values for some columns.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2018-01-12 Thread Mark Rofail

On Mon, Nov 13, 2017 at 2:41 AM, Andreas Karlsson  wrote:

> I think the code would look cleaner if you generate the following query:
>
> SELECT fk.x, fk.ys FROM ONLY t2 fk CROSS JOIN LATERAL
> pg_catalog.unnest(ys) a2 (v) LEFT OUTER JOIN ONLY t1 pk ON pk.x = fk.x AND
> pk.y = a2.v WHERE [...]
>
> rather than:
>
> SELECT fk.k1, fk.ak2 FROM (SELECT x k1, pg_catalog.unnest(ys) k2, ys ak2
> FROM ONLY t2) fk LEFT OUTER JOIN ONLY t1 pk ON pk.x = fk.k1 AND pk.y =
> fk.k2 WHERE [...]
>

Andreas has kindly written the SQL query for us. My problem is generating
it with C code

Best Regards,
Mark Rofail

Re: [HACKERS] postgres_fdw bug in 9.6

2018-01-12 Thread Robert Haas

On Sun, Dec 3, 2017 at 2:56 PM, Tom Lane  wrote:
> I thought some more about this.  While it seems clear that we don't
> actually need to recheck anything within a postgres_fdw foreign join,
> there's still a problem: we have to be able to regurgitate the join
> row during postgresRecheckForeignScan.  Since, in the current system
> design, the only available data is scan-level rowmarks (that is,
> whole-row values from each base foreign relation), there isn't any
> very good way to produce the join row except to insert the rowmark
> values into dummy scan nodes and then re-execute the join locally.
> So really, that is what the outerPlan infrastructure is doing for us.

I started to look at this patch again today and got cold feet.  It
seems to me that this entire design on the posted patch is based on
your remarks in http://postgr.es/m/13242.1481582...@sss.pgh.pa.us --

# One way that we could make things better is to rely on the knowledge
# that EPQ isn't asked to evaluate joins for more than one row per input
# relation, and therefore using merge or hash join technology is really
# overkill.  We could make a tree of simple nestloop joins, which aren't
# going to care about sort order, if we could identify the correct join
# clauses to apply.

However, those remarks are almost entirely wrong.  It's true that the
relations for which we have EPQ tuples will only ever output a single
tuple, but because of what you said in the quoted text above, this
does NOT imply that the recheck plan never needs to worry about
dealing with a large number of rows, as the following example shows.
(The part about making a tree of simple nestloop joins is wrong not
only for this reason but because it won't work in the executor, as
previously discussed.)

rhaas=# explain select * from foo join (bar join baz on bar.a = baz.a)
on foo.a = bar.a where foo.b = 'hello' for update;
  QUERY PLAN
---
 LockRows  (cost=109.02..121.46 rows=1 width=162)
   ->  Merge Join  (cost=109.02..121.45 rows=1 width=162)
 Merge Cond: (bar.a = foo.a)
 ->  Foreign Scan  (cost=100.58..4914.58 rows=4 width=119)
   Relations: (public.bar) INNER JOIN (public.baz)
   ->  Hash Join  (cost=2556.00..4962.00 rows=4 width=119)
 Hash Cond: (bar.a = baz.a)
 ->  Foreign Scan on bar  (cost=100.00..1956.00
rows=4 width=28)
 ->  Hash  (cost=1956.00..1956.00 rows=4 width=27)
   ->  Foreign Scan on baz
(cost=100.00..1956.00 rows=4 width=27)
 ->  Sort  (cost=8.44..8.45 rows=1 width=28)
   Sort Key: foo.a
   ->  Index Scan using foo_b_idx on foo  (cost=0.41..8.43
rows=1 width=28)
 Index Cond: (b = 'hello'::text)

There's really nothing to prevent the bar/baz join from producing
arbitrarily many rows, which means that it's not good to turn the
recheck plan into a Nested Loop unconditionally -- and come to think
of it, I'm pretty sure this kind of case is why
GetExistingLocalJoinPath() tries to using an existing path: it wants a
path that isn't going suck if it gets executed.

After some thought, it seems that there's a much simpler way that we
could fix the problem you identified in that original email: if the
EPQ path isn't properly sorted, have postgres_fdw's
add_paths_with_pathkeys_for_rel stick a Sort node on top of it.  I
tried this and it does indeed fix Jeff Janes' initial test case.
Patch attached.  One could do better here if one wanted to revise
GetExistingLocalJoinPath() to take Last *pathkeys as an additional
argument; we could then try to find the optimal EPQ path for each
possible set of sortkeys.  But I don't feel the need to work on that
right now, and it would be nicer not to back-patch a change to the
signature of GetExistingLocalJoinPath(), as has already been noted.

It is altogether possible that this doesn't fix every issue with the
GetExistingLocalJoinPath interface, and perhaps that is due for a
broader rewrite.  However, I think that more recent discussion has
cast some doubt on the fiery criticism of that previous note.  That
note asserted that we'd have trouble with join nests containing more
than 2 joins, but as was pointed out at the time, that's not really
true, because if we happen to pull a join out of the pathlist that has
ForeignScan children, we'll fetch out the EPQ-recheck subpaths that
they created and use those instead.  And there are other problems with
the analysis in that email too as noted above.  The only problem that
*has actually been demonstrated* is that we can end up using as an EPQ
path something that doesn't generate the right sort order, and that is
easily fixed by making it do so, which the attached patch does.  So I
propose we commit and back-patch this and move on.

-- 
Robert Haas
EnterpriseDB: ht

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread John Naylor

Tom, everyone,
It's getting late in my timezone, but I wanted to give a few quick
answers. I'll follow up tomorrow. Thanks Alvaro for committing my
refactoring of pg_attribute data creation. I think your modifications
are sensible and I'll rebase soon.

On 1/13/18, Tom Lane  wrote:
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.

Alvaro gave a good overview, so I'll just point out a few things.

-Patches 0002 through 0007 represent a complete one-to-one migration
of data entries. I didn't see much in bki.sgml specific to the current
format, so my documentation changes are confined largely to the
README, in patch 0005.
-Patches 0008 and 0009 implement techniques to make the data lines
shorter. My choices are certainly debatable. There is a brief addition
to the README in patch 0008. The abbreviation technique was only used
in three catalogs to demonstrate.
-Patches 0010 and 0011 implement human-readable OID references.
-Patches 0012 and 0013 are cosmetic, but invasive.

> Seems like we would almost need a per-catalog convention on how to lay out
> the entries, or else we're going to end up (over time) with lots of cowboy
> coding leading to entries that look randomly different from the ones
> around them.

If I understand your concern correctly, the convention is enforced by
a script (rewrite_dat.pl). At the very least this would be done at the
same time as pg_indent and perltidy. To be sure, because of default
values many entries will look randomly different from the ones around
them regardless. I have a draft patch to load the source data into
tables for viewing, but it's difficult to rebase, so I thought I'd
offer that enhancement later.

> One other question is how we'll verify the conversion.  Is there an
> expectation that the .bki file immediately after the conversion will
> be identical to immediately before?

Not identical. First, as part of the base migration, I stripped almost
all double quotes from the data entries since the new Perl hash values
are already single-quoted. (The exception is macros expanded by
initdb.c) I made genbki.pl add quotes on output to match what
bootscanner.l expects. Where a simple rule made it possible, it also
matches the original .bki. The new .bki will only diff where the
current data has superfluous quotes. (ie. "0", "sql"). Second, if the
optional cosmetic patch 0013 is applied, the individual index and
toast commands will be in a different order.

> Check.  Where is it coming from --- I suppose we aren't going to try to
> store this in the existing .h files?  What provisions will there be for
> comments?

Yes, they're in ".dat" files. Perl comments (#) on their own line are
supported. I migrated all existing comments from the header files as
part of the conversion. This is scripted, so I can rebase over new
catalog entries that get committed.

> I think single-letter abbreviations here are a pretty bad space vs
> readability tradeoff, particularly for wider catalogs where it risks
> ambiguity.

Ironically, I got that one from you [1] ;-), but if you have a
different opinion upon seeing concrete, explicit examples, I think
that's to be expected.

--
Now is probably a good time to disclose concerns of my own:
1. MSVC dependency tracking is certainly broken until such time as I
can shave that yak and test.
2. Keeping the oid symbols with the data entries required some
Makefile trickery to make them visible to .c files outside the backend
(patch 0007). It builds fine, but the dependency tracking might have
bugs.

--
[1] https://www.postgresql.org/message-id/15697.1479161432%40sss.pgh.pa.us

Thanks,
John Naylor

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2018-01-12 Thread Tom Lane

I wrote:
> I think that there might be a much simpler solution to this, which
> is to just remove make_inh_translation_list's tests of attinhcount,
> as per attached.

I went ahead and pushed that, with an isolation test based on the
one Horiguchi-san submitted but covering a few related cases.

regards, tom lane

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2018-01-12 Thread Alvaro Herrera

Mark Rofail wrote:

> > 1) MATCH FULL does not seem to care about NULLS in arrays. In the example
> > below I expected both inserts into the referring table to fail.
> >
> > CREATE TABLE t (x int, y int, PRIMARY KEY (x, y));
> > CREATE TABLE fk (x int, ys int[], FOREIGN KEY (x, EACH ELEMENT OF ys) 
> > REFERENCES t MATCH FULL);
> > INSERT INTO t VALUES (10, 1);
> > INSERT INTO fk VALUES (10, '{1,NULL}');
> > INSERT INTO fk VALUES (NULL, '{1}');
> >
> I understand that Match full should contain nulls in the results. However,
> I don't think that it's semantically correct, so I suggest we don't use
> Match full. What would be the consequences of that ?

Well, I think you could get away with not supporting MATCH FULL with
array FK references (meaning you ought to raise an error if you see it) ...
clearly EACH ELEMENT OF is an extension of the spec so we're not forced
to comply with all the clauses.  On the other hand, it would be better if it
can be made to work.

If I understand correctly, you would need a new operator similar to @>
but which rejects NULLs in order to implement MATCH FULL, right?

> > 2) I think the code in RI_Initial_Check() would be cleaner if you used
> > "CROSS JOIN LATERAL unnest(col)" rather than having unnest() in the target
> > list. This way you would not need to rename all columns and the code paths
> > for the array case could look more like the code path for the normal case.
> >
> I have repeatedly tried to generate the suggested query using C code and I
> failed. I would like some help with it

Well, the way to go about it would be to first figure out what is the
correct SQL query, and only later try to implement it in C.  Is SQL the
problem, or is it C?  I'm sure we can get in touch with somebody that
knows a little bit of SQL.  Can you do a write-up of the query requirements?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PoC: custom signal handler for extensions

2018-01-12 Thread legrand legrand

This comes from
https://github.com/postgrespro/pg_query_state

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas  wrote:
> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila  wrote:
>> 1.
>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>   {
>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>> - OldestXmin = InvalidTransactionId; /* not used */
>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>
>> I think leader and workers should have the same idea of oldestXmin for
>> the purpose of deciding the visibility of tuples.  I think this is
>> ensured in all form of parallel query as we do share the snapshot,
>> however, same doesn't seem to be true for Parallel Index builds.
>
> Hmm.  Does it break anything if they use different snapshots?  In the
> case of a query that would be disastrous because then you might get
> inconsistent results, but if the snapshot is only being used to
> determine what is and is not dead then I'm not sure it makes much
> difference ... unless the different snapshots will create confusion of
> some other sort.

I think that this is fine. GetOldestXmin() is only used when we have a
ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
only talking about the difference between HEAPTUPLE_DEAD and
HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
strictly necessary by the time you got to it is normal.

However, it's not okay that GetOldestXmin()'s second argument is true
in the patch, rather than PROCARRAY_FLAGS_VACUUM. That's due to bitrot
that was not caught during some previous rebase (commit af4b1a08
changed the signature). Will fix.

You've given me a lot more to work through in your most recent mail,
Robert. I will probably get the next revision to you on Monday.
Doesn't seem like there is much point in posting what I've done so
far.

-- 
Peter Geoghegan

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2018-01-12 Thread Mark Rofail

According to the last review, there are still two main issues with the
patch.

On Mon, Oct 30, 2017 at 3:24 AM, Andreas Karlsson  wrote:

> 1) MATCH FULL does not seem to care about NULLS in arrays. In the example
> below I expected both inserts into the referring table to fail.
>
> CREATE TABLE t (x int, y int, PRIMARY KEY (x, y));
> CREATE TABLE fk (x int, ys int[], FOREIGN KEY (x, EACH ELEMENT OF ys)
> REFERENCES t MATCH FULL);
> INSERT INTO t VALUES (10, 1);
> INSERT INTO fk VALUES (10, '{1,NULL}');
> INSERT INTO fk VALUES (NULL, '{1}');
>
> CREATE TABLE
> CREATE TABLE
> INSERT 0 1
> INSERT 0 1
> ERROR:  insert or update on table "fk" violates foreign key constraint
> "fk_x_fkey"
> DETAIL:  MATCH FULL does not allow mixing of null and nonnull key values.
>
> I understand that Match full should contain nulls in the results. However,
I don't think that it's semantically correct, so I suggest we don't use
Match full. What would be the consequences of that ?

2) I think the code in RI_Initial_Check() would be cleaner if you used
> "CROSS JOIN LATERAL unnest(col)" rather than having unnest() in the target
> list. This way you would not need to rename all columns and the code paths
> for the array case could look more like the code path for the normal case.
>
I have repeatedly tried to generate the suggested query using C code and I
failed. I would like some help with it

Best Regards,
Mark Rofail

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread David Fetter

On Fri, Jan 12, 2018 at 02:29:53PM -0500, Corey Huinker wrote:
> >
> >
> >
> > It goes on from there, but I think there's a reasonable interpretation
> > of this which allows us to use the same syntax as CREATE
> > (FUNCTION|PROCEDURE), apart from the body, e.g.:
> >
> > CREATE ROUTINE MAPPING local_routine_name
> > FOR (FUNCTION | PROCEDURE) remote_routine_name ( [ [ argmode ] [ argname ]
> > argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
> >[ RETURNS rettype
> >  | RETURNS TABLE ( column_name column_type [, ...] ) ]
> > SERVER foreign_server_name
> >[ (option [, ...]) ]
> >
> > Does that seem like too broad an interpretation?
> >
> 
> That's really interesting. I didn't think to look in the definition of
> CREATE FUNCTION to see if a SERVER option popped in there, but seems like a
> more accessible way to introduce the notion of remote functions,

It does indeed.  Adding the functionality to CREATE
(FUNCTION|PROCEDURE) seems like a *much* better idea than trying to
wedge it into the CREATE ROUTINE MAPPING syntax.

> because I talked to a few developers about this before posting to
> the list, and only one had ever heard of ROUTINE MAPPING and had no
> clear recollection of it.  An option on CREATE FUNCTION is going to
> get noticed (and used!) a lot sooner.

+1

> Having said that, I think syntactically we have to implement CREATE ROUTINE
> MAPPING, even if it is just translated to a CREATE FUNCTION call.
> 
> In either case, I suspected that pg_proc would need a nullable srvid column
> pointing to pg_foreign_server, and possibly a new row in pg_language for
> 'external'.

Makes a lot of sense.

> I had entertained having a pg_routine_mappings table like
> pg_user_mappings, and we still could, if the proc's language of
> 'external' clued the planner to look for the mapping. I can see
> arguments for either approach.

It would be good to have them in the catalog somehow if we make CREATE
ROUTINE MAPPING a DDL.  If I've read the standard correctly, there are
parts of information_schema which come into play for those routine
mappings.

> Before anyone asks, I looked for, and did not find, any suggestion of
> IMPORT FOREIGN ROUTINE a la IMPORT FOREIGN SCHEMA, so as of yet there
> wouldn't be any way to grab all the functions that .a foreign server is
> offering up.

How about making an option to IMPORT FOREIGN SCHEMA do it?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2018-01-12 Thread Alvaro Herrera

Here's a rebased version of this patch.  I have done nothing other than
fix the conflicts and verify that it passes existing regression tests.
I intend to go over the reviews sometime later and hopefully get it all
fixed for inclusion in pg11.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From d513b449722b6e848b958584d0c81ca19b289d22 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 12 Jan 2018 16:28:51 -0300
Subject: [PATCH v6] Support foreign keys for array elements

---
 doc/src/sgml/catalogs.sgml|  16 +
 doc/src/sgml/ddl.sgml | 110 
 doc/src/sgml/ref/create_table.sgml| 116 +++-
 src/backend/catalog/heap.c|   1 +
 src/backend/catalog/index.c   |   1 +
 src/backend/catalog/pg_constraint.c   |  14 +-
 src/backend/commands/tablecmds.c  | 138 -
 src/backend/commands/trigger.c|   6 +
 src/backend/commands/typecmds.c   |   1 +
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/nodes/outfuncs.c  |   1 +
 src/backend/parser/gram.y |  66 ++-
 src/backend/parser/parse_coerce.c |  84 +++
 src/backend/parser/parse_utilcmd.c|   2 +
 src/backend/utils/adt/ri_triggers.c   | 245 +++--
 src/backend/utils/adt/ruleutils.c |  88 ++-
 src/include/catalog/pg_constraint.h   |  27 +-
 src/include/catalog/pg_constraint_fn.h|   1 +
 src/include/nodes/parsenodes.h|   5 +
 src/include/parser/kwlist.h   |   1 +
 src/include/parser/parse_coerce.h |   1 +
 src/test/regress/expected/element_foreign_key.out | 639 ++
 src/test/regress/parallel_schedule|   7 +-
 src/test/regress/serial_schedule  |   1 +
 src/test/regress/sql/element_foreign_key.sql  | 503 +
 26 files changed, 2000 insertions(+), 76 deletions(-)
 create mode 100644 src/test/regress/expected/element_foreign_key.out
 create mode 100644 src/test/regress/sql/element_foreign_key.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3f02202caf..aa5f9fe21f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2342,6 +2342,16 @@ SCRAM-SHA-256$:&l
  
 
  
+  confreftype
+  char[]
+  
+  If a foreign key, the reference semantics for each column:
+   p = plain (simple equality),
+   e = each element of referencing array must have a 
match
+  
+ 
+ 
+   
   conpfeqop
   oid[]
   pg_operator.oid
@@ -2387,6 +2397,12 @@ SCRAM-SHA-256$:&l
   
 
   
+When confreftype indicates array to scalar
+foreign key reference semantics, the equality operators listed in
+conpfeqop etc are for the array's element type.
+  
+ 
+  
In the case of an exclusion constraint, conkey
is only useful for constraint elements that are simple column references.
For other cases, a zero appears in conkey
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index b1167a40e6..5ca685b991 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -882,6 +882,116 @@ CREATE TABLE order_items (

   
 
+  
+   Foreign Key Arrays
+
+   
+Foreign Key Arrays
+   
+
+   
+ELEMENT foreign key
+   
+
+   
+constraint
+Array ELEMENT foreign key
+   
+
+   
+constraint
+ELEMENT foreign key
+   
+
+   
+referential integrity
+   
+
+   
+Another option you have with foreign keys is to use a referencing column
+which is an array of elements with the same type (or a compatible one) as
+the referenced column in the related table. This feature is called
+Foreign Key Arrays and is implemented in PostgreSQL
+with EACH ELEMENT OF foreign key constraints, as
+described in the following example:
+
+
+ CREATE TABLE drivers (
+ driver_id integer PRIMARY KEY,
+ first_name text,
+ last_name text
+ );
+
+ CREATE TABLE races (
+ race_id integer PRIMARY KEY,
+ title text,
+ race_day date,
+ final_positions integer[],
+ FOREIGN KEY  (EACH ELEMENT OF final_positions) REFERENCES 
drivers 
+ );
+
+
+The above example uses an array (final_positions) to
+store the results of a race: for each of its elements a referential
+integrity check is enforced on the drivers table. Note
+that (EACH ELEMENT OF ...) REFERENCES is an extension of
+PostgreSQL and it is not included in the SQL standard.
+   
+
+   
+We currently only support the table constraint form.
+   
+
+   
+Even though the most common use c

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread Corey Huinker

>
>
>
> It goes on from there, but I think there's a reasonable interpretation
> of this which allows us to use the same syntax as CREATE
> (FUNCTION|PROCEDURE), apart from the body, e.g.:
>
> CREATE ROUTINE MAPPING local_routine_name
> FOR (FUNCTION | PROCEDURE) remote_routine_name ( [ [ argmode ] [ argname ]
> argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
>[ RETURNS rettype
>  | RETURNS TABLE ( column_name column_type [, ...] ) ]
> SERVER foreign_server_name
>[ (option [, ...]) ]
>
> Does that seem like too broad an interpretation?
>

That's really interesting. I didn't think to look in the definition of
CREATE FUNCTION to see if a SERVER option popped in there, but seems like a
more accessible way to introduce the notion of remote functions, because I
talked to a few developers about this before posting to the list, and only
one had ever heard of ROUTINE MAPPING and had no clear recollection of it.
An option on CREATE FUNCTION is going to get noticed (and used!) a lot
sooner.

Having said that, I think syntactically we have to implement CREATE ROUTINE
MAPPING, even if it is just translated to a CREATE FUNCTION call.

In either case, I suspected that pg_proc would need a nullable srvid column
pointing to pg_foreign_server, and possibly a new row in pg_language for
'external'. I had entertained having a pg_routine_mappings table like
pg_user_mappings, and we still could, if the proc's language of 'external'
clued the planner to look for the mapping. I can see arguments for either
approach.

Before anyone asks, I looked for, and did not find, any suggestion of
IMPORT FOREIGN ROUTINE a la IMPORT FOREIGN SCHEMA, so as of yet there
wouldn't be any way to grab all the functions that .a foreign server is
offering up.

Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2018-01-12 Thread Robert Haas

On Thu, Jan 11, 2018 at 10:30 PM, David Rowley
 wrote:
> Instead, can you make it:
>
> if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
> keys->n_maxkeys > 0 || n_keynullness > 0)
> return true;
>
> return false;

Or better yet:

return (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
keys->n_maxkeys > 0 || n_keynullness > 0);

It's not really necessary to write if (some condition is true) return
true; return false when you can just write return (boolean-valued
condition).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Rangejoin rebased

2018-01-12 Thread Alexander Kuzmenkov


Hi Jeff,

Just a quick comment -- I ran a slightly modified version of a query 
from the regression tests, and got an assertion failure:


select i1, ir1, i2, ir2
  from (select * from rangejoin_left order by ir1 desc) as a1 inner 
join (select * from rangejoin_right order by ir2 desc) as a2

    on (i1 = i2 and ir1 && ir2)
  order by ir1 desc, i1;
TRAP: FailedAssertion("!(!ssup->ssup_reverse)", File: 
"/home/akuzmenkov/pgp-old/build/../postgrespro/src/backend/executor/nodeMergejoin.c", 
Line: 492)


The sort order isn't right for the join, it seems. I remember having 
similar troubles with my full merge join implementation. I tried 
filtering unsuitable paths in try_mergejoin_path, but that was not quite 
enough. The planner tries to use only one sort direction to limit the 
number of path, so the path we need might not be there at all. This 
optimization was added in commit 834ddc62, see right_merge_direction(). 
Sadly, I have no idea how to solve this.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread David Fetter

On Fri, Jan 12, 2018 at 11:11:26AM -0500, Corey Huinker wrote:
> >
> > PostgreSQL allows function overloading, which means that there can
> > be multiple functions with same name differing in argument types.
> > So, the syntax has to include the input parameters or their types
> > at least.
> 
> "local_routine_name" and "remote_routine_spec" were my own
> paraphrasings of what the spec implies. I'm nearly certain that the
> local routine name, which the spec says is just an identifier,
> cannot have a parameter spec on it, which leaves only one other
> place to define it, remote_routine_spec, which wasn't defined at
> all. I _suppose_ parameter definitions could be pushed into options,
> but that'd be ugly.

In my draft of SQL:2011, which I don't think has substantive changes
to what's either in the official SQL:2011 or SQL:2016, it says:

 ::=
CREATE ROUTINE MAPPING  FOR 
SERVER  [  ]
Syntax Rules
1) Let FSN be the . Let RMN be the .
2) The catalog identified by the explicit or implicit catalog name of FSN shall 
include a foreign server
descriptor whose foreign server name is equivalent to FSN.
3) The SQL-environment shall not include a routine mapping descriptor whose 
routine mapping name is
RMN.
4) Let R be the SQL-invoked routine identified by the . R shall identify an SQL-invoked regular function.

It goes on from there, but I think there's a reasonable interpretation
of this which allows us to use the same syntax as CREATE
(FUNCTION|PROCEDURE), apart from the body, e.g.:

CREATE ROUTINE MAPPING local_routine_name
FOR (FUNCTION | PROCEDURE) remote_routine_name ( [ [ argmode ] [ argname ] 
argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
   [ RETURNS rettype
 | RETURNS TABLE ( column_name column_type [, ...] ) ]
SERVER foreign_server_name 
   [ (option [, ...]) ]

Does that seem like too broad an interpretation?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: pgbench - add \if support

2018-01-12 Thread Fabien COELHO




Hm, isn't already commited when/case/then/else syntax do the same?


No, not strictly. The "CASE WHEN" is an if *within* an expression:

  \set i CASE WHEN condition THEN val1 ELSE val2 END

The \if is at the script level, like psql already available version, which 
can change what SQL is sent.


  \if condition
SOME SQL
  \else
OTHER SQL
  \endif

You could achieve the CASE semantics with some \if:

  \if condition
\set i val1
  \else
\set i val2
  \endif

But the reverse is not possible.

--
Fabien.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Robert Haas

On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan  wrote:
> I've caught up with you again. I just need to take a look at what I
> came up with with fresh eyes, and maybe do some more testing.

More comments:

BufFileView() looks fairly pointless.  It basically creates a copy of
the input and, in so doing, destroys the input, which is a lot like
returning the input parameter except that it uses more cycles.  It
does do a few things.  First, it zeroes the offsets array instead of
copying the offsets.  But as used, those offsets would have been 0
anyway.  Second, it sets the fileset parameter to NULL.   But that
doesn't actually seem to be important for anything: the fileset is
only used when creating new files, and the BufFile must already be
marked read-only, so we won't be doing that.  It seems like this
function can just be entirely removed and replaced by Assert()-ing
some things about the target in BufFileViewAppend, which I would just
rename to BufFileAppend.

In miscadmin.h, I'd put the prototype for the new GUC next to
max_worker_processes, not maintenance_work_mem.

The ereport() in index_build will, I think, confuse people when it
says that there are 0 parallel workers.  I suggest splitting this into
two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
"building index \"%s\" on table \"%s\" serially" ...) else ereport(...
"building index \"%s\" on table \"%s\" in parallel with request for %d
parallel workers" ...).  Might even need three cases to handle
parallel_leader_participation without needing to assemble the message,
unless we drop parallel_leader_participation support.

The logic in IndexBuildHeapRangeScan() around need_register_snapshot
and OldestXmin seems convoluted and not very well-edited to me.  For
example, need_register_snapshot is set to false in a block that is
only entered when it's already false, and the comment that follows is
supposed to be associated with GetOldestXmin() and makes no sense
here.  I suggest that you go back to the original code organization
and then just insert an additional case for a caller-supplied scan, so
that the overall flow looks like this:

if (scan != NULL)
{
   ...
}
else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
{
   ...
}
else
{
   ...
}

Along with that, I'd change the name of need_register_snapshot to
need_unregister_snapshot (it's doing both jobs right now) and
initialize it to false.  If you enter the second of the above blocks
then change it to true just after snapshot =
RegisterSnapshot(GetTransactionSnapshot()).  Then adjust the comment
that begins "Prepare for scan of the base relation." by inserting an
additional sentence just after that one: "If the caller has supplied a
scan, just use it.  Otherwise, in a normal index build..." and the
rest as it is currently.

+ * This support code isn't reliable when called from within a parallel
+ * worker process due to the fact that our state isn't propagated.  This is
+ * why parallel index builds are disallowed on catalogs.  It is possible
+ * that we'll fail to catch an attempted use of a user index undergoing
+ * reindexing due the non-propagation of this state to workers, which is not
+ * ideal, but the problem is not particularly likely to go undetected due to
+ * our not doing better there.

I understand the first two sentences, but I have no idea what the
third one means, especially the part that says "not particularly
likely to go undetected due to our not doing better there".  It sounds
scary that something bad is only "not particularly likely to go
undetected"; don't we need to detect bad things reliably? But also,
you used the word "not" three times and also the prefix "un-", meaning
"not", once.  Four negations in 13 words! Perhaps I'm not entirely in
a position to cast aspersions on overly-complex phraseology -- the pot
calling the kettle black and all that -- but I bet that will be a lot
clearer if you reduce the number of negations to either 0 or 1.

The comment change in standard_planner() doesn't look helpful to me;
I'd leave it out.

+ * tableOid is the table that index is to be built on.  indexOid is the OID
+ * of a index to be created or reindexed (which must be a btree index).

I'd rewrite that first sentence to end "the table on which the index
is to be built".  The second sentence should say "an index" rather
than "a index".

+ * leaderWorker indicates whether leader will participate as worker or not.
+ * This needs to be taken into account because leader process is guaranteed to
+ * be idle when not participating as a worker, in contrast with conventional
+ * parallel relation scans, where the leader process typically has plenty of
+ * other work to do relating to coordinating the scan, etc.  For CREATE INDEX,
+ * leader is usually treated as just another participant for our scaling
+ * calculation.

OK, I get the first sentence.  But the rest of this appears to be
partially irrelevant and partially incorrect.  The degree to which the
leader is l

Re: master make check fails on Solaris 10

2018-01-12 Thread Tom Lane

Marina Polyakova  writes:
> On 12-01-2018 14:05, Marina Polyakova wrote:
>> - on the previous commit (272c2ab9fd0a604e3200030b1ea26fd464c44935)
>> the same failures occur (see the attached regression diffs and
>> output);
>> - on commit bf54c0f05c0a58db17627724a83e1b6d4ec2712c make check-world 
>> passes.
>> I'll try to find out from what commit it started..

> Binary search has shown that all these failures begin with commit 
> 7518049980be1d90264addab003476ae105f70d4 (Prevent int128 from requiring 
> more than MAXALIGN alignment.).

Hm ... so apparently, that compiler has bugs in handling nondefault
alignment specs.  You said upthread it was gcc, but what version
exactly?

regards, tom lane

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> So could we have an explanation of what it is we're agreeing to?

> Here's a small sample pg_proc entry:

>  { oid => '2147', descr => 'number of input rows for which the input 
> expression is not null',
>   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 
> 'any', s => 'aggregate_dummy' },

> An pg_amop entry:
> { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => 
> '<(int2,int2)', am => 'btree' },

> Notes:
> 1. this is Perl data; it is read with 'eval' without any external modules.

Check.  Where is it coming from --- I suppose we aren't going to try to
store this in the existing .h files?  What provisions will there be for
comments?

> 2. the pg_proc entry has been compressed to two lines, to avoid
>content-free lines that would easily confuse git merge, but keep line
>length reasonable.

Seems like we would almost need a per-catalog convention on how to lay out
the entries, or else we're going to end up (over time) with lots of cowboy
coding leading to entries that look randomly different from the ones
around them.

> 3. references to objects in other catalogs are by name, such as "int8"
>or "btree/integer_ops" rather than OID.

+1

> 4. for each attribute, an abbreviation can be declared.  In the
>pg_proc sample we have "n" which stands for proname, because we have
>this line:
>+   NameDataproname BKI_ABBREV(n);

I think single-letter abbreviations here are a pretty bad space vs
readability tradeoff, particularly for wider catalogs where it risks
ambiguity.  The pg_amop sample you show looks noticeably more legible than
the other one.  Still, this is something we can debate on a case-by-case
basis, it's not a defect in the mechanism.

One other question is how we'll verify the conversion.  Is there an
expectation that the .bki file immediately after the conversion will
be identical to immediately before?

regards, tom lane

Re: master make check fails on Solaris 10

2018-01-12 Thread Marina Polyakova


On 12-01-2018 14:05, Marina Polyakova wrote:

- on the previous commit (272c2ab9fd0a604e3200030b1ea26fd464c44935)
the same failures occur (see the attached regression diffs and
output);
- on commit bf54c0f05c0a58db17627724a83e1b6d4ec2712c make check-world 
passes.

I'll try to find out from what commit it started..


Binary search has shown that all these failures begin with commit 
7518049980be1d90264addab003476ae105f70d4 (Prevent int128 from requiring 
more than MAXALIGN alignment.). On the previous commit 
(91aec93e6089a5ba49cce0aca3bf7f7022d62ea4) make check-world passes.


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Amit Khandekar

On 12 January 2018 at 20:24, Robert Haas  wrote:
> On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar  
> wrote:
>> The reason why I am having map_required field inside a structure along
>> with the map, as against a separate array, is so that we can do the
>> on-demand allocation for both per-leaf array and per-subplan array.
>
> Putting the map_required field inside the structure with the map makes
> it completely silly to do the 0/1/2 thing, because the whole structure
> is going to be on the same cache line anyway.  It won't save anything
> to access the flag instead of a pointer in the same struct.

I see. Got it.

>  Also,
> the uint8 will be followed by 7 bytes of padding, because the pointer
> that follows will need to begin on an 8-byte boundary (at least, on
> 64-bit machines), so this will use more memory.
>
> What I suggest is:
>
> #define MT_CONVERSION_REQUIRED_UNKNOWN0
> #define MT_CONVERSION_REQUIRED_YES1
> #define MT_CONVERSION_REQUIRED_NO  2
>
> In ModifyTableState:
>
> uint8 *mt_per_leaf_tupconv_required;
> TupleConversionMap **mt_per_leaf_tupconv_maps;
>
> In PartitionTupleRouting:
>
> int *subplan_partition_offsets;
>
> When you initialize the ModifyTableState, do this:
>
> mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
> numResultRelInfos);
> mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
> *) * numResultRelInfos);
>

A few points below where I wanted to confirm that we are on the same page ...

> When somebody needs a map, then
>
> (1) if they need it by subplan index, first use
> subplan_partition_offsets to convert it to a per-leaf index

Before that, we need to check if there *is* an offset array. If there
are no partitions, there is only going to be a per-subplan array,
there won't be an offsets array. But I guess, you are saying : "do the
on-demand allocation only for leaf partitions; if there are no
partitions, the per-subplan maps will always be allocated for each of
the subplans from the beginning" . So if there is no offset array,
just return mtstate->mt_per_subplan_tupconv_maps[subplan_index]
without any further checks.

>
> (2) then write a function that takes the per-leaf index and does this:
>
> switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
> {
> case MT_CONVERSION_REQUIRED_UNKNOWN:
> map = convert_tuples_by_name(...);
> if (map == NULL)
> mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_NO;
> else
> {
> mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
> MT_CONVERSION_REQUIRED_YES;
> mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
> }
> return map;
> case MT_CONVERSION_REQUIRED_YES:
> return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
> case MT_CONVERSION_REQUIRED_NO:
> return NULL;
> }

Yeah, right.

But after that, I am not sure then why is mt_per_sub_plan_maps[] array
needed ? We are always going to convert the subplan index into leaf
index, so per-subplan map array will not come into picture. Or are you
saying, it will be allocated and used only when there are no
partitions ?  From one of your earlier replies, you did mention about
trying to share the maps between the two arrays, that means you were
considering both arrays being used at the same time.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Alvaro Herrera

Tom Lane wrote:
> Alvaro Herrera  writes:
> > Others: Now is the time to raise concerns related to the proposed file
> > formats and tooling, so please do have a look when you have a moment.
> > At this stage, the proposed data format seems a good choice to me.
> 
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.
> 
> So could we have an explanation of what it is we're agreeing to?

Here's a small sample pg_proc entry:

 { oid => '2147', descr => 'number of input rows for which the input expression 
is not null',
  n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 'any', 
s => 'aggregate_dummy' },

An pg_amop entry:
{ opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => 
'<(int2,int2)', am => 'btree' },

Notes:
1. this is Perl data; it is read with 'eval' without any external modules.
2. the pg_proc entry has been compressed to two lines, to avoid
   content-free lines that would easily confuse git merge, but keep line
   length reasonable.
3. references to objects in other catalogs are by name, such as "int8"
   or "btree/integer_ops" rather than OID.
4. for each attribute, an abbreviation can be declared.  In the
   pg_proc sample we have "n" which stands for proname, because we have
   this line:
   +   NameDataproname BKI_ABBREV(n);

I think John has gone overboard with some of these choices, but we can
argue the specific choices once we decide that abbreviation is a good
idea.  (Prior discussion seems to suggest we already agreed on that.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible performance regression with pg_dump of a large number of relations

2018-01-12 Thread Luke Cowell

> On Jan 12, 2018, at 8:01 AM, Jeff Janes  wrote:
> 
> On Thu, Jan 11, 2018 at 5:26 PM, Luke Cowell  > wrote:
> I've been troubleshooting an issue with slow pg_dump times on postgres 9.6.6. 
> I believe something changed between 9.5.10 and 9.6.6 that has made dumps 
> significantly slower for databases with a large number of relations. I posted 
> this in irc and someone suggested that I should post this here. I'm sorry if 
> this isn't the right place.
> 
> To simulate the issue I generated 150,000 relations spread across 1000 
> schemas (this roughly reflects my production setup).
> 
> ```ruby
> File.write "many_relations.sql", (15 / 150).times.flat_map {|n|
>   [
>"create schema s_#{n};",
>150.times.map do |t|
>  "create table s_#{n}.test_#{t} (id int);"
>end
>]
> }.join("\n")
> ```
> 
> I have 2 identical pieces of hardware. I've installed 9.5 on one and 9.6 on 
> the other. I've run the same generated piece of sql in a fresh database on 
> both systems.
> 
> On my 9.5.10 system:
> > time pg_dump -n s_10 testing > /dev/null
> real0m5.492s
> user0m1.424s
> sys 0m0.184s
> 
> On my 9.6.6 system:
> > time pg_dump -n s_10 testing > /dev/null
> real0m27.342s
> user0m1.748s
> sys 0m0.248s
> 
> I don't get quite as large a regression as you do, from 6s to 19s.  It looks 
> like there are multiple of them, but the biggest is caused by:
> 
> commit 5d589993cad212f7d556d52cc1e42fe18f65b057
> Author: Stephen Frost mailto:sfr...@snowman.net>>
> Date:   Fri May 6 14:06:50 2016 -0400
> 
> pg_dump performance and other fixes
> 
> That commit covered a few different things, and I don't what improvement it 
> mentions is the one that motivated this, but the key change was to add this 
> query:
> 
> EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid = 
> pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname = 
> 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND 
> at.attnum>0 and ((SELECT count(acl) FROM (SELECT 
> unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT 
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) >1 OR 
> (SELECT count(acl) FROM (SELECT 
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT 
> SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo) >0))AS 
> changed_acl
> 
> Considering it runs 2 subqueries for every column (including the 6 hidden 
> system columns) of every table, even ones that don't end up getting dumped 
> out, it is no wonder it is slow.
> 
> If you were just dumping the database with 150,000 objects, I wouldn't worry 
> about a 20 second regression.  But I assume you intend to loop over every 
> schema and dump each individually? 
>  
> Cheers,
> 
> Jeff

Hi Jeff, thanks for your attention on this. Yes, that is exactly our use case. 
We dump each schema individually so we would be paying that 20 second penalty 
each time. As a workaround I've been dumping the schemas in batches of 20, but 
this isn't really ideal as we'll lose access to a number of our existing 
workflows.

Luke

Re: WIP Patch: Pgbench Serialization and deadlock errors

2018-01-12 Thread Marina Polyakova


On 12-01-2018 18:13, Fabien COELHO wrote:

Hello Marina,


Hello, Fabien!


If you want 2 transactions, then you have to put them in two scripts,
which looks fine with me. Different transactions are expected to be
independent, otherwise they should be merged into one transaction.


Therefore if the script consists of several single statements (= 
several transactions), you cannot retry them. For example, the script 
looked like this:


UPDATE xy1 SET x = 1 WHERE y = 1;
UPDATE xy2 SET x = 2 WHERE y = 2;
UPDATE xy3 SET x = 3 WHERE y = 3;

If this restriction is ok for you, I'll simplify the code :)


Yes, that is what I'm suggesting. If you want to restart them, you can
put them in 3 scripts.


Okay, in the next patch I'll simplify the code.


Under these restrictions, ISTM that a retry is something like:
...


If we successfully complete a failed transaction block and process its 
end command in CSTATE_END_COMMAND, we may want to make a retry. So do 
you think that in this case it is ok to go to CSTATE_ABORTED at the 
end of CSTATE_END_COMMAND?..


Dunno.

I'm fine with having END_COMMAND skipping to START_TX if it can be
done easily and cleanly, esp without code duplication.


If I understand you correctly, I'm not sure that we should skip the 
statistics collector for the command that completes a failed transaction 
block.. + maybe for this command we will have only such statistics.



ISTM that ABORTED & FINISHED are currently exactly the same. That would
put a particular use to aborted. Also, there are many points where the
code may go to "aborted" state, so reusing it could help avoid
duplicating stuff on each abort decision.


Thanks, I agree with you.


Once this works, maybe it could go a step further by restarting at
savepoints. I'd put restrictions there to ease detecting a savepoint
so that it cannot occur in a compound command for instance. This 
would

probably require a new state. Fine.


We discussed the savepoints earlier in [1]:


Yep. I'm trying to suggest an incremental path with simple but yet
quite useful things first.


This question ("if there's a failure what savepoint we should rollback 
to and start the execution again? ...") mostly concerns the basic idea 
of how to maintain the savepoints in this feature, rather than the exact 
architecture of the code, so we can discuss this now :)


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2018-01-12 Thread Peter Eisentraut

On 1/11/18 13:52, Alvaro Herrera wrote:
> The delta patch turned out to have at least one stupid bug, and a few
> non-stupid bugs.  Here's an updated version that should behave
> correctly, and addresses all reported problems.

It seems that CompareIndexInfo() still doesn't compare indexes' operator
classes and collations.

Also, some new test cases for pg_dump would be nice.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread David Fetter

On Fri, Jan 12, 2018 at 11:38:54AM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Others: Now is the time to raise concerns related to the proposed
> > file formats and tooling, so please do have a look when you have a
> > moment.  At this stage, the proposed data format seems a good
> > choice to me.
> 
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.
> 
> So could we have an explanation of what it is we're agreeing to?

That would be awesome.  A walk-through example or two would also help.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Tom Lane

Alvaro Herrera  writes:
> Others: Now is the time to raise concerns related to the proposed file
> formats and tooling, so please do have a look when you have a moment.
> At this stage, the proposed data format seems a good choice to me.

It's not very clear to me what the proposed data format actually is,
and I don't really want to read several hundred KB worth of patches
in order to reverse-engineer that information.  Nor do I see
anything in the patch list that obviously looks like it updates
doc/src/sgml/bki.sgml to explain things.

So could we have an explanation of what it is we're agreeing to?

regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-01-12 Thread Peter Eisentraut

On 1/11/18 18:23, Greg Stark wrote:
> AIUI spilling to disk doesn't affect absorbing future updates, we
> would just keep accumulating them in memory right? We won't need to
> unspill until it comes time to commit.

Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again.  But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).

So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.

But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.

> The real aim should be to try to pick the transaction that will be
> committed furthest in the future. That gives you the most memory to
> use for live transactions for the longest time and could let you
> process the maximum amount of transactions without spilling them. So
> either the oldest transaction (in the expectation that it's been open
> a while and appears to be a long-lived batch job that will stay open
> for a long time) or the youngest transaction (in the expectation that
> all transactions are more or less equally long-lived) might make
> sense.

Yes, that makes sense.  We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible performance regression with pg_dump of a large number of relations

2018-01-12 Thread Jeff Janes

On Fri, Jan 12, 2018 at 8:01 AM, Jeff Janes  wrote:

> That commit covered a few different things, and I don't what improvement
> it mentions is the one that motivated this, but the key change was to add
> this query:
>
> EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid
> = pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname =
> 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND
> at.attnum>0 and ((SELECT count(acl) FROM (SELECT 
> unnest(coalesce(at.attacl,acldefault('c',c.relowner)))
> AS acl EXCEPT SELECT 
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner
> as foo) >1 OR (SELECT count(acl) FROM (SELECT 
> unnest(coalesce(pip.initprivs,acldefault('c',c.relowner)))
> AS acl EXCEPT SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner
> as foo) >0))AS changed_acl
>

Sorry, that query reflects some munging I did to it.  The real part added
to the query is:

EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid
= pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname =
'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND
((SELECT array_agg(acl) FROM (SELECT
unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT
unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) IS NOT
NULL OR (SELECT array_agg(acl) FROM (SELECT
unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT
SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo) IS
NOT NULL OR NULL IS NOT NULL OR NULL IS NOT NULL))AS changed_ac

Cheers,

Jeff

Re: pgbench - add \if support

2018-01-12 Thread Teodor Sigaev


Hi!

Hm, isn't already commited when/case/then/else syntax do the same? If not, could 
it be added to existing synax?



Fabien COELHO wrote:


Here is a rebase. I made some tests use actual expressions instead of just 0 
and 1. No other changes.


Sigh. Better with the attachment. Sorry for the noise.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread Corey Huinker

>
> PostgreSQL allows function overloading, which means that there can be
> multiple functions with same name differing in argument types. So, the
> syntax has to include the input parameters or their types at least.
>

"local_routine_name" and "remote_routine_spec" were my own paraphrasings of
what the spec implies. I'm nearly certain that the local routine name,
which the spec says is just an identifier, cannot have a parameter spec on
it, which leaves only one other place to define it, remote_routine_spec,
which wasn't defined at all. I _suppose_ parameter definitions could be
pushed into options, but that'd be ugly.

Re: improve type conversion of SPI_processed in Python

2018-01-12 Thread Tom Lane

Peter Eisentraut  writes:
> On 1/9/18 15:54, Tom Lane wrote:
>> I'd be inclined to code PLyObject_FromUint64() as an exact
>> analog of PLyLong_FromInt64(), ie
>> 
>> /* on 32 bit platforms "unsigned long" may be too small */
>> if (sizeof(uint64) > sizeof(unsigned long))
>>  return PyLong_FromUnsignedLongLong(DatumGetUInt64(d));
>> else
>>  return PyLong_FromUnsignedLong(DatumGetUInt64(d));
>> 
>> and let Python worry about how to optimize the conversion.

> Would that even be necessary?  Why not use the LongLong variant all the
> time then?

I've not looked at the Python internals to see if one is noticeably faster
than the other, but if it isn't, yeah we could simplify that.  In any case
my main point is let's keep these two functions using similar approaches.

regards, tom lane

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Tom Lane

I wrote:
> Robert Haas  writes:
>> On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs  wrote:
>>> For this to work, we must ensure that WAL files are either recycled in
>>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>>> easy to do with the existing infrastructure.

> I have no faith that we can prevent old WAL data from reappearing in the
> file system across an OS crash, so I find Simon's assertion that we can
> dodge the problem through file manipulation to be simply unbelievable.

Forgot to say: if we *could* do it reliably, it would likely add
significant I/O traffic, eg an extra zeroing pass for every WAL segment.
That's a pretty heavy performance price to set against whatever win
we might get from contention reduction.

regards, tom lane

Re: Possible performance regression with pg_dump of a large number of relations

2018-01-12 Thread Jeff Janes

On Thu, Jan 11, 2018 at 5:26 PM, Luke Cowell  wrote:

> I've been troubleshooting an issue with slow pg_dump times on postgres
> 9.6.6. I believe something changed between 9.5.10 and 9.6.6 that has made
> dumps significantly slower for databases with a large number of relations.
> I posted this in irc and someone suggested that I should post this here.
> I'm sorry if this isn't the right place.
>
> To simulate the issue I generated 150,000 relations spread across 1000
> schemas (this roughly reflects my production setup).
>
> ```ruby
> File.write "many_relations.sql", (15 / 150).times.flat_map {|n|
>   [
>"create schema s_#{n};",
>150.times.map do |t|
>  "create table s_#{n}.test_#{t} (id int);"
>end
>]
> }.join("\n")
> ```
>
> I have 2 identical pieces of hardware. I've installed 9.5 on one and 9.6
> on the other. I've run the same generated piece of sql in a fresh database
> on both systems.
>
> On my 9.5.10 system:
> > time pg_dump -n s_10 testing > /dev/null
> real0m5.492s
> user0m1.424s
> sys 0m0.184s
>
> On my 9.6.6 system:
> > time pg_dump -n s_10 testing > /dev/null
> real0m27.342s
> user0m1.748s
> sys 0m0.248s
>

I don't get quite as large a regression as you do, from 6s to 19s.  It
looks like there are multiple of them, but the biggest is caused by:

commit 5d589993cad212f7d556d52cc1e42fe18f65b057
Author: Stephen Frost 
Date:   Fri May 6 14:06:50 2016 -0400

pg_dump performance and other fixes

That commit covered a few different things, and I don't what improvement it
mentions is the one that motivated this, but the key change was to add this
query:

EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid
= pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname =
'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND
at.attnum>0 and ((SELECT count(acl) FROM (SELECT
unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT
unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) >1 OR
(SELECT count(acl) FROM (SELECT
unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT
SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo)
>0))AS changed_acl

Considering it runs 2 subqueries for every column (including the 6 hidden
system columns) of every table, even ones that don't end up getting dumped
out, it is no wonder it is slow.

If you were just dumping the database with 150,000 objects, I wouldn't
worry about a 20 second regression.  But I assume you intend to loop over
every schema and dump each individually?

Cheers,

Jeff

Re: [HACKERS] PoC: custom signal handler for extensions

2018-01-12 Thread Teodor Sigaev

In perspective, this mechanism provides the low-level instrument to define 
remote procedure call on extension side. The simple RPC that defines effective 
userid on remote backend (remote_effective_user function) is attached for example.


Suppose, it's useful patch. Some notes:

1) AssignCustomProcSignalHandler is unneeded API function, easy replaced by 
GetCustomProcSignalHandler/ReleaseCustomProcSignal/RegisterCustomProcSignalHandler 
functions, but in other hand, it isn't looked as widely used feature to reassign 
custom signal handler.


2) Naming RegisterCustomProcSignalHandler and ReleaseCustomProcSignal  is not 
self-consistent. Other parts suggest pair Register/Unregister or Aquire/Release. 
Seems, Register/Unregister pair is preferred here.


3) Agree that Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N) 
should be replaced with ereport. By the way, ReleaseCustomProcSignal() is a 
single place where there isn't a range check of reason arg


4) CustomSignalInterrupt() - play with errno and SetLatch() seems unneeded here 
- there isn't call of handler of custom signal, SetLatch will be called several 
lines below


5) Using a special memory context for handler call looks questionably, I think 
that complicated handlers could manage memory itself (and will do, if they need 
to store something between calls). So, I prefer to remove context.


6) As I can see, all possible (not only custom) signal handlers could perfectly 
use this API, or, at least, be store in CustomHandlers array, which could be 
renamed to InterruptHandlers, for example. Next, custom handler type is possible 
to make closier to built-in handlers, let its signature extends to
void (*ProcSignalHandler_type) (ProcSignalReason reason) as others. It will 
allow to use single handler to support several reasons.


7) Suppose, API allows to use different handlers in different processes for the 
same reason, it's could be reason of confusion. I suggest to restrict 
Register/Unregister call only for shared_preload_library, ie only during startup.


8) I'd like to see an example of usage this API somewhere in contrib in exsting 
modules. Ideas?




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: improve type conversion of SPI_processed in Python

2018-01-12 Thread Peter Eisentraut

On 1/9/18 15:54, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Here is a patch to improves how PL/Python deals with very large values
>> of SPI_processed.  The previous code converts anything that does not fit
>> into a C long into a Python float.  But Python long has unlimited
>> precision, so we should be using that instead.  And in Python 3, int and
>> long as the same, so there is no need to deal with any variations anymore.
> 
> I took a quick look at this.  +1 for returning Python long all the time,
> but I wonder why the Python version dependency.

To keep returning an int in Python 2 in the cases it fits.  Maybe that's
overkill.

> Our existing function
> PLyLong_FromInt64() believes that PyLong_FromLongLong is unconditionally
> available.

Interesting.  I had coded this to account for the possibility that long
long does not exist on a 64-bit platform, but that situation probably
died with the first Alpha or something.

> I'd be inclined to code PLyObject_FromUint64() as an exact
> analog of PLyLong_FromInt64(), ie
> 
>   /* on 32 bit platforms "unsigned long" may be too small */
>   if (sizeof(uint64) > sizeof(unsigned long))
>   return PyLong_FromUnsignedLongLong(DatumGetUInt64(d));
>   else
>   return PyLong_FromUnsignedLong(DatumGetUInt64(d));
> 
> and let Python worry about how to optimize the conversion.

Would that even be necessary?  Why not use the LongLong variant all the
time then?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Tom Lane

Robert Haas  writes:
> On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs  wrote:
>> So we can't completely remove xl_prev field, without giving up some
>> functionality. But we don't really need to store the 8-byte previous
>> WAL pointer in order to detect torn pages. Something else which can
>> tell us that the WAL record does not belong to current WAL segno would
>> be enough as well. I propose that we replace it with a much smaller
>> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
>> decide to call it) is the low order 16-bits of the WAL segno to which
>> the WAL record belongs. While reading WAL, we always match that the
>> "xl_walid" value stored in the WAL record matches with the current WAL
>> segno's lower order 16-bits and if not, then consider that as the end
>> of the stream.
>> 
>> For this to work, we must ensure that WAL files are either recycled in
>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>> easy to do with the existing infrastructure.

> I have some reservations about whether this makes the mechanism less
> reliable.

Yeah, it scares me too.  The xl_prev field is our only way of detecting
that we're looking at old WAL data when we cross a sector boundary.
I have no faith that we can prevent old WAL data from reappearing in the
file system across an OS crash, so I find Simon's assertion that we can
dodge the problem through file manipulation to be simply unbelievable.

If we could be sure that the WAL page size was no larger than the file
system's write quantum, then checking xlp_pageaddr would be sufficient
to detect stale WAL data.  But I'm afraid 8K is way too big for that;
so we need to be able to recognize page tears within-page.

> Of course, we also have xl_crc, so I'm not sure whether there's any
> chance of real harm...

The CRC only tells you that you have a valid WAL record, it won't clue
you in that it's old data you shouldn't replay.  If the previous WAL
record crossed the torn-page boundary, then you should have gotten a
CRC failure on that record --- but if the previous record ended at a
sector boundary, recognizing that the new record has an old xl_prev is
our ONLY defense against replaying stale data.

regards, tom lane

Re: Jsonb transform for pl/python

2018-01-12 Thread Aleksander Alekseev

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

LGTM.

The new status of this patch is: Ready for Committer

Re: master make check fails on Solaris 10

2018-01-12 Thread Marina Polyakova


On 12-01-2018 18:12, Alvaro Herrera wrote:

Marina Polyakova wrote:

- on the previous commit (272c2ab9fd0a604e3200030b1ea26fd464c44935) 
the same

failures occur (see the attached regression diffs and output);
- on commit bf54c0f05c0a58db17627724a83e1b6d4ec2712c make check-world
passes.
I'll try to find out from what commit it started.. Don't you have any
suspicions?)


Perhaps you can use "git bisect".


Thanks, I'm doing the same thing :)

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: WIP Patch: Pgbench Serialization and deadlock errors

2018-01-12 Thread Fabien COELHO



Hello Marina,


If you want 2 transactions, then you have to put them in two scripts,
which looks fine with me. Different transactions are expected to be
independent, otherwise they should be merged into one transaction.


Therefore if the script consists of several single statements (= several 
transactions), you cannot retry them. For example, the script looked like 
this:


UPDATE xy1 SET x = 1 WHERE y = 1;
UPDATE xy2 SET x = 2 WHERE y = 2;
UPDATE xy3 SET x = 3 WHERE y = 3;

If this restriction is ok for you, I'll simplify the code :)


Yes, that is what I'm suggesting. If you want to restart them, you can put 
them in 3 scripts.



Under these restrictions, ISTM that a retry is something like:

  case ABORTED:
 if (we want to retry) {
// do necessary stats
// reset the initial state (random, vars, current command)
state = START_TX; // loop
 }
 else {
   // count as failed...
   state = FINISHED; // or done.
 }
 break;


If we successfully complete a failed transaction block and process its end 
command in CSTATE_END_COMMAND, we may want to make a retry. So do you think 
that in this case it is ok to go to CSTATE_ABORTED at the end of 
CSTATE_END_COMMAND?..


Dunno.

I'm fine with having END_COMMAND skipping to START_TX if it can be done 
easily and cleanly, esp without code duplication.


ISTM that ABORTED & FINISHED are currently exactly the same. That would
put a particular use to aborted. Also, there are many points where the 
code may go to "aborted" state, so reusing it could help avoid duplicating 
stuff on each abort decision.



Once this works, maybe it could go a step further by restarting at
savepoints. I'd put restrictions there to ease detecting a savepoint
so that it cannot occur in a compound command for instance. This would
probably require a new state. Fine.


We discussed the savepoints earlier in [1]:


Yep. I'm trying to suggest an incremental path with simple but yet quite 
useful things first.


--
Fabien.

Re: master make check fails on Solaris 10

2018-01-12 Thread Alvaro Herrera

Marina Polyakova wrote:

> - on the previous commit (272c2ab9fd0a604e3200030b1ea26fd464c44935) the same
> failures occur (see the attached regression diffs and output);
> - on commit bf54c0f05c0a58db17627724a83e1b6d4ec2712c make check-world
> passes.
> I'll try to find out from what commit it started.. Don't you have any
> suspicions?)

Perhaps you can use "git bisect".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed

2018-01-12 Thread Pavel Stehule

Hi

2018-01-12 2:35 GMT+01:00 Thomas Munro :

> On Thu, Nov 2, 2017 at 12:44 AM, Pavel Stehule 
> wrote:
> > I am sending updated patch with some basic doc
>
> Hi Pavel,
>
> I am not sure what the status of this patch is, but FYI:
>
> startup.c: In function ‘main’:
> startup.c:284:3: error: too few arguments to function ‘listAllDbs’
>success = listAllDbs(NULL, false);
>^
> In file included from startup.c:21:0:
> describe.h:68:13: note: declared here
>  extern bool listAllDbs(const char *pattern, bool verbose, sortby_type
> sortby, bool sort_desc);
>  ^
>

There are no agreement about optimal form of this feature - so I'll remove
this patch from commitfest.

Regards

Pavel


>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Robert Haas

On Sat, Dec 30, 2017 at 5:32 AM, Simon Riggs  wrote:
> So we can't completely remove xl_prev field, without giving up some
> functionality. But we don't really need to store the 8-byte previous
> WAL pointer in order to detect torn pages. Something else which can
> tell us that the WAL record does not belong to current WAL segno would
> be enough as well. I propose that we replace it with a much smaller
> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
> decide to call it) is the low order 16-bits of the WAL segno to which
> the WAL record belongs. While reading WAL, we always match that the
> "xl_walid" value stored in the WAL record matches with the current WAL
> segno's lower order 16-bits and if not, then consider that as the end
> of the stream.
>
> For this to work, we must ensure that WAL files are either recycled in
> such a way that the "xl_walid" of the previous (to be recycled) WAL
> differs from the new WAL or we zero-out the new WAL file. Seems quite
> easy to do with the existing infrastructure.

I have some reservations about whether this makes the mechanism less
reliable.  It seems that an 8-byte field would have almost no chance
of matching by accident even if the location was filled with random
bytes, but with a 2-byte field it's not that unlikely.  Of course, we
also have xl_crc, so I'm not sure whether there's any chance of real
harm...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: General purpose hashing func in pgbench

2018-01-12 Thread Fabien COELHO




Hmm. I do not think that we should want a shared seed value. The seed
should be different for each call so as to avoid undesired
correlations. If wanted, correlation could be obtained by using an
explicit identical seed.

ISTM that the best way to add the seed is to call random() when the
second arg is missing in make_func. Also, this means that the executor
would always get its two arguments, so it would simplify the code there.


Ok, I think I understand what you meant. You meant the case like following:

\set x random(1, 100)
\set h1 hash(:x)
\set h2 hash(:x)  -- will have different seed from h1

so that different instances of hash function within one script would
have different seeds. Yes, that is a good idea, I can do that.


Yes, exactly. What is desirable may depend on the use case, though.

--
Fabien.

Re: General purpose hashing func in pgbench

2018-01-12 Thread Fabien COELHO



Hello Ildar,


Hmm. I do not think that we should want a shared seed value. The seed
should be different for each call so as to avoid undesired
correlations. If wanted, correlation could be obtained by using an
explicit identical seed.


Probably I'm missing something but I cannot see the point. If we change
seed on every invokation then we get uniform-like distribution (see
attached image). And we don't get the same hash value for the same input
which is the whole point of hash functions. Maybe I didn't understand
you correctly.


I suggest to fix the seed when parsing the script, so that it is the same 
seed on each script for a given pgbench invocation, so that for one run it 
runs with the same seed for each hash call, but changes if pgbench is 
re-invoked so that the results would be different.


Also, if hash(:i) and hash(:j) appears in two distinct scripts, ISTM that 
we do not necessarily want the same seed, otherwise i == j would correlate 
to hash(i) == hash(j), which may not be a desirable property for some use 
case.


Maybe it would be desirable for other use cases, though.



Anyway I've attached a new version with some tests and docs added.


--
Fabien.

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 5:12 AM, Amit Khandekar  wrote:
> The reason why I am having map_required field inside a structure along
> with the map, as against a separate array, is so that we can do the
> on-demand allocation for both per-leaf array and per-subplan array.

Putting the map_required field inside the structure with the map makes
it completely silly to do the 0/1/2 thing, because the whole structure
is going to be on the same cache line anyway.  It won't save anything
to access the flag instead of a pointer in the same struct.   Also,
the uint8 will be followed by 7 bytes of padding, because the pointer
that follows will need to begin on an 8-byte boundary (at least, on
64-bit machines), so this will use more memory.

What I suggest is:

#define MT_CONVERSION_REQUIRED_UNKNOWN0
#define MT_CONVERSION_REQUIRED_YES1
#define MT_CONVERSION_REQUIRED_NO  2

In ModifyTableState:

uint8 *mt_per_leaf_tupconv_required;
TupleConversionMap **mt_per_leaf_tupconv_maps;

In PartitionTupleRouting:

int *subplan_partition_offsets;

When you initialize the ModifyTableState, do this:

mtstate->mt_per_leaf_tupconv_required = palloc0(sizeof(uint8) *
numResultRelInfos);
mtstate->mt_per_leaf_tupconv_maps = palloc0(sizeof(TupleConversionMap
*) * numResultRelInfos);

When somebody needs a map, then

(1) if they need it by subplan index, first use
subplan_partition_offsets to convert it to a per-leaf index

(2) then write a function that takes the per-leaf index and does this:

switch (mtstate->mt_per_leaf_tupconv_required[leaf_part_index])
{
case MT_CONVERSION_REQUIRED_UNKNOWN:
map = convert_tuples_by_name(...);
if (map == NULL)
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_NO;
else
{
mtstate->mt_per_leaf_tupconv_required[leaf_part_index] =
MT_CONVERSION_REQUIRED_YES;
mtstate->mt_per_leaf_tupconv_maps[leaf_part_index] = map;
}
return map;
case MT_CONVERSION_REQUIRED_YES:
return mtstate->mt_per_leaf_tupconv_maps[leaf_part_index];
case MT_CONVERSION_REQUIRED_NO:
return NULL;
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[PATCH] Find additional connection service files in pg_service.conf.d directory

2018-01-12 Thread Curt Tilmes

I love centralizing connection service definitions in
/pg_service.conf, but for a
large enterprise, sometimes we have multiple sets of connection
service definitions
independently managed.

The convention widely adopted for this type of thing is to allow
multiple config files to be in
a directory, usually the '.d' version of the config filename. See, for example:

http://blog.siphos.be/2013/05/the-linux-d-approach/
https://lists.debian.org/debian-devel/2010/04/msg00352.html

This patch adds an extra search for service connection definitions if
the current places fail
to find the service (~/.pg_service.conf /pg_service.conf).
It will then search
every readable file in the directory /pg_service.conf.d.

What do you think?

Curt
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 4e46451..ec4ba77 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -7421,6 +7421,9 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
   
.pg_service.conf
   
+  
+   pg_service.conf.d
+  
 
   
The connection service file allows libpq connection parameters to be
@@ -7444,6 +7447,15 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
   
 
   
+Additional connection service files can be placed in the system-wide
+directory `pg_config --sysconfdir`/pg_service.conf.d
+or in the subdirectory pg_service.conf.d below the
+directory specified by the environment variable
+PGSYSCONFDIR.  These will have the lowest precedence,
+following the files described above.
+  
+
+  
The file uses an INI file format where the section
name is the service name and the parameters are connection
parameters; see  for a list.  For
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 8d54333..43ec0d5 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "libpq-fe.h"
 #include "libpq-int.h"
@@ -4492,10 +4493,14 @@ parseServiceInfo(PQconninfoOption *options, PQExpBuffer errorMessage)
 {
 	const char *service = conninfo_getval(options, "service");
 	char		serviceFile[MAXPGPATH];
+	char		serviceDirPath[MAXPGPATH];
 	char	   *env;
 	bool		group_found = false;
 	int			status;
 	struct stat stat_buf;
+	DIR		   *serviceDir;
+	struct dirent *direntry;
+
 
 	/*
 	 * We have to special-case the environment variable PGSERVICE here, since
@@ -4539,21 +4544,45 @@ next_file:
 	snprintf(serviceFile, MAXPGPATH, "%s/pg_service.conf",
 			 getenv("PGSYSCONFDIR") ? getenv("PGSYSCONFDIR") : SYSCONFDIR);
 	if (stat(serviceFile, &stat_buf) != 0)
+		goto conf_dir;
+
+	status = parseServiceFile(serviceFile, service, options, errorMessage, &group_found);
+	if (group_found || status != 0)
+		return status;
+
+conf_dir:
+
+	/*
+	 * Try every file in pg_service.conf.d/*
+	 */
+	snprintf(serviceDirPath, MAXPGPATH, "%s/pg_service.conf.d",
+			 getenv("PGSYSCONFDIR") ? getenv("PGSYSCONFDIR") : SYSCONFDIR);
+
+	if (stat(serviceDirPath, &stat_buf) != 0 || !S_ISDIR(stat_buf.st_mode) ||
+		(serviceDir = opendir(serviceDirPath)) == NULL)
 		goto last_file;
 
-	status = parseServiceFile(serviceFile, service, options, errorMessage, &group_found);
-	if (status != 0)
-		return status;
+	while ((direntry = readdir(serviceDir)) != NULL)
+	{
+		snprintf(serviceFile, MAXPGPATH, "%s/%s", serviceDirPath, direntry->d_name);
+
+		if (stat(serviceFile, &stat_buf) != 0 || !S_ISREG(stat_buf.st_mode) ||
+			access(serviceFile, R_OK))
+			continue;
+
+		status = parseServiceFile(serviceFile, service, options, errorMessage, &group_found);
+		if (group_found || status != 0)
+		{
+			closedir(serviceDir);
+			return status;
+		}
+	}
+	closedir(serviceDir);
 
 last_file:
-	if (!group_found)
-	{
-		printfPQExpBuffer(errorMessage,
-		  libpq_gettext("definition of service \"%s\" not found\n"), service);
-		return 3;
-	}
-
-	return 0;
+	printfPQExpBuffer(errorMessage,
+	  libpq_gettext("definition of service \"%s\" not found\n"), service);
+	return 3;
 }
 
 static int

Re: WIP: a way forward on bootstrap data

2018-01-12 Thread Alvaro Herrera

Pushed 0001 with some changes of my own.  I'm afraid I created a few
conflicts for the later patches in your series; please rebase.

I don't think we introduced anything that would fail on old Perls, but
let's see what buildfarm has to say.

Others: Now is the time to raise concerns related to the proposed file
formats and tooling, so please do have a look when you have a moment.
At this stage, the proposed data format seems a good choice to me.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Doc tweak for huge_pages?

2018-01-12 Thread Peter Eisentraut

On 12/1/17 10:08, Peter Eisentraut wrote:
> Part of the confusion is that the huge_pages setting is only for shared
> memory, whereas the kernel settings affect all memory.  Is the same true
> for the proposed Windows patch?

Btw., I'm kind of hoping that the Windows patch would be committed
first, so that we don't have to rephrase this whole thing again after that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Robert Haas

On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila  wrote:
> 1.
> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>   {
> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
> - OldestXmin = InvalidTransactionId; /* not used */
> + OldestXmin = GetOldestXmin(heapRelation, true);
>
> I think leader and workers should have the same idea of oldestXmin for
> the purpose of deciding the visibility of tuples.  I think this is
> ensured in all form of parallel query as we do share the snapshot,
> however, same doesn't seem to be true for Parallel Index builds.

Hmm.  Does it break anything if they use different snapshots?  In the
case of a query that would be disastrous because then you might get
inconsistent results, but if the snapshot is only being used to
determine what is and is not dead then I'm not sure it makes much
difference ... unless the different snapshots will create confusion of
some other sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP Patch: Pgbench Serialization and deadlock errors

2018-01-12 Thread Marina Polyakova


On 12-01-2018 15:47, Fabien COELHO wrote:

Hello Marina,

A partial answer, to focus on how the feature may be simplified.


Thank you!


I apologise as it seems that I overlooked one of your mail. Changing
the thread has not helped.


I'm wondering whether the whole feature could be simplified by
considering that one script is one "transaction" (it is from the
report point of view at least), and that any retry is for the full
script only, from its beginning. That would remove the trying to 
guess
at transactions begin or end, avoid scanning manually for 
subcommands,

and so on.
 - Would it make sense?
 - Would it be ok for your use case?


I'm not sure if this makes sense: if we retry the script from the very 
beginning including successful transactions, there may be errors..


What I'm suggesting is to simplify the problem by adding some
restrictions to what kind of case which is handled, so as to simplify
the code a lot.

I'd start by stating (i.e. documenting) that the features assumes that
one script is just *one* transaction.

Note that pgbench somehow already assumes that one script is one
transaction when it reports performance anyway.

If you want 2 transactions, then you have to put them in two scripts,
which looks fine with me. Different transactions are expected to be
independent, otherwise they should be merged into one transaction.


Therefore if the script consists of several single statements (= several 
transactions), you cannot retry them. For example, the script looked 
like this:


UPDATE xy1 SET x = 1 WHERE y = 1;
UPDATE xy2 SET x = 2 WHERE y = 2;
UPDATE xy3 SET x = 3 WHERE y = 3;

If this restriction is ok for you, I'll simplify the code :)


Under these restrictions, ISTM that a retry is something like:

  case ABORTED:
 if (we want to retry) {
// do necessary stats
// reset the initial state (random, vars, current command)
state = START_TX; // loop
 }
 else {
   // count as failed...
   state = FINISHED; // or done.
 }
 break;


If we successfully complete a failed transaction block and process its 
end command in CSTATE_END_COMMAND, we may want to make a retry. So do 
you think that in this case it is ok to go to CSTATE_ABORTED at the end 
of CSTATE_END_COMMAND?..



Once this works, maybe it could go a step further by restarting at
savepoints. I'd put restrictions there to ease detecting a savepoint
so that it cannot occur in a compound command for instance. This would
probably require a new state. Fine.


We discussed the savepoints earlier in [1]:

- if there's a failure what savepoint we should rollback to and start 
the

execution again?

...
Maybe to go to the last one, if it is not successful go to the 
previous

one etc. Retrying the entire transaction may take less time..


Well, I do not know that. My 0.02 € is that if there was a savepoint 
then

this is natural the restarting point of a transaction which has some
recoverable error.




A detail:

For file contents, maybe the << 'EOF' here-document syntax would help 
instead of using concatenated backslashed strings everywhere.


Thanks, but I'm not sure that it is better to open file handlers for 
printing in variables..


Perl here-document stuff is just a multiline string, no file is
involved, it is different from the shell.


Oh, now googling was successful, thanks)

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707141637300.7871%40lancre


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Amit Kapila

On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan  wrote:
> On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia  
> wrote:
>> I agree that plan_create_index_workers() needs to count the leader as a
>> normal worker for the CREATE INDEX.  So what you proposing is - when
>> parallel_leader_participation is true launch (return value of
>> compute_parallel_worker() - 1)
>> workers. true ?
>
> Almost. We need to not subtract one when only one worker is indicated
> by compute_parallel_worker(). I also added some new stuff there, to
> consider edge cases with the parallel_leader_participation GUC.
>
>>> I'm working on fixing up what you posted. I'm probably not more than a
>>> week away from posting a patch that I'm going to mark "ready for
>>> committer". I've already made the change above, and once I spend time
>>> on trying to break the few small changes needed within buffile.c I'll
>>> have taken it as far as I can, most likely.
>>>
>>
>> Okay, once you submit the patch with changes - I will do one round of
>> review for the changes.
>
> I've attached my revision. Changes include:
>

Few observations while skimming through the patch:

1.
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
  {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- OldestXmin = InvalidTransactionId; /* not used */
+ OldestXmin = GetOldestXmin(heapRelation, true);

I think leader and workers should have the same idea of oldestXmin for
the purpose of deciding the visibility of tuples.  I think this is
ensured in all form of parallel query as we do share the snapshot,
however, same doesn't seem to be true for Parallel Index builds.

2.
+
+ /* Wait on worker processes to finish (should be almost instant) */
+ reltuples = _bt_leader_wait_for_workers(buildstate);

Can't we use WaitForParallelWorkersToFinish for this purpose?  The
reason is that if we use a different mechanism here then we might need
a different way to solve the problem related to fork failure.  See
thread [1].  Basically, what if postmaster fails to launch workers due
to fork failure, the leader backend might wait indefinitely.



[1] - https://commitfest.postgresql.org/16/1341/


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Bug in to_timestamp().

2018-01-12 Thread Arthur Zakirov

Hello,

On Fri, Jan 12, 2018 at 02:48:40PM +1300, Thomas Munro wrote:
> 
> I'm guessing that commit 11b623dd0a2c385719ebbbdd42dd4ec395dcdc9d had
> something to do with the following failure, when your patch is
> applied:
> 
>  horology ... FAILED
> 

Thank you a lot for pointing on that.

It seems to me that it happens because the patch eats minus sign "-" before 
"05". And it is wrong to do that.

I attached new version of the patch. Now (expected output):

=# SELECT to_timestamp('2011-12-18 11:38 -05', '-MM-DD HH12:MI TZH');
  to_timestamp  

 2011-12-18 20:38:00+04

But these queries may confuse:

=# SELECT to_timestamp('2011-12-18 11:38 -05', '-MM-DD HH12:MI  TZH');
  to_timestamp  

 2011-12-18 10:38:00+04

=# SELECT to_timestamp('2011-12-18 11:38 -05', '-MM-DD HH12:MI -TZH');
  to_timestamp  

 2011-12-18 10:38:00+04

And these queries don't work anymore using new version of the patch:

=# SELECT to_timestamp('2000 + JUN', ' MON');
ERROR:  invalid value "+ J" for "MON"
DETAIL:  The given value did not match any of the allowed values for this field.

=# SELECT to_timestamp('2000 +   JUN', ' MON');
ERROR:  invalid value "+  " for "MON"
DETAIL:  The given value did not match any of the allowed values for this field.

Other queries mentioned in the thread work as before.

Any thoughts? If someone has better approach, feel free to share it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2428434030..545129334c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6174,7 +6174,8 @@ SELECT regexp_match('abc01234xyz', 
'(?:(.*?)(\d+)(.*)){1,1}');
to_timestamp and to_date
skip multiple blank spaces in the input string unless the
FX option is used. For example,
-   to_timestamp('2000JUN', ' 
MON') works, but
+   to_timestamp('2000JUN', ' 
MON') and
+   to_timestamp('2000 JUN', 
'MON') work, but
to_timestamp('2000JUN', 'FX 
MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
@@ -6182,6 +6183,19 @@ SELECT regexp_match('abc01234xyz', 
'(?:(.*?)(\d+)(.*)){1,1}');
   
  
 
+ 
+  
+   to_timestamp and to_date don't
+   skip multiple printable non letter and non digit characters in the input
+   string, but skip them in the formatting string. For example,
+   to_timestamp('2000-JUN', '/MON') and
+   to_timestamp('2000/JUN', '//MON') work, but
+   to_timestamp('2000//JUN', '/MON')
+   returns an error because count of the "/" character in the input string
+   doesn't match count of it in the formatting string.
+  
+ 
+
  
   
Ordinary text is allowed in to_char
diff --git a/src/backend/utils/adt/formatting.c 
b/src/backend/utils/adt/formatting.c
index b8bd4caa3e..dddfc4b1bf 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -171,6 +171,8 @@ typedef struct
 #define NODE_TYPE_END  1
 #define NODE_TYPE_ACTION   2
 #define NODE_TYPE_CHAR 3
+#define NODE_TYPE_SEPARATOR4
+#define NODE_TYPE_SPACE5
 
 #define SUFFTYPE_PREFIX1
 #define SUFFTYPE_POSTFIX   2
@@ -961,6 +963,7 @@ typedef struct NUMProc
 static const KeyWord *index_seq_search(const char *str, const KeyWord *kw,
 const int *index);
 static const KeySuffix *suff_search(const char *str, const KeySuffix *suf, int 
type);
+static bool is_separator_char(const char *str);
 static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
 static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 const KeySuffix *suf, const int *index, int ver, 
NUMDesc *Num);
@@ -1050,6 +1053,16 @@ suff_search(const char *str, const KeySuffix *suf, int 
type)
return NULL;
 }
 
+static bool
+is_separator_char(const char *str)
+{
+   /* ASCII printable character, but not letter or digit */
+   return (*str > 0x20 && *str < 0x7F &&
+   !(*str >= 'A' && *str <= 'Z') &&
+   !(*str >= 'a' && *str <= 'z') &&
+   !(*str >= '0' && *str <= '9'));
+}
+
 /* --
  * Prepare NUMDesc (number description struct) via FormatNode struct
  * --
@@ -1325,7 +1338,14 @@ parse_format(FormatNode *node, const char *str, const 
KeyWord *kw,
if (*str == '\\' && *(str + 1) == '"')
str++;
chlen = pg_mblen(str);
-   n->type = NODE_TYPE_CHAR;
+
+   if (ver == DCH_TYPE && is_separator_char(str))
+

Re: bytea bitwise logical operations implementation (xor / and / or / not)

2018-01-12 Thread Christian Rossow

Hi Fabien,

> I think that the probability of getting these useful things into pg is
> alas small. In the mean time, you may package and register it as an
> extension?
I aimed to close the asymmetry between bit vector operations (they also
offer xor/and/etc.) and bytea operations. My code is more or less a 1:1
copy of the corresponding bit vector operations (in fact, just a tiny
bit easier, since bytea is byte-aligned, wheres bit vectors aren't).

I just wanted to dump the code here, given that Tom suggested (in 2006)
such an implementation might be a useful addition to bytea. But I won't
feel offended in any way if the code won't be merged.

Cheers,
Christian

Re: [HACKERS] Cached plans and statement generalization

2018-01-12 Thread Konstantin Knizhnik




On 12.01.2018 03:40, Thomas Munro wrote:

On Sun, Jan 7, 2018 at 11:51 AM, Stephen Frost  wrote:

* Konstantin Knizhnik (k.knizh...@postgrespro.ru) wrote:

Updated version of the patch is attached.

This patch appears to apply with just a bit of fuzz and make check
passes, so I'm not sure why this is currently marked as 'Waiting for
author'.

I've updated it to be 'Needs review'.  If that's incorrect, feel free to
change it back with an explanation.

Hi Konstantin,

/home/travis/build/postgresql-cfbot/postgresql/src/backend/tcop/postgres.c:5249:
undefined reference to `PortalGetHeapMemory'

That's because commit 0f7c49e85518dd846ccd0a044d49a922b9132983 killed
PortalGetHeapMemory.  Looks like it needs to be replaced with
portal->portalContext.


Hi  Thomas,

Thank you very much for reporting the problem.
Rebased version of the patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 6c76c41..8922950 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3688,6 +3688,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree.
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ *
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ *
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case T_ParamRef:
+		case T_A_Const:
+		case T_A_Star:
+			/* primitive node types with no subnodes */
+			break;
+		case T_Alias:
+			/* we assume the colnames list isn't interesting */
+			break;
+		case T_RangeVar:
+			return mutator(&((RangeVar *) node)->alias, context);
+		case T_GroupingFunc:
+			return mutator(&((GroupingFunc *) node)->args, context);
+		case T_SubLink:
+			{
+SubLink	   *sublink = (SubLink *) node;
+
+if (mutator(&sublink->testexpr, context))
+	return true;
+/* we assume the operName is not interesting */
+if (mutator(&sublink->subselect, context))
+	return true;
+			}
+			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *caseexpr = (CaseExpr *) node;
+
+if (mutator(&caseexpr->arg, context))
+	return true;
+/* we assume mutator(& doesn't care about CaseWhens, either */
+foreach(temp, caseexpr->args)
+{
+	CaseWhen   *when = (CaseWhen *) lfirst(temp);
+
+	Assert(IsA(when, CaseWhen));
+	if (mutator(&when->expr, context))
+		return true;
+	if (mutator(&when->result, context))
+		return true;
+}
+if (mutator(&caseexpr->defresult, context))
+	return true;
+			}
+			break;
+		case T_RowExpr:
+			/* Assume colnames isn't interesting */
+			return mutator(&((RowExpr *) node)->args, context);
+		case T_CoalesceExpr:
+			return mutator(&((CoalesceExpr *) node)->args, context);
+		case T_MinMaxExpr:
+			return mutator(&((MinMaxExpr *) node)->args, context);
+		case T_XmlExpr:
+			{
+XmlExpr	   *xexpr = (XmlExpr *) node;
+
+if (mutator(&xexpr->named_args, context))
+	return true;
+/* we assume mutator(& doesn't care about arg_names */
+if (mutator(&xexpr->args, context))
+	return true;
+			}
+			break;
+		case T_NullTest:
+			return mutator(&((NullTest *) node)->arg, context);
+		case T_BooleanTest:
+			return mutator(&((BooleanTest *) node)->arg, context);
+		case T_JoinExpr:
+			{
+JoinExpr   *join = (JoinExpr *) node;
+
+if (mutator(&join->larg, context))
+	return true;
+if (mutator(&join->rarg, context))
+	return true;
+if (mutator(&join->quals, context))
+	return true;
+if (mutator(&join->alias, context))
+	return tr

Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?

2018-01-12 Thread Robert Haas

On Thu, Jan 11, 2018 at 11:01 PM, Thomas Munro
 wrote:
> Are you saying we should do the work now to create a per-transaction
> DSM segment + DSA area + thing that every backend attaches to?

No, I was just thinking you could stuff it into the per-parallel-query
DSM/DSA.  But...

> I didn't think creating backend local hash tables would be a problem
> because it's a vanishingly rare occurrence for the hash table to be
> created at all (ie when you've altered an enum), and if created, to
> have more than a couple of entries in it.

...this is also a fair point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP Patch: Pgbench Serialization and deadlock errors

2018-01-12 Thread Fabien COELHO



Hello Marina,

A partial answer, to focus on how the feature may be simplified.

I apologise as it seems that I overlooked one of your mail. Changing the 
thread has not helped.



I'm wondering whether the whole feature could be simplified by
considering that one script is one "transaction" (it is from the
report point of view at least), and that any retry is for the full
script only, from its beginning. That would remove the trying to guess
at transactions begin or end, avoid scanning manually for subcommands,
and so on.
 - Would it make sense?
 - Would it be ok for your use case?


I'm not sure if this makes sense: if we retry the script from the very 
beginning including successful transactions, there may be errors..


What I'm suggesting is to simplify the problem by adding some restrictions 
to what kind of case which is handled, so as to simplify the code a lot.


I'd start by stating (i.e. documenting) that the features assumes that one 
script is just *one* transaction.


Note that pgbench somehow already assumes that one script is one 
transaction when it reports performance anyway.


If you want 2 transactions, then you have to put them in two scripts, 
which looks fine with me. Different transactions are expected to be 
independent, otherwise they should be merged into one transaction.


Under these restrictions, ISTM that a retry is something like:

  case ABORTED:
 if (we want to retry) {
// do necessary stats
// reset the initial state (random, vars, current command)
state = START_TX; // loop
 }
 else {
   // count as failed...
   state = FINISHED; // or done.
 }
 break;

Once this works, maybe it could go a step further by restarting at 
savepoints. I'd put restrictions there to ease detecting a savepoint so 
that it cannot occur in a compound command for instance. This would 
probably require a new state. Fine.



A detail:

For file contents, maybe the << 'EOF' here-document syntax would help 
instead of using concatenated backslashed strings everywhere.


Thanks, but I'm not sure that it is better to open file handlers for 
printing in variables..


Perl here-document stuff is just a multiline string, no file is involved, 
it is different from the shell.


--
Fabien.

Re: General purpose hashing func in pgbench

2018-01-12 Thread Ildar Musin

Hello Fabien,

11/01/2018 19:21, Ildar Musin пишет:
>
> 10/01/2018 21:42, Fabien COELHO пишет:
>> Hmm. I do not think that we should want a shared seed value. The seed
>> should be different for each call so as to avoid undesired
>> correlations. If wanted, correlation could be obtained by using an
>> explicit identical seed.
>>
>> ISTM that the best way to add the seed is to call random() when the
>> second arg is missing in make_func. Also, this means that the executor
>> would always get its two arguments, so it would simplify the code there.
>>
> Ok, I think I understand what you meant. You meant the case like following:
>
> \set x random(1, 100)
> \set h1 hash(:x)
> \set h2 hash(:x)  -- will have different seed from h1
>
> so that different instances of hash function within one script would
> have different seeds. Yes, that is a good idea, I can do that.
>
Added this feature in attached patch. But on a second thought this could
be something that user won't expect. For example, they may want to run
pgbench with two scripts:
- the first one updates row by key that is a hashed random_zipfian value;
- the second one reads row by key generated the same way
(that is actually what YCSB workloads A and B do)

It feels natural to write something like this:
\set rnd random_zipfian(0, 100, 0.99)
\set key abs(hash(:rnd)) % 1000
in both scripts and expect that they both would have the same
distribution. But they wouldn't. We could of course describe this
implicit behaviour in documentation, but ISTM that shared seed would be
more clear.

Thanks!

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company 

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..c575f19 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1246,6 +1246,27 @@ pgbench  options 
 d
5
   
   
+   hash(a [, 
seed ] )
+   integer
+   alias for hash_murmur2()
+   hash(10, 5432)
+   -5817877081768721676
+  
+  
+   hash_fnv1a(a [, 
seed ] )
+   integer
+   FNV hash
+   hash_fnv1a(10, 5432)
+   -7793829335365542153
+  
+  
+   hash_murmur2(a [, 
seed ] )
+   integer
+   murmur2 hash
+   hash_murmur2(10, 5432)
+   -5817877081768721676
+  
+  

int(x)
integer
cast to int
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index e23ca51..36cad30 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -16,6 +16,10 @@
 
 #include "pgbench.h"
 
+#define PGBENCH_NARGS_VARIABLE (-1)
+#define PGBENCH_NARGS_CASE (-2)
+#define PGBENCH_NARGS_HASH (-3)
+
 PgBenchExpr *expr_parse_result;
 
 static PgBenchExprList *make_elist(PgBenchExpr *exp, PgBenchExprList *list);
@@ -226,9 +230,13 @@ make_uop(yyscan_t yyscanner, const char *operator, 
PgBenchExpr *expr)
 /*
  * List of available functions:
  * - fname: function name, "!..." for special internal functions
- * - nargs: number of arguments
- * -1 is a special value for least & greatest meaning 
#args >= 1
- * -2 is for the "CASE WHEN ..." function, which has #args 
>= 3 and odd
+ * - nargs: number of arguments. Special cases:
+ * - PGBENCH_NARGS_VARIABLE is a special value for least & 
greatest
+ *   meaning #args >= 1;
+ * - PGBENCH_NARGS_CASE is for the "CASE WHEN ..." 
function, which
+ *   has #args >= 3 and odd;
+ * - PGBENCH_NARGS_HASH is for hash functions, which have 
one required
+ *   and one optional argument;
  * - tag: function identifier from PgBenchFunction enum
  */
 static const struct
@@ -259,10 +267,10 @@ static const struct
"abs", 1, PGBENCH_ABS
},
{
-   "least", -1, PGBENCH_LEAST
+   "least", PGBENCH_NARGS_VARIABLE, PGBENCH_LEAST
},
{
-   "greatest", -1, PGBENCH_GREATEST
+   "greatest", PGBENCH_NARGS_VARIABLE, PGBENCH_GREATEST
},
{
"debug", 1, PGBENCH_DEBUG
@@ -347,7 +355,16 @@ static const struct
},
/* "case when ... then ... else ... end" construction */
{
-   "!case_end", -2, PGBENCH_CASE
+   "!case_end", PGBENCH_NARGS_CASE, PGBENCH_CASE
+   },
+   {
+   "hash", PGBENCH_NARGS_HASH, PGBENCH_HASH_MURMUR2
+   },
+   {
+   "hash_murmur2", PGBENCH_NARGS_HASH, PGBENCH_HASH_MURMUR2
+   },
+   {
+   "hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
},
/* keep as last array element */
{
@@ -423,29 +440,47 @@ elist_length(PgBenchExprList *list)
 static PgBenchExpr *
 make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args)
 {
+   int len = elist_length(args);
+
PgBenchExpr

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Alexander Kuzmenkov


Hi Simon, Pavan,

I tried benchmarking your patch. I ran pgbench on a 72-core machine with 
a simple append-only script:


BEGIN;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, 
:bid, :aid, :delta, CURRENT_TIMESTAMP);

 the insert repeated 20 times total ---
END;

I can see an increase in transaction throughput of about 5%. Please see 
the attached graph 'append-only.png'
On the default tpcb-like benchmark, there is no meaningful change in 
performance, as shown in 'tpcb.png'


Looking at the code, your changes seem reasonable to me. Some nitpicks:

XLogSegNoLowOrderMask -- this macro is not really needed, 
XLogSegNoToSegID is enough (pg_resetwal should be changed to use the 
latter, too).



The loop in findLastCheckpoint looks complex to me. It would be easier 
to read with two loops, the outer one iterating over the segments and 
the inner one iterating over the records.



>+    uint16        xl_walid;        /* lowest 16 bits of the WAL file 
to which this

>+                                   record belongs */

I'd put a high-level description here, the low-level details are hidden 
behind macros anyway. Something like this: /* identifier of the WAL 
segment to which this record belongs. Used to detect stale WAL records 
in a reused segment. */



>+     * Make sure the above heavily-contended spinlock and byte 
positions are
>+     * on their own cache line. In particular, the RedoRecPtr and 
full page


This comment in the definition of XLogCtlInsert mentions the spinlock 
that is no more.



The signature of InstallXLogFileSegment looks overloaded now, not sure 
how to simplify it. We could have two versions, one that looks for the 
free slot and another that reuses the existing file. Also, it seems to 
me that the lock doesn't have to be acquired inside this function, it 
can be done by the caller.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: master make check fails on Solaris 10

2018-01-12 Thread Marina Polyakova


On 11-01-2018 20:39, Andres Freund wrote:

Hi,

On 2018-01-11 20:21:11 +0300, Marina Polyakova wrote:

Hello, hackers! I got a permanent failure of master (commit
ca454b9bd34c75995eda4d07c9858f7c22890c2b) make check on Solaris 10.


Did this use to work?


It always fails if you have asked about this..


If so, could you check whether it worked before
69c3936a1499b772a749ae629fc59b2d72722332?


- on the previous commit (272c2ab9fd0a604e3200030b1ea26fd464c44935) the 
same failures occur (see the attached regression diffs and output);
- on commit bf54c0f05c0a58db17627724a83e1b6d4ec2712c make check-world 
passes.
I'll try to find out from what commit it started.. Don't you have any 
suspicions?)


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company*** /home/buildfarm/mpolyakova/postgresql_master/src/test/regress/expected/random.out	Thu Jan 11 19:37:51 2018
--- /home/buildfarm/mpolyakova/postgresql_master/src/test/regress/results/random.out	Fri Jan 12 12:37:28 2018
***
*** 46,52 
  
  SELECT AVG(random) FROM RANDOM_TBL
HAVING AVG(random) NOT BETWEEN 80 AND 120;
!  avg 
! -
! (0 rows)
  
--- 46,53 
  
  SELECT AVG(random) FROM RANDOM_TBL
HAVING AVG(random) NOT BETWEEN 80 AND 120;
!   avg  
! ---
!  59630855721112644241909088256
! (1 row)
  

==

*** /home/buildfarm/mpolyakova/postgresql_master/src/test/regress/expected/groupingsets.out	Thu Jan 11 19:37:50 2018
--- /home/buildfarm/mpolyakova/postgresql_master/src/test/regress/results/groupingsets.out	Fri Jan 12 12:38:07 2018
***
*** 143,156 
  -- nesting with window functions
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by rollup (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!|   |  12 |   36
  (6 rows)
  
  -- nesting with grouping sets
--- 143,156 
  -- nesting with window functions
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by rollup (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  79489175236044124958001987584
!  1 |   |  10 | 158978350472088249916003975168
!  2 | 2 |   2 | 238467525708132374874005962752
!  2 |   |   2 | 317956700944176499832007950336
!|   |  12 | 397445876180220624790009937920
  (6 rows)
  
  -- nesting with grouping sets
***
*** 544,559 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!| 1 |   8 |   32
!| 2 |   4 |   36
!|   |  12 |   48
  (8 rows)
  
  select a, b, sum(c) from (values (1,1,10),(1,1,11),(1,2,12),(1,2,13),(1,3,14),(2,3,15),(3,3,16),(3,4,17),(4,1,18),(4,1,19)) v(a,b,c) group by rollup (a,b);
--- 544,559 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  79523211102173597567174049792
!  1 |   |  10 | 159046422204347195134348099584
!  2 | 2 |   2 | 238569633306520792701522149376
!  2 |   |   2 | 318092844408694390268696199168
!| 1 |   8 | 397616055510867987835870248960
!| 2 |   4 | 477139266613041585403044298752
!|   |  12 | 556662477715215182970218348544
  (8 rows)
  
  select a, b, sum(c) from (values (1,1,10),(1,1,11),(1,2,12),(1,2,13),(1,3,14),(2,3,15),(3,3,16),(3,4,17),(4,1,18),(4,1,19)) v(a,b,c) group by rollup (a,b);
***
*** 1219,1234 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!| 1 |   8 |   32
!| 2 |   4 |   36
!|   |  12 |   48
  (8 rows)
  
  explain (costs off)
--- 1219,1234 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  79526240205124453265225809920
!  1 |   |  10 | 159052480410248906530451619840
!  2 | 2 |   2 | 238578720615373359795677429760
!  2 |   |   2 | 318104960820497813060903239680
!| 1 |   8 | 3976312010256222663261

Re: Jsonb transform for pl/python

2018-01-12 Thread Anthony Bykov

On Fri, 12 Jan 2018 13:33:56 +1300
Thomas Munro  wrote:

> On Thu, Dec 7, 2017 at 12:40 AM, Anthony Bykov
>  wrote:
> > Hello,
> > fixed the issues:
> > 1. Rising errors when invalid object being transformed.
> > 2. We don't rise the exception when transforming range(3) only in
> > python 2. In third one it is an error.
> >
> > Please, find the 4-th version of the patch in attachments to this
> > message. --  
> 
> Hi Anthony,
> 
> FYI make docs fails:
> 
> json.sgml:584: parser error : Opening and ending tag mismatch: xref
> line 581 and para
>   
>  ^
> json.sgml:585: parser error : Opening and ending tag mismatch: para
> line 575 and sect2
>  
>  ^
> json.sgml:588: parser error : Opening and ending tag mismatch: sect2
> line 572 and sect1
> 
> ^
> json.sgml:589: parser error : Premature end of data in tag sect1 line
> 3 json.sgml:589: parser error : chunk is not well balanced
> datatype.sgml:4354: parser error : Failure to process entity json
>   &json;
> ^
> datatype.sgml:4354: parser error : Entity 'json' not defined
>   &json;
> ^
> datatype.sgml:4955: parser error : chunk is not well balanced
> postgres.sgml:104: parser error : Failure to process entity datatype
>   &datatype;
> ^
> postgres.sgml:104: parser error : Entity 'datatype' not defined
>   &datatype;
> ^
> 

Hello, thank you for your message.

Fixed it. Here is a new version of the patch.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

0001-jsonb_plpython-extension-v5.patch.tar.gz
Description: application/gzip

Minor fix for pgbench documentation

2018-01-12 Thread Ildar Musin

Hi hackers,

Here is a patch that adds missing random_zipfian func to the paragraph
in pgbench documentation about random functions parameterization.

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company 

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 3dd492c..7f2fc1f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1332,9 +1332,10 @@ pgbench  options 
 d

 The random function generates values using a uniform
 distribution, that is all the values are drawn within the specified
-range with equal probability. The random_exponential and
-random_gaussian functions require an additional double
-parameter which determines the precise shape of the distribution.
+range with equal probability. The random_exponential, 
+random_gaussian and random_zipfian
+functions require an additional double parameter which determines the 
precise
+shape of the distribution.

Re: bytea bitwise logical operations implementation (xor / and / or / not)

2018-01-12 Thread Komяpa

Hello,

working with binary is certainly lacked in Postgres.

I would be happy if these can be pulled in to Postgres, as there's not
always a possibility to install a C based extension, and bit tricks are
needed for efficient SQL-level implementations of custom datatypes.

Another bitwise thing,  when I was implementing a custom interval type, I
lacked a way to set bits from i to j in bit vector. It's probably doable
with clever and-or-xor tricks, but that would imply a lot of allocations of
temporary bit vectors.

For reference, it was subset of OpenStreetMap opening_hours spec
implementation.
Implementation I came up with:
https://github.com/gojuno/lostgis/blob/master/sql/types/type_opening_hours.sql

Spec:
https://wiki.openstreetmap.org/wiki/Key:opening_hours

Re: [HACKERS] UPDATE of partition key

2018-01-12 Thread Amit Khandekar

On 12 January 2018 at 01:18, Robert Haas  wrote:
> On Thu, Jan 11, 2018 at 6:07 AM, Amit Khandekar  
> wrote:
>> In the first paragraph of my explanation, I was explaining why two
>> Transition capture states does not look like a good idea to me :
>
> Oh, sorry.  I didn't read what you wrote carefully enough, I guess.
>
> I see your points.  I think that there is probably a general need for
> some refactoring here.  AfterTriggerSaveEvent() got significantly more
> complicated and harder to understand with the arrival of transition
> tables, and this patch is adding more complexity still.  It's also
> adding complexity in other places to make ExecInsert() and
> ExecDelete() usable for the semi-internal DELETE/INSERT operations
> being produced when we split a partition key update into a DELETE and
> INSERT pair.  It would be awfully nice to have some better way to
> separate out each of the different things we might or might not want
> to do depending on the situation: capture old tuple, capture new
> tuple, fire before triggers, fire after triggers, count processed
> rows, set command tag, perform actual heap operation, update indexes,
> etc.  However, I don't have a specific idea how to do it better, so
> maybe we should just get this committed for now and perhaps, with more
> eyes on the code, someone will have a good idea.
>
>> Slight correction; it was suggested by Amit Langote; not by David.
>
> Oh, OK, sorry.
>
>> So there are two independent optimizations we are talking about :
>>
>> 1. Create the map only when needed.
>> 2. In case of UPDATE, for partitions that take part in update scans,
>> there should be a single map; there should not be two separate maps,
>> one for accessing per-subplan and the other for accessing per-leaf.
>
> These optimizations aren't completely independent.   Optimization #2
> can be implemented in several different ways.  The way you've chosen
> to do it is to index the same array in two different ways depending on
> whether per-leaf indexing is not needed, which I think is
> unacceptable.  Another approach, which I proposed upthread, is to
> always built the per-leaf mapping, but you pointed out that this could
> involve doing a lot of unnecessary work in the case where most leaves
> were pruned.  However, if you also implement #1, then that problem
> goes away.  In other words, depending on the design you choose for #2,
> you may or may not need to also implement optimization #1 to get good
> performance.
>
> To put that another way, I think Amit's idea of keeping a
> subplan-offsets array is a pretty good one.  From your comments, you
> do too.  But if we want to keep that, then we need a way to avoid the
> expense of populating it for leaves that got pruned, except when we
> are doing update row movement.  Otherwise, I don't see much choice but
> to jettison the subplan-offsets array and just maintain two separate
> arrays of mappings.


Ok. So giving more thought on our both's points, here's what I feel we
can do ...

With the two arrays mt_per_leaf_tupconv_maps and
mt_per_subplan_tupconv_maps, we want the following things :
1. Create the map on-demand.
2. If possible, try to share the maps between the per-subplan and
per-leaf arrays.

For this, option 1 is :

---

Both the arrays elements are made of this structure :

typedef struct TupleConversionMapInfo
{
uint8 map_required; /* 0 : Not known if map is required */
/* 1 : map is created/required */
/* 2 : map is not necessary */
TupleConversionMap *map;
} TupleConversionMapInfo;

Arrays look like this :
TupleConversionMapInfo mt_per_subplan_tupconv_maps[];
TupleConversionMapInfo mt_per_leaf_tupconv_maps[];

When a per-subplan array is to be accessed at index i, a macro
get_tupconv_map(mt_per_subplan_tupconv_maps, i, forleaf=false) will be
called. This will create a new map if necessary, populate the array
element fields, and it will also copy this info into a corresponding
array element in the per-leaf array. To get to the per-leaf array
element, we need a subplan-offsets array. Whereas, if the per-leaf
array element is already populated, this info will be copied into the
subplan element in the opposite direction.

When a per-leaf array is to be accessed at index i,
get_tupconv_map(mt_per_leaf_tupconv_maps, i, forleaf=true) will be
called. Here, it will similarly update the per-leaf array element. But
it will not try to access the corresponding per-subplan array because
we don't have such offset array.

This is how the macro will look like :

#define get_tupconv_map(mapinfo, i, perleaf)
((mapinfo[i].map_required == 2) ? NULL :
   ((mapinfo[i].map_required == 1) ? mapinfo[i].map :
 create_new_map(mapinfo, i, perleaf)))

where create_new_map() will take care of populating the array element
on both the arrays, and then return the map if created, or NULL if not
required.


---

Option 2 :

Elements of both arrays are pointers to TupleConversionMapInfo structure.
Ar

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2018-01-12 Thread David Rowley

On 12 January 2018 at 22:51, Amit Langote  wrote:
> On 2018/01/12 18:09, David Rowley wrote:
>> On 10 January 2018 at 17:18, David Rowley  
>> wrote:
>>> Basically, the changes to add_paths_to_append_rel() are causing
>>> duplication in partition_rels.
>
>> I also noticed earlier that this is still broken in v19.
>
> I cannot see the duplication here (with v19 + some local changes per your
> latest review, although I had fixed the issue in v18).

I may have made a mistake there. The code I expected to change didn't.
I meant to test the case again, but I got distracted just before I did
and came back a while later and forgot that I hadn't tested.

If you've tested my case and it works, then please don't look any
further. I will look when v20 is ready.

Sorry for the false alarm... I must've been trying to do too many
things at once :-(

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread Pavel Stehule

2018-01-12 10:02 GMT+01:00 Ashutosh Bapat :

> On Fri, Jan 12, 2018 at 8:07 AM, Corey Huinker 
> wrote:
> > A few months ago, I was researching ways for formalizing calling
> functions
> > on one postgres instance from another. RPC, basically. In doing so, I
> > stumbled across an obscure part of the the SQL Standard called ROUTINE
> > MAPPING, which is exactly what I'm looking for.
> >
> > The syntax specified is, roughly:
> >
> > CREATE ROUTINE MAPPING local_routine_name FOR remote_routine_spec
> > SERVER my_server [ OPTIONS( ... ) ]
> >
> >
> > Which isn't too different from CREATE USER MAPPING.
> >
> > The idea here is that if I had a local query:
> >
> > SELECT t.x, remote_func1(),  remote_func2(t.y)
> >
> > FROM remote_table t
> >
> > WHERE t.active = true;
> >
> >
> > that would become this query on the remote side:
> >
> > SELECT t.x, local_func1(), local_func2(t.y)
> >
> > FROM local_table t
> >
> > WHERE t.active = true;
> >
>
> I think this is a desired feature. Being able to call a function on
> remote server through local server is often useful.
>
> PostgreSQL allows function overloading, which means that there can be
> multiple functions with same name differing in argument types. So, the
> syntax has to include the input parameters or their types at least.
>

+1

Pavel


> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>
>

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2018-01-12 Thread Amit Langote

Hi.

On 2018/01/12 18:09, David Rowley wrote:
> On 10 January 2018 at 17:18, David Rowley  
> wrote:
>> Basically, the changes to add_paths_to_append_rel() are causing
>> duplication in partition_rels.

[ ... ]

> I also noticed earlier that this is still broken in v19.

I cannot see the duplication here (with v19 + some local changes per your
latest review, although I had fixed the issue in v18).

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in (1) partition by list (b);
create table part2 partition of part1 for values in (1);

select * from part;

For the above query, I set a breakpoint all the way in ExecInitAppend() to
see what partitioned_rels list it ends up with and I see no duplication:

   :partitioned_rels (i 1 3)

where 1 and 3 are RT indexes of part and part1, respectively.

With v17, you'd be able to see the duplication:

   :partitioned_rels (i 1 3 3)

Let me confirm again if you were complaining exactly of this duplication?
That the RT index of part1 appears twice due to the bug I claim I fixed i
v18?  Or something else?

Thanks,
Amit

Re: [HACKERS] A design for amcheck heapam verification

2018-01-12 Thread Andrey Borodin

Hi, Peter!

> 11 янв. 2018 г., в 15:14, Andrey Borodin  написал(а):
> I like heapam verification functionality and use it right now. So, I'm 
> planning to provide review for this patch, probably, this week.

I've looked into the code and here's my review.

The new functionality works and is useful right now. I believe it should be 
shipped in the Postgres box (currently, I install it with packet managers).
The code is nice and well documented.

I'd be happy, if I could pass argument like ErrorLevel, which would help me to 
estimate scale of corruption. Or any other way to list more than one error. 
But, clearly, this functionality is not necessary for this patch.
Also, I couldn't find where is check that ensures that there is a heap tuple 
for every B-tree leaf tuple. Is it there?

I've found few neatpicks in bloom filter:
1. Choice of sdbmhash does not seems like a best option from my point of view. 
I'd stick with MurmurX, with any available X. Or anything doing 32-bit alligned 
computations. Hacks like (hash << 6) + (hash << 16) - hash are cool, but 
nowadays there is no point not to use hash * 65599.
2. 
+   bitset_bytes = Max(1024L * 1024L, bitset_bytes);
bloom_work_mem was supposed to be the limit? Why we do not honor this limit?
3.
+   filter = palloc0(offsetof(bloom_filter, bitset) +
+sizeof(unsigned char) * bitset_bytes);
sizeof(unsigned char) == 1 by C standard
4. function my_bloom_power() returns bit numers, then it's result is powered 
INT64CONST(1) << bloom_power; back. I'd stik with removing bits in a loop by 
while(target_bitset_bits & (target_bitset_bits-1)) { 
target_bitset_bits&=target_bitset_bits-1; } or something like that. Or, if we 
use code like sdbm hash, fallback to bithacks :) 
https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
5. I would implement k_hashed with the folllowing code:
static void
k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
{
uint64  hasha,
hashb;
int i;

hasha = DatumGetUInt32(hash_any(elem, len));
hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);

/*
 * Mix seed value using XOR.  Mixing with addition instead would defeat the
 * purpose of having a seed (false positives would never change for a given
 * set of input elements).
 */
hasha ^= filter->seed;

for (i = 0; i < filter->k_hash_funcs; i++)
{
/* Accumulate hash value for caller */
hashes[i] = (hasha + i * hashb + i) % filter->bitset_bits;
}
}
It produces almost same result (hashes 1..k-1 are +1'd), but uses a lot less % 
operations. Potential overflow is governed by uint64 type.

That's all what I've found. I do not know did the patch had all necessary 
reviewers attention. Please, feel free to change status if you think that patch 
is ready. From my point of view, the patch is in a good shape.


It was a pleasure to read amcheck code, thanks for writing it.


Best regards, Andrey Borodin.

Re: Enhance pg_stat_wal_receiver view to display connected host

2018-01-12 Thread Michael Paquier

On Fri, Jan 12, 2018 at 04:32:54PM +1100, Haribabu Kommi wrote:
> On Fri, Jan 12, 2018 at 4:06 PM, Michael Paquier 
> wrote:
> > On Fri, Jan 12, 2018 at 03:55:04PM +1100, Haribabu Kommi wrote:
> > > Before posting the patch, first I did the same, upon further study
> > > I didn't find any scenario where the value is not present in
> > > conn->connhost[conn->whichhost].host and present in conn->pghost.
> > >
> > > If user provides "host" as connection option, the value is present
> > > in both the variables. Even if the connection is unix domain socket,
> > > there is a value in conn->connhost[conn->whichhost].host.
> > >
> > > In case if user provides only hostaddr and host connection option,
> > > then in that case, both the members are NULL. So even if we add
> > > that case, it will be dead code.
> >
> > Hm. Wouldn't it matter for cases where caller has not yet established a
> > connection to the server but still calls PQhost to get the host string?
> >
> 
> Yes I agree that the above scenario leads to a wrong result with the
> earlier patch,
> Updated patch attached by including the conn->pghost. Thanks for the review.

Could you begin a new thread by the way? As you are the one who
discovered the inconsistency and the one who wrote a patch this looks
adapted to me. Or perhaps I should?
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] path toward faster partition pruning

2018-01-12 Thread David Rowley

On 10 January 2018 at 17:18, David Rowley  wrote:
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.
>
> It would be nicer if you could use a RelIds for this, but you'd also
> need some way to store the target partition relation since
> nodeModifyTable.c does:
>
> /* The root table RT index is at the head of the partitioned_rels list */
> if (node->partitioned_rels)
> {
> Index root_rti;
> Oid root_oid;
>
> root_rti = linitial_int(node->partitioned_rels);
> root_oid = getrelid(root_rti, estate->es_range_table);
> rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
> }
>
> You could also fix it by instead of doing:
>
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
> partitioned_rels = list_concat(partitioned_rels,
>childrel->live_partitioned_rels);
>
> do something along the lines of:
>
> if (childrel->part_scheme)
> {
> ListCell *lc;
> ListCell *start = lnext(list_head(childrel->live_partitioned_rels));
>
> for_each_cell(lc, start)
>partitioned_rels = lappend_int(partitioned_rels,
>lfirst_int(lc));
> }
>
> Although it seems pretty fragile. It would probably be better to find
> a nicer way of handling all this.

Hi Amit,

I also noticed earlier that this is still broken in v19.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: CREATE ROUTINE MAPPING

2018-01-12 Thread Ashutosh Bapat

On Fri, Jan 12, 2018 at 8:07 AM, Corey Huinker  wrote:
> A few months ago, I was researching ways for formalizing calling functions
> on one postgres instance from another. RPC, basically. In doing so, I
> stumbled across an obscure part of the the SQL Standard called ROUTINE
> MAPPING, which is exactly what I'm looking for.
>
> The syntax specified is, roughly:
>
> CREATE ROUTINE MAPPING local_routine_name FOR remote_routine_spec
> SERVER my_server [ OPTIONS( ... ) ]
>
>
> Which isn't too different from CREATE USER MAPPING.
>
> The idea here is that if I had a local query:
>
> SELECT t.x, remote_func1(),  remote_func2(t.y)
>
> FROM remote_table t
>
> WHERE t.active = true;
>
>
> that would become this query on the remote side:
>
> SELECT t.x, local_func1(), local_func2(t.y)
>
> FROM local_table t
>
> WHERE t.active = true;
>

I think this is a desired feature. Being able to call a function on
remote server through local server is often useful.

PostgreSQL allows function overloading, which means that there can be
multiple functions with same name differing in argument types. So, the
syntax has to include the input parameters or their types at least.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Transform for pl/perl

2018-01-12 Thread Anthony Bykov

On Fri, 12 Jan 2018 15:19:26 +1300
Thomas Munro  wrote:

> On Thu, Dec 7, 2017 at 10:56 PM, Anthony Bykov
>  wrote:
> >> Please, find a new version of the patch in attachments to this
> >> message.  
> 
> Hi again Anthony,
> 
> I wonder why make check passes for me on my Mac, but when Travis CI
> (Ubuntu Trusty on amd64) runs it, it fails like this:
> 
> test jsonb_plperl ... FAILED
> test jsonb_plperl_relocatability ... ok
> test jsonb_plperlu... FAILED
> test jsonb_plperlu_relocatability ... ok
> 
> = Contents of ./contrib/jsonb_plperl/regression.diffs
> *** 
> /home/travis/build/postgresql-cfbot/postgresql/contrib/jsonb_plperl/expected/jsonb_plperl.out
> 2018-01-11 21:46:35.867584467 +
> --- 
> /home/travis/build/postgresql-cfbot/postgresql/contrib/jsonb_plperl/results/jsonb_plperl.out
> 2018-01-11 21:55:08.564204175 +
> ***
> *** 89,96 
>   (1 row)
> 
>   SELECT testSVToJsonb2('1E+131071');
> ! ERROR:  could not transform to type "jsonb"
> ! DETAIL:  The type you are trying to transform can't be transformed
> to jsonb CONTEXT:  PL/Perl function "testsvtojsonb2"
>   SELECT testSVToJsonb2('-1');
>testsvtojsonb2
> --- 89,95 
>   (1 row)
> 
>   SELECT testSVToJsonb2('1E+131071');
> ! ERROR:  invalid input syntax for type numeric: "inf"
>   CONTEXT:  PL/Perl function "testsvtojsonb2"
>   SELECT testSVToJsonb2('-1');
>testsvtojsonb2
> ==
> *** 
> /home/travis/build/postgresql-cfbot/postgresql/contrib/jsonb_plperl/expected/jsonb_plperlu.out
> 2018-01-11 21:46:35.867584467 +
> --- 
> /home/travis/build/postgresql-cfbot/postgresql/contrib/jsonb_plperl/results/jsonb_plperlu.out
> 2018-01-11 21:55:08.704204228 +
> ***
> *** 89,96 
>   (1 row)
> 
>   SELECT testSVToJsonb2('1E+131071');
> ! ERROR:  could not transform to type "jsonb"
> ! DETAIL:  The type you are trying to transform can't be transformed
> to jsonb CONTEXT:  PL/Perl function "testsvtojsonb2"
>   SELECT testSVToJsonb2('-1');
>testsvtojsonb2
> --- 89,95 
>   (1 row)
> 
>   SELECT testSVToJsonb2('1E+131071');
> ! ERROR:  invalid input syntax for type numeric: "inf"
>   CONTEXT:  PL/Perl function "testsvtojsonb2"
>   SELECT testSVToJsonb2('-1');
>testsvtojsonb2
> ==
> 

Hello, thank you for your message.
The problem was that different perl compilers uses different infinity
representations. Some of them use "Inf" others - use "inf". So, in
attachments there is a new version of the patch.

Thank you again.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/contrib/Makefile b/contrib/Makefile
index 8046ca4..53d44fe 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += hstore_plperl
+SUBDIRS += hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += hstore_plperl
+ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile
new file mode 100644
index 000..cd86553
--- /dev/null
+++ b/contrib/jsonb_plperl/Makefile
@@ -0,0 +1,40 @@
+# contrib/jsonb_plperl/Makefile
+
+MODULE_big = jsonb_plperl
+OBJS = jsonb_plperl.o $(WIN32RES)
+PGFILEDESC = "jsonb_plperl - jsonb transform for plperl"
+
+PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl
+
+EXTENSION = jsonb_plperlu jsonb_plperl
+DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql
+
+REGRESS = jsonb_plperl jsonb_plperl_relocatability jsonb_plperlu jsonb_plperlu_relocatability
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/jsonb_plperl
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We must link libperl explicitly
+ifeq ($(PORTNAME), win32)
+# these settings are the same as for plperl
+override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
+# ... see silliness in plperl Makefile ...
+SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+else
+rpathdir = $(perl_archlibexp)/CORE
+SHLIB_LINK += $(perl_embed_ldflags)
+endif
+
+# As with plperl we need to make sure that the CORE directory is included
+# last, probably because it sometimes contains some header files with names
+# that clash with some of ours, or with some that we include, notably on
+# Windows.
+override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
new file mode 100644
index 000..9032f7e
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -0,0 +1,235 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+-- test hash -> jsonb
+CRE

Re: master make check fails on Solaris 10

2018-01-12 Thread Marina Polyakova


On 11-01-2018 20:34, Tom Lane wrote:

Marina Polyakova  writes:

Hello, hackers! I got a permanent failure of master (commit
ca454b9bd34c75995eda4d07c9858f7c22890c2b) make check on Solaris 10.
Regression output and diffs are attached.


Hm, buildfarm member protosciurus is running a similar configuration
without problems.  Looking at its configuration, maybe you need to
fool with LD_LIBRARY_PATH and/or LDFLAGS_SL?


I added these parameters with the same values in configure 
(LDFLAGS_SL="-m64" 
LD_LIBRARY_PATH="/lib/64:/usr/lib/64:/usr/sfw/lib/64:/usr/local/lib"), 
there're the same failures :( (see the attached regression diffs and 
output)


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company*** /home/buildfarm/mpolyakova/postgresql_master_other_config/src/test/regress/expected/random.out	Fri Jan 12 10:59:09 2018
--- /home/buildfarm/mpolyakova/postgresql_master_other_config/src/test/regress/results/random.out	Fri Jan 12 11:23:56 2018
***
*** 46,52 
  
  SELECT AVG(random) FROM RANDOM_TBL
HAVING AVG(random) NOT BETWEEN 80 AND 120;
!  avg 
! -
! (0 rows)
  
--- 46,53 
  
  SELECT AVG(random) FROM RANDOM_TBL
HAVING AVG(random) NOT BETWEEN 80 AND 120;
!  avg  
! --
!  13835058055282163712
! (1 row)
  

==

*** /home/buildfarm/mpolyakova/postgresql_master_other_config/src/test/regress/expected/groupingsets.out	Fri Jan 12 10:59:08 2018
--- /home/buildfarm/mpolyakova/postgresql_master_other_config/src/test/regress/results/groupingsets.out	Fri Jan 12 11:24:35 2018
***
*** 143,156 
  -- nesting with window functions
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by rollup (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!|   |  12 |   36
  (6 rows)
  
  -- nesting with grouping sets
--- 143,156 
  -- nesting with window functions
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by rollup (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  79507068872943528402422333440
!  1 |   |  10 | 159014137745887056804844666880
!  2 | 2 |   2 | 238521206618830585207267000320
!  2 |   |   2 | 318028275491774113609689333760
!|   |  12 | 397535344364717642012111667200
  (6 rows)
  
  -- nesting with grouping sets
***
*** 544,559 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!| 1 |   8 |   32
!| 2 |   4 |   36
!|   |  12 |   48
  (8 rows)
  
  select a, b, sum(c) from (values (1,1,10),(1,1,11),(1,2,12),(1,2,13),(1,3,14),(2,3,15),(3,3,16),(3,4,17),(4,1,18),(4,1,19)) v(a,b,c) group by rollup (a,b);
--- 544,559 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  79523145874486552930199535616
!  1 |   |  10 | 159046291748973105860399071232
!  2 | 2 |   2 | 238569437623459658790598606848
!  2 |   |   2 | 318092583497946211720798142464
!| 1 |   8 | 397615729372432764650997678080
!| 2 |   4 | 477138875246919317581197213696
!|   |  12 | 556662021121405870511396749312
  (8 rows)
  
  select a, b, sum(c) from (values (1,1,10),(1,1,11),(1,2,12),(1,2,13),(1,3,14),(2,3,15),(3,3,16),(3,4,17),(4,1,18),(4,1,19)) v(a,b,c) group by rollup (a,b);
***
*** 1219,1234 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum | rsum 
! ---+---+-+--
!  1 | 1 |   8 |8
!  1 | 2 |   2 |   10
!  1 |   |  10 |   20
!  2 | 2 |   2 |   22
!  2 |   |   2 |   24
!| 1 |   8 |   32
!| 2 |   4 |   36
!|   |  12 |   48
  (8 rows)
  
  explain (costs off)
--- 1219,1234 
  
  select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
from gstest2 group by cube (a,b) order by rsum, a, b;
!  a | b | sum |  rsum  
! ---+---+-+
!  1 | 1 |   8 |  0
!  1 | 2 |   2 |  7952557021937969613436800
!  1 |   |  10 | 159051140438759392268622233600
!  2 | 2 |   2 | 238576710658139088402933350400
!  2 |   |   2 | 318102280877518784537244467200
!| 1 |   8 | 39762785109689848067184000
!| 2 |   4 | 4

Re: [HACKERS] [PATCH] Improve geometric types

2018-01-12 Thread Kyotaro HORIGUCHI

Hello, sorry for the absense.

(Sorry for the long but should be hard-to-read article..)

At Thu, 4 Jan 2018 14:53:47 +0100, Emre Hasegeli  wrote in 

> Rebased versions are attached.

Thank you for the new patch.

I'm looking 0001 patch. This does many things at once but seems
hard to split into step-by-step patches. So I reviewed this from
the viewpoint that how each of the external function is
changed. (Attatchment 1).

I think that this patch is not intending to change any behaviors
of the external functions, but intending make it mathematically
reasonable. So some behavioral changes are ineviablly
requried. The problem here is what is the most reasonable (and
acceptable) behavior.

The comments below are just the starting of a discussion about
some aspects of design.  I'm sorry that this discussion might be
changing the way to go largily, but I'd like to hear what you
think about two topics.


I found seemingly inconsistent handling of NaN.

- Old box_same assumed NaN != NaN, but new assumes NaN ==
  NaN. I'm not sure how the difference is significant out of the
  context of sorting, though...

- box_overlap()'s behavior has not been changed. As the result
  box_same and box_overlap now behaves differently concerning
  NaN handling.

- line_construct_pts has been changed so that generates
  {NaN,-1,NaN} for two identical points (old code generated
  {-1,0,0}) but I think the right behavior here is erroring out.
  (as line_in behaves.)


I feel that it'd better to define how to handle non-numbers
before refactoring.  I prefer to just denying NaN as a part of
the geometric types, or any input containing NaN just yields a
result filled with NaNs.


The next point is reasonability of the algorithm and consistency
among related functions.

- line_parallel considers the EPSILON(ugaa) with different scale
  from the old code, but both don't seem reasonable.. It might
  not be the time to touch it, but if we changed the behavior,
  I'd like to choose reasonable one. At least the tolerance of
  parallelity should be taken by rotation, not by slope.

I think that one reasonable way to do this is taking the distance
between the far ends of two direction (unit) vectors.
  
Assuming Ax + By + C = 0, the direction vecotr dv(x, y) for the
line is:

n = sqrt(A^2 + B^2), dv = (B / n, -A / n).

  (and the normal vector nv = (A / n, B / n))

The vector needs to be restricted within upper or any two
successive quadrants so that it is usable for this objective. So
let's restrict A to be negative value for now. Something like the
follwoing would be an answer.

  void line_direction_vector(Point *result, LINE *l)
  {
float8 A = l->A;
float8 B = l->B;
float8 n;

n = HYPOT(A, B);

/* keep the result vector within the 1st-2nd quadrants */
if (A > 0)
{
   A = -A;
   B = -B;
}
point_construct(result, B / n, -A / n);
  }

  bool line_parallel(LINE *l1, LINE *l2)
  {
Point d1, d2;

line_direction_vector(&d1, l1);
line_direction_vector(&d2, l2);
return FPzero(point_dt(&d1, &d2));
  }
  
Also perpendicularity is detected by comparing the direction
vector of one line and the normal vector of another in the same
manner.

  void line_normal_vector(Point *result, LINE *l)
  {
float8 A = l->A;
float8 B = l->B;
float8 n;

/* keep the result vector within the 1st-2nd quadrants */
n = HYPOT(A, B);
if (B < 0)
{
   A = -A;
   B = -B;
}
point_construct(result, A / n, B / n);
  }

  bool line_perp(LINE *l1, LINE *l2)
  {
Point d1, d2;

line_direction_vector(&d1, l1);
line_normal_vector(&d2, l2);
return FPzero(point_dt(&d1, &d2));
  }
  
In relation to this, equality of two lines is also nuisance.
>From the definitions, two equal lines are parallel. In
contraposition, two nonparallel lines cannot be equal. This
relationship is represented by the following expression:

 is_equal =: line_parallel(l1, l2) && FPzero(line_distance(l1, l2))

But the line_distance returns zero in most cases even if it is
considered "parallel" with considering tolerance. This means that
There's almost no two lines that are parallel but not equal (that
is, the truely parallel lines)... sigh.

So we might should calculate the distance in different (not
purely mathematical/geometrical) way. For example, if we can
assume the geometric objects are placed mainly around the origin,
we could take the distance between the points nearest to origin
on both lines. (In other words, between the foots of
perpendicular lines from the origin onto the both lines). Of
couse this is not ideal but...

The same discussion also affects line_interpt_line or other
similar functions.


At last, just a simple comment.

- point_eq_point() assumes that NaN == NaN. This is an inherited
  behavior from old float[48]_cmp_internal() but it's not a
  common treat. point_eq_point() needs any comment about the
  special definition as float[48]_cmp_internal has.



regards,

-- 
Kyota

99 matches

Mail list logo