Re: [HACKERS] Parallel build with MSVC

2016-09-07 Thread Noah Misch
On Mon, Sep 05, 2016 at 02:43:48PM +0900, Michael Paquier wrote:
> On Mon, Sep 5, 2016 at 9:18 AM, Noah Misch  wrote:
> > Every vcbuild and msbuild invocation ought to recognize this variable, so
> > please update the two places involving ecpg_regression.proj.  Apart from 
> > that,
> > the patch looks good.
> 
> Good catch. I did not notice those during my lookups of those patches.
> Not my intention to bump into Christian's feet, but here are updated
> patches.

Committed.

> Actually, is that worth adding for clean.bat? I doubt that many people
> would care if MSBFLAGS is not supported in it. The cleanup script is
> not the bottleneck, the build script is. I added it in the patch 0001
> attached but I doubt that's worth it to be honest.

If parallelism were the only consideration, then I would agree.  We don't
know, in general, how each operation will respond to arbitrary flags the user
selects.  I did remove your clean.bat documentation change; documenting
MSBFLAGS in the one place suffices.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-07 Thread Andres Freund
Hi,

On 2016-08-30 21:50:05 -0400, Peter Eisentraut wrote:
> I would like to propose the attached patch implementing autonomous
> transactions for discussion and review.
> 
> This work was mostly inspired by the discussion about pg_background a
> while back [0].  It seemed that most people liked the idea of having
> something like that, but couldn't perhaps agree on the final interface.
> Most if not all of the preliminary patches in that thread were
> committed, but the user interface portions were then abandoned in favor
> of other work.  (I'm aware that rebased versions of pg_background
> existing.  I have one, too.)

I kind of dislike this approach for a different reason than already
mentioned in this thread: Namely it's not re-usable for implementing
streaming logical replication of large transaction, i.e. allow to decode
& apply un-committed changes.  The issue there is that one really needs
to have multiple transactions active in the same connection, which this
approach does not allow.

That's not necessarily a fatal objection, but I did want to throw that
as a design goal out there.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Amit Kapila
On Thu, Sep 8, 2016 at 10:02 AM, Mark Kirkwood
 wrote:
>
> Repeating my tests with these new patches applied points to the hang issue
> being solved. I tested several 10 minute runs (any of which was enough to
> elicit the hang previously). I'll do some longer ones, but looks good!
>

Thanks for verifying the issue and doing further testing of the patch.
It is really helpful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-07 Thread Amit Kapila
On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes  wrote:
> On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila  wrote:
>>
>>
>> I have fixed all other issues you have raised.  Updated patch is
>> attached with this mail.
>
>
> I am finding the comments (particularly README) quite hard to follow.  There
> are many references to an "overflow bucket", or similar phrases.  I think
> these should be "overflow pages".  A bucket is a conceptual thing consisting
> of a primary page for that bucket and zero or more overflow pages for the
> same bucket.  There are no overflow buckets, unless you are referring to the
> new bucket to which things are being moved.
>

Hmm.  I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology.  I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

> Was maintaining on-disk compatibility a major concern for this patch?  Would
> you do things differently if that were not a concern?
>

I would not have done much differently from what it is now, however
one thing I have considered during development was to change the hash
index tuple structure as below to mark the index tuples as
move-by-split:

typedef struct
{
IndexTuple entry; /* tuple to insert */
bool moved_by_split;
} HashEntryData;

The other alternative was to use the (unused) bit in IndexTupleData->tinfo.

I have chosen the later approach, now one could definitely argue that
it is the last available bit in IndexTuple and using it for hash
indexes might or might not be best thing to do.  However, I think it
is also not advisable to break the compatibility if we can use some
existing bit.  In any case, the same question can arise whenever
anyone wants to use it for some other purpose.

> In particular, I am thinking about the need for every insert to
> exclusive-content-lock the meta page to increment the index-wide tuple
> count.

This is not what this patch has changed.  The main purpose of this
patch is to change heavy-weight locking to light-weight locking and
provide a way to handle the in-complete splits, both of which are
required to sensibly write WAL for hash-indexes. Having said that, I
agree with your point that we can improve the insertion logic, so that
we don't need to Write-lock the meta-page at each insert. I have
noticed some other improvements in hash indexes as well during this
work like caching the meta page, reduce lock/unlock calls for
retrieving tuples from a page by making hash index scans work a page
at a time as we do for btree scans, kill_prior_tuple mechanism is
current quite naive and needs improvement and the biggest improvement
is needed in create index logic where we are inserting tuple-by-tuple
whereas btree operates at page level and also by-passes the shared
buffers.  One of such improvements (cache the meta page) is already
being worked upon by my colleague and the patch [2] for same is in CF.
The main point I want to high light is that apart from what this patch
does, there are number of other potential areas which needs
improvements in hash indexes and I think it is better to do those as
separate enhancements rather than as a single patch.

>  I think that this is going to be a huge bottleneck on update
> intensive workloads (which I don't believe have been performance tested as
> of yet).

I have done some performance testing with this patch and I find there
was a significant improvement as compare to what we have now in hash
indexes even for read-write workload. I think the better idea is to
compare it with btree, but in any case, even if this proves to be a
bottleneck, we should try to improve it in a separate patch rather
than as a part of this patch.

>  I was wondering if we might not want to change that so that each
> bucket keeps a local count, and sweeps that up to the meta page only when it
> exceeds a threshold.  But this would require the bucket page to have an area
> to hold such a count.  Another idea would to keep not a count of tuples, but
> of buckets with at least one overflow page, and split when there are too
> many of those.

I think both of these ideas could lead to change the point (tuple
count) where we currently split.  This might impact the search speed
and space usage. Yet another alternative could be to change
hashm_ntuples to 64bit and use 64-bit atomics to operate on it or may
be use a separate spin-lock to protect it.  However, whatever we
decide to do with it, I think it is a matter of separate patch.


Thanks for looking into patch.

[1] - https://en.wikipedia.org/wiki/Linear_hashing
[2] - https://commitfest.postgresql.org/10/715/


-- 
With 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Mark Kirkwood

On 07/09/16 21:58, Amit Kapila wrote:


On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes  wrote:

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila 
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes  wrote:


After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.


Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set.  Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit.  Then while freeing
the overflow page it can hit the Assert as mentioned by you.  I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue.  As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.



The test is rather awkward, it might be easier to just have me test it.


Okay, I have fixed this issue as explained above.  Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation).  I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended).  Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.




Repeating my tests with these new patches applied points to the hang 
issue being solved. I tested several 10 minute runs (any of which was 
enough to elicit the hang previously). I'll do some longer ones, but 
looks good!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

2016-09-07 Thread Michael Paquier
Hi all,

Looking at the MSVC scripts for some stuff I have noticed the following thing:
if ($options->{xml})
{
if (!($options->{xslt} && $options->{iconv}))
{
die "XML requires both XSLT and ICONV\n";
}
}
But I don't understand the reason behind such a restriction to be
honest because libxml2 does not depend on libxslt. The contrary is
true: libxslt needs libxml2. Note as well that libxml2 does depend on
ICONV though. So I think that this condition should be relaxed as
follows:
if ($options->{xml} && !$options->{iconv})
{
die "XML requires ICONV\n";
}
And we also need to be sure that when libxslt is specified, libxml2 is
here to have the build working correctly.

Relaxing that would allow people to compile contrib/xml2 with just a
dependency to libxml2, without libxslt, something possible on any *nix
systems. As far as I can see this restriction comes from 9 years ago
in 2007 and commit 7f58ed1a. So nobody has complained about that so
far :)

Attached is a patch to address both issues.
Comments are welcome.
-- 
Michael


msvc_xml_relax.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-07 Thread Craig Ringer
On 8 September 2016 at 08:18, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Craig Ringer
>> Of course, if we could decrease the startup cost of a bgworker
>
>> For this use in autonomous tx's we could probably pool workers. Or at least
>> lazily terminate them so that the loop cases work better by re-using an
>> existing bgworker.
>
> I’m not sure whether to be nervous about the context switch cost in the use
> cases of autonomous transactions.

That's probably going to be one of the smaller costs. Doing this with
bgworkers won't be cheap, but you need to consider the alternative.
Factoring out all transaction-specific data currently stored in or
pointed to by globals into a transaction state struct that can be
swapped out. Making PgXact and PGPROC capable of representing
multiple/suspended transactions. Lots more. Much of which would have a
performance impact on all day to day operations whether or not
autononomous xacts are actually in use.

I haven't looked into it in detail. Peter can probably explain more
and better. I'm just pointing out that I doubt there's any way to do
this without a cost somewhere, and having that cost limited to actual
uses of autonomous xacts would be nice.

(BTW, could you please set your reply style so that your mail client
quotes text and it's possible to see what you wrote vs what is
quoted?)

-- 
-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-07 Thread Peter Geoghegan
On Wed, Sep 7, 2016 at 2:42 PM, Tom Lane  wrote:
> The reason it's called siftup is that that's what Knuth calls it.
> See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of
> Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8.

I see that Knuth explains that these steps form the siftup procedure.
What steps does tuplesort_heap_insert correspond to, if any?

5.2.3.H is about heapsort, and so the Wikipedia article on heapsort
(not the one on heaps in general, which I referenced first) may be a
useful reference here. It's also freely available. Consider the
Wikipedia pseudocode for siftup [1], from classic heapsort. That
version goes from child to parent each iteration (in the first
iteration, "child" is initialized to "end" -- not root). So, it
*ascends* the heap ("child" is assigned from "parent" for any
subsequent iterations). OTOH, tuplesort_heap_siftup always starts from
the root of tuplesort's top-level heap (or the "hole" at the root
position, if you prefer), and *descends* the heap.

> I think this patch is misguided, as it unnecessarily moves the code
> away from standard nomenclature.

As I mentioned, the patch is guided by the descriptions of fundamental
operations on heaps from Wikipedia (I didn't think much about heapsort
until now). I'm not really proposing to change things in a way that is
inconsistent with Knuth (regardless of how the term siftup is
understood). I'm proposing to change things in a way that seems
clearer overall, particularly given the way that these various
routines are used in fairly distinct contexts.

The terminology in this area can be confusing. My Sedgewick/Wayne
Algorithms book never uses the term shift or sift when discussing
heaps, even once in passing. The call shift up "swim", and shift down
"sink", possibly because they'd like to avoid any baggage that other
terminology has.

I propose, as a compromise, to not introduce the term "shift down" at
all in this patch, but to still rename tuplesort_heap_siftup to
tuplesort_heap_compact. Even assuming that I'm wrong about siftup
here, I think that that still represents an improvement. Would that be
acceptable to you?

[1] https://en.wikipedia.org/wiki/Heapsort#Pseudocode
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-07 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Craig Ringer
> Of course, if we could decrease the startup cost of a bgworker

For this use in autonomous tx's we could probably pool workers. Or at least 
lazily terminate them so that the loop cases work better by re-using an 
existing bgworker.



Though I may say something odd, isn’t the bgworker approach going to increase 
context switches?  I thought PostgreSQL has made efforts to decrease context 
switches, e.g.



* Each backend itself writes WAL to disk unlike Oracle requests LGWR process to 
write REDO to disk.



* Releasing and re-acquiring a lwlock appears to try to avoid context switches.



   /*

   * Loop here to try to acquire lock after each time we are signaled by

   * LWLockRelease.

   *

   * NOTE: it might seem better to have LWLockRelease actually grant us 
the

   * lock, rather than retrying and possibly having to go back to 
sleep. But

   * in practice that is no good because it means a process swap for 
every

   * lock acquisition when two or more processes are contending for the 
same

   * lock.  Since LWLocks are normally used to protect not-very-long

   * sections of computation, a process needs to be able to acquire and

   * release the same lock many times during a single CPU time slice, 
even

   * in the presence of contention.  The efficiency of being able to do 
that

   * outweighs the inefficiency of sometimes wasting a process dispatch

   * cycle because the lock is not free when a released waiter finally 
gets

   * to run.  See pgsql-hackers archives for 29-Dec-01.

*/



I’m not sure whether to be nervous about the context switch cost in the use 
cases of autonomous transactions.



Regards

Takayuki Tsunakawa




Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Michael Paquier
On Thu, Sep 8, 2016 at 8:56 AM, Tom Lane  wrote:
> Vik Fearing  writes:
>> On 09/08/2016 01:05 AM, Tom Lane wrote:
>>> I'm pretty much -1 on printing a warning.  There's no ambiguity, and no
>>> real reason for us ever to remove the old spellings.  Standardizing on
>>> "no-" going forward makes sense, but let's not slap people's wrists for
>>> existing usage.  (Or: if it ain't broke, don't break it.)
>
>> One could also argue that 2 out of 53 "no" options omitting the dash is
>> in fact broken, and a real reason to remove them.
>
> I do not buy that.  As a counter argument, consider that removing them
> would make it impossible to write a script that works with both old
> and new versions of PG.  That's a mighty high price to pay for what
> is little more than pedantry.

Perhaps discussing that on another thread would be better, and I was
the one who began this thing... So I'll do it.

Vik's stuff is just to add a --no-wait and --wait long option alias on
pg_ctl. And that clearly improves the readability for users, so that's
a +1 from here. And let's just use the v1 presented at the beginning
of this thread. I agree with the feeling that standardizing things
would be better btw for such option names.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Michael Paquier
On Thu, Sep 8, 2016 at 5:41 AM, Alvaro Herrera  wrote:
> Gavin Flower wrote:
>
>> possibly '--nosync' (& any similar) should have a '--no-sync' variation
>> added, with the '--nosync' variation documented as depreciated?
>
> I agree -- I would go as far as just documenting --no-sync only and
> keeping the --nosync one working with minimal (if any) visibility in
> docs.

Keeping no visibility at all in the docs, with an alias in the
binaries sounds fine to me if we want to standardize a bit more
things.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Tom Lane
Vik Fearing  writes:
> On 09/08/2016 01:05 AM, Tom Lane wrote:
>> I'm pretty much -1 on printing a warning.  There's no ambiguity, and no
>> real reason for us ever to remove the old spellings.  Standardizing on
>> "no-" going forward makes sense, but let's not slap people's wrists for
>> existing usage.  (Or: if it ain't broke, don't break it.)

> One could also argue that 2 out of 53 "no" options omitting the dash is
> in fact broken, and a real reason to remove them.

I do not buy that.  As a counter argument, consider that removing them
would make it impossible to write a script that works with both old
and new versions of PG.  That's a mighty high price to pay for what
is little more than pedantry.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Vik Fearing
On 09/08/2016 01:05 AM, Tom Lane wrote:
> Vik Fearing  writes:
>> On 09/07/2016 11:39 PM, Gavin Flower wrote:
>>> Possibly generate warningswhen the non hyphenated form is used?
> 
>> I'm not quite sure how I got volunteered to do this work, but it's easy
>> enough so I don't mind.
>> Here is a new patch that emits a warning when --noclean and/or --nosync
>> are used.
> 
> I'm pretty much -1 on printing a warning.  There's no ambiguity, and no
> real reason for us ever to remove the old spellings.  Standardizing on
> "no-" going forward makes sense, but let's not slap people's wrists for
> existing usage.  (Or: if it ain't broke, don't break it.)

One could also argue that 2 out of 53 "no" options omitting the dash is
in fact broken, and a real reason to remove them.

I don't see the warning as "slapping wrists" so much as saying that
we're harmonizing our conventions and their scripts need to be updated
for the better.

That said, I'm not going to fight for this.  My only goal here is to get
--wait and --no-wait added to pg_ctl.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] High-CPU consumption on information_schema (only) query

2016-09-07 Thread Robins Tharakan
Hi,

An SQL (with only information_schema related JOINS) when triggered, runs
with max CPU (and never ends - killed after 2 days).
- It runs similarly (very slow) on a replicated server that acts as a
read-only slave.
- Top shows only postgres as hitting max CPU (nothing else). When query
killed, CPU near 0%.
- When the DB is restored on a separate test server (with the exact
postgresql.conf) the same query works fine.
- There is no concurrent usage on the replicated / test server (although
the primary is a Production server and has concurrent users).

Questions:
- If this was a postgres bug or a configuration issue, query on the
restored DB should have been slow too. Is there something very basic I am
missing here?

If someone asks for I could provide SQL + EXPLAIN, but it feels irrelevant
here. I amn't looking for a specific solution but what else should I be
looking for here?

p.s.: All postgres servers are running the v9.3.10

-
robins
-- 

-
robins


Re: [HACKERS] autonomous transactions

2016-09-07 Thread Craig Ringer
On 8 Sep. 2016 3:47 am, "Robert Haas"  wrote:
>

> Of course, if we could decrease the startup cost of a bgworker

For this use in autonomous tx's we could probably pool workers. Or at least
lazily terminate them so that the loop cases work better by re-using an
existing bgworker.

I'm pretty sure we're going to need a bgworker pool sooner or later anyway.


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Tom Lane
Vik Fearing  writes:
> On 09/07/2016 11:39 PM, Gavin Flower wrote:
>> Possibly generate warningswhen the non hyphenated form is used?

> I'm not quite sure how I got volunteered to do this work, but it's easy
> enough so I don't mind.
> Here is a new patch that emits a warning when --noclean and/or --nosync
> are used.

I'm pretty much -1 on printing a warning.  There's no ambiguity, and no
real reason for us ever to remove the old spellings.  Standardizing on
"no-" going forward makes sense, but let's not slap people's wrists for
existing usage.  (Or: if it ain't broke, don't break it.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Vik Fearing
On 09/07/2016 11:39 PM, Gavin Flower wrote:
> On 08/09/16 09:08, Vik Fearing wrote:
>> On 09/07/2016 10:41 PM, Alvaro Herrera wrote:
>>> Gavin Flower wrote:
>>>
 possibly '--nosync' (& any similar) should have a '--no-sync' variation
 added, with the '--nosync' variation documented as depreciated?
>>> I agree -- I would go as far as just documenting --no-sync only and
>>> keeping the --nosync one working with minimal (if any) visibility in
>>> docs.
>> Okay, here's a patch to do that.  I don't think it's the other patch's
>> job to do it.
>>
>> I also changed --noclean to --no-clean, and --no-locale was already
>> correct.
> 
> Suggest a comment along the lines "Where flags of the form --xxx have a
> negated form, then the preferred negated form is --no-xxx - and that any
> existing use of the form --noxxx should be converted to --no-xxx, as the
> non hyphenated form is now deprecated & will be removed in a future
> version of Postgres."

I have verified that these are the only two options anywhere in the tree
that start with "no" and not "no-" so it should be pretty easy for
future options to conform on their own.  I don't see adding a comment
like this to every long option definition block to be very helpful, and
only adding it to initdb is just weird.  So I don't see the need for it.

> Possibly generate warningswhen the non hyphenated form is used?

I'm not quite sure how I got volunteered to do this work, but it's easy
enough so I don't mind.

Here is a new patch that emits a warning when --noclean and/or --nosync
are used.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


initdb_no_opts_02.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Tom Lane
Heikki Linnakangas  writes:
> Unfortunately, sqrt(x) isn't very cheap.

You'd be surprised: sqrt is built-in on most modern hardware.  On my
three-year-old workstation, sqrt(x) seems to take about 2.6ns.  For
comparison, the pack_float version posted in

takes 3.9ns (and draws multiple compiler warnings, too).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] (Comment)Bug in CteScanNext

2016-09-07 Thread Jim Nasby

On 9/6/16 12:00 PM, Tom Lane wrote:

On the other hand, if eof_cte is true, then what happened on the last
call is that we tried to fetch forwards, reached EOF on the underlying
query, and returned NULL.  In that case, a backwards fetch *should*
produce the last row in the tuplestore.


Patch attached.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
diff --git a/src/backend/executor/nodeCtescan.c 
b/src/backend/executor/nodeCtescan.c
index 3c2f684..64248cb 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -53,8 +53,21 @@ CteScanNext(CteScanState *node)
 */
eof_tuplestore = tuplestore_ateof(tuplestorestate);
 
+   /*
+* Before fetching, we need to handle a special case: when reversing
+* direction at the end of the tuplestore, the next
+* tuplestore_gettupleslot() call will return the last tuple. But since
+* that tuple was just seen, we want to move past it. This is necessary
+* because the tuplestore get routines always move the current pointer,
+* unless they hit the end (or beginning) of the store.
+*/
if (!forward && eof_tuplestore)
{
+   /*
+* However, once we hit the end of the underlying node any call 
while
+* at the end of the tuplestore will return NULL. In that case, 
we DO
+* want to return the last row in the tuplestore.
+*/
if (!node->leader->eof_cte)
{
/*
diff --git a/src/backend/executor/nodeMaterial.c 
b/src/backend/executor/nodeMaterial.c
index 9ab03f3..197b91b 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -82,16 +82,23 @@ ExecMaterial(MaterialState *node)
eof_tuplestore = (tuplestorestate == NULL) ||
tuplestore_ateof(tuplestorestate);
 
+   /*
+* Before fetching, we need to handle a special case: when reversing
+* direction at the end of the tuplestore, the next
+* tuplestore_gettupleslot() call will return the last tuple. But since
+* that tuple was just seen, we want to move past it. This is necessary
+* because the tuplestore get routines always move the current pointer,
+* unless they hit the end (or beginning) of the store.
+*/
if (!forward && eof_tuplestore)
{
+   /*
+* However, once we hit the end of the underlying node any call 
while
+* at the end of the tuplestore will return NULL. In that case, 
we DO
+* want to return the last row in the tuplestore.
+*/
if (!node->eof_underlying)
{
-   /*
-* When reversing direction at tuplestore EOF, the first
-* gettupleslot call will fetch the last-added tuple; 
but we want
-* to return the one before that, if possible. So do an 
extra
-* fetch.
-*/
if (!tuplestore_advance(tuplestorestate, forward))
return NULL;/* the tuplestore must be empty 
*/
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-07 Thread Tom Lane
Peter Geoghegan  writes:
> On Fri, Aug 12, 2016 at 4:30 PM, Peter Geoghegan  wrote:
>> Doesn't tuplesort_heap_siftup() actually shift-down?

The reason it's called siftup is that that's what Knuth calls it.
See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of
Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8.

> Heikki (CC'd) and I discussed this privately today, and we were in
> agreement that this needs to be fixed, so I wrote a patch, which I
> attach here.

I think this patch is misguided, as it unnecessarily moves the code
away from standard nomenclature.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Gavin Flower

On 08/09/16 09:08, Vik Fearing wrote:

On 09/07/2016 10:41 PM, Alvaro Herrera wrote:

Gavin Flower wrote:


possibly '--nosync' (& any similar) should have a '--no-sync' variation
added, with the '--nosync' variation documented as depreciated?

I agree -- I would go as far as just documenting --no-sync only and
keeping the --nosync one working with minimal (if any) visibility in
docs.

Okay, here's a patch to do that.  I don't think it's the other patch's
job to do it.

I also changed --noclean to --no-clean, and --no-locale was already correct.


Suggest a comment along the lines "Where flags of the form --xxx have a 
negated form, then the preferred negated form is --no-xxx - and that any 
existing use of the form --noxxx should be converted to --no-xxx, as the 
non hyphenated form is now deprecated & will be removed in a future 
version of Postgres."


Possibly generate warningswhen the non hyphenated form is used?

Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-07 Thread Heikki Linnakangas

On 09/07/2016 09:17 AM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas  wrote:

The big picture here is that you can't only USEMEM() for tapes as the
need arises for new tapes as new runs are created. You'll just run a
massive availMem deficit, that you have no way of paying back, because
you can't "liquidate assets to pay off your creditors" (e.g., release
a bit of the memtuples memory). The fact is that memtuples growth
doesn't work that way. The memtuples array never shrinks.



Hmm. But memtuples is empty, just after we have built the initial runs. Why
couldn't we shrink, i.e. free and reallocate, it?


After we've built the initial runs, we do in fact give a FREEMEM()
refund to those tapes that were not used within beginmerge(), as I
mentioned just now (with a high workMem, this is often the great
majority of many thousands of logical tapes -- that's how you get to
wasting 8% of 5GB of maintenance_work_mem).


I & peter chatted over IM on this. Let me try to summarize the problems, 
and my plan:


1. When we start to build the initial runs, we currently reserve memory 
for tape buffers, maxTapes * TAPE_BUFFER_OVERHEAD. But we only actually 
need the buffers for tapes that are really used. We "refund" the buffers 
for the unused tapes after we've built the initial runs, but we're still 
wasting that while building the initial runs. We didn't actually 
allocate it, but we could've used it for other things. Peter's solution 
to this was to put a cap on maxTapes.


2. My observation is that during the build-runs phase, you only actually 
need those tape buffers for the one tape you're currently writing to. 
When you switch to a different tape, you could flush and free the 
buffers for the old tape. So reserving maxTapes * TAPE_BUFFER_OVERHEAD 
is excessive, 1 * TAPE_BUFFER_OVERHEAD would be enough. logtape.c 
doesn't have an interface for doing that today, but it wouldn't be hard 
to add.


3. If we do that, we'll still have to reserve the tape buffers for all 
the tapes that we use during merge. So after we've built the initial 
runs, we'll need to reserve memory for those buffers. That might require 
shrinking memtuples. But that's OK: after building the initial runs, 
memtuples is empty, so we can shrink it.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Heikki Linnakangas

On 09/07/2016 09:20 PM, Andrew Borodin wrote:

Well, arithmetics is too fragile.

This version of float packing with arithmetical packaging
static float
pack_float(float actualValue, int realm)
{
double max,min;
max = FLT_MAX / ( 8 >> realm );
min = FLT_MAX / ( 16 >> realm );
if( realm == 0 )
min = 0;
/* squeeze the actual value between min and max */
return ( min + (actualValue * ( max - min ) / FLT_MAX));
}
Inserts are the same as of bithacked pack_float, but selects are 5 times slower.
When we are trying to pack value linearly into range we loose too much bits.


That's why I suggested scaling it by the new value's volume and/or 
edge-sum. I was hoping that the old and new values are roughly of the 
same magnitude, so that it would work out. I guess not.


But we could probably something like the above too, if we use 
logarithmic or exponential, rather than linear, packing. Something like:


static float
pack_float(float actualValue, int realm)
{
  double  val;

  val = sqrt(sqrt(actualValue));

  if (realm == 0)
return actualvalue;
  if (realm == 1)
return actualValue * sqrt(sqrt(FLT_MAX));
  if (realm == 2)
return actualValue * sqrt(FLT_MAX);
}

Unfortunately, sqrt(x) isn't very cheap.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-07 Thread Peter Geoghegan
On Fri, Aug 12, 2016 at 4:30 PM, Peter Geoghegan  wrote:
> Doesn't tuplesort_heap_siftup() actually shift-down?
>
> The Wikipedia article on heaps [1] lists "shift-down" (never "sift
> down", FWIW) as a common operation on a heap:
>
> "shift-down: move a node down in the tree, similar to shift-up; used
> to restore heap condition after deletion or replacement."
>
> This seems to be what tuplesort_heap_siftup() actually does (among
> other things; its job is to compact the heap following caller
> returning the root, where caller leaves a "hole" at the root position
> 0).

Heikki (CC'd) and I discussed this privately today, and we were in
agreement that this needs to be fixed, so I wrote a patch, which I
attach here.


-- 
Peter Geoghegan
From b747c088e5ad7fb831e21e6d3a503f594cbc8e12 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 7 Sep 2016 13:51:38 -0700
Subject: [PATCH] Rename tuplesort_heap_siftup function

tuplesort_heap_siftup is renamed to tuplesort_heap_compact.  The name
tuplesort_heap_siftup did not accurately represent the underlying
behavior.  The function shifts down, rather than sifting (shifting up).

The new name is chosen to describe the function at a slightly higher
level, since the underlying heap operation is just an implementation
detail.
---
 src/backend/utils/sort/tuplesort.c | 28 +---
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index aa8e0e4..02d4b6b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -570,7 +570,7 @@ static void sort_bounded_heap(Tuplesortstate *state);
 static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 	  int tupleindex, bool checkIndex);
-static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
+static void tuplesort_heap_compact(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
@@ -1617,9 +1617,9 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 			}
 			else
 			{
-/* discard top of heap, sift up, insert new tuple */
+/* discard root of heap to compact, insert new tuple */
 free_sort_tuple(state, >memtuples[0]);
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_compact(state, false);
 tuplesort_heap_insert(state, tuple, 0, false);
 			}
 			break;
@@ -1987,7 +1987,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
  * more generally.
  */
 *stup = state->memtuples[0];
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_compact(state, false);
 if ((tupIndex = state->mergenext[srcTape]) == 0)
 {
 	/*
@@ -2642,8 +2642,7 @@ mergeonerun(Tuplesortstate *state)
 		/* writetup adjusted total free space, now fix per-tape space */
 		spaceFreed = state->availMem - priorAvail;
 		state->mergeavailmem[srcTape] += spaceFreed;
-		/* compact the heap */
-		tuplesort_heap_siftup(state, false);
+		tuplesort_heap_compact(state, false);
 		if ((tupIndex = state->mergenext[srcTape]) == 0)
 		{
 			/* out of preloaded data on this tape, try to read more */
@@ -3275,13 +3274,12 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 			 * Still holding out for a case favorable to replacement
 			 * selection. Still incrementally spilling using heap.
 			 *
-			 * Dump the heap's frontmost entry, and sift up to remove it from
-			 * the heap.
+			 * Dump the heap's root entry, and remove it from the heap.
 			 */
 			Assert(state->memtupcount > 0);
 			WRITETUP(state, state->tp_tapenum[state->destTape],
 	 >memtuples[0]);
-			tuplesort_heap_siftup(state, true);
+			tuplesort_heap_compact(state, true);
 		}
 		else
 		{
@@ -3663,7 +3661,7 @@ make_bounded_heap(Tuplesortstate *state)
 			if (state->memtupcount > state->bound)
 			{
 free_sort_tuple(state, >memtuples[0]);
-tuplesort_heap_siftup(state, false);
+tuplesort_heap_compact(state, false);
 			}
 		}
 	}
@@ -3693,8 +3691,8 @@ sort_bounded_heap(Tuplesortstate *state)
 	{
 		SortTuple	stup = state->memtuples[0];
 
-		/* this sifts-up the next-largest entry and decreases memtupcount */
-		tuplesort_heap_siftup(state, false);
+		/* this puts next-largest entry at root, and decreases memtupcount */
+		tuplesort_heap_compact(state, false);
 		state->memtuples[state->memtupcount] = stup;
 	}
 	state->memtupcount = tupcount;
@@ -3784,10 +3782,10 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 
 /*
  * The tuple at state->memtuples[0] has been removed from the heap.
- * Decrement memtupcount, and sift up to maintain the heap invariant.
+ * Decrement memtupcount, and shift down to maintain the heap invariant.
  */
 static void

Re: [HACKERS] \timing interval

2016-09-07 Thread Corey Huinker
>
> ... and it would probably greatly reduce the amount of mailing list
> traffic asking for version if nothing else.


That was the major reason for wanting it.
The second is that if an explain were posted to a forum like stackexchange,
the reader wouldn't have to wonder what version produced the plan.


Re: [HACKERS] \timing interval

2016-09-07 Thread Jim Nasby

On 9/6/16 1:45 PM, Tom Lane wrote:

It's sorta out of my hands now, but what Tom said earlier is that because
> this is client-side code, it wouldn't use existing interval code.
> EXPLAIN *is* server-side, we couldn't use this code, but we could leverage
> existing interval code there to achieve a similar concept.

If we like this specific output format, I'd be inclined to just
copy-and-paste the code from psql.  I seriously doubt that getting type
interval involved in the discussion would lead to a shorter or
better-performing solution.


If we could actually execute user functions as part of EXPLAIN 
generating it's output then it might be a lot less invasive... but I 
don't think that's an option.



> I have another thing I'd like to add to EXPLAIN output : server version
> number output. So maybe we can pick those up in another thread.

Ugh.  There are multiple ways to get that already, and it's not like
space in EXPLAIN's output is not a precious resource.


I don't think adding a line would be that bad, and it would probably 
greatly reduce the amount of mailing list traffic asking for version if 
nothing else. It might also be useful to tools like 
https://explain.depesz.com/. If nothing else it's probably worth adding 
to the non-text output formats.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Vik Fearing
On 09/07/2016 10:41 PM, Alvaro Herrera wrote:
> Gavin Flower wrote:
> 
>> possibly '--nosync' (& any similar) should have a '--no-sync' variation
>> added, with the '--nosync' variation documented as depreciated?
> 
> I agree -- I would go as far as just documenting --no-sync only and
> keeping the --nosync one working with minimal (if any) visibility in
> docs.

Okay, here's a patch to do that.  I don't think it's the other patch's
job to do it.

I also changed --noclean to --no-clean, and --no-locale was already correct.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


pg_ctl_no_opts_01.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-07 Thread Tom Lane
Dilip Kumar  writes:
> Basically this patch changes error at three places.

> 1. getBaseTypeAndTypmod: This is being called from domain_in exposed
> function (domain_in->
> domain_state_setup-> getBaseTypeAndTypmod). Though this function is
> being called from many other places which are not exposed function,
> but I don't this will have any imapct, will this ?

I really don't like this solution.  Changing this one function, out of the
dozens just like it in lsyscache.c, seems utterly random and arbitrary.

I'm actually not especially convinced that passing domain_in a random
OID needs to not produce an XX000 error.  That isn't a user-facing
function and people shouldn't be calling it by hand at all.  The same
goes for record_in.  But if we do want those cases to avoid this,
I think it's appropriate to fix it in those functions, not decree
that essentially-random other parts of the system should bear the
responsibility.  (Thought experiment: if we changed the order of
operations in domain_in so that getBaseTypeAndTypmod were no longer
the first place to fail, would we undo this change and then change
wherever the failure had moved to?  That's pretty messy.)

In the case of record_in, it seems like it'd be easy enough to use
lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc()
and then throw a user-facing error right there.  Not sure about a
nice solution for domain_in.  We might have to resort to an extra
syscache lookup there, but even if we did, it should happen only
once per query so that's not very awful.

> 3. CheckFunctionValidatorAccess: This is being called from all
> language validator functions.

This part seems reasonable, since the validator functions are documented
as something users might call, and CheckFunctionValidatorAccess seems
like an apropos place to handle it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Alvaro Herrera
Gavin Flower wrote:

> possibly '--nosync' (& any similar) should have a '--no-sync' variation
> added, with the '--nosync' variation documented as depreciated?

I agree -- I would go as far as just documenting --no-sync only and
keeping the --nosync one working with minimal (if any) visibility in
docs.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible optimization on Function Scan

2016-09-07 Thread Andres Freund
Hi,

On 2016-09-07 15:29:08 -0500, Jim Nasby wrote:
> I was a bit surprised to discover the difference below in calling an SRF as
> part of a target list vs part of the from clause. The from clause generates
> a Function Scan, which (apparently blindly) builds a tuplestore. Is there a
> relatively easy way to either transform this type of query so the SRF is
> back in a target list, or teach Function Scan that it doesn't always need to
> create a tuplestore? It would be nice if we could just not use a tuplestore
> at all (depending on the planner to add a Materialize node if necessary),
> but AIUI functions can directly return a tuplestore, so I guess that's not
> an option...

I've recently implemented ValuePerCall support for SRF in FROM
http://archives.postgresql.org/message-id/20160827214829.zo2dfb5jaikii5nw%40alap3.anarazel.de

One mail up in 
https://www.postgresql.org/message-id/20160822214023.aaxz5l4igypowyri%40alap3.anarazel.de
there's before/after performance numbers showing that removing the
materialization fixes the issue.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Possible optimization on Function Scan

2016-09-07 Thread Jim Nasby
I was a bit surprised to discover the difference below in calling an SRF 
as part of a target list vs part of the from clause. The from clause 
generates a Function Scan, which (apparently blindly) builds a 
tuplestore. Is there a relatively easy way to either transform this type 
of query so the SRF is back in a target list, or teach Function Scan 
that it doesn't always need to create a tuplestore? It would be nice if 
we could just not use a tuplestore at all (depending on the planner to 
add a Materialize node if necessary), but AIUI functions can directly 
return a tuplestore, so I guess that's not an option...



~@decina/45678# explain (analyze,verbose,buffers) select count(*) from (select 
generate_series(1,)) c;
   QUERY PLAN

 Aggregate  (cost=17.51..17.52 rows=1 width=8) (actual 
time=27085.104..27085.104 rows=1 loops=1)
   Output: count(*)
   ->  Result  (cost=0.00..5.01 rows=1000 width=4) (actual 
time=0.007..14326.945 rows= loops=1)
 Output: generate_series(1, )
 Planning time: 0.125 ms
 Execution time: 27085.153 ms
(6 rows)

Time: 27087.624 ms
~@decina/45678# explain (analyze,verbose,buffers) select count(*) from 
generate_series(1,);
QUERY PLAN
--
 Aggregate  (cost=12.50..12.51 rows=1 width=8) (actual 
time=57968.811..57968.812 rows=1 loops=1)
   Output: count(*)
   Buffers: temp read=170900 written=170899
   ->  Function Scan on pg_catalog.generate_series  (cost=0.00..10.00 rows=1000 
width=0) (actual time=22407.515..44908.001 rows= loops=1)
 Output: generate_series
 Function Call: generate_series(1, )
 Buffers: temp read=170900 written=170899
 Planning time: 0.060 ms
 Execution time: 58054.981 ms
(9 rows)

Time: 58055.929 ms



--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-07 Thread Tom Lane
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Here is version 6 of the patch, which just adds RENAME VALUE with no IF
> [NOT] EXISTS, rebased onto current master (particularly the
> transactional ADD VALUE patch).

Pushed with some adjustments.  The only thing that wasn't a matter of
taste is you forgot to update the backend/nodes/ support functions
for struct AlterEnumStmt.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-09-07 Thread Robert Haas
On Sat, Sep 3, 2016 at 7:09 AM, Simon Riggs  wrote:
> On 2 September 2016 at 09:45, Robert Haas  wrote:
>> On Wed, Aug 31, 2016 at 7:20 AM, Peter Eisentraut
>>  wrote:
>>> I would like to propose the attached patch implementing autonomous
>>> transactions for discussion and review.
>>
>> I'm pretty skeptical of this approach.  Like Greg Stark, Serge Rielau,
>> and Constantin Pan, I had expected that autonomous transactions should
>> be implemented inside of a single backend, without relying on workers.
>
> The problem is that we have limits on the number of concurrent
> transactions, which is currently limited by the number of procs.

True.  I believe this issue has been discussed before, and I think
it's a solvable issue.  I believe that an autonomous transaction could
be made to appear to the rest of the system (outside the affected
backend) as if it were a subtransaction, but then get committed before
the containing transaction gets committed.  This would avoid needing
more PGPROCs (but getting deadlock detection to work is hard).

> So doing autonomous transactions inside a single backend doesn't gain
> you very much, yet it is an enormously invasive patch to do it that
> way, not least because it requires you to rewrite locking and
> deadlocks to make them work correctly when proc is not 1:1 with xid.
> And as Serge points out it introduces other restrictions that we know
> about now, perhaps more as well.

Yes, but I think that if we want to really meet the needs of Oracle
users who are used to being able to slap an autonomous transaction
pragma on any function that they like, we're going to need a solution
that is far lighter-weight than starting up a new backend.  Any
approach based on background workers is going to make the cost of a
function call something like 4ms, which I suspect is going to make it
useless for a pretty significant percentage of the cases where users
want to use autonomous transactions.  For example, if you want to log
every attempt to access an object, this is a phenomenally expensive
way to get there.  Calling a per-row trigger is already pretty
expensive; calling a per-row trigger that has to *launch a process for
every row* is going to be insanely bad.

>> That approach would be much less likely to run afoul of limits on the
>> number of background workers
>
> That would also be an argument against using them for parallelism, yet
> we used background workers there.

I guess that's sort of true, but parallel query intrinsically requires
multiple processes or threads, whereas autonomous transactions only
require that if you pick an implementation that requires that.  Also,
the parallel query facility is designed to only apply to operations
that are already pretty expensive, namely long-running queries, but
there are lots of use cases where an autonomous transaction gets
spawned to do a very small amount of work, and not infrequently in a
loop.  So the startup cost is a significantly more serious problem for
this use case.

Of course, if we could decrease the startup cost of a bgworker, that'd
be good for parallel query, too, because then we could deploy it on
shorter queries.  But the point remains that for parallel query the
planner can always decide to use a non-parallel plan if the query is
cheap enough that the startup cost of a worker will be a problem.
That isn't the case here; if the worker startup cost is a problem,
then you'll just end up with really slow SQL code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Gavin Flower

On 08/09/16 07:31, Robert Haas wrote:

On Sat, Sep 3, 2016 at 7:13 PM, Michael Paquier
 wrote:

On Sun, Sep 4, 2016 at 5:57 AM, Vik Fearing  wrote:

One thing that has been irking me ever since I came to PostgreSQL is the
fact that pg_ctl -w (and -W) don't have longhand equivalents.  I like to
use the long version in scripts and such as extra documentation, and
I've never been able to with these.  What's more, I keep forgetting that
--wait (and --no-wait) aren't a thing.

Trivial patch attached.

Nit: Like --nosync we could use --nowait, without an hyphen.

But is that actually better?  I think that the idea of omitting the
dash here is one of those things that sounds good at first, and then
later you realize that it was actually a dumb idea all along.  If
somebody has an option for --body or --on or --table and has to negate
it by running --nobody or --noon or --notable, some confusion may
result, because in each case you get a word that is not really the
logical inverse of the original option.   Also, if you end up with any
multi-word options, like --save-backup-files, then users wonder why
the opposite, --nosave-backup-files, has a dash between words 2 and 3
and between words 3 and 4, but not between words 1 and 2.  I suggest
we'd do better to standardize on always including a dash in such
cases.


+1

possibly '--nosync' (& any similar) should have a '--no-sync' variation 
added, with the '--nosync' variation documented as depreciated?



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Long options for pg_ctl waiting

2016-09-07 Thread Robert Haas
On Sat, Sep 3, 2016 at 7:13 PM, Michael Paquier
 wrote:
> On Sun, Sep 4, 2016 at 5:57 AM, Vik Fearing  wrote:
>> One thing that has been irking me ever since I came to PostgreSQL is the
>> fact that pg_ctl -w (and -W) don't have longhand equivalents.  I like to
>> use the long version in scripts and such as extra documentation, and
>> I've never been able to with these.  What's more, I keep forgetting that
>> --wait (and --no-wait) aren't a thing.
>>
>> Trivial patch attached.
>
> Nit: Like --nosync we could use --nowait, without an hyphen.

But is that actually better?  I think that the idea of omitting the
dash here is one of those things that sounds good at first, and then
later you realize that it was actually a dumb idea all along.  If
somebody has an option for --body or --on or --table and has to negate
it by running --nobody or --noon or --notable, some confusion may
result, because in each case you get a word that is not really the
logical inverse of the original option.   Also, if you end up with any
multi-word options, like --save-backup-files, then users wonder why
the opposite, --nosave-backup-files, has a dash between words 2 and 3
and between words 3 and 4, but not between words 1 and 2.  I suggest
we'd do better to standardize on always including a dash in such
cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SELECT FOR UPDATE regression in 9.5

2016-09-07 Thread Marko Tiikkaja

On 07/09/16 7:29 PM, Alvaro Herrera wrote:

Marko, does this fix your reported problem too?  Both the assertion and
the overall test case that causes it to fire?


The test case never realized anything was wrong, but the assertion is 
gone.  So yup, problem solved on this end, at least.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-09-07 Thread Peter Geoghegan
On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner  wrote:
> IMV the process is to post a patch to this list to certify that it
> is yours to contribute and free of IP encumbrances that would
> prevent us from using it.  I will wait for that post.

I attach my V3. There are only very minor code changes here, such as
breaking out a separate elevel constant for DEBUG2 traces that show
information of how the B-Tree is accessed (e.g. current level, block
number). Tiny tweaks to about 3 comments were also performed. These
changes are trivial.

I've also removed the tests added, since external sort regression test
coverage is in a much better place now.

Finally, the documentation was updated, to make it closer to the
Github project's documentation. I expanded what was started as the
original amcheck sgml documentation based on the experience of using
amcheck on production databases at Heroku. This isn't a big change,
but it's the biggest in V3. The documentation now emphasizes the
likelihood of amcheck finding software errors, which is all we found.
Jim Gray predicted that fault tolerant systems will tend to have
almost all problems arise from operator error and software faults in
practice, so perhaps I should have predicted that that would be the
general trend.

-- 
Peter Geoghegan
From 03046b49c893c7695bbae365c33e3d7b27a1f9a3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile |1 +
 contrib/amcheck/Makefile |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.c| 1267 ++
 contrib/amcheck/amcheck.control  |5 +
 doc/src/sgml/amcheck.sgml|  278 +
 doc/src/sgml/contrib.sgml|1 +
 doc/src/sgml/filelist.sgml   |1 +
 8 files changed, 1592 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.c
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 000..7bb29aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= amcheck.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 000..5c4d02d
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,1267 @@
+/*-
+ *
+ * amcheck.c
+ *		Verifies the integrity of access methods based on invariants.
+ *
+ * Provides SQL-callable functions for verifying that various logical
+ * invariants in 

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Well, arithmetics is too fragile.

This version of float packing with arithmetical packaging
static float
pack_float(float actualValue, int realm)
{
double max,min;
max = FLT_MAX / ( 8 >> realm );
min = FLT_MAX / ( 16 >> realm );
if( realm == 0 )
min = 0;
/* squeeze the actual value between min and max */
return ( min + (actualValue * ( max - min ) / FLT_MAX));
}

Inserts are the same as of bithacked pack_float, but selects are 5 times slower.
When we are trying to pack value linearly into range we loose too much bits.
Here is how it happens: in floats addition of small values to big
values causes loss of small values.
Applied to Heikki's algorithm this means that nV, nE and dV can all be
in different mantissa ranges. (Please note that dV ca be many times
more than nV and many times smaller that nV)

Integer bitshift of a float have no arithmetical meaning. It would be
hard to describe in formulas what carring of mantissa bit to
significant field mean. But this bitshift preserves an order among
almost all float values (except 2 controllably lost bits and some
values become sNaN ). Entire idea of the advanced GiST penalty stands
on this.

The other way is to add to API optional handler which executes all of
choose subtree algorithm inside cube (or other extension).

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-07 Thread Jeff Janes
On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila  wrote:

>
> I have fixed all other issues you have raised.  Updated patch is
> attached with this mail.
>

I am finding the comments (particularly README) quite hard to follow.
There are many references to an "overflow bucket", or similar phrases.  I
think these should be "overflow pages".  A bucket is a conceptual thing
consisting of a primary page for that bucket and zero or more overflow
pages for the same bucket.  There are no overflow buckets, unless you are
referring to the new bucket to which things are being moved.

Was maintaining on-disk compatibility a major concern for this patch?
Would you do things differently if that were not a concern?  If we would
benefit from a break in format, I think it would be better to do that now
while hash indexes are still discouraged, rather than in a future release.

In particular, I am thinking about the need for every insert to
exclusive-content-lock the meta page to increment the index-wide tuple
count.  I think that this is going to be a huge bottleneck on update
intensive workloads (which I don't believe have been performance tested as
of yet).  I was wondering if we might not want to change that so that each
bucket keeps a local count, and sweeps that up to the meta page only when
it exceeds a threshold.  But this would require the bucket page to have an
area to hold such a count.  Another idea would to keep not a count of
tuples, but of buckets with at least one overflow page, and split when
there are too many of those.  I bring it up now because it would be a shame
to ignore it until 10.0 is out the door, and then need to break things in
11.0.

Cheers,

Jeff


Re: Install extensions using update scripts (was Re: [HACKERS] Remove superuser() checks from pgstattuple)

2016-09-07 Thread Tom Lane
I wrote:
> Still no SGML doc updates.

And here's a doc addition.

regards, tom lane

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index df88380..1c8c420 100644
*** a/doc/src/sgml/extend.sgml
--- b/doc/src/sgml/extend.sgml
*** SELECT * FROM pg_extension_update_paths(
*** 885,890 
--- 885,927 
  
 
  
+
+ Installing Extensions using Update Scripts
+ 
+ 
+  An extension that has been around for awhile will probably exist in
+  several versions, for which the author will need to write update scripts.
+  For example, if you have released a foo extension in
+  versions 1.0, 1.1, and 1.2, there
+  should be update scripts foo--1.0--1.1.sql
+  and foo--1.1--1.2.sql.
+  Before PostgreSQL 10, it was necessary to create new
+  script files foo--1.1.sql and foo--1.2.sql
+  containing the same changes, or else the newer versions could not be
+  installed directly, only by installing 1.0 and then updating.
+  Now, CREATE EXTENSION can do that automatically 
+  for example, if only the script
+  files foo--1.0.sql, foo--1.0--1.1.sql,
+  and foo--1.1--1.2.sql are available then a request to
+  install version 1.2 is honored by running those three scripts
+  in sequence.  (As with update requests, if multiple pathways are
+  available then the shortest is preferred.)
+  Arranging an extension's script files in this style can reduce the amount
+  of maintenance effort needed to produce small updates.
+ 
+ 
+ 
+  If you use secondary (version-specific) control files with an extension
+  maintained in this style, keep in mind that each version needs a control
+  file even if it has no stand-alone installation script, as that control
+  file will determine how the implicit update to that version is performed.
+  For example, if foo--1.0.control specifies requires
+  = 'bar' but foo's other control files do not, the
+  extension's dependency on bar will be dropped when updating
+  from 1.0 to another version.
+ 
+
+ 
 
  Extension Example
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-07 Thread Alvaro Herrera
> - I can't remember the specific language but they had the collation rule
> that "CH" was treated as a distinct entity between C and D. This gave the
> order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
> C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
> BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
> started with C. After the change it the constraint would block everything
> that started with CI - CZ leaving many rows that no longer qualify in the
> database.

(This was probably Spanish.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-07 Thread Robert Haas
On Tue, Sep 6, 2016 at 11:13 PM, Masahiko Sawada  wrote:
> I understood, thank you.
>
> I've measured the performance benefit of this patch by following steps.
> 1. Create very large table and set all-visible flag to all blocks.
> 2. Measure the execution time of vacuum that skips to scan all heap pages.
>
> * 1TB Table(visibility map size is 32MB)
> HEAD : 11012.274 ms (00:11.012)
> Patched : 6742.481 ms (00:06.742)
>
> * 8TB Table(visibility map size is 64MB)
> HEAD : 69886.605 ms (01:09.887)
> Patched : 35562.131 ms (00:35.562)
>
> * 32TB Table(visibility map size is 258MB)
> HEAD: 265901.096 ms (04:25.901)
> Patched: 131779.082 ms (02:11.779)
>
> Since current HEAD could scan visibility map twice, the execution time
> of Patched is approximately half of HEAD.
> But when table is several hundreds gigabyte, performance benefit would
> not be large.

Wow, those are surprisingly good results.  Interesting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-07 Thread Doug Doole
>
> This isn't a problem for Postgres, or at least wouldn't be right now,
> because we don't have case insensitive collations.


I was wondering if Postgres might be that way. It does avoid the RI
constraint problem, but there are still troubles with range based
predicates. (My previous project wanted case/accent insensitive collations,
so we got to deal with it all.)


> So, we use a strcmp()/memcmp() tie-breaker when strcoll() indicates
> equality, while also making the general notion of text equality actually
> mean binary equality.


We used a similar tie breaker in places. (e.g. Index keys needed to be
identical, not just equal. We also broke ties in sort to make its behaviour
more deterministic.)

I would like to get case insensitive collations some day, and was
> really hoping that ICU would help. That being said, the need for a
> strcmp() tie-breaker makes that hard. Oh well.
>

Prior to adding ICU to my previous project, it had the assumption that
equal meant identical as well. It turned out to be a lot easier to break
this assumption than I expected, but that code base had religiously used
its own string comparison function for user data - strcmp()/memcmp() was
never called for user data. (I don't know if the same can be said for
Postgres.) We found that very few places needed to be aware of values that
were equal but not identical. (Index and sort were the big two.)

Hopefully Postgres will be the same.

--
Doug Doole


Re: [HACKERS] SELECT FOR UPDATE regression in 9.5

2016-09-07 Thread Alvaro Herrera
Marti Raudsepp wrote:
> Hello list
> 
> While testing an application with PostgreSQL 9.5, we experienced an issue
> involving aborted subtransactions and SELECT FOR UPDATE. In certain
> situations, a locking query doesn't return rows that should be visible and
> already locked by the current transaction.

Okay, so the assertion failure is fixed by the attached patch.  Also,
the division-by-zero that your test case says shouldn't occur doesn't
occur.  But does it solve the larger problem of not returning rows that
should be visible?

Marko, does this fix your reported problem too?  Both the assertion and
the overall test case that causes it to fire?

(The problem fixed by the patch is that we were trying to lock tuples
down the update chain, but one of the tuples in the chain had been
updated by an aborted subtransaction.  Obviously, there is no point in
locking such a tuple because it effectively "doesn't exist" in the first
place.)

Thanks!


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b8f4fe5..9846b9f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5403,6 +5403,16 @@ l4:
 			return HeapTupleMayBeUpdated;
 		}
 
+		/*
+		 * Also check Xmin: if this tuple was created by an aborted
+		 * transaction, we're done.
+		 */
+		if (TransactionIdDidAbort(HeapTupleHeaderGetXmin(mytup.t_data)))
+		{
+			UnlockReleaseBuffer(buf);
+			return HeapTupleMayBeUpdated;
+		}
+
 		old_infomask = mytup.t_data->t_infomask;
 		old_infomask2 = mytup.t_data->t_infomask2;
 		xmax = HeapTupleHeaderGetRawXmax(mytup.t_data);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-07 Thread Doug Doole
> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

Yep. If you want the protection I've described, you can't depend on the OS
copy of ICU. You need to have multiple ICU libraries that are
named/installed such that you can load the specific version you want. It
also means that you can have dependencies on versions of ICU that are no
longer supported. (In my previous project, we were shipping 3 copies of the
ICU library, going back to 2.x. Needless to say, we didn't add support for
every drop of ICU.)

On Wed, Sep 7, 2016 at 5:53 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 9/6/16 1:40 PM, Doug Doole wrote:
> > We carried the ICU version numbers around on our collation and locale
> > IDs (such as fr_FR%icu36) . The database would load multiple versions of
> > the ICU library so that something created with ICU 3.6 would always be
> > processed with ICU 3.6. This avoided the problems of trying to change
> > the rules on the user. (We'd always intended to provide tooling to allow
> > the user to move an existing object up to a newer version of ICU, but we
> > never got around to doing it.) In the code, this meant we were
> > explicitly calling the versioned API so that we could keep the calls
> > straight.
>
> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-07 Thread Claudio Freire
On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
> On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs  wrote:
>> On 6 September 2016 at 19:59, Tom Lane  wrote:
>>
>>> The idea of looking to the stats to *guess* about how many tuples are
>>> removable doesn't seem bad at all.  But imagining that that's going to be
>>> exact is folly of the first magnitude.
>>
>> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> so I think we agree that a reasonably accurate, yet imprecise
>> calculation is possible in most cases.
>
> That would all be well and good if it weren't trivial to do what
> Robert suggested. This is just a large unsorted list that we need to
> iterate throught. Just allocate chunks of a few megabytes and when
> it's full allocate a new chunk and keep going. There's no need to get
> tricky with estimates and resizing and whatever.

I agree. While the idea of estimating the right size sounds promising
a priori, considering the estimate can go wrong and over or
underallocate quite severely, the risks outweigh the benefits when you
consider the alternative of a dynamic allocation strategy.

Unless the dynamic strategy has a bigger CPU impact than expected, I
believe it's a superior approach.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fun fact about autovacuum and orphan temp tables

2016-09-07 Thread Robert Haas
On Mon, Sep 5, 2016 at 1:14 PM, Bruce Momjian  wrote:
> On Mon, Sep  5, 2016 at 12:48:32PM -0300, Alvaro Herrera wrote:
>> > The least invasive solution would be to have a guc, something like
>> > 'keep_orphan_temp_tables' with boolean value.
>> > Which would determine a autovacuum worker policy toward encountered orphan
>> > temp tables.
>>
>> The stated reason for keeping them around is to ensure you have time to
>> do some forensics research in case there was something useful in the
>> crashing backend.  My feeling is that if the reason they are kept around
>> is not a crash but rather some implementation defect that broke end-time
>> cleanup, then they don't have their purported value and I would rather
>> just remove them.
>>
>> I have certainly faced my fair share of customers with dangling temp
>> tables, and would like to see this changed in some way or another.
>
> I don't think we look at those temp tables frequently enough to justify
> keeping them around for all users.

+1.  I think it would be much better to nuke them more aggressively.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-07 Thread Greg Stark
On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs  wrote:
> On 6 September 2016 at 19:59, Tom Lane  wrote:
>
>> The idea of looking to the stats to *guess* about how many tuples are
>> removable doesn't seem bad at all.  But imagining that that's going to be
>> exact is folly of the first magnitude.
>
> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
> so I think we agree that a reasonably accurate, yet imprecise
> calculation is possible in most cases.

That would all be well and good if it weren't trivial to do what
Robert suggested. This is just a large unsorted list that we need to
iterate throught. Just allocate chunks of a few megabytes and when
it's full allocate a new chunk and keep going. There's no need to get
tricky with estimates and resizing and whatever.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index Onlys Scan for expressions

2016-09-07 Thread Ildar Musin

Hi Vladimir,

On 05.09.2016 16:38, Ildar Musin wrote:

Hi Vladimir,

On 03.09.2016 19:31, Vladimir Sitnikov wrote:

Ildar>The reason why this doesn't work is that '~~' operator (which is a
Ildar>synonym for 'like') isn't supported by operator class for btree.
Since
Ildar>the only operators supported by btree are <, <=, =, >=, >, you
can use
Ildar>it with queries like:

Ildar>And in 3rd query 'OFFSET' statement prevents rewriter from
Ildar>transforming the query, so it is possible to use index only scan on
Ildar>subquery and then filter the result of subquery with '~~' operator.

I'm afraid I do not follow you.
Note: query 3 is 100% equivalent of query 2, however query 3 takes 55
times less reads.
It looks like either an optimizer bug, or some missing feature in the
"index only scan" logic.

Here's quote from "query 2" (note % are at both ends):  ... where
type=42) as x where upper_vc like '%ABC%';

Note: I do NOT use "indexed scan" for the like operator. I'm very well
aware
that LIKE patterns with leading % cannot be optimized to a btree range
scan.
What I want is "use the first indexed column as index scan, then use the
second column
for filtering".

As shown in "query 2" vs "query 3", PostgreSQL cannot come up with such
a plan on its own
for some reason.

This is not a theoretical issue, but it is something that I use a lot
with Oracle DB (it just creates a good plan for "query 2").

Vladimir


Thanks, I get it now. The reason why it acts like this is that I used
match_clause_to_index() function to determine if IOS can be used with
the specified clauses. This function among other things checks if
operator matches the index opfamily. Apparently this isn't correct. I
wrote another prototype to test your case and it seems to work. But it's
not ready for public yet, I'll publish it in 1-2 days.



Here is a new patch version. I modified check_index_only_clauses() so 
that it doesn't use match_clause_to_indexcol() anymore. Instead it 
handles different types of expressions including binary operator 
expressions, scalar array expressions, row compare expressions (e.g. 
(a,b)<(1,2)) and null tests and tries to match each part of expression 
to index regardless an operator. I reproduced your example and was able 
to get index only scan on all queries. Could you please try the patch 
and tell if it works for you?


--
Ildar Musin
i.mu...@postgrespro.ru
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb..e81f914 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -25,6 +25,7 @@
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -78,6 +79,14 @@ typedef struct
 	int			indexcol;		/* index column we want to match to */
 } ec_member_matches_arg;
 
+/* Context for index only expression checker */
+typedef struct
+{
+	IndexOptInfo   *index;
+	bool			match;	/* TRUE if expression matches
+			 * index */
+} check_index_only_walker_ctx;
+
 
 static void consider_index_join_clauses(PlannerInfo *root, RelOptInfo *rel,
 			IndexOptInfo *index,
@@ -130,7 +139,15 @@ static PathClauseUsage *classify_index_clause_usage(Path *path,
 static Relids get_bitmap_tree_required_outer(Path *bitmapqual);
 static void find_indexpath_quals(Path *bitmapqual, List **quals, List **preds);
 static int	find_list_position(Node *node, List **nodelist);
-static bool check_index_only(RelOptInfo *rel, IndexOptInfo *index);
+static bool check_index_only(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index);
+static bool check_index_only_targetlist(PlannerInfo *root,
+			RelOptInfo *rel,
+			IndexOptInfo *index,
+			Bitmapset *index_canreturn_attrs);
+static bool check_index_only_clauses(List *clauses,
+		 IndexOptInfo *index);
+static bool check_index_only_expr_walker(Node *node, check_index_only_walker_ctx *ctx);
+static bool check_index_only_expr(Node *expr, IndexOptInfo *index);
 static double get_loop_count(PlannerInfo *root, Index cur_relid, Relids outer_relids);
 static double adjust_rowcount_for_semijoins(PlannerInfo *root,
 			  Index cur_relid,
@@ -1020,7 +1037,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * index data retrieval anyway.
 	 */
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
-	   check_index_only(rel, index));
+	   check_index_only(root, rel, index));
 
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
@@ -1786,7 +1803,7 @@ find_list_position(Node *node, List **nodelist)
  *		Determine whether an index-only scan is possible for this index.
  */
 static bool
-check_index_only(RelOptInfo *rel, IndexOptInfo *index)
+check_index_only(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index)
 {
 	bool		result;
 	Bitmapset  *attrs_used = NULL;
@@ -1837,8 +1854,10 @@ 

Re: [HACKERS] Suggestions for first contribution?

2016-09-07 Thread Stas Kelvich
> On 05 Sep 2016, at 20:25, Christian Convey  wrote:
> 
> Hi guys,
> 
> Can anyone suggest a project for my first PG contribution?
> 
> My first two ideas didn't pan out:  Yury doesn't seem to need help
> with CMake, and the TODO list's "-Wcast-align" project (now deleted)
> appeared impractical.

There is also a list of projects for google summer of code: 
https://wiki.postgresql.org/wiki/GSoC_2016

That topics expected to be doable by a newcomer during several months. It is 
also slightly
outdated, but you always can check current state of things by searching this 
mail list on interesting topic.

Also postgres have several internal API’s like fdw[1] or gist[2] that allows 
you to implements something
useful without digging too much into a postgres internals.

[1] https://www.postgresql.org/docs/current/static/fdwhandler.html
[2] https://www.postgresql.org/docs/current/static/gist-extensibility.html

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestions for first contribution?

2016-09-07 Thread Yury Zhuravlev

Christian Convey wrote:

Yury doesn't seem to need help
with CMake

Hello.
I am sorry that the only answer is now.
I need help but with write CMake code:
1. Make ecpg tests
2. Make MinGW/Msys support
3. many many ...
all targets and discussion here: 
https://github.com/stalkerg/postgres_cmake/issues


If you can only test CMake for Ubuntu x86_64 that it is not necessary.
If I not fully understand you - sorry. 




--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-07 Thread Aleksander Alekseev
> > 8) get_next_element procedure implementation is way too smart (read
> > "complicated"). You could probably just store current list length and
> > generate a random number between 0 and length-1.
> 
> No, algorithm here is more complicated. It must ensure that there would
> not be second attempt to connect to host, for which unsuccessful
> connection attempt was done. So, there is list rearrangement.
> 
> Algorithm for pick random list element by single pass is quite trivial.

Great! In this case I would be _trivial_ for you to write a comment that
describes how this procedure works, what makes you think that it gives a
good distribution in all possible cases (e.g. if there is more than
0x1 elements in a list - why not), etc. Right? :)

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestions for first contribution?

2016-09-07 Thread Aleksander Alekseev
Here is another idea for a contribution - refactoring.

Currently there are a lot of procedures in PostgreSQL code that
definitely don't fit on one screen (i.e. ~50 lines). Also many files are
larger than say 1000 lines of code. For instance, psql_completion
procedure is more than 2000 lines long! I think patches that would make
such code easier to read would have a good chance to be accepted.

In case you are interested I wrote a script that scans PostgreSQL code
and then prints procedures names and corresponding number of lines of
code (see attachment).

-- 
Best regards,
Aleksander Alekseev
#!/usr/bin/env python3

import sys
import os
import re

def process_file(fname):
	nlines = 0
	procname = ""
	with open(fname) as f:
		for line in f:
			nlines += 1
			if re.search("^{", line):
nlines = 0
			elif re.search("^}", line) and procname and (nlines > 0):
print("{:08d}:{}".format(nlines-1, procname))
procname = ""
			else:
m = re.search("^(\\w+)\\s*\\(", line)
if m:
	procname = m.group(1)

def recursive_search(path):
	for file in os.listdir(path):
		if (file == ".") or (file == ".."):
			continue

		full_name = os.path.join(path, file)
		if not os.path.islink(full_name):
			if os.path.isdir(full_name):
recursive_search(full_name)

			if re.search("\\.c$", file):
process_file(full_name)

if len(sys.argv) < 2:
	print("Usage " + sys.argv[0] + " ")
	sys.exit(1)

start_path = sys.argv[1]
recursive_search(start_path)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - allow to store select results into variables

2016-09-07 Thread Fabien COELHO


Hello Amit,


Custom script looks like:

\;
select a \into a
   from tab where a = 1;
\set i debug(:a)

I get the following error:

undefined variable "a"
client 0 aborted in state 1; execution of meta-command failed


Good catch!

Indeed, there is a problem with empty commands which are simply ignored by 
libpq/postgres if there are other commands around, so my synchronization 
between results & commands was too naive.


In order to fix this, I made the scanner also count empty commands and 
ignore comments. I guessed that proposing to change libpq/postgres 
behavior was not an option.



Comments on the patch follow:

+
+ 
+  Stores the first fields of the resulting row from the current or
preceding
+  SELECT command into these variables.

Any command returning rows ought to work, no?


Yes. I put "SQL command" instead.



-char   *line;   /* text of command line */
+char   *line;   /* first line for short display */
+char   *lines;  /* full multi-line text of command */

When I looked at this and related hunks (and also the hunks related to
semicolons), it made me wonder whether this patch contains two logical
changes.  Is this a just a refactoring for the \into implementation or
does this provide some new externally visible feature?


There is essentially a refactoring that I did when updating how Command is 
managed because it has to be done in several stages to fit "into" into it 
and to take care of compounds.


However there was small "new externally visible feature": on -r, instead 
of cutting abruptly a multiline command at the end of the first line it 
appended "..." as an ellipsis because it looked nicer.


I've removed this small visible changed.


char   *argv[MAX_ARGS]; /* command word list */
+int compound;  /* last compound command (number of \;) */
+char***intos;  /* per-compound command \into variables */



Need an extra space for intos to align with earlier fields.


Ok.

Also I wonder if intonames or intoargs would be a slightly better name 
for the field.


I put "intovars" as they are variable names.


+static bool
+read_response(CState *st, char ** intos[])

Usual style seems to be to use ***intos here.


Ok.


+fprintf(stderr,
+"client %d state %d compound %d: "
+"cannot apply \\into to non SELECT statement\n",
+st->id, st->state, compound);

How about make this error message say something like other messages
related to \into, perhaps something like: "\\into cannot follow non SELECT
statements\n"


As you pointed out above, there may be statements without "SELECT" which 
which return a row. I wrote "\\into expects a row" instead.



/*
 * Read and discard the query result; note this is not included in
- * the statement latency numbers.
+ * the statement latency numbers (above), thus if reading the
+ * response fails the transaction is counted nevertheless.
 */

Does this comment need to mention that the result is not discarded when
\into is specified?


Hmmm. The result structure is discarded, but the results are copied into 
variables before that, so the comments seems ok...



+my_command->intos = pg_malloc0(sizeof(char**) * (compounds+1));

Need space: s/char**/char **/g


Ok.


This comments applies also to a couple of nearby places.


Indeed.


-my_command->line = pg_malloc(nlpos - p + 1);
+my_command->line = pg_malloc(nlpos - p  + 1 + 3);

A comment mentioning what the above means would be helpful.


Ok. I removed the "+ 3" anyway.


+boolsql_command_in_progress = false;

This variable's name could be slightly confusing to readers. If I
understand its purpose correctly, perhaps it could be called
sql_command_continues.


It is possible. I like 'in progress' though. Why is it confusing?

It means that the current command is not finished yet and more is 
expected, that is the final ';' has not been encountered.



+if (index == 0)
+syntax_error(desc, lineno, NULL, NULL,
+ "\\into cannot start a script",
+ NULL, -1);

How about: "\\into cannot be at the beginning of a script" ?


It is possible, but it's quite longer... I'm not a native speaker, so I'm 
do not know whether it would be better.


The attached patch takes into all your comments but:
 - comment about discarded results...
 - the sql_command_in_progress variable name change
 - the longer message on into at the start of a script

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 285608d..0a474e1 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -809,6 +809,30 @@ pgbench  options  dbname
   
 
   
+   
+
+ \into var1 [var2 ...]
+

Re: [HACKERS] [COMMITTERS] pgsql: Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL

2016-09-07 Thread Tom Lane
Simon Riggs  writes:
> On 7 September 2016 at 13:47, Fujii Masao  wrote:
>> On Tue, Sep 6, 2016 at 11:41 PM, Simon Riggs  wrote:
>>> lazy_truncate_heap() was waiting for
>>> VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL, but in microseconds
>>> not milliseconds as originally intended.

>> Don't we need to back-patch this?

> If we do then a database-wide VACUUM on a busy database will take
> substantially longer than it does now.

On the other hand, it's also more likely to successfully perform desired
truncations.

> That may not be perceived as a "fix" by everybody, so we should not do
> it without an explicit agreement by many.

Agreed, but I vote with Fujii-san for back-patching.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-07 Thread Victor Wagner
On Mon, 5 Sep 2016 14:03:11 +0300
Aleksander Alekseev  wrote:

> Hello, Victor.
> 
> 
> 1) It looks like code is not properly formatted.
> 

Thanks for pointing to the documentation and formatting problems. I'll
fix them 
 
> >  static int 
> >  connectDBStart(PGconn *conn)
> >  {
> > +   struct nodeinfo
> > +   {
> > +   char   *host;
> > +   char   *port;
> > +   };  
> 
> Not sure if all compilers we support can parse this. I suggest to
> declare nodinfo structure outside of procedure, just to be safe.
> 

Block-local structs  are part of C language standard. I don't remember
when they appeared first (may be in K C), but at least since C89 AFAIR
they are allowed explicitely.

And using most local scope possible is always a good thing.

So, I'll leave this structure function local for now.


> 
> You should check return values of malloc, calloc and strdup
> procedures, or use safe pg_* equivalents.

Thanks, I'll fix this one.
 
> 6) 
> 
> > +   for (p = addrs; p->ai_next != NULL; p =
> >   p->ai_next);
> > +   p->ai_next = this_node_addrs;  
> 
> Looks a bit scary. I suggest to add an empty scope ( {} ) and a
> comment to make our intention clear here.

Ok, it really would make code more clear.
 
> 7) Code compiles with following warnings (CLang 3.8):
> 
> > 1 warning generated.
> > fe-connect.c:5239:39: warning: if statement has empty body
> > [-Wempty-body]
> >
> > errorMessage,
> > false, true));
> > 
> >   ^
> > fe-connect.c:5239:39: note: put the semicolon on a separate line to
> > silence this warning
> > 1 warning generated.  

Oops, it looks like logic error, which should be fixed. Thanks for
testing my patch by more picky compiler.

> 8) get_next_element procedure implementation is way too smart (read
> "complicated"). You could probably just store current list length and
> generate a random number between 0 and length-1.

No, algorithm here is more complicated. It must ensure that there would
not be second attempt to connect to host, for which unsuccessful
connection attempt was done. So, there is list rearrangement.

Algorithm for pick random list element by single pass is quite trivial.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bug in two-phase transaction recovery

2016-09-07 Thread Stas Kelvich
Hello.

Some time ago two-phase state file format was changed to have variable size GID,
but several places that read that files were not updated to use new offsets. 
Problem
exists in master and 9.6 and can be reproduced on prepared transactions with
savepoints. For example:

create table t(id int);
begin;
insert into t values (42);
savepoint s1;
insert into t values (43);
prepare transaction 'x';
begin;
insert into t values (142);
savepoint s1;
insert into t values (143);
prepare transaction 'y’;

and restart the server. Startup process will hang. Fix attached.

Also while looking at StandbyRecoverPreparedTransactions() i’ve noticed that 
buffer
for 2pc file is allocated in TopMemoryContext but never freed. That probably 
exists
for a long time.



gidlen_fixes.diff
Description: Binary data





standby_recover_pfree.diff
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Install extensions using update scripts (was Re: [HACKERS] Remove superuser() checks from pgstattuple)

2016-09-07 Thread Tom Lane
Andres Freund  writes:
> On 2016-09-05 22:24:09 -0400, Tom Lane wrote:
>> Ordinarily I'd be willing to stick this on the queue for the next
>> commitfest, but it seems like we ought to try to get it pushed now
>> so that Stephen can make use of the feature for his superuser changes.
>> Thoughts?

> Seems sensible to me. I can have a look at it one of the next few days
> if you want.

Thanks for offering.  Attached is an updated patch that addresses a
couple of issues I noticed:

1. When I did the previous patch, I was thinking that any parameters
specified in an auxiliary .control file for the base version would
apply to any non-base versions based on it; so I had the
pg_available_extension_versions view just duplicate those parameters.
But actually, most of those parameters are just applied on-the-fly
while processing the update script, so it *does* work for them to be
different for different versions.  The exceptions are schema (which
you can't modify during an update) and comment (while we could replace
the comment during an update, it doesn't seem like a good idea because
we might overwrite a user-written comment if we did).  (I think this
behavior is undocumented BTW :-(.)  So this fixes the view to only
inherit those two parameters from the base version.

2. I noticed that CASCADE was not implemented for required extensions
processed by ApplyExtensionUpdates, which seems like a bad thing if
CREATE processing is going to rely more heavily on that.  This only
matters if different versions have different requires lists, but that
is supposed to be supported, and all the catalog-hacking for it is there.
So I made this work.  It took a fair amount of refactoring in order to
avoid code duplication, but it wasn't really a very big change.

At this point it's awfully tempting to make ALTER EXTENSION UPDATE grow
a CASCADE option to allow automatic installation of new requirements
of the new version, but I didn't do that here.

Still no SGML doc updates.

regards, tom lane

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index df49a78..59f9e21 100644
*** a/src/backend/commands/extension.c
--- b/src/backend/commands/extension.c
*** typedef struct ExtensionVersionInfo
*** 100,113 
  static List *find_update_path(List *evi_list,
   ExtensionVersionInfo *evi_start,
   ExtensionVersionInfo *evi_target,
   bool reinitialize);
  static void get_available_versions_for_extension(ExtensionControlFile *pcontrol,
  	 Tuplestorestate *tupstore,
  	 TupleDesc tupdesc);
  static void ApplyExtensionUpdates(Oid extensionOid,
  	  ExtensionControlFile *pcontrol,
  	  const char *initialVersion,
! 	  List *updateVersions);
  static char *read_whole_file(const char *filename, int *length);
  
  
--- 100,124 
  static List *find_update_path(List *evi_list,
   ExtensionVersionInfo *evi_start,
   ExtensionVersionInfo *evi_target,
+  bool reject_indirect,
   bool reinitialize);
+ static Oid get_required_extension(char *reqExtensionName,
+ 	   char *extensionName,
+ 	   char *origSchemaName,
+ 	   bool cascade,
+ 	   List *parents,
+ 	   bool is_create);
  static void get_available_versions_for_extension(ExtensionControlFile *pcontrol,
  	 Tuplestorestate *tupstore,
  	 TupleDesc tupdesc);
+ static Datum convert_requires_to_datum(List *requires);
  static void ApplyExtensionUpdates(Oid extensionOid,
  	  ExtensionControlFile *pcontrol,
  	  const char *initialVersion,
! 	  List *updateVersions,
! 	  char *origSchemaName,
! 	  bool cascade,
! 	  bool is_create);
  static char *read_whole_file(const char *filename, int *length);
  
  
*** identify_update_path(ExtensionControlFil
*** 1071,1077 
  	evi_target = get_ext_ver_info(newVersion, _list);
  
  	/* Find shortest path */
! 	result = find_update_path(evi_list, evi_start, evi_target, false);
  
  	if (result == NIL)
  		ereport(ERROR,
--- 1082,1088 
  	evi_target = get_ext_ver_info(newVersion, _list);
  
  	/* Find shortest path */
! 	result = find_update_path(evi_list, evi_start, evi_target, false, false);
  
  	if (result == NIL)
  		ereport(ERROR,
*** identify_update_path(ExtensionControlFil
*** 1086,1091 
--- 1097,1105 
   * Apply Dijkstra's algorithm to find the shortest path from evi_start to
   * evi_target.
   *
+  * If reject_indirect is true, ignore paths that go through installable
+  * versions (presumably, caller will consider starting from such versions).
+  *
   * If reinitialize is false, assume the ExtensionVersionInfo list has not
   * been used for this before, and the initialization done by get_ext_ver_info
   * is still good.
*** static List *
*** 1097,1102 
--- ,1117 
  find_update_path(List *evi_list,
   ExtensionVersionInfo *evi_start,
   ExtensionVersionInfo *evi_target,
+  

Re: [HACKERS] ICU integration

2016-09-07 Thread Peter Eisentraut
On 9/6/16 1:40 PM, Doug Doole wrote:
> We carried the ICU version numbers around on our collation and locale
> IDs (such as fr_FR%icu36) . The database would load multiple versions of
> the ICU library so that something created with ICU 3.6 would always be
> processed with ICU 3.6. This avoided the problems of trying to change
> the rules on the user. (We'd always intended to provide tooling to allow
> the user to move an existing object up to a newer version of ICU, but we
> never got around to doing it.) In the code, this meant we were
> explicitly calling the versioned API so that we could keep the calls
> straight.

I understand that in principle, but I don't see operating system
providers shipping a bunch of ICU versions to facilitate that.  They
will usually ship one.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-07 Thread Amit Kapila
On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh  wrote:
>
> I got two types of inconsistencies as following:
>
> 1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
> page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
> redo, only BTP_DELETED flag is set in buffer page.
>

I see that inconsistency in code as well.  I think this is harmless,
because after the page is marked as deleted, it is not used for any
purpose other than to recycle it for re-use.  After re-using it, the
caller always suppose to initialize the flags based on it's usage and
I see that is happening in the code unless I am missing something.

> I assume that we
> should clear all btpo_flags before setting BTP_DELETED in
> _bt_unlink_halfdead_page().
>

Yeah, we can do that for consistency.  If we see any problem in doing
so, then I think we can log the flags and set them during replay.


Note - Please post your replies inline rather than top posting them.
It breaks the discussion link, if you top post it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL

2016-09-07 Thread Simon Riggs
On 7 September 2016 at 13:47, Fujii Masao  wrote:
> On Tue, Sep 6, 2016 at 11:41 PM, Simon Riggs  wrote:
>> Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL
>>
>> lazy_truncate_heap() was waiting for
>> VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL, but in microseconds
>> not milliseconds as originally intended.
>
> Don't we need to back-patch this?

If we do then a database-wide VACUUM on a busy database will take
substantially longer than it does now.

That may not be perceived as a "fix" by everybody, so we should not do
it without an explicit agreement by many.

Thoughts?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL

2016-09-07 Thread Fujii Masao
On Tue, Sep 6, 2016 at 11:41 PM, Simon Riggs  wrote:
> Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL
>
> lazy_truncate_heap() was waiting for
> VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL, but in microseconds
> not milliseconds as originally intended.

Don't we need to back-patch this?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-07 Thread Simon Riggs
On 6 September 2016 at 19:59, Tom Lane  wrote:

> The idea of looking to the stats to *guess* about how many tuples are
> removable doesn't seem bad at all.  But imagining that that's going to be
> exact is folly of the first magnitude.

Yes.  Bear in mind I had already referred to allowing +10% to be safe,
so I think we agree that a reasonably accurate, yet imprecise
calculation is possible in most cases.

If a recent transaction has committed, we will see both committed dead
rows and stats to show they exist. I'm sure there are corner cases and
race conditions where a major effect (greater than 10%) could occur,
in which case we run the index scan more than once, just as we do now.

The attached patch raises the limits as suggested by Claudio, allowing
for larger memory allocations if possible, yet limits the allocation
for larger tables based on the estimate gained from pg_stats, while
adding 10% for caution.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


vacuum_mem_estimate.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-07 Thread Kuntal Ghosh
On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh  wrote:
> Hello,
>
> As per the earlier discussions, I've attached the updated patch for
> WAL consistency check feature. This is how the patch works:
>

The earlier patch (wal_consistency_v6.patch) was based on the commit
id 67e1e2aaff.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Exclude additional directories in pg_basebackup

2016-09-07 Thread David Steele
On 9/7/16 8:21 AM, David Steele wrote:
> On 9/6/16 10:25 PM, Michael Paquier wrote:
>> I find that ugly. I'd rather use an array with undefined size for the
>> fixed elements finishing by NULL, remove EXCLUDE_DIR_MAX and
>> EXCLUDE_DIR_STAT_TMP and use a small routine to do the work done on
>> _tarWriteHeader...
> 
> My goal was to be able to fully reuse the code that creates the paths,
> but this could also be done by following your suggestion and also moving
> the path code into a function.

I just realized all I did was repeat what you said...

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Exclude additional directories in pg_basebackup

2016-09-07 Thread David Steele
On 9/6/16 10:25 PM, Michael Paquier wrote:
> On Wed, Sep 7, 2016 at 12:16 AM, David Steele  wrote:
>> Attached is a new patch that adds sgml documentation.  I can expand on each
>> directory individually if you think that's necessary, but thought it was
>> better to lump them into a few categories.
> 
> +be ommitted from the backup as they will be initialized on postmaster
> +startup. If the  is set and is
> +under the database cluster directory then the contents of the directory
> +specified by  can also
> be ommitted.
> 
> s/ommitted/omitted/

Thanks!

> +#define EXCLUDE_DIR_MAX 8
> +#define EXCLUDE_DIR_STAT_TMP 0
> +
> +const char *excludeDirContents[EXCLUDE_DIR_MAX] =
> +{
> +/*
> + * Skip temporary statistics files. The first array position will be
> + * filled with the value of pgstat_stat_directory relative to PGDATA.
> + * PG_STAT_TMP_DIR must be skipped even when stats_temp_directory is set
> + * because PGSS_TEXT_FILE is always created there.
> + */
> +NULL,
> I find that ugly. I'd rather use an array with undefined size for the
> fixed elements finishing by NULL, remove EXCLUDE_DIR_MAX and
> EXCLUDE_DIR_STAT_TMP and use a small routine to do the work done on
> _tarWriteHeader...

My goal was to be able to fully reuse the code that creates the paths,
but this could also be done by following your suggestion and also moving
the path code into a function.

Any opinion on this, Peter?

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-07 Thread Tomas Vondra


On 09/07/2016 01:13 PM, Amit Kapila wrote:
> On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
>  wrote:
>> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>>>  wrote:


 On 09/05/2016 06:03 AM, Amit Kapila wrote:
>  So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations
>
> For approach-1, you can use patch [1].  For approach-2, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail.  For approach-3, you can use
> 0001-Improve-64bit-atomics-support patch[2] and the patch attached
> with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
> work for you then for now we can compare approach-1 and approach-2.
>

 OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
 the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
 attempts to update to a newer version were unsuccessful so far.

>>>
>>> So which all patches your are able to compile on 4-socket m/c?  I
>>> think it is better to measure the performance on bigger machine.
>>
>> Oh, sorry - I forgot to mention that only the last test (with
>> USE_CONTENT_LOCK commented out) fails to compile, because the functions
>> for atomics were added in gcc-4.7.
>>
> 
> No issues, in that case we can leave the last test for now and do it later.
> 

FWIW I've managed to compile a new GCC on the system (all I had to do
was to actually read the damn manual), so I'm ready to do the test once
I get a bit of time.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] identity columns

2016-09-07 Thread Vitaly Burovoy
Hello,

The first look at the patch:

On 8/30/16, Peter Eisentraut  wrote:
> Here is another attempt to implement identity columns.  This is a
> standard-conforming variant of PostgreSQL's serial columns.
>
> ...
>
> Some comments on the implementation, and where it differs from previous
> patches:
>
> - The new attidentity column stores whether a column is an identity
> column and when it is generated (always/by default).  I kept this
> independent from atthasdef mainly for he benefit of existing (client)
> code querying those catalogs, but I kept confusing myself by this, so
> I'm not sure about that choice.  Basically we need to store four
> distinct states (nothing, default, identity always, identity by default)
> somehow.

I don't have a string opinion which way is preferred. I think if the
community is not against it, it can be left as is.

> ...
> - I did not implement the restriction of one identity column per table.
> That didn't seem necessary.

I think it should be mentioned in docs' "Compatibility" part as a PG's
extension (similar to "Zero-column Tables").

> ...
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Questions:
1. Is your patch implements T174 feature? Should a corresponding line
be changed in src/backend/catalog/sql_features.txt?
2. Initializing attidentity in most places is ' ' but makefuncs.c has
"n->identity = 0;". Is it correct?
3. doc/src/sgml/ref/create_table.sgml (5th chunk) has "TODO". Why?
4. There is "ADD GENERATED", but the standard says it should be "SET
GENERATED" (ISO/IEC 9075-2 Subcl.11.20)
5. In ATExecAddIdentity: is it a good idea to check whether
"attTup->attidentity" is the same as the given in "(ADD) SET
GENERATED" and do nothing (except changing sequence's options) in
addition to strict checking for "unset" (" ")?
6. In ATExecDropIdentity: is it a good idea to do nothing if the
column is already not a identity (the same behavior as DROP NOT
NULL/DROP DEFAULT)?
7. Is there any reason to insert CREATE_TABLE_LIKE_IDENTITY before
CREATE_TABLE_LIKE_INDEXES, not at the end?

Why do you change catversion.h? It can lead conflict when other
patches influence it are committed...

I'll have a closer look soon.

-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-07 Thread Dagfinn Ilmari Mannsåker
Emre Hasegeli  writes:

>> Bottom line here is that I'd rather commit ALTER TYPE RENAME VALUE with
>> no EXISTS features and then see it accrete those features together with
>> other types of RENAME, when and if there's a will to make that happen.
>
> This sounds like a good conclusion to me.

Works for me. I mainly added the IF [NOT] EXISTS support to be
consistent with ADD VALUE, and because I like idempotency, but I'm not
married to it.

Here is version 6 of the patch, which just adds RENAME VALUE with no IF
[NOT] EXISTS, rebased onto current master (particularly the
transactional ADD VALUE patch).

>From e4ee07d53bbab5ee434a12a2f30a9ac9bb5c89d3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 6 Sep 2016 14:38:06 +0100
Subject: [PATCH 1/2] =?UTF-8?q?Add=20ALTER=20TYPE=20=E2=80=A6=20RENAME=20V?=
 =?UTF-8?q?ALUE=20=E2=80=A6=20TO=20=E2=80=A6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Unlike adding values to an enum, altering an existing value doesn't
affect the OID values or ordering, so can be done in a transaction
regardless of whether the enum was created in the same transaction.
---
 doc/src/sgml/ref/alter_type.sgml   | 18 +++-
 src/backend/catalog/pg_enum.c  | 91 ++
 src/backend/commands/typecmds.c| 12 +++--
 src/backend/parser/gram.y  | 14 ++
 src/include/catalog/pg_enum.h  |  2 +
 src/include/nodes/parsenodes.h |  1 +
 src/test/regress/expected/enum.out | 58 
 src/test/regress/sql/enum.sql  | 27 +++
 8 files changed, 217 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/alter_type.sgml b/doc/src/sgml/ref/alter_type.sgml
index aec73f6..bdc2fa2 100644
--- a/doc/src/sgml/ref/alter_type.sgml
+++ b/doc/src/sgml/ref/alter_type.sgml
@@ -29,6 +29,7 @@ ALTER TYPE name RENAME ATTRIBUTE name RENAME TO new_name
 ALTER TYPE name SET SCHEMA new_schema
 ALTER TYPE name ADD VALUE [ IF NOT EXISTS ] new_enum_value [ { BEFORE | AFTER } existing_enum_value ]
+ALTER TYPE name RENAME VALUE existing_enum_value TO new_enum_value
 
 where action is one of:
 
@@ -123,6 +124,17 @@ ALTER TYPE name ADD VALUE [ IF NOT
 

 
+   
+RENAME VALUE TO
+
+ 
+  This form renames a value in an enum type.
+  The value's place in the enum's ordering is not affected.
+  An error will occur if the old value is not already present or the new value is.
+ 
+
+   
+

 CASCADE
 
@@ -241,7 +253,8 @@ ALTER TYPE name ADD VALUE [ IF NOT
   new_enum_value
   

-The new value to be added to an enum type's list of values.
+The new value to be added to an enum type's list of values,
+or to rename an existing one to.
 Like all enum literals, it needs to be quoted.

   
@@ -252,7 +265,8 @@ ALTER TYPE name ADD VALUE [ IF NOT
   

 The existing enum value that the new value should be added immediately
-before or after in the enum type's sort ordering.
+before or after in the enum type's sort ordering,
+or renamed from.
 Like all enum literals, it needs to be quoted.

   
diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index c66f963..2e48dfe 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -465,6 +465,97 @@ AddEnumLabel(Oid enumTypeOid,
 }
 
 
+/*
+ * RenameEnumLabel
+ *		Rename a label in an enum set.
+ */
+void
+RenameEnumLabel(Oid enumTypeOid,
+const char *oldVal,
+const char *newVal)
+{
+	Relation	pg_enum;
+	HeapTuple	old_tup = NULL;
+	HeapTuple	enum_tup;
+	CatCList   *list;
+	int			nelems;
+	int			nbr_index;
+	Form_pg_enum nbr_en;
+	boolfound_new = false;
+
+	/* check length of new label is ok */
+	if (strlen(newVal) > (NAMEDATALEN - 1))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_NAME),
+ errmsg("invalid enum label \"%s\"", newVal),
+ errdetail("Labels must be %d characters or less.",
+		   NAMEDATALEN - 1)));
+
+	/*
+	 * Acquire a lock on the enum type, which we won't release until commit.
+	 * This ensures that two backends aren't concurrently modifying the same
+	 * enum type.  Without that, we couldn't be sure to get a consistent view
+	 * of the enum members via the syscache.  Note that this does not block
+	 * other backends from inspecting the type; see comments for
+	 * RenumberEnumType.
+	 */
+	LockDatabaseObject(TypeRelationId, enumTypeOid, 0, ExclusiveLock);
+
+	pg_enum = heap_open(EnumRelationId, RowExclusiveLock);
+
+	/* Get the list of existing members of the enum */
+	list = SearchSysCacheList1(ENUMTYPOIDNAME,
+			   ObjectIdGetDatum(enumTypeOid));
+	nelems = list->n_members;
+
+	/* Locate the element to rename and check if the new label is already in
+	 * use.  The unique index on pg_enum would catch this anyway, but we
+	 * prefer 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-07 Thread Amit Kapila
On Wed, Sep 7, 2016 at 1:08 AM, Tomas Vondra
 wrote:
> On 09/06/2016 04:49 AM, Amit Kapila wrote:
>> On Mon, Sep 5, 2016 at 11:34 PM, Tomas Vondra
>>  wrote:
>>>
>>>
>>> On 09/05/2016 06:03 AM, Amit Kapila wrote:
  So, in short we have to compare three
 approaches here.

 1) Group mode to reduce CLOGControlLock contention
 2) Use granular locking model
 3) Use atomic operations

 For approach-1, you can use patch [1].  For approach-2, you can use
 0001-Improve-64bit-atomics-support patch[2] and the patch attached
 with this mail.  For approach-3, you can use
 0001-Improve-64bit-atomics-support patch[2] and the patch attached
 with this mail by commenting USE_CONTENT_LOCK.  If the third doesn't
 work for you then for now we can compare approach-1 and approach-2.

>>>
>>> OK, I can compile all three cases - but onl with gcc 4.7 or newer. Sadly
>>> the 4-socket 64-core machine runs Debian Jessie with just gcc 4.6 and my
>>> attempts to update to a newer version were unsuccessful so far.
>>>
>>
>> So which all patches your are able to compile on 4-socket m/c?  I
>> think it is better to measure the performance on bigger machine.
>
> Oh, sorry - I forgot to mention that only the last test (with
> USE_CONTENT_LOCK commented out) fails to compile, because the functions
> for atomics were added in gcc-4.7.
>

No issues, in that case we can leave the last test for now and do it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-09-07 Thread Victor Wagner
On Wed, 7 Sep 2016 17:09:17 +0900
Michael Paquier  wrote:

> On Sun, Sep 4, 2016 at 11:39 PM, Andreas Karlsson 
> wrote:
> > 1) Serialize the certificates, key, and CRL and write them to the
> > backend_var temp file and then deserialize everything in the
> > backends.
> >
> > Sounds like you would need to write some code for every SSL library
> > to support the serialization and deserialization, which I am not a
> > fan of doing just for one platform since I worry about little used
> > code paths. Additionally this would mean that we write a copy of
> > the private key to potentially another file system than the one
> > where the private key is stored, this sounds like a bad idea from a
> > security point of view.  
> 
> Yeah... This would result in something that is heavily SSL-dependent,
> which would be an additional maintenance pain when trying to support
> future OpenSSL versions.

OpenSSL has documented API for serializing/deserializing each and every
cryptographic format it supports. And this API quite unlikely to change
in future OpenSSL versions. Moreover, LibreSSL is compatible with this
API as far as I know.

Really, Apache does simular thing for ages - it starts as root, loads
certificates and keys, serializes them in memory, then forks and drops
privileges, and then uses these keys and certificates.

There are two problems with this approach

1. You have to carefully design data structures to store serialized
keys. Apache made a mistake there and didn't allow for future invention
of new public key algorithms.  So, in 2008 I had problems adding
russian GOST cryptography there.

2. You keys and certificates might not be stored in the filesystem at
all. They can live in some hardware cryptography module, which don't
let private keys out, just provide some handle.
(OpenSSL supports it via lodable engine modules, and Postgres allows to
use engine modules since 8.2, so people can use it with postgres).

Really OpenSSL/LibreSSL provide useful abstraction of memory BIO (Where
BIO stands for basic input/output) which can be used to store
serialized cryptographic data. And serializing/deserializing API is
designed to work with BIO anyway.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Ashutosh Sharma
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


2) I have also made use of the following tools for analyzing the hash
index tables created on master and standby.

a) pgstattuple : Using this tool, i could verfiy the tuple level
statistics which includes index table physical length, dead tuple
percentageand other infos on both master and standby. However, i could
see below error message when using this tool for one of the hash index
table on both master and standby server.

postgres=# SELECT * FROM pgstattuple('pgbench_history_bid');
ERROR:  index "pgbench_history_bid" contains unexpected zero page at
block 2482297

To investigate on this, i made use of pageinspect contrib module and
came to know that above block contains an uninitialised page. But,
this is quite possible in hash index as here the bucket split happens
in the power-of-2 and we may find some of the uninitialised bucket
pages.

b) pg_filedump : Using this tool, i could take a dump of all the hash
index tables on master and standy and compare the dump file to verify
if there is any difference between hash index pages on both master and
standby.

In short, this is all i did to verify the patch for wal logging in hash index.

I would like thank Amit and Robert for their valuable inputs during
the hash index testing phase.

I am also planning to verify wal logging in hash index using Kuntal's
patch for WAL consistency check once it is stablized.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-09-07 Thread Amit Langote

Hi,

On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
> Hi,
> 
> I have a query regarding list partitioning,
> 
> For example if I want to store employee data in a table, with "IT" dept
> employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and if
> employee belongs to other than these two, should come in emp_p3 partition.
> 
> In this case not sure how to create partition table. Do we have something
> like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
> list partition.
> 
> create table employee (empid int, dept varchar) partition by list(dept);
> create table emp_p1 partition of employee for values in ('IT');
> create table emp_p2 partition of employee for values in ('HR');
> create table emp_p3 partition of employee for values in (??);

Sorry, no such feature is currently offered.  It might be possible to
offer something like a "default" list partition which accepts values other
than those specified for other existing partitions.  However, that means
if we add a non-default list partition after a default one has been
created, the implementation must make sure that it moves any values from
the default partition that now belong to the newly created partition.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-09-07 Thread Kuntal Ghosh
Hello,

As per the earlier discussions, I've attached the updated patch for
WAL consistency check feature. This is how the patch works:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.
- I've extended the RmgrTable with a new function pointer
rm_checkConsistency, which is called after rm_redo. (only when WAL
consistency check is enabled for this rmgrID)
- In each rm_checkConsistency, both backup pages and buffer pages are
masked accordingly before any comparison.
- In postgresql.conf, a new guc variable named 'wal_consistency' is
added. Default value of this variable is 'None'. Valid values are
combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
values.
- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

Results


I've tested with installcheck and installcheck-world in master-standby
set-up. Followings are the configuration parameters.

Master:
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 4000
hot_standby = on
wal_consistency = 'All'

Standby:
wal_consistency = 'All'

I got two types of inconsistencies as following:

1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
redo, only BTP_DELETED flag is set in buffer page. I assume that we
should clear all btpo_flags before setting BTP_DELETED in
_bt_unlink_halfdead_page().

2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
REVMAP page. This happens only for two cases. I'm not sure what the
reason can be.

I haven't done sufficient tests yet to measure the overhead of this
modification. I'll do that next.


Thanks to Amit Kapila, Dilip Kumar and Robert Haas for their off-line
suggestions.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Sep 1, 2016 at 11:34 PM, Peter Geoghegan  wrote:
> On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas  wrote:
>> Indeed, it had occurred to me that we might not even want to compile
>> this code into the server unless WAL_DEBUG is defined; after all, how
>> does it help a regular user to detect that the server has a bug?  Bug
>> or no bug, that's the code they've got.  But on further reflection, it
>> seems like it could be useful: if we suspect a bug in the redo code
>> but we can't reproduce it here, we could ask the customer to turn this
>> option on to see whether it produces logging indicating the nature of
>> the problem.  However, because of the likely expensive of enabling the
>> feature, it seems like it would be quite desirable to limit the
>> expense of generating many extra FPWs to the affected rmgr.  For
>> example, if a user has a table with a btree index and a gin index, and
>> we suspect a bug in GIN, it would be nice for the user to be able to
>> enable the feature *only for GIN* rather than paying the cost of
>> enabling it for btree and heap as well.[2]
>
> Yes, that would be rather a large advantage.
>
> I think that there really is no hard distinction between users and
> hackers. Some people will want to run this in production, and it would
> be a lot better if performance was at least not atrocious. If amcheck
> couldn't do the majority of its verification with only an
> AccessShareLock, then users probably just couldn't use it. Heroku
> wouldn't have been able to use it on all production databases. It
> wouldn't have mattered that the verification was no less effective,
> since the bugs it found would simply never have been observed in
> practice.
>
> --
> Peter Geoghegan



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 27ba0a9..4c63ded 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -14,7 +14,7 @@
 #include "access/brin_pageops.h"
 #include "access/brin_xlog.h"
 #include "access/xlogutils.h"
-
+#include "storage/bufmask.h"
 
 /*
  * xlog replay routines
@@ -286,3 +286,84 @@ brin_redo(XLogReaderState *record)
 			elog(PANIC, "brin_redo: unknown op code %u", info);
 	}
 }
+
+/*
+ * It checks whether the current buffer page and backup page stored in the
+ * WAL record are consistent or not. Before comparing the two pages, it applies
+ * appropiate masking to the pages to ignore certain areas like hint bits,
+ * unused space between pd_lower and pd_upper etc. For more information about
+ * masking, see the masking function.
+ * This function should be called once WAL replay has been completed.
+ */
+void
+brin_checkConsistency(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Amit Kapila
On Wed, Sep 7, 2016 at 3:28 PM, Amit Kapila  wrote:
>
> Okay, I have fixed this issue as explained above.  Apart from that, I
> have fixed another issue reported by Mark Kirkwood upthread and few
> other issues found during internal testing by Ashutosh Sharma.
>

Forgot to mention that you need to apply the latest concurrent hash
index patch [1] before applying this patch.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm%2Bqn0jZX31-obXrJw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-07 Thread Petr Jelinek

On 07/09/16 02:56, Peter Eisentraut wrote:

Review of 0002-Add-SUBSCRIPTION-catalog-and-DDL.patch:

Similar concerns as before about ALTER syntax, e.g., does ALTER
SUBSCRIPTION ... PUBLICATION add to or replace the publication set?



It sets.


For that matter, why is there no way to add?



Didn't seem all that useful, the expectation here is that most 
subscriptions will use one couple of publications.




I think we should allow creating subscriptions initally without
publications.  This could be useful for example to test connections,
or create slots before later on adding publications.  Seeing that
there is support for changing the publications later, this shouldn't
be a problem.



Sure, but they need to be created disabled then.



Larger conceptual issues:

I haven't read the rest of the code yet to understand why
pg_subscription needs to be a shared catalog, but be that as it may, I
would still make it so that subscription names appear local to the
database.  We already have the database OID in the pg_subscription
catalog, so I would make the key (subname, subdatid).  DDL commands
would only operate on subscriptions in the current database (or you
could use "datname"."subname" syntax), and \drs would only show
subscriptions in the current database.  That way the shared catalog is
an implementation detail that can be changed in the future.  I think
it would be very helpful for users if publications and subscriptions
appear to work in a parallel way.  If I have two databases that I want
to replicate between two servers, I might want to have a publication
"mypub" in each database and a subscription "mysub" in each database.
If I learn that the subscriptions can't be named that way, then I have
to go back to rename to the publications, and it'll all be a bit
frustrating.


Okay that makes sense. The pg_subscription is shared catalog so that we 
can have one launcher per cluster instead one per database. Otherwise 
there is no reason why it could not behave like local catalog.




Some thoughts on pg_dump and such:

Even an INITIALLY DISABLED subscription needs network access to create
the replication slot.  So restoring a dump when the master is not
available will have some difficulties.  And restoring master and slave
at the same time (say disaster recovery) will not necessarily work
well either.  Also, the general idea of doing network access during a
backup restore without obvious prior warning sounds a bit unsafe.

I imagine maybe having three states for subscriptions: DISABLED,
PREPARED, ENABLED (to use existing key words).  DISABLED just exists
in the catalog, PREPARED has the slots set up, ENABLED starts
replicating.  So you can restore a dump with all slots disabled.  And
then it might be good to have a command to "prepare" or "enable" all
subscriptions at once.


Well the DISABLED keyword is also used in alter to stop the subscription 
but not remove it, that would not longer map well if we used the 
behavior you described. That being said I agree with the idea of having 
subscription that exists just locally in catalog, we just need to figure 
out better naming.




As I had mentioned privately before, I would perhaps have CREATE
SUBSCRIPTION use the foreign server infrastructure for storing
connection information.



Hmm, yeah it's an idea. My worry there is that it will make it bit more 
complex to setup as user will have to first create server and user 
mapping before creating subscription.



We'll have to keep thinking about ways to handle abandonded
replication slots.  I imagine that people will want to create
subscriber instances in fully automated ways.  If that fails every so
often and requires manual cleanup of replication slots on the master
some of the time, that will get messy.  I don't have well-formed ideas
about this, though.



Yes it's potential issue, don't have good solution for it either. It's 
loosely coupled system so we can't have 100% control over everything.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-09-07 Thread Rajkumar Raghuwanshi
Hi,

I have a query regarding list partitioning,

For example if I want to store employee data in a table, with "IT" dept
employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and if
employee belongs to other than these two, should come in emp_p3 partition.

In this case not sure how to create partition table. Do we have something
like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
list partition.

create table employee (empid int, dept varchar) partition by list(dept);
create table emp_p1 partition of employee for values in ('IT');
create table emp_p2 partition of employee for values in ('HR');
create table emp_p3 partition of employee for values in (??);

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Tue, Sep 6, 2016 at 6:37 PM, Amit Langote 
wrote:

> On Tue, Sep 6, 2016 at 9:19 PM, Robert Haas  wrote:
> > On Wed, Aug 31, 2016 at 1:05 PM, Amit Langote
> >  wrote:
> >>> However, it seems a lot better to make it a property of the parent
> >>> from a performance point of view.  Suppose there are 1000 partitions.
> >>> Reading one toasted value for pg_class and running stringToNode() on
> >>> it is probably a lot faster than scanning pg_inherits to find all of
> >>> the child partitions and then doing an index scan to find the pg_class
> >>> tuple for each and then decoding all of those tuples and assembling
> >>> them into some data structure.
> >>
> >> Seems worth trying.  One point that bothers me a bit is how do we
> enforce
> >> partition bound condition on individual partition basis.  For example
> when
> >> a row is inserted into a partition directly, we better check that it
> does
> >> not fall outside the bounds and issue an error otherwise.  With current
> >> approach, we just look up a partition's bound from the catalog and gin
> up
> >> a check constraint expression (and cache in relcache) to be enforced in
> >> ExecConstraints().  With the new approach, I guess we would need to look
> >> up the parent's partition descriptor.  Note that the checking in
> >> ExecConstraints() is turned off when routing a tuple from the parent.
> >
> > [ Sorry for the slow response. ]
> >
> > Yeah, that's a problem.  Maybe it's best to associate this data with
> > the childrels after all - or halfway in between, e.g. augment
> > pg_inherits with this information.  After all, the performance problem
> > I was worried about above isn't really much of an issue: each backend
> > will build a relcache entry for the parent just once and then use it
> > for the lifetime of the session unless some invalidation occurs.  So
> > if that takes a small amount of extra time, it's probably not really a
> > big deal.  On the other hand, if we can't build the implicit
> > constraint for the child table without opening the parent, that's
> > probably going to cause us some serious inconvenience.
>
> Agreed.  So I will stick with the existing approach.
>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-09-07 Thread Michael Paquier
On Sun, Sep 4, 2016 at 11:39 PM, Andreas Karlsson  wrote:
> 1) Serialize the certificates, key, and CRL and write them to the
> backend_var temp file and then deserialize everything in the backends.
>
> Sounds like you would need to write some code for every SSL library to
> support the serialization and deserialization, which I am not a fan of doing
> just for one platform since I worry about little used code paths.
> Additionally this would mean that we write a copy of the private key to
> potentially another file system than the one where the private key is
> stored, this sounds like a bad idea from a security point of view.

Yeah... This would result in something that is heavily SSL-dependent,
which would be an additional maintenance pain when trying to support
future OpenSSL versions.

> 2) Copy all the SSL related files into the data directory at SIGHUP, before
> loading them. While this does not require any serialization of certificates
> it still has the problem of writing private keys to disk.

You expressed enough concern about that upthread, copying private keys
into PGDATA is a security concern.

> 3) Leave my patch as it is now. This means the postmaster will reload
> certificates on SIGHUP while the backends will also load them when spawning.
> This means windows will continue to work the same as before my patch.
>
> Is there any other way to pass the current set of loaded certificates and
> keys from the postmaster to the backends on Windows? I guess you could use a
> pipe, but if so we should probably send all data on this pipe, not just the
> SSL stuff.
>
> I am leaning towards doing (3) but I know I am biased since it is less work
> and I do not care much for Windows.

Seriously... The benefit of this feature is clear for a lot of people.
And the implementation dedicated only to Windows would just result in
a grotty thing anyway. So I'd say that at this point we could just
push for 3) and facilitate the life of most with SSL configuration.
The behavior across platforms needs to be properly documented for
sure.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-07 Thread Stas Kelvich
> On 07 Sep 2016, at 03:09, Michael Paquier  wrote:
> 
>>> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
>>> 
>>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>>> wrote:
 Oh, I was preparing new version of patch, after fresh look on it. 
 Probably, I should
 said that in this topic. I’ve found a bug in sub transaction handling and 
 now working
 on fix.
>>> 
>>> What's the problem actually?
>> 
>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
>> prepared tx's in memory.
>> Also I want to double-check and add comments to RecoveryInProgress() checks 
>> in FinishGXact.
>> 
>> I’ll post reworked patch in several days.
> 
> Could you use as a base the version I just sent yesterday then? I
> noticed many mistakes in the comments, for example many s/it's/its/
> and did a couple of adjustments around the code, the goto next_file
> was particularly ugly. That will be that much work not do to again
> later.

Yes. Already merged branch with your version.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-09-07 Thread Michael Paquier
On Wed, Sep 7, 2016 at 2:32 AM, Tom Lane  wrote:
> I think probably a better answer is to reject bad paths earlier, eg have
> initdb error out before doing anything if the proposed -D path contains
> CR/LF.

Yes, that's a bug that we had better address. It is not nice to not
clean up PGDATA in this case.

> Given the collection of security bugs we just fixed, and the
> strong likelihood that user-written scripts contain other instances of
> the same type of problem, I do not think we are doing anybody any favors
> if we sweat bullets to support such paths.

Yep. Database and role names are the two patterns defined by the user
at SQL level that can be reused by command lines, so it still seems to
me that the patch I sent is the correct call..

> And I'm not even convinced
> that we *can* support such paths on Windows; no one was able to find a
> safe shell quoting solution there.

None has been found as far as I know, the problem gets into Windows
core with paths passed to cmd.exe which cannot handle those two
characters correctly.

> And yeah, the docs probably need work.

Patch 0001 is what I have done for the database and role names. Patch
0002 adds a routine in string_utils.c to check if a string given is
valid for shell commands or not. I added a call of that to initdb.c,
which is at least what we'd want to check because it calls directly
appendShellString. Now I am a bit reluctant to spread that
elsewhere... Users would get the message only with initdb if they see
the problem. I have added a note in the page of initdb as well, but
that does not sound enough to me, we may want to add a warning in a
more common plase. Does somebody have an idea where we could add that.

Also, in 0001 I have cleared the docs to refer to newline and carriage
return characters instead of CR and LF.
-- 
Michael
From 016a8168b39b67a6468ee1a752ee9ec82cf3fecb Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 7 Sep 2016 16:39:57 +0900
Subject: [PATCH 1/2] Forbid newline and carriage return characters in database
 and role names

---
 doc/src/sgml/ref/create_database.sgml |  7 +++
 doc/src/sgml/ref/create_role.sgml |  7 +++
 src/backend/commands/dbcommands.c | 23 +++
 src/backend/commands/user.c   | 21 +
 src/fe_utils/string_utils.c   |  4 +---
 5 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index cf33746..2477ba7 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -260,6 +260,13 @@ CREATE DATABASE name
connection slot remains for the database, it is possible that
both will fail.  Also, the limit is not enforced against superusers.
   
+
+  
+Database names cannot include newline or carriage return characters
+as those could be at the origin of security breaches, particularly on
+Windows where the command shell is unusable with arguments containing
+such characters.
+   
  
 
  
diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
index 38cd4c8..9d6926e 100644
--- a/doc/src/sgml/ref/create_role.sgml
+++ b/doc/src/sgml/ref/create_role.sgml
@@ -404,6 +404,13 @@ CREATE ROLE name [ [ WITH ] \password that can be used to safely change the
password later.
   
+
+  
+   Role names cannot include newline or carriage return characters
+   as those could be at the origin of security breaches, particularly on
+   Windows where the command shell is unusable with arguments containing
+   such characters.
+  
  
 
  
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index ef48659..22712ec 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -77,6 +77,7 @@ typedef struct
 } movedb_failure_params;
 
 /* non-export function prototypes */
+static void check_db_name(const char *dbname);
 static void createdb_failure_callback(int code, Datum arg);
 static void movedb(const char *dbname, const char *tblspcname);
 static void movedb_failure_callback(int code, Datum arg);
@@ -469,6 +470,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		/* Note there is no additional permission check in this path */
 	}
 
+	/* do sanity checks on database name */
+	check_db_name(dbname);
+
 	/*
 	 * Check for db name conflict.  This is just to give a more friendly error
 	 * message than "unique index violation".  There's a race condition but
@@ -758,6 +762,22 @@ check_encoding_locale_matches(int encoding, const char *collate, const char *cty
    pg_encoding_to_char(collate_encoding;
 }
 
+/*
+ * Perform sanity checks on the database name.
+ */
+static void
+check_db_name(const char *dbname)
+{
+	if (strchr(dbname, '\n') != NULL)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("database name cannot include newline character")));
+	if 

Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-07 Thread Emre Hasegeli
> Bottom line here is that I'd rather commit ALTER TYPE RENAME VALUE with
> no EXISTS features and then see it accrete those features together with
> other types of RENAME, when and if there's a will to make that happen.

This sounds like a good conclusion to me.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-09-07 Thread Vladimir Gordiychuk
>Review comments on the 2nd patch, i.e. the 2nd half of your original patch:
>
>* Other places in logical decoding use the CB suffix for callback
>types. This should do the same.
>
>* I'm not too keen on the name is_active  for the callback. We
>discussed the name continue_decoding_cb in our prior conversation.
>
>* Instead of having LogicalContextAlwaysActive, would it be better to
>test if the callback pointer is null and skip it if so?
>
>* Lots of typos in LogicalContextAlwaysActive comments, and wording is
unclear
>
>* comment added to arguments of pg_logical_slot_get_changes_guts is
>not in PostgreSQL style - it goes way wide on the line. Probably
>better to put the comment above the call and mention which argument it
>refers to.
>
>* A comment somewhere is needed - probably on the IsStreamingActive()
>callback - to point readers at the fact that WalSndWriteData() calls
>ProcessRepliesIfAny() and that's how IsStreamingActive can get set. I
>had to look at the email discussion history to remind myself how this
>worked when I was re-reading the code.
>
>* I'm not keen on
>
>typedef ReorderBufferIsActive LogicalDecondingContextIsActive;
>
>  I think instead a single callback name that encompasses both should
>be used. ContinueDecodingCB ?


Fixed patch in attach.

> * There are no tests. I don't see any really simple way to test this,
though.

I will be grateful if you specify the best way how to do it. I tests this
patches by pgjdbc driver tests that also build on head of postgres. You can
check test scenario for both patches in PRhttps://github.com/pgjdbc/
pgjdbc/pull/550

org.postgresql.replication.LogicalReplicationTest#testDuringSendBigTransactionReplicationStreamCloseNotActive
org.postgresql.replication.LogicalReplicationTest#testAfterCloseReplicationStreamDBSlotStatusNotActive


2016-09-06 4:10 GMT+03:00 Craig Ringer :

> On 25 August 2016 at 13:04, Craig Ringer  wrote:
>
> > By the way, I now think that the second part of your patch, to allow
> > interruption during ReorderBufferCommit processing, is also very
> > desirable.
>
> I've updated your patch, rebasing it on top of 10.0 master and
> splitting it into two pieces. I thought you were planning on following
> up with an update to the second part, but maybe that was unclear from
> our prior discussion, so I'm doing this to help get this moving again.
>
> The first patch is what I posted earlier - the part of your patch that
> allows the walsender to exit COPY BOTH mode and return to command mode
> in response to a client-initiated CopyDone when it's in the
> WalSndLoop.  For logical decoding this means that it will notice a
> client CopyDone when it's between transactions or while it's ingesting
> transaction changes. It won't react to CopyDone while it's in
> ReorderBufferCommit and sending a transaction to an output plugin
> though.
>
> The second piece is the other half of your original patch. Currently
> unmodified from what you wrote. It adds a hook in ReorderBufferCommit
> that calls back into the walsender to check for new client input and
> interrupt processing. I haven't yet investigated this in detail.
>
>
> Review comments on the 2nd patch, i.e. the 2nd half of your original patch:
>
> * Other places in logical decoding use the CB suffix for callback
> types. This should do the same.
>
> * I'm not too keen on the name is_active  for the callback. We
> discussed the name continue_decoding_cb in our prior conversation.
>
> * Instead of having LogicalContextAlwaysActive, would it be better to
> test if the callback pointer is null and skip it if so?
>
> * Lots of typos in LogicalContextAlwaysActive comments, and wording is
> unclear
>
> * comment added to arguments of pg_logical_slot_get_changes_guts is
> not in PostgreSQL style - it goes way wide on the line. Probably
> better to put the comment above the call and mention which argument it
> refers to.
>
> * A comment somewhere is needed - probably on the IsStreamingActive()
> callback - to point readers at the fact that WalSndWriteData() calls
> ProcessRepliesIfAny() and that's how IsStreamingActive can get set. I
> had to look at the email discussion history to remind myself how this
> worked when I was re-reading the code.
>
> * I'm not keen on
>
> typedef ReorderBufferIsActive LogicalDecondingContextIsActive;
>
>   I think instead a single callback name that encompasses both should
> be used. ContinueDecodingCB ?
>
> * There are no tests. I don't see any really simple way to test this,
> though.
>
> I suggest that you apply these updated/split patches to a fresh copy
> of master then see if you can update it based on this feedback.
> Something like:
>
> git checkout master
> git pull
> git checkout -b stop-decoding-v10
> git am /path/to/0001-Respect-client-initiated-CopyDone-in-walsender.patch
> git am /path/to/0002-Second-part-of-Vladimir-Gordiychuk-s-patch.patch
>
> then modify the code per the above, and "git commit 

Re: [HACKERS] Supporting SJIS as a database encoding

2016-09-07 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro
> Thanks, by the way, there's another issue related to SJIS conversion.  MS932
> has several characters that have multiple code points. By converting texts
> in this encoding to and from Unicode causes a round-trop problem. For
> example,
> 
> 8754(ROMAN NUMERICAL I in NEC specials)
>   => U+2160(ROMAN NUMERICAL I)
> => FA4A (ROMAN NUMERICA I in IBM extension)
> 
> My counting said that 398 characters are affected by this kind of replacement.
> Addition to that, "GAIJI" (Private usage area) is not allowed. Is this meet
> your purpose?

Supporting GAIJI is not a requirement as far as I know.  Thank you for sharing 
information.

# I realize my lack of knowledge about character sets...

Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-07 Thread Craig Ringer
On 7 September 2016 at 14:44, Pavel Stehule  wrote:
>>
>> Suggested comment:
>>
>> /*
>>  * This is the parsenode for a column definition in a table-expression
>> like XMLTABLE.
>>  *
>>  * We can't re-use ColumnDef here; the utility command column
>> definition has all the
>>  * wrong attributes for use in table-expressions and just doesn't make
>> sense here.
>>  */
>> typedef struct TableExprColumn
>> {
>> ...
>> };
>>
>> ?
>>
>> Why "RawCol" ? What does it become when it's not "raw" anymore? Is
>> that a reference to ColumnDef's raw_default and cooked_default for
>> untransformed vs transformed parse-trees?
>
>
> My previous reply was wrong - it is used by parser only and holds TypeName
> field. The analogy with ColumnDef raw_default is perfect.

Cool, lets just comment that then.

I'll wait on an updated patch per discussion to date. Hopefully
somebody else with more of a clue than me can offer better review of
the executor/view/caching part you specifically called out as complex.
Otherwise maybe it'll be clearer in a revised version.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting SJIS as a database encoding

2016-09-07 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 6 Sep 2016 03:43:46 +, "Tsunakawa, Takayuki" 
 wrote in 
<0A3221C70F24FB45833433255569204D1F5E66CE@G01JPEXMBYT05>
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro
> > HORIGUCHI
> Implementing radix tree code, then redefining the format of mapping table
> > to suppot radix tree, then modifying mapping generator script are needed.
> > 
> > If no one oppse to this, I'll do that.
> 
> +100
> Great analysis and your guts.  I very much appreciate your trial!

Thanks, by the way, there's another issue related to SJIS
conversion.  MS932 has several characters that have multiple code
points. By converting texts in this encoding to and from Unicode
causes a round-trop problem. For example,

8754(ROMAN NUMERICAL I in NEC specials)
  => U+2160(ROMAN NUMERICAL I)
=> FA4A (ROMAN NUMERICA I in IBM extension)

My counting said that 398 characters are affected by this kind of
replacement. Addition to that, "GAIJI" (Private usage area) is
not allowed. Is this meet your purpose?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-07 Thread Simon Riggs
On 7 September 2016 at 04:13, Masahiko Sawada  wrote:

> Since current HEAD could scan visibility map twice, the execution time
> of Patched is approximately half of HEAD.

Sounds good.

To ensure we are doing exactly same amount of work as before, did you
see the output of VACUUM VEROBOSE?

Can we produce a test that verifies the result patched/unpatched?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Oh, sorry, made one more attemp and now I see your algorithm differently.

So you propose to use oE and oV as a markers of borders for what I call Realm.
But there may be very little significant bits in one of this ranges.
pg_sphere and PostGiS extensions tried to use 1 as a marker, with
alike formula. And that generated not a good tree.
Here's how they done it
https://github.com/postgis/postgis/commit/9569e898078eeac8928891e8019ede2cbf27d06f

I'll try to compose a patch for your formula later, I think we should
just try and see, it is easier than to reason about digital floating
points.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Heikki Linnakangas

On 09/07/2016 09:42 AM, Andrew Borodin wrote:

2. Your algorithm, among loosing some bits of precision (which is
absolutely acceptable - we have like 31 of them and that’s a lot) rely on
false assumption. We compare tuples on page not by MBR of inserted tuple,
but by MBR of tuple on page, which is different for every penalty
calculation.


The penalty function has two arguments: the new tuple, and the existing 
tuple on the page. When we're inserting, it gets called for every 
existing tuple on the page, with the same new tuple. And then we compare 
the returned penalties. So for a single insertion, the "new tuple" 
argument stays the same for every call.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Illegal SJIS mapping

2016-09-07 Thread Kyotaro HORIGUCHI
Hi,

I found an useless entry in utf8_to_sjis.map

>  {0xc19c, 0x815f},

which is apparently illegal as UTF-8 which postgresql
deliberately refuses. So it should be removed and the attached
patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C
FULLWIDTH REVERSE SOLIDUS) and it is a right mapping.


By the way, the file comment at the beginning of UCS_to_SJIS.pl
is the following.

# Generate UTF-8 <--> SJIS code conversion tables from
# map files provided by Unicode organization.
# Unfortunately it is prohibited by the organization
# to distribute the map files. So if you try to use this script,
# you have to obtain SHIFTJIS.TXT from
# the organization's ftp site.

The file was found at the following place thanks to google.

ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/

As the URL is showing, or as written in the file
Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and
the *live* definition *may* be found in Unicode Character
Database. But I haven't found SJIS-related informatin there.

If I'm not missing anything, the only available authority would
be JIS X 0208/0213 but what should be implmented seems to be
maybe-modified MS932 for which I don't know the authority.

Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got
a quite different mapping files from the current ones.

So, I wonder how the mappings related to SJIS (and/or EUC-JP) are
maintained. If no authoritative information is available, the
generating script no longer usable. If any other autority is
choosed, it is to be modified according to whatever the new
source format is.

Any suggestions? Or opinions?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/mb/Unicode/utf8_to_sjis.map b/src/backend/utils/mb/Unicode/utf8_to_sjis.map
index bcb76c9..47f5fdf 100644
--- a/src/backend/utils/mb/Unicode/utf8_to_sjis.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_sjis.map
@@ -1,5 +1,4 @@
-static const pg_utf_to_local ULmapSJIS[ 7398 ] = {
-  {0xc19c, 0x815f},
+static const pg_utf_to_local ULmapSJIS[ 7397 ] = {
   {0xc2a2, 0x8191},
   {0xc2a3, 0x8192},
   {0xc2a5, 0x5c},

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-07 Thread Pavel Stehule
>
>
> Suggested comment:
>
> /*
>  * This is the parsenode for a column definition in a table-expression
> like XMLTABLE.
>  *
>  * We can't re-use ColumnDef here; the utility command column
> definition has all the
>  * wrong attributes for use in table-expressions and just doesn't make
> sense here.
>  */
> typedef struct TableExprColumn
> {
> ...
> };
>
> ?
>
> Why "RawCol" ? What does it become when it's not "raw" anymore? Is
> that a reference to ColumnDef's raw_default and cooked_default for
> untransformed vs transformed parse-trees?
>

My previous reply was wrong - it is used by parser only and holds TypeName
field. The analogy with ColumnDef raw_default is perfect.

Regards

Pavel


>
>
>
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] GiST penalty functions [PoC]

2016-09-07 Thread Andrew Borodin
Hi Heikki!
Thank you for reviewing the code, it's always inspiring when a work is
noted (:

>Looking at the code, by "margin", you mean the sum of all "edges", i.e. of
each dimension, of the cube.
Indeed. As far as I remember, this is a terminology of old R*-tree paper.
Now they use the word "perimeter", seems like it's closer for
English-speaking folks, so in future patches I'll switch to "perimeter"

>I guess the point of that is to differentiate between cubes that have the
same volume, but a different shape.

Somewhat like this. For an R[something]-tree volume is an ultimate measure
to compare figures, close to multidimensional cube. Tree must tend to form
MBRs with equal edges to ensure search optimizations spread across all
dimensions equally.
But on practice volume degrade quickly. When you index dots, you quickly
get a page with a "plane", lot's of dots with one dimensions on a fixed
value. And then R-tree do not work anymore, inserts are somewhat-randomized
by GiST, but in practice tree is a lot worse than randomized, due to
randomization skew towards early insert candidates.
Perimeter is not as good as volume in general case. But it does not degrade
until you have truly equal object MBRs.

So this was volume vs perimeter. For example in MySQL they use #define to
choose one strategy of these, which seems for me not a good design. In this
patch I propose to use perimeter when volume strategy degraded, effectively
when one dimension edge is 0 or some dimensions edges are close to epsilon.

But there is one more tie to break. When you choose subtree for insertion,
some candidates have zero volume extension and zero edge extension. They do
not need to be extended to accommodate new tuple. In this case we choose
smallest one, by volume. If all of them have 0 volume, we choose by
perimeter.

All in all, we have 4 different cases here, ordered by priority.

>So, let's try to come up with a scheme that doesn't require IEEE 754
floats.
1. Conversion for my algorithm to float ops. I actually haven't thought
about it before, but seems it's impossible. Bit shift makes no sense for
regular float operations, under any circumstances exponent bits cannot shit
to significant field and vise versa. If we do not move last bit of exponent
- all significant field bits are garbage, that leaves us with only 6 bits
for actual value, which is unacceptable.
2. Your algorithm, among loosing some bits of precision (which is
absolutely acceptable - we have like 31 of them and that’s a lot) rely on
false assumption. We compare tuples on page not by MBR of inserted tuple,
but by MBR of tuple on page, which is different for every penalty
calculation.
I'll put it in your var names, they are good to reason about proposed
algorithm.

oE: Original tuple's edge sum (perimeter measure). This tuple is on a page.
nE: New tuple's edge sum. This tuple is to be inserted. We calc penalty to
choose subtree with minimum penalty.
dE: Edge increase. Edge of union(original|new) minus oE.

oV: Original tuple's volume
nV: New tuple's volume
dV: Volume increase

Best desired algorithm: sort tuples by dV, then by dE, then by oV, then by
oE (all ascending) and pick top 1.
Unfortunately, I do not see a way to do that in 31 bit of float (in GiST we
cannot use sign bit, <=0 is used as a magic condition).
So implemented is following:
Say we have number ALOT which is guaranteed to be bigger than dV,dE,oV,oE.
if( dV > 0 )
  penalty = ALOT*3 + dV;//nonzero dV goes last
elif ( dE > 0 )
  penalty = ALOT*2 + dE;//than goes dE
elif ( oV > 0 )
  penalty = ALOT + oV
then
  penalty = oE;
//oE is zero for subtrees containing only equal points. In this case we
pass split optimization possibility to next GiST attribute

Volume and perimeter of new entry does not affect choice of subtree
directly, whether it is zero or not.

>Hmm. That's a pretty large increase in construction time. Have you done
any profiling on where the time goes?
Since I've changes only one function, I didn't consider profiling. I've
asked Michael to check disassembly of the function call tree, implemented
his recommendations and patch v2 have all build performance back (: And now
it computes aggregates 40% faster.
>continue work within cube
There is one more option: implement extension index using CREATE ACCESS
METHOD feature of 9.6. It is an option: we can skip all GiST API hacks and
implement real RR*-tree, there are some other improvements.
But, on the other hand, improvement of GiST API could be beneficial to
other extensions.

Again, thanks for your attention. Possibly we can come up with some
solution without dirty hacks.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM

2016-09-07 Thread Amit Langote
On 2016/09/07 12:29, Corey Huinker wrote:
> On Tue, Sep 6, 2016 at 9:46 PM, Amit Langote wrote:
>> OK.
> Well...maybe not, depending on what Craig and other can do to educate me
> about the TAP tests.

Sure.

>>> Changing table-level options requires superuser privileges, for security
>>> reasons: only a superuser should be able to determine which file is read,
>>> or which program is run. In principle non-superusers could be allowed to
>>> change the other options, but that's not supported at present.
>>
>> Hmm, just a little modification would make it better for me:
>>
>> ... for security reasons.  For example, only superuser should be able to
>> determine which file read or which program is run.
>>
>> Although that could be just me.
> 
> In this case "determine" is unclear whether a non-superuser can set the
> program to be run, or is capable of knowing which program is set to be run
> by the fdw.

Hmm, it is indeed unclear.

How about:

... for security reasons.  For example, only superuser should be able to
*change* which file is read or which program is run.

I just realized this is not just about a C comment.  There is a line in
documentation as well which needs an update.  Any conclusion here should
be applied there.

> We may want some more opinions on what is the most clear.

Certainly.

>> But as you said above, this could be deferred to the committer.
>>
> 
> Yeah, and that made for zero storage savings: a char pointer which is never
> assigned a string takes up as much space as a 4-byte-aligned boolean. And
> the result is that "file" really means program, which I found slightly
> awkward.

My only intent to push for that approach is to have consistency with other
code implementing a similar feature although it may not be that important.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-07 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 11:09 PM, Heikki Linnakangas  wrote:
>> The big picture here is that you can't only USEMEM() for tapes as the
>> need arises for new tapes as new runs are created. You'll just run a
>> massive availMem deficit, that you have no way of paying back, because
>> you can't "liquidate assets to pay off your creditors" (e.g., release
>> a bit of the memtuples memory). The fact is that memtuples growth
>> doesn't work that way. The memtuples array never shrinks.
>
>
> Hmm. But memtuples is empty, just after we have built the initial runs. Why
> couldn't we shrink, i.e. free and reallocate, it?

After we've built the initial runs, we do in fact give a FREEMEM()
refund to those tapes that were not used within beginmerge(), as I
mentioned just now (with a high workMem, this is often the great
majority of many thousands of logical tapes -- that's how you get to
wasting 8% of 5GB of maintenance_work_mem).

What's at issue with this 500 tapes cap patch is what happens after
tuples are first dumped (after we decide that this is going to be an
external sort -- where we call tuplesort_merge_order() to get the
number of logical tapes in the tapeset), but before the final merge
happens, where we're already doing the right thing for merging by
giving that refund. I want to stop logical allocation (USEMEM()) of an
enormous number of tapes, to make run generation itself able to use
more memory.

It's surprisingly difficult to do something cleverer than just impose a cap.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-07 Thread Heikki Linnakangas

On 09/07/2016 09:01 AM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan  wrote:

There isn't much point in that, because those buffers are never
physically allocated in the first place when there are thousands. They
are, however, entered into the tuplesort.c accounting as if they were,
denying tuplesort.c the full benefit of available workMem. It doesn't
matter if you USEMEM() or FREEMEM() after we first spill to disk, but
before we begin the merge. (We already refund the
unused-but-logically-allocated memory from unusued at the beginning of
the merge (within beginmerge()), so we can't do any better than we
already are from that point on -- that makes the batch memtuples
growth thing slightly more effective.)


The big picture here is that you can't only USEMEM() for tapes as the
need arises for new tapes as new runs are created. You'll just run a
massive availMem deficit, that you have no way of paying back, because
you can't "liquidate assets to pay off your creditors" (e.g., release
a bit of the memtuples memory). The fact is that memtuples growth
doesn't work that way. The memtuples array never shrinks.


Hmm. But memtuples is empty, just after we have built the initial runs. 
Why couldn't we shrink, i.e. free and reallocate, it?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-07 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 10:57 PM, Peter Geoghegan  wrote:
> There isn't much point in that, because those buffers are never
> physically allocated in the first place when there are thousands. They
> are, however, entered into the tuplesort.c accounting as if they were,
> denying tuplesort.c the full benefit of available workMem. It doesn't
> matter if you USEMEM() or FREEMEM() after we first spill to disk, but
> before we begin the merge. (We already refund the
> unused-but-logically-allocated memory from unusued at the beginning of
> the merge (within beginmerge()), so we can't do any better than we
> already are from that point on -- that makes the batch memtuples
> growth thing slightly more effective.)

The big picture here is that you can't only USEMEM() for tapes as the
need arises for new tapes as new runs are created. You'll just run a
massive availMem deficit, that you have no way of paying back, because
you can't "liquidate assets to pay off your creditors" (e.g., release
a bit of the memtuples memory). The fact is that memtuples growth
doesn't work that way. The memtuples array never shrinks.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers