Re: [HACKERS] Tracking wait event for latches

2016-09-24 Thread Michael Paquier
On Thu, Sep 22, 2016 at 10:49 PM, Robert Haas  wrote:
> Finally, extensions got their own category in this taxonomy, though I
> wonder if it would be better to instead have
> Activity/ExtensionActivity, Client/ExtensionClient,
> Timeout/ExtensionTimeout, and IPC/ExtensionIPC instead of making it a
> separate toplevel category.

I don't think that it is necessary to go up to that level. If you look
at the latest patch, WaitLatch & friends have been extended with two
arguments: classId and eventId, so extensions could just use
WAIT_ACTIVITY as classId and WE_EXTENSION as eventId to do the
equivalent of ExtensionActivity.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-24 Thread Amit Kapila
On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila  wrote:
> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila  wrote:
>>
>> I think here I am slightly wrong.  For the full page writes, it do use
>> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> doing page verification check and rather blindly setting the page to
>> zero and then overwrites it with full page image.  So after my fix,
>> you will not see the error of checksum failure.  I have a fix ready,
>> but still doing some more verification.  If everything passes, I will
>> share the patch in a day or so.
>>
>
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem.  Also, he has found
a way to easily reproduce the problem.  Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Amit Kapila
On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark  wrote:
> On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:
>> But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Well there's plenty of halfway solutions for that. We could move hash
> indexes to contrib or even have them in core as experimental_hash or
> unlogged_hash until the day they achieve their potential.
>
> We definitely shouldn't discourage people from working on hash indexes
>

Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated.  Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-24 Thread Thomas Munro
On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan  wrote:
> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
>  wrote:
>> It looks like varstr_abbrev_convert calls strxfrm unconditionally
>> (assuming TRUST_STRXFRM is defined).  This needs to
>> use ucol_getSortKey instead when appropriate.  It looks like it's a
>> bit more helpful than strxfrm about telling you the output buffer size
>> it wants, and it doesn't need nul termination, which is nice.
>> Unfortunately it is like strxfrm in that the output buffer's contents
>> is unspecified if it ran out of space.
>
> One can use the ucol_nextSortKeyPart() interface to just get the first
> 4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
> the output buffer size limitation is probably irrelevant. The ICU
> documentation says something about this being useful for Radix sort,
> but I suspect it's more often used to generate abbreviated keys.
> Abbreviated keys were not my original idea. They're really just a
> standard technique.

Nice!  The other advantage of ucol_nextSortKeyPart is that you don't have
to convert the whole string to UChar (UTF16) first, as I think you would
need to with ucol_getSortKey, because the UCharIterator mechanism can read
directly from a UTF8 string.  I see in the documentation that
ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and
this caveat may be related to whether sort key compression is used.  I
don't understand what sort of compression is involved but out of curiosity
I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could
see their size.  As you say, we can ask it to stop at 4 or 8 bytes which is
very convenient for our purposes, but here I asked for more to get the full
output so I could see where the primary weight part ends.  The primary
weight took one byte for the Latin letters I tried and two for the Japanese
characters I tried (except 一 which was just 0xaa).

ucol_nextSortKeyPart(en_US, "a", ...) -> 29 01 05 01 05
ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 06 01 06
ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07
ucol_nextSortKeyPart(en_US, "abcd", ...) -> 29 2b 2d 2f 01 08 01 08
ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc
ucol_nextSortKeyPart(en_US, "AB", ...) -> 29 2b 01 06 01 dc dc
ucol_nextSortKeyPart(en_US, "ABC", ...) -> 29 2b 2d 01 07 01 dc dc dc
ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc
ucol_nextSortKeyPart(ja_JP, "一", ...) -> aa 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 01 06 01 06
ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07
ucol_nextSortKeyPart(ja_JP, "一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08
ucol_nextSortKeyPart(ja_JP, "日", ...) -> d0 18 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06
ucol_nextSortKeyPart(fr_FR, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_FR, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09
ucol_nextSortKeyPart(fr_FR, "côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a
ucol_nextSortKeyPart(fr_CA, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09
ucol_nextSortKeyPart(fr_CA, "côté", ...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a

I wonder how it manages to deal with fr_CA's reversed secondary weighting
rule which requires you to consider diacritics in reverse order --
apparently abandoned in France but still used in Canada -- using a fixed
size space for state between calls.

-- 
Thomas Munro
http://www.enterprisedb.com


[HACKERS] Development build with uuid-ossp support - macOS

2016-09-24 Thread Enrique M
I am trying to do a macOS build of postgresql (9.6 stable branch from
github) with the uuid-ossp contrib by typing "make world" but it fails due
to an openjade error (I did install openjade using homebrew using this
setup https://github.com/petere/homebrew-sgml).

Is there a way to build postgresql and install with uuid-ossp without
having to build the documentation? I don't really need the documentation
for my test.

Thank you,

Enrique


Re: [HACKERS] Rebranding OS X as macOS

2016-09-24 Thread Michael Paquier
On Sun, Sep 25, 2016 at 1:47 AM, Tom Lane  wrote:
> Apple has decided to rename Mac OS X to "macOS", and apparently is now
> retroactively referring to old releases that way too.  I propose that
> we should do likewise, ie run around and clean up our various references
> to "Mac OS X", "OS X", or "OSX" to uniformly say "macOS".  I think it's
> considerably clearer to non-Apple people exactly what that is, plus
> there's not so much temptation towards variant spellings.
>
> We also have various uses of the ancient code name "Darwin".  I think
> we should change that to "macOS" in user-facing docs, but I'm less
> inclined to do so in the code.  In particular, renaming the template
> file or trying to change the PORTNAME value seems much too messy,
> so I think we should stick to "darwin" as the port name.

+1 to all that. I am seeing the same trend regarding the naming.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better tracking of free space during SP-GiST index build

2016-09-24 Thread Tomas Vondra

On 09/22/2016 07:37 PM, Tom Lane wrote:

Tomas Vondra  writes:


... I've tried increasing the cache size to 768
entries, with vast majority of them (~600) allocated to leaf pages.
Sadly, this seems to only increase the CREATE INDEX duration a bit,
without making the index significantly smaller (still ~120MB).


Yeah, that's in line with my results: not much further gain from a
larger cache.  Though if you were testing with the same IRRExplorer
data, it's not surprising that our results would match.  Would be
good to try some other cases...



Agreed, but I don't have any other data sets at hand. One possibility 
would be to generate something randomly (e.g. it's not particularly 
difficult to generate random IP addresses), but I'd much rather use some 
real-world data sets.


>>

One thing I'd change is making the SpGistLUPCache dynamic, i.e.
storing the size and lastUsedPagesMap on the meta page. That
should allow us resizing the cache and tweak lastUsedPagesMap in
the future.


Yeah, probably a good idea. I had thought of bumping
SPGIST_MAGIC_NUMBER again if we want to revisit the cache size; but
keeping it as a separate field won't add noticeable cost, and it
might save some trouble.



I see you plan to track only the cache size, while I proposed to track 
also the map, i.e. number of pages per category. I think that'd useful 
in case we come up with better values (e.g. more entries for leaf 
pages), or even somewhat adaptive way.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Tom Lane
Greg Stark  writes:
> On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:
>> But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".

> Well there's plenty of halfway solutions for that. We could move hash
> indexes to contrib or even have them in core as experimental_hash or
> unlogged_hash until the day they achieve their potential.

> We definitely shouldn't discourage people from working on hash indexes
> but we probably shouldn't have released ten years worth of a feature
> marked "please don't use this" that's guaranteed to corrupt your
> database and cause weird problems if you use it a any of a number of
> supported situations (including non-replicated system recovery that
> has been a bedrock feature of Postgres for over a decade).

Obviously that has not been a good situation, but we lack a time
machine to retroactively make it better, so I don't see much point
in fretting over what should have been done in the past.

> Arguably adding a hashed btree opclass and relegating the existing
> code to an experimental state would actually encourage development
> since a) Users would actually be likely to use the hashed btree
> opclass so any work on a real hash opclass would have a real userbase
> ready and waiting for delivery, b) delivering a real hash opclass
> wouldn't involve convincing users to unlearn a million instructions
> warning not to use this feature and c) The fear of breaking existing
> users use cases and databases would be less and pg_upgrade would be an
> ignorable problem at least until the day comes for the big cutover of
> the default to the new opclass.

I'm not following your point here.  There is no hash-over-btree AM and
nobody (including Andres) has volunteered to create one.  Meanwhile,
we have a patch in hand to WAL-enable the hash AM.  Why would we do
anything other than apply that patch and stop saying hash is deprecated?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-24 Thread Tomas Vondra

On 09/24/2016 06:06 AM, Amit Kapila wrote:

On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
 wrote:

...

>>

So I'm using 16GB shared buffers (so with scale 300 everything fits into
shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
1h etc. So no, there are no checkpoints during the 5-minute runs, only those
triggered explicitly before each run.



Thanks for clarification.  Do you think we should try some different
settings *_flush_after parameters as those can help in reducing spikes
in writes?



I don't see why that settings would matter. The tests are on unlogged 
tables, so there's almost no WAL traffic and checkpoints (triggered 
explicitly before each run) look like this:


checkpoint complete: wrote 17 buffers (0.0%); 0 transaction log file(s) 
added, 0 removed, 13 recycled; write=0.062 s, sync=0.006 s, total=0.092 
s; sync files=10, longest=0.004 s, average=0.000 s; distance=309223 kB, 
estimate=363742 kB


So I don't see how tuning the flushing would change anything, as we're 
not doing any writes.


Moreover, the machine has a bunch of SSD drives (16 or 24, I don't 
remember at the moment), behind a RAID controller with 2GB of write 
cache on it.



Also, I think instead of 5 mins, read-write runs should be run for 15
mins to get consistent data.



Where does the inconsistency come from?


Thats what I am also curious to know.


Lack of warmup?


Can't say, but at least we should try to rule out the possibilities.
I think one way to rule out is to do slightly longer runs for
Dilip's test cases and for pgbench we might need to drop and
re-create database after each reading.



My point is that it's unlikely to be due to insufficient warmup, because 
the inconsistencies appear randomly - generally you get a bunch of slow 
runs, one significantly faster one, then slow ones again.


I believe the runs to be sufficiently long. I don't see why recreating 
the database would be useful - the whole point is to get the database 
and shared buffers into a stable state, and then do measurements on it.


I don't think bloat is a major factor here - I'm collecting some 
additional statistics during this run, including pg_database_size, and I 
can see the size oscillates between 4.8GB and 5.4GB. That's pretty 
negligible, I believe.


I'll let the current set of benchmarks complete - it's running on 4.5.5 
now, I'll do tests on 3.2.80 too.


Then we can re-evaluate if longer runs are needed.


Considering how uniform the results from the 10 runs are (at least
on 4.5.5), I claim  this is not an issue.



It is quite possible that it is some kernel regression which might
be fixed in later version. Like we are doing most tests in cthulhu
which has 3.10 version of kernel and we generally get consistent
results. I am not sure if later version of kernel say 4.5.5 is a net
win, because there is a considerable difference (dip) of performance
in that version, though it produces quite stable results.



Well, the thing is - the 4.5.5 behavior is much nicer in general. I'll 
always prefer lower but more consistent performance (in most cases). In 
any case, we're stuck with whatever kernel version the people are using, 
and they're likely to use the newer ones.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Greg Stark
On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:
> But to kick the hash AM as such to the curb is to say
> "sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes
but we probably shouldn't have released ten years worth of a feature
marked "please don't use this" that's guaranteed to corrupt your
database and cause weird problems if you use it a any of a number of
supported situations (including non-replicated system recovery that
has been a bedrock feature of Postgres for over a decade).

Arguably adding a hashed btree opclass and relegating the existing
code to an experimental state would actually encourage development
since a) Users would actually be likely to use the hashed btree
opclass so any work on a real hash opclass would have a real userbase
ready and waiting for delivery, b) delivering a real hash opclass
wouldn't involve convincing users to unlearn a million instructions
warning not to use this feature and c) The fear of breaking existing
users use cases and databases would be less and pg_upgrade would be an
ignorable problem at least until the day comes for the big cutover of
the default to the new opclass.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Rebranding OS X as macOS

2016-09-24 Thread Tom Lane
Apple has decided to rename Mac OS X to "macOS", and apparently is now
retroactively referring to old releases that way too.  I propose that
we should do likewise, ie run around and clean up our various references
to "Mac OS X", "OS X", or "OSX" to uniformly say "macOS".  I think it's
considerably clearer to non-Apple people exactly what that is, plus
there's not so much temptation towards variant spellings.

We also have various uses of the ancient code name "Darwin".  I think
we should change that to "macOS" in user-facing docs, but I'm less
inclined to do so in the code.  In particular, renaming the template
file or trying to change the PORTNAME value seems much too messy,
so I think we should stick to "darwin" as the port name.

Barring objections, I'll make this happen (in HEAD only) sometime soon.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] store narrow values in hash indexes?

2016-09-24 Thread Bruce Momjian
On Sat, Sep 24, 2016 at 10:33:01AM +0530, Amit Kapila wrote:
> On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas  wrote:
> > Currently, hash indexes always store the hash code in the index, but
> > not the actual Datum.  It's recently been noted that this can make a
> > hash index smaller than the corresponding btree index would be if the
> > column is wide.  However, if the index is being built on a fixed-width
> > column with a typlen <= sizeof(Datum), we could store the original
> > value in the hash index rather than the hash code without using any
> > more space.  That would complicate the code, but I bet it would be
> > faster: we wouldn't need to set xs_recheck, we could rule out hash
> > collisions without visiting the heap, and we could support index-only
> > scans in such cases.
> >
> 
> What exactly you mean by Datum?  Is it for datatypes that fits into 64
> bits like integer.  I think if we are able to support index only scans
> for hash indexes for some data types, that will be a huge plus.
> Surely, there is some benefit without index only scans as well, which
> is we can avoid recheck, but not sure if that alone can give us any
> big performance boost.  As, you say, it might lead to some
> complication in code, but I think it is worth trying.
> 
> Won't it add some requirements for pg_upgrade as well?

Yes, pg_upgrade will mark the indexes as invalid and supply a script to
reindex them.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raised checkpoint limit & manual checkpoint

2016-09-24 Thread Fabien COELHO



I would suggest that a good complementary feature would be to allow a manual
checkpoint to run over a period of time, say something like:

  CHECKPOINT OVER '10 hours';

That would target to complete after this period (whether it succeeds or not
is another issue) instead of going as fast as possible, thus avoiding
some performance degradation.


Isn't it somewhat overlaps with existing parameter
checkpoint_completion_target?


More or less. The difference is that throttled checkpoints are currently 
started *automatically* when an certain amount of work has been done or 
some time as passed, but you cannot start them manually.



You can use checkpoint_completion_target to throttle the checkpoints.


Nearly yes, however it does not give any control to when a throttle 
checkpoint is started. I'm arguing that since the configuration allows to 
delay checkpointing up to a day, than the ability to control when to 
actually start one seems to make sense.


The option you are suggesting seems to be more straight forward, but how 
will user decide the time he wants Checkpoint to take.


In the hypothetical use case I have in mind, the user would happen to know 
its load well enough to choose. Say the system supports a load linked to 
office hour, you would know that you want it done before the next 
morning.


--
Fabien


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-24 Thread Peter Geoghegan
On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas  wrote:
>> It'd be good if you could overlap the final merges in the workers with the
>> merge in the leader. ISTM it would be quite straightforward to replace the
>> final tape of each worker with a shared memory queue, so that the leader
>> could start merging and returning tuples as soon as it gets the first tuple
>> from each worker. Instead of having to wait for all the workers to complete
>> first.
>
> If you do that, make sure to have the leader read multiple tuples at a
> time from each worker whenever possible.  It makes a huge difference
> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

That requires some kind of mutual exclusion mechanism, like an LWLock.
It's possible that merging everything lazily is actually the faster
approach, given this, and given the likely bottleneck on I/O at htis
stage. It's also certainly simpler to not overlap things. This is
something I've read about before [1], with "eager evaluation" sorting
not necessarily coming out ahead IIRC.

[1] 
http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193=csetech
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-24 Thread Amit Kapila
On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila  wrote:
>>> I think you have a valid point.  It seems we don't need to write WAL
>>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
>>> new, as the only purpose of that log is to handle conflict based on
>>> transaction id stored in special area which will be anyway zero.
>
>> +1.
>
> This is clearly an oversight in Simon's patch fafa374f2, which introduced
> this code without any consideration for the possibility that the page
> doesn't have a valid special area.  We could prevent the crash by
> doing nothing if PageIsNew, a la
>
> if (_bt_page_recyclable(page))
> {
> /*
>  * If we are generating WAL for Hot Standby then create a
>  * WAL record that will allow us to conflict with queries
>  * running on standby.
>  */
> -   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
> +   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
> +   !PageIsNew(page))
> {
> BTPageOpaque opaque = (BTPageOpaque) 
> PageGetSpecialPointer(page);
>
> _bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
> }
>
> /* Okay to use page.  Re-initialize and return it */
>
> but I'm not very clear on whether this is a safe fix, mainly because
> I don't understand what the reuse WAL record really accomplishes.
> Maybe we need to instead generate a reuse record with some special
> transaction ID indicating worst-case assumptions?
>

I think it is basically to ensure that the queries on standby should
not be considering the page being recycled as valid and if there is
any pending query to which this page is visible, cancel it.  If this
understanding is correct, then I think the fix you are proposing is
correct.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-24 Thread Peter Geoghegan
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas  wrote:
> I find this unification business really complicated.

I can certainly understand why you would. As I said, it's the most
complicated part of the patch, which overall is one of the most
ambitious patches I've ever written.

> I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I think that it won't be possible to make a LogicalTapeSet ever use
more than one BufFile without regressing the ability to eagerly reuse
space, which is almost the entire reason for logtape.c existing. The
whole indirect block thing is an idea borrowed from the FS world, of
course, and so logtape.c needs one block-device-like BufFile, with
blocks that can be reclaimed eagerly, but consumed for recycling in
*contiguous* order (which is why they're sorted using qsort() within
ltsGetFreeBlock()). You're going to increase the amount of random I/O
by using more than one BufFile for an entire tapeset, I think.

This patch you posted
("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just
keeps one BufFile, and only changes the interface to expose the tapes
themselves to tuplesort.c, without actually making tuplesort.c do
anything with that capability. I see what you're getting at, I think,
but I don't see how that accomplishes all that much for parallel
CREATE INDEX. I mean, the special case of having multiple tapesets
from workers (not one "unified" tapeset created from worker temp files
from their tapesets to begin with) now needs special treatment.
Haven't you just moved the complexity around (once your patch is made
to care about parallelism)? Having multiple entire tapesets explicitly
from workers, with their own BufFiles, is not clearly less complicated
than managing ranges from BufFile fd.c files with delineated ranges of
"logical tapeset space". Seems almost equivalent, except that my way
doesn't bother tuplesort.c with any of this.

>> +* As a consequence of only being permitted to write to the leader
>> +* controlled range, parallel sorts that require a final
>> materialized tape
>> +* will use approximately twice the disk space for temp files
>> compared to
>> +* a more or less equivalent serial sort.

> I'm slightly worried about that. Maybe it's OK for a first version, but it'd
> be annoying in a query where a sort is below a merge join, for example, so
> that you can't do the final merge on the fly because mark/restore support is
> needed.

My intuition is that we'll *never* end up using this for merge joins.
I think that I could do better here (why should workers really care at
this point?), but just haven't bothered to.

This parallel sort implementation is something written with CREATE
INDEX and CLUSTER in mind only (maybe one or two other things, too). I
believe that for query execution, partitioning is the future [1]. With
merge joins, partitioning is desirable because it lets you push down
*everything* to workers, not just sorting (e.g., by aligning
partitioning boundaries on each side of each merge join sort in the
worker, and having the worker also "synchronize" each side of the
join, all independently and without a dependency on a final merge).

That's why I think it's okay that I use twice as much space for
randomAccess tuplesort.c callers. No real world caller will ever end
up needing to do this. It just seems like a good idea to support
randomAccess when using this new infrastructure, on general principle.
Forcing myself to support that case during initial development
actually resulted in much cleaner, less invasive changes to
tuplesort.c in general.

[1] 
https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raised checkpoint limit & manual checkpoint

2016-09-24 Thread Amit Kapila
On Sat, Sep 24, 2016 at 12:12 PM, Fabien COELHO  wrote:
>
> Hello,
>
> The checkpoint time limit has just been raised to one day after a discussion
> started by Andres Freund:
>
> https://www.postgresql.org/message-id/20160202001320.GP8743%40awork2.anarazel.de
>
> I would have gone further up, say one week or even one month, but I think
> that this new limit is an improvement over the previous 1 hour maximum.
>
> Now ISTM that there is a possible use case which arises with this new
> setting and is not well addressed by postgresql:
>
> Let us say that an application has periods of high and low usage, say over a
> day, so that I want to avoid a checkpoint from 8 to 20, but I'm okay after
> that. I could raise the size and time limits so that they do not occur
> during these hours and plan to do a manual CHECKPOINT once a day when I see
> fit.
>
> Now the problem I see is that CHECKPOINT means "do a CHECKPOINT right now as
> fast as possible", i.e. there is no throttling whatsoever, which leads to
> heavy IO and may result in a very unresponsive database.
>
> I would suggest that a good complementary feature would be to allow a manual
> checkpoint to run over a period of time, say something like:
>
>   CHECKPOINT OVER '10 hours';
>
> That would target to complete after this period (whether it succeeds or not
> is another issue) instead of going as fast as possible, thus avoiding
> some performance degradation.
>

Isn't it somewhat overlaps with existing parameter
checkpoint_completion_target?  You can use
checkpoint_completion_target to throttle the checkpoints.  The option
you are suggesting seems to be more straight forward, but how will
user decide the time he wants Checkpoint to take.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-24 Thread Amit Kapila
On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova
 wrote:
> 20.09.2016 08:21, Amit Kapila:
>
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
>  wrote:
>
> 28.08.2016 09:13, Amit Kapila:
>
>
> The problem seems really tricky, but the answer is simple.
> We store included columns unordered. It was mentioned somewhere in
> this thread.
>

Is there any fundamental problem in storing them in ordered way?  I
mean to say, you need to anyway store all the column values on leaf
page, so why can't we find the exact location for the complete key.
Basically use truncated key to reach to leaf level and then use the
complete key to find the exact location to store the key.  I might be
missing some thing here, but if we can store them in ordered fashion,
we can use them even for queries containing ORDER BY (where ORDER BY
contains included columns).

> Let me give you an example:
>
> create table t (i int, p point);
> create index on (i) including (p);
> "point" data type doesn't have any opclass for btree.
> Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
> We have no idea what is the "correct order" for this attribute.
> So the answer is "it doesn't matter". When searching in index,
> we know that only key attrs are ordered, so only them can be used
> in scankey. Other columns are filtered after retrieving data.
>
> explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
> QUERY PLAN
> ---
>  Index Only Scan using idx on t  (cost=0.14..4.20 rows=1 width=20)
>Index Cond: (i = 0)
>Filter: (p <@ '<(0,0),2>'::circle)
>

I think here reason for using Filter is that because we don't keep
included columns in scan keys, can't we think of having them in scan
keys, but use only key columns in scan key to reach till leaf level
and then use complete scan key at leaf level.

>
> The same approach is used for included columns of any type, even if
> their data types have opclass.
>
> Is this truncation concept of high key needed for correctness of patch
> or is it just to save space in index?   If you need this, then I think
> nbtree/Readme needs to be updated.
>
>
> Now it's done only for space saving. We never check included attributes
> in non-leaf pages, so why store them? Especially if we assume that included
> attributes can be quite long.
> There is already a note in documentation:
>
> +It's the same with other constraints (PRIMARY KEY and EXCLUDE).
> This can
> +also can be used for non-unique indexes as any columns which are
> not required
> +for the searching or ordering of records can be included in the
> +INCLUDING clause, which can slightly reduce the size of
> the index,
> +due to storing included attributes only in leaf index pages.
>

Okay, thanks for clarification.

> What should I add to README (or to documentation),
> to make it more understandable?
>

May be add the data representation like only leaf pages contains all
the columns and how the scan works.  I think you can see if you can
extend "Notes About Data Representation" and or "Other Things That Are
Handy to Know" sections in existing README.

> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> TRAP: unrecognized TOAST vartag("((bool) 1)", File:
> "src/backend/access/common/h
> eaptuple.c", Line: 532)
> LOG:  server process (PID 1404) was terminated by exception 0x8003
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> valu
> e.
> LOG:  terminating any other active server processes
>
>
> That is expected behavior, because catalog versions are not compatible.
> But I wonder why there was no message about that?
> I suppose, that's because CATALOG_VERSION_NO was outdated in my
> patch. As well as I know, committer will change it before the commit.
> Try new patch with updated value. It should fail with a message about
> incompatible versions.
>

Yeah, that must be reason, but lets not change it now, otherwise we
will face conflicts while applying patch.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactor StartupXLOG?

2016-09-24 Thread Michael Paquier
On Sat, Sep 24, 2016 at 11:01 AM, Thomas Munro
 wrote:
> What would the appetite be for that kind of refactoring work,
> considering the increased burden on committers who have to backpatch
> bug fixes? Is it a project goal to reduce the size of large
> complicated functions like StartupXLOG and heap_update?  It seems like
> a good way for new players to learn how they work.

A lot of appetite. The size of xlog.c is out of control, so something
that would be really cool to see is spliiting the whole logic of
xlog.c into more independent files, for example low-level file only
operations could go into xlogfile.c, backup code paths in
xlogbackup.c, etc. This would make necessary to expose some of the
shared-memory structures now at the top of xlog.c like XLogCtl but I
think that would be really worth it at the end, and closer to the
things like xloginsert.c and xlogarchive.c that began such a move.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2016-09-24 Thread Peter Geoghegan
On Tue, Sep 20, 2016 at 8:55 AM, Heikki Linnakangas  wrote:
> Thanks for working on this, and sorry for disappearing and dropping this on
> the floor earlier. This doesn't apply anymore, thanks to 9c810a2e. Shouldn't
> be hard to fix.

Thanks for looking at it again.

> I was in support of this earlier in general, even though I had some
> questions on the details. But now that I look at the patch again, I'm not so
> sure anymore.

Honestly, I almost withdrew this patch myself, simply because it has
dragged on so long. It has become ridiculous.

> I don't think this actually makes things clearer. It adds new
> cases that the code needs to deal with. Index AM writers now have to care
> about the difference between a UNIQUE_CHECK_PARTIAL and
> UNIQUE_CHECK_SPECULATIVE. You can have the exact same implementation for
> both, but at the very least the index AM author needs to read the paragraph
> you added to the docs to understand the difference. That adds some cognitive
> load.

I think it gives us the opportunity to discuss the differences, and in
particular to explain why this "speculative insertion" thing exists at
all. Besides, we can imagine an amcanunique implementation in which
the difference might matter. Honestly, since it is highly unlikely
that there ever will be another amcanunique access method, the
cognitive burden to implementers of new amcanunique AMs is not a
concern to me. Rather, the concern with that part of the patch is
educating people about how the whole speculative insertion thing works
in general, and drawing specific attention to it in a few key places
in the code.

Speculative insertion is subtle and complicated enough that I feel
that that should be well described, as I've said many times. Remember
how hard it was for us to come to agreement on the basic requirements
in the first place!

> Likewise, in ExecInsertIndexTuples(), this makes the deferred-index
> case work slightly differently from speculative insertion. It's not a big
> difference, but it again forces you to keep one more scenario in mind, when
> reading the code

This actually does have useful practical consequences, although I
didn't point that out earlier (though I should have). To see what I
mean, consider the complaint in this recent thread, which is based on
an actual user application problem:

https://www.postgresql.org/message-id/flat/1472134358.649524...@f146.i.mail.ru#1472134358.649524...@f146.i.mail.ru

I go on to explain how this patch represents a partial solution to
that [1]. That's what I mean by "useful practical consequences". As I
say in [1], I think we could even get a full solution, if we applied
this patch and *also* made the ordering in which the relcache returns
a list of index OIDs more useful (it should still be based on
something stable, to avoid deadlocks, but more than just OID order.
Instead, relcache.c can sort indexes such that we insert into primary
keys first, then unique indexes, then all other indexes. This also
avoids bloat if there is a unique violation, by getting unique indexes
out of the way first during ExecInsert()).

> So overall, I think we should just drop this. Maybe a comment somewhere
> would be in order, to point out that ExecInsertIndexTuples() and
> index_insert might perform some unnecessary work, by inserting index tuples
> for a doomed heap tuple, if a speculative insertion fails. But that's all.

Perhaps. I'm curious about what your thoughts are on what I've said
about "useful practical consequences" of finishing insertion earlier
for speculative inserters. If you're still not convinced about this
patch, having considered that as well, then I will drop the patch (or
maybe we just add some comments, as you suggest).

[1] 
https://www.postgresql.org/message-id/CAM3SWZTFTbK_Y%3D7uWJaXYLHnYQ99pV4KFpmjTKbNJR5_%2BQThzA%40mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-24 Thread Peter Geoghegan
On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
 wrote:
> A couple of thoughts about abbreviated keys:
>
> #ifndef TRUST_STRXFRM
> if (!collate_c)
> abbreviate = false;
> #endif
>
> I think this macro should affect only strxfrm, and we should trust
> ucol_getSortKey or disable it independently.  ICU's manual says
> reassuring things like "Sort keys are most useful in databases" and
> "Sort keys are generally only useful in databases or other
> circumstances where function calls are extremely expensive".

+1. Abbreviated keys are essential to get competitive performance
while sorting text, and the fact that ICU makes them safe to
reintroduce is a major advantage of adopting ICU. Perhaps we should
consider wrapping strxfrm() instead, though, so that other existing
callers of strxfrm() (I'm thinking of convert_string_datum()) almost
automatically do the right thing. In other words, maybe there should
be a pg_strxfrm() or something, with TRUST_STRXFRM changed to be
something that can dynamically resolve whether or not it's a collation
managed by a trusted collation provider (this could only be resolved
at runtime, which I think is fine).

> It looks like varstr_abbrev_convert calls strxfrm unconditionally
> (assuming TRUST_STRXFRM is defined).  This needs to
> use ucol_getSortKey instead when appropriate.  It looks like it's a
> bit more helpful than strxfrm about telling you the output buffer size
> it wants, and it doesn't need nul termination, which is nice.
> Unfortunately it is like strxfrm in that the output buffer's contents
> is unspecified if it ran out of space.

One can use the ucol_nextSortKeyPart() interface to just get the first
4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
the output buffer size limitation is probably irrelevant. The ICU
documentation says something about this being useful for Radix sort,
but I suspect it's more often used to generate abbreviated keys.
Abbreviated keys were not my original idea. They're really just a
standard technique.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - minor fix for meta command only scripts

2016-09-24 Thread Fabien COELHO


Hello Heikki,

Yeah, it really is quite a mess. I tried to review your patch, and I think 
it's correct, but I couldn't totally convince myself, because of the existing 
messiness of the logic. So I bit the bullet and started refactoring.


I came up with the attached. It refactors the logic in doCustom() into a 
state machine. I think this is much clearer, what do you think?


The patch did not apply to master because of you committed the sleep fix 
in between. I updated the patch so that the fix is included as well.


I think that this is really needed. The code is much clearer and simple to 
understand with the state machines & additional functions. This is a 
definite improvement to the code base.


I've done quite some testing with various options (-r, --rate, 
--latency-limit, -C...) and got pretty reasonnable results.


Although I cannot be absolutely sure that the refactoring does not 
introduce any new bug, I'm convinced that it will be much easier to find 
them:-)



Attached are some small changes to your version:

I have added the sleep_until fix.

I have fixed a bug introduced in the patch by changing && by || in the 
(min_sec > 0 && maxsock != -1) condition which was inducing errors with 
multi-threads & clients...


I have factored out several error messages in "commandFailed", in place of 
the "metaCommandFailed", and added the script number as well in the error 
messages. All messages are now specific to the failed command.


I have added two states to the machine:

 - CSTATE_CHOOSE_SCRIPT which simplifies threadRun, there is now one call
   to chooseScript instead of two before.

 - CSTATE_END_COMMAND which manages is_latencies and proceeding to the
   next command, thus merging the three instances of updating the stats
   that were in the first version.

The later state means that processing query results is included in the per 
statement latency, which is an improvement because before I was getting 
some transaction latency significantly larger that the apparent sum of the 
per-statement latencies, which did not make much sense...


I have added & updated a few comments. There are some places where the 
break could be a pass through instead, not sure how desirable it is, I'm 
fine with break.



Well, the comment right there says "note this is not included in the 
statement latency numbers", so apparently it's intentional. Whether it's a 
good idea or not, I don't know :-). It does seem a bit surprising.


Indeed, it also results in apparently inconsistent numbers, and it creates 
a mess for recording the statement latency because it meant that in some 
case the latency was collected before the actual end of the command, see 
the discussion about CSTATE_END_COMMAND above.


But what seems more bogus to me is that we do that after recording the 
*transaction* stats, if this was the last command. So the PQgetResult() of 
the last command in the transaction is not included in the transaction stats, 
even though the PQgetResult() calls for any previous commands are. (Perhaps 
that's what you meant too?)


I changed that in my patch, it would've been inconvenient to keep that old 
behavior, and it doesn't make any sense to me anyway.


Fine with me.

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 8b24ad5..502e644 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -235,25 +235,97 @@ typedef struct StatsData
 } StatsData;
 
 /*
- * Connection state
+ * Connection state machine states.
+ */
+typedef enum
+{
+	/*
+	 * The client must first choose a script to execute.  Once chosen, it can
+	 * either be throttled (state CSTATE_START_THROTTLE under --rate) or start
+	 * right away (state CSTATE_START_TX).
+	 */
+	CSTATE_CHOOSE_SCRIPT,
+
+	/*
+	 * In CSTATE_START_THROTTLE state, we calculate when to begin the next
+	 * transaction, and advance to CSTATE_THROTTLE.  CSTATE_THROTTLE state
+	 * sleeps until that moment.  (If throttling is not enabled, doCustom()
+	 * falls directly through from CSTATE_START_THROTTLE to CSTATE_START_TX.)
+	 */
+	CSTATE_START_THROTTLE,
+	CSTATE_THROTTLE,
+
+	/*
+	 * CSTATE_START_TX performs start-of-transaction processing.  Establishes
+	 * a new connection for the transaction, in --connect mode, and records
+	 * the transaction start time.
+	 */
+	CSTATE_START_TX,
+
+	/*
+	 * We loop through these states, to process each command in the
+	 * script:
+	 *
+	 * CSTATE_START_COMMAND starts the execution of a command.  On a SQL
+	 * command, the command is sent to the server, and we move to
+	 * CSTATE_WAIT_RESULT state.  On a \sleep meta-command, the timer is
+	 * set, and we enter the CSTATE_SLEEP state to wait for it to expire.
+	 * Other meta-commands are executed immediately, and we proceed to next
+	 * command.
+	 *
+	 * CSTATE_WAIT_RESULT waits until we get a result set back from the server
+	 * for the current command, and proceeds to CSTATE_END_COMMAND.
+	 *
+	 * CSTATE_SLEEP waits until the end of \sleep 

Re: [HACKERS] Refactor StartupXLOG?

2016-09-24 Thread Thomas Munro
On Sat, Sep 24, 2016 at 8:24 PM, Heikki Linnakangas  wrote:
> On 09/24/2016 05:01 AM, Thomas Munro wrote:
>>
>> What would the appetite be for that kind of refactoring work,
>> considering the increased burden on committers who have to backpatch
>> bug fixes?  Is it a project goal to reduce the size of large
>> complicated functions like StartupXLOG and heap_update?  It seems like
>> a good way for new players to learn how they work.
>
>
> +1. Yes, it does increase the burden of backpatching, but I think it'd still
> be very much worth it.

Cool.

> A couple of little details that caught my eye at a quick read:
>
>> /* Try to find a backup label. */
>> if (read_backup_label(, ,
>>   ))
>> {
>> wasShutdown = ProcessBackupLabel(xlogreader, ,
>> checkPointLoc,
>>
>> );
>>
>> /* set flag to delete it later */
>> haveBackupLabel = true;
>> }
>> else
>> {
>> /* Clean up any orphaned tablespace map files with no
>> backup label. */
>> CleanUpTablespaceMap();
>> ...
>
>
> This is a bit asymmetric: In the true-branch, ProcessBackupLabel() reads the
> tablespace map, and sets InArchiveRecovery and StandbyMode, but in the
> false-branch, StartupXLog() calls CleanupTablespaceMap() and sets those
> variables directly.

Right.  I need to move all or some of the other branch out to its own
function too.

> For functions like BeginRedo, BeginHotStandby, ReplayRedo, etc., I think
> it'd be better to have the "if (InRecovery)" checks in the caller, rather
> than in the functions.

Yeah.  I was thinking that someone might value the preservation of
indention level, since that might make small localised bug fixes
easier to backport to the monolithic StartupXLOG.  Plain old
otherwise-redundant curly braces would achieve that.  Or maybe it's
better not to worry about preserving that.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-09-24 Thread Masahiko Sawada
On Wed, Sep 21, 2016 at 11:22 PM, Robert Haas  wrote:
> On Wed, Sep 21, 2016 at 5:54 AM, Petr Jelinek  wrote:
 Reading again the thread, it seems that my previous post [1] was a bit
 misunderstood. My position is to not introduce any new behavior
 changes in 9.6, so we could just make the FIRST NUM grammar equivalent
 to NUM.

 [1]: 
 https://www.postgresql.org/message-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4njgmmgiagvcs...@mail.gmail.com
>>>
>>> I misunderstood your intent, then.  But I still stand by what I did
>>> understand, namely that 'k (...)'  should mean 'any k (...)'.  It's much
>>> more natural than having it mean 'first k (...)' and I also think it
>>> will be more frequent in practice.
>>>
>>
>> I think so as well.
>
> Well, I agree, but I think making behavior changes after rc1 is a
> non-starter.  It's better to live with the incompatibility than to
> change the behavior so close to release.  At least, that's my
> position.  Getting the release out on time with a minimal bug count is
> more important to me than a minor incompatibility in the meaning of
> one GUC.
>

As the release team announced, it's better to postpone changing the
syntax of existing s_s_name.
I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1.  'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2.  'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3.  'n1, n2, n3' is the same as #1 with k=1.
4.  '(n1, n2, n3)' is the same as #2 with k=1.

Attached updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


quorum_commit_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactor StartupXLOG?

2016-09-24 Thread Heikki Linnakangas

On 09/24/2016 05:01 AM, Thomas Munro wrote:

What would the appetite be for that kind of refactoring work,
considering the increased burden on committers who have to backpatch
bug fixes?  Is it a project goal to reduce the size of large
complicated functions like StartupXLOG and heap_update?  It seems like
a good way for new players to learn how they work.


+1. Yes, it does increase the burden of backpatching, but I think it'd 
still be very much worth it.


A couple of little details that caught my eye at a quick read:


/* Try to find a backup label. */
if (read_backup_label(, ,
  ))
{
wasShutdown = ProcessBackupLabel(xlogreader, , 
checkPointLoc,
 
);

/* set flag to delete it later */
haveBackupLabel = true;
}
else
{
/* Clean up any orphaned tablespace map files with no backup 
label. */
CleanUpTablespaceMap();
...


This is a bit asymmetric: In the true-branch, ProcessBackupLabel() reads 
the tablespace map, and sets InArchiveRecovery and StandbyMode, but in 
the false-branch, StartupXLog() calls CleanupTablespaceMap() and sets 
those variables directly.


For functions like BeginRedo, BeginHotStandby, ReplayRedo, etc., I think 
it'd be better to have the "if (InRecovery)" checks in the caller, 
rather than in the functions.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-24 Thread Craig Ringer
On 24 Sep. 2016 04:04, "Tom Lane"  wrote:.
>
> >
> > It's thus sufficient to apply the patch to install the perl modules to
> > 9.4, 9.5 and 9.6. Nothing else is needed. I've attached backports for
> > 9.4 and 9.5.
>
> Pushed with cosmetic adjustments ---

Thanks.

> Looking back over the thread, I see that you also proposed installing
> isolationtester and pg_isolation_regress for the benefit of extensions.
> I'm very much less excited about that idea.  It'd be substantially more
> dead weight in typical installations, and I'm not sure that it'd be useful
> to common extensions, and I'm not eager to treat isolationtester's API
> and behavior as something we need to hold stable for extension use.

Fine by me. I have no particular problem expecting those to be installed
explicitly by ext devs who want to use them.


[HACKERS] raised checkpoint limit & manual checkpoint

2016-09-24 Thread Fabien COELHO


Hello,

The checkpoint time limit has just been raised to one day after a 
discussion started by Andres Freund:


https://www.postgresql.org/message-id/20160202001320.GP8743%40awork2.anarazel.de

I would have gone further up, say one week or even one month, but I think 
that this new limit is an improvement over the previous 1 hour maximum.


Now ISTM that there is a possible use case which arises with this new 
setting and is not well addressed by postgresql:


Let us say that an application has periods of high and low usage, say over 
a day, so that I want to avoid a checkpoint from 8 to 20, but I'm okay 
after that. I could raise the size and time limits so that they do not 
occur during these hours and plan to do a manual CHECKPOINT once a day 
when I see fit.


Now the problem I see is that CHECKPOINT means "do a CHECKPOINT right now 
as fast as possible", i.e. there is no throttling whatsoever, which leads 
to heavy IO and may result in a very unresponsive database.


I would suggest that a good complementary feature would be to allow a 
manual checkpoint to run over a period of time, say something like:


  CHECKPOINT OVER '10 hours';

That would target to complete after this period (whether it succeeds or 
not is another issue) instead of going as fast as possible, thus avoiding

some performance degradation.

Any thoughts?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-24 Thread Pavel Stehule
Hi

2016-09-23 10:05 GMT+02:00 Craig Ringer :

> On 22 September 2016 at 02:31, Pavel Stehule 
> wrote:
>
> > another small update - fix XMLPath parser - support multibytes characters
>
> I'm returning for another round of review.
>
> The code doesn't handle the 5 XML built-in entities correctly in
> text-typed output. It processes  and  but not ,  or
>  . See added test. I have not fixed this, but I think it's clearly
> broken:
>
>
> + -- XML builtin entities
> + SELECT * FROM xmltable('/x/a' PASSING
> '<
> ent>'
> COLUMNS ent text);
> +   ent
> + ---
> +  '
> +  "
> +  
> +  
> +  
> + (5 rows)
>
> so I've adjusted the docs to claim that they're expanded. The code
> needs fixing to avoid entity-escaping when the output column type is
> not 'xml'.
>
>
fixed


>
>  and  entities in xml-typed output are expanded, not
> preserved. I don't know if this is intended but I suspect it is:
>
> + SELECT * FROM xmltable('/x/a' PASSING
> '<
> ent>'
> COLUMNS ent xml);
> +ent
> + --
> +  '
> +  "
> +  
> +  
> +  
> + (5 rows)
>
>
> For the docs changes relevant to the above search for "The five
> predefined XML entities". Adjust that bit of docs if I guessed wrong
> about the intended behaviour.
>
> The tests don't cover CDATA or PCDATA . I didn't try to add that, but
> they should.
>
>
appended


> Did some docs copy-editing and integrated some examples. Explained how
> nested elements work, that multiple top level elements is an error,
> etc. Explained the time-of-evaluation stuff. Pointed out that you can
> refer to prior output columns in PATH and DEFAULT, since that's weird
> and unusual compared to normal SQL. Documented handling of multiple
> node matches, including the surprising results of somepath/text() on
> xy. Documented handling of nested
> elements. Documented that xmltable works only on XML documents, not
> fragments/forests.
>

I don't understand to this sentence: "It is possible for a PATH expression
to reference output columns that appear before it in the column-list, so
paths may be dynamically constructed based on other parts of the XML
document:"


> Regarding evaluation time, it struck me that evaluating path
> expressions once per row means the xpath must be parsed and processed
> once per row. Isn't it desirable to store and re-use the preparsed
> xpath? I don't think this is a major problem, since we can later
> detect stable/immutable expressions including constants, evaluate only
> once in that case, and cache. It's just worth thinking about.
>

Probably it is possible - it is exactly how you wrote - it needs to check
the change. We can try do some possible performance optimizations later -
without compatibility issues. Now, I prefer the most simple code.

a note: PATH expression is evaluated for any **input** row. In same moment
is processed row path expression and man XML document DOM parsing. So
overhead of PATH expression and PATH parsing should not be dominant.


>
> The docs and tests don't seem to cover XML entities. What's the
> behaviour there? Core XML only defines one entity, but if a schema
> defines more how are they processed? The tests need to cover the
> predefined entities and  at least.
>

I don't understand, what you are propose here. ?? Please, can you send some
examples.

>
> I have no idea whether the current code can fetch a DTD and use any
>  declarations to expand entities, but I'm guessing not? If
> not, external DTDs, and internal DTDs with external entities should be
> documented as unsupported.
>
> It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):
>
> SELECT * FROM xmltable('/' PASSING $XML$ standalone="yes" ?>
>
>   
> ]>
> Hello .
> $XML$ COLUMNS foo text);
>
> + ERROR:  invalid XML content
> + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$ +^
> + DETAIL:  line 2: StartTag: invalid element name
> +  +  ^
> + line 3: StartTag: invalid element name
> +   
> +^
> + line 4: StartTag: invalid element name
> +   
> +^
> + line 6: Entity 'pg' not defined
> + Hello .
> +^
>
>
It is rejected before XMLTABLE function call

postgres=# select $XML$
postgres$# 
postgres$#   
postgres$# ]>
postgres$# Hello .
postgres$# $XML$::xml;
ERROR:  invalid XML content
LINE 1: select $XML$
   ^
DETAIL:  line 2: StartTag: invalid element name

   ^
line 4: StartTag: invalid element name
  
   ^
line 6: Entity 'pg' not defined
Hello .
   ^
It is disabled by default in libxml2. I found a function
xmlSubstituteEntitiesDefault http://www.xmlsoft.org/entities.html
http://www.xmlsoft.org/html/libxml-parser.html#xmlSubstituteEntitiesDefault

The default behave should be common for all PostgreSQL's libxml2 based
function - and then it is different topic - maybe part for PostgreSQL ToDo?
But I don't remember any user requests related to this issue.



>
> libxml seems to support documents with