Re: [HACKERS] store narrow values in hash indexes?

2016-09-23 Thread Amit Kapila
On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas  wrote:
> Currently, hash indexes always store the hash code in the index, but
> not the actual Datum.  It's recently been noted that this can make a
> hash index smaller than the corresponding btree index would be if the
> column is wide.  However, if the index is being built on a fixed-width
> column with a typlen <= sizeof(Datum), we could store the original
> value in the hash index rather than the hash code without using any
> more space.  That would complicate the code, but I bet it would be
> faster: we wouldn't need to set xs_recheck, we could rule out hash
> collisions without visiting the heap, and we could support index-only
> scans in such cases.
>

What exactly you mean by Datum?  Is it for datatypes that fits into 64
bits like integer.  I think if we are able to support index only scans
for hash indexes for some data types, that will be a huge plus.
Surely, there is some benefit without index only scans as well, which
is we can avoid recheck, but not sure if that alone can give us any
big performance boost.  As, you say, it might lead to some
complication in code, but I think it is worth trying.

Won't it add some requirements for pg_upgrade as well?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
 wrote:
> On 09/23/2016 03:07 PM, Amit Kapila wrote:
>>
>> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
>>  wrote:
>>>
>>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:


 ...
 The 4.5 kernel clearly changed the results significantly:

>>> ...



 (c) Although it's not visible in the results, 4.5.5 almost perfectly
 eliminated the fluctuations in the results. For example when 3.2.80
 produced this results (10 runs with the same parameters):

 12118 11610 27939 11771 18065
 12152 14375 10983 13614 11077

 we get this on 4.5.5

 37354 37650 37371 37190 37233
 38498 37166 36862 37928 38509

 Notice how much more even the 4.5.5 results are, compared to 3.2.80.

>>>
>>> The more I think about these random spikes in pgbench performance on
>>> 3.2.80,
>>> the more I find them intriguing. Let me show you another example (from
>>> Dilip's workload and group-update patch on 64 clients).
>>>
>>> This is on 3.2.80:
>>>
>>>   44175  34619  51944  38384  49066
>>>   37004  47242  36296  46353  36180
>>>
>>> and on 4.5.5 it looks like this:
>>>
>>>   34400  35559  35436  34890  34626
>>>   35233  35756  34876  35347  35486
>>>
>>> So the 4.5.5 results are much more even, but overall clearly below
>>> 3.2.80.
>>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>>> randomly do something right, but what is it and why doesn't it happen on
>>> the
>>> new kernel? And how could we do it every time?
>>>
>>
>> As far as I can see you are using default values of min_wal_size,
>> max_wal_size, checkpoint related params, have you changed default
>> shared_buffer settings, because that can have a bigger impact.
>
>
> Huh? Where do you see me using default values?
>

I was referring to one of your script @ http://bit.ly/2doY6ID.  I
haven't noticed that you have changed default values in
postgresql.conf.

> There are settings.log with a
> dump of pg_settings data, and the modified values are
>
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> effective_io_concurrency = 32
> log_autovacuum_min_duration = 100
> log_checkpoints = on
> log_line_prefix = %m
> log_timezone = UTC
> maintenance_work_mem = 524288
> max_connections = 300
> max_wal_size = 8192
> min_wal_size = 1024
> shared_buffers = 2097152
> synchronous_commit = on
> work_mem = 524288
>
> (ignoring some irrelevant stuff like locales, timezone etc.).
>
>> Using default values of mentioned parameters can lead to checkpoints in
>> between your runs.
>
>
> So I'm using 16GB shared buffers (so with scale 300 everything fits into
> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
> triggered explicitly before each run.
>

Thanks for clarification.  Do you think we should try some different
settings *_flush_after parameters as those can help in reducing spikes
in writes?

>> Also, I think instead of 5 mins, read-write runs should be run for 15
>> mins to get consistent data.
>
>
> Where does the inconsistency come from?

Thats what I am also curious to know.

> Lack of warmup?

Can't say, but at least we should try to rule out the possibilities.
I think one way to rule out is to do slightly longer runs for Dilip's
test cases and for pgbench we might need to drop and re-create
database after each reading.

> Considering how
> uniform the results from the 10 runs are (at least on 4.5.5), I claim this
> is not an issue.
>

It is quite possible that it is some kernel regression which might be
fixed in later version.  Like we are doing most tests in cthulhu which
has 3.10 version of kernel and we generally get consistent results.
I am not sure if later version of kernel say 4.5.5 is a net win,
because there is a considerable difference (dip) of performance in
that version, though it produces quite stable results.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Complete LOCK TABLE ... IN ACCESS|ROW|SHARE

2016-09-23 Thread Thomas Munro
Hi

After LOCK TABLE ... IN ACCESS|ROW|SHARE we run out of completions.
Here's a patch to improve that, for November.

-- 
Thomas Munro
http://www.enterprisedb.com


complete-lock-table.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared m

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 7:41 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> Now, it is quite possible
>> that error code is set to 0 on success in my windows environment
>> (Win7) and doesn't work in some other environment.  In any case, if we
>> want to go ahead and don't want to rely on CreateFileMapping(), then
>> attached patch should suffice the need.
>
> Yeah, seeing that this is not mentioned in MS' documentation I don't
> think we should rely on it.  The patch looks fine to me, pushed.
>

Thanks.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Refactor StartupXLOG?

2016-09-23 Thread Thomas Munro
Hi hackers

StartupXLOG is 1549 lines of code.  Its unwieldy size came up in a
discussion in an IRC channel where some PG hackers lurk and I decided
to see how it might be chopped up into subroutines with a clear
purpose and reasonable size, as an exercise.

I came up with the following on a first pass, though obviously you
could go a lot further:

  StartupXLOG -- reduced to 651 lines
   -> ReportControlFileState() -- 40 lines to log recovery startup message
   -> ReportRecoveryTarget() -- 32 lines to log recovery target message
   -> ProcessBackupLabel() -- 90 lines to read backup label's checkpoint
   -> CleanUpTablespaceMap() -- 33 lines of file cleanup

   -> BeginRedo() -- 191 lines for redo setup
   -> BeginHotStandby() -- 74 lines for hot standby initialisation
   -> ReplayRedo() -- 260 lines for the main redo loop
   -> FinishRedo() -- 63 lines for redo completion

   -> AssignNewTimeLine() -- 66 lines
   -> CleanUpOldTimelineSegments() -- 74 lines of file cleanup
   -> RecoveryCheckpoint() -- 71 lines

Unfortunately the attached sketch patch (not tested much) is totally
unreadable as a result of moving so many things around.  It would
probably need to be done in a series of small well tested readable
patches moving one thing out at a time.  Perhaps with an extra
context/state object to cut down on wide argument lists.

What would the appetite be for that kind of refactoring work,
considering the increased burden on committers who have to backpatch
bug fixes?  Is it a project goal to reduce the size of large
complicated functions like StartupXLOG and heap_update?  It seems like
a good way for new players to learn how they work.

-- 
Thomas Munro
http://www.enterprisedb.com


refactor-startupxlog-sketch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-09-23 Thread Petr Jelinek
On 21/09/16 15:04, Peter Eisentraut wrote:
> Some partial notes on 0005-Add-logical-replication-workers.patch:
> 
> Document to what extent other relation types are supported (e.g.,
> materialized views as source, view or foreign table or temp table as
> target).  Suggest an updatable view as target if user wants to have
> different table names or write into a different table structure.
> 

I don't think that's good suggestion, for one it won't work for UPDATEs
as we have completely different path for finding the tuple to update
which only works on real data, not on view. I am thinking of even just
allowing table to table replication in v1 tbh, but yes it should be
documented what target relation types can be.

> 
> subscriptioncmds.c: Perhaps combine logicalrep_worker_find() and
> logicalrep_worker_stop() into one call that also encapsulates the
> required locking.

I was actually thinking of moving the wait loop that waits for worker to
finish there as well.

> 
> In get_subscription_list(), the memory context pointers don't appear to
> do anything useful, because everything ends up being CurrentMemoryContext.
> 

That's kind of the point of the memory context pointers there though as
we start transaction inside that function.

> pg_stat_get_subscription(NULL) for "all" seems a bit of a weird interface.
> 

I modeled that after pg_stat_get_activity() which seems to be similar
type of interface.

> pglogical_apply_main not used, should be removed.
> 

Hah.

> In logicalreprel_open(), the error message "cache lookup failed for
> remote relation %u" could be clarified.  This message could probably
> happen if the protocol did not send a Relation message first.  (The term
> "cache" is perhaps inappropriate for LogicalRepRelMap, because it
> implies that the value can be gotten from elsewhere if it's not in the
> cache.  In this case it's really session state that cannot be recovered
> easily.)
> 

Yeah I have different code and error for that now.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-09-23 Thread Robert Haas
[ Adjusting subject line to reflect the actual topic of discussion better. ]

On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas  wrote:
> On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar  
> wrote:
>> For e.g., in the above plan which you specified, suppose :
>> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
>> is
>> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
>> 2. The event wait list already has foreign scan on a that is on a different
>> subtree.
>> 3. This foreign scan a happens to be ready, so in
>> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
>> which returns with result_ready.
>> 4. Since it returns result_ready, it's parent node is now inserted in the
>> callbacks array, and so it's parent (Append) is executed.
>> 5. But, this Append planstate is already in the middle of executing Hash
>> join, and is waiting for HashJoin.
>
> Ah, yeah, something like that could happen.  I've spent much of this
> week working on a new design for this feature which I think will avoid
> this problem.  It doesn't work yet - in fact I can't even really test
> it yet.  But I'll post what I've got by the end of the day today so
> that anyone who is interested can look at it and critique.

Well, I promised to post this, so here it is.  It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface.  Hopefully that
means that anything other than those two nodes will suffer no
performance impact.  Of course, it might have other problems

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet.  Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


async-wip-2016-09-23.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tbm_lossify causes "unbalanced" hashtables / bitmaps

2016-09-23 Thread Andres Freund
On 2016-09-23 17:40:13 -0400, Tom Lane wrote:
> My idea of an appropriate fix would be to resume the scan at the same
> point where the last scan stopped, and work circularly around the table
> when necessary.

I've played with that idea, and it does help a great deal. Not that
surprisingly, it's better than starting at a random point (which in turn
is better than starting at one end all the time).


> But I'm not sure there is any really good way to do that
> in the dynahash infrastructure.  Maybe it'd work to keep the iteration
> state around, but I don't remember how well that interacts with other
> insertions/deletions.  What about in your implementation?

It's easy enough to specify a start point. Requires exposing some things
that I don't necessarily want to - that's why I played around with a
random start point - but other than that it's easy to
implement. Internally growing the hashtable would be kind of an issue,
invalidating that point, but a) we're most of the time not growing at
that point anymore b) it'd be quite harmless to start at the wrong
point.


> There's also the point mentioned in the existing comment, that it'd be
> better to go after pages with more bits set first.  Not sure of an
> inexpensive way to do that (ie, one that doesn't involve multiple
> scans of the hashtable).  But your results suggest that maybe it'd
> be worth making tbm_lossify slower in order to get better results.

It's not easy, I agree.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tbm_lossify causes "unbalanced" hashtables / bitmaps

2016-09-23 Thread Tom Lane
Andres Freund  writes:
> Because iteration (both in my implementation and dynahash) is
> essentially in bucket order, the front of the hashtable will be mostly
> empty, whereas later parts of the hashtable will be full (i.e. a lot of
> collisions). The reason for that is that we'll find all parts of the
> hashtable that are uncompressed and "early" in the hashspace, and
> they'll possibly hash to something later in the table.

Hm, yeah, I had supposed that this would hit a random part of the key
space, but you're right that over successive calls it's not good.

My idea of an appropriate fix would be to resume the scan at the same
point where the last scan stopped, and work circularly around the table
when necessary.  But I'm not sure there is any really good way to do that
in the dynahash infrastructure.  Maybe it'd work to keep the iteration
state around, but I don't remember how well that interacts with other
insertions/deletions.  What about in your implementation?

There's also the point mentioned in the existing comment, that it'd be
better to go after pages with more bits set first.  Not sure of an
inexpensive way to do that (ie, one that doesn't involve multiple
scans of the hashtable).  But your results suggest that maybe it'd
be worth making tbm_lossify slower in order to get better results.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tbm_lossify causes "unbalanced" hashtables / bitmaps

2016-09-23 Thread Andres Freund
On 2016-09-23 13:58:43 -0700, Andres Freund wrote:
> static void
> tbm_lossify(TIDBitmap *tbm)
> {
> ...
>   pagetable_start_iterate_random(tbm->pagetable, );

Uh, obviously that's from one of my attempts to address the problem.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] tbm_lossify causes "unbalanced" hashtables / bitmaps

2016-09-23 Thread Andres Freund
Hi,

Playing around with my hashtable implementation [1] and using it for
bitmapscans/tidbitmap.c I noticed something curious. When using small
work_mem (4MB in my case) for large-ish tables (~5GB), the performance
tanks. That's primarily caused by some issues in the hashtable code
(since fixed), but it made me look more at the relevant tidbitmap
code. Namely tbm_lossify():

static void
tbm_lossify(TIDBitmap *tbm)
{
...
pagetable_start_iterate_random(tbm->pagetable, );
while ((page = pagetable_iterate(tbm->pagetable, )) != NULL)
{
if (page->ischunk)
continue;   /* already a chunk 
header */

/*
 * If the page would become a chunk header, we won't save 
anything by
 * converting it to lossy, so skip it.
 */
if ((page->blockno % PAGES_PER_CHUNK) == 0)
continue;

/* This does the dirty work ... */
tbm_mark_page_lossy(tbm, page->blockno);

if (tbm->nentries <= tbm->maxentries / 2)
{
/* we have done enough */
break;
}

}

tbm_mark_page_lossy() in turn deletes the individual page entry, and
adds one for the chunk (spanning several pages).

The relevant part is that the loop stops when enough page ranges have
been lossified.  This leads to some problems:

Because iteration (both in my implementation and dynahash) is
essentially in bucket order, the front of the hashtable will be mostly
empty, whereas later parts of the hashtable will be full (i.e. a lot of
collisions). The reason for that is that we'll find all parts of the
hashtable that are uncompressed and "early" in the hashspace, and
they'll possibly hash to something later in the table.  As the number of
entries in the hashtable doesn't increase, we'll never grow the
hashtable to decrease the number of collisions.  I've seen that cause
quite noticeable number of conflicts (which of course are worse in open
addressing table, rather than a separately chained one).

Secondly it'll lead to pretty random parts of the bitmap being
compressed. For performance it'd be good to compress "heavily used"
areas of the bitmap, instead of just whatever happens to be early in the
hash space.


I'm not entirely sure how to resolve that best, but it does seem worth
addressing.  One thing that might be nice is to record the last N
insertions once at 90% full or so, and then lossify in a more targeted
manner using those. E.g. lossify block ranges (256 long atm) that
contain a lot of individual entries or such.

Greetings,

Andres Freund

[1] 
http://archives.postgresql.org/message-id/20160727004333.r3e2k2y6fvk2ntup%40alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 9.6.0 release schedule

2016-09-23 Thread Joe Conway
On 09/23/2016 01:31 PM, Tom Lane wrote:
> Apologies for the late notice on this, but the release team has concluded
> that 9.6 is about as ready as it's going to get.  We are planning to go
> ahead with 9.6.0 release next week --- that is, wrap Monday 26th for
> public announcement on Thursday 29th.
> 
> Thanks to the Release Management Team (Robert, Alvaro, and Noah) for
> driving 9.6 towards an on-time release!

+100


-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


[HACKERS] PG 9.6.0 release schedule

2016-09-23 Thread Tom Lane
Apologies for the late notice on this, but the release team has concluded
that 9.6 is about as ready as it's going to get.  We are planning to go
ahead with 9.6.0 release next week --- that is, wrap Monday 26th for
public announcement on Thursday 29th.

Thanks to the Release Management Team (Robert, Alvaro, and Noah) for
driving 9.6 towards an on-time release!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-23 Thread Pavel Stehule
Hi

2016-09-23 10:07 GMT+02:00 Craig Ringer :

> > Did some docs copy-editing and integrated some examples.
>
> Whoops, forgot to attach.
>
> Rather than sending a whole new copy of the patch, here's a diff
> against your patched tree of my changes so you can see what I've done
> and apply the parts you want.
>
> Note that I didn't updated the expected files.
>

I applied your patch - there is small misunderstanding. The PATH is
evaluated once for input row already. It is not clean in code, because it
is executor node started and running for all rows. I changed it in your
part of doc.

   to a simple value before calling the function. PATH
+  expressions are normally evaluated exactly once per result
row ## per input row
+  ,

Regards

Pavel


>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] Hash Indexes

2016-09-23 Thread Andres Freund
On 2016-09-23 15:19:14 -0400, Robert Haas wrote:
> On Wed, Sep 21, 2016 at 10:33 PM, Andres Freund  wrote:
> > On 2016-09-21 22:23:27 -0400, Tom Lane wrote:
> >> Andres Freund  writes:
> >> > Sure. But that can be addressed, with a lot less effort than fixing and
> >> > maintaining the hash indexes, by adding the ability to do that
> >> > transparently using btree indexes + a recheck internally.  How that
> >> > compares efficiency-wise is unclear as of now. But I do think it's
> >> > something we should measure before committing the new code.
> >>
> >> TBH, I think we should reject that argument out of hand.  If someone
> >> wants to spend time developing a hash-wrapper-around-btree AM, they're
> >> welcome to do so.  But to kick the hash AM as such to the curb is to say
> >> "sorry, there will never be O(1) index lookups in Postgres".
> >
> > Note that I'm explicitly *not* saying that. I just would like to see
> > actual comparisons being made before investing significant amounts of
> > code and related effort being invested in fixing the current hash table
> > implementation. And I haven't seen a lot of that.  If the result of that
> > comparison is that hash-indexes actually perform very well: Great!
>
> Yeah, I just don't agree with that.  I don't think we have any policy
> that you can't develop A and get it committed unless you try every
> alternative that some other community member thinks might be better in
> the long run first.

Huh. I think we make such arguments *ALL THE TIME*.

Anyway, I don't see much point in continuing to discuss this, I'm
clearly in the minority.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] store narrow values in hash indexes?

2016-09-23 Thread Tom Lane
Robert Haas  writes:
> Another thought is that hash codes are 32 bits, but a Datum is 64 bits
> wide on most current platforms.  So we're wasting 4 bytes per index
> tuple storing nothing.

Datum is not a concept that exists on-disk.  What's stored is the 32-bit
hash value.  You're right that we waste space if the platform's MAXALIGN
is 8, but that's the fault of the alignment requirement not the index
definition.

> If we generated 64-bit hash codes we could
> store as many bits of it as a Datum will hold and reduce hash
> collisions.

I think there is considerable merit in trying to move to 64-bit hash
codes (at least for data types that are more than 4 bytes to begin with),
but that's largely in the hope of reducing hash collisions in very large
indexes, not because we'd avoid wasting alignment pad space.  If we do
support that, I think we would do it regardless of the platform MAXALIGN.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 TAP tests and extensions

2016-09-23 Thread Tom Lane
Craig Ringer  writes:
> On 23 September 2016 at 00:32, Tom Lane  wrote:
>> Certainly there are restrictions, but I'd imagine that every new release
>> will be adding features to the TAP test infrastructure for some time to
>> come.  I think it's silly to claim that 9.6 is the first branch where
>> TAP testing is usable at all.

> Also, despite what I said upthread, there's no need for normal
> PGXS-using extensions to define their $(prove_installcheck)
> replacement. It works as-is. The problems I was having stemmed from
> the fact that I was working with a pretty nonstandard Makefile without
> realising that the changes were going to affect prove. $(prove_check)
> isn't useful from extensions of course since it expects a temp install
> and PGXS doesn't know how to make one, but $(prove_installcheck) does
> the job fine.

> It's thus sufficient to apply the patch to install the perl modules to
> 9.4, 9.5 and 9.6. Nothing else is needed. I've attached backports for
> 9.4 and 9.5.

Pushed with cosmetic adjustments --- mostly, I followed the project
standard that makefiles are named Makefile.  (We use a GNUmakefile
only in directories where it seems likely that non-developers would
invoke make by hand, and there's supposed to be a Makefile beside them
that throws an error whining about how you're not using GNU make.
I see no need for that in src/test/perl.)

Looking back over the thread, I see that you also proposed installing
isolationtester and pg_isolation_regress for the benefit of extensions.
I'm very much less excited about that idea.  It'd be substantially more
dead weight in typical installations, and I'm not sure that it'd be useful
to common extensions, and I'm not eager to treat isolationtester's API
and behavior as something we need to hold stable for extension use.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] store narrow values in hash indexes?

2016-09-23 Thread Robert Haas
Currently, hash indexes always store the hash code in the index, but
not the actual Datum.  It's recently been noted that this can make a
hash index smaller than the corresponding btree index would be if the
column is wide.  However, if the index is being built on a fixed-width
column with a typlen <= sizeof(Datum), we could store the original
value in the hash index rather than the hash code without using any
more space.  That would complicate the code, but I bet it would be
faster: we wouldn't need to set xs_recheck, we could rule out hash
collisions without visiting the heap, and we could support index-only
scans in such cases.

Another thought is that hash codes are 32 bits, but a Datum is 64 bits
wide on most current platforms.  So we're wasting 4 bytes per index
tuple storing nothing.  If we generated 64-bit hash codes we could
store as many bits of it as a Datum will hold and reduce hash
collisions.  Alternatively, we could try to stick some other useful
information in those bytes, like an abbreviated abbreviated key.

Not sure if these are good ideas.  They're just ideas.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-09-23 Thread Jim Nasby

On 9/23/16 2:42 AM, Heikki Linnakangas wrote:

How do we handle single-dimensional arrays of composite types at the
moment? At a quick glance, it seems that the composite types are just
treated like strings, when they're in an array. That's probably OK, but
it means that there's nothing special about composite types in
multi-dimensional arrays. In any case, we should mention that in the docs.


That is how they're handled, but I'd really like to change that. I've 
held off because I don't know how to handle the backwards 
incompatibility that would introduce. (I've been wondering if we might 
add a facility to allow specifying default TRANSFORMs that should be 
used for specific data types in specific languages.)


The converse case (a composite with arrays) suffers the same problem 
(array is just treated as a string).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-23 Thread Robert Haas
On Wed, Sep 21, 2016 at 10:33 PM, Andres Freund  wrote:
> On 2016-09-21 22:23:27 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>> > Sure. But that can be addressed, with a lot less effort than fixing and
>> > maintaining the hash indexes, by adding the ability to do that
>> > transparently using btree indexes + a recheck internally.  How that
>> > compares efficiency-wise is unclear as of now. But I do think it's
>> > something we should measure before committing the new code.
>>
>> TBH, I think we should reject that argument out of hand.  If someone
>> wants to spend time developing a hash-wrapper-around-btree AM, they're
>> welcome to do so.  But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Note that I'm explicitly *not* saying that. I just would like to see
> actual comparisons being made before investing significant amounts of
> code and related effort being invested in fixing the current hash table
> implementation. And I haven't seen a lot of that.  If the result of that
> comparison is that hash-indexes actually perform very well: Great!

Yeah, I just don't agree with that.  I don't think we have any policy
that you can't develop A and get it committed unless you try every
alternative that some other community member thinks might be better in
the long run first.  If we adopt such a policy, we'll have no
developers and no new features.  Also, in this particular case, I
think there's no evidence that the alternative you are proposing would
actually be better or less work to maintain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-23 Thread Robert Haas
On Fri, Sep 23, 2016 at 8:54 AM, Peter Eisentraut
 wrote:
> On 9/20/16 4:07 PM, Robert Haas wrote:
>> No, I'm assuming that the classes would be built-in.  A string tag
>> seems like over-engineering to me, particularly because the postmaster
>> needs to switch on the tag, and we need to be very careful about the
>> degree to which the postmaster trusts the contents of shared memory.
>
> I'm hoping that we can come up with something that extensions can
> participate in, without the core having to know ahead of time what those
> extensions are or how they would be categorized.
>
> My vision is something like
>
> max_processes = 512  # requires restart
>
> process_allowances = 'connection:300 superuser:10 autovacuum:10
> parallel:30 replication:10 someextension:20 someotherextension:20'
> # does not require restart

I don't think it's going to be very practical to allow extensions to
participate in the mechanism because there have to be a finite number
of slots that is known at the time we create the main shared memory
segment.

Also, it's really important that we don't add lots more surface area
for the postmaster to crash and burn.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Exclude additional directories in pg_basebackup

2016-09-23 Thread David Steele
On 9/12/16 2:50 PM, Peter Eisentraut wrote:
> The comments on excludeDirContents are somewhat imprecise.  For
> example, none of those directories are actually removed or recreated
> on startup, only the contents are.

Fixed.

> The comment for pg_replslot is incorrect.  I think you can copy
> replication slots just fine, but you usually don't want to.

Fixed.

> What is the 2 for in this code?
> 
> if (strcmp(pathbuf + 2, excludeFile[excludeIdx]) == 0)

This should be basepathlen + 1 (i.e., $PGDATA/).  Fixed.

> I would keep the signature of _tarWriteDir() similar to
> _tarWriteHeader().  So don't pass sizeonly in there, or do pass in
> sizeonly but do the same with _tarWriteHeader().

I'm not sure how much more similar they can be, but I do agree that
sizeonly logic should be wrapped in _tarWriteHeader().  Done.

> The documentation in pg_basebackup.sgml and protocol.sgml should be
> updated.

Done.  I moved a few things to the protocol docs to avoid repetition.

> Add some tests.  At least test that one entry from the directory list
> and one entry from the files list is not contained in the backup
> result.

Done for all files and directories.

> Testing the symlink handling would also be good.

Done using pg_replslot for the test.

> (The
> pg_basebackup tests claim that Windows doesn't support symlinks and
> therefore skip all the symlink tests.  That seems a bit at odds with
> your code handling symlinks on Windows.)

Hopefully Michael's response down-thread answered your question.

-- 
-David
da...@pgmasters.net
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..a8daa07 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1090,6 +1090,22 @@ SELECT pg_stop_backup();

 

+The contents of the pg_dynshmem/, pg_stat_tmp/,
+pg_notify/, pg_serial/,
+pg_snapshots/, and pg_subtrans/ directories can
+be omitted from the backup as they will be initialized on postmaster
+startup. If the  is set and is
+under the database cluster directory then the contents of the directory
+specified by  can also be omitted.
+   
+
+   
+Any file or directory beginning with pgsql_tmp can be
+omitted from the backup.  These files are removed on postmaster start and
+the directories will be recreated as needed.
+   
+
+   
 The backup label
 file includes the label string you gave to pg_start_backup,
 as well as the time at which pg_start_backup was run, and
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 68b0941..d65687f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2059,17 +2059,26 @@ The commands accepted in walsender mode are:
   

 
- postmaster.pid
+ postmaster.pid and postmaster.opts
 


 
- postmaster.opts
+ postgresql.auto.conf.tmp
 


 
- various temporary files created during the operation of the 
PostgreSQL server
+ backup_label and tablespace_map.  If these
+ files exist they belong to an exclusive backup and are not applicable
+ to the base backup.
+
+   
+   
+
+ Various temporary files and directories created during the operation 
of
+ the PostgreSQL server, i.e. any file or directory beginning with
+ pgsql_tmp.
 


@@ -2082,7 +2091,11 @@ The commands accepted in walsender mode are:


 
- pg_replslot is copied as an empty directory.
+ pg_replslot, pg_dynshmem,
+ pg_stat_tmp, pg_notify,
+ pg_serial, pg_snapshots, and
+ pg_subtrans are copied as empty directories (even if they
+ are symbolic links).
 


diff --git a/doc/src/sgml/ref/pg_basebackup.sgml 
b/doc/src/sgml/ref/pg_basebackup.sgml
index 9f1eae1..984ea5b 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -610,10 +610,8 @@ PostgreSQL documentation
   
The backup will include all files in the data directory and tablespaces,
including the configuration files and any additional files placed in the
-   directory by third parties. But only regular files and directories are
-   copied.  Symbolic links (other than those used for tablespaces) and special
-   device files are skipped.  (See  for
-   the precise details.)
+   directory by third parties, with certain exceptions.  (See
+for the complete list of exceptions.)
   
 
   
diff --git a/src/backend/replication/basebackup.c 
b/src/backend/replication/basebackup.c
index da9b7a6..04909ef 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -30,6 +30,7 @@
 #include "replication/basebackup.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
+#include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include 

Re: [HACKERS] pg_basebackup creates a corrupt file for pg_stat_tmp and pg_replslot on a backup location

2016-09-23 Thread David Steele
On 9/23/16 2:12 PM, Peter Eisentraut wrote:
> On 4/26/16 5:02 AM, Ashutosh Sharma wrote:
>> Knowing that pg_basebackup always creates an empty directory for
>> pg_stat_tmp and pg_replslot in backup location, even i think it would be
>> better to handle these directories in such a way that pg_basebackup
>> generates an empty directory for pg_replslot and pg_stat_tmp if they are
>> symbolic link.
> 
> I just wanted to update you, I have taken this commit fest entry patch
> to review because I think it will be addresses as part of "Exclude
> additional directories in pg_basebackup", which I'm also reviewing.
> Therefore, I'm not actually planning on discussing this patch further.
> Please correct me if this assessment does not match your expectations.

Just to be clear, and as I noted in [1], I pulled the symlink logic from
this this patch into the exclusion patch [2].  I think it makes sense to
only commit [2] as long as Ashutosh gets credit for his contribution.

Thanks,
-- 
-David
da...@pgmasters.net

[1]
https://www.postgresql.org/message-id/raw/ced3b05f-c1d9-c262-ce63-9744ef7e6de8%40pgmasters.net
[2] https://commitfest.postgresql.org/10/721/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup creates a corrupt file for pg_stat_tmp and pg_replslot on a backup location

2016-09-23 Thread Peter Eisentraut
On 4/26/16 5:02 AM, Ashutosh Sharma wrote:
> Knowing that pg_basebackup always creates an empty directory for
> pg_stat_tmp and pg_replslot in backup location, even i think it would be
> better to handle these directories in such a way that pg_basebackup
> generates an empty directory for pg_replslot and pg_stat_tmp if they are
> symbolic link.

I just wanted to update you, I have taken this commit fest entry patch
to review because I think it will be addresses as part of "Exclude
additional directories in pg_basebackup", which I'm also reviewing.
Therefore, I'm not actually planning on discussing this patch further.
Please correct me if this assessment does not match your expectations.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sequences and pg_upgrade

2016-09-23 Thread Peter Eisentraut
Here is an updated patch set.  Compared to the initial set, I have
changed pg_dump's sorting priorities so that sequence data is always
after table data.  This would otherwise have introduced a problem
because sortDataAndIndexObjectsBySize() only considers consecutive
DO_TABLE_DATA entries.  Also, I have removed the separate
--sequence-data switch from pg_dump and made it implicit in
--binary-upgrade.  (So the previous patches 0002 and 0003 have been
combined, because it's no longer a separate feature.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v2-0001-pg_dump-Separate-table-and-sequence-data-object-t.patch
Description: invalid/octet-stream


v2-0002-pg_upgrade-Upgrade-sequence-data-via-pg_dump.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade vs user created range type extension

2016-09-23 Thread Andrew Dunstan



On 09/23/2016 11:55 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 09/22/2016 07:33 PM, Tom Lane wrote:

If that diagnosis is correct, we should either change pg_dump to not
dump that function, or change CREATE TYPE AS RANGE to not auto-create
the constructor functions in binary-upgrade mode.  The latter might be
more flexible in the long run.

Yeah, I think your diagnosis is correct. I'm not sure I see the point of
the flexibility given that you can't specify a constructor function for
range types (if that feature had been available I would probably have
used it in this extension).

It turns out that pg_dump already contains logic that's meant to exclude
constructor functions from getting dumped, but it's broken for binary
upgrades from pre-9.6 branches because somebody fat-fingered the AND/OR
nesting in the giant WHERE clauses in getFuncs().  Curiously, it's *not*
broken when dumping from >= 9.6, which explains why I couldn't reproduce
this in HEAD.  It looks like Stephen fixed this while adding the
pg_init_privs logic, while not realizing that he'd left it broken in the
existing cases.

The comment in front of all this is nearly as confused as the code is,
too.  Will fix.




Great, Thanks.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade vs user created range type extension

2016-09-23 Thread Tom Lane
Andrew Dunstan  writes:
> On 09/22/2016 07:33 PM, Tom Lane wrote:
>> If that diagnosis is correct, we should either change pg_dump to not
>> dump that function, or change CREATE TYPE AS RANGE to not auto-create
>> the constructor functions in binary-upgrade mode.  The latter might be
>> more flexible in the long run.

> Yeah, I think your diagnosis is correct. I'm not sure I see the point of 
> the flexibility given that you can't specify a constructor function for 
> range types (if that feature had been available I would probably have 
> used it in this extension).

It turns out that pg_dump already contains logic that's meant to exclude
constructor functions from getting dumped, but it's broken for binary
upgrades from pre-9.6 branches because somebody fat-fingered the AND/OR
nesting in the giant WHERE clauses in getFuncs().  Curiously, it's *not*
broken when dumping from >= 9.6, which explains why I couldn't reproduce
this in HEAD.  It looks like Stephen fixed this while adding the
pg_init_privs logic, while not realizing that he'd left it broken in the
existing cases.

The comment in front of all this is nearly as confused as the code is,
too.  Will fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

2016-09-23 Thread Michael Paquier
On Fri, Sep 23, 2016 at 10:54 AM, Peter Eisentraut
 wrote:
> This is looking pretty good.  More comments on this patch set:

Thanks for the review.

> 0001:
>
> Keep the file order alphabetical in Mkvcbuild.pm.
>
> Needs nls.mk updates for file move (in initdb and pg_basebackup
> directories).

Fixed.

> 0002:
>
> durable_rename needs close(fd) in non-error path (compare backend code).

Oops, leak.

> Changing from fsync() to fsync_name() in some cases means that
> inaccessible files are now ignored.  I guess this would only happen in
> very obscure circumstances, but it's worth considering if that is OK.

In writeTimeLineHistoryFile() that's fine, the user should have
ownership of it as it has been created by receivelog.c. Similar remark
for .
In mark_file_as_archived, the previous open() call would have just failed.

> You added this comment:
>  * XXX: This means that we might not restart if a crash occurs
> before the
>  * fsync below. We probably should create the file in a temporary path
>  * like the backend does...
> pg_receivexlog uses the .partial suffix for this reason.  Why doesn't
> pg_basebackup do that?

Because it matters for pg_receivexlog as it has a retry logic and it
is able to rework on a partial segment. Thinking more about that this
comment does not make much sense, because for pg_basebackup a crash
would just cause the backup to be incomplete, so I have removed it.

> In open_walfile, could we move the fsync calls before the fstat or
> somewhere around there so we don't have to make the same calls in two
> different branches?

We could refactor a bit the code so as follows:
if (size == 16MB)
{
walfile = f;
}
else if (size == 0)
{
//do padding and lseek
}
else
error();
fsync();
I am not sure if this gains in clarity though, perhaps I looked too
much the current code.

> 0003:
>
> There was a discussion about renaming the --nosync option in initdb to
> --no-sync, but until that is done, the option in pg_basebackup should
> stay what initdb has right now.

Changed.

> There is a whitespace alignment error in usage().

Not sure I follow here.

> The -N option should be listed after -n.

In the switch()? Fixed.

> Some fsync calls are not covered by a do_sync conditional.  I see some
> in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

Right. I looked at the rest and did not notice any extra mistakes.
-- 
Michael
From 79a9c302bc4723d6709bf5f1dc6ca673598e8b4a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 23 Sep 2016 23:23:05 +0900
Subject: [PATCH 1/4] Relocation fsync routines of initdb into src/common

Those are aimed at being used by other utilities, pg_basebackup mainly,
and at some other degree by pg_dump and pg_receivexlog.
---
 src/bin/initdb/initdb.c | 270 ++-
 src/bin/initdb/nls.mk   |   2 +-
 src/bin/pg_basebackup/nls.mk|   2 +-
 src/common/Makefile |   2 +-
 src/common/file_utils.c | 276 
 src/include/common/file_utils.h |  22 
 src/tools/msvc/Mkvcbuild.pm |   2 +-
 7 files changed, 313 insertions(+), 263 deletions(-)
 create mode 100644 src/common/file_utils.c
 create mode 100644 src/include/common/file_utils.h

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3350e13..e52e67d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -61,6 +61,7 @@
 #endif
 
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/username.h"
 #include "mb/pg_wchar.h"
@@ -70,13 +71,6 @@
 #include "fe_utils/string_utils.h"
 
 
-/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
-#if defined(HAVE_SYNC_FILE_RANGE)
-#define PG_FLUSH_DATA_WORKS 1
-#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-#define PG_FLUSH_DATA_WORKS 1
-#endif
-
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
 
@@ -237,13 +231,6 @@ static char **filter_lines_with_token(char **lines, const char *token);
 #endif
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
-static void walkdir(const char *path,
-		void (*action) (const char *fname, bool isdir),
-		bool process_symlinks);
-#ifdef PG_FLUSH_DATA_WORKS
-static void pre_sync_fname(const char *fname, bool isdir);
-#endif
-static void fsync_fname_ext(const char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
@@ -270,7 +257,6 @@ static void load_plpgsql(FILE *cmdfd);
 static void vacuum_db(FILE *cmdfd);
 static void make_template0(FILE *cmdfd);
 static void make_postgres(FILE *cmdfd);
-static void fsync_pgdata(void);
 static void trapsig(int signum);
 static void check_ok(void);
 

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Tomas Vondra

On 09/23/2016 02:59 PM, Pavan Deolasee wrote:



On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra
> wrote:

On 09/23/2016 05:10 AM, Amit Kapila wrote:

On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
> wrote:

On 09/21/2016 08:04 AM, Amit Kapila wrote:



(c) Although it's not visible in the results, 4.5.5 almost
perfectly
eliminated the fluctuations in the results. For example when
3.2.80 produced
this results (10 runs with the same parameters):

12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to
3.2.80.


how long each run was?  Generally, I do half-hour run to get
stable results.


10 x 5-minute runs for each client count. The full shell script
driving the benchmark is here: http://bit.ly/2doY6ID and in short it
looks like this:

for r in `seq 1 $runs`; do
for c in 1 8 16 32 64 128 192; do
psql -c checkpoint
pgbench -j 8 -c $c ...
done
done



I see couple of problems with the tests:

1. You're running regular pgbench, which also updates the small
tables. At scale 300 and higher clients, there is going to heavy
contention on the pgbench_branches table. Why not test with pgbench
-N?


Sure, I can do a bunch of tests with pgbench -N. Good point.

But notice that I've also done the testing with Dilip's workload, and 
the results are pretty much the same.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Tomas Vondra

On 09/23/2016 03:07 PM, Amit Kapila wrote:

On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
 wrote:

On 09/23/2016 01:44 AM, Tomas Vondra wrote:


...
The 4.5 kernel clearly changed the results significantly:


...



(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80
produced this results (10 runs with the same parameters):

12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.



The more I think about these random spikes in pgbench performance on 3.2.80,
the more I find them intriguing. Let me show you another example (from
Dilip's workload and group-update patch on 64 clients).

This is on 3.2.80:

  44175  34619  51944  38384  49066
  37004  47242  36296  46353  36180

and on 4.5.5 it looks like this:

  34400  35559  35436  34890  34626
  35233  35756  34876  35347  35486

So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
randomly do something right, but what is it and why doesn't it happen on the
new kernel? And how could we do it every time?



As far as I can see you are using default values of min_wal_size,
max_wal_size, checkpoint related params, have you changed default
shared_buffer settings, because that can have a bigger impact.


Huh? Where do you see me using default values? There are settings.log 
with a dump of pg_settings data, and the modified values are


checkpoint_completion_target = 0.9
checkpoint_timeout = 3600
effective_io_concurrency = 32
log_autovacuum_min_duration = 100
log_checkpoints = on
log_line_prefix = %m
log_timezone = UTC
maintenance_work_mem = 524288
max_connections = 300
max_wal_size = 8192
min_wal_size = 1024
shared_buffers = 2097152
synchronous_commit = on
work_mem = 524288

(ignoring some irrelevant stuff like locales, timezone etc.).


Using default values of mentioned parameters can lead to checkpoints in
between your runs.


So I'm using 16GB shared buffers (so with scale 300 everything fits into 
shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint 
timeout 1h etc. So no, there are no checkpoints during the 5-minute 
runs, only those triggered explicitly before each run.



Also, I think instead of 5 mins, read-write runs should be run for 15
mins to get consistent data.


Where does the inconsistency come from? Lack of warmup? Considering how 
uniform the results from the 10 runs are (at least on 4.5.5), I claim 
this is not an issue.



For Dilip's workload where he is using only Select ... For Update, i
think it is okay, but otherwise you need to drop and re-create the
database between each run, otherwise data bloat could impact the
readings.


And why should it affect 3.2.80 and 4.5.5 differently?



I think in general, the impact should be same for both the kernels
because you are using same parameters, but I think if use
appropriate parameters, then you can get consistent results for
3.2.80. I have also seen variation in read-write tests, but the
variation you are showing is really a matter of concern, because it
will be difficult to rely on final data.



Both kernels use exactly the same parameters (fairly tuned, IMHO).


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove redundant if clause in standbydesc.c

2016-09-23 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Sep 23, 2016 at 6:54 PM, Aleksander Alekseev
>  wrote:
>>> Very simple small patch - see attachment.
>>> 
>>>  /* not expected, but print something anyway */
>>>  else if (msg->id == SHAREDINVALRELMAP_ID)
>>> -appendStringInfoString(buf, " relmap");
>>> -else if (msg->id == SHAREDINVALRELMAP_ID)
>>>  appendStringInfo(buf, " relmap db %u", msg->rm.dbId);

> Not really but my guess is that the two conditions have been left for
> this purpose.

I think it's a pretty obvious copy-and-pasteo.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Phrase search distance syntax

2016-09-23 Thread Tom Lane
Teodor Sigaev  writes:
>> Why does the phrase distance operator assume <1> means adjacent words,
>> and not <0>.  (FYI, <-> is the same as <1>.)

> Because
> 1 it is a result of subtruction of word's positions
> 2 <0> could be used as special case like a word with two infinitives:

This is actually documented, in 12.1.2:

A special case that's sometimes useful is that <0> can be used to
require that two patterns match the same word.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segm

2016-09-23 Thread Tom Lane
Amit Kapila  writes:
> Attached patch tightens the error handling.  However, on debugging, I
> found that CreateFileMapping() always set error code to 0 on success.

Ah, interesting.

> Now, it is quite possible
> that error code is set to 0 on success in my windows environment
> (Win7) and doesn't work in some other environment.  In any case, if we
> want to go ahead and don't want to rely on CreateFileMapping(), then
> attached patch should suffice the need.

Yeah, seeing that this is not mentioned in MS' documentation I don't
think we should rely on it.  The patch looks fine to me, pushed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Phrase search distance syntax

2016-09-23 Thread Bruce Momjian
On Fri, Sep 23, 2016 at 05:07:26PM +0300, Teodor Sigaev wrote:
> >Sorry to be asking another phrase search syntax question, and so close
> >to final release, but ...
> Really close...
> >
> >Why does the phrase distance operator assume <1> means adjacent words,
> >and not <0>.  (FYI, <-> is the same as <1>.)
> Because
> 1 it is a result of subtruction of word's positions
> 2 <0> could be used as special case like a word with two infinitives:
> # create text search dictionary xx (template = 'ispell',
> DictFile='ispell_sample', AffFile='ispell_sample');
> # alter text search configuration english ALTER MAPPING FOR asciiword WITH
> xx, english_stem;
> 
> # select to_tsvector('english', 'bookings');
>  to_tsvector
> --
>  'book':1 'booking':1
> 
> # select to_tsvector('english', 'bookings') @@ 'book <0> booking';
>  ?column?
> --
>  t
> (1 row)

OK, thanks.  I just found it as unusual.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Phrase search distance syntax

2016-09-23 Thread Teodor Sigaev

Sorry to be asking another phrase search syntax question, and so close
to final release, but ...

Really close...


Why does the phrase distance operator assume <1> means adjacent words,
and not <0>.  (FYI, <-> is the same as <1>.)

Because
1 it is a result of subtruction of word's positions
2 <0> could be used as special case like a word with two infinitives:
# create text search dictionary xx (template = 'ispell', 
DictFile='ispell_sample', AffFile='ispell_sample');
# alter text search configuration english ALTER MAPPING FOR asciiword WITH xx, 
english_stem;


# select to_tsvector('english', 'bookings');
 to_tsvector
--
 'book':1 'booking':1

# select to_tsvector('english', 'bookings') @@ 'book <0> booking';
 ?column?
--
 t
(1 row)



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove redundant if clause in standbydesc.c

2016-09-23 Thread Michael Paquier
On Fri, Sep 23, 2016 at 6:54 PM, Aleksander Alekseev
 wrote:
>> > Very simple small patch - see attachment.
>>
>>  /* not expected, but print something anyway */
>>  else if (msg->id == SHAREDINVALRELMAP_ID)
>> -appendStringInfoString(buf, " relmap");
>> -else if (msg->id == SHAREDINVALRELMAP_ID)
>>  appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
>>
>> Looking at inval.c, dbId can be InvalidOid.
>
> Frankly I'm not very familiar with this part of code. InvalidOid is just
> zero. Does it create some problem in this case?

Not really but my guess is that the two conditions have been left for
this purpose.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tracking wait event for latches

2016-09-23 Thread Michael Paquier
On Fri, Sep 23, 2016 at 10:32 AM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 7:10 PM, Thomas Munro
>  wrote:
>> I was thinking about suggesting a category "Replication" to cover the
>> waits for client IO relating to replication, as opposed to client IO
>> waits relating to regular user connections.  Then you could put sync
>> rep into that category instead of IPC, even though technically it is
>> waiting for IPC from walsender process(es), on the basis that it's
>> more newsworthy to a DBA that it's really waiting for a remote replica
>> to respond.  But it's probably pretty clear what's going on from the
>> the wait point names, so maybe it's not worth a category.  Thoughts?
>
> I thought about a replication category but either it will only have
> SyncRep in it, which is odd, or it will pull in other things that
> otherwise fit nicely into the Activity category, and then that
> boundaries of all the categories become mushy: is the subsystem that
> causes the wait that we are trying to document, or the kind of thing
> for which we are waiting?

Using category IPC for SyncRep looks fine to me.

>> I do suspect that the set of wait points will grow quite a bit as we
>> develop more parallel stuff though.  For example, I have been working
>> on a patch that adds several more wait points, indirectly via
>> condition variables (using your patch).  Actually in my case it's
>> BarrierWait -> ConditionVariableWait -> WaitEventSetWait.  I propose
>> that these higher level wait primitives should support passing a wait
>> point identifier through to WaitEventSetWait.
>
> +1.

As much as I suspect that inclusion of pgstat.h will become more and
more usual to allow more code paths to access to a given WaitClass.

After digesting all the comments given, I have produced the patch
attached that uses the following categories:
+const char *const WaitEventNames[] = {
+   /* activity */
+   "ArchiverMain",
+   "AutoVacuumMain",
+   "BgWriterHibernate",
+   "BgWriterMain",
+   "CheckpointerMain",
+   "PgStatMain",
+   "RecoveryWalAll",
+   "RecoveryWalStream",
+   "SysLoggerMain",
+   "WalReceiverMain",
+   "WalSenderMain",
+   "WalWriterMain",
+   /* client */
+   "SecureRead",
+   "SecureWrite",
+   "SSLOpenServer",
+   "WalReceiverWaitStart",
+   "WalSenderWaitForWAL",
+   "WalSenderWriteData",
+   /* Extension */
+   "Extension",
+   /* IPC */
+   "BgWorkerShutdown",
+   "BgWorkerStartup",
+   "ExecuteGather",
+   "MessageQueueInternal",
+   "MessageQueuePutMessage",
+   "MessageQueueReceive",
+   "MessageQueueSend",
+   "ParallelFinish",
+   "ProcSignal",
+   "ProcSleep",
+   "SyncRep",
+   /* timeout */
+   "BaseBackupThrottle",
+   "PgSleep",
+   "RecoveryApplyDelay",
+};
I have moved WalSenderMain as it tracks a main loop, so it was strange
to not have it in Activity. I also kept SecureRead and SecureWrite
because this is referring to the functions of the same name. For the
rest, I got convinced by what has been discussed and it makes things
clearer. My head is not spinning anymore when I try to keep in sync
the list of names and its associated enum, which is good. I have as
well rewritten the docs to follow that.
-- 
Michael
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8ca1c1c..9222b73 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -17,6 +17,7 @@
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
@@ -496,7 +497,9 @@ pgfdw_get_result(PGconn *conn, const char *query)
 			wc = WaitLatchOrSocket(MyLatch,
    WL_LATCH_SET | WL_SOCKET_READABLE,
    PQsocket(conn),
-   -1L);
+   -1L,
+   WAIT_EXTENSION,
+   WE_EXTENSION);
 			ResetLatch(MyLatch);
 
 			CHECK_FOR_INTERRUPTS();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0776428..6f7eef9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -679,6 +679,44 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   buffer in question.
  
 
+
+ 
+  Activity: The server process is waiting for some
+  activity to happen on a socket.  This is mainly used system processes
+  in their main processing loop.  wait_event will identify
+  the type of activity waited for.
+ 
+
+
+ 
+  Extension: The server process is waiting for activity
+  in a plugin or an extension.  This category is useful for plugin
+  and module developers to track custom waiting points.
+ 
+
+
+ 
+  Client: The server process is waiting for some activity
+  on a socket from a client process, and that the server expects
+  

Re: [HACKERS] asynchronous and vectorized execution

2016-09-23 Thread Robert Haas
On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar  wrote:
> For e.g., in the above plan which you specified, suppose :
> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
> is
> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
> 2. The event wait list already has foreign scan on a that is on a different
> subtree.
> 3. This foreign scan a happens to be ready, so in
> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
> which returns with result_ready.
> 4. Since it returns result_ready, it's parent node is now inserted in the
> callbacks array, and so it's parent (Append) is executed.
> 5. But, this Append planstate is already in the middle of executing Hash
> join, and is waiting for HashJoin.

Ah, yeah, something like that could happen.  I've spent much of this
week working on a new design for this feature which I think will avoid
this problem.  It doesn't work yet - in fact I can't even really test
it yet.  But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>  wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how we could
>> commit any of the patches at this point. So perhaps "returned with feedback"
>> and resubmitting in the next CF (along with analysis of improved workloads)
>> would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks.  So, for
> example, suppose we one run of these tests with various client counts
> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
> wait_event from pg_stat_activity" once per second throughout the test.
> Then we see how many times we get each wait event, including NULL (no
> wait event).  Now, from this, we can compute the approximate
> percentage of time we're spending waiting on CLogControlLock and every
> other lock, too, as well as the percentage of time we're not waiting
> for lock.  That, it seems to me, would give us a pretty clear idea
> what the maximum benefit we could hope for from reducing contention on
> any given lock might be.
>

As mentioned earlier, such an activity makes sense, however today,
again reading this thread, I noticed that Dilip has already posted
some analysis of lock contention upthread [1].  It is clear that patch
has reduced LWLock contention from ~28% to ~4% (where the major
contributor was TransactionIdSetPageStatus which has reduced from ~53%
to ~3%).  Isn't it inline with what you are looking for?


[1] - 
https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Phrase search distance syntax

2016-09-23 Thread Bruce Momjian
Sorry to be asking another phrase search syntax question, and so close
to final release, but ...

Why does the phrase distance operator assume <1> means adjacent words,
and not <0>.  (FYI, <-> is the same as <1>.)

For example:

  select to_tsvector('park a a house') @@ to_tsquery('park <3> house');

seems like it would be more logical as <2>, meaning two lexems between
the words.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:29 PM, Pavan Deolasee
 wrote:
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra 
> wrote:
>>
>> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>>>
>>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>>  wrote:

 On 09/21/2016 08:04 AM, Amit Kapila wrote:
>
>

 (c) Although it's not visible in the results, 4.5.5 almost perfectly
 eliminated the fluctuations in the results. For example when 3.2.80
 produced
 this results (10 runs with the same parameters):

 12118 11610 27939 11771 18065
 12152 14375 10983 13614 11077

 we get this on 4.5.5

 37354 37650 37371 37190 37233
 38498 37166 36862 37928 38509

 Notice how much more even the 4.5.5 results are, compared to 3.2.80.

>>>
>>> how long each run was?  Generally, I do half-hour run to get stable
>>> results.
>>>
>>
>> 10 x 5-minute runs for each client count. The full shell script driving
>> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
>> this:
>>
>> for r in `seq 1 $runs`; do
>> for c in 1 8 16 32 64 128 192; do
>> psql -c checkpoint
>> pgbench -j 8 -c $c ...
>> done
>> done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small tables. At
> scale 300 and higher clients, there is going to heavy contention on the
> pgbench_branches table. Why not test with pgbench -N? As far as this patch
> is concerned, we are only interested in seeing contention on
> ClogControlLock. In fact, how about a test which only consumes an XID, but
> does not do any write activity at all? Complete artificial workload, but
> good enough to tell us if and how much the patch helps in the best case. We
> can probably do that with a simple txid_current() call, right?
>

Right, that is why in the initial tests done by Dilip, he has used
Select .. for Update.  I think using txid_current will generate lot of
contention on XidGenLock which will mask the contention around
CLOGControlLock, in-fact we have tried that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-09-23 Thread Peter Eisentraut
On 9/22/16 11:16 AM, Masahiko Sawada wrote:
> Commit e7010ce seems to forget to change help message of pg_ctl.
> Attached patch.

Fixed, thanks.  I also added the -t option and reformatted a bit.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
 wrote:
> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>
>> ...
>> The 4.5 kernel clearly changed the results significantly:
>>
> ...
>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80
>> produced this results (10 runs with the same parameters):
>>
>> 12118 11610 27939 11771 18065
>> 12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>> 37354 37650 37371 37190 37233
>> 38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> The more I think about these random spikes in pgbench performance on 3.2.80,
> the more I find them intriguing. Let me show you another example (from
> Dilip's workload and group-update patch on 64 clients).
>
> This is on 3.2.80:
>
>   44175  34619  51944  38384  49066
>   37004  47242  36296  46353  36180
>
> and on 4.5.5 it looks like this:
>
>   34400  35559  35436  34890  34626
>   35233  35756  34876  35347  35486
>
> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
> randomly do something right, but what is it and why doesn't it happen on the
> new kernel? And how could we do it every time?
>

As far as I can see you are using default values of min_wal_size,
max_wal_size, checkpoint related params, have you changed default
shared_buffer settings, because that can have a bigger impact.  Using
default values of mentioned parameters can lead to checkpoints in
between your runs. Also, I think instead of 5 mins, read-write runs
should be run for 15 mins to get consistent data.  For Dilip's
workload where he is using only Select ... For Update, i think it is
okay, but otherwise you need to drop and re-create the database
between each run, otherwise data bloat could impact the readings.

I think in general, the impact should be same for both the kernels
because you are using same parameters, but I think if use appropriate
parameters, then you can get consistent results for 3.2.80.  I have
also seen variation in read-write tests, but the variation you are
showing is really a matter of concern, because it will be difficult to
rely on final data.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Pavan Deolasee
On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra 
wrote:

> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>
>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>  wrote:
>>
>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>


>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>> eliminated the fluctuations in the results. For example when 3.2.80
>>> produced
>>> this results (10 runs with the same parameters):
>>>
>>> 12118 11610 27939 11771 18065
>>> 12152 14375 10983 13614 11077
>>>
>>> we get this on 4.5.5
>>>
>>> 37354 37650 37371 37190 37233
>>> 38498 37166 36862 37928 38509
>>>
>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>
>>>
>> how long each run was?  Generally, I do half-hour run to get stable
>> results.
>>
>>
> 10 x 5-minute runs for each client count. The full shell script driving
> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
> this:
>
> for r in `seq 1 $runs`; do
> for c in 1 8 16 32 64 128 192; do
> psql -c checkpoint
> pgbench -j 8 -c $c ...
> done
> done



I see couple of problems with the tests:

1. You're running regular pgbench, which also updates the small tables. At
scale 300 and higher clients, there is going to heavy contention on the
pgbench_branches table. Why not test with pgbench -N? As far as this patch
is concerned, we are only interested in seeing contention on
ClogControlLock. In fact, how about a test which only consumes an XID, but
does not do any write activity at all? Complete artificial workload, but
good enough to tell us if and how much the patch helps in the best case. We
can probably do that with a simple txid_current() call, right?

2. Each subsequent pgbench run will bloat the tables. Now that may not be
such a big deal given that you're checkpointing between each run. But it
still makes results somewhat hard to compare. If a vacuum kicks in, that
may have some impact too. Given the scale factor you're testing, why not
just start fresh every time?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Rename max_parallel_degree?

2016-09-23 Thread Peter Eisentraut
On 9/20/16 4:07 PM, Robert Haas wrote:
> No, I'm assuming that the classes would be built-in.  A string tag
> seems like over-engineering to me, particularly because the postmaster
> needs to switch on the tag, and we need to be very careful about the
> degree to which the postmaster trusts the contents of shared memory.

I'm hoping that we can come up with something that extensions can
participate in, without the core having to know ahead of time what those
extensions are or how they would be categorized.

My vision is something like

max_processes = 512  # requires restart

process_allowances = 'connection:300 superuser:10 autovacuum:10
parallel:30 replication:10 someextension:20 someotherextension:20'
# does not require restart

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:11 PM, Peter Eisentraut
 wrote:
> On 9/23/16 1:56 AM, Amit Kapila wrote:
>> which comment are you referring here?  hashm_mapp contains block
>> numbers of bitmap pages.
>
> The comment I'm referring to says
>
> The blknos of these bitmap pages are kept in bitmaps[]; nmaps is the
> number of currently existing bitmaps.
>
> But there is no "bitmaps" field anywhere.
>

Okay.  You are right, it should be hashm_mapp.

>> In the above code, it appears that you are trying to calculate
>> max_avail space for all pages in same way.  Don't you need to
>> calculate it differently for bitmap page or meta page as they don't
>> share the same format as other type of pages?
>
> Is this even useful for hash indexes?
>

I think so.  It will be helpful for bucket and overflow pages.  They
store the index tuples similar to btree.  Is there a reason, why you
think it won't be useful?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.85140

2016-09-23 Thread Tom Lane
Amit Kapila  writes:
> On Fri, Sep 23, 2016 at 1:21 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> So, we could have dsm_postmaster_startup() seed the random number
>>> generator itself, and then let PostmasterRandom() override the seed
>>> later.  Like maybe:
>> 
>> Works for me, at least as a temporary solution.

> Isn't it better if we use the same technique in dsm_create() as well
> which uses random() for handle?

dsm_create() is executed in backends, not the postmaster, and they
already have their own random seeds (cf BackendRun).  Adding more
srandom calls at random places will *not* make things better.

However, it's certainly true that dsm_postmaster_startup() might not
be the only place in postmaster startup that wants to use random().
What seems like the best thing after sleeping on it is to put
"srandom(time(NULL))" somewhere early in PostmasterMain, so that one
such call suffices for all uses during postmaster startup.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Tomas Vondra

On 09/23/2016 01:44 AM, Tomas Vondra wrote:

...
The 4.5 kernel clearly changed the results significantly:


...
>

(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80
produced this results (10 runs with the same parameters):

12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.



The more I think about these random spikes in pgbench performance on 
3.2.80, the more I find them intriguing. Let me show you another example 
(from Dilip's workload and group-update patch on 64 clients).


This is on 3.2.80:

  44175  34619  51944  38384  49066
  37004  47242  36296  46353  36180

and on 4.5.5 it looks like this:

  34400  35559  35436  34890  34626
  35233  35756  34876  35347  35486

So the 4.5.5 results are much more even, but overall clearly below 
3.2.80. How does 3.2.80 manage to do ~50k tps in some of the runs? 
Clearly we randomly do something right, but what is it and why doesn't 
it happen on the new kernel? And how could we do it every time?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-23 Thread Amit Khandekar
On 13 September 2016 at 20:20, Robert Haas  wrote:

> On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
>  wrote:
> > [ new patches ]
>
> +/*
> + * We assume that few nodes are async-aware and async-unaware
> + * nodes cannot be revserse-dispatched from lower nodes that
> is
> + * async-aware. Firing of an async node that is not a
> descendant
> + * of the planstate will cause such reverse-diaptching to
> + * async-aware nodes, which is unexpected behavior for them.
> + *
> + * For instance, consider an async-unaware Hashjoin(OUTER,
> INNER)
> + * where the OUTER is running asynchronously but the Hashjoin
> is
> + * waiting on the async INNER during inner-hash creation. If
> the
> + * OUTER fires for the case, since anyone is waiting on it,
> + * ExecAsyncWaitForNode finally dispatches to the Hashjoin
> which
> + * is now in the middle of thing its work.
> + */
> +if (!IsParent(planstate, node))
> +continue;
>
> I'm not entirely sure that I understand this comment, but I don't
> think it's going in the right direction.  Let's start with the example
> in the second paragraph. If the hash join is async-unaware, then it
> isn't possible for the hash join to be both running the outer side of
> the join asynchronously and at the same time waiting on the inner
> side.  Once it tries to pull the first tuple from the outer side, it's
> waiting for that to finish and can't do anything else.  So, the inner
> side can't possibly get touched in any way until the outer side
> finishes.  For anything else to happen, the hash join would have to be
> async-aware.  Even if we did that, I don't think it would be right to
> kick off both sides of the hash join at the same time.  Right now, if
> the outer side turns out to be empty, we never need to build the hash
> table, and that's good.
>

I feel the !IsParent() condition is actually to prevent the infinite wait
caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
mentioned
earlier. But yes, the comments don't explain exactly how the hash join can
cause the re-entrant issue.

But I attempted to come up with some testcase which might reproduce the
infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
check
so that the other subtree nodes are also included, but I couldn't reproduce.
Kyotaro, is it possible for you to give a testcase that consistently hangs
if
we revert back the !IsParent() check ?

I was also thinking about another possibility where the same plan state
node is
re-entered, as explained below.

>
> I don't think it's a good idea to wait for only nodes that are in the
> current subtree.  For example, consider a plan like this:
>
> Append
> -> Foreign Scan on a
> -> Hash Join
>   -> Foreign Scan on b
>   -> Hash
> -> Seq Scan on x
>
> Suppose Append and Foreign Scan are parallel-aware but the other nodes
> are not.  Append kicks off the Foreign Scan on a and then waits for
> the hash join to produce a tuple; the hash join kicks off the Foreign
> Scan on b and waits for it to return a tuple.  If, while we're waiting
> for the foreign scan on b, the foreign scan on a needs some attention
> - either to produce tuples, or maybe just to call PQconsumeInput() so
> that more data can be sent from the other side, I think we need to be
> able to do that.  There's no real problem here; even if the Append
> becomes result-ready before the hash join returns, that is fine.


Yes I agree : we should be able to do this. Sine we have all the waiting
events
in a common estate, there's no harm if we start executing nodes of another
sub-tree if we get an event from there.

But I am thinking about what would happen when this node from other sub-tree
returns result_ready, and then it's parents are called, and then the result
gets bubbled up upto the node which had already caused us to call
ExecAsyncWaitForNode() in the first place.

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Is this safe to execute the same plan state when it is already inside its
execution ? In other words, is the plan state re-entrant ? I suspect, the
new
execution may even corrupt the structures with which it was 

Re: [HACKERS] pageinspect: Hash index support

2016-09-23 Thread Peter Eisentraut
On 9/23/16 1:56 AM, Amit Kapila wrote:
> On Fri, Sep 23, 2016 at 9:40 AM, Peter Eisentraut
>> - hash_metap result fields spares and mapp should be arrays of integer.
>>
> 
> how would you like to see both those arrays in tuple, right now, I
> think Jesper's code is showing it as string.

I'm not sure what you are asking here.

>> (Incidentally, the comment in hash.h talks about bitmaps[] but I think
>> it means hashm_mapp[].)
>>
> 
> which comment are you referring here?  hashm_mapp contains block
> numbers of bitmap pages.

The comment I'm referring to says

The blknos of these bitmap pages are kept in bitmaps[]; nmaps is the
number of currently existing bitmaps.

But there is no "bitmaps" field anywhere.

> In the above code, it appears that you are trying to calculate
> max_avail space for all pages in same way.  Don't you need to
> calculate it differently for bitmap page or meta page as they don't
> share the same format as other type of pages?

Is this even useful for hash indexes?  This idea came from btrees.
Neither the gin nor the brin code have this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Tomas Vondra

On 09/23/2016 05:10 AM, Amit Kapila wrote:

On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
 wrote:

On 09/21/2016 08:04 AM, Amit Kapila wrote:




(c) Although it's not visible in the results, 4.5.5 almost perfectly
eliminated the fluctuations in the results. For example when 3.2.80 produced
this results (10 runs with the same parameters):

12118 11610 27939 11771 18065
12152 14375 10983 13614 11077

we get this on 4.5.5

37354 37650 37371 37190 37233
38498 37166 36862 37928 38509

Notice how much more even the 4.5.5 results are, compared to 3.2.80.



how long each run was?  Generally, I do half-hour run to get stable results.



10 x 5-minute runs for each client count. The full shell script driving 
the benchmark is here: http://bit.ly/2doY6ID and in short it looks like 
this:


for r in `seq 1 $runs`; do
for c in 1 8 16 32 64 128 192; do
psql -c checkpoint
pgbench -j 8 -c $c ...
done
done



It of course begs the question what kernel version is running on
the machine used by Dilip (i.e. cthulhu)? Although it's a Power
machine, so I'm not sure how much the kernel matters on it.



cthulhu is a x86 m/c and the kernel version is 3.10. Seeing, the
above results I think kernel version do matter, but does that mean
we ignore the benefits we are seeing on somewhat older kernel
version. I think right answer here is to do some experiments which
can show the actual contention as suggested by Robert and you.



Yes, I think it'd be useful to test a new kernel version. Perhaps try 
4.5.x so that we can compare it to my results. Maybe even try using my 
shell script




There are results for 64, 128 and 192 clients. Why should we care
about numbers in between? How likely (and useful) would it be to
get improvement with 96 clients, but no improvement for 64 or 128
clients?

>>


The only point to take was to see from where we have started seeing
improvement, saying that the TPS has improved from >=72 client count
is different from saying that it has improved from >=128.



I find the exact client count rather uninteresting - it's going to be 
quite dependent on hardware, workload etc.


>>

I don't dare to suggest rejecting the patch, but I don't see how
we could commit any of the patches at this point. So perhaps
"returned with feedback" and resubmitting in the next CF (along
with analysis of improvedworkloads) would be appropriate.



Agreed with your conclusion and changed the status of patch in CF
accordingly.



+1


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-23 Thread Amit Kapila
On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila  wrote:
> On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes  wrote:
>>
>>
>> Correct.  But any torn page write must be covered by the restoration of a
>> full page image during replay, shouldn't it?  And that restoration should
>> happen blindly, without first reading in the old page and verifying the
>> checksum.
>>
>
> Probably, but I think this is not currently the way it is handled and
> I don't want to change it.  AFAIU, what happens now is that we first
> read the old page (and we do page verification while reading old page
> and display *warning* if checksum doesn't match) and then restore the
> image.  The relevant code path is
> XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().
>
> Now, here the point is that why instead of WARNING we are seeing FATAL
> for the bitmap page of hash index.  The reason as explained in my
> previous e-mail is that for bitmap page we are not logging full page
> image and we should fix that as explained there.  Once the full page
> image is logged, then first time it reads the torn page, it will use
> flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
> to WARNING.
>

I think here I am slightly wrong.  For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image.  So after my fix,
you will not see the error of checksum failure.  I have a fix ready,
but still doing some more verification.  If everything passes, I will
share the patch in a day or so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MSVC pl-perl error message is not verbose enough

2016-09-23 Thread Heikki Linnakangas

On 08/02/2016 08:18 PM, John Harvey wrote:

On Mon, Aug 1, 2016 at 9:39 PM, Michael Paquier 
wrote:


On Tue, Aug 2, 2016 at 2:08 AM, Robert Haas  wrote:

Did you add this to the next CommitFest?


I have added it here:
https://commitfest.postgresql.org/10/691/
John, it would be good if you could get a community account and add
your name to this patch as its author. I could not find you.


Done.  If there's anything else that I should do, please let me know.  I'd
be glad to help out.


Committed, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment

2016-09-23 Thread Amit Kapila
On Thu, Sep 22, 2016 at 10:40 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Tue, Sep 20, 2016 at 10:33 PM, Tom Lane  wrote:
>>> ISTM both the previous coding and this version can fail for no good
>>> reason, that is what if GetLastError() happens to return one of these
>>> error codes as a result of some unrelated failure from before this
>>> subroutine is entered?  That is, wouldn't it be a good idea to
>>> do SetLastError(0) before calling CreateFileMapping?
>
>> Yes, that seems like a good idea.  Do you need a patch with some
>> testing on windows environment?
>
> Please; I can't test it.
>

Attached patch tightens the error handling.  However, on debugging, I
found that CreateFileMapping() always set error code to 0 on success.
Basically, before calling CreateFileMapping(), I have set the error
code as 10 (SetLastError(10)) and then after CreateFileMapping(), it
sets the error code to 0 on success and appropriate error code on
failure.  I also verified that error code is set to 10 by calling
GetLastError() before CreateFileMapping().  Now, it is quite possible
that error code is set to 0 on success in my windows environment
(Win7) and doesn't work in some other environment.  In any case, if we
want to go ahead and don't want to rely on CreateFileMapping(), then
attached patch should suffice the need.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


clear_errorcode_dsm-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove redundant if clause in standbydesc.c

2016-09-23 Thread Aleksander Alekseev
> > Very simple small patch - see attachment.
> 
>  /* not expected, but print something anyway */
>  else if (msg->id == SHAREDINVALRELMAP_ID)
> -appendStringInfoString(buf, " relmap");
> -else if (msg->id == SHAREDINVALRELMAP_ID)
>  appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
> 
> Looking at inval.c, dbId can be InvalidOid.

Frankly I'm not very familiar with this part of code. InvalidOid is just
zero. Does it create some problem in this case?

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove redundant if clause in standbydesc.c

2016-09-23 Thread Michael Paquier
On Fri, Sep 23, 2016 at 6:08 PM, Aleksander Alekseev
 wrote:
> Very simple small patch - see attachment.

 /* not expected, but print something anyway */
 else if (msg->id == SHAREDINVALRELMAP_ID)
-appendStringInfoString(buf, " relmap");
-else if (msg->id == SHAREDINVALRELMAP_ID)
 appendStringInfo(buf, " relmap db %u", msg->rm.dbId);

Looking at inval.c, dbId can be InvalidOid.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Remove redundant if clause in standbydesc.c

2016-09-23 Thread Aleksander Alekseev
Hello.

Very simple small patch - see attachment.

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 13797a3..77983e5 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -122,8 +122,6 @@ standby_desc_invalidations(StringInfo buf,
 			appendStringInfoString(buf, " smgr");
 		/* not expected, but print something anyway */
 		else if (msg->id == SHAREDINVALRELMAP_ID)
-			appendStringInfoString(buf, " relmap");
-		else if (msg->id == SHAREDINVALRELMAP_ID)
 			appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
 		else if (msg->id == SHAREDINVALSNAPSHOT_ID)
 			appendStringInfo(buf, " snapshot %u", msg->sn.relId);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-09-23 Thread Pavel Stehule
2016-09-23 10:05 GMT+02:00 Craig Ringer :

> On 22 September 2016 at 02:31, Pavel Stehule 
> wrote:
>
> > another small update - fix XMLPath parser - support multibytes characters
>
> I'm returning for another round of review.
>
> The code doesn't handle the 5 XML built-in entities correctly in
> text-typed output. It processes  and  but not ,  or
>  . See added test. I have not fixed this, but I think it's clearly
> broken:
>
>
> + -- XML builtin entities
> + SELECT * FROM xmltable('/x/a' PASSING
> '<
> ent>'
> COLUMNS ent text);
> +   ent
> + ---
> +  '
> +  "
> +  
> +  
> +  
> + (5 rows)
>
> so I've adjusted the docs to claim that they're expanded. The code
> needs fixing to avoid entity-escaping when the output column type is
> not 'xml'.
>
>
>  and  entities in xml-typed output are expanded, not
> preserved. I don't know if this is intended but I suspect it is:
>
> + SELECT * FROM xmltable('/x/a' PASSING
> '<
> ent>'
> COLUMNS ent xml);
> +ent
> + --
> +  '
> +  "
> +  
> +  
> +  
> + (5 rows)
>
>
> For the docs changes relevant to the above search for "The five
> predefined XML entities". Adjust that bit of docs if I guessed wrong
> about the intended behaviour.
>
> The tests don't cover CDATA or PCDATA . I didn't try to add that, but
> they should.
>
>
> Did some docs copy-editing and integrated some examples. Explained how
> nested elements work, that multiple top level elements is an error,
> etc. Explained the time-of-evaluation stuff. Pointed out that you can
> refer to prior output columns in PATH and DEFAULT, since that's weird
> and unusual compared to normal SQL. Documented handling of multiple
> node matches, including the surprising results of somepath/text() on
> xy. Documented handling of nested
> elements. Documented that xmltable works only on XML documents, not
> fragments/forests.
>
> Regarding evaluation time, it struck me that evaluating path
> expressions once per row means the xpath must be parsed and processed
> once per row. Isn't it desirable to store and re-use the preparsed
> xpath? I don't think this is a major problem, since we can later
> detect stable/immutable expressions including constants, evaluate only
> once in that case, and cache. It's just worth thinking about.
>
> The docs and tests don't seem to cover XML entities. What's the
> behaviour there? Core XML only defines one entity, but if a schema
> defines more how are they processed? The tests need to cover the
> predefined entities and  at least.
>
> I have no idea whether the current code can fetch a DTD and use any
>  declarations to expand entities, but I'm guessing not? If
> not, external DTDs, and internal DTDs with external entities should be
> documented as unsupported.
>
> It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):
>
> SELECT * FROM xmltable('/' PASSING $XML$ standalone="yes" ?>
>
>   
> ]>
> Hello .
> $XML$ COLUMNS foo text);
>
> + ERROR:  invalid XML content
> + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$ +^
> + DETAIL:  line 2: StartTag: invalid element name
> +  +  ^
> + line 3: StartTag: invalid element name
> +   
> +^
> + line 4: StartTag: invalid element name
> +   
> +^
> + line 6: Entity 'pg' not defined
> + Hello .
> +^
>
>
> libxml seems to support documents with internal DTDs:
>
> $ xmllint --valid /tmp/x
> 
>  
> 
> ]>
> Hello .
>
>
> so presumably the issue lies in the xpath stuff? Note that it's not
> even ignoring the DTD and choking on the undefined entity, it's
> choking on the DTD its self.
>
>
> OK, code comments:
>
>
> In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP
> instead of a PG_TRY() / PG_CATCH() block?
>
>
> I think the new way you handle the type stuff is much, much better,
> and with comments to explain too. Thanks very much.
>
>
> There's an oversight in tableexpr vs xmltable separation here:
>
> +case T_TableExpr:
> +*name = "xmltable";
> +return 2;
>
> presumably you need to look at the node and decide what kind of table
> expression it is or just use a generic "tableexpr".
>
> Same problem here:
>
> +case T_TableExpr:
> +{
> +TableExpr  *te = (TableExpr *) node;
> +
> +/* c_expr shoud be closed in brackets */
> +appendStringInfoString(buf, "XMLTABLE(");
>
>
This is correct, but not well commented - looks on XMLEXPR node - TableExpr
is a holder, but it is invisible for user. User running a XMLTABLE function
and should to see XMLTABLE. It will be more clean when we will support
JSON_TABLE function.


>
>
> I don't have the libxml knowledge or remaining brain to usefully
> evaluate the xpath and xml specifics in xpath.c today. It does strike
> me that the new xpath parser should probably live in its own file,
> though.
>

I'll try move it to separate file


>
> I 

Re: [HACKERS] patch: function xmltable

2016-09-23 Thread Craig Ringer
On 22 September 2016 at 02:31, Pavel Stehule  wrote:

> another small update - fix XMLPath parser - support multibytes characters

I'm returning for another round of review.

The code doesn't handle the 5 XML built-in entities correctly in
text-typed output. It processes  and  but not ,  or
 . See added test. I have not fixed this, but I think it's clearly
broken:


+ -- XML builtin entities
+ SELECT * FROM xmltable('/x/a' PASSING
''
COLUMNS ent text);
+   ent
+ ---
+  '
+  "
+  
+  
+  
+ (5 rows)

so I've adjusted the docs to claim that they're expanded. The code
needs fixing to avoid entity-escaping when the output column type is
not 'xml'.


 and  entities in xml-typed output are expanded, not
preserved. I don't know if this is intended but I suspect it is:

+ SELECT * FROM xmltable('/x/a' PASSING
''
COLUMNS ent xml);
+ent
+ --
+  '
+  "
+  
+  
+  
+ (5 rows)


For the docs changes relevant to the above search for "The five
predefined XML entities". Adjust that bit of docs if I guessed wrong
about the intended behaviour.

The tests don't cover CDATA or PCDATA . I didn't try to add that, but
they should.


Did some docs copy-editing and integrated some examples. Explained how
nested elements work, that multiple top level elements is an error,
etc. Explained the time-of-evaluation stuff. Pointed out that you can
refer to prior output columns in PATH and DEFAULT, since that's weird
and unusual compared to normal SQL. Documented handling of multiple
node matches, including the surprising results of somepath/text() on
xy. Documented handling of nested
elements. Documented that xmltable works only on XML documents, not
fragments/forests.

Regarding evaluation time, it struck me that evaluating path
expressions once per row means the xpath must be parsed and processed
once per row. Isn't it desirable to store and re-use the preparsed
xpath? I don't think this is a major problem, since we can later
detect stable/immutable expressions including constants, evaluate only
once in that case, and cache. It's just worth thinking about.

The docs and tests don't seem to cover XML entities. What's the
behaviour there? Core XML only defines one entity, but if a schema
defines more how are they processed? The tests need to cover the
predefined entities and  at least.

I have no idea whether the current code can fetch a DTD and use any
 declarations to expand entities, but I'm guessing not? If
not, external DTDs, and internal DTDs with external entities should be
documented as unsupported.

It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):

SELECT * FROM xmltable('/' PASSING $XML$

  
]>
Hello .
$XML$ COLUMNS foo text);

+ ERROR:  invalid XML content
+ LINE 1: SELECT * FROM xmltable('/' PASSING $XML$
+^
+ line 4: StartTag: invalid element name
+   
+^
+ line 6: Entity 'pg' not defined
+ Hello .
+^


libxml seems to support documents with internal DTDs:

$ xmllint --valid /tmp/x



]>
Hello .


so presumably the issue lies in the xpath stuff? Note that it's not
even ignoring the DTD and choking on the undefined entity, it's
choking on the DTD its self.


OK, code comments:


In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP
instead of a PG_TRY() / PG_CATCH() block?


I think the new way you handle the type stuff is much, much better,
and with comments to explain too. Thanks very much.


There's an oversight in tableexpr vs xmltable separation here:

+case T_TableExpr:
+*name = "xmltable";
+return 2;

presumably you need to look at the node and decide what kind of table
expression it is or just use a generic "tableexpr".

Same problem here:

+case T_TableExpr:
+{
+TableExpr  *te = (TableExpr *) node;
+
+/* c_expr shoud be closed in brackets */
+appendStringInfoString(buf, "XMLTABLE(");



I don't have the libxml knowledge or remaining brain to usefully
evaluate the xpath and xml specifics in xpath.c today. It does strike
me that the new xpath parser should probably live in its own file,
though.

I think this is all a big improvement. Barring the notes above and my
lack of review of the guts of the xml.c parts of it, I'm pretty happy
with what I see now.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python adding support for multi-dimensional arrays

2016-09-23 Thread Heikki Linnakangas

On 09/22/2016 10:28 AM, Pavel Stehule wrote:

Now, the tests are enough - so I'll mark this patch as ready for commiters.

I had to fix tests - there was lot of white spaces, and the result for
python3 was missing


Thanks Pavel!

This crashes with arrays with non-default lower bounds:

postgres=# SELECT * FROM test_type_conversion_array_int4('[2:4]={1,2,3}');
INFO:  ([1, 2, ], )
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.


I'd like to see some updates to the docs for this. The manual doesn't 
currently say anything about multi-dimensional arrays in pl/python, but 
it should've mentioned that they're not supported. Now that it is 
supported, should mention that, and explain briefly that a 
multi-dimensional array is mapped to a python list of lists.


It seems we don't have any mention in the docs about arrays with 
non-default lower-bounds ATM. That's not this patch's fault, but it 
would be good to point out that the lower bounds are discarded when an 
array is passed to python.


I find the loop in PLyList_FromArray() quite difficult to understand. 
Are the comments there mixing up the "inner" and "outer" dimensions? I 
wonder if that would be easier to read, if it was written in a 
recursive-style, rather than iterative with stacks for the dimensions.


On 08/03/2016 02:49 PM, Alexey Grishchenko wrote:

This patch does not support multi-dimensional arrays of composite types, as
composite types in Python might be represented as iterators and there is no
obvious way to find out when the nested array stops and composite type
structure starts. For example, if we have a composite type of (int, text),
we can try to return "[ [ [1,'a'], [2,'b'] ], [ [3,'c'], [4,'d'] ] ]", and
it is hard to find out that the first two lists are lists, and the third
one represents structure. Things are getting even more complex when you
have arrays as members of composite type. This is why I think this
limitation is reasonable.


How do we handle single-dimensional arrays of composite types at the 
moment? At a quick glance, it seems that the composite types are just 
treated like strings, when they're in an array. That's probably OK, but 
it means that there's nothing special about composite types in 
multi-dimensional arrays. In any case, we should mention that in the docs.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-09-23 Thread Thomas Munro
On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

This is very interesting work, and it's great to see some development
in this area.  I've been peripherally involved in more than one
collation-change-broke-my-data incident over the years.  I took the
patch for a quick spin today.  Here are a couple of initial
observations.

This patch adds pkg-config support to our configure script, in order
to produce the build options for ICU.  That's cool, and I'm a fan of
pkg-config, but it's an extra dependency that I just wanted to
highlight.  For example MacOSX appears to ship with ICU but has is no
pkg-config or ICU .pc files out of the box (erm, I can't even find the
headers, so maybe that copy of ICU is useless to everyone except
Apple).  The MacPorts ICU port ships with .pc files, so that's easy to
deal with, and I don't expect it to be difficult to get a working
pkg-config and meta-data installed alongside ICU on any platform that
doesn't already ship with them.  It may well be useful for configuring
other packages.  (There is also other cool stuff that would become
possible once ICU is optionally around, off topic.)

There is something missing from the configure script however: the
output of pkg-config is not making it into CFLAGS or LDFLAGS, so
building fails on FreeBSD and MacOSX where for example
 doesn't live in the default search path.  I tried
very briefly to work out what but autoconfuse defeated me.

You call the built-in strcoll/strxfrm collation provider "posix", and
although POSIX does define those functions, it only does so because it
inherits them from ISO C90.  As far as I can tell Windows provides
those functions because it conforms to the C spec, not the POSIX spec,
though I suppose you could argue that in that respect it DOES conform
to the POSIX spec...  Also, we already have a collation called
"POSIX".  Maybe we should avoid confusing people with a "posix"
provider and a "POSIX" collation?  But then I'm not sure what else to
call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc".

postgres=# select * from pg_collation where collname like 'en_NZ%';
┌──┬───┬───┬──┬──┬──┬──┬─┐
│ collname │ collnamespace │ collowner │ collprovider │
collencoding │   collcollate│collctype │ collversion │
├──┼───┼───┼──┼──┼──┼──┼─┤
│ en_NZ│11 │10 │ p│
6 │ en_NZ│ en_NZ│   0 │
│ en_NZ│11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ│11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.ISO8859-1  │11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ.ISO8859-15 │11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.UTF-8  │11 │10 │ p│
6 │ en_NZ.UTF-8  │ en_NZ.UTF-8  │   0 │
│ en_NZ%icu│11 │10 │ i│
   -1 │ en_NZ│ en_NZ│ -1724383232 │
└──┴───┴───┴──┴──┴──┴──┴─┘
(7 rows)

I notice that you use encoding -1, meaning "all".  I haven't fully
grokked what that really means but it appears that we won't be able to
create any new such collations after initdb using CREATE COLLATION, if
for example you update your ICU installation and now have a new
collation available.  When I tried dropping some and recreating them
they finished up with collencoding = 6.  Should the initdb-created
rows use 6 too?

+ ucol_getVersion(collator, versioninfo);
+ numversion = ntohl(*((uint32 *) versioninfo));
+
+ if (numversion != collform->collversion)
+ ereport(WARNING,
+ (errmsg("ICU collator version mismatch"),
+ errdetail("The database was created using version 0x%08X, the
library provides version 0x%08X.",
+   (uint32) collform->collversion, (uint32) numversion),
+ errhint("Rebuild affected indexes, or build PostgreSQL with the
right version of ICU.")));

I played around with bumping version numbers artificially.  That gives
each session that accesses the collation one warning:

postgres=# select * from foo order by id;
WARNING:  ICU collator version mismatch
DETAIL:  The database was created using version 0x99380001, the
library provides version 0x9938.
HINT:  Rebuild affected indexes, or build PostgreSQL with the right
version of ICU.
┌┐
│ id │
├┤
└┘
(0 rows)

postgres=# select * from foo 

Re: [HACKERS] Showing parallel status in \df+

2016-09-23 Thread Pavel Stehule
2016-09-23 7:22 GMT+02:00 Rushabh Lathia :

>
>
> On Thu, Sep 22, 2016 at 10:04 PM, Tom Lane  wrote:
>
>> Rushabh Lathia  writes:
>> > I agree with the argument in this thread, having "Source code" as part
>> > of \df+ is bit annoying, specifically when output involve some really
>> > big PL language functions. Having is separate does make \df+ output more
>> > readable. So I would vote for \df++ rather then adding the source code
>> > as part of footer for \df+.
>>
>> If it's unreadable in \df+, how would \df++ make that any better?
>>
>>
> Eventhough source code as part of \df+ is bit annoying (specifically for
> PL functions),
> I noticed the argument in this thread that it's useful information for
> some of.  So \df++
> is just alternate option for the those who want the source code.
>

++ is little bit obscure. So better to remove src everywhere.

Regards

Pavel


>
>
>
>> regards, tom lane
>>
>
>
>
> --
> Rushabh Lathia
>


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 1:21 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Sep 20, 2016 at 12:53 PM, Tom Lane  wrote:
>>> I'd be the first to agree that this point is inadequately documented
>>> in the code, but PostmasterRandom should be reserved for its existing
>>> security-related uses, not exposed to the world for (ahem) random other
>>> uses.
>
>> So, we could have dsm_postmaster_startup() seed the random number
>> generator itself, and then let PostmasterRandom() override the seed
>> later.  Like maybe:
>
> Works for me, at least as a temporary solution.  The disturbing thing
> here is that this still only does what we want if dsm_postmaster_startup
> happens before the first PostmasterRandom call --- which is OK today but
> seems pretty fragile.  Still, redesigning PostmasterRandom's seeding
> technique is not something to do right before 9.6 release.  Let's revert
> the prior patch and do it as you have here:
>
>> struct timeval tv;
>> gettimeofday(, NULL);
>> srandom(tv.tv_sec);
>> ...
>> dsm_control_handle = random();
>
> for the time being.
>

Isn't it better if we use the same technique in dsm_create() as well
which uses random() for handle?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] File system operations.

2016-09-23 Thread Craig Ringer
On 22 September 2016 at 20:02, Yury Zhuravlev
 wrote:
> On четверг, 15 сентября 2016 г. 19:09:16 MSK, Tom Lane wrote:
>>
>> Robert Haas  writes:
>>>
>>> On Thu, Sep 15, 2016 at 11:01 AM, Anastasia Lubennikova
>>>  wrote:

 What do you think about moving stuff from pg_upgrade/file.c to
 storage/file/
 to allow reuse of this code? I think it'll be really helpful for
 developers
 of contrib modules
 like backup managers.
>>
>>
>>> Well, storage/file is backend and pg_upgrade is frontend.  If you want
>>> to reuse the same code for both it's got to go someplace else.
>>
>>
>> Also, to the extent that it's important to use those wrappers rather
>> than libc directly, it's because the wrappers are tied into backend
>> resource management and error handling conventions.  I don't see how
>> you convert that into code that also works in a frontend program,
>> or what the point would be exactly.
>
>
> Maybe we should towards to framework ecosystem. I mean, developers of third
> party tools must use maximum same api what in backend.

Well, there's a very gradual shift in that direction with libpgcommon
etc. But it's minimalist.

Most of what's done in the backend makes no sense in frontend code.

It'd make much more sense to adopt an existing portable runtime (nspr,
apr, whatever) than write our own if we wanted to go full framework.
But I don't think we're likely to. Much more likely to cause pain than
prevent it, esp since we're multiprocessing and shmem based.

There are a few areas we could use abstractions for though, like
fork()/exec() vs CreateProcessEx(...).


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers