Re: [HACKERS] Parallel bitmap heap scan

2016-11-22 Thread Dilip Kumar
On Wed, Nov 23, 2016 at 7:24 AM, Robert Haas  wrote:
> So, I had a brief look at this tonight.  This is not a full review,
> but just some things I noticed:

Thanks for the review..

>
> + *Update snpashot info in heap scan descriptor.
>
> Typo.  Also, why should we have a function for this at all?  And if we
> do have a function for this, why should it have "bm" in the name when
> it's stored in heapam.c?
We are updating snapshot in HeapScanDesc, that's the reason I am using
function and kept in heapam.c file like other function
heap_beginscan_bm.
But I think we can change it's name bm is not required in this, this
function don't do anything specific for bm. I will change this.

>
> + *[PARALLEL BITMAP HEAP SCAN ALGORITHM]
> + *
> + *#1. Shared TIDBitmap creation and initialization
> + *a) First worker to see the state as parallel bitmap info as
> + *PBM_INITIAL become leader and set the state to PBM_INPROGRESS
> + *All other workers see the state as PBM_INPROGRESS will wait for
> + *leader to complete the TIDBitmap.
> + *
> + *Leader Worker Processing:
> + *(Leader is responsible for creating shared TIDBitmap and create
> + *shared page and chunk array from TIDBitmap.)
> + *1) Create TIDBitmap using DHT.
> + *2) Begin Iterate: convert hash table into shared page and chunk
> + *array.
> + *3) Restore local TIDBitmap variable information into
> + *ParallelBitmapInfo so that other worker can see those.
> + *4) set state to PBM_FINISHED.
> + *5) Wake up other workers.
> + *
> + *Other Worker Processing:
> + *1) Wait until leader create shared TIDBitmap and shared page
> + *and chunk array.
> + *2) Attach to shared page table, copy TIDBitmap from
> + *ParallelBitmapInfo to local TIDBitmap, we copy this to local
> + *TIDBitmap so that next level processing can read information
> + *same as in non parallel case and we can avoid extra changes
> + *in code.
> + *
> + *# At this level TIDBitmap is ready and all workers are awake #
> + *
> + *#2. Bitmap processing (Iterate and process the pages).
> + *. In this phase each worker will iterate over page and
> chunk array and
> + *select heap pages one by one. If prefetch is enable then there will
> + *be two iterator.
> + *. Since multiple worker are iterating over same page and chunk 
> array
> + *we need to have a shared iterator, so we grab a spin lock and 
> iterate
> + *within a lock.
>
> The formatting of this comment is completely haphazard.

I will fix this..

"Leader
> worker" is not a term that has any meaning.  A given backend involved
> in a parallel query is either a leader or a worker, not both.

I agree this is confusing, but we can't call it directly a leader
because IMHO we meant by a leader who, actually spawns all worker and
gather the results. But here I meant by "leader worker" is the worker
which takes responsibility of building tidbitmap, which can be any
worker. So I named it "leader worker". Let me think what we can call
it.

>
> +/* reset parallel bitmap scan info, if present */
> +if (node->parallel_bitmap)
> +{
> +ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
> +
> +pbminfo->state = PBM_INITIAL;
> +pbminfo->tbmiterator.schunkbit = 0;
> +pbminfo->tbmiterator.spageptr = 0;
> +pbminfo->tbmiterator.schunkptr = 0;
> +pbminfo->prefetch_iterator.schunkbit = 0;
> +pbminfo->prefetch_iterator.spageptr = 0;
> +pbminfo->prefetch_iterator.schunkptr = 0;
> +pbminfo->prefetch_pages = 0;
> +pbminfo->prefetch_target = -1;
> +}
>
> This is obviously not going to work in the face of concurrent
> activity.  When we did Parallel Seq Scan, we didn't make any changes
> to the rescan code at all.  I think we are assuming that there's no
> way to cause a rescan of the driving table of a parallel query; if
> that's wrong, we'll need some fix, but this isn't it.  I would just
> leave this out.

In heap_rescan function we have similar change..

if (scan->rs_parallel != NULL)
{
  parallel_scan = scan->rs_parallel;
  SpinLockAcquire(_scan->phs_mutex);
  parallel_scan->phs_cblock = parallel_scan->phs_startblock;
  SpinLockRelease(_scan->phs_mutex);
}

And this is not for driving table, it's required for the case when
gather is as inner node for JOIN.
consider below example. I know it's a bad plan but we can produce this plan :)

Outer NodeInner Node
SeqScan(t1)   NLJ   (Gather -> Parallel SeqScan (t2)).

Reason for doing same is that, during ExecReScanGather we don't
recreate new DSM instead of that we just reinitialise the DSM
(ExecParallelReinitialize).

>
> +static bool
> +pbms_is_leader(ParallelBitmapInfo *pbminfo)
>
> I think 

Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-11-22 Thread Pavel Stehule
Hi

2016-11-15 12:26 GMT+01:00 Kyotaro HORIGUCHI :

> Hello, I rebased this patch on the current master.
>
> At Mon, 31 Oct 2016 10:15:48 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20161031.101548.162143279.horiguchi.kyot...@lab.ntt.co.jp>
> > Anyway, I fixed space issues and addressed the
> > COMPLETE_WITH_QUERY()'s almost unused parameter problem. And
> > tried to write a README files.
>
> tab-complete.c have gotten some improvements after this time.
>
> 577f0bdd2b8904cbdfde6c98f4bda6fd93a05ffc psql: Tab completion for
> renaming enum values.
> 927d7bb6b120a2ca09a164898f887eb850b7a329 Improve tab completion for
> CREATE TRIGGER.
> 1d15d0db50a5f39ab69c1fe60f2d5dcc7e2ddb9c psql: Tab-complete LOCK [TABLE]
> ... IN {ACCESS|ROW|SHARE}.
>
> The attached patchset is rebsaed on the master including these
> patches.
>

I checked patches 0001, 0002, 0003 patches

There are no any problems with patching, compiling, it is working as
expected

These patches can be committed separately - they are long, but the code is
almost mechanical

The README is perfect

Regards

Pavel





>
> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: [HACKERS] patch: function xmltable

2016-11-22 Thread Pavel Stehule
2016-11-22 21:47 GMT+01:00 Alvaro Herrera :

> I found the whole TableExprGetTupleDesc() function a bit odd in
> nodeFuncs.c, so I renamed it to ExecTypeFromTableExpr() and moved it to
> execTuples.c -- but only because that's where ExecTypeFromTL and others
> already live.  I would have liked to move it to tupdesc.c instead, but
> it requires knowledge of executor nodes, which is probably the reason
> that ExecTypeFromTL is in execTuples.  I think we'd eat that bit of
> ugliness only because we're not the first.  But anyway I quickly ran
> into another problem.
>
> I noticed that ExecTypeFromTableExpr is being called from the transform
> phase, which is much earlier than the executor.  I noticed because of
> the warning that the above movement added to nodeFuncs.c,
> src/backend/nodes/nodeFuncs.c:509:5: warning: implicit declaration of
> function 'ExecTypeFromTableExpr' [-Wimplicit-function-declaration]
>

The tuple descriptor should not be serialized.

When xmltable is called directly, then living tuple descriptor is used -
created in transform time. Another situation is when xmltable is used from
view, where transform time is skipped.

Originally I serialized generated type - but I had the problems with record
types - the current infrastructure expects serialization only real types.

My solution is a recheck of tuple descriptor in executor time. It is small
overhead - once per query - but it allows use xmltable from views without
necessity to specify returned columns explicitly.


>
> so I thought, hm, is it okay to have parse analysis run an executor
> function?  (I suppose this is the reason you put it in nodeFuncs in the
> first place).  For fun, I tried this query under GDB, with a breakpoint
> on exprTypmod():
>
> SELECT X.*
> FROM emp,
> XMLTABLE ('//depts/dept/employee' passing doc
>COLUMNS
>empIDINTEGER PATH '@id',
>firstnameint PATH 'name/first',
>lastname VARCHAR(25) PATH 'name/last') AS X;
>
> and sure enough, the type is resolved during parse analysis:
>
> Breakpoint 1, exprTypmod (expr=expr@entry=0x1d23ad8)
> at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
> 283 if (!expr)
> (gdb) print *expr
> $2 = {type = T_TableExpr}
> (gdb) bt
> #0  exprTypmod (expr=expr@entry=0x1d23ad8)
> at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
> #1  0x0080c500 in get_expr_result_type (expr=0x1d23ad8,
> resultTypeId=0x7ffd482bfdb4, resultTupleDesc=0x7ffd482bfdb8)
> at /pgsql/source/master/src/backend/utils/fmgr/funcapi.c:247
> #2  0x0056de1b in expandRTE (rte=rte@entry=0x1d6b800, rtindex=2,
> sublevels_up=0, location=location@entry=7,
> include_dropped=include_dropped@entry=0 '\000',
> colnames=colnames@entry=0x7ffd482bfe10, colvars=0x7ffd482bfe18)
> at /pgsql/source/master/src/backend/parser/parse_relation.c:2052
> #3  0x0056e131 in expandRelAttrs (pstate=pstate@entry=0x1d238a8,
> rte=rte@entry=0x1d6b800, rtindex=,
> sublevels_up=, location=location@entry=7)
> at /pgsql/source/master/src/backend/parser/parse_relation.c:2435
> #4  0x0056fa64 in ExpandSingleTable (pstate=pstate@entry=
> 0x1d238a8,
> rte=rte@entry=0x1d6b800, location=7,
> make_target_entry=make_target_entry@entry=1 '\001')
> at /pgsql/source/master/src/backend/parser/parse_target.c:1266
> #5  0x0057135b in ExpandColumnRefStar (pstate=pstate@entry=
> 0x1d238a8,
> cref=0x1d22720, make_target_entry=make_target_entry@entry=1 '\001')
> at /pgsql/source/master/src/backend/parser/parse_target.c:1158
> #6  0x005716f9 in transformTargetList (pstate=0x1d238a8,
> targetlist=, exprKind=EXPR_KIND_SELECT_TARGET)
>
> This seems fine I guess, and it seems to say that we ought to move the
> code that generates the tupdesc to back parse analysis rather than
> executor.  Okay, fine.  But let's find a better place than nodeFuncs.
>
> But if I move the XMLTABLE() call to the target list instead, the type
> is resolved at planner time:
>
> SELECT
> XMLTABLE ('/dept/employee' passing $$
> 
> 
> Mary
> Jones
> 
> 415
> 905-403-6112
> 647-504-4546
> 64000
> 
> $$
>COLUMNS
>empIDINTEGER PATH '@id',
>firstnamevarchar(4) PATH 'name/first',
>lastname VARCHAR(25) PATH 'name/last') AS X;
>
> Breakpoint 1, exprTypmod (expr=expr@entry=0x1d6bed8)
> at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
> 283 if (!expr)
> (gdb) bt
> #0  exprTypmod (expr=expr@entry=0x1d6bed8)
> at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
> #1  0x00654058 in set_pathtarget_cost_width (root=0x1d6bc68,
> target=0x1d6c728)
> at /pgsql/source/master/src/backend/optimizer/path/costsize.c:4729
> #2  0x0066c197 in grouping_planner (root=0x1d6bc68,
> inheritance_update=40 '(', inheritance_update@entry=0 '\000',
>   

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan
On Tue, Nov 22, 2016 at 8:45 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> The best thing by far about an alternative design like this is that it
>> performs *consistently*.
>
> Really?  I think it just moves the issues somewhere else.

Definitely, yes.

* HOT is great and all, but HOT-safety is a pretty onerous thing to
ask users to worry about. They don't know anything about it. Maybe
WARM eventually helps with this -- I don't know about that.

* It's bad that the effectiveness of index-only scans is strongly
influenced by the visibility map. And, autovacuum doesn't care about
index-only scans (nor should it, I suppose).

* The high watermark size of a table is much higher for an
update-mostly workload. When bloat is localized to certain values in
indexes, it's a lot worse.

* Our behavior with many duplicates in secondary indexes is pretty bad
in general, I suspect. I think we do badly at amortizing inserts into
indexes from updates (we dirty more leaf pages than we really need
to). I think we could do better at making LP_DEAD style IndexTuple
recycling more effective.

* Most obviously, autovacuum can create significant load at
inconvenient times. Paying that cost in a piecemeal fashion has a
certain appeal.

For what it's worth, I am not planning to work on this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Pavan Deolasee
On Wed, Nov 23, 2016 at 10:07 AM, Peter Geoghegan  wrote:

> On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane  wrote:
> > Ultimately, I doubt that update-in-place buys much that we don't already
> > have with HOT updates (which postdate this old conversation, btw).
> > If you want MVCC semantics, you need to hold both versions of the tuple
> > *somewhere*.  Oracle's solution is different from ours, but I'm not
> > exactly convinced that it's better.  It makes some aspects better and
> > others worse.
>
> I agree that HOT is roughly as effective, at least when you get mostly
> HOT updates.
>
>
I agree. We'd tried update-in-place before writing HOT and it turned out to
be complex and error-prone [1]. I am not suggesting that we can't do a
better job this time, but as Tom said, it's going to be lot of work. If
WARM makes any progress, it will also help addressing some of the issues
around index bloats when HOT updates can not be done.

One key issue that I see is that unrelated long running transactions end up
holding up OldestXmin advances and that leads to significant bloat to
update-heavy tables. If we could somehow address that situation, that may
still go a long way in containing bloat. Of course, there may still be
cases where bloat will be unavoidable, but it seems far lesser work to me
that a complete overhaul of the MVCC working.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/1163092396.3634.461.ca...@silverbirch.site
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Mark Kirkwood

On 23/11/16 16:31, Tom Lane wrote:

Robert Haas  writes:

[ Let's invent Oracle-style UNDO logs ]


I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:


- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.




Also ROLLBACK becomes vastly more expensive than COMMIT (I can recall 
many years ago when I used to be an Oracle DBA reading whole chapters of 
novels waiting for failed batch jobs to roll back).


However I'd like to add that I agree this is worth looking at, as 
ideally it would be great to be able to choose whether to have No-UNDO 
or UNDO on a table by table basis...


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane
Peter Geoghegan  writes:
> The best thing by far about an alternative design like this is that it
> performs *consistently*.

Really?  I think it just moves the issues somewhere else.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan
On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane  wrote:
> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.

I think that there is an under-appreciated issue with index bloat and
VACUUM. Bloat in the sense of a B-Tree becoming unbalanced. I think
that short term bursts of non-HOT updates bloat the *keyspace* of
B-Trees, making them unbalanced. A short term burst (possibly
associated with one long running transaction that prevents pruning)
can cause long term damage to the key space.

The best thing by far about an alternative design like this is that it
performs *consistently*. I believe that we will lose on real-world
cases that play to the strengths of the current design. That doesn't
mean that it isn't worth it, but it does make it exceptional
difficult.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Amit Kapila
On Wed, Nov 23, 2016 at 9:48 AM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>>> Oracle spends a lot of time on this, and it's really cache-inefficient
>>> because the data is spread all over.  This was what this guy felt in
>>> circa 2001; I'd have to think that the cache unfriendliness problem is
>>> much worse for modern hardware.
>
>> I imagine that temporal locality helps a lot. Most snapshots will be
>> interested in the same row version.
>
> Yeah, but the problem is that all those transactions have to scavenge
> through a big chunk of UNDO log to find out what they need to know.
>

I think that depends on how we design to read the old data, if every
transaction has to go through UNDO log then it can impact the
performance, however, if we design such that the transaction can take
help from the previous transaction which has gone through undo log,
then it will be much better. For that to happen, I think we need some
kind of page level MVCC and that should be possible in design proposed
above where the latest transactions information is present at the page
level.

> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.
>

I think not only Oracle but some of the other database systems like
SQL Server (and IIRC MySQL) does a better job in terms of bloat
management (by keeping it separate from main Heap) which is why in
many cases they are preferred over PostgreSQL.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan
On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas  wrote:
> This basic DO-UNDO-REDO protocol has been well-understood for
> decades.

FWIW, while this is basically true, the idea of repurposing UNDO to be
usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
nothing about MVCC.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Amit Kapila
On Wed, Nov 23, 2016 at 9:32 AM, Peter Geoghegan  wrote:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>>> - Reading a page that has been recently modified gets significantly
>>> more expensive; it is necessary to read the associated UNDO entries
>>> and do a bunch of calculation that is significantly more complex than
>>> what is required today.
>
> Someone told me that there is something called an interested
> transaction list stored in the page header, and from that I infer that
> isn't *really* true. I think that unless you're interested in an
> affected row, rather than just some row that happens to be on the same
> page, you don't really have to worry about it.
>

Yeah, so basically if there is an effect of any transaction which is
not visible to the snapshot of transaction reading the page, you need
to do something to read the old row/rows present on that page.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>> Oracle spends a lot of time on this, and it's really cache-inefficient
>> because the data is spread all over.  This was what this guy felt in
>> circa 2001; I'd have to think that the cache unfriendliness problem is
>> much worse for modern hardware.

> I imagine that temporal locality helps a lot. Most snapshots will be
> interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*.  Oracle's solution is different from ours, but I'm not
exactly convinced that it's better.  It makes some aspects better and
others worse.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan
On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.

Someone told me that there is something called an interested
transaction list stored in the page header, and from that I infer that
isn't *really* true. I think that unless you're interested in an
affected row, rather than just some row that happens to be on the same
page, you don't really have to worry about it.

> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 10:41 PM, Peter Geoghegan  wrote:
> On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas  wrote:
>>> Heikki's been fooling with some ideas that I think have more promise.
>>> I wish he'd get to the point of presenting them publicly rather than
>>> just over beers at conferences.
>>
>> That would be good, too!
>
> I was told that TED was kind of formally proposed in the "MultiXact
> hindsight design" thread.

I read that, but:

1. Nothing's really come of that, AFAICS, and

2. It doesn't allow for in-place update, and

3. It doesn't eliminate the need for freezing.

Implementing TED would probably have been better than doing fkeylocks
the way we did, but now that we've mostly stabilized the multixact
work that was actually done, I have a hard time imagining that we
really want to revamp the whole thing without fixing any of the more
fundamental problems with our on-disk format.  In my mind, the bloat
created by storing both the old and new versions of an updated tuple
in the heap, and the need to rewrite pages multiple times to set hint
bits, set all-visible, and ultimately freeze are the three-ton
gorillas in the corner.  I am not arguing that what I've outlined here
is the only way to fix those problems or even the best way, just that
it isn't really worth the trouble of doing major surgery if we aren't
going to make a dent in those problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan
On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas  wrote:
>> Heikki's been fooling with some ideas that I think have more promise.
>> I wish he'd get to the point of presenting them publicly rather than
>> just over beers at conferences.
>
> That would be good, too!

I was told that TED was kind of formally proposed in the "MultiXact
hindsight design" thread.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 10:31 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> [ Let's invent Oracle-style UNDO logs ]
>
> I dunno.  I remember being told years ago, by an ex-Oracle engineer,
> that he thought our approach was better.  I don't recall all the details
> of the conversation but I think his key point was basically this:
>
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.
>
> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

Yes, that's the main thing I'm worried about with this design.  Now,
bloat is pretty cache-inefficient too.  If our table is 2x the size of
Oracle's table, they can spend more cycles per page and still do
fairly well.  If that 2x increment pushes us out of RAM, they can
spend basically all the CPU time they want and still win handily.  But
that isn't to say that this concern isn't justified.

> Which is not to say that it's not worth experimenting with.  But what you
> describe is going to be a huge amount of work even to get to the point
> where we could try to measure the actual costs and benefits :-(

Yeah.

> Heikki's been fooling with some ideas that I think have more promise.
> I wish he'd get to the point of presenting them publicly rather than
> just over beers at conferences.

That would be good, too!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane
Robert Haas  writes:
> [ Let's invent Oracle-style UNDO logs ]

I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:

> - Reading a page that has been recently modified gets significantly
> more expensive; it is necessary to read the associated UNDO entries
> and do a bunch of calculation that is significantly more complex than
> what is required today.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over.  This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

Which is not to say that it's not worth experimenting with.  But what you
describe is going to be a huge amount of work even to get to the point
where we could try to measure the actual costs and benefits :-(

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-22 Thread Peter Geoghegan
On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire  wrote:
>> There are a couple
>> of tricky issues with that that you'd have to look out for, like
>> making sure that the high key continues to hold a real TID, which at a
>> glance looks like something that just happens already. I'm not sure
>> that we preserve the item pointer TID in all cases, since the current
>> assumption is that a high key's TID is just filler -- maybe we can
>> lose that at some point.
>
> I thought long about that, and inner pages don't need a real TID in
> their keys, as those keys only specify key space cutoff points and not
> real tids, and high tids in leaf pages are always real.

Not if there are duplicates in the inner pages. Then, you have to add
in the TID, and separate the key space, to guide insertions to the
right leaf page directly (particularly for non-HOT UPDATEs). That's
what I'm mostly interested in investigating, here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Alternative MATERIALIZED VIEW design and implementation with history table and other features

2016-11-22 Thread Nico Williams

I love MATERIALIZED VIEWs.

But we needed a method for recording deltas from REFRESHes, and that's
not supported.  So I coded up my own version of materialized views, in
PlPgSQL, that does provide a history feature.

Besides a history feature, this includes the ability to record changes
made to a materialized view's materialization table, which means I can
have triggers that update the materialized view.

We use this for updating a view whose query is a bit slow.  Some
triggers are also slow (well, they're fast, but used in transactions
that might potentially run fire these triggers many times), in which
case I mark a "view" as needing a refresh.  Other triggers are fast and
directly update the "view".

https://github.com/twosigma/postgresql-contrib
https://github.com/twosigma/postgresql-contrib/blob/master/pseudo_mat_views.sql
https://raw.githubusercontent.com/twosigma/postgresql-contrib/master/pseudo_mat_views.sql

I'd be willing to do some of the work of integrating this more closely
with PG, but I may need some pointers (but hopefully not much hand-
holding).  Ideally we could have CREATE MATERIALIZED VIEW syntax like
this:

CREATE MATERIALIZED VIEW schema_name.view_name
   [ (  [, ...] ) ]
   [ WITH ( storage_parameter [= value] [, ... ] ) ]
   [ TABLESPACE tablespace_name ]
AS 
WITH [ [ UNLOGGED ] HISTORY TABLE [ schema_name.view_name_history ], ]
 [ PRIMARY KEY (  [, ...] ), ]
 [ [ NO ] DATA ];

Of particular interest may be the fact that the FULL OUTER JOIN that PG
does for REFRESH CONCURRENTLY, and which I copied here, doesn't deal
well with views that have NULLs in any columns used in the join.  It
would be nice to have an equijoin that uses IS NOT DISTINCT FROM rather
than just =, and then refreshing could use such a join in order to deal
properly with NULLs.

Any help with integration, or comments, even flames, are welcomed, but
keep in mind that this is my first foray into making a contribution to
PG, so please do be kind.  Pointers to C and SQL style guides and
standards for in-tree code would be particularly helpful.  Thanks!

Nico
-- 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UNDO and in-place update

2016-11-22 Thread Robert Haas
PostgreSQL tables and indexes not infrequently become bloated, and
this bloat is difficult to reverse once it has occurred.  When a row
is updated, a new row version is created and the space occupied by the
old version can be eventually be recycled.  Since there is always a
short delay and sometimes a very long delay (hours) between the time
when the new row version is created and the time the old row version
can be removed, tables grow to accommodate storing both the old and
new row versions.  When the old versions are reclaimed, there is no
easy way to shrink the table, because the free space is spread through
the table's data blocks rather than concentrated at the end of the
file.  Moreover, since the space used by old row versions are only
reclaimed lazily -- either by HOT pruning or by VACUUM -- the table
may be further extended even space for the new row version could have
been made available by more timely cleanup.  Indexes have similar
issues: even though in theory there is more latitude to consolidate
free space, we don't always do so, and lazy reclamation of row
versions may cause relation extension.  One factor contributing to
index bloat is that any non-HOT update to a heap tuple requires an
update to every index even if the indexed columns are not updated.

How can we get ahead of this problem?  One approach is to make it
easier to reorganize the table; for example, a variant of CLUSTER that
doesn't block concurrent DML would be useful.  However, that's really
only a stopgap, because it only makes it easier to salvage a table
where internal free space has become badly fragmented. What we want to
do is prevent free space from becoming fragmented in the first place.
Most other relational database systems do this by placing old row
versions in a separate data structure from the table itself where they
can easily be removed when no longer needed.  A row version becomes
"old" when it is updated or deleted; the "new" version enters the main
heap while the "old" version moves elsewhere.

To implement this in PostgreSQL, I believe we would need support for
UNDO.  Most database systems have both REDO -- what PostgreSQL
typically calls the write-ahead log (WAL) -- and UNDO.  REDO is
responsible for restoring physical consistency after a database crash,
while UNDO is responsible for reversing the effects of aborted
transactions.  When a transaction performs an operation (DO), it also
writes it to the write-ahead log (REDO) and records the information
needed to reverse it (UNDO).  If the transaction aborts, UNDO is used
to reverse all changes made by the transaction.  Both the actions
performed (DO) and the UNDO created for those actions are protected by
REDO.  Thus, UNDO must be written to disk at checkpoint time; UNDO
since the most recent checkpoint prior to a crash is regenerated by
WAL replay (REDO).  So, after a crash, we first replay WAL (REDO) and
then reverse out any transactions implicitly aborted by the crash
(UNDO). This basic DO-UNDO-REDO protocol has been well-understood for
decades.  It would likely be worthwhile to implement DO-UNDO-REDO in
PostgreSQL independently of any design for reducing bloat, because it
would provide a systematic framework for handling cleanup actions that
must be performed at the end of recovery. For example, if a
transaction creates a table and, while that transaction is still in
progress, there is an operating system or PostgreSQL crash, the
storage used by the table is permanently leaked.  This could be fixed
by having the operation that creates the table also emit UNDO which
would unlink the backing storage in the event of an abort.

We could have a single shared undo log to which all transactions
write, but the insertion point would almost certainly become a point
of contention. Moreover, it's not very desirable to interleave UNDO
from different transactions, because it makes cleanup difficult.
Suppose one transaction aborts while another continues to run; if
their UNDO records are interleaved, it will be difficult to recover
any space if one set of UNDO records can be discarded but the other
must be kept.  Instead, it seems best to have a separate UNDO log for
each transaction.  Each of these UNDO logs individually will be
written sequentially and will consist of a series of variable-length
records.  When a particular UNDO log is no longer needed, we can
remove the whole thing at once.  Unless the UNDO log spilled to disk
because it was large or because of a checkpoint during the
transaction, this is just a matter of deallocating or recycling
whatever shared memory we're using to store it; otherwise, we'll need
to remove or recycle the file, or the portion of a file, that was used
to persist it on disk.

Once we have UNDO, we could create a new type of heap where we emit
UNDO for each INSERT, UPDATE, or DELETE.  For an INSERT, we will
insert the new write and emit UNDO which will remove it.  For a
DELETE, we will immediately remove the old row version but 

Re: [HACKERS] Parallel bitmap heap scan

2016-11-22 Thread Robert Haas
On Wed, Oct 19, 2016 at 11:53 AM, Dilip Kumar  wrote:
> I found one defect in v2 patch, that I induced during last rebasing.
> That is fixed in v3.

So, I had a brief look at this tonight.  This is not a full review,
but just some things I noticed:

+ *Update snpashot info in heap scan descriptor.

Typo.  Also, why should we have a function for this at all?  And if we
do have a function for this, why should it have "bm" in the name when
it's stored in heapam.c?

+ *[PARALLEL BITMAP HEAP SCAN ALGORITHM]
+ *
+ *#1. Shared TIDBitmap creation and initialization
+ *a) First worker to see the state as parallel bitmap info as
+ *PBM_INITIAL become leader and set the state to PBM_INPROGRESS
+ *All other workers see the state as PBM_INPROGRESS will wait for
+ *leader to complete the TIDBitmap.
+ *
+ *Leader Worker Processing:
+ *(Leader is responsible for creating shared TIDBitmap and create
+ *shared page and chunk array from TIDBitmap.)
+ *1) Create TIDBitmap using DHT.
+ *2) Begin Iterate: convert hash table into shared page and chunk
+ *array.
+ *3) Restore local TIDBitmap variable information into
+ *ParallelBitmapInfo so that other worker can see those.
+ *4) set state to PBM_FINISHED.
+ *5) Wake up other workers.
+ *
+ *Other Worker Processing:
+ *1) Wait until leader create shared TIDBitmap and shared page
+ *and chunk array.
+ *2) Attach to shared page table, copy TIDBitmap from
+ *ParallelBitmapInfo to local TIDBitmap, we copy this to local
+ *TIDBitmap so that next level processing can read information
+ *same as in non parallel case and we can avoid extra changes
+ *in code.
+ *
+ *# At this level TIDBitmap is ready and all workers are awake #
+ *
+ *#2. Bitmap processing (Iterate and process the pages).
+ *. In this phase each worker will iterate over page and
chunk array and
+ *select heap pages one by one. If prefetch is enable then there will
+ *be two iterator.
+ *. Since multiple worker are iterating over same page and chunk array
+ *we need to have a shared iterator, so we grab a spin lock and iterate
+ *within a lock.

The formatting of this comment is completely haphazard.  "Leader
worker" is not a term that has any meaning.  A given backend involved
in a parallel query is either a leader or a worker, not both.

+/* reset parallel bitmap scan info, if present */
+if (node->parallel_bitmap)
+{
+ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
+
+pbminfo->state = PBM_INITIAL;
+pbminfo->tbmiterator.schunkbit = 0;
+pbminfo->tbmiterator.spageptr = 0;
+pbminfo->tbmiterator.schunkptr = 0;
+pbminfo->prefetch_iterator.schunkbit = 0;
+pbminfo->prefetch_iterator.spageptr = 0;
+pbminfo->prefetch_iterator.schunkptr = 0;
+pbminfo->prefetch_pages = 0;
+pbminfo->prefetch_target = -1;
+}

This is obviously not going to work in the face of concurrent
activity.  When we did Parallel Seq Scan, we didn't make any changes
to the rescan code at all.  I think we are assuming that there's no
way to cause a rescan of the driving table of a parallel query; if
that's wrong, we'll need some fix, but this isn't it.  I would just
leave this out.

+static bool
+pbms_is_leader(ParallelBitmapInfo *pbminfo)

I think you should see if you can use Thomas Munro's barrier stuff for
this instead.

+SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data);
+
+shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo);
+
+node->parallel_bitmap = pbminfo;
+snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data);
+
+heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot);

This doesn't make any sense.  You serialize the snapshot from the
estate, then restore it, then shove it into the scan descriptor. But
presumably that's already the snapshot the scan descriptor is using.
The workers need to do this, perhaps, but not the leader!

+dht_parameters params = {0};

Not PostgreSQL style.

+#define TBM_IS_SHARED(tbm) (tbm)->shared

Seems pointless.

+boolshared;/* need to build shared tbm if set*/

Style.

+params.tranche_id = LWLockNewTrancheId();

You absolutely, positively cannot burn through tranche IDs like this.

+if (tbm->shared_pagetable)
+dht_detach(tbm->shared_pagetable);

Hmm, would we leak references if we errored out?

@@ -47,7 +47,6 @@ typedef enum

 static List *translate_sub_tlist(List *tlist, int relid);

-
 /*
  *MISC. PATH UTILITIES
  */

Useless whitespace 

Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-22 Thread Tom Lane
Haribabu Kommi  writes:
> On Wed, Nov 23, 2016 at 1:42 AM, Tom Lane  wrote:
>> The precedent of int4/int8/float4/float8 is that SQL data types should
>> be named after their length in bytes.  So I'd be inclined to call this
>> "macaddr8" not "macaddr64".  That would suggest taking the simple
>> approach of always storing values in the 8-byte format, rather than
>> dealing with the complexities of having two formats internally, two
>> display formats, etc.

> Do you prefer the automatic conversion from 6 byte to 8 byte MAC address,
> This way it takes extra space with new datatype. Is it fine to with new
> datatype?

Well, I think the space savings would be pretty illusory.  If we use a
varlena datatype, then old-style MAC addresses would take 7 bytes and
new-style would take 9.  That's not much of an improvement over always
taking 8 bytes.  What's worse, if the next field has an alignment
requirement more than char, is that it's really 8 bytes and 12 bytes (or
16!), making this strictly worse than a fixed-length-8-bytes approach.

As against that, if we use a varlena type then we'd have some protection
against the day when the MAC people realize they were still being
short-sighted and go up to 10 or 12 or 16 bytes.  But even if that happens
while Postgres is still in use, I'm not sure that we'd choose to make use
of the varlena aspect rather than invent a third datatype to go with that
new version of the standard.  Per the discussion in this thread, varlena
storage in itself doesn't do very much for the client-side compatibility
issues.  Making a new datatype with a new, well-defined I/O behavior
ensures that applications don't get blindsided by a new behavior they're
not ready for.

In short, I'm leaning to having just a fixed-length-8-byte implementation.
This may seem like learning nothing from our last go-round, but the
advantages of varlena are very far in the hypothetical future, and the
disadvantages are immediate.

Also, if we define macaddr as "always 6 bytes" and macaddr8 as "always 8
bytes", then there's a very simple way for users to widen an old-style
address to 8 bytes or convert it back to the 6-byte format: just cast
to the other datatype.  If the new macaddr type can store both widths
then you need to invent at least one additional function to provide
those behaviors.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 4:42 PM, Thomas Munro
 wrote:
> On Tue, Nov 1, 2016 at 5:03 PM, Thomas Munro
>> Here's a new version which is rebased and adds support for passing
>> wait_event through to pg_stat_activity.
>
> Here's a version updated for the new conditional variables interface
> which has just been committed as e8ac886c.  Some comments improved.

The code here looks OK.  A few thoughts:

- I'm a little unsure whether it's a good idea to remove the existing
barrier.h and immediately add a new barrier.h that does something
totally different.  It's tempting to try to find another name for
these things just to avoid that.  Then again, I bet there's not much
non-core code that is relying on the existing barrier.h, so maybe it's
OK.  On the third hand, the risk of confusing developers may be
non-nil even if we don't confuse any code.  I think I'll go commit
remove-useless-barrier-v4.patch now and see if anyone shows up to
complain...

- Should parallel bitmap heap scan be using this to decide who runs
the subplan?  It looks like it would work for what is needed in that
case, and it looks like it would be simpler (and probably more
correct) than what's there now.  PBMState could go away completely, I
think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] btree_gin and btree_gist for enums

2016-11-22 Thread Andrew Dunstan
I won't have time to fix this before the end of the Commitfest
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-22 Thread Haribabu Kommi
On Wed, Nov 23, 2016 at 1:42 AM, Tom Lane  wrote:

> Haribabu Kommi  writes:
> > Any suggestions for the name to be used for the new datatype the can
> > work for both 48 and 64 bit MAC addresses?
>
> The precedent of int4/int8/float4/float8 is that SQL data types should
> be named after their length in bytes.  So I'd be inclined to call this
> "macaddr8" not "macaddr64".  That would suggest taking the simple
> approach of always storing values in the 8-byte format, rather than
> dealing with the complexities of having two formats internally, two
> display formats, etc.
>
> > It is possible to represent a 48 bit MAC address as 64 bit MAC address
> > by adding reserved bytes in the middle as follows.
> > 01-01-01-01-01-01::macaddr => 01-01-01-FF-FE-01-01-01::newmacaddr
>
> Check.  So we could accept 6-byte addresses on input and perform
> that conversion automatically.
>

Do you prefer the automatic conversion from 6 byte to 8 byte MAC address,
This way it takes extra space with new datatype. Is it fine to with new
datatype?



> > While comparing a 48 bit MAC address with 64 bit MAC address, Ignore
> > the two bytes if the contents in those bytes are reserved bytes.
>
> Um ... I don't follow.  Surely these must compare different:
>
> 01-01-01-FF-FE-01-01-01
> 01-01-01-FF-0E-01-01-01
>

Yes, that's correct. Both the above MAC addresses are different.

Sorry for not providing more details.

The new macaddr8 datatype can accept both 6 and 8 byte MAC addresses.
Let's assume the data in the table for the macaddr8 column as follows.

row1 = 01-01-01-02-02-02
row2 = 01-01-01-FF-FE-02-02-02

The MAC address is same, it is just a representation in 2 forms, one is 6
byte
and another is 8 byte.

What I am suggesting is, we can treat both of the rows data as same, because
it is just a representation difference.

To do the same, while comparing two MAC addresses that are of 6 and 8 byte
length, some special handling is added to check for the reserved keywords
in
the 8 byte MAC address, based on that the comparison is carried out.

> The new datatype can store directly whatever is the input is, like 48 bit
> > or 64 bit. The same data is sent over the wire, whether the reserved
> > bytes are present or not?
>
> I'd just send all 8 bytes.  What the client wants to do with values
> that could be mapped back into the 6-byte space is its business.
>

As the new datatype that can store both 6 and 8 byte MAC addresses
as it is, while sending this data to client, based on the data that is
stored,
it sends 6 or 8 bytes.

If we go with storing 8 byte always, then sending all 8 bytes is fine.

Regards,
Hari Babu
Fujitsu Australia


[HACKERS] Re: [COMMITTERS] pgsql: Doc: improve documentation about composite-value usage.

2016-11-22 Thread David G. Johnston
On Tue, Nov 22, 2016 at 3:56 PM, Tom Lane  wrote:

> Doc: improve documentation about composite-value usage.
>
> Create a section specifically for the syntactic rules around whole-row
> variable usage, such as expansion of "foo.*".  This was previously
> documented only haphazardly, with some critical info buried in
> unexpected places like xfunc-sql-composite-functions.  Per repeated
> questions in different mailing lists.
>

​Tom,​

​I found it notable that you choose to introduce the OFFSET 0 hack instead
of writing a LATERAL query in the "optimization failure" example.​

SELECT (m).* FROM (SELECT myfunc(x) AS m FROM some_table OFFSET 0) ss;
​

instead of​

SELECT (m).* FROM some_table, LATERAL myfunc(some_table.x) m

​​Skipping or having a different example in 9.2 seems worth it in order to
introduce the now preferred way of writing such queries.

Or maybe in addition, so that some familiarity with the hack is gained
should the reader encounter it in the wild.

David J.


Re: [HACKERS] commitfest 2016-11 status summary

2016-11-22 Thread Haribabu Kommi
Hi All,


The commitfest status summary after three weeks of progress:

Needs review: 59
Waiting on author: 28
Ready for Commiter: 16
Commited: 35
Moved to next CF: 0
Rejected: 5
Returned with feedback: 4
TOTAL: 147


Overall progress of completion - 29%
week-1 progress of completion - 9%
week-2 progress of completion - 3%
week-3 progress of completion - 3%

Last two weeks the commitfest progress is slow compared to first week.

There are many patches in "ready for committer" state,  committer's please
have a check on those patches and provide your update.

I sent reminder mails to the reviewers who haven't responded to the patches
in those corresponding threads. Please share your review, this will help us
in smoother operation of commitfest.

There are still some patches with no reviewer assigned. Please consider
reviewing some patches. The review can be a code, test and etc, whichever is
comfortable to you. We need your help.

I updated the status of some patches as "waiting on author" to reflect it's
actual state based on my understanding of recent communications on those
threads. If anyone feels the status is wrong, please update it accordingly.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] delta relations in AFTER triggers

2016-11-22 Thread Jim Nasby

On 11/21/16 3:49 AM, Craig Ringer wrote:

After going through that experience, I now agree with Kevin: an
> interface where a new SPI interface lets PLs push a named tuplestore
> into the SPI connection to make it available to SQL seems like the
> simplest and tidiest way.

That also offers a handy step on the path toward table-valued
variables and pipelined functions, both of which would be _really_
nice for PL/PgSQL users.


FWIW, I expect at some point we'd like the ability to index tuplestores 
as well.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-11-22 Thread Michael Paquier
On Wed, Nov 23, 2016 at 5:48 AM, Michael Banck
 wrote:
> On Fri, Nov 11, 2016 at 07:42:05PM +0100, Andreas Karlsson wrote:
>> On 11/11/2016 07:40 PM, Andreas Karlsson wrote:
>> >Here is a new version of the patch with the only differences;
>> >
>> >1) The SSL tests have been changed to use reload rather than restart
>> >
>> >2) Rebased on master
>>
>> And here with a fix to a comment.
>
> Michael, as you brought up the issues with the SSL tests, do you plan to
> review the latest version (and add yourself as reviewer)?

Yes. I need two days.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-11-22 Thread Thomas Munro
On Tue, Nov 1, 2016 at 5:03 PM, Thomas Munro
 wrote:
> On Thu, Aug 18, 2016 at 1:55 PM, Thomas Munro
>  wrote:
>> On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas  wrote:
>>> If we're going to remove barrier.h, I think that should be a separate
>>> commit from creating a new barrier.h.
>>
>> OK.  Here's a patch to remove the old header, and the v2 barrier patch
>> which adds phases and attach/detach.  As before, it depends on
>> condition-variable-v2.patch.
>
> Here's a new version which is rebased and adds support for passing
> wait_event through to pg_stat_activity.

Here's a version updated for the new conditional variables interface
which has just been committed as e8ac886c.  Some comments improved.

-- 
Thomas Munro
http://www.enterprisedb.com


remove-useless-barrier-header-v4.patch
Description: Binary data


barrier-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-11-22 16:15:58 -0500, Tom Lane wrote:

> > Maybe a workable compromise would be to leave the file present, and have
> > the stats collector re-write it every (say) five minutes.  Then I'd be
> > okay with having an immediate shutdown skip writing the file; you'd be
> > losing up to five minutes' worth of activity, but not going completely
> > nuts.  So the stats collector's normal activities would include writing
> > the temp file on-demand and the permanent file on a timed cycle.
> >
> > The other components of the fix (deleting on PITR rewind or stats
> > collector crash) would remain the same.

+1

> It could even make sense to WAL log the contents of the stats file at
> checkpoint (or similar) time. Then it'd be sane after crash recovery,
> including starting from an old base backup.  Loosing the information
> about what to vacuum / analyze after a failover is quite problematic...

An idea worth considering.  This would also mean the file is valid in a
standby -- the lack of the file is currently a problem if you promote
the standby, as I recall.  But the file is perhaps too large to write to
WAL on every checkpoint.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Andres Freund
On 2016-11-22 16:15:58 -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Well, the problem is that the stats data is not on disk while the system
> > is in operation, as far as I recall -- it's only in the collector's
> > local memory.  On shutdown we tell it to write it down to a file, and on
> > startup we tell it to read it from the file and then delete it.  I think
> > the rationale for this is to avoid leaving a file with stale data on
> > disk while the system is running.
>
> Maybe a workable compromise would be to leave the file present, and have
> the stats collector re-write it every (say) five minutes.  Then I'd be
> okay with having an immediate shutdown skip writing the file; you'd be
> losing up to five minutes' worth of activity, but not going completely
> nuts.  So the stats collector's normal activities would include writing
> the temp file on-demand and the permanent file on a timed cycle.
>
> The other components of the fix (deleting on PITR rewind or stats
> collector crash) would remain the same.

It could even make sense to WAL log the contents of the stats file at
checkpoint (or similar) time. Then it'd be sane after crash recovery,
including starting from an old base backup.  Loosing the information
about what to vacuum / analyze after a failover is quite problematic...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 4:15 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Well, the problem is that the stats data is not on disk while the system
>> is in operation, as far as I recall -- it's only in the collector's
>> local memory.  On shutdown we tell it to write it down to a file, and on
>> startup we tell it to read it from the file and then delete it.  I think
>> the rationale for this is to avoid leaving a file with stale data on
>> disk while the system is running.
>
> Maybe a workable compromise would be to leave the file present, and have
> the stats collector re-write it every (say) five minutes.  Then I'd be
> okay with having an immediate shutdown skip writing the file; you'd be
> losing up to five minutes' worth of activity, but not going completely
> nuts.  So the stats collector's normal activities would include writing
> the temp file on-demand and the permanent file on a timed cycle.
>
> The other components of the fix (deleting on PITR rewind or stats
> collector crash) would remain the same.

Sounds great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Tom Lane
Alvaro Herrera  writes:
> Well, the problem is that the stats data is not on disk while the system
> is in operation, as far as I recall -- it's only in the collector's
> local memory.  On shutdown we tell it to write it down to a file, and on
> startup we tell it to read it from the file and then delete it.  I think
> the rationale for this is to avoid leaving a file with stale data on
> disk while the system is running.

Maybe a workable compromise would be to leave the file present, and have
the stats collector re-write it every (say) five minutes.  Then I'd be
okay with having an immediate shutdown skip writing the file; you'd be
losing up to five minutes' worth of activity, but not going completely
nuts.  So the stats collector's normal activities would include writing
the temp file on-demand and the permanent file on a timed cycle.

The other components of the fix (deleting on PITR rewind or stats
collector crash) would remain the same.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.

2016-11-22 Thread Corey Huinker
On Tue, Nov 22, 2016 at 3:29 PM, Tom Lane  wrote:

> Corey Huinker  writes:
> > In 9.4, I encountered a complaint about flex 2.6.0. After a little
> research
> > it seems that a fix for that made it into versions 9.3+, but not 9.4.
>
> Er ... what?  See 1ba874505.
>
> regards, tom lane
>

Maybe git fetch might didn't pick up that change. Odd that it did pick it
up on all other REL_* branches. Feel free to disregard that file.

I re-ran git pulls from my committed local branches to the corresponding
REL9_x_STABLE branch, and encountered no conflicts, so the 0001.dblink.*
patches should still be good.


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 3:52 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I agree.  However, in many cases, the major cost of a fast shutdown is
>> getting the dirty data already in the operating system buffers down to
>> disk, not in writing out shared_buffers itself.  The latter is
>> probably a single-digit number of gigabytes, or maybe double-digit.
>> The former might be a lot more, and the write of the pgstat file may
>> back up behind it.  I've seen cases where an 8kB buffered write from
>> Postgres takes tens of seconds to complete because the OS buffer cache
>> is already saturated with dirty data, and the stats files could easily
>> be a lot more than that.
>
> I think this is mostly FUD, because we don't fsync the stats files.  Maybe
> we should, but we don't today.  So even if we have managed to get the
> system into a state where physical writes are heavily backlogged, that's
> not a reason to assume that the stats collector will be unable to do its
> thing promptly.  All it has to do is push a relatively small amount of
> data into kernel buffers.

I don't believe that's automatically fast, if we're bumping up against
dirty_ratio.  However, suppose you're right.  Then what prompted the
original complaint?  The OP said "The problem here is that postmaster
took as long as 15 seconds to terminate after it had detected a
crashed backend."  It clearly WASN'T an indefinite hang as might have
occurred with the malloc-lock problem for which we implemented the
SIGKILL stuff.  So something during shutdown took a long time, but not
forever.  There's no convincing evidence I've seen that it has to have
been this particular thing, but I find it plausible, because normal
backends bail out without doing much of anything, and here we have a
process that is trying to continue doing work after having received
SIGQUIT.  If not this, then what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Tom Lane
Andres Freund  writes:
> But I'm a bit confused too - does this make any sort of difference?
> Because the startup path for crash recovery is like this:
>   pgstat_reset_all();
> so it seems quite inconsequential whether we write out pgstat, because
> we're going to nuke it either way after an immediate shutdown?

The discussion is exactly about whether we shouldn't get rid of that,
rather than doing what's proposed in this patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-22 Thread Robert Haas
On Sun, Nov 13, 2016 at 9:02 PM, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
>> Great, committed.  There's still potentially more work to be done here,
>> because my patch omits some features that were present in Victor's original
>> submission, like setting the failover timeout, optionally randomizing the
>> order of the hosts, and distinguishing between master and standby servers;
>> Victor, or anyone, please feel free to submit separate patches for those
>> things.
>
> The attached patch fixes some bugs and make a clarification for doc.  Could 
> you check and test the authentication stuff as I don't have an environment at 
> hand?

I don't have an environment at hand, either, but I think your fixes
are correct, so I have committed them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Andres Freund
Hi,

On 2016-11-22 15:49:27 -0500, Robert Haas wrote:
> I think you are almost right.  When the server is running, there are
> files in pg_stat_tmp but not pg_stat; when it is shut down, there are
> files in pg_stat but not pg_stat_tmp.  Of course the data can never be
> ONLY in the collector's backend-local memory because then nobody else
> could read it.

> I don't actually really understand the reason for this distinction.

pg_stat_tmp commonly is placed on tmpfs/a ramdisk.


But I'm a bit confused too - does this make any sort of difference?
Because the startup path for crash recovery is like this:

void
StartupXLOG(void)
{
...
/* REDO */
if (InRecovery)
{
...
/*
 * Reset pgstat data, because it may be invalid after recovery.
 */
pgstat_reset_all();

so it seems quite inconsequential whether we write out pgstat, because
we're going to nuke it either way after an immediate shutdown?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Tom Lane
Robert Haas  writes:
> I agree.  However, in many cases, the major cost of a fast shutdown is
> getting the dirty data already in the operating system buffers down to
> disk, not in writing out shared_buffers itself.  The latter is
> probably a single-digit number of gigabytes, or maybe double-digit.
> The former might be a lot more, and the write of the pgstat file may
> back up behind it.  I've seen cases where an 8kB buffered write from
> Postgres takes tens of seconds to complete because the OS buffer cache
> is already saturated with dirty data, and the stats files could easily
> be a lot more than that.

I think this is mostly FUD, because we don't fsync the stats files.  Maybe
we should, but we don't today.  So even if we have managed to get the
system into a state where physical writes are heavily backlogged, that's
not a reason to assume that the stats collector will be unable to do its
thing promptly.  All it has to do is push a relatively small amount of
data into kernel buffers.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-11-22 Thread Michael Banck
Hi,

On Fri, Nov 11, 2016 at 07:42:05PM +0100, Andreas Karlsson wrote:
> On 11/11/2016 07:40 PM, Andreas Karlsson wrote:
> >Here is a new version of the patch with the only differences;
> >
> >1) The SSL tests have been changed to use reload rather than restart
> >
> >2) Rebased on master
> 
> And here with a fix to a comment.

Michael, as you brought up the issues with the SSL tests, do you plan to
review the latest version (and add yourself as reviewer)?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 3:34 PM, Alvaro Herrera
 wrote:
>> OK, that's possible, but I'm not sure.  I think there are two separate
>> issues here.  One is whether we should nuke the stats file on
>> recovery, and the other is whether we should force a final write of
>> the stats file before agreeing to an immediate shutdown.  It seems to
>> me that the first one affects whether all of the counters go to zero,
>> and the second affects whether we lose a small amount of data from
>> just prior to the shutdown.  Right now, we are doing the first, so the
>> second is a waste.  If we decide to start doing the first, we can
>> independently decide whether to also do the second.
>
> Well, the problem is that the stats data is not on disk while the system
> is in operation, as far as I recall -- it's only in the collector's
> local memory.  On shutdown we tell it to write it down to a file, and on
> startup we tell it to read it from the file and then delete it.  I think
> the rationale for this is to avoid leaving a file with stale data on
> disk while the system is running.

/me tests.

I think you are almost right.  When the server is running, there are
files in pg_stat_tmp but not pg_stat; when it is shut down, there are
files in pg_stat but not pg_stat_tmp.  Of course the data can never be
ONLY in the collector's backend-local memory because then nobody else
could read it.

I don't actually really understand the reason for this distinction.
If it's important not to lose the data, then the current system is the
worst of all possible worlds, and it would be better to have only only
one file and atomically rename() a new one in over top of it from time
to time.  If we did that and also committed the proposed patch, it
would be only slightly worse than if we did only that.  Wouldn't it?

>> > Those writes are slow because of the concurrent activity.  If all
>> > backends just throw their hands in the air, no more writes come from
>> > them, so the OS is going to finish the writes pretty quickly (or at
>> > least empty enough of the caches so that the pgstat data fits); so
>> > neither (1) nor (3) should be terribly serious.  I agree that (2) is a
>> > problem, but it's not a problem for everyone.
>>
>> If the operating system buffer cache doesn't contain much dirty data,
>> then I agree.  But there is a large backlog of dirty data there, then
>> it might be quite slow.
>
> That's true, but if the system isn't crashing, then writing a bunch of
> pages would make room for the pgstat data to be written to the OS, which
> is enough (we request only a write, not a flush, as I recall).  So we
> don't need to wait for a very long period.

I'm not sure what you are saying here.  Of course, if the OS writes
pages to disk then there will be room in the buffer cache for more
dirty pages.  The issue is whether this will unduly delay shutdown.

>> > A fast shutdown is not all that fast -- it needs to write the whole
>> > contents of shared buffers down to disk, which may be enormous.
>> > Millions of times bigger than pgstat data.  So a fast shutdown is
>> > actually very slow in a large machine.  An immediate shutdown, even if
>> > it writes pgstat data, is still going to be much smaller in terms of
>> > what is written.
>>
>> I agree.  However, in many cases, the major cost of a fast shutdown is
>> getting the dirty data already in the operating system buffers down to
>> disk, not in writing out shared_buffers itself.  The latter is
>> probably a single-digit number of gigabytes, or maybe double-digit.
>> The former might be a lot more, and the write of the pgstat file may
>> back up behind it.  I've seen cases where an 8kB buffered write from
>> Postgres takes tens of seconds to complete because the OS buffer cache
>> is already saturated with dirty data, and the stats files could easily
>> be a lot more than that.
>
> In the default config, background flushing is invoked when memory is 10%
> dirty (dirty_background_ratio); foreground flushing is forced when
> memory is 40% dirty (dirty_ratio).  That means the pgstat process can
> dirty 30% additional memory before being forced to perform flushing.

There's absolutely no guarantee that we aren't already hard up against
that 40% threshold - or whatever it is - already.  Backends write data
all the time, and flushes only happen at checkpoint time.  Indeed, I'd
argue that if somebody is performing an immediate shutdown, the most
likely reason is that a fast shutdown is too slow.  And the most
likely reason for that is that the operating system buffer cache is
full of dirty data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: function xmltable

2016-11-22 Thread Alvaro Herrera
I found the whole TableExprGetTupleDesc() function a bit odd in
nodeFuncs.c, so I renamed it to ExecTypeFromTableExpr() and moved it to
execTuples.c -- but only because that's where ExecTypeFromTL and others
already live.  I would have liked to move it to tupdesc.c instead, but
it requires knowledge of executor nodes, which is probably the reason
that ExecTypeFromTL is in execTuples.  I think we'd eat that bit of
ugliness only because we're not the first.  But anyway I quickly ran
into another problem.

I noticed that ExecTypeFromTableExpr is being called from the transform
phase, which is much earlier than the executor.  I noticed because of
the warning that the above movement added to nodeFuncs.c,
src/backend/nodes/nodeFuncs.c:509:5: warning: implicit declaration of function 
'ExecTypeFromTableExpr' [-Wimplicit-function-declaration]

so I thought, hm, is it okay to have parse analysis run an executor
function?  (I suppose this is the reason you put it in nodeFuncs in the
first place).  For fun, I tried this query under GDB, with a breakpoint
on exprTypmod():

SELECT X.* 
FROM emp, 
XMLTABLE ('//depts/dept/employee' passing doc 
   COLUMNS 
   empIDINTEGER PATH '@id',
   firstnameint PATH 'name/first',
   lastname VARCHAR(25) PATH 'name/last') AS X;

and sure enough, the type is resolved during parse analysis:

Breakpoint 1, exprTypmod (expr=expr@entry=0x1d23ad8)
at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
283 if (!expr)
(gdb) print *expr
$2 = {type = T_TableExpr}
(gdb) bt
#0  exprTypmod (expr=expr@entry=0x1d23ad8)
at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
#1  0x0080c500 in get_expr_result_type (expr=0x1d23ad8, 
resultTypeId=0x7ffd482bfdb4, resultTupleDesc=0x7ffd482bfdb8)
at /pgsql/source/master/src/backend/utils/fmgr/funcapi.c:247
#2  0x0056de1b in expandRTE (rte=rte@entry=0x1d6b800, rtindex=2, 
sublevels_up=0, location=location@entry=7, 
include_dropped=include_dropped@entry=0 '\000', 
colnames=colnames@entry=0x7ffd482bfe10, colvars=0x7ffd482bfe18)
at /pgsql/source/master/src/backend/parser/parse_relation.c:2052
#3  0x0056e131 in expandRelAttrs (pstate=pstate@entry=0x1d238a8, 
rte=rte@entry=0x1d6b800, rtindex=, 
sublevels_up=, location=location@entry=7)
at /pgsql/source/master/src/backend/parser/parse_relation.c:2435
#4  0x0056fa64 in ExpandSingleTable (pstate=pstate@entry=0x1d238a8, 
rte=rte@entry=0x1d6b800, location=7, 
make_target_entry=make_target_entry@entry=1 '\001')
at /pgsql/source/master/src/backend/parser/parse_target.c:1266
#5  0x0057135b in ExpandColumnRefStar (pstate=pstate@entry=0x1d238a8, 
cref=0x1d22720, make_target_entry=make_target_entry@entry=1 '\001')
at /pgsql/source/master/src/backend/parser/parse_target.c:1158
#6  0x005716f9 in transformTargetList (pstate=0x1d238a8, 
targetlist=, exprKind=EXPR_KIND_SELECT_TARGET)

This seems fine I guess, and it seems to say that we ought to move the
code that generates the tupdesc to back parse analysis rather than
executor.  Okay, fine.  But let's find a better place than nodeFuncs.

But if I move the XMLTABLE() call to the target list instead, the type
is resolved at planner time:

SELECT  
XMLTABLE ('/dept/employee' passing $$ 


Mary
Jones

415
905-403-6112
647-504-4546
64000

$$
   COLUMNS 
   empIDINTEGER PATH '@id',
   firstnamevarchar(4) PATH 'name/first',
   lastname VARCHAR(25) PATH 'name/last') AS X;

Breakpoint 1, exprTypmod (expr=expr@entry=0x1d6bed8)
at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
283 if (!expr)
(gdb) bt
#0  exprTypmod (expr=expr@entry=0x1d6bed8)
at /pgsql/source/master/src/backend/nodes/nodeFuncs.c:283
#1  0x00654058 in set_pathtarget_cost_width (root=0x1d6bc68, 
target=0x1d6c728)
at /pgsql/source/master/src/backend/optimizer/path/costsize.c:4729
#2  0x0066c197 in grouping_planner (root=0x1d6bc68, 
inheritance_update=40 '(', inheritance_update@entry=0 '\000', 
tuple_fraction=0.01, tuple_fraction@entry=0)
at /pgsql/source/master/src/backend/optimizer/plan/planner.c:1745
#3  0x0066ef64 in subquery_planner (glob=glob@entry=0x1d6bbd0, 
parse=parse@entry=0x1d23818, parent_root=parent_root@entry=0x0, 
hasRecursion=hasRecursion@entry=0 '\000', 
tuple_fraction=tuple_fraction@entry=0)
at /pgsql/source/master/src/backend/optimizer/plan/planner.c:795
#4  0x0066fe5e in standard_planner (parse=0x1d23818, 
cursorOptions=256, boundParams=)
at /pgsql/source/master/src/backend/optimizer/plan/planner.c:307

This is surprising, but I'm not sure it's wrong.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera
>  wrote:
> >> > Yes, I am, and I disagree with you.  The current decision on this point
> >> > was made ages ago, before autovacuum even existed let alone relied on
> >> > the stats for proper functioning.  The tradeoff you're saying you're
> >> > okay with is "we'll shut down a few seconds faster, but you're going
> >> > to have table bloat problems later because autovacuum won't know it
> >> > needs to do anything".  I wonder how many of the complaints we get
> >> > about table bloat are a consequence of people not realizing that
> >> > "pg_ctl stop -m immediate" is going to cost them.
> >>
> >> That would be useful information to have, but I bet the answer is "not
> >> that many".  Most people don't shut down their database very often;
> >> they're looking for continuous uptime.  It looks to me like autovacuum
> >> activity causes the statistics files to get refreshed at least once
> >> per autovacuum_naptime, which defaults to once a minute, so on the
> >> average we're talking about the loss of perhaps 30 seconds worth of
> >> statistics.
> >
> > I think you're misunderstanding how this works.  Losing that file
> > doesn't lose just the final 30 seconds worth of data -- it loses
> > *everything*, and every counter goes back to zero.  So it's not a few
> > parts-per-million, it loses however many millions there were.
> 
> OK, that's possible, but I'm not sure.  I think there are two separate
> issues here.  One is whether we should nuke the stats file on
> recovery, and the other is whether we should force a final write of
> the stats file before agreeing to an immediate shutdown.  It seems to
> me that the first one affects whether all of the counters go to zero,
> and the second affects whether we lose a small amount of data from
> just prior to the shutdown.  Right now, we are doing the first, so the
> second is a waste.  If we decide to start doing the first, we can
> independently decide whether to also do the second.

Well, the problem is that the stats data is not on disk while the system
is in operation, as far as I recall -- it's only in the collector's
local memory.  On shutdown we tell it to write it down to a file, and on
startup we tell it to read it from the file and then delete it.  I think
the rationale for this is to avoid leaving a file with stale data on
disk while the system is running.

> > Those writes are slow because of the concurrent activity.  If all
> > backends just throw their hands in the air, no more writes come from
> > them, so the OS is going to finish the writes pretty quickly (or at
> > least empty enough of the caches so that the pgstat data fits); so
> > neither (1) nor (3) should be terribly serious.  I agree that (2) is a
> > problem, but it's not a problem for everyone.
> 
> If the operating system buffer cache doesn't contain much dirty data,
> then I agree.  But there is a large backlog of dirty data there, then
> it might be quite slow.

That's true, but if the system isn't crashing, then writing a bunch of
pages would make room for the pgstat data to be written to the OS, which
is enough (we request only a write, not a flush, as I recall).  So we
don't need to wait for a very long period.

> > A fast shutdown is not all that fast -- it needs to write the whole
> > contents of shared buffers down to disk, which may be enormous.
> > Millions of times bigger than pgstat data.  So a fast shutdown is
> > actually very slow in a large machine.  An immediate shutdown, even if
> > it writes pgstat data, is still going to be much smaller in terms of
> > what is written.
> 
> I agree.  However, in many cases, the major cost of a fast shutdown is
> getting the dirty data already in the operating system buffers down to
> disk, not in writing out shared_buffers itself.  The latter is
> probably a single-digit number of gigabytes, or maybe double-digit.
> The former might be a lot more, and the write of the pgstat file may
> back up behind it.  I've seen cases where an 8kB buffered write from
> Postgres takes tens of seconds to complete because the OS buffer cache
> is already saturated with dirty data, and the stats files could easily
> be a lot more than that.

In the default config, background flushing is invoked when memory is 10%
dirty (dirty_background_ratio); foreground flushing is forced when
memory is 40% dirty (dirty_ratio).  That means the pgstat process can
dirty 30% additional memory before being forced to perform flushing.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.

2016-11-22 Thread Tom Lane
Corey Huinker  writes:
> In 9.4, I encountered a complaint about flex 2.6.0. After a little research
> it seems that a fix for that made it into versions 9.3+, but not 9.4.

Er ... what?  See 1ba874505.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres 9.3 postgres_fdw ::LOG: could not receive data from client: Connection reset by peer

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 5:05 AM, Vladimir Svedov  wrote:
> Hi,
> Sorry - tried to reproduce on other machine and gather all statements. And
> failed
> Installed 9.3 (which has those symptoms) and still can't reproduce.
> Must be platform specific, not version

Probably the foreign server isn't configured properly, and points to a
host/port to that resets the connection when you attempt to connect to
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding on standby

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer  wrote:
> On 22 November 2016 at 10:20, Craig Ringer  wrote:
>> I'm currently looking at making detection of replay conflict with a
>> slot work by separating the current catalog_xmin into two effective
>> parts - the catalog_xmin currently needed by any known slots
>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
>> actually valid catalog_xmin where we know we haven't removed anything
>> yet.
>
> OK, more detailed plan.
>
> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
> are already held down by ProcArray's catalog_xmin. But that doesn't
> mean we haven't removed newer tuples from specific relations and
> logged that in xl_heap_clean, etc, including catalogs or user
> catalogs, it only means the clog still exists for those XIDs.

Really?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 4:15 AM, Amit Langote
 wrote:
>> The easiest thing to do might be to just enforce that all of the
>> partition key columns have to be not-null when the range-partitioned
>> table is defined, and reject any attempt to DROP NOT NULL on them
>> later.  That's probably better that shoehorning it into the table
>> constraint.
>
> Agreed that forcing range partitioning columns to be NOT NULL during table
> creation would be a better approach.  But then we would have to reject
> using expressions in the range partition key, right?

Why?

> As a result, one change became necessary: to how we flag individual range
> bound datum as infinite or not.  Previously, it was a regular Boolean
> value (either infinite or not) and to distinguish +infinity from
> -infinity, we looked at whether the bound is lower or upper (the lower
> flag).  Now, instead, the variable holding the status of individual range
> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
> a bool.

You better not be using a bool to represent a ternary value!  Use an
enum for that -- or if in the system catalogs, a char.

> Upon encountering an infinite range bound datum, whether it's
> negative or positive infinity derives the comparison result.  Consider the
> following example:
>
> partition p1 from (1, unbounded) to (1, 1);
> partition p2 from (1, 1) to (1, 10);
> partition p3 from (1, 10) to (1, unbounded);
> partition p4 from (2, unbounded) to (2, 1);
> ... so on
>
> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.

2016-11-22 Thread Corey Huinker
>
>
>
>> It looks like this might be fairly easy to fix by having
>> get_connect_string() use is_valid_dblink_option() to check each
>> option name, and silently ignore options that are inappropriate.
>>
>
> From what I can tell, it is very straightforward, the context oids are set
> up just a few lines above where the new is_valid_dblink_option() calls
> would be.
>
> I'm happy to write the patch, for both v10 and any back-patches we feel
> are necessary. However, I suspect with a patch this trivial that reviewing
> another person's patch might be more work for a committer than just doing
> it themselves. If that's not the case, let me know and I'll get started.
>

Joe indicated that he wouldn't be able to get to the patch until this
weekend at the earliest, so I went ahead and made the patches on my own.

Nothing unusual to report for master, 9.6, 9.5, or 9.3. The patch is
basically the same for all of them and I was able to re-run the test script
at the beginning of the thread to ensure that the fix worked.

In 9.4, I encountered a complaint about flex 2.6.0. After a little research
it seems that a fix for that made it into versions 9.3+, but not 9.4. That
mini-patch is attached as well (0001.configure.94.diff). The dblink patch
for 9.4 was basically the same as the others.

The issue (no validation of connection string elements pulled from an FDW)
exists in 9.2, however, the only possible source of such options I know of
(postgres_fdw) does not. So I doubt we need to patch 9.2, but it's trivial
to do so if we want to.
diff --git a/configure b/configure
index 5204869..7da2dac 100755
--- a/configure
+++ b/configure
@@ -7166,7 +7166,7 @@ else
 echo '%%'  > conftest.l
 if $pgac_candidate -t conftest.l 2>/dev/null | grep FLEX_SCANNER 
>/dev/null 2>&1; then
   pgac_flex_version=`$pgac_candidate --version 2>/dev/null`
-  if echo "$pgac_flex_version" | sed 's/[.a-z]/ /g' | $AWK '{ if ($1 = 
2 && $2 = 5 && $3 >= 31) exit 0; else exit 1;}'
+  if echo "$pgac_flex_version" | sed 's/[.a-z]/ /g' | $AWK '{ if ($1 
== 2 && ($2 > 5 || ($2 == 5 && $3 >= 31))) exit 0; else exit 1;}'
   then
 pgac_cv_path_flex=$pgac_candidate
 break 2
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index af062fa..491dec8 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -40,6 +40,7 @@
 #include "access/reloptions.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
+#include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_type.h"
 #include "catalog/pg_user_mapping.h"
@@ -2731,6 +2732,24 @@ get_connect_string(const char *servername)
ForeignDataWrapper *fdw;
AclResult   aclresult;
char   *srvname;
+   static const PQconninfoOption *options = NULL;
+
+   /*
+* Get list of valid libpq options.
+*
+* To avoid unnecessary work, we get the list once and use it throughout
+* the lifetime of this backend process.  We don't need to care about
+* memory context issues, because PQconndefaults allocates with malloc.
+*/
+   if (!options)
+   {
+   options = PQconndefaults();
+   if (!options)   /* assume reason for failure is 
OOM */
+   ereport(ERROR,
+   (errcode(ERRCODE_FDW_OUT_OF_MEMORY),
+errmsg("out of memory"),
+errdetail("could not get libpq's default connection 
options")));
+   }
 
/* first gather the server connstr options */
srvname = pstrdup(servername);
@@ -2755,16 +2774,18 @@ get_connect_string(const char *servername)
{
DefElem*def = lfirst(cell);
 
-   appendStringInfo(buf, "%s='%s' ", def->defname,
-
escape_param_str(strVal(def->arg)));
+   if ( is_valid_dblink_option(options, def->defname, 
ForeignDataWrapperRelationId ) )
+   appendStringInfo(buf, "%s='%s' ", def->defname,
+
escape_param_str(strVal(def->arg)));
}
 
foreach(cell, foreign_server->options)
{
DefElem*def = lfirst(cell);
 
-   appendStringInfo(buf, "%s='%s' ", def->defname,
-
escape_param_str(strVal(def->arg)));
+   if ( is_valid_dblink_option(options, def->defname, 
ForeignServerRelationId ) )
+   appendStringInfo(buf, "%s='%s' ", def->defname,
+
escape_param_str(strVal(def->arg)));
}
 
foreach(cell, 

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera
 wrote:
>> > Yes, I am, and I disagree with you.  The current decision on this point
>> > was made ages ago, before autovacuum even existed let alone relied on
>> > the stats for proper functioning.  The tradeoff you're saying you're
>> > okay with is "we'll shut down a few seconds faster, but you're going
>> > to have table bloat problems later because autovacuum won't know it
>> > needs to do anything".  I wonder how many of the complaints we get
>> > about table bloat are a consequence of people not realizing that
>> > "pg_ctl stop -m immediate" is going to cost them.
>>
>> That would be useful information to have, but I bet the answer is "not
>> that many".  Most people don't shut down their database very often;
>> they're looking for continuous uptime.  It looks to me like autovacuum
>> activity causes the statistics files to get refreshed at least once
>> per autovacuum_naptime, which defaults to once a minute, so on the
>> average we're talking about the loss of perhaps 30 seconds worth of
>> statistics.
>
> I think you're misunderstanding how this works.  Losing that file
> doesn't lose just the final 30 seconds worth of data -- it loses
> *everything*, and every counter goes back to zero.  So it's not a few
> parts-per-million, it loses however many millions there were.

OK, that's possible, but I'm not sure.  I think there are two separate
issues here.  One is whether we should nuke the stats file on
recovery, and the other is whether we should force a final write of
the stats file before agreeing to an immediate shutdown.  It seems to
me that the first one affects whether all of the counters go to zero,
and the second affects whether we lose a small amount of data from
just prior to the shutdown.  Right now, we are doing the first, so the
second is a waste.  If we decide to start doing the first, we can
independently decide whether to also do the second.

>> I also think that you're wildly overestimating the likelihood that
>> writing the stats file will be fast, because (1) anything that
>> involves writing to the disk can be very slow, either because there's
>> a lot of other write activity or because the disk is going bad, the
>> latter being actually a pretty common cause of emergency database
>> shutdowns, (2) the stats files can be quite large if the database
>> system contains hundreds of thousands or even millions of objects,
>> which is not all that infrequent, and (3) pgstat wait timeouts are
>> pretty common, which would not be the case if writing the file was
>> invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741).
>
> Those writes are slow because of the concurrent activity.  If all
> backends just throw their hands in the air, no more writes come from
> them, so the OS is going to finish the writes pretty quickly (or at
> least empty enough of the caches so that the pgstat data fits); so
> neither (1) nor (3) should be terribly serious.  I agree that (2) is a
> problem, but it's not a problem for everyone.

If the operating system buffer cache doesn't contain much dirty data,
then I agree.  But there is a large backlog of dirty data there, then
it might be quite slow.

>> >> ...  Yeah, it's not good, but neither are the things that prompt
>> >> people to perform an immediate shutdown in the first place.
>> >
>> > Really?  I think many users think an immediate shutdown is just fine.
>>
>> Why would anybody ever perform an immediate shutdown rather than a
>> fast shutdown if a fast shutdown were fast?
>
> A fast shutdown is not all that fast -- it needs to write the whole
> contents of shared buffers down to disk, which may be enormous.
> Millions of times bigger than pgstat data.  So a fast shutdown is
> actually very slow in a large machine.  An immediate shutdown, even if
> it writes pgstat data, is still going to be much smaller in terms of
> what is written.

I agree.  However, in many cases, the major cost of a fast shutdown is
getting the dirty data already in the operating system buffers down to
disk, not in writing out shared_buffers itself.  The latter is
probably a single-digit number of gigabytes, or maybe double-digit.
The former might be a lot more, and the write of the pgstat file may
back up behind it.  I've seen cases where an 8kB buffered write from
Postgres takes tens of seconds to complete because the OS buffer cache
is already saturated with dirty data, and the stats files could easily
be a lot more than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-22 Thread Robert Haas
On Mon, Nov 21, 2016 at 11:48 PM, Thomas Munro
 wrote:
> Here's a version that works that way, though it allows you to call
> ConditionVariablePrepareToSleep *optionally* before you enter your
> loop, in case you expect to have to wait and would rather avoid the
> extra loop.  Maybe there isn't much point in exposing that though,
> since your condition test should be fast and waiting is the slow path,
> but we don't really really know what your condition test is.  I
> thought about that because my use case (barrier.c) does in fact expect
> to hit the wait case more often than not.  If that seems pointless
> then perhaps ConditionVariablePrepareToSleep should become static and
> implicit.  This version does attempt to suppress spurious returns, a
> bit, using proclist_contains.  No more cvSleeping.

This version looks good to me and I have committed it after doing a
bit more work on the comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] amcheck (B-Tree integrity checking tool)

2016-11-22 Thread Peter Geoghegan
On Sun, Nov 20, 2016 at 3:42 PM, Robert Haas  wrote:
> OK.  If it's not reasonable to continue checking after an ERROR, then
> I think ERROR is the way to go.  If somebody really doesn't like that
> lack of flexibility (in either direction), they can propose a change
> later for separate consideration.  That limitation is not, in my view,
> a sufficient reason to hold up the patch on the table.

I attach V4 of amcheck. I decided to leave things as they are, as far
as forcing a different elevel goes (you still need to modify a macro).
I didn't change the elevel macro from CONCERN in a couple of cases, as
Andres suggested, because in fact that's generally the same as DEBUG1,
not WARNING (I agreed with Andres that it was WARNING before, but it
isn't unless you #define NOT_USED).

Andres said: "Theoretically we should recheck that the oids still
match, theoretically the locking here allows the index->table mapping
to change". I don't know how to act on this feedback, since comparable
index + heap locking code should have the same problem, but doesn't
consider it. How is this any different to reindex_index()?

Other than that, all feedback items were worked through. I made the
functions PARALLEL SAFE, too, since I noticed that that wasn't the
case in passing.

-- 
Peter Geoghegan
From 21c843ae0193ed17b2e5234d67f2e73f7015b3cd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile |1 +
 contrib/amcheck/Makefile |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.control  |5 +
 contrib/amcheck/verify_nbtree.c  | 1227 ++
 doc/src/sgml/amcheck.sgml|  275 +
 doc/src/sgml/contrib.sgml|1 +
 doc/src/sgml/filelist.sgml   |1 +
 8 files changed, 1549 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/verify_nbtree.c
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 000..f14c7aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= verify_nbtree.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 000..b415f0e
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/contrib/amcheck/verify_nbtree.c 

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Nov 22, 2016 at 12:54 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> But that's not what is at issue here.  The issue is whether, when
> >> asked to exit immediately, all processes should exit immediately, or
> >> whether it would be better for all processes except one to exit
> >> immediately and the last one exit non-immediately.  In other words,
> >> when the user asks for an immediate shutdown, do they really mean it,
> >> or are they OK taking time to do some other stuff first?
> >
> > Peter already gave the response to that, which is that users do not
> > expect an immediate shutdown to have permanent harmful effects.
> > It doesn't have such effects so far as the SQL data goes; why is it
> > okay to blow away statistical data?
> 
> You're doing that anyway.  The backends aren't going to send any
> accumulated but unsent statistics to the stats collector before
> exiting; they're just going to exit.

Sure.  That loses a few counts, but as you argue below, it's just "a few
parts per million".

> > Yes, I am, and I disagree with you.  The current decision on this point
> > was made ages ago, before autovacuum even existed let alone relied on
> > the stats for proper functioning.  The tradeoff you're saying you're
> > okay with is "we'll shut down a few seconds faster, but you're going
> > to have table bloat problems later because autovacuum won't know it
> > needs to do anything".  I wonder how many of the complaints we get
> > about table bloat are a consequence of people not realizing that
> > "pg_ctl stop -m immediate" is going to cost them.
> 
> That would be useful information to have, but I bet the answer is "not
> that many".  Most people don't shut down their database very often;
> they're looking for continuous uptime.  It looks to me like autovacuum
> activity causes the statistics files to get refreshed at least once
> per autovacuum_naptime, which defaults to once a minute, so on the
> average we're talking about the loss of perhaps 30 seconds worth of
> statistics.

I think you're misunderstanding how this works.  Losing that file
doesn't lose just the final 30 seconds worth of data -- it loses
*everything*, and every counter goes back to zero.  So it's not a few
parts-per-million, it loses however many millions there were.

> I also think that you're wildly overestimating the likelihood that
> writing the stats file will be fast, because (1) anything that
> involves writing to the disk can be very slow, either because there's
> a lot of other write activity or because the disk is going bad, the
> latter being actually a pretty common cause of emergency database
> shutdowns, (2) the stats files can be quite large if the database
> system contains hundreds of thousands or even millions of objects,
> which is not all that infrequent, and (3) pgstat wait timeouts are
> pretty common, which would not be the case if writing the file was
> invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741).

Those writes are slow because of the concurrent activity.  If all
backends just throw their hands in the air, no more writes come from
them, so the OS is going to finish the writes pretty quickly (or at
least empty enough of the caches so that the pgstat data fits); so
neither (1) nor (3) should be terribly serious.  I agree that (2) is a
problem, but it's not a problem for everyone.

> >> ...  Yeah, it's not good, but neither are the things that prompt
> >> people to perform an immediate shutdown in the first place.
> >
> > Really?  I think many users think an immediate shutdown is just fine.
> 
> Why would anybody ever perform an immediate shutdown rather than a
> fast shutdown if a fast shutdown were fast?

A fast shutdown is not all that fast -- it needs to write the whole
contents of shared buffers down to disk, which may be enormous.
Millions of times bigger than pgstat data.  So a fast shutdown is
actually very slow in a large machine.  An immediate shutdown, even if
it writes pgstat data, is still going to be much smaller in terms of
what is written.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 12:54 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> But that's not what is at issue here.  The issue is whether, when
>> asked to exit immediately, all processes should exit immediately, or
>> whether it would be better for all processes except one to exit
>> immediately and the last one exit non-immediately.  In other words,
>> when the user asks for an immediate shutdown, do they really mean it,
>> or are they OK taking time to do some other stuff first?
>
> Peter already gave the response to that, which is that users do not
> expect an immediate shutdown to have permanent harmful effects.
> It doesn't have such effects so far as the SQL data goes; why is it
> okay to blow away statistical data?

You're doing that anyway.  The backends aren't going to send any
accumulated but unsent statistics to the stats collector before
exiting; they're just going to exit.

>> You're arguing that preserving the stats file is more important than
>> timely shutdown, but I don't buy it.
>
> Yes, I am, and I disagree with you.  The current decision on this point
> was made ages ago, before autovacuum even existed let alone relied on
> the stats for proper functioning.  The tradeoff you're saying you're
> okay with is "we'll shut down a few seconds faster, but you're going
> to have table bloat problems later because autovacuum won't know it
> needs to do anything".  I wonder how many of the complaints we get
> about table bloat are a consequence of people not realizing that
> "pg_ctl stop -m immediate" is going to cost them.

That would be useful information to have, but I bet the answer is "not
that many".  Most people don't shut down their database very often;
they're looking for continuous uptime.  It looks to me like autovacuum
activity causes the statistics files to get refreshed at least once
per autovacuum_naptime, which defaults to once a minute, so on the
average we're talking about the loss of perhaps 30 seconds worth of
statistics.  What percentage of database activity occurs within 30
second of an immediate shutdown?  For a hypothetical installation
where the DBA does an immediate shutdown at four randomly-chosen times
each day, the answer is "less than 0.2%".  In real installations, the
time between immediate shutdown is likely to be weeks or months, and
so the answer is perhaps two or three orders of magnitude less than
that.  The loss of a few (or even a few dozen) parts-per-million of
statistics data shouldn't make any material difference.

To make this add up to a significant loss, you have to suppose that
there was a particularly large transaction in flight just before the
crash.  But it can't have been in flight at the moment of the crash,
because then the statistics message would not then have been sent.  So
it has to be the case that some really big transaction ended and then
just after that the DBA did an immediate shutdown.  I'm not arguing
that it couldn't happen, but it's a pretty narrow target to be aiming
at.

I also think that you're wildly overestimating the likelihood that
writing the stats file will be fast, because (1) anything that
involves writing to the disk can be very slow, either because there's
a lot of other write activity or because the disk is going bad, the
latter being actually a pretty common cause of emergency database
shutdowns, (2) the stats files can be quite large if the database
system contains hundreds of thousands or even millions of objects,
which is not all that infrequent, and (3) pgstat wait timeouts are
pretty common, which would not be the case if writing the file was
invariably fast (c.f. 75b48e1fff8a4dedd3ddd7b76f6360b5cc9bb741).

>> ...  Yeah, it's not good, but neither are the things that prompt
>> people to perform an immediate shutdown in the first place.
>
> Really?  I think many users think an immediate shutdown is just fine.

Why would anybody ever perform an immediate shutdown rather than a
fast shutdown if a fast shutdown were fast?  Now you've incurred the
overhead of a crash recovery for no gain.  In my experience, people
perform immediate shutdowns mostly when they try something else and it
doesn't work.  (Other reasons: They are Oracle users who think that
immediate will do what it does on Oracle, namely what we call a fast
shutdown; or they don't want to wait for the shutdown checkpoint.)

>> The patch that
>> you claim moots this one was inspired by immediate shutdowns taking
>> too long, and that patch enjoyed pretty broad support IIRC.
>
> I think that's historical revisionism.  The commit message for 82233ce7e
> says "(This might happen, for example, if a backend gets tangled trying
> to malloc() due to gettext(), as in an example illustrated by MauMau.)".
> There is absolutely nothing in it about people being dissatisfied with
> the shutdown timing in normal cases; rather, it was done to prevent
> cases where shutdown failed to happen at all.

OK, you're probably right about 

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Tom Lane
Robert Haas  writes:
> But that's not what is at issue here.  The issue is whether, when
> asked to exit immediately, all processes should exit immediately, or
> whether it would be better for all processes except one to exit
> immediately and the last one exit non-immediately.  In other words,
> when the user asks for an immediate shutdown, do they really mean it,
> or are they OK taking time to do some other stuff first?

Peter already gave the response to that, which is that users do not
expect an immediate shutdown to have permanent harmful effects.
It doesn't have such effects so far as the SQL data goes; why is it
okay to blow away statistical data?

> You're arguing that preserving the stats file is more important than
> timely shutdown, but I don't buy it.

Yes, I am, and I disagree with you.  The current decision on this point
was made ages ago, before autovacuum even existed let alone relied on
the stats for proper functioning.  The tradeoff you're saying you're
okay with is "we'll shut down a few seconds faster, but you're going
to have table bloat problems later because autovacuum won't know it
needs to do anything".  I wonder how many of the complaints we get
about table bloat are a consequence of people not realizing that
"pg_ctl stop -m immediate" is going to cost them.

> ...  Yeah, it's not good, but neither are the things that prompt
> people to perform an immediate shutdown in the first place.

Really?  I think many users think an immediate shutdown is just fine.

> The patch that
> you claim moots this one was inspired by immediate shutdowns taking
> too long, and that patch enjoyed pretty broad support IIRC.

I think that's historical revisionism.  The commit message for 82233ce7e
says "(This might happen, for example, if a backend gets tangled trying
to malloc() due to gettext(), as in an example illustrated by MauMau.)".
There is absolutely nothing in it about people being dissatisfied with
the shutdown timing in normal cases; rather, it was done to prevent
cases where shutdown failed to happen at all.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-11-22 Thread Erik Rijkers

On 2016-11-20 19:02, Petr Jelinek wrote:


0001-Add-support-for-TE...cation-slots-v8.patch.gz (~8 KB)
0002-Refactor-libpqwalreceiver-v8.patch.gz (~9 KB)
0003-Add-PUBLICATION-catalogs-and-DDL-v8.patch.gz (~30 KB)
0004-Add-SUBSCRIPTION-catalog-and-DDL-v8.patch.gz (~27 KB)
0005-Define-logical-rep...output-plugi-v8.patch.gz (~13 KB)
0006-Add-logical-replication-workers-v8.patch.gz (~43 KB)
0007-Add-separate-synch...for-logical--v8.patch.gz (~2 KB)


Apply, make, make check, install OK.


A crash of the subscriber can be forced by running  vacuum table>  on the publisher.



- publisher
create table if not exists testt( id integer primary key, c text );
create publication pub1 for table testt;

- subscriber
create table if not exists testt( id integer primary key, c text );
create subscription sub1 connection 'dbname=testdb port=6444' 
publication pub1 with (disabled);

alter  subscription sub1 enable;

- publisher
vacuum testt;

now data change on the published table, (perhaps also a select on the 
subscriber-side data) leads to:



- subscriber log:
TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "mcxt.c", Line: 
1001)
2016-11-22 18:13:13.983 CET 10177 LOG:  worker process: ??)? (PID 10334) 
was terminated by signal 6: Aborted
2016-11-22 18:13:13.983 CET 10177 LOG:  terminating any other active 
server processes
2016-11-22 18:13:13.983 CET 10338 WARNING:  terminating connection 
because of crash of another server process
2016-11-22 18:13:13.983 CET 10338 DETAIL:  The postmaster has commanded 
this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted 
shared memory.

[...]




Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-22 Thread Jon Nelson
On Tue, Nov 22, 2016 at 8:42 AM, Tom Lane  wrote:

> Haribabu Kommi  writes:
> > Any suggestions for the name to be used for the new datatype the can
> > work for both 48 and 64 bit MAC addresses?
>
> The precedent of int4/int8/float4/float8 is that SQL data types should
> be named after their length in bytes.  So I'd be inclined to call this
> "macaddr8" not "macaddr64".  That would suggest taking the simple
> approach of always storing values in the 8-byte format, rather than
> dealing with the complexities of having two formats internally, two
> display formats, etc.
>

Be that as it may, but almost everybody else (outside the db world?) uses
bits.  The C types, for example, are expressed in bits (int8_t, int64_t,
etc...).

> While comparing a 48 bit MAC address with 64 bit MAC address, Ignore
> > the two bytes if the contents in those bytes are reserved bytes.
>
> Um ... I don't follow.  Surely these must compare different:
>
> 01-01-01-FF-FE-01-01-01
> 01-01-01-FF-0E-01-01-01
>

What's more, it now requires 2 comparisons and some logic versus the
possibility of a single memcmp.

-- 
Jon


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 10:59 AM, Tom Lane  wrote:
> It's already the case that the pgstats code writes the stats data under a
> temporary file name and then renames it into place atomically.  So the
> prospects for corrupt data are not large, and I do not think that the
> existing removal behavior was intended to prevent that.  Rather, the
> concern was that if you do a point-in-time recovery to someplace much
> earlier on the WAL timeline, the stats file will be out of sync with
> what's now in your database.  That's a valid point, but deleting the
> stats file during *any* recovery seems like an overreaction.

But that's not what is at issue here.  The issue is whether, when
asked to exit immediately, all processes should exit immediately, or
whether it would be better for all processes except one to exit
immediately and the last one exit non-immediately.  In other words,
when the user asks for an immediate shutdown, do they really mean it,
or are they OK taking time to do some other stuff first?

I think that the charter of an immediate shutdown is to perform a
simulated crash.  We do our best to let clients know that we are going
down and then we go down.   We don't checkpoint, we don't do any other
operation that might take a lengthy period of time, we just SHUT DOWN.
You're arguing that preserving the stats file is more important than
timely shutdown, but I don't buy it.  We've never tried to do that in
the past and we've got no evidence that it's causing a problem for
anyone.  Yeah, it's not good, but neither are the things that prompt
people to perform an immediate shutdown in the first place.

On the other hand, there's plenty of evidence that slow shutdowns are
a big problem for users.  People use fast shutdown because smart
shutdown was too slow (never-ending), and we changed the default for
that reason.  I've repeatedly seen users who resorted to an immediate
shut down because a fast shutdown wasn't fast enough.  The patch that
you claim moots this one was inspired by immediate shutdowns taking
too long, and that patch enjoyed pretty broad support IIRC.  I think
it's fairly perverse to receive yet another complaint about shutdown
being slow and, instead of fixing it, go try to make sure that
slowness is baked in as a policy decision.

Immediate, adj. occurring or done at once; instant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Tom Lane
"Tsunakawa, Takayuki"  writes:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
>> The point I was trying to make is that I think the forced-removal behavior
>> is not desirable, and therefore committing a patch that makes it be graven
>> in stone is not desirable either.

> I totally agree that we should pursue the direction for escaping from the 
> complete loss of stats files.  Personally, I would like to combine that with 
> the idea of persistent performance diagnosis information for long-term 
> analysis (IIRC, someone proposed it.)  However, I don't think my patch will 
> make everyone forget about the problem of stats file loss during recovery.  
> The problem exists with or without my patch, and my patch doesn't have the 
> power to delute the importance of the problem.  If you are worried about 
> memory, we can add an entry for the problem in TODO list that Bruce-san is 
> maintaining.

> Or, maybe we can just stop removing the stats files during recovery by 
> keeping the files of previous generation and using it as the current one.  I 
> haven't seen how fresh the previous generation is (500ms ago?).  A bit older 
> might be better than nothing.

Freshness isn't the issue.  The stats file isn't there at all, in the
permanent stats directory, unless the collector takes the time to write
it before exiting.  Without that, we have unrecoverable loss of the stats
data.  Now, that isn't as bad as loss of the SQL data content, but it's
not good either.

It's already the case that the pgstats code writes the stats data under a
temporary file name and then renames it into place atomically.  So the
prospects for corrupt data are not large, and I do not think that the
existing removal behavior was intended to prevent that.  Rather, the
concern was that if you do a point-in-time recovery to someplace much
earlier on the WAL timeline, the stats file will be out of sync with
what's now in your database.  That's a valid point, but deleting the
stats file during *any* recovery seems like an overreaction.

The simplest solution I can think of is to delete the stats file when
doing a PITR operation, but not during simple crash recovery.  I've
not looked to see how hard it would be to do that, but it seems like
it should be a fairly minor logic tweak.  Maybe decide to do the removal
at the point where we intentionally stop following WAL someplace earlier
than its end.

Another angle we might take, independently of that, is to delete the
stats file if the stats collector process itself crashes.  This would
provide a recovery avenue if somehow we did have a stats file that
was corrupt enough to crash the collector.  And it would not matter
for post-startup crashes of the stats collector, because the file
would not be there anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more full joins in postgres_fdw

2016-11-22 Thread Ashutosh Bapat
>
>
>> There is a already a function to build targetlist for a given relation
>> build_tlist_to_deparse(), which does the same thing as this code for a
>> join or
>> base relation and when there are no local conditions. Why don't we use
>> that
>> function instead of duplicating that logic? If tomorrow, the logic to
>> build the
>> targetlist changes, we will need to duplicate those changes. We should
>> avoid
>> that.
>> +/* Build a tlist from the subquery. */
>> +tlist = add_to_flat_tlist(tlist, foreignrel->reltarget->exprs);
>
>
> Will modify the patch to use build_tlist_to_deparse.  Actually, the early
> versions of the patch used that function, but it looks like I changed not to
> use that function, as I misunderstood about your comments on this part at
> some point.  Sorry for that.
>
>> The comment below says "get the aliases", but what the function really
>> returns
>> is the identifiers used for creating aliases. Please correct the comment.
>> +/*
>> + * Get the relation and column alias for a given Var node, which belongs
>> to
>> + * input foreignrel. They are returned in *tabno and *colno respectively.
>> + */
>
>
> Actually, this was rewritten into the above by *you*.  The original comment
> I added was:
>
> + /*
> +  * Get info about the subselect alias to given expression.
> +  *
> +  * The subselect table and column numbers are returned to *tabno and
> *colno,
> +  * respectively.
> +  */
>
> I'd like to change the comment into something like the original one.

Sorry. I think the current version is better than previous one. The
term "subselect alias" is confusing in the previous version. In the
current version, "Get the relation and column alias for a given Var
node," we need to add word "identifiers" like "Get the relation and
column identifiers for a given Var node".

>
>> We discussed that we have to deparse and search from the same targetlist.
>> And
>> that the targetlist should be saved in fpinfo, the first time it gets
>> created.
>> But the patch seems to be searching in foreignrel->reltarget->exprs and
>> deparsing from the tlist returned by add_to_flat_tlist(tlist,
>> foreignrel->reltarget->exprs).
>> +foreach(lc, foreignrel->reltarget->exprs)
>> +{
>> +if (equal(lfirst(lc), (Node *) node))
>> +{
>> +*colno = i;
>> +return;
>> +}
>> +i++;
>> +}
>> I guess, the reason why you are doing it this way, is SELECT clause for
>> the
>> outermost query gets deparsed before FROM clause. For later we call
>> deparseRangeTblRef(), which builds the tlist. So, while deparsing SELECT
>> clause, we do not have tlist to build from.
>
>
> That's right.
>
>> In that case, I guess, we have to
>> build the tlist in get_subselect_alias_id() if it's not available and
>> stick it
>> in fpinfo. Subsequent calls to get_subselect_alias_id() should find tlist
>> there. Then in deparseRangeTblRef() assert that there's a tlist in fpinfo
>> and use it to build the SELECT clause of subquery. That way, we don't
>> build
>> tlist unless it's needed and also use the same tlist for all searches.
>> Please
>> use tlist_member() to search into the tlist.
>
>
> I don't think that's a good idea because that would make the code
> unnecessarily complicated and inefficient.  I think the direct search into
> the foreignrel->reltarget->exprs shown above would be better because that's
> more simple and efficient than what you proposed.  Note that since (1) the
> foreignrel->reltarget->exprs doesn't contain any PHVs and (2) the
> local_conds is empty, the tlist created for the subquery by
> build_tlist_to_deparse is guaranteed to have one-to-one correspondence with
> the foreignrel->reltarget->exprs, so the direct search works well.

If we deparse from and search into different objects, that's not very
maintainable code. Changes to any one of them require changes to the
other, thus creating avenues for bugs. Even if slighly expensive, we
should search into and deparse from the same targetlist. I think I
have explained this before.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-22 Thread Tom Lane
Haribabu Kommi  writes:
> Any suggestions for the name to be used for the new datatype the can
> work for both 48 and 64 bit MAC addresses?

The precedent of int4/int8/float4/float8 is that SQL data types should
be named after their length in bytes.  So I'd be inclined to call this
"macaddr8" not "macaddr64".  That would suggest taking the simple
approach of always storing values in the 8-byte format, rather than
dealing with the complexities of having two formats internally, two
display formats, etc.

> It is possible to represent a 48 bit MAC address as 64 bit MAC address
> by adding reserved bytes in the middle as follows.
> 01-01-01-01-01-01::macaddr => 01-01-01-FF-FE-01-01-01::newmacaddr

Check.  So we could accept 6-byte addresses on input and perform
that conversion automatically.

> While comparing a 48 bit MAC address with 64 bit MAC address, Ignore
> the two bytes if the contents in those bytes are reserved bytes.

Um ... I don't follow.  Surely these must compare different:

01-01-01-FF-FE-01-01-01
01-01-01-FF-0E-01-01-01

> The new datatype can store directly whatever is the input is, like 48 bit
> or 64 bit. The same data is sent over the wire, whether the reserved
> bytes are present or not?

I'd just send all 8 bytes.  What the client wants to do with values
that could be mapped back into the 6-byte space is its business.

In short, let's just make this work similarly to integers of different
widths, rather than trying to sprinkle extra pixie dust on it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-11-22 Thread Erik Rijkers

and the attachment...

On 2016-11-22 14:55, Erik Rijkers wrote:

On 2016-11-20 19:06, Petr Jelinek wrote:

0004-Add-SUBSCRIPTION-catalog-and-DDL-v8.patch.gz


This patch contains 2 tabs which break the html build when using 'make 
oldhtml':


$ ( cd
/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml;
time make oldhtml )
make check-tabs
make[1]: Entering directory
`/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml'
./ref/create_subscription.sgml:WITH (DISABLED);
Tabs appear in SGML/XML files
make[1]: *** [check-tabs] Error 1
make[1]: Leaving directory
`/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml'
make: *** [oldhtml-stamp] Error 2

Very minor change, but it fixes that build.

Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--- doc/src/sgml/ref/create_subscription.sgml.orig	2016-11-22 14:21:05.0 +0100
+++ doc/src/sgml/ref/create_subscription.sgml	2016-11-22 14:21:27.0 +0100
@@ -151,7 +151,7 @@
 CREATE SUBSCRIPTION mysub
  CONNECTION 'host=192.168.1.50 port=5432 user=foo dbname=foodb password=foopass'
 PUBLICATION insert_only
-		   WITH (DISABLED);
+   WITH (DISABLED);
 
   
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-11-22 Thread Erik Rijkers

On 2016-11-20 19:06, Petr Jelinek wrote:

0004-Add-SUBSCRIPTION-catalog-and-DDL-v8.patch.gz


This patch contains 2 tabs which break the html build when using 'make 
oldhtml':


$ ( cd 
/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml; 
time make oldhtml )

make check-tabs
make[1]: Entering directory 
`/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml'

./ref/create_subscription.sgml:WITH (DISABLED);
Tabs appear in SGML/XML files
make[1]: *** [check-tabs] Error 1
make[1]: Leaving directory 
`/var/data1/pg_stuff/pg_sandbox/pgsql.logical_replication/doc/src/sgml'

make: *** [oldhtml-stamp] Error 2

Very minor change, but it fixes that build.

Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvements in psql hooks for variables

2016-11-22 Thread Daniel Verite
Stephen Frost wrote:

> That certainly doesn't feel right.  I'm thinking that if we're going to
> throw an error back to the user about a value being invalid then we
> shouldn't change the current value.
> 
> My initial thought was that perhaps we should pass the current value to
> ParseVariableBool() and let it return whatever the "right" answer is,
> however, your use of ParseVariableBool() for enums that happen to accept
> on/off means that won't really work.

That's not needed once ParseVariableBool() informs the caller about
the validity of the boolean expression, which is what the patch already
does.

For instance I just implemented it for \timing and the diff consists of just
that:

if (opt)
-   pset.timing = ParseVariableBool(opt, "\\timing",
NULL);
+   {
+   boolnewval = ParseVariableBool(opt, "\\timing",
);
+   if (success)
+   pset.timing = newval;
+   }
else
pset.timing = !pset.timing;

That makes \timing foobar being rejected as a bad command with a
proper error message and no change of state, which is just what we want.

> Perhaps the right answer is to flip this around a bit and treat booleans
> as a special case of enums and have a generic solution for enums.
> Consider something like:
> 
> ParseVariableEnum(valid_enums, str_value, name, curr_value);
> 
> 'valid_enums' would be a simple list of what the valid values are for a
> given variable and their corresponding value, str_value the string the
> user typed, name the name of the variable, and curr_value the current
> value of the variable.

Firstly I'd like to insist that such a refactoring is not necessary
for this patch and I feel like it would be out of place in it.

That being said, if we wanted this, I think it would be successful
only if we'd first change our internal variables pset.* from a struct 
of different types to a list of variables from some kind of common
abstract type and an abstraction layer to access them.
That would be an order of magnitude more sophisticated than what we
have.

Otherwise as I tried to explain in [1], I don't see how we could write
a ParseVariableEnum() that would return different types
and take variable inputs.
Or if we say that ParseVariableEnum should not return the value
but affect the variable directly, that would require refactoring
all call sites, and what's the benefit that would justify
such large changes?
Plus we have two different non-mergeable concepts of variables
that need this parser:
psql variables from VariableSpace stored as strings,
and C variables directly instantiated as native types.


Also, the argument that bools are just another type of enums
is legitimate in theory, but as in psql we accept any left-anchored
match of true/false/on/off/0/1, it means that the enumeration
of values is in fact:

0
1
t
tr
tru
true
f
fa
fal
fals
false
on
of
off

I don't see that it would help if the code treated the above like just a
vanilla list of values, comparable to the other qualifiers like "auto",
"expanded", "vertical", an so on, notwithstanding the fact
that they don't share the same types.

I think that the current code with ParseVariableBool() that only
handles booleans is better in terms of separation of concerns
and readability.

[1]
https://www.postgresql.org/message-id/fc879967-da93-43b6-aa5a-92f2d825e786@mm

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-22 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

The setup is made of hot-standby architecture and the issue is seen during 
normal run with normal load of 50% insert and 50% delete operation.
During startup of the standby node, we copy the data directory from the active 
postgres using pg_basebackup.

Meanwhile we are trying to create a test bed for people to try.

Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, November 22, 2016 1:47 AM
To: K S, Sandhya (Nokia - IN/Bangalore) 
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 

Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)"  writes:
> As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
> removed all the patches we had applied before. But the issue is still 
> observed in the latest version as well.

Can you make a test case for other people to try?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-11-22 Thread Michael Paquier
On Tue, Nov 22, 2016 at 5:55 PM, Ideriha, Takeshi
 wrote:
> Here's a summary for what I tested  in RHEL7.0, details follow.

Thanks for the review.

> [Summary]
> 1. apply patch and make world
>   -> failed because  was mistakenly coded .
>
> 2.correct this mistake and make check-world
>   -> got 1 failed test: "'pg_dumpall with \n\r in database name'"
>  because test script cannot createdb "foo\n\rbar"

The attached version addresses those problems. I have replaced the
test in src/bin/pg_dump/t/ by tests in src/bin/scripts/t/ to check if
the role name and database name with CR or LF fail to be created. I
have as well added a test for initdb when the data directory has an
incorrect character in 0002.
-- 
Michael


0002-Ensure-clean-up-of-data-directory-even-with-restrict.patch
Description: binary/octet-stream


0001-Forbid-newline-and-carriage-return-characters-in-dat.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-11-22 Thread Rushabh Lathia
Hi Amit,

I was just reading through your patches and here are some quick review
comments
for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.

Review comments for  0001-Catalog-and-DDL-for-partitioned-tables-17.patch:

1)
@@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname,
 {
 /* Use binary-upgrade override for pg_class.oid/relfilenode? */
 if (IsBinaryUpgrade &&
-(relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE ||
- relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW ||
- relkind == RELKIND_COMPOSITE_TYPE || relkind ==
RELKIND_FOREIGN_TABLE))
+(relkind == RELKIND_RELATION || relkind ==
RELKIND_PARTITIONED_TABLE ||
+ relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW ||
+ relkind == RELKIND_MATVIEW || relkind ==
RELKIND_COMPOSITE_TYPE ||
+ relkind == RELKIND_FOREIGN_TABLE))

You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that
will make diff minimal. While reading through the patch I noticed that there
is inconsistency - someplace its been added at the end and at few places its
at the start. I think you can make add it at the end of condition and be
consistent with each place.

2)

+/*
+ * We need to transform the raw parsetrees corresponding to
partition
+ * expressions into executable expression trees.  Like column
defaults
+ * and CHECK constraints, we could not have done the transformation
+ * earlier.
+ */


Additional space before "Like column defaults".

3)
-charrelkind;
+charrelkind,
+expected_relkind;

Newly added variable should be define separately with its type. Something
like:

charrelkind;
+charexpected_relkind;

4)

a)
+/* Prevent partitioned tables from becoming inheritance parents */
+if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot inherit from partitioned table \"%s\"",
+ parent->relname)));
+

need alignment for last line.

b)
+atttuple = SearchSysCacheAttName(RelationGetRelid(rel),
pelem->name);
+if (!HeapTupleIsValid(atttuple))
+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_COLUMN),
+ errmsg("column \"%s\" named in partition key does
not exist",
+ pelem->name)));
+attform = (Form_pg_attribute) GETSTRUCT(atttuple);
+
+if (attform->attnum <= 0)
+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_COLUMN),
+ errmsg("cannot use system column \"%s\" in
partition key",
+ pelem->name)));

need alignment for last line of ereport

c)
+/* Disallow ROW triggers on partitioned tables */
+if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+errmsg("\"%s\" is a partitioned table",
+RelationGetRelationName(rel)),
+  errdetail("Partitioned tables cannot have ROW triggers.")));

need alignment

5)

@@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt
*stmt,
 cxt.blist = NIL;
 cxt.alist = NIL;
 cxt.pkey = NULL;
+cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;

I think adding bracket will look code more clear.

+cxt.ispartitioned = (rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE);

6)

+ * RelationBuildPartitionKey
+ *Build and attach to relcache partition key data of relation
+ *
+ * Partitioning key data is stored in CacheMemoryContext to ensure it
survives
+ * as long as the relcache.  To avoid leaking memory in that context in
case
+ * of an error partway through this function, we build the structure in the
+ * working context (which must be short-lived) and copy the completed
+ * structure into the cache memory.

extra space before "To avoid leaking memory"

7)
+/* variable-length fields start here, but we allow direct access to
partattrs */
+int2vectorpartattrs;/* attribute numbers of columns in
the

Why partattrs is allow direct access - its not really clear from the
comments.

I will continue reading more patch and testing functionality.. will share
the
comments as I have it.

Thanks,

On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote  wrote:

>
> Updated patches attached.  I merged what used to be 0006 and 0007 into one.
>
> On 2016/11/19 2:23, Robert Haas wrote:
> > On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
> >  wrote:
> >> Oh but wait, that means I can insert rows with NULLs in the range
> >> partition key if I choose to 

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-11-22 Thread Etsuro Fujita

Hi Rushabh,

On 2016/11/22 19:05, Rushabh Lathia wrote:

I started reviewing the patch and here are few initial review points and
questions for you.


Thanks for the review!


1)
-static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+static void deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context);

Any particular reason of inserting new argument as 1st argunment? In general
we add new flags at the end or before the out param for the existing
function.


No, I don't.  I think the *context should be the last argument as in 
other functions in deparse.c.  How about inserting that before the out 
param **retrieved_attrs, like this?


static void
deparseExplicitTargetList(List *tlist,
  bool is_returning,
  List **retrieved_attrs,
  deparse_expr_cxt *context);


2)
-static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
-RelOptInfo *joinrel, bool use_alias, List
**params_list);
+static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
RelOptInfo *foreignrel,
+  bool use_alias, Index target_rel, List
**params_list);



Going beyond 80 character length


Will fix.


3)
 static void
-deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context)

This looks bit wired to me - as comment for the deparseExplicitTargetList()
function name and comments says different then what its trying to do when
is_returning flag is passed. Either function comment need to be change or
may be separate function fo deparse returning list for join relation?

Is it possible to teach deparseReturningList() to deparse the returning
list for the joinrel?


I don't think it's possible to do that because deparseReturningList uses 
deparseTargetList internally, which only allows us to emit a target list 
for a baserel.  For example, deparseReturningList can't handle a 
returning list that contains join outputs like this:


postgres=# explain verbose delete from ft1 using ft2 where ft1.a = ft2.a 
returning ft1.*, ft2.*;

   QUERY PLAN

 Delete on public.ft1  (cost=100.00..102.06 rows=1 width=46)
   Output: ft1.a, ft1.b, ft2.a, ft2.b
   ->  Foreign Delete  (cost=100.00..102.06 rows=1 width=46)
 Remote SQL: DELETE FROM public.t1 r1 USING public.t2 r2 WHERE 
((r1.a = r2.a)) RETURNING r1.a, r1.b, r2.a, r2.b

(4 rows)

I'd like to change the comment for deparseExplicitTargetList.


4)

+if (foreignrel->reloptkind == RELOPT_JOINREL)
+{
+List   *rlist = NIL;
+
+/* Create a RETURNING list */
+rlist = make_explicit_returning_list(rtindex, rel,
+ returningList,
+ fdw_scan_tlist);
+
+deparseExplicitReturningList(rlist, retrieved_attrs, );
+}
+else
+deparseReturningList(buf, root, rtindex, rel, false,
+ returningList, retrieved_attrs);

The code will be more centralized if we add the logic of extracting
returninglist
for JOINREL inside deparseReturningList() only, isn't it?


You are right.  Will do.


5) make_explicit_returning_list() pull the var list from the
returningList and
build the targetentry for the returning list and at the end rewrites the
fdw_scan_tlist.

AFAIK, in case of DML - which is going to pushdown to the remote server
ideally fdw_scan_tlist should be same as returning list, as final output
for the query is query will be RETUNING list only. isn't that true?


That would be true.  But the fdw_scan_tlist passed from the core would 
contain junk columns inserted by the rewriter and planner work, such as 
CTID for the target table and whole-row Vars for other tables specified 
in the FROM clause of an UPDATE or the USING clause of a DELETE.  So, I 
created the patch so that the pushed-down UPDATE/DELETE retrieves only 
the data needed for the RETURNING calculation from the remote server, 
not the whole fdw_scan_tlist data.



If yes, then why can't we directly replace the fdw_scan_tlist with the
returning
list, rather then make_explicit_returning_list()?


I think we could do that if we modified the core (maybe the executor 
part).  I'm not sure that's a good idea, though.



Also I think rewriting the fdw_scan_tlist should happen into
postgres_fdw.c -
rather then deparse.c.


Will try.


6) Test coverage for the returning list is missing.


Will add.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers 

Re: [HACKERS] condition variables

2016-11-22 Thread Kyotaro HORIGUCHI
At Tue, 22 Nov 2016 17:48:07 +1300, Thomas Munro 
 wrote in 

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-22 Thread Mithun Cy
An updated patch with some fixes for bugs reported earlier,

A. failover_to_new_master_v4.patch
Default value "any" is added to target_server_type parameter during its
definition.

B. libpq-failover-smallbugs_02.patch
Fixed the issue raised by [PATCH] pgpassfile connection option
,
Now added host and port name along with pgpass file error message when ever
authentication fails and added same to file libpq-failover-smallbugs.patch
(bugs reported and fixed by Tsunakawa). I did review of original bugs
reported and fixes it seems okay for me. I could not test GSS and SSPI
authentication as it appears they needed windows.


failover_to_new_master_v4.patch
Description: Binary data


libpq-failover-smallbugs_02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Danger of automatic connection reset in psql

2016-11-22 Thread Pavel Stehule
2016-11-22 13:02 GMT+01:00 Oleksandr Shulgin :

> On Tue, Nov 22, 2016 at 5:28 AM, Pavel Stehule 
> wrote:
>
>>
>> 2016-11-22 3:46 GMT+01:00 Robert Haas :
>>
>>> On Mon, Nov 21, 2016 at 4:55 AM, Oleksandr Shulgin
>>>  wrote:
>>> > On Tue, Nov 15, 2016 at 4:10 PM, Jim Nasby 
>>> wrote:
>>> >>
>>> >> On 11/14/16 5:41 AM, Oleksandr Shulgin wrote:
>>> >>>
>>> >>> Automatic connection reset is a nice feature for server development,
>>> >>> IMO.  Is it really useful for anything else is a good question.
>>> >>
>>> >>
>>> >> I use it all the time for application development; my rebuild script
>>> will
>>> >> forcibly kick everyone out to re-create the database. I put that in
>>> because
>>> >> I invariably end up with a random psql sitting somewhere that I don't
>>> want
>>> >> to track down.
>>> >>
>>> >> What currently stinks though is if the connection is dead and the next
>>> >> command I run is a \i, psql just dies instead of re-connecting. It'd
>>> be nice
>>> >> if before reading the script it checked connection status and
>>> attempted a
>>> >> reconnect.
>>> >>
>>> >>> At least an option to control that behavior seems like a good idea,
>>> >>> maybe even set it to 'no reconnect' by default, so that people who
>>> >>> really use it can make conscious choice about enabling it in their
>>> >>> .psqlrc or elsewhere.
>>> >>
>>> >>
>>> >> +1, I don't think it needs to be the default.
>>> >
>>> >
>>> > So if we go in this direction, should the option be specified from
>>> command
>>> > line or available via psqlrc (or both?)  I think both make sense.
>>> >
>>> > What should be the option and control variable names?  Something like:
>>> > --reconnect and RECONNECT?  Should we allow reconnect in
>>> non-interactive
>>> > mode?  I have no use case for that, but it might be different for
>>> others.
>>> > If non-interactive is not supported then it could be a simple boolean
>>> > variable, otherwise we might want something like a tri-state: on / off
>>> /
>>> > interactive (the last one being the default).
>>>
>>> I think it should just be another psql special variable, like
>>> AUTOCOMMIT or VERBOSITY.  If the user wants to set it on the command
>>> line, they can just use -v.  We don't need a separate, dedicated
>>> option for this, I think.
>>>
>>
> That makes sense to me.
>
> In this case depends on a default. For almost all scripts the sensible
>> value is "without reconnect". It be unfriendly to use this setting via -v
>> variable.
>>
>
> Well, if you're running a script it should not be affected as long as
> default value for this new variable is "interactive" or "off" (and you
> didn't override it in psqlrc).  If you really want to get a "reconnect even
> from the script" type of behavior, then you'll have to use -v or set the
> variable from inside the script itself to "on".
>

ok

Pavel

>
> --
> Alex
>
>


Re: [HACKERS] Danger of automatic connection reset in psql

2016-11-22 Thread Oleksandr Shulgin
On Tue, Nov 22, 2016 at 5:28 AM, Pavel Stehule 
wrote:

>
> 2016-11-22 3:46 GMT+01:00 Robert Haas :
>
>> On Mon, Nov 21, 2016 at 4:55 AM, Oleksandr Shulgin
>>  wrote:
>> > On Tue, Nov 15, 2016 at 4:10 PM, Jim Nasby 
>> wrote:
>> >>
>> >> On 11/14/16 5:41 AM, Oleksandr Shulgin wrote:
>> >>>
>> >>> Automatic connection reset is a nice feature for server development,
>> >>> IMO.  Is it really useful for anything else is a good question.
>> >>
>> >>
>> >> I use it all the time for application development; my rebuild script
>> will
>> >> forcibly kick everyone out to re-create the database. I put that in
>> because
>> >> I invariably end up with a random psql sitting somewhere that I don't
>> want
>> >> to track down.
>> >>
>> >> What currently stinks though is if the connection is dead and the next
>> >> command I run is a \i, psql just dies instead of re-connecting. It'd
>> be nice
>> >> if before reading the script it checked connection status and
>> attempted a
>> >> reconnect.
>> >>
>> >>> At least an option to control that behavior seems like a good idea,
>> >>> maybe even set it to 'no reconnect' by default, so that people who
>> >>> really use it can make conscious choice about enabling it in their
>> >>> .psqlrc or elsewhere.
>> >>
>> >>
>> >> +1, I don't think it needs to be the default.
>> >
>> >
>> > So if we go in this direction, should the option be specified from
>> command
>> > line or available via psqlrc (or both?)  I think both make sense.
>> >
>> > What should be the option and control variable names?  Something like:
>> > --reconnect and RECONNECT?  Should we allow reconnect in non-interactive
>> > mode?  I have no use case for that, but it might be different for
>> others.
>> > If non-interactive is not supported then it could be a simple boolean
>> > variable, otherwise we might want something like a tri-state: on / off /
>> > interactive (the last one being the default).
>>
>> I think it should just be another psql special variable, like
>> AUTOCOMMIT or VERBOSITY.  If the user wants to set it on the command
>> line, they can just use -v.  We don't need a separate, dedicated
>> option for this, I think.
>>
>
That makes sense to me.

In this case depends on a default. For almost all scripts the sensible
> value is "without reconnect". It be unfriendly to use this setting via -v
> variable.
>

Well, if you're running a script it should not be affected as long as
default value for this new variable is "interactive" or "off" (and you
didn't override it in psqlrc).  If you really want to get a "reconnect even
from the script" type of behavior, then you'll have to use -v or set the
variable from inside the script itself to "on".

--
Alex


Re: [HACKERS] sequence data type

2016-11-22 Thread Haribabu Kommi
Hi Vik and Vinayak,

This is a gentle reminder.

you both are assigned as reviewer's to the current patch in the 11-2016
commitfest.
But you haven't shared your review yet. Please share your review about
the patch. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] identity columns

2016-11-22 Thread Haribabu Kommi
Hi Vitaly,


This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet on the new patch in this commitfest.
Please share your review about the patch. This will help us in smoother
operation of commitfest.

Please Ignore if you already shared your review.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] raw output from copy

2016-11-22 Thread Haribabu Kommi
 Hi,

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet. Please share your review about
the patch. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-11-22 Thread Haribabu Kommi
Hi Craig,

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet. Please share your review about
the patch. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Push down more full joins in postgres_fdw

2016-11-22 Thread Etsuro Fujita

On 2016/11/22 18:28, Ashutosh Bapat wrote:

The comments should explain why is the assertion true.
+/* Shouldn't be NIL */
+Assert(tlist != NIL);
+/* Should be same length */
+Assert(list_length(tlist) ==
list_length(foreignrel->reltarget->exprs));


Will revise.


OK, I'd like to propose referencing to foreignrel->reltarget->exprs directly
in deparseRangeTblRef and get_subselect_alias_id, then, which is the same as
what I proposed in the first version of the patch, but I'd also like to
change deparseRangeTblRef to use add_to_flat_tlist, not
make_tlist_from_pathtarget, to create a tlist of the subquery, as you
proposed.



There is a already a function to build targetlist for a given relation
build_tlist_to_deparse(), which does the same thing as this code for a join or
base relation and when there are no local conditions. Why don't we use that
function instead of duplicating that logic? If tomorrow, the logic to build the
targetlist changes, we will need to duplicate those changes. We should avoid
that.
+/* Build a tlist from the subquery. */
+tlist = add_to_flat_tlist(tlist, foreignrel->reltarget->exprs);


Will modify the patch to use build_tlist_to_deparse.  Actually, the 
early versions of the patch used that function, but it looks like I 
changed not to use that function, as I misunderstood about your comments 
on this part at some point.  Sorry for that.



The comment below says "get the aliases", but what the function really returns
is the identifiers used for creating aliases. Please correct the comment.
+/*
+ * Get the relation and column alias for a given Var node, which belongs to
+ * input foreignrel. They are returned in *tabno and *colno respectively.
+ */


Actually, this was rewritten into the above by *you*.  The original 
comment I added was:


+ /*
+  * Get info about the subselect alias to given expression.
+  *
+  * The subselect table and column numbers are returned to *tabno and 
*colno,

+  * respectively.
+  */

I'd like to change the comment into something like the original one.


We discussed that we have to deparse and search from the same targetlist. And
that the targetlist should be saved in fpinfo, the first time it gets created.
But the patch seems to be searching in foreignrel->reltarget->exprs and
deparsing from the tlist returned by add_to_flat_tlist(tlist,
foreignrel->reltarget->exprs).
+foreach(lc, foreignrel->reltarget->exprs)
+{
+if (equal(lfirst(lc), (Node *) node))
+{
+*colno = i;
+return;
+}
+i++;
+}
I guess, the reason why you are doing it this way, is SELECT clause for the
outermost query gets deparsed before FROM clause. For later we call
deparseRangeTblRef(), which builds the tlist. So, while deparsing SELECT
clause, we do not have tlist to build from.


That's right.


In that case, I guess, we have to
build the tlist in get_subselect_alias_id() if it's not available and stick it
in fpinfo. Subsequent calls to get_subselect_alias_id() should find tlist
there. Then in deparseRangeTblRef() assert that there's a tlist in fpinfo
and use it to build the SELECT clause of subquery. That way, we don't build
tlist unless it's needed and also use the same tlist for all searches. Please
use tlist_member() to search into the tlist.


I don't think that's a good idea because that would make the code 
unnecessarily complicated and inefficient.  I think the direct search 
into the foreignrel->reltarget->exprs shown above would be better 
because that's more simple and efficient than what you proposed.  Note 
that since (1) the foreignrel->reltarget->exprs doesn't contain any PHVs 
and (2) the local_conds is empty, the tlist created for the subquery by 
build_tlist_to_deparse is guaranteed to have one-to-one correspondence 
with the foreignrel->reltarget->exprs, so the direct search works well.



The name get_subselect_alias_id() seems to suggest that the function returns
identifier for subselect alias, which isn't correct. It returns the alias
identifiers for deparsing a Var node. But I guess, we have left this to the
committer's judgement.


Fine with me.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch proposal

2016-11-22 Thread Haribabu Kommi
Hi Abhijit,

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet. Please share your review about
the patch. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] GiST support for UUIDs

2016-11-22 Thread Haribabu Kommi
On Wed, Nov 2, 2016 at 3:49 AM, Adam Brusselback 
wrote:

> So I apologize in advance if I didn't follow the processes exactly, I was
> going to attempt to review this to move it along, but ran into issues
> applying the patch cleanly to master.  I fixed the issues I was having
> applying it, and created a new patch (attached).
>
> Managed to test it out after I got it applied, and it is working as
> expected for exclusion constraints, as well as normal indexes.
>
> This test was performed on Windows 10 (64 bit), and Postgres was compiled
> using MSYS2
>

Thanks for the review. The proposed patch still applies cleanly. If you
don't have any
comments on the patch, please change the patch status as "ready for
committer" for
committer's attention. This will help us in smoother operation of
commitfest.


Regards,
Hari Babu
Fujitsu Australia


[HACKERS] [PATCH] ALTER DEFAULT PRIVILEGES with GRANT/REVOKE ON SCHEMAS

2016-11-22 Thread Matheus de Oliveira
Hi all,

I noticed that we have no option to set default privileges for newly
created schemas, other than calling GRANT explicitly. At work I use ALTER
DEFAULT PRIVILEGE (ADP) command extensively, as the developers are
permitted to manage DDL on the databases, and all work fine except for when
a new schema is created. So,I'd like to propose this very simple patch
(attached) that adds the capability of using SCHEMAS, adding the following
syntax to ADP:

ALTER DEFAULT PRIVILEGES
[ FOR { ROLE | USER } target_role [, ...] ]
abbreviated_grant_or_revoke

where abbreviated_grant_or_revoke is one of:

GRANT { USAGE | CREATE | ALL [ PRIVILEGES ] }
ON SCHEMAS
TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ]

REVOKE [ GRANT OPTION FOR ]
{ USAGE | CREATE | ALL [ PRIVILEGES ] }
ON SCHEMAS
FROM { [ GROUP ] role_name | PUBLIC } [, ...]
[ CASCADE | RESTRICT ]

The patch itself is really straight forward (I'm new to sending patches, so
I've chosen a simple one), and there is only one thing that concerns me (as
in, if I did it right/good). The difference in syntax for SCHEMAS and the
other objects is that IN SCHEMA option makes no sense here (as we don't
have nested schemas), and to solve that I simple added the error "cannot
use IN SCHEMA clause when using GRANT/REVOKE ON SCHEMAS".

Does that look good to you?

Also, should I add translations for that error message in other languages
(I can do that without help of tools for pt_BR) or is that a latter process
in the releasing?

Other than that, I added a few regression tests (similar to others used for
ADP), and patched the documentation (my English is not that good, so I'm
open to suggestions). Anything else I forgot?

While at this, I'd like to ask if you are interested in have all the other
types we have in GRANT/REVOKE for ADP (I myself see few use for that at
work, but the symmetry on those commands seems like a good idea). If you
agree, I can take some time to do the others (looks very simple to do). I
just wonder if that should be done as one patch for each, or just a single
patch for all of them (perhaps send the sequence of patches in order, as
certainly one will conflict with the other if done apart).

Best regards,
-- 
Matheus de Oliveira
diff --git a/doc/src/sgml/ref/alter_default_privileges.sgml b/doc/src/sgml/ref/alter_default_privileges.sgml
index 04064d3..7745792 100644
*** a/doc/src/sgml/ref/alter_default_privileges.sgml
--- b/doc/src/sgml/ref/alter_default_privileges.sgml
***
*** 46,51  GRANT { USAGE | ALL [ PRIVILEGES ] }
--- 46,55 
  ON TYPES
  TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ]
  
+ GRANT { USAGE | CREATE | ALL [ PRIVILEGES ] }
+ ON SCHEMAS
+ TO { [ GROUP ] role_name | PUBLIC } [, ...] [ WITH GRANT OPTION ]
+ 
  REVOKE [ GRANT OPTION FOR ]
  { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER }
  [, ...] | ALL [ PRIVILEGES ] }
***
*** 71,76  REVOKE [ GRANT OPTION FOR ]
--- 75,86 
  ON TYPES
  FROM { [ GROUP ] role_name | PUBLIC } [, ...]
  [ CASCADE | RESTRICT ]
+ 
+ REVOKE [ GRANT OPTION FOR ]
+ { USAGE | CREATE | ALL [ PRIVILEGES ] }
+ ON SCHEMAS
+ FROM { [ GROUP ] role_name | PUBLIC } [, ...]
+ [ CASCADE | RESTRICT ]
  
   
  
***
*** 125,130  REVOKE [ GRANT OPTION FOR ]
--- 135,142 
are altered for objects later created in that schema.
If IN SCHEMA is omitted, the global default privileges
are altered.
+   IN SCHEMA is not allowed when using ON SCHEMAS
+   as schemas can't be nested.
   
  
 
diff --git a/src/backend/catalog/aclchk.c b/src/backeindex c0df671..e1256f1 100644
*** a/src/backend/catalog/aclchk.c
--- b/src/backend/catalog/aclchk.c
***
*** 948,953  ExecAlterDefaultPrivilegesStmt(ParseState *pstate, AlterDefaultPrivilegesStmt *s
--- 948,957 
  			all_privileges = ACL_ALL_RIGHTS_TYPE;
  			errormsg = gettext_noop("invalid privilege type %s for type");
  			break;
+ 		case ACL_OBJECT_NAMESPACE:
+ 			all_privileges = ACL_ALL_RIGHTS_NAMESPACE;
+ 			errormsg = gettext_noop("invalid privilege type %s for schema");
+ 			break;
  		default:
  			elog(ERROR, "unrecognized GrantStmt.objtype: %d",
   (int) action->objtype);
***
*** 1135,1140  SetDefaultACL(InternalDefaultACL *iacls)
--- 1139,1154 
  this_privileges = ACL_ALL_RIGHTS_TYPE;
  			break;
  
+ 		case ACL_OBJECT_NAMESPACE:
+ 			if (OidIsValid(iacls->nspid))
+ ereport(ERROR,
+ 		(errcode(ERRCODE_INVALID_GRANT_OPERATION),
+ 		 errmsg("cannot use IN SCHEMA clause when using GRANT/REVOKE ON SCHEMAS")));
+ 			objtype = DEFACLOBJ_NAMESPACE;
+ 			if (iacls->all_privs && this_privileges == ACL_NO_RIGHTS)
+ this_privileges = ACL_ALL_RIGHTS_NAMESPACE;
+ 			break;
+ 
  		default:
  			elog(ERROR, "unrecognized objtype: %d",
   (int) 

Re: [HACKERS] pg_hba_file_settings view patch

2016-11-22 Thread Ashutosh Bapat
On Fri, Nov 18, 2016 at 12:23 PM, Haribabu Kommi
 wrote:
>
>
> On Thu, Nov 17, 2016 at 10:13 PM, Ashutosh Bapat
>  wrote:
>>
>> On Wed, Nov 16, 2016 at 4:40 PM, Ashutosh Bapat
>>  wrote:
>> > make check run with this patch shows server crashes. regression.out
>> > attached. I have run make check after a clean build, tried building it
>> > after running configure, but the problem is always reproducible. Do
>> > you see this problem?
>
>
> Thanks for reviewing the patch.
>
> No. I am not able to reproduce this problem.
> make check works fine in my system.

It could be because of some un-initialised variable, which is
initialized appropriately by default on your machine but not on my
machine. I first applied your pg_hba_rules... patch, ran regression.
It didn't crash. then I applied patch for discard_hba... and it
started crashing. Does that give you any clue? Here's regression.out
file for make installcheck. Here is error log snippet that shows a
SIGSEGV there.
2016-11-22 15:47:11.939 IST [86206] LOG:  worker process: parallel
worker for PID 86779 (PID 86780) was terminated by signal 11:
Segmentation fault
2016-11-22 15:47:11.939 IST [86206] LOG:  terminating any other active
server processes
2016-11-22 15:47:11.939 IST [86779] WARNING:  terminating connection
because of crash of another server process
2016-11-22 15:47:11.939 IST [86779] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.

Applying those patches in any order doesn't matter.

>
> From the regression.out file, the crash occurred in select_parallel.out,
> I don't think this patch has any affect on that test.

The changes in postinit.c may have that impact. Just a guess though. I
haven't debugged the crash myself.


>>
>> I looked at the patch in some more details and here are some more comments
>> 1. In catalog.sgml, the sentence "If the configuration file contains any
>> errors
>> ..." looks redundant, as description of "error" field says so. Removed it
>> in
>> the attached patch. In that example, you might want to provide pg_hba.conf
>> contents to help understand the view output.
>
>
> updated details, but not exactly what you said. check it once.
>



>
>>
>> in parse_hba_line()
>> -ereport(LOG,
>> +ereport(level,
>>  (errcode(ERRCODE_CONFIG_FILE_ERROR),
>>   errmsg("invalid connection type \"%s\"",
>>  token->string),
>>   errcontext("line %d of configuration file \"%s\"",
>>  line_num, HbaFileName)));
>> +*err_msg = pstrdup(_("invalid connection type"));
>>
>> Is the difference between error reported to error log and that in the view
>> intentional? That brings to another question. Everywhere, in similar code,
>> the
>> patch adds a line *err_msg = pstrdup() or psprinf() and copies the
>> arguements
>> to errmsg(). Someone modifying the error message has to duplicate the
>> changes.
>> Since the code is just few lines away, it may not be hard to duplicate the
>> changes, but still that's a maintenance burder. Isn't there a way to
>> compute
>> the message once and use it twice? show_all_file_settings() used for
>> pg_file_settings also has similar problem, so may be it's an accepted
>> practice.
>> There are multiple instances of such a difference, but may be the invalid
>> value
>> can be found out from the value of the referenced field (which will be
>> part of
>> the view). So, may be it's ok. But that not true with the difference
>> below.
>> gai_strerror() may not be obvious from the referenced field.
>> -ereport(LOG,
>> +ereport(level,
>>  (errcode(ERRCODE_CONFIG_FILE_ERROR),
>>   errmsg("invalid IP address \"%s\": %s",
>>  str, gai_strerror(ret)),
>>   errcontext("line %d of configuration file
>> \"%s\"",
>>  line_num, HbaFileName)));
>>  if (gai_result)
>>  pg_freeaddrinfo_all(hints.ai_family, gai_result);
>> +*err_msg = pstrdup(_("invalid IP address"));
>
>
> Reused the error string once, as in this patch it chances in many places
> compared
> to pg_file_settings, so I tend to reuse it.

Thanks. Although the new change might affect the way we translate the
messages in other languages. I am not sure. So, I will leave it for
someone with more knowledge to review.


>
>>
>> 5. May be you want to rename "type" attribute to "connection_type" to be
>> explicit.
>
>
> "type" is the keyword that is mentioned in the pg_hba.conf, I feel it is
> better
> if this view is in sync with that. If others feel the same, I can change.

Ok.


>
>>
>> 7. Also, each of the 

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-11-22 Thread Rushabh Lathia
I started reviewing the patch and here are few initial review points and
questions for you.

1)
-static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+static void deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context);

Any particular reason of inserting new argument as 1st argunment? In general
we add new flags at the end or before the out param for the existing
function.

2)
-static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
-RelOptInfo *joinrel, bool use_alias, List
**params_list);
+static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
RelOptInfo *foreignrel,
+  bool use_alias, Index target_rel, List
**params_list);


Going beyond 80 character length

3)
 static void
-deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
+deparseExplicitTargetList(bool is_returning,
+  List *tlist,
+  List **retrieved_attrs,
   deparse_expr_cxt *context)

This looks bit wired to me - as comment for the deparseExplicitTargetList()
function name and comments says different then what its trying to do when
is_returning flag is passed. Either function comment need to be change or
may be separate function fo deparse returning list for join relation?

Is it possible to teach deparseReturningList() to deparse the returning
list for the joinrel?

4)

+if (foreignrel->reloptkind == RELOPT_JOINREL)
+{
+List   *rlist = NIL;
+
+/* Create a RETURNING list */
+rlist = make_explicit_returning_list(rtindex, rel,
+ returningList,
+ fdw_scan_tlist);
+
+deparseExplicitReturningList(rlist, retrieved_attrs, );
+}
+else
+deparseReturningList(buf, root, rtindex, rel, false,
+ returningList, retrieved_attrs);

The code will be more centralized if we add the logic of extracting
returninglist
for JOINREL inside deparseReturningList() only, isn't it?

5) make_explicit_returning_list() pull the var list from the returningList
and
build the targetentry for the returning list and at the end rewrites the
fdw_scan_tlist.

AFAIK, in case of DML - which is going to pushdown to the remote server
ideally fdw_scan_tlist should be same as returning list, as final output
for the query is query will be RETUNING list only. isn't that true?

If yes, then why can't we directly replace the fdw_scan_tlist with the
returning
list, rather then make_explicit_returning_list()?

Also I think rewriting the fdw_scan_tlist should happen into postgres_fdw.c
-
rather then deparse.c.

6) Test coverage for the returning list is missing.



On Fri, Nov 18, 2016 at 8:56 AM, Etsuro Fujita 
wrote:

> On 2016/11/16 16:38, Etsuro Fujita wrote:
>
>> On 2016/11/16 13:10, Ashutosh Bapat wrote:
>>
>>> I don't see any reason why DML/UPDATE pushdown should depend upon
>>> subquery deparsing or least PHV patch. Combined together they can help
>>> in more cases, but without those patches, we will be able to push-down
>>> more stuff. Probably, we should just restrict push-down only for the
>>> cases when above patches are not needed. That makes reviews easy. Once
>>> those patches get committed, we may add more functionality depending
>>> upon the status of this patch. Does that make sense?
>>>
>>
> OK, I'll extract from the patch the minimal part that wouldn't depend on
>> the two patches.
>>
>
> Here is a patch for that.  Todo items are: (1) add more comments and (2)
> add more regression tests.  I'll do that in the next version.
>
> Best regards,
> Etsuro Fujita
>



-- 
Rushabh Lathia


Re: [HACKERS] postgres 9.3 postgres_fdw ::LOG: could not receive data from client: Connection reset by peer

2016-11-22 Thread Vladimir Svedov
Hi,
Sorry - tried to reproduce on other machine and gather all statements. And
failed
Installed 9.3 (which has those symptoms) and still can't reproduce.
Must be platform specific, not version

2016-11-21 21:58 GMT+00:00 Kevin Grittner :

> On Mon, Nov 21, 2016 at 8:32 AM, Vladimir Svedov 
> wrote:
>
> > I have this question. Looked for a help on http://dba.stackexchange.com/
> > No success.
>
> A link to the actual question would be appreciated.
>
> > "FOREIGN_TABLE" created with postgres_fdw. LOCAL_TABLE is just a local
> table...
> >
> > Symptoms:
> >
> > I run in psql query SELECT * from FOREIGN_TABLE. No log generated
> > I run in bash psql -c "SELECT * from LOCAL_TABLE". No log generated
> > I run in bash psql -c "SELECT * from FOREIGN_TABLE". ::LOG: could not
> receive data from client: Connection reset by peer generated in postgres log
>
> Please provide more information, and preferably a self-contained
> test case (one that anyone can run on an empty database to see the
> problem).
>
> https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

2016-11-22 Thread Ashutosh Bapat
On Tue, Nov 22, 2016 at 12:26 PM, Tsunakawa, Takayuki
 wrote:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
>> The point I was trying to make is that I think the forced-removal behavior
>> is not desirable, and therefore committing a patch that makes it be graven
>> in stone is not desirable either.
>
> I totally agree that we should pursue the direction for escaping from the 
> complete loss of stats files.  Personally, I would like to combine that with 
> the idea of persistent performance diagnosis information for long-term 
> analysis (IIRC, someone proposed it.)  However, I don't think my patch will 
> make everyone forget about the problem of stats file loss during recovery.  
> The problem exists with or without my patch, and my patch doesn't have the 
> power to delute the importance of the problem.  If you are worried about 
> memory, we can add an entry for the problem in TODO list that Bruce-san is 
> maintaining.

I don't think Tom is worried about forgetting this. I think he is
worried that if we do the changes as suggested in 01... patch, we will
make it more difficult to change stats file behaviour later. Right now
we are throwing away statistics files at the time of crash recovery.
In case we want to change it tomorrow, we will have to fix the
existing problem with the half-written/stale stats files and then fix
not to zap those files at the time of crash recovery. In case we go
ahead with this patch, we will have more instances of creating
half-written/stale stats file, which will need to fixed.

>
> 9.4 change may be sufficient.  But I don't think I can proudly explain the 
> logic to a really severe customer.  I can't answer the question "Why does 
> PostgreSQL write files that will be deleted, even during 'immediate' 
> shutdown?  Why does PostgreSQL use 5 seconds for nothing?"
>
> Other children do nothing and exit immediately.  I believe they are behaving 
> correctly.

I agree, with this though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from walsender?

2016-11-22 Thread Kyotaro HORIGUCHI
No that's not right.

At Tue, 22 Nov 2016 17:45:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20161122.174534.266086549.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Tue, 22 Nov 2016 12:35:56 +0800, Craig Ringer  
> wrote in 
> > On 22 November 2016 at 11:35, Kyotaro HORIGUCHI
> >  wrote:
> > > Hello,
> > >
> > > At Mon, 21 Nov 2016 21:39:07 -0500, Robert Haas  
> > > wrote in 
> > > 
> > >> On Mon, Nov 21, 2016 at 3:21 AM, Craig Ringer  
> > >> wrote:
> > >> > I'm still interested in hearing comments from experienced folks about
> > >> > whether it's sane to do this at all, rather than have similar
> > >> > duplicate signal handling for the walsender.
> > >>
> > >> Well, I mean, if it's reasonable to share code in a given situation,
> > >> that is generally better than NOT sharing code...
> > >
> > > Walsender handles SIGUSR1 completely different way from normal
> > > backends. The procsignal_sigusr1_handler is designed to work as
> > > the peer of SendProcSignal (not ProcSendSignal:). Walsender is
> > > just using a latch to be woken up. It has nothing to do with
> > > SendProcSignal.
> > 
> > Indeed, at the moment it doesn't. I'm proposing to change that, since
> > walsenders doing logical decoding on a standby need to be notified of
> > conflicts with recovery, many of which are the same as for normal user
> > backends and bgworkers.
> > 
> > The main exception is snapshot conflicts, where logical decoding
> > walsenders don't care about conflicts with specific tables, they want
> > to be terminated if the effective catalog_xmin on the upstream
> > increases past their needed catalog_xmin. They don't care about
> > non-catalog (or non-heap-catalog) tables. So I expect we'd just want
> > to ignore PROCSIG_RECOVERY_CONFLICT_SNAPSHOT when running a walsender
> > and handle conflict with catalog_xmin increases separately.
> 
> Thank you for the explanation. It seems to be reasonable for
> walsender to have the same mechanism with
> procsignal_sigusr1_handler. Sharing a handler or having another
> one is a matter of design. (But sharing it might not be better if
> walsender handles only two reasons.)
...
> > > If you need to expand the function of SIGUSR1 of walsender, more
> > > thought would be needed.
> > 
> > Yeah, I definitely don't think it's as simple as just using
> > procsignal_sigusr1_handler as-is. I expect we'd likely create a new
> > global IsWalSender and ignore some RecoveryConflictInterrupt cases
> > when in a walsender, at least PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, and
> > probably add a new case for catalog_xmin conflicts that's only acted
> > on when IsWalSender.
> 
> The global is unncecessary if walsender have a different handler
> from normal backends. If there's at least one or few additional
> reasons for signal, sharing SendProcSignal and having dedicate
> handler might be better.

If no behavior is shared among normal backend and walsender, it
would be a good reason not to share the handler function. What
you are willing to do seems so.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more full joins in postgres_fdw

2016-11-22 Thread Ashutosh Bapat
The comments should explain why is the assertion true.
+/* Shouldn't be NIL */
+Assert(tlist != NIL);
+/* Should be same length */
+Assert(list_length(tlist) ==
list_length(foreignrel->reltarget->exprs));


>
> OK, I'd like to propose referencing to foreignrel->reltarget->exprs directly
> in deparseRangeTblRef and get_subselect_alias_id, then, which is the same as
> what I proposed in the first version of the patch, but I'd also like to
> change deparseRangeTblRef to use add_to_flat_tlist, not
> make_tlist_from_pathtarget, to create a tlist of the subquery, as you
> proposed.

There is a already a function to build targetlist for a given relation
build_tlist_to_deparse(), which does the same thing as this code for a join or
base relation and when there are no local conditions. Why don't we use that
function instead of duplicating that logic? If tomorrow, the logic to build the
targetlist changes, we will need to duplicate those changes. We should avoid
that.
+/* Build a tlist from the subquery. */
+tlist = add_to_flat_tlist(tlist, foreignrel->reltarget->exprs);

The comment below says "get the aliases", but what the function really returns
is the identifiers used for creating aliases. Please correct the comment.
+/*
+ * Get the relation and column alias for a given Var node, which belongs to
+ * input foreignrel. They are returned in *tabno and *colno respectively.
+ */

We discussed that we have to deparse and search from the same targetlist. And
that the targetlist should be saved in fpinfo, the first time it gets created.
But the patch seems to be searching in foreignrel->reltarget->exprs and
deparsing from the tlist returned by add_to_flat_tlist(tlist,
foreignrel->reltarget->exprs).
+foreach(lc, foreignrel->reltarget->exprs)
+{
+if (equal(lfirst(lc), (Node *) node))
+{
+*colno = i;
+return;
+}
+i++;
+}
I guess, the reason why you are doing it this way, is SELECT clause for the
outermost query gets deparsed before FROM clause. For later we call
deparseRangeTblRef(), which builds the tlist. So, while deparsing SELECT
clause, we do not have tlist to build from. In that case, I guess, we have to
build the tlist in get_subselect_alias_id() if it's not available and stick it
in fpinfo. Subsequent calls to get_subselect_alias_id() should find tlist
there. Then in deparseRangeTblRef() assert that there's a tlist in fpinfo
and use it to build the SELECT clause of subquery. That way, we don't build
tlist unless it's needed and also use the same tlist for all searches. Please
use tlist_member() to search into the tlist.

The name get_subselect_alias_id() seems to suggest that the function returns
identifier for subselect alias, which isn't correct. It returns the alias
identifiers for deparsing a Var node. But I guess, we have left this to the
committer's judgement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

2016-11-22 Thread Kyotaro HORIGUCHI
I almost forgot this.

At Mon, 21 Nov 2016 15:44:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20161121.154408.47398334.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 21 Nov 2016 14:41:27 +0900, Michael Paquier 
>  wrote in 
> 
> > On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
> >  wrote:
> > > So, all my original concern were cleared.
> > 
> > Cool. Perhaps this could be marked as ready for committer then?
> 
> ^^;
> 
> > > The last one is
> > > resetting by a checkpointer restart.. I'd like to remove that if
> > > Andres agrees.
> > 
> > Could you clarify this point? v18 makes sure that the last segment
> > switch stays in shared memory so as we could still skip the activity
> > of archive_timeout correctly.
> 
> I don't doubt that it works. (I don't comment on the comment:) My
> concern is complexity. I don't think we wish to save almost no
> harm behavior caused by a thing rarely happens.  But, if you and
> others on this thread don't mind the complexity, It's not worth
> asserting myself more.
> 
> So, after a day waiting, I'll mark this as ready for committer
> again.

I have marked this as ready for committer again.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring replay lag

2016-11-22 Thread Thomas Munro
On Tue, Nov 8, 2016 at 2:35 PM, Masahiko Sawada  wrote:
> replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
> postgres(1:36789)=# show replay_lag_sample_interval ;
>  replay_lag_sample_interval
> 
>  1000s
> (1 row)

Oops, fixed.

>> 1.  The lag is measured in time, not LSN difference.
>> 2.  The lag time is computed using two observations of a single
>> server's clock, so there is no clock skew.
>> 3.  The lag is updated even between commits (during large data loads etc).
>
> I agree with this approach.

Thanks for the feedback.

> I think that you need to change sendFeedback() in pg_recvlogical.c and
> receivexlog.c as well.

Right, fixed.

Thanks very much for testing!  New version attached.  I will add this
to the next CF.

-- 
Thomas Munro
http://www.enterprisedb.com


replay-lag-v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Forbid use of LF and CR characters in database and role names

2016-11-22 Thread Ideriha, Takeshi
Hi, 
Here's a summary for what I tested  in RHEL7.0, details follow.

[Summary]
1. apply patch and make world 
  -> failed because  was mistakenly coded .

2.correct this mistake and make check-world
  -> got 1 failed test: "'pg_dumpall with \n\r in database name'"
 because test script cannot createdb "foo\n\rbar" 


Details follows:
[1. apply patch and make world>]

I tried to make world after applying patch and there seems to be one small 
mistake in create_role. 
===

openjade:ref/create_role.sgml:413:7:E: document type does not allow element 
"PARA" here; missing one of "FOOTNOTE", "ITEMIZEDLIST", "ORDEREDLIST", 
"VARIABLELIST", "CAUTION", "IMPORTANT", "NOTE", "TIP", "WARNING", "BLOCKQUOTE", 
"INFORMALEXAMPLE" start-tag
openjade:ref/create_role.sgml:414:11:E: end tag for "PARA" omitted, but OMITTAG 
NO was specified
openjade:ref/create_role.sgml:413:2: start tag was here
openjade:ref/create_role.sgml:414:11:E: end tag for "PARA" omitted, but OMITTAG 
NO was specified
openjade:ref/create_role.sgml:408:2: start tag was here
make[3]: *** [HTML.index] Error 1
make[3]: *** Deleting file `HTML.index'

===

[2.correct this mistake and make check-world]
After fixing this mistake locally by correcting  to end tag, postgresql 
and documentation were successfully made.
And also "make check" tests passed (not "make check-world").


diff --git a/doc/src/sgml/ref/create_role.sgml 
b/doc/src/sgml/ref/create_role.sgml
index 9d6926e..ff4b6f6 100644
--- a/doc/src/sgml/ref/create_role.sgml
+++ b/doc/src/sgml/ref/create_role.sgml
@@ -410,7 +410,7 @@ CREATE ROLE name [ [ WITH ] 
+  
  
 
  
=

This failure seems actually caused by t/010_dump_connstr.pl line 65
"$node->run_log(['createdb', "foo\n\rbar"]);".

This means your patch works well.

Excerpt from log follows. 
===
#   Failed test 'pg_dumpall with \n\r in database name'
#   at 
/home/ideriha/postgres-master/src/bin/pg_dump/../../../src/test/perl/TestLib.pm 
line 230.
# Looks like you failed 1 test of 14.
t/010_dump_connstr.pl .. 

not ok 6 - pg_dumpall with \n\r in database name

Failed 1/14 subtests 

Test Summary Report
---
t/010_dump_connstr.pl (Wstat: 256 Tests: 14 Failed: 1)
  Failed test:  6
  Non-zero exit status: 1
Files=3, Tests=1822, 58 wallclock secs ( 0.20 usr  0.01 sys + 12.52 cusr  1.94 
csys = 14.67 CPU)
Result: FAIL
=


Regards,
Ideriha, Takeshi
Fujitsu

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_recvlogical --endpos

2016-11-22 Thread Okano, Naoki
On Monday, November 21, 2016 1:08 PM Craig Ringer wrote:
> I've updated the patch for this. It's already posted on the logical
> decoding timeline following thread, so I'll avoid repeating it here.
> 
> https://www.postgresql.org/message-id/CAMsr%2BYGd5dv3zPNch6BU4UXX49NJDC9m3-Y%3DV5q%3DTNcE9QgSaQ%40mail.gmail.com
I checked the latest patch.
I think that the error message shown below is a typo. 

> + if (endpos != InvalidXLogRecPtr && !do_start_slot)
> + {
> + fprintf(stderr,
> + _("%s: cannot use --create-slot or --drop-slot 
> together with --endpos\n"),
The condition '!do_start_slot' is not reflected in the error message.
The patch should allow --endpos to work with --create-slot.

Also, the document explains as follows.
> +specified LSN.  If specified when not in --start
> +mode, an error is raised.
So, it is better to output an error message that matches the document when 
changing the error message.

Regards,
Okano Naoki
Fujitsu

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2016-11-22 Thread Ashutosh Bapat
> you assigned as reviewer to the current patch in the 11-2016 commitfest.
> But you haven't shared your review yet. Please share your views about
> the patch. This will help us in smoother operation of commitfest.
>

Thanks for the reminder.

Mithun has not provided a patch addressing the comments upthread. I am
waiting for his response to those comments.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from walsender?

2016-11-22 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 22 Nov 2016 12:35:56 +0800, Craig Ringer  wrote 
in 
> On 22 November 2016 at 11:35, Kyotaro HORIGUCHI
>  wrote:
> > Hello,
> >
> > At Mon, 21 Nov 2016 21:39:07 -0500, Robert Haas  
> > wrote in 
> > 
> >> On Mon, Nov 21, 2016 at 3:21 AM, Craig Ringer  
> >> wrote:
> >> > I'm still interested in hearing comments from experienced folks about
> >> > whether it's sane to do this at all, rather than have similar
> >> > duplicate signal handling for the walsender.
> >>
> >> Well, I mean, if it's reasonable to share code in a given situation,
> >> that is generally better than NOT sharing code...
> >
> > Walsender handles SIGUSR1 completely different way from normal
> > backends. The procsignal_sigusr1_handler is designed to work as
> > the peer of SendProcSignal (not ProcSendSignal:). Walsender is
> > just using a latch to be woken up. It has nothing to do with
> > SendProcSignal.
> 
> Indeed, at the moment it doesn't. I'm proposing to change that, since
> walsenders doing logical decoding on a standby need to be notified of
> conflicts with recovery, many of which are the same as for normal user
> backends and bgworkers.
> 
> The main exception is snapshot conflicts, where logical decoding
> walsenders don't care about conflicts with specific tables, they want
> to be terminated if the effective catalog_xmin on the upstream
> increases past their needed catalog_xmin. They don't care about
> non-catalog (or non-heap-catalog) tables. So I expect we'd just want
> to ignore PROCSIG_RECOVERY_CONFLICT_SNAPSHOT when running a walsender
> and handle conflict with catalog_xmin increases separately.

Thank you for the explanation. It seems to be reasonable for
walsender to have the same mechanism with
procsignal_sigusr1_handler. Sharing a handler or having another
one is a matter of design. (But sharing it might not be better if
walsender handles only two reasons.)

> > IMHO, I don't think walsender is allowed to just share the
> > backends' handler for a practical reason that pss_signalFlags can
> > harm.
> 
> I'm not sure I see the problem. The ProcSignalReason argument to
> RecoveryConflictInterrupt states that:
> 
>  * Also, because of race conditions, it's important that all the signals be
>  * defined so that no harm is done if a process mistakenly receives one.

It won't cause actual problem, but it is not managed on the
*current* walsender. I meant that taking any action based on
unmanaged variable seems to be a flaw of design. But anyway no
problem if it is redesigned to be used on walsender.

> > If you need to expand the function of SIGUSR1 of walsender, more
> > thought would be needed.
> 
> Yeah, I definitely don't think it's as simple as just using
> procsignal_sigusr1_handler as-is. I expect we'd likely create a new
> global IsWalSender and ignore some RecoveryConflictInterrupt cases
> when in a walsender, at least PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, and
> probably add a new case for catalog_xmin conflicts that's only acted
> on when IsWalSender.

The global is unncecessary if walsender have a different handler
from normal backends. If there's at least one or few additional
reasons for signal, sharing SendProcSignal and having dedicate
handler might be better.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers