date:20160908

On 7 September 2016 at 15:44, Vladimir Gordiychuk  wrote:

> Fixed patch in attach.

It'd helpful if you summarize the changes made when posting revisions.

>> * There are no tests. I don't see any really simple way to test this,
>> though.
>
> I will be grateful if you specify the best way how to do it. I tests this
> patches by pgjdbc driver tests that also build on head of postgres. You
can
> check test scenario for both patches in
> PRhttps://github.com/pgjdbc/pgjdbc/pull/550
>
>
org.postgresql.replication.LogicalReplicationTest#testDuringSendBigTransactionReplicationStreamCloseNotActive
>
org.postgresql.replication.LogicalReplicationTest#testAfterCloseReplicationStreamDBSlotStatusNotActive

Yeah, testing it isn't trivial in postgres core, since of course none of
the tools know to send a client-initiated CopyDone.

My suggestion is to teach pg_recvlogical to send CopyDone on the first
SIGINT (control-C) and wait until the server ends the data-stream and
returns to command mode. A second control-C should unilaterally close the
connection and it should close after a timeout too. This will be backward
compatible with 9.4/5/6 (just with a delay on exit by sigint).

Then in a TAP test in src/test/recovery set up a logical decoding session
with test_decoding and pg_recvlogical. Do a test xact then sigint
pg_recvlogical and verify that it exits. To make sure it exited by mutual
agreement with server, probably run pg_recvlogival at a higher debug level
and text search for a message you print from pg_recvlogical when it gets
server CopyDone in the response to client CopyDone. I don't think a
different pg_recvlogical numeric exit code could be justified for this.

It sounds like more work than I think it would actually be.

-- 
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [COMMITTERS] pgsql: Fix VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL

2016-09-08 Thread Simon Riggs

On 7 September 2016 at 14:58, Tom Lane  wrote:

>> That may not be perceived as a "fix" by everybody, so we should not do
>> it without an explicit agreement by many.
>
> Agreed, but I vote with Fujii-san for back-patching.

No problem with backpatching, just wanted some +1s before I did it.

Will do that tomorrow.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in two-phase transaction recovery

2016-09-08 Thread Simon Riggs

On 8 September 2016 at 07:43, Michael Paquier  wrote:
> On Wed, Sep 7, 2016 at 10:48 PM, Stas Kelvich  
> wrote:
>> Some time ago two-phase state file format was changed to have variable size 
>> GID,
>> but several places that read that files were not updated to use new offsets. 
>> Problem
>> exists in master and 9.6 and can be reproduced on prepared transactions with
>> savepoints.
>
> Oops and meh. This meritates an open item, and has better be fixed by
> 9.6.0. I am glad you noticed that honestly. And we had better take
> care of this issue as soon as possible.

Looking now.

>> Also while looking at StandbyRecoverPreparedTransactions() i’ve noticed that 
>> buffer
>> for 2pc file is allocated in TopMemoryContext but never freed. That probably 
>> exists
>> for a long time.
>
> @@ -1886,6 +1886,8 @@ StandbyRecoverPreparedTransactions(bool overwriteOK)
> Assert(TransactionIdFollows(subxid, xid));
> SubTransSetParent(xid, subxid, overwriteOK);
> }
> +
> +   pfree(buf);
> }
> This one is a good catch. I have checked also the other callers of
> ReadTwoPhaseFile but I am not seeing any other leak. That's a leak,
> not critical though so applying it only on HEAD would be enough IMO.

Far from critical, but backpatched to 9.6 because it isn't just
executed once at startup, it is executed every shutdown checkpoint.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-08 Thread Emre Hasegeli

> Given that you are now familiar with the internals of how enums are
> implemented would it be possible to continue the work and add the "ALTER
> TYPE  DROP VALUE " command?

I was thinking to try developing it, but I would be more than happy to
help by testing and reviewing, if someone else would do.  I don't
think I have enough experience to think of all details of this
feature.

The main problem that has been discussed before was the indexes.  One
way is to tackle with it is to reindex all the tables after the
operation.  Currently we are doing it when the datatype of indexed
columns change.  So it should be possible, but very expensive.

Another way, Thomas Munro suggested when we were talking about it,
would be to add another column to mark the deleted rows to the catalog
table.  We can check for this column before allowing the value to be
used.  Indexes can continue using the deleted values.  We can also
re-use those entries when someone wants to add new value to the
matching place.  It should be safe as long as we don't update the sort
order.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

On Wed, Sep 7, 2016 at 2:39 AM, Robert Haas  wrote:
> On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire  
> wrote:
>>> The problem with this is that we allocate the entire amount of
>>> maintenance_work_mem even when the number of actual dead tuples turns
>>> out to be very small.  That's not so bad if the amount of memory we're
>>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>>> to remove the 1 GB limit, because somebody might have
>>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>>> index creation, and allocating that much space for a VACUUM that
>>> encounters 1 dead tuple does not seem like a good plan.
>>>
>>> What I think we need to do is make some provision to initially
>>> allocate only a small amount of memory and then grow the allocation
>>> later if needed.  For example, instead of having
>>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>>> ItemPointer * and allocate the array progressively in segments.  I'd
>>> actually argue that the segment size should be substantially smaller
>>> than 1 GB, like say 64MB; there are still some people running systems
>>> which are small enough that allocating 1 GB when we may need only 6
>>> bytes can drive the system into OOM.
>>
>> This would however incur the cost of having to copy the whole GB-sized
>> chunk every time it's expanded. It woudln't be cheap.
>
> No, I don't want to end up copying the whole array; that's what I
> meant by allocating it progressively in segments.  Something like what
> you go on to propose.
>
>> I've monitored the vacuum as it runs and the OS doesn't map the whole
>> block unless it's touched, which it isn't until dead tuples are found.
>> Surely, if overcommit is disabled (as it should), it could exhaust the
>> virtual address space if set very high, but it wouldn't really use the
>> memory unless it's needed, it would merely reserve it.
>
> Yeah, but I've seen actual breakage from exactly this issue on
> customer systems even with the 1GB limit, and when we start allowing
> 100GB it's going to get a whole lot worse.
>
>> To fix that, rather than repalloc the whole thing, dead_tuples would
>> have to be an ItemPointer** of sorted chunks. That'd be a
>> significantly more complex patch, but at least it wouldn't incur the
>> memcpy.
>
> Right, this is what I had in mind.  I don't think this is actually
> very complicated, because the way we use this array is really simple.
> We basically just keep appending to the array until we run out of
> space, and that's not very hard to implement with an array-of-arrays.
> The chunks are, in some sense, sorted, as you say, but you don't need
> to do qsort() or anything like that.  You're just replacing a single
> flat array with a data structure that can be grown incrementally in
> fixed-size chunks.
>

If we replaced dead_tuples with an array-of-array, isn't there
negative performance impact for lazy_tid_reap()?
As chunk is added, that performance would be decrease.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] autonomous transactions

On 8 Sep. 2016 1:38 pm, "Andres Freund"  wrote:
>
> I kind of dislike this approach for a different reason than already
> mentioned in this thread: Namely it's not re-usable for implementing
> streaming logical replication of large transaction, i.e. allow to decode
> & apply un-committed changes.  The issue there is that one really needs
> to have multiple transactions active in the same connection, which this
> approach does not allow.

I've been thinking about this recently with an eye to handling the majority
of transactions on a typical system that neither perform DDL nor are
concurrent with it.

The following might be fairly nonsensical if I've misunderstood the problem
as I've not had a chance to more than glance at it, but:

I wonder if some kind of speculative concurrent decoding+output is possible
without being able to have multiple xacts on a backend. We could use a
shared xact and snapshot for all concurrent xacts. If any of them do
anything that requires invalidations etc we dump the speculatively
processed ones and fall back to reorder buffering and serial output at
commit time until we pass the xact that caused an invalidation and don't
have more pending. Add a callback to the output plugin to tell it that
speculative decoding of an xact has been aborted.

Or maybe even just put that speculstive decoding on hold and pick up where
we left off partway through the reorder buffer when we get around to
processing it serially. That way we don't have to discard work already
done. More usefully we could avoid having to enqueue stuff into the reorder
buffer just in case we have to switch to fallback serial decoding.

Failing that:

Since logical decoding only needs read only xacts that roll back, it only
needs multiple backend private virtualxacts. They don't need
SERIALIZABLE/SSI which I think (?) means other backends don't need to be
able to wait on them. That seems simpler than what autonomous xacts would
need since there we need them exposed in PgXact, waitable-on for txid
locks, etc. Right? In the same way that historical snapshots are cut-down
maybe logical decoding xacts can be too.

I suspect that waiting for full in-process multiple xact support to do
interleaved decoding/replay will mean a long wait indeed.

Re: [HACKERS] Quorum commit for multiple synchronous replication.

On Wed, Sep 7, 2016 at 4:03 AM, Josh Berkus  wrote:
> On 08/29/2016 06:52 AM, Fujii Masao wrote:
>> Also I like the following Simon's idea.
>>
>> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
>> ---
>> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
>> * any k (n1, n2, n3) – would release waiters as soon as we have the
>> responses from k out of N standbys. “any k” would be faster, so is
>> desirable for performance and resilience
>
> What are we going to do for backwards compatibility, here?
>
> So, here's the dilemma:
>
> If we want to keep backwards compatibility with 9.6, then:
>
> "k (n1, n2, n3)" == "first k (n1, n2, n3)"
>
> However, "first k" is not what most users will want, most of the time;
> users of version 13, years from now, will be getting constantly confused
> by "first k" behavior when they wanted quorum.  So the sensible default
> would be:
>
> "k (n1, n2, n3)" == "any k (n1, n2, n3)"
>

+1.

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

On Wed, Sep 7, 2016 at 12:47 AM, Masahiko Sawada  wrote:
> On Tue, Sep 6, 2016 at 11:08 PM, Simon Riggs  wrote:
>> On 29 August 2016 at 14:52, Fujii Masao  wrote:
>>> On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek  wrote:
 On 04/08/16 06:40, Masahiko Sawada wrote:
>
> On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
>  wrote:
>>
>> On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada 
>> wrote:
>>>
>>> I was thinking that the syntax for quorum method would use '[ ... ]'
>>> but it will be confused with '( ... )' priority method used.
>>> 001 patch adds 'Any N ( ... )' style syntax but I know that we still
>>> might need to discuss about better syntax, discussion is very welcome.
>>> Attached draft patch, please give me feedback.
>>
>>
>> I am +1 for using either "{}" or "[]" to define a quorum set, and -1
>> for the addition of a keyword in front of the integer defining for how
>> many nodes server need to wait for.
>
>
> Thank you for reply.
> "{}" or "[]" are not bad but because these are not intuitive, I
> thought that it will be hard for uses to use different method for each
> purpose.
>

 I think the "any" keyword is more explicit and understandable, also closer
 to SQL. So I would be in favor of doing that.
>>>
>>> +1
>>>
>>> Also I like the following Simon's idea.
>>>
>>> https://www.postgresql.org/message-id/canp8+jlhfbvv_pw6grasnupw+bdk5dctu7gwpnap-+-zwvk...@mail.gmail.com
>>> ---
>>> * first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
>>> * any k (n1, n2, n3) – would release waiters as soon as we have the
>>> responses from k out of N standbys. “any k” would be faster, so is
>>> desirable for performance and resilience
>>> ---
>>
>> +1
>>
>> "synchronous_method" -> "synchronization_method"
>
> Thanks, will fix.
>
>> I'm concerned about the performance of this code. Can we work out a
>> way of measuring it, so we can judge how successful we are at
>> releasing waiters quickly? Thanks
>
> I will measure the performance effect of this code.
> I'm expecting that performances are,
> 'first 1 (n1, n2)'  > 'any 1(n1, n2)' > 'first 2(n1, n2)'
> 'first 1 (n1, n2)' will be highest throughput.
>

Sorry, that's wrong.
'any 1(n1, n2)' will be highest throughput or same as 'first 1(n1, n2)'.

>> For 9.6 we implemented something that allows the DBA to define how
>> slow programs are. Previously, since 9.1 this was something specified
>> on the application side. I would like to put it back that way, so we
>> end up with a parameter on client e.g. commit_quorum = k. Forget the
>> exact parameters/user API for now, but I'd like to allow the code to
>> work with user defined settings. Thanks.
>
> I see. The parameter on client should effect for priority method as well.
> And similar to synchronous_commit, the client can specify the how much
> standbys the master waits to commit for according to synchronization
> method, even if s_s_names is defined.
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in two-phase transaction recovery

2016-09-08 Thread Pavan Deolasee

On Thu, Sep 8, 2016 at 12:13 PM, Michael Paquier 
wrote:

> On Wed, Sep 7, 2016 at 10:48 PM, Stas Kelvich 
> wrote:
> > Some time ago two-phase state file format was changed to have variable
> size GID,
> > but several places that read that files were not updated to use new
> offsets. Problem
> > exists in master and 9.6 and can be reproduced on prepared transactions
> with
> > savepoints.
>
> Oops and meh. This meritates an open item, and has better be fixed by
> 9.6.0. I am glad you noticed that honestly. And we had better take
> care of this issue as soon as possible.
>
>
Indeed, it's a bug. Thanks Stas for tracking it down and Michael for the
review and checking other places. I also looked at the patch and it seems
fine to me. FWIW I looked at all other places where TwoPhaseFileHeader is
referred and they look safe to me.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-08 Thread Dilip Kumar

On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane  wrote:
> I really don't like this solution.  Changing this one function, out of the
> dozens just like it in lsyscache.c, seems utterly random and arbitrary.
>
> I'm actually not especially convinced that passing domain_in a random
> OID needs to not produce an XX000 error.  That isn't a user-facing
> function and people shouldn't be calling it by hand at all.  The same
> goes for record_in.  But if we do want those cases to avoid this,
> I think it's appropriate to fix it in those functions, not decree
> that essentially-random other parts of the system should bear the
> responsibility.  (Thought experiment: if we changed the order of
> operations in domain_in so that getBaseTypeAndTypmod were no longer
> the first place to fail, would we undo this change and then change
> wherever the failure had moved to?  That's pretty messy.)
>
> In the case of record_in, it seems like it'd be easy enough to use
> lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc()
> and then throw a user-facing error right there.

Actually when we are passing invalid type to "record_in", error is
thrown in "lookup_type_cache" function, and this function is not
taking "no_error" input (it's only limited to
lookup_rowtype_tupdesc_internal).

One option is to pass "no_error" parameter to "lookup_type_cache"
function also, and throw error only in record_in function.
But problem is, "lookup_type_cache" has lot of references. And also
will it be good idea to throw one generic error instead of multiple
specific errors. ?

Another option is to do same for "record_in" also what you have
suggested for domain_in (an extra syscache lookup). ?

>Not sure about a
> nice solution for domain_in.  We might have to resort to an extra
> syscache lookup there, but even if we did, it should happen only
> once per query so that's not very awful.
>
>> 3. CheckFunctionValidatorAccess: This is being called from all
>> language validator functions.
>
> This part seems reasonable, since the validator functions are documented
> as something users might call, and CheckFunctionValidatorAccess seems
> like an apropos place to handle it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Heikki Linnakangas


On 09/08/2016 03:36 AM, Peter Geoghegan wrote:

On Wed, Sep 7, 2016 at 2:42 PM, Tom Lane  wrote:

The reason it's called siftup is that that's what Knuth calls it.
See Algorithm 5.2.3H (Heapsort), pp 146-147 in the first edition of
Volume 3; tuplesort_heap_siftup corresponds directly to steps H3-H8.


I see that Knuth explains that these steps form the siftup procedure.
What steps does tuplesort_heap_insert correspond to, if any?


Hmm. The loop in tuplesort_heap_siftup() indeed looks the same as 
Knuth's steps H3-H8. But the start is different. Knuth's algorithm 
begins with the situation that the top node of the sub-heap is in wrong 
place, and it pushes it downwards, until the whole sub-heap satisfies 
the heap condition again. But tuplesort_heap_siftup() begins by moving 
the last value in the heap to the top, and then it does the H3-H8 steps 
to move it to the right place.


Using Wikipedia's terms, tuplesort_heap_siftup() function as whole is 
the Extract operation. And the loop inside it corresponds to the 
Max-Heapify operation, which is the same as Knuth's "siftup".


I still think tuplesort_heap_siftup is a confusing name, although I'm 
not sure that Peter's "compact" is much better. I suggest that we rename 
it to tuplesort_heap_delete_top(). In comments within the function, 
explain that the *loop* corresponds to the "siftup" in Knuth's book.


Interestingly, as Knuth explains the siftup algorithm, it performs a 
"replace the top" operation, rather than "remove the top". But we use it 
to remove the top, by moving the last element to the top and running the 
algorithm after that. Peter's other patch, to change the way we 
currently replace the top node, to do it in one pass rather than as 
delete+insert, is actually Knuth's siftup algorithm.


Peter, looking at your "displace" patch in this light, I think 
tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm 
calling it now), should share a common subroutine. Displace replaces the 
top node with the new node passed as argument, and then calls the 
subroutine to push it down to the right place. Delete_top replaces the 
top node with the last value in the heap, and then calls the subroutine. 
Or perhaps delete_top should remove the last value from the heap, and 
then call displace() with it, to re-insert it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting SJIS as a database encoding

2016-09-08 Thread Tsunakawa, Takayuki

From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro
> HORIGUCHI
> 
> $ time psql postgres -c 'select t.a from t, generate_series(0, )' >
> /dev/null
> 
> real  0m22.696s
> user  0m16.991s
> sys   0m0.182s>
> 
> Using binsearch the result for the same operation was
> real  0m35.296s
> user  0m17.166s
> sys   0m0.216s
> 
> Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't
> seem to make sense comparing with it. But it takes real = 47.35s.

Cool, 36% speedup!  Does this difference vary depending on the actual 
characters used, e.g. the speedup would be greater if most of the characters 
are ASCII?

Regards
Takayuki Tsunakawa




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL consistency check facility

On Wed, Sep 7, 2016 at 7:22 PM, Kuntal Ghosh  wrote:
> Hello,

Could you avoid top-posting please? More reference here:
http://www.idallen.com/topposting.html

> - If WAL consistency check is enabled for a rmgrID, we always include
> the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

> - I've extended the RmgrTable with a new function pointer
> rm_checkConsistency, which is called after rm_redo. (only when WAL
> consistency check is enabled for this rmgrID)
> - In each rm_checkConsistency, both backup pages and buffer pages are
> masked accordingly before any comparison.

This leads to heavy code duplication...

> - In postgresql.conf, a new guc variable named 'wal_consistency' is
> added. Default value of this variable is 'None'. Valid values are
> combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
> BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
> values.

Lower-case is the usual policy for parameter values for GUC parameters.

> - In recovery tests (src/test/recovery/t), I've added wal_consistency
> parameter in the existing scripts. This feature doesn't change the
> expected output. If there is any inconsistency, it can be verified in
> corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

> Results
> 
>
> I've tested with installcheck and installcheck-world in master-standby
> set-up. Followings are the configuration parameters.

So you tested as well the recovery tests, right?

> I got two types of inconsistencies as following:
>
> 1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
> page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
> redo, only BTP_DELETED flag is set in buffer page. I assume that we
> should clear all btpo_flags before setting BTP_DELETED in
> _bt_unlink_halfdead_page().

The page is deleted, it does not matter, so you could just mask all
the flags for a deleted page...
[...]
+   /*
+* Mask everything on a DELETED page.
+*/
+   if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags
& BTP_DELETED)
And that's what is happening.

> 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
> REVMAP page. This happens only for two cases. I'm not sure what the
> reason can be.

Hm? This smells like a block reference bug. What are the cases you are
referring to?

> I haven't done sufficient tests yet to measure the overhead of this
> modification. I'll do that next.

I did a first pass on your patch, and I think that things could be
really reduced. There is much code duplication, but see below for the
details..

 #include "access/xlogutils.h"
-
+#include "storage/bufmask.h"
I know that I am a noisy one on the matter, but please double-check
for such useless noise in your patch. And there is not only one.

+   newwalconsistency = (bool *) guc_malloc(ERROR,(RM_MAX_ID + 1)*sizeof(bool));
This spacing is not project-style. You may want to go through that:
https://www.postgresql.org/docs/devel/static/source.html

+$node_master->append_conf(
+   'postgresql.conf', qq(
+wal_consistency = 'All'
+));
Instead of duplicating that 7 times, you could just do it once in the
init() method of PostgresNode.pm. This really has meaning if enabled
by default.

+   /*
+* Followings are the rmids which can have backup blocks.
+* We'll enable this feature only for these rmids.
+*/
+   newwalconsistency[RM_HEAP2_ID] = true;
+   newwalconsistency[RM_HEAP_ID] = true;
+   newwalconsistency[RM_BTREE_ID] = true;
+   newwalconsistency[RM_HASH_ID] = true;
+   newwalconsistency[RM_GIN_ID] = true;
+   newwalconsistency[RM_GIST_ID] = true;
+   newwalconsistency[RM_SEQ_ID] = true;
+   newwalconsistency[RM_SPGIST_ID] = true;
+   newwalconsistency[RM_BRIN_ID] = true;
+   newwalconsistency[RM_GENERIC_ID] = true;
+   newwalconsistency[RM_XLOG_ID] = true;
Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.

+   for(i = 0; i < RM_MAX_ID + 1 ; i++)
+   newwalconsistency[i] = false;
+   break;
Same here you can just use MemSet.

+   else if (pg_strcasecmp(tok, "NONE") == 0)
[...]
+   else if (pg_strcasecmp(tok, "ALL") == 0)
It seems to me that using NONE or ALL with any other keywords should
not be allowed.

+   if (pg_strcasecmp(tok, "Heap2") == 0)
+   {
+   newwalconsistency[RM_HEAP2_ID]

Re: [HACKERS] Parallel build with MSVC

2016-09-08 Thread Christian Ullrich


* Noah Misch wrote:


Committed.


Much apologizings for coming in late again, but I just realized it would 
be better if the user-controlled flags came after all predefined options 
the user might want to override. Right now that is only /verbosity in 
both build and clean operations.


Patch attached, also reordering the ecpg-clean command line in clean.bat 
to match the others that have the project file first.


--
Christian


>From 09a5f3945e2b87b56ca3525a56db970e9ecf8ffd Mon Sep 17 00:00:00 2001
From: Christian Ullrich 
Date: Thu, 8 Sep 2016 08:34:45 +0200
Subject: [PATCH] Let MSBFLAGS be used to override predefined options.

---
 src/tools/msvc/build.pl  | 4 ++--
 src/tools/msvc/clean.bat | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl
index 5273977..a5469cd 100644
--- a/src/tools/msvc/build.pl
+++ b/src/tools/msvc/build.pl
@@ -54,7 +54,7 @@ elsif (uc($ARGV[0]) ne "RELEASE")
 if ($buildwhat and $vcver >= 10.00)
 {
system(
-   "msbuild $buildwhat.vcxproj $msbflags /verbosity:normal 
/p:Configuration=$bconf"
+   "msbuild $buildwhat.vcxproj /verbosity:normal $msbflags 
/p:Configuration=$bconf"
);
 }
 elsif ($buildwhat)
@@ -63,7 +63,7 @@ elsif ($buildwhat)
 }
 else
 {
-   system("msbuild pgsql.sln $msbflags /verbosity:normal 
/p:Configuration=$bconf");
+   system("msbuild pgsql.sln /verbosity:normal $msbflags 
/p:Configuration=$bconf");
 }
 
 # report status
diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat
index e21e37f..b972ddf 100755
--- a/src/tools/msvc/clean.bat
+++ b/src/tools/msvc/clean.bat
@@ -107,6 +107,6 @@ REM for /r %%f in (*.sql) do if exist %%f.in del %%f
 cd %D%
 
 REM Clean up ecpg regression test files
-msbuild %MSBFLAGS% /NoLogo ecpg_regression.proj /t:clean /v:q
+msbuild ecpg_regression.proj /NoLogo /v:q %MSBFLAGS% /t:clean 
 
 goto :eof
-- 
2.10.0.windows.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Heikki Linnakangas


On 09/08/2016 09:39 AM, Михаил Бахтерев wrote:

Excuse me for intervention.

It depends. For instance, i run PostgreSQL on the modern MIPS CPU, which
does not have sqrt support.

But you are right, it is supported in most cases. And if execution speed
of this very fuction is of concern, sqrtf(x) should be used instead of
sqrt(x).

Despite this, Andrew's solution gives more accurate representation of
values. And as far as i understand, this improves overall performance by
decreasing the overall amount of instructions, which must be executed.


BTW, I would be OK with the bit-twiddling hack, if we had an autoconf 
check for IEEE 754 floats, and a graceful fallback for other systems. 
The fallback could be simply the current penalty function. You wouldn't 
get the benefit from the better penalty function on non-IEEE systems, 
then, but it would still be correct.



It is possible to speed up Andrew's implementation and get rid of
warnings by using bit-masks and unions. Something like:

  union {
float f;
struct {
  unsigned int mantissa:23, exponent:8, sign:1;
} bits;
  }

I am sorry, i have no time to check this. But it is common wisdom to
avoid pointer-based memory accesses in high-performance code, as they
create a lot of false write-to-read dependencies.


The compiler should be smart enough to generate the same instructions 
either way. A union might be more readable, though. (We don't need to 
extract the mantissa, exponent and sign, so a union of float and int32 
would do.)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sample configuration files

2016-09-08 Thread Vik Fearing

On 08/29/2016 03:34 AM, Vik Fearing wrote:
> We have sample configuration files for postgresql.conf and
> recovery.conf, but we do not have them for contrib modules.  This patch
> attempts to add them.
> 
> Although the patch puts the sample configuration files in the tree, it
> doesn't try to install them.  That's partly because I think it would
> need an extension version bump and that doesn't seem worth it.
> 
> Also, while writing this patch, it crossed my mind that the plpgsql
> variables should probably be in the main postgresql.conf file because we
> install it by default.  I will happily write that patch if desired,
> independently of this patch.
> 
> Current as of 26fa446, added to next commitfest.

I noticed that this patch has been marked Waiting on Author with no
comment.  Peter, what more should I be doing right now while waiting for
Martín's review?
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in two-phase transaction recovery

On Wed, Sep 7, 2016 at 10:48 PM, Stas Kelvich  wrote:
> Some time ago two-phase state file format was changed to have variable size 
> GID,
> but several places that read that files were not updated to use new offsets. 
> Problem
> exists in master and 9.6 and can be reproduced on prepared transactions with
> savepoints.

Oops and meh. This meritates an open item, and has better be fixed by
9.6.0. I am glad you noticed that honestly. And we had better take
care of this issue as soon as possible.

> For example:
>
> create table t(id int);
> begin;
> insert into t values (42);
> savepoint s1;
> insert into t values (43);
> prepare transaction 'x';
> begin;
> insert into t values (142);
> savepoint s1;
> insert into t values (143);
> prepare transaction 'y’;
>
> and restart the server. Startup process will hang. Fix attached.

In crash recovery this is very likely to fail with an assertion if
those are enabled:
TRAP: FailedAssertion("!(TransactionIdFollows(subxid, xid))", File:
"twophase.c", Line: 1767)
And the culprit is e0694cf9, which makes this open item owned by Simon.

I also had a look at the patch you are proposing, and I think that's a
correct fix. I also looked at the other code paths scanning the 2PC
state file and did not notice extra problems. The good news is that
the 2PC file generation is not impacted, only its scan at recovery is,
so the impact of the bug is limited for existing 9.6 deployments where
both 2PC and subtransactions are involved.

> Also while looking at StandbyRecoverPreparedTransactions() i’ve noticed that 
> buffer
> for 2pc file is allocated in TopMemoryContext but never freed. That probably 
> exists
> for a long time.

@@ -1886,6 +1886,8 @@ StandbyRecoverPreparedTransactions(bool overwriteOK)
Assert(TransactionIdFollows(subxid, xid));
SubTransSetParent(xid, subxid, overwriteOK);
}
+
+   pfree(buf);
}
This one is a good catch. I have checked also the other callers of
ReadTwoPhaseFile but I am not seeing any other leak. That's a leak,
not critical though so applying it only on HEAD would be enough IMO.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel build with MSVC

On Thu, Sep 8, 2016 at 3:54 PM, Christian Ullrich  wrote:
> Much apologizings for coming in late again, but I just realized it would be
> better if the user-controlled flags came after all predefined options the
> user might want to override. Right now that is only /verbosity in both build
> and clean operations.
>
> Patch attached, also reordering the ecpg-clean command line in clean.bat to
> match the others that have the project file first.

-"msbuild $buildwhat.vcxproj $msbflags /verbosity:normal
/p:Configuration=$bconf"
+"msbuild $buildwhat.vcxproj /verbosity:normal $msbflags
/p:Configuration=$bconf"

Why not... If people are willing to switch to /verbosity:detailed and
double the amount of build time...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for lazy_scan_heap

On Wed, Sep 7, 2016 at 4:11 PM, Simon Riggs  wrote:
> On 7 September 2016 at 04:13, Masahiko Sawada  wrote:
>
>> Since current HEAD could scan visibility map twice, the execution time
>> of Patched is approximately half of HEAD.
>
> Sounds good.
>
> To ensure we are doing exactly same amount of work as before, did you
> see the output of VACUUM VEROBOSE?

Sorry, the previous test result I posted was something wrong.
I rerun the performance test and results are,

* 1TB Table(visibility map size is 32MB)
HEAD : 4853.250 ms (00:04.853)
Patched : 3805.789 ms (00:03.806)

* 8TB Table(visibility map size is 257MB)
HEAD : 37853.891 ms (00:37.854)
Patched : 30153.943 ms (00:30.154)

* 32TB Table(visibility map size is 1GB)
HEAD: 151908.344 ms (02:31.908)
Patched: 120560.037 ms (02:00.560)

Since visibility map page can be cached onto shared buffer or OS cache
by first scanning, the benefit of this patch seems not to be large.

Here are outputs of VACUUM VERBOSE for 32TB table.

* HEAD
INFO:  vacuuming "public.vm_skip_test"
INFO:  "vm_skip_test": found 0 removable, 0 nonremovable row versions
in 0 out of 4294967294 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
Skipped 4294967294 all-frozen pages according to visibility map.
0 pages are entirely empty.
CPU 1.06s/148.11u sec elapsed 149.20 sec.
VACUUM
Time: 151908.344 ms (02:31.908)

* Patched
INFO:  vacuuming "public.vm_skip_test"
INFO:  "vm_skip_test": found 0 removable, 0 nonremovable row versions
in 0 out of 4294967294 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
Skipped 4294967294 all-frozen pages according to visibility map.
0 pages are entirely empty.
CPU 0.65s/117.15u sec elapsed 117.81 sec.
VACUUM
Time: 120560.037 ms (02:00.560)

Current manual vacuum doesn't output how may all_frozen pages we
skipped according to visibility map.
That's why I attached 0001 patch which makes the manual vacuum emit
such information.

>
> Can we produce a test that verifies the result patched/unpatched?
>

Attached test shell script but because I don't have such a large disk,
I've measured performance benefit using by something like unofficial
way.

To make a situation where table is extremly large and make
corresponding visibility map, I applied 0002 patch and made a fake
visibility map.
Attached 0002 patch adds GUC parameter cheat_vacuum_table_size which
artificially defines table size being vacuumed .
For example, If we do,
  SET cheat_vacuum_table_size = 4;
  VACUUM vm_test;
then in lazy_scan_heap, vm_test table is processed as an
8TB(MaxBlockNumber / 4) table.

Attached test shell script makes fake visibility map files and
executes the performance tests for 1TB, 8TB and 32TB table.
Please confirm it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From bbda8e4144101a41804865d88592a15bf0150340 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 8 Sep 2016 11:24:21 -0700
Subject: [PATCH 1/2] lazy_scan_heap outputs how many all_frozen pages are
 skipped.

---
 src/backend/commands/vacuumlazy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..20d8334 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1331,6 +1331,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	"Skipped %u pages due to buffer pins.\n",
 	vacrelstats->pinskipped_pages),
 	 vacrelstats->pinskipped_pages);
+	appendStringInfo(, ngettext("Skipped %u all-frozen page according to visibility map.\n",
+	"Skipped %u all-frozen pages according to visibility map.\n",
+	vacrelstats->frozenskipped_pages),
+	 vacrelstats->frozenskipped_pages);
 	appendStringInfo(, ngettext("%u page is entirely empty.\n",
 	"%u pages are entirely empty.\n",
 	empty_pages),
-- 
2.8.1

From 01e493a44ba6c5bb9b59a1016bfce4b9bcf75e95 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 8 Sep 2016 11:36:28 -0700
Subject: [PATCH 2/2] Add cheat_vacuum_table_size.

---
 contrib/pg_visibility/pg_visibility.c |  6 +-
 src/backend/commands/vacuumlazy.c |  9 +++--
 src/backend/utils/misc/guc.c  | 10 ++
 src/include/commands/vacuum.h |  1 +
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 7034066..37ace99 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -12,6 +12,7 @@
 #include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage_xlog.h"
+#include "commands/vacuum.h"
 #include "funcapi.h"
 #include

Re: [HACKERS] Supporting SJIS as a database encoding

2016-09-08 Thread Kyotaro HORIGUCHI

Hello,

At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160907.161304.112519789.horiguchi.kyot...@lab.ntt.co.jp>
> > Implementing radix tree code, then redefining the format of mapping table
> > > to suppot radix tree, then modifying mapping generator script are needed.
> > > 
> > > If no one oppse to this, I'll do that.

So, I did that as a PoC. The radix tree takes a little less than
100k bytes (far smaller than expected:) and it is defnitely
faster than binsearch.


The attached patch does the following things.

- Defines a struct for static radix tree
  (utf_radix_tree). Currently it supports up to 3-byte encodings.

- Adds a map generator script UCS_to_SJIS_radix.pl, which
  generates utf8_to_sjis_radix.map from utf8_to_sjis.map.

- Adds a new conversion function utf8_to_sjis_radix.

- Modifies UtfToLocal so as to allow map to be NULL.

- Modifies utf8_to_sjis to use the new conversion function
  instead of ULmapSJIS.


The followings are to be done.

- utf8_to_sjis_radix could be more generic.

- SJIS->UTF8 is not implemented but it would be easily done since
  there's no difference in using the radix tree mechanism.
  (but the output character is currently assumed to be 2-byte long)

- It doesn't support 4-byte codes so this is not applicable to
  sjis_2004. Extending the radix tree to support 4-byte wouldn't
  be hard.


The following is the result of a simple test.

=# create table t (a text); alter table t alter column a storage plain;
=# insert into t values ('... 7130 cahracters containing (I believe) all 
characters in SJIS encoding');
=# insert into t values ('... 7130 cahracters containing (I believe) all 
characters in SJIS encoding');

# Doing that twice is just my mistake.

$ export PGCLIENTENCODING=SJIS

$ time psql postgres -c '
$ psql -c '\encoding' postgres
SJIS


$ time psql postgres -c 'select t.a from t, generate_series(0, )' > 
/dev/null

real0m22.696s
user0m16.991s
sys 0m0.182s>

Using binsearch the result for the same operation was 
real0m35.296s
user0m17.166s
sys 0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


0001-Use-radix-tree-to-encoding-characters-of-Shift-JIS.patch.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggestions for first contribution?

2016-09-08 Thread Noah Misch

On Wed, Sep 07, 2016 at 05:14:44PM +0300, Aleksander Alekseev wrote:
> Here is another idea for a contribution - refactoring.

Refactoring is not a good early contribution.  Refactoring is more likely to
succeed once you internalize community code practices through much study of
functional patches.  However, even committer-initiated refactoring has a high
failure rate.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Alter or rename enum value

2016-09-08 Thread Matthias Kurz

On 7 September 2016 at 22:14, Tom Lane  wrote:

> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> > Here is version 6 of the patch, which just adds RENAME VALUE with no IF
> > [NOT] EXISTS, rebased onto current master (particularly the
> > transactional ADD VALUE patch).
>
> Pushed with some adjustments.  The only thing that wasn't a matter of
> taste is you forgot to update the backend/nodes/ support functions
> for struct AlterEnumStmt.
>
> regards, tom lane
>

Thank you all for committing this feature!

Given that you are now familiar with the internals of how enums are
implemented would it be possible to continue the work and add the "ALTER
TYPE  DROP VALUE " command?

Thank you!
Regards, Matthias

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Andrew Borodin

>BTW, would someone explain to me why using a float here will not fail 
>catastrophically for inputs outside the range of float?

Indeed, it will. And that's one of the reason I'm considering
advancing GiST API instead of advancing extensions.

It won't crash, but will produce terrible index, not suitable of
performing efficient queries.
GiST treats penalty this way:
/* disallow negative or NaN penalty */
if (isnan(penalty) || penalty < 0.0)
penalty = 0.0;

Any NaN is a "perfect fit".

Calculation of real (say float) penalty and choice of subtree with
smallest penalty is an idea from 92. Modern spatial index requires
more than just a float to compare subtrees.

Best regards, Andrey Borodin, Octonica & Ural Federal University.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

Michael Paquier  writes:
> But I don't understand the reason behind such a restriction to be
> honest because libxml2 does not depend on libxslt. The contrary is
> true: libxslt needs libxml2.

Right.

> Note as well that libxml2 does depend on ICONV though.

Hm, is that true either?  I don't see any sign of such a dependency
on Linux:

$ ldd /usr/lib64/libxml2.so.2.7.6
linux-vdso.so.1 =>  (0x7fff98f6b000)
libdl.so.2 => /lib64/libdl.so.2 (0x00377a60)
libz.so.1 => /lib64/libz.so.1 (0x00377ae0)
libm.so.6 => /lib64/libm.so.6 (0x003779e0)
libc.so.6 => /lib64/libc.so.6 (0x003779a0)
/lib64/ld-linux-x86-64.so.2 (0x00377960)


regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggestions for first contribution?

2016-09-08 Thread Christian Convey

On Wed, Sep 7, 2016 at 10:44 AM, Stas Kelvich  wrote:
> There is also a list of projects for google summer of code: 
> https://wiki.postgresql.org/wiki/GSoC_2016
>
> That topics expected to be doable by a newcomer during several months. It is 
> also slightly
> outdated, but you always can check current state of things by searching this 
> mail list on interesting topic.
>
> Also postgres have several internal API’s like fdw[1] or gist[2] that allows 
> you to implements something
> useful without digging too much into a postgres internals.
>
> [1] https://www.postgresql.org/docs/current/static/fdwhandler.html
> [2] https://www.postgresql.org/docs/current/static/gist-extensibility.html

Thanks Stas, I'll have a look at those.  I appreciate the links.

- Christian


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] \timing interval

2016-09-08 Thread Pavel Stehule

2016-09-08 13:10 GMT+02:00 Craig Ringer :

> On 4 Sep. 2016 3:36 am, "Tom Lane"  wrote:
> >
>
> > After further thought I concluded that not providing any labeling of
> > days is a bad idea.
>
> Yeah. I think labeling days is definitely good. I'm glad you changed that.
>
> Personally I'd like to trim milliseconds when dealing with minute+ long
> runs and seconds from hour+ runs too, since it's all there in the ms output
> and the units output is for human readability. I see the value of retaining
> full precision too, though, and don't feel strongly about it.
>

It should be hard to read without units. I know so now it is maybe late,
but the current output is not too readable for me. I miss units - 10s,
1m20.5s, 1h 30m 5s

Regards

Pavel

Re: [HACKERS] GiST penalty functions [PoC]

2016-09-08 Thread Михаил Бахтерев

Excuse me for intervention.

It depends. For instance, i run PostgreSQL on the modern MIPS CPU, which
does not have sqrt support.

But you are right, it is supported in most cases. And if execution speed
of this very fuction is of concern, sqrtf(x) should be used instead of
sqrt(x).

Despite this, Andrew's solution gives more accurate representation of
values. And as far as i understand, this improves overall performance by
decreasing the overall amount of instructions, which must be executed.

It is possible to speed up Andrew's implementation and get rid of
warnings by using bit-masks and unions. Something like:

  union {
float f;
struct {
  unsigned int mantissa:23, exponent:8, sign:1;
} bits;
  }

I am sorry, i have no time to check this. But it is common wisdom to
avoid pointer-based memory accesses in high-performance code, as they
create a lot of false write-to-read dependencies.

- Mikhail, respectfully

On Wed, Sep 07, 2016 at 05:58:42PM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > Unfortunately, sqrt(x) isn't very cheap.
> 
> You'd be surprised: sqrt is built-in on most modern hardware.  On my
> three-year-old workstation, sqrt(x) seems to take about 2.6ns.  For
> comparison, the pack_float version posted in
> 
> takes 3.9ns (and draws multiple compiler warnings, too).
> 
>   regards, tom lane

signature.asc
Description: PGP signature

Re: [HACKERS] High-CPU consumption on information_schema (only) query

On 8 Sep. 2016 7:38 am, "Robins Tharakan"  wrote:
>
> Hi,
>
> An SQL (with only information_schema related JOINS) when triggered, runs
with max CPU (and never ends - killed after 2 days).
> - It runs similarly (very slow) on a replicated server that acts as a
read-only slave.
> - Top shows only postgres as hitting max CPU (nothing else). When query
killed, CPU near 0%.
> - When the DB is restored on a separate test server (with the exact
postgresql.conf) the same query works fine.
> - There is no concurrent usage on the replicated / test server (although
the primary is a Production server and has concurrent users).
>
> Questions:
> - If this was a postgres bug or a configuration issue, query on the
restored DB should have been slow too. Is there something very basic I am
missing here?
>
> If someone asks for I could provide SQL + EXPLAIN, but it feels
irrelevant here. I amn't looking for a specific solution but what else
should I be looking for here?

Get a series of stack traces.

Perf with stack output would be good too.

You need debug info for both.

Re: [HACKERS] Suggestions for first contribution?

2016-09-08 Thread Christian Convey

On Wed, Sep 7, 2016 at 10:14 AM, Aleksander Alekseev
 wrote:
> Here is another idea for a contribution - refactoring.
>
> Currently there are a lot of procedures in PostgreSQL code that
> definitely don't fit on one screen (i.e. ~50 lines). Also many files are
> larger than say 1000 lines of code. For instance, psql_completion
> procedure is more than 2000 lines long! I think patches that would make
> such code easier to read would have a good chance to be accepted.

Thanks for the suggestion!  I can see how refactoring could make a lot
of sense for some functions.  That's probably a task I'd want to take
on after I had more experience with PG's code base.  I think being
familiar with most/all of PG's code would make that task go more
smoothly.

- Christian

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more full joins in postgres_fdw

2016-09-08 Thread Etsuro Fujita


On 2016/09/06 22:07, Ashutosh Bapat wrote:

On Fri, Sep 2, 2016 at 3:55 PM, Etsuro Fujita
> wrote:



On 2016/08/22 15:49, Ashutosh Bapat wrote:
1. deparsePlaceHolderVar looks odd - each of the deparse*
function is
named as deparse + . PlaceHolderVar is not a parser node, so no string is
going to be
parsed as a PlaceHolderVar. May be you want to name it as
deparseExerFromPlaceholderVar or something like that.



The name "deparseExerFromPlaceholderVar" seems long to me.  How
about "deparsePlaceHolderExpr"?



There isn't any node with name PlaceHolderExpr.


I'll rename it to "deparseExerInPlaceholderVar", then.


3. I think registerAlias stuff should happen really at the time of
creating paths and should be stored in fpinfo. Without that it
will be
computed every time we deparse the query. This means every time
we try
to EXPLAIN the query at various levels in the join tree and once
for the
query to be sent to the foreign server.



Hmm.  I think the overhead in calculating aliases would be
negligible compared to the overhead in explaining each remote query
for costing or sending the remote query for execution.  So, I
created aliases in the same way as remote params created and stored
into params_list in deparse_expr_cxt.  I'm not sure it's worth
complicating the code.



We defer remote parameter creation till deparse time since the the
parameter numbers are dependent upon the sequence in which we deparse
the query. Creating them at the time of path creation and storing them
in fpinfo is not possible because a. those present in the joining
relations will conflict with each other and need some kind of
serialization at the time of deparsing b. those defer for differently
parameterized paths or paths with different pathkeys. We don't defer it
because it's very light on performance.

That's not true with the alias information. As long as we detect which
relations need subqueries, their RTIs are enough to create unique
aliases e.g. if a relation involves RTIs 1, 2, 3 corresponding subquery
can have alias r123 without conflicting with any other alias.


Hmm.  But another concern about the approach of generating an subselect 
alias for a path, if needed, at the path creation time would be that the 
path might be rejected by add_path, which would result in totally 
wasting cycles for generating the subselect alias for the path.



However minimum overhead it might have, we will deparse the query every
time we create a path involving that kind of relation i.e. for different
pathkeys, different parameterization and different joins in which that
relation participates in. This number can be significant. We want to
avoid this overhead if we can.


Exactly.  As I said above, I think the overhead would be negligible 
compared to the overhead in explaining each remote query for costing or 
the overhead in sending the final remote query for execution, though.



5. The blocks related to inner and outer relations in
deparseFromExprForRel() look same. We should probably separate
that code
out into a function and call it at two places.



Done.



I see you have created function deparseOperandRelation() for the
same. I guess, this should be renamed as deparseRangeTblRef() since it
will create RangeTblRef node when parsed back.


OK, if there no opinions of others, I'll rename it to the name proposed 
by you, "deparseRangeTblRef".



6.
! if (is_placeholder)
! errcontext("placeholder expression at position %d in
select list",
!errpos->cur_attno);
A user wouldn't know what a placeholder expression is, there is
no such
term in the documentation. We have to device a better way to
provide an
error context for an expression in general.



Though I proposed that, I don't think that it's that important to
let users know that the expression is from a PHV.  How about just
saying "expression", not "placeholder expression", so that we have
the message "expression at position %d in select list" in the context?



Hmm, that's better than placeholder expression, but not as explanatory
as it should be since we won't be printing the "select list" in the
error context and user won't have a clue as to what error context is
about.


I don't think so.  Consider an example of the conversion error message, 
which is from the regression test:


SELECT  ft1.c1,  ft2.c2, ft1 FROM ft1, ft2 WHERE ft1.c1 = ft2.c1 AND 
ft1.c1 = 1;

ERROR:  invalid input syntax for integer: "foo"
CONTEXT:  whole-row reference to foreign table "ft1"

As shown in the example, the error message is displayed under a remote 
query for execution.  So, ISTM it's reasonable to print something like

Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2016-09-08 Thread Etsuro Fujita


On 2016/09/07 13:21, Ashutosh Bapat wrote:

* with the patch:
postgres=# explain verbose delete from ft1 using ft2 where ft1.a =
ft2.a;
 QUERY PLAN

-
 Delete on public.ft1  (cost=100.00..102.04 rows=1 width=38)
   ->  Foreign Delete  (cost=100.00..102.04 rows=1 width=38)
 Remote SQL: DELETE FROM public.t1 r1 USING (SELECT ROW(a,
b), a FROM public.t2) ss1(c1, c2) WHERE ((r1.a = ss1.c2))
(3 rows)



The underlying scan on t2 requires ROW(a,b) for locking the row for
update/share. But clearly it's not required if the full query is being
pushed down.


Exactly.


Is there a way we can detect that ROW(a,b) is useless
column (not used anywhere in the other parts of the query like
RETURNING, DELETE clause etc.) and eliminate it?


I don't have a clear solution for that yet, but I'll try to remove that 
in the next version.



Similarly for a, it's
part of the targetlist of the underlying scan so that the WHERE clause
can be applied on it. But it's not needed if we are pushing down the
query. If we eliminate the targetlist of the query, we could construct a
remote query without having subquery in it, making it more readable.


Will try to do so also.

Thanks for the comments!

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] \timing interval

On 4 Sep. 2016 3:36 am, "Tom Lane"  wrote:
>

> After further thought I concluded that not providing any labeling of
> days is a bad idea.

Yeah. I think labeling days is definitely good. I'm glad you changed that.

Personally I'd like to trim milliseconds when dealing with minute+ long
runs and seconds from hour+ runs too, since it's all there in the ms output
and the units output is for human readability. I see the value of retaining
full precision too, though, and don't feel strongly about it.

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-08 Thread Victor Wagner

On Tue, 6 Sep 2016 07:58:28 +0530
Mithun Cy  wrote:

> On Mon, Sep 5, 2016 at 4:33 PM, Aleksander Alekseev <
> a.aleks...@postgrespro.ru> wrote:
> >After a brief examination I've found following ways to improve the
> >patch.  
> Adding to above comments.
> 
I'm sending new version of patch.

I've replaced readonly option with target_server_type (with JDBC
compatibility alias targetServerType), 

use logic of setting defaults based on number of hosts in the connect
string instead of complicated condition in the state machine,

added checks for return values of memory allocation function.

Also, I've solved pg_usleep problem by linking pgsleep.c into libpq
along with some other libpgport objects which are already linked there.

Thus client applications don't need to link with libpgport and libpq
shared library is self-containted.
 diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 4e34f00..f0df877 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=randomtarget_server_type=any
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhostport=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhostport=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  target_server_type
+  
+  
+  If this parameter is master upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  any, connection to standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ failover_timeout
+ 
+ 
+ Maximum time to cyclically retry all the hosts in connect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the

Re: [HACKERS] GiST penalty functions [PoC]

Heikki Linnakangas  writes:
> BTW, I would be OK with the bit-twiddling hack, if we had an autoconf 
> check for IEEE 754 floats, and a graceful fallback for other systems. 

I would still object to the version submitted, on the grounds of the
compiler warnings it causes.  Possibly those could be avoided with
a union, though; Float4GetDatum doesn't produce such warnings.

BTW, would someone explain to me why using a float here will not
fail catastrophically for inputs outside the range of float?
Cubes are based on doubles, not floats.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Suggestions for first contribution?

2016-09-08 Thread Christian Convey

On Wed, Sep 7, 2016 at 10:34 AM, Yury Zhuravlev
 wrote:
> Christian Convey wrote:
>>
>> Yury doesn't seem to need help
>> with CMake
>
> Hello.
> I am sorry that the only answer is now.
> I need help but with write CMake code:
> 1. Make ecpg tests
> 2. Make MinGW/Msys support
> 3. many many ...
> all targets and discussion here:
> https://github.com/stalkerg/postgres_cmake/issues
>
> If you can only test CMake for Ubuntu x86_64 that it is not necessary.
> If I not fully understand you - sorry.

Hi Yury, no problem.  It sounds like most of the platform testing you
want requires hardware I don't have, unfortunately.

- Christian


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

2016-09-08 Thread Magnus Hagander

On Thu, Sep 8, 2016 at 2:27 PM, Tom Lane  wrote:

> Michael Paquier  writes:
> > But I don't understand the reason behind such a restriction to be
> > honest because libxml2 does not depend on libxslt. The contrary is
> > true: libxslt needs libxml2.
>
> Right.
>

Pretty sure this goes back to *our* XML support requiring both. As in you
couldn't turn on/off xslt independently, so the "xml requires xslt" comment
is that *our* xml support required both.

This may definitely not be true anymore, and that check has just not been
updated.

Also this was 10 years ago, so I'm of course not 100% sure, but I think it
was something like that...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Declarative partitioning - another take

2016-09-08 Thread Rajkumar Raghuwanshi

On Wed, Sep 7, 2016 at 3:58 PM, Amit Langote 
wrote:

>
> Hi,
>
> On 2016/09/07 17:56, Rajkumar Raghuwanshi wrote:
> > Hi,
> >
> > I have a query regarding list partitioning,
> >
> > For example if I want to store employee data in a table, with "IT" dept
> > employee in emp_p1 partition, "HR" dept employee in emp_p2 partition and
> if
> > employee belongs to other than these two, should come in emp_p3
> partition.
> >
> > In this case not sure how to create partition table. Do we have something
> > like we have UNBOUNDED for range partition or oracle have "DEFAULT" for
> > list partition.
> >
> > create table employee (empid int, dept varchar) partition by list(dept);
> > create table emp_p1 partition of employee for values in ('IT');
> > create table emp_p2 partition of employee for values in ('HR');
> > create table emp_p3 partition of employee for values in (??);
>
> Sorry, no such feature is currently offered.  It might be possible to
> offer something like a "default" list partition which accepts values other
> than those specified for other existing partitions.  However, that means
> if we add a non-default list partition after a default one has been
> created, the implementation must make sure that it moves any values from
> the default partition that now belong to the newly created partition.
>
> Thanks,
> Amit
>

Thanks for clarifying, But I could see same problem of moving data when
adding a new non-default partition with unbounded range partition.

For example give here, Initially I have create a partition table test with
test_p3 as unbounded end,
Later tried to change test_p3 to contain 7-9 values only, and adding a new
partition test_p4 contain 10-unbound.

--create partition table and some leafs
CREATE TABLE test (a int, b int) PARTITION BY RANGE(a);
CREATE TABLE test_p1 PARTITION OF test FOR VALUES START (1) END (4);
CREATE TABLE test_p2 PARTITION OF test FOR VALUES START (4) END (7);
CREATE TABLE test_p3 PARTITION OF test FOR VALUES START (7) END UNBOUNDED;

--insert some data
INSERT INTO test SELECT i, i*10 FROM generate_series(1,3) i;
INSERT INTO test SELECT i, i*10 FROM generate_series(4,6) i;
INSERT INTO test SELECT i, i*10 FROM generate_series(7,13) i;

--directly not able to attach test_p4 because of overlap error, hence
detached test_p3 and than attaching test_p4
SELECT tableoid::regclass,* FROM test;
 tableoid | a  |  b
--++-
 test_p1  |  1 |  10
 test_p1  |  2 |  20
 test_p1  |  3 |  30
 test_p2  |  4 |  40
 test_p2  |  5 |  50
 test_p2  |  6 |  60
 test_p3  |  7 |  70
 test_p3  |  8 |  80
 test_p3  |  9 |  90
 test_p3  | 10 | 100
 test_p3  | 11 | 110
 test_p3  | 12 | 120
 test_p3  | 13 | 130
(13 rows)

ALTER TABLE test DETACH PARTITION test_p3;
CREATE TABLE test_p4 (like test);
ALTER TABLE test ATTACH PARTITION test_p4 FOR VALUES start (10) end
UNBOUNDED;

--now can not attach test_p3 because of overlap with test_p4, causing data
loss from main test table.
ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (10);
ERROR:  source table contains a row violating partition bound specification
ALTER TABLE test ATTACH PARTITION test_p3 FOR VALUES start (7) end (13);
ERROR:  partition "test_p3" would overlap partition "test_p4"

SELECT tableoid::regclass,* FROM test;
 tableoid | a | b
--+---+
 test_p1  | 1 | 10
 test_p1  | 2 | 20
 test_p1  | 3 | 30
 test_p2  | 4 | 40
 test_p2  | 5 | 50
 test_p2  | 6 | 60
(6 rows)

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

On Thu, Sep 8, 2016 at 9:27 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> But I don't understand the reason behind such a restriction to be
>> honest because libxml2 does not depend on libxslt. The contrary is
>> true: libxslt needs libxml2.
>
> Right.
>
>> Note as well that libxml2 does depend on ICONV though.
>
> Hm, is that true either?  I don't see any sign of such a dependency
> on Linux:
>
> $ ldd /usr/lib64/libxml2.so.2.7.6
> linux-vdso.so.1 =>  (0x7fff98f6b000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00377a60)
> libz.so.1 => /lib64/libz.so.1 (0x00377ae0)
> libm.so.6 => /lib64/libm.so.6 (0x003779e0)
> libc.so.6 => /lib64/libc.so.6 (0x003779a0)
> /lib64/ld-linux-x86-64.so.2 (0x00377960)

Peaking into the libxml2 code there is a configure switch
--with-iconv, so that's an optional dependency. And the same exists
for Windows in their win32/stuff. So I am mistaken and this could be
just removed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sample configuration files

Vik Fearing  writes:
> I noticed that this patch has been marked Waiting on Author with no
> comment.  Peter, what more should I be doing right now while waiting for
> MartÃn's review?

FWIW, I agree with the upthread misgivings about whether this is actually
a useful effort.  Even if we installed the sample config files somewhere
(something there is not consensus for AFAICT), they would not actually
*do* anything useful as standalone files.  I suppose you are imagining
that people would either manually concatenate them onto postgresql.conf
or insert an include directive for them into postgresql.conf, but neither
of those things sound pleasant or maintainable.

Moreover, it's not clear why anyone would do that at all in the age of
ALTER SYSTEM SET.

I suggest that it'd be more fruitful to view this as a documentation
effort; that is, in each contrib module's SGML documentation file provide
a standardized section listing all its parameters and their default
settings.  That would be something that could be copied-and-pasted from
into either an editor window on postgresql.conf for the old guard, or
an ALTER SYSTEM SET command for the new.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

2016-09-08 Thread Magnus Hagander

On Thu, Sep 8, 2016 at 2:53 PM, Michael Paquier 
wrote:

> On Thu, Sep 8, 2016 at 9:39 PM, Magnus Hagander 
> wrote:
> > Pretty sure this goes back to *our* XML support requiring both. As in you
> > couldn't turn on/off xslt independently, so the "xml requires xslt"
> comment
> > is that *our* xml support required both.
> >
> > This may definitely not be true anymore, and that check has just not been
> > updated.
> >
> > Also this was 10 years ago, so I'm of course not 100% sure, but I think
> it
> > was something like that...
>
> Was it for contrib/xml2? For example was it because this module could
> not be compiled with just libxml2?
>

Yes, that's what I'm referring to.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

Michael Paquier  writes:
> On Thu, Sep 8, 2016 at 9:39 PM, Magnus Hagander  wrote:
>> Pretty sure this goes back to *our* XML support requiring both. As in you
>> couldn't turn on/off xslt independently, so the "xml requires xslt" comment
>> is that *our* xml support required both.
>> 
>> This may definitely not be true anymore, and that check has just not been
>> updated.
>> 
>> Also this was 10 years ago, so I'm of course not 100% sure, but I think it
>> was something like that...

> Was it for contrib/xml2? For example was it because this module could
> not be compiled with just libxml2?

The core code has never used xslt at all.  Some quick digging in the git
history suggests that contrib/xml2 wasn't very clean about this before
2008:

commit eb915caf92a6805740e949c3233ee32bc9676484
Author: Tom Lane 
Date:   Thu May 8 16:49:37 2008 +

Fix contrib/xml2 makefile to not override CFLAGS, and in passing make it
auto-configure properly for libxslt present or not.

or even 2010, depending on how large a value of "work" you want:

commit d6a6f8c6be4b6d6a9e90e92d91a83225bfe8d29d
Author: Tom Lane 
Date:   Mon Mar 1 18:07:59 2010 +

Fix contrib/xml2 so regression test still works when it's built without 
libxslt.

This involves modifying the module to have a stable ABI, that is, the
xslt_process() function still exists even without libxslt.  It throws a
runtime error if called, but doesn't prevent executing the CREATE FUNCTION
call.  This is a good thing anyway to simplify cross-version upgrades.

Both of those fixes postdate our native-Windows port, though I'm not sure
of the origin of the specific test you're wondering about.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Useless dependency assumption libxml2 -> libxslt in MSVC scripts

On Thu, Sep 8, 2016 at 9:39 PM, Magnus Hagander  wrote:
> Pretty sure this goes back to *our* XML support requiring both. As in you
> couldn't turn on/off xslt independently, so the "xml requires xslt" comment
> is that *our* xml support required both.
>
> This may definitely not be true anymore, and that check has just not been
> updated.
>
> Also this was 10 years ago, so I'm of course not 100% sure, but I think it
> was something like that...

Was it for contrib/xml2? For example was it because this module could
not be compiled with just libxml2?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

Dilip Kumar  writes:
> On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane  wrote:
>> In the case of record_in, it seems like it'd be easy enough to use
>> lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc()
>> and then throw a user-facing error right there.

> Actually when we are passing invalid type to "record_in", error is
> thrown in "lookup_type_cache" function, and this function is not
> taking "no_error" input (it's only limited to
> lookup_rowtype_tupdesc_internal).

Ah, should have dug a bit deeper.

> One option is to pass "no_error" parameter to "lookup_type_cache"
> function also, and throw error only in record_in function.
> But problem is, "lookup_type_cache" has lot of references. And also
> will it be good idea to throw one generic error instead of multiple
> specific errors. ?

We could make it work like that without breaking the ABI if we were
to add a NOERROR bit to the allowed "flags".  However, after looking
around a bit I'm no longer convinced what I said above is a good idea.
In particular, if we put the responsibility of reporting the error on
callers then we'll lose the ability to distinguish "no such pg_type OID"
from "type exists but it's only a shell".  So I'm now thinking it's okay
to promote lookup_type_cache's elog to a full ereport, especially since
it's already using ereport for some of the related error cases such as
the shell-type failure.

That still leaves us with what to do for domain_in.  A really simple
fix would be to move the InitDomainConstraintRef call before the
getBaseTypeAndTypmod call, because the former fetches the typcache
entry and would then benefit from lookup_type_cache's ereport.
But that seems a bit magic.

I'm tempted to propose that domain_state_setup start with

typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO);
if (typentry->typtype != TYPTYPE_DOMAIN)
ereport(ERROR,
(errcode(ERRCODE_DATATYPE_MISMATCH),
 errmsg("type %s is not a domain",
format_type_be(domainType;

removing the need to verify that after getBaseTypeAndTypmod.

The cache loading is basically free because InitDomainConstraintRef
would do it anyway; so the extra cost here is only one dynahash search.
You could imagine buying back those cycles by teaching the typcache
to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful
that we really care.  This whole setup sequence only happens once per
query anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-08 Thread Rahila Syed

Thank you for comments.

>But there's a function in startup.c which might be the ideal location
>or the check, as it looks like the one and only place where the
>autocommit flag actually changes:

>static void
>autocommit_hook(const char *newval)

Thank you for pointing out this. This indeed is the only place where
autocommit setting changes in psql. However,
including the check here will require returning the status of command from
this hook. i.e if we throw error inside this
hook we will need to return false indicating the value has not changed.
This will change the existing definition of the hook
which returns void. This will also lead to changes in other implementations
of this hook apart from autocommit_hook.

>There are other values than ON: true/yes and theirs abbreviations,
>the value 1, and anything that doesn't resolve to OFF is taken as ON.
>ParseVariableBool() in commands.c already does the job of converting
>these to bool, the new code should probably just call that function
>instead of parsing the value itself.

I have included this in the attached patch.

Also I have included prior review comments by Rushabh Lathia and Amit
Langote in the attached patch

Regarding suggestion to move the code inside SetVariable(), I think it is
not a good idea because
SetVariable() and variable.c file in which it is present is about handling
of psql variables, checks for
correct variable names, values etc. This check is about correctness of the
command so it is better placed
in exec_command().

Thank you,
Rahila Syed

On Sat, Sep 3, 2016 at 4:39 PM, Daniel Verite 
wrote:

> Rushabh Lathia wrote:
>
> > It might happen that SetVariable() can be called from other places in
> > future,
> > so its good to have all the set variable related checks inside
> > SetVariable().
>
> There's already another way to skip the \set check:
>   select 'on' as "AUTOCOMMIT" \gset
>
> But there's a function in startup.c which might be the ideal location
> for the check, as it looks like the one and only place where the
> autocommit flag actually changes:
>
> static void
> autocommit_hook(const char *newval)
> {
>  pset.autocommit = ParseVariableBool(newval, "AUTOCOMMIT");
> }
>
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>

psql_error_on_autocommit_v2.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-08 Thread Aleksander Alekseev

Hello, Victor.

> I'm sending new version of patch.
> 
> I've replaced readonly option with target_server_type (with JDBC
> compatibility alias targetServerType), 
> 
> use logic of setting defaults based on number of hosts in the connect
> string instead of complicated condition in the state machine,
> 
> added checks for return values of memory allocation function.
> 
> Also, I've solved pg_usleep problem by linking pgsleep.c into libpq
> along with some other libpgport objects which are already linked there.
> 
> Thus client applications don't need to link with libpgport and libpq
> shared library is self-containted.

This patch doesn't apply to master branch:

```
$ git apply ~/temp/libpq-failover-8.patch
/home/eax/temp/libpq-failover-8.patch:184: trailing whitespace.
check: 
/home/eax/temp/libpq-failover-8.patch:252: trailing whitespace.
/* 
/home/eax/temp/libpq-failover-8.patch:253: trailing whitespace.
 *  Validate target_server_mode option 
/home/eax/temp/libpq-failover-8.patch:254: trailing whitespace.
 */ 
/home/eax/temp/libpq-failover-8.patch:306: trailing whitespace.
appendPQExpBuffer(>errorMessage, 
error: src/interfaces/libpq/t/001-multihost.pl: already exists in
working directory

$ git diff
```

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical Replication WIP

2016-09-08 Thread Petr Jelinek


Hi,

On 07/09/16 14:10, Erik Rijkers wrote:

On 2016-08-31 22:51, Petr Jelinek wrote:


and one more version with bug fixes, improved code docs and couple



I am not able to get the replication to work. Would you (or anyone) be
so kind to point out what I am doing wrong?

Patches applied, compiled, make-checked, installed OK.

I have 2 copies compiled and installed,  logical_replication and
logical_replication2, to be publisher and subscriber, ports 6972 and
6973 respectively.


Logfile subscriber-side:
[...]
2016-09-07 13:47:44.441 CEST 21997 LOG:  MultiXact member wraparound
protections are now enabled
2016-09-07 13:47:44.528 CEST 21986 LOG:  database system is ready to
accept connections
2016-09-07 13:47:44.529 CEST 22002 LOG:  logical replication launcher
started
2016-09-07 13:52:11.319 CEST 22143 LOG:  logical replication apply for
subscription sub1 started
2016-09-07 13:53:47.010 CEST 22143 ERROR:  could not open relation with
OID 0
2016-09-07 13:53:47.012 CEST 21986 LOG:  worker process: logical
replication worker 24048 (PID 22143) exited with exit code 1
2016-09-07 13:53:47.018 CEST 22184 LOG:  logical replication apply for
subscription sub1 started
2016-09-07 13:53:47.028 CEST 22184 ERROR:  could not open relation with
OID 0
2016-09-07 13:53:47.030 CEST 21986 LOG:  worker process: logical
replication worker 24048 (PID 22184) exited with exit code 1
2016-09-07 13:53:52.041 CEST 22187 LOG:  logical replication apply for
subscription sub1 started
2016-09-07 13:53:52.045 CEST 22187 ERROR:  could not open relation with
OID 0
2016-09-07 13:53:52.046 CEST 21986 LOG:  worker process: logical
replication worker 24048 (PID 22187) exited with exit code 1
(repeat every few seconds)




It means the tables don't exist on subscriber. I added check and proper 
error message in my local dev branch, it will be part of the next update.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee

On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire 
wrote:

> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs 
> wrote:
> >> On 6 September 2016 at 19:59, Tom Lane  wrote:
> >>
> >>> The idea of looking to the stats to *guess* about how many tuples are
> >>> removable doesn't seem bad at all.  But imagining that that's going to
> be
> >>> exact is folly of the first magnitude.
> >>
> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
> >> so I think we agree that a reasonably accurate, yet imprecise
> >> calculation is possible in most cases.
> >
> > That would all be well and good if it weren't trivial to do what
> > Robert suggested. This is just a large unsorted list that we need to
> > iterate throught. Just allocate chunks of a few megabytes and when
> > it's full allocate a new chunk and keep going. There's no need to get
> > tricky with estimates and resizing and whatever.
>
> I agree. While the idea of estimating the right size sounds promising
> a priori, considering the estimate can go wrong and over or
> underallocate quite severely, the risks outweigh the benefits when you
> consider the alternative of a dynamic allocation strategy.
>
> Unless the dynamic strategy has a bigger CPU impact than expected, I
> believe it's a superior approach.
>
>
How about a completely different representation for the TID array? Now this
is probably not something new, but I couldn't find if the exact same idea
was discussed before. I also think it's somewhat orthogonal to what we are
trying to do here, and will probably be a bigger change. But I thought I'll
mention since we are at the topic.

What I propose is to use a simple bitmap to represent the tuples. If a
tuple at  is dead then the corresponding bit in the bitmap
is set. So clearly the searches through dead tuples are O(1) operations,
important for very large tables and large arrays.

Challenge really is that a heap page can theoretically have MaxOffsetNumber
of line pointers (or to be precise maximum possible offset number). For a
8K block, that comes be about 2048. Having so many bits per page is neither
practical nor optimal. But in practice the largest offset on a heap page
should not be significantly greater than MaxHeapTuplesPerPage, which is a
more reasonable value of 291 on my machine. Again, that's with zero sized
tuple and for real life large tables, with much wider tuples, the number
may go down even further.

So we cap the offsets represented in the bitmap to some realistic value,
computed by looking at page density and then multiplying it by a small
factor (not more than two) to take into account LP_DEAD and LP_REDIRECT
line pointers. That should practically represent majority of the dead
tuples in the table, but we then keep an overflow area to record tuples
beyond the limit set for per page. The search routine will do a direct
lookup for offsets less than the limit and search in the sorted overflow
area for offsets beyond the limit.

For example, for a table with 60 bytes wide tuple (including 24 byte
header), each page can approximately have 8192/60 = 136 tuples. Say we
provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
bitmap. First 272 offsets in every page are represented in the bitmap and
anything greater than are in overflow region. On the other hand, the
current representation will need about 16 bytes per page assuming 2% dead
tuples, 40 bytes per page assuming 5% dead tuples and 80 bytes assuming 10%
dead tuples. So bitmap will take more space for small tuples or when vacuum
is run very aggressively, both seems unlikely for very large tables. Of
course the calculation does not take into account the space needed by the
overflow area, but I expect that too be small.

I guess we can make a choice between two representations at the start
looking at various table stats. We can also be smart and change from bitmap
to traditional representation as we scan the table and see many more tuples
in the overflow region than we provisioned for. There will be some
challenges in converting representation mid-way, especially in terms of
memory allocation, but I think those can be sorted out if we think that the
idea has merit.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Truncating/vacuuming relations on full tablespaces

2016-09-08 Thread Haribabu Kommi

On Mon, Jun 20, 2016 at 9:28 PM, Asif Naeem  wrote:

> Thank you for useful suggestions. PFA patch, I have tried to cover all the
> points mentioned.
>
>>
>>
Patch needs rebase, it is failing to apply on latest master.
And also there are some pending comment fix from Robert.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-08 Thread Jim Nasby


On 9/8/16 3:03 AM, Masahiko Sawada wrote:

Current manual vacuum doesn't output how may all_frozen pages we
skipped according to visibility map.
That's why I attached 0001 patch which makes the manual vacuum emit
such information.


I think we should add that, and info about all-frozen skips, regardless 
of the rest of these changes. Can you submit that separately to the 
commitfest? (Or I can if you want.)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-09-08 Thread Aleksander Alekseev

> Hello, Victor.
> 
> > I'm sending new version of patch.
> > 
> > I've replaced readonly option with target_server_type (with JDBC
> > compatibility alias targetServerType), 
> > 
> > use logic of setting defaults based on number of hosts in the connect
> > string instead of complicated condition in the state machine,
> > 
> > added checks for return values of memory allocation function.
> > 
> > Also, I've solved pg_usleep problem by linking pgsleep.c into libpq
> > along with some other libpgport objects which are already linked there.
> > 
> > Thus client applications don't need to link with libpgport and libpq
> > shared library is self-containted.
> 
> This patch doesn't apply to master branch:
> 
> ```
> $ git apply ~/temp/libpq-failover-8.patch
> /home/eax/temp/libpq-failover-8.patch:184: trailing whitespace.
> check: 
> /home/eax/temp/libpq-failover-8.patch:252: trailing whitespace.
>   /* 
> /home/eax/temp/libpq-failover-8.patch:253: trailing whitespace.
>*  Validate target_server_mode option 
> /home/eax/temp/libpq-failover-8.patch:254: trailing whitespace.
>*/ 
> /home/eax/temp/libpq-failover-8.patch:306: trailing whitespace.
>   appendPQExpBuffer(>errorMessage, 
> error: src/interfaces/libpq/t/001-multihost.pl: already exists in
> working directory
> 
> $ git diff
> ```

Oops. Sorry, my mistake. I forgot to check for untracked files.
Everything is fine.

-- 
Best regards,
Aleksander Alekseev


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Jim Nasby


On 9/8/16 3:48 AM, Masahiko Sawada wrote:

If we replaced dead_tuples with an array-of-array, isn't there
negative performance impact for lazy_tid_reap()?
As chunk is added, that performance would be decrease.


Yes, it certainly would, as you'd have to do 2 binary searches. I'm not 
sure how much that matters though; presumably the index scans are 
normally IO-bound?


Another option would be to use the size estimation ideas others have 
mentioned to create one array. If the estimates prove to be wrong you 
could then create a single additional segment; by that point you should 
have a better idea of how far off the original estimate was. That means 
the added search cost would only be a compare and a second pointer redirect.


Something else that occurred to me... AFAIK the only reason we don't 
support syncscan with VACUUM is because it would require sorting the TID 
list. If we just added a second TID list we would be able to support 
syncscan, swapping over to the 'low' list when we hit the end of the 
relation.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] to_date_valid()

On 8/15/16 7:33 AM, Andreas 'ads' Scherbaum wrote:
> postgres=# SELECT to_date('2011 12  18', ' MM   DD');
>to_date
> 
>   2011-12-08
> (1 row)
> 
> 
> That is from the regression tests, and obviously handles the date 
> transformation wrong. My attempt catches this, because I compare the 
> date with the input date, and do not rely on a valid date only.

It's debatable what is correct here.

Using to_number, the behavior appears to be that a space in the pattern
ignores one character.  For example:

test=# select to_number('123 456', '999 999');
 to_number
---
123456

test=# select to_number('123 456', '999  999');
 to_number
---
 12356

Considering that, the above to_date result is not incorrect.

So just squashing the spaces and converting the value back is not a
correct approach to detecting overflow.

I think using ValidateDate() was the right idea.  That is what we use
for checking date validity everywhere else.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for lazy_scan_heap

2016-09-08 Thread Robert Haas

On Thu, Sep 8, 2016 at 4:03 AM, Masahiko Sawada  wrote:
> On Wed, Sep 7, 2016 at 4:11 PM, Simon Riggs  wrote:
>> On 7 September 2016 at 04:13, Masahiko Sawada  wrote:
>>
>>> Since current HEAD could scan visibility map twice, the execution time
>>> of Patched is approximately half of HEAD.
>>
>> Sounds good.
>>
>> To ensure we are doing exactly same amount of work as before, did you
>> see the output of VACUUM VEROBOSE?
>
> Sorry, the previous test result I posted was something wrong.
> I rerun the performance test and results are,
>
> * 1TB Table(visibility map size is 32MB)
> HEAD : 4853.250 ms (00:04.853)
> Patched : 3805.789 ms (00:03.806)
>
> * 8TB Table(visibility map size is 257MB)
> HEAD : 37853.891 ms (00:37.854)
> Patched : 30153.943 ms (00:30.154)
>
> * 32TB Table(visibility map size is 1GB)
> HEAD: 151908.344 ms (02:31.908)
> Patched: 120560.037 ms (02:00.560)
>
> Since visibility map page can be cached onto shared buffer or OS cache
> by first scanning, the benefit of this patch seems not to be large.

Yeah, that's not nearly as good as the first set of results.  Maybe
it's still worth doing, but if you need to have a 32TB table to save
as much as 30 seconds, we don't have much of a problem here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

Peter Eisentraut  writes:
> On 9/6/16 1:40 PM, Doug Doole wrote:
>> We carried the ICU version numbers around on our collation and locale
>> IDs (such as fr_FR%icu36) . The database would load multiple versions of
>> the ICU library so that something created with ICU 3.6 would always be
>> processed with ICU 3.6. This avoided the problems of trying to change
>> the rules on the user. (We'd always intended to provide tooling to allow
>> the user to move an existing object up to a newer version of ICU, but we
>> never got around to doing it.) In the code, this meant we were
>> explicitly calling the versioned API so that we could keep the calls
>> straight.

> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

I agree with that estimate, and I would further venture that even if we
wanted to bundle ICU into our tarballs, distributors would rip it out
again on security grounds.  I am dead certain Red Hat would do so; less
sure that other vendors have similar policies, but it seems likely.
They don't want to have to fix security bugs in more than one place.

This is a problem, if ICU won't guarantee cross-version compatibility,
because it destroys the argument that moving to ICU would offer us
collation behavior stability.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
 wrote:
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.

I thought about something like this, but it could be extremely
inefficient for mostly frozen tables, since the bitmap cannot account
for frozen pages without losing the O(1) lookup characteristic


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for lazy_scan_heap

On Thu, Sep 8, 2016 at 11:27 PM, Jim Nasby  wrote:
> On 9/8/16 3:03 AM, Masahiko Sawada wrote:
>>
>> Current manual vacuum doesn't output how may all_frozen pages we
>> skipped according to visibility map.
>> That's why I attached 0001 patch which makes the manual vacuum emit
>> such information.
>
>
> I think we should add that, and info about all-frozen skips, regardless of
> the rest of these changes. Can you submit that separately to the commitfest?
> (Or I can if you want.)

Yeah, autovacuum emits such information but manual vacuum doesn't.
I submitted that patch before[1] but patch was not committed, because
verbose output is already complicated.
But I will submit it again and start to discussion about that.

[1] 
https://www.postgresql.org/message-id/CAD21AoCQGTKDxQ6YfrJ0%2B%3Dev6A3i3pt2ULdWxCtwPLKR2E77jg%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CF app and patch series

Craig Ringer wrote:
> Hi all
> 
> Now that it's becoming more common to post patch series, not just
> standalone patches, it might be worth looking at how the CF app can
> help manage them.
> 
> Any ideas?

I agree that we don't consider this case at all currently and that it's
causing some pain.  

MIME parsing is mind-boggling.  Trying to figure out *one* patch file
from an email is already pretty difficult.  Trying to figure out more
than one might be nightmarish.  Maybe Magnus will contradict me and say
it's trivial to do -- that'd be great.

I don't have any great ideas for how to support this; I'd say it would
be something like the CF entry has sub-entries one for each patch in the
series, that can be closed independently.  This probably involves
surgery to the CF database and app which I'm not volunteering to write,
however.  Django-enabled contributors speak up now ...

I think for the time being we should keep the single entry as "needs
review" or whatever open state until there is nothing left in
committable state, then close as "committed"; the parts that were not
committed can be resent as a new series to a later CF.  In the
annotations section, add a link to the latest version each time it's
posted, along with some indication of how many parts are left[1].  Perhaps
if a part is committed and there's no new version submitted soon, also
add an annotation to indicate that.

[1] Do not, as I've seen some do in the past, delete the old annotations
while adding new ones.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan  wrote:
>> In the meanwhile, I'll go and do some perf testing.
>>
>> Assuming the speedup is realized during testing, LGTM.
>
> Thanks. I suggest spending at least as much time on unsympathetic
> cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
> suggest focusing on a type that has relatively expensive comparisons,
> such as collated text, to make differences clearer.

The tests are still running (the benchmark script I came up with runs
for a lot longer than I anticipated, about 2 days), but preliminar
results are very promising, I can see a clear and consistent speedup.
We'll have to wait for the complete results to see if there's any
significant regression, though. I'll post the full results when I have
them, but until now it all looks like this:

setup:

create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
insert into lotsofitext select cast(random() * 10.0 as text)
|| 'blablablawblabla', cast(random() * 10.0 as text) ||
'blablablawjjjblabla', cast(random() * 10.0 as text) ||
'blablabl
awwwabla', random() * 10.0, random() * 1.0 from
generate_series(1, 1000);

timed:

select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;

Unpatched Time: 100351.251 ms
Patched Time: 75180.787 ms

That's like a 25% speedup on random input. As we say over here, rather
badly translated, not a turkey's boogers (meaning "nice!")


On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire  wrote:
> On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
>> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire  
>> wrote:
>>> I noticed, but here n = state->memtupcount
>>>
>>> +   Assert(memtuples[0].tupindex == newtup->tupindex);
>>> +
>>> +   CHECK_FOR_INTERRUPTS();
>>> +
>>> +   n = state->memtupcount; /* n is heap's size,
>>> including old root */
>>> +   imin = 0;   /*
>>> start with caller's "hole" in root */
>>> +   i = imin;
>>
>> I'm fine with using "n" in the later assertion you mentioned, if
>> that's clearer to you. memtupcount is broken out as "n" simply because
>> that's less verbose, in a place where that makes things far clearer.
>>
>>> In fact, the assert on the patch would allow writing memtuples outside
>>> the heap, as in calling tuplesort_heap_root_displace if
>>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>>> == memtuples[imin] would be outside the heap).
>>
>> You have to have a valid heap (i.e. there must be at least one
>> element) to call tuplesort_heap_root_displace(), and it doesn't
>> directly compact the heap, so it must remain valid on return. The
>> assertion exists to make sure that everything is okay with a
>> one-element heap, a case which is quite possible.
>
> More than using "n" or "memtupcount" what I'm saying is to assert that
> memtuples[imin] is inside the heap, which would catch the same errors
> the original assert would, and more.
>
> Assert(imin < state->memtupcount)
>
> If you prefer.
>
> The original asserts allows any value of imin for memtupcount>1, and
> that's my main concern. It shouldn't.

So, for the assertions to properly avoid clobbering/reading out of
bounds memory, you need both the above assert:

 +*/
 +   memtuples[i] = memtuples[imin];
 +   i = imin;
 +   }
 +
>+   Assert(imin < state->memtupcount);
 +   memtuples[imin] = *newtup;
 +}

And another one at the beginning, asserting:

 +   SortTuple  *memtuples = state->memtuples;
 +   int n,
 +   imin,
 +   i;
 +
>+   Assert(state->memtupcount > 0 && memtuples[0].tupindex == 
>newtup->tupindex);
 +
 +   CHECK_FOR_INTERRUPTS();

It's worth making that change, IMHO, unless I'm missing something.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v2] Add overflow checks to money type input function

I have updated the patch with additional tests and comments per your
review.  Final(?) version attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From ee34d7d64a4b10c9f7fbe8c905a56cea1584c8c9 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Sep 2016 12:00:00 -0400
Subject: [PATCH v3] Add overflow checks to money type input function

The money type input function did not have any overflow checks at all.
There were some regression tests that purported to check for overflow,
but they actually checked for the overflow behavior of the int8 type
before casting to money.  Remove those unnecessary checks and add some
that actually check the money input function.

Reviewed-by: Fabien COELHO 
---
 src/backend/utils/adt/cash.c| 53 ++--
 src/test/regress/expected/money.out | 98 ++---
 src/test/regress/sql/money.sql  | 30 ++--
 3 files changed, 165 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index b336185..a146b0a 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -189,13 +189,30 @@ cash_in(PG_FUNCTION_ARGS)
 	printf("cashin- string is '%s'\n", s);
 #endif
 
+	/*
+	 * We accumulate the absolute amount in "value" and then apply the sign at
+	 * the end.  (The sign can appear before or after the digits, so it would
+	 * be more complicated to do otherwise.)  Because of the larger range of
+	 * negative signed integers, we build "value" in the negative and then
+	 * flip the sign at the end, catching most-negative-number overflow if
+	 * necessary.
+	 */
+
 	for (; *s; s++)
 	{
 		/* we look for digits as long as we have found less */
 		/* than the required number of decimal places */
 		if (isdigit((unsigned char) *s) && (!seen_dot || dec < fpoint))
 		{
-			value = (value * 10) + (*s - '0');
+			Cash newvalue = (value * 10) - (*s - '0');
+
+			if (newvalue / 10 != value)
+ereport(ERROR,
+		(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+		 errmsg("value \"%s\" is out of range for type money",
+str)));
+
+			value = newvalue;
 
 			if (seen_dot)
 dec++;
@@ -214,11 +231,27 @@ cash_in(PG_FUNCTION_ARGS)
 
 	/* round off if there's another digit */
 	if (isdigit((unsigned char) *s) && *s >= '5')
-		value++;
+		value--;  /* remember we build the value in the negative */
+
+	if (value > 0)
+		ereport(ERROR,
+(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("value \"%s\" is out of range for type money",
+		str)));
 
 	/* adjust for less than required decimal places */
 	for (; dec < fpoint; dec++)
-		value *= 10;
+	{
+		Cash newvalue = value * 10;
+
+		if (newvalue / 10 != value)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value \"%s\" is out of range for type money",
+			str)));
+
+		value = newvalue;
+	}
 
 	/*
 	 * should only be trailing digits followed by whitespace, right paren,
@@ -247,7 +280,19 @@ cash_in(PG_FUNCTION_ARGS)
 			str)));
 	}
 
-	result = value * sgn;
+	/* If the value is supposed to be positive, flip the sign, but check for
+	 * the most negative number. */
+	if (sgn > 0)
+	{
+		result = -value;
+		if (result < 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+	 errmsg("value \"%s\" is out of range for type money",
+			str)));
+	}
+	else
+		result = value;
 
 #ifdef CASHDEBUG
 	printf("cashin- result is " INT64_FORMAT "\n", result);
diff --git a/src/test/regress/expected/money.out b/src/test/regress/expected/money.out
index 538235c..5695f87 100644
--- a/src/test/regress/expected/money.out
+++ b/src/test/regress/expected/money.out
@@ -185,6 +185,96 @@ SELECT * FROM money_data;
  $123.46
 (1 row)
 
+-- input checks
+SELECT '1234567890'::money;
+   money   
+---
+ $1,234,567,890.00
+(1 row)
+
+SELECT '12345678901234567'::money;
+   money
+
+ $12,345,678,901,234,567.00
+(1 row)
+
+SELECT '123456789012345678'::money;
+ERROR:  value "123456789012345678" is out of range for type money
+LINE 1: SELECT '123456789012345678'::money;
+   ^
+SELECT '9223372036854775807'::money;
+ERROR:  value "9223372036854775807" is out of range for type money
+LINE 1: SELECT '9223372036854775807'::money;
+   ^
+SELECT '-12345'::money;
+money
+-
+ -$12,345.00
+(1 row)
+
+SELECT '-1234567890'::money;
+   money
+
+ -$1,234,567,890.00
+(1 row)
+
+SELECT '-12345678901234567'::money;
+money
+-
+ -$12,345,678,901,234,567.00
+(1 row)
+
+SELECT '-123456789012345678'::money;
+ERROR:  value "-123456789012345678" is out of range for type money
+LINE 1: SELECT '-123456789012345678'::money;
+   ^
+SELECT

Re: [HACKERS] Forbid use of LF and CR characters in database and role names

On 9/6/16 1:42 PM, Robert Haas wrote:
> If we were talking about pathnames containing spaces, I would agree,
> but I've never heard of a legitimate pathname containing CR or LF.  I
> can't see us losing much by refusing to allow such pathnames, except
> for security holes.

The flip side of that is that if we're doing a half-way job of only
prohibiting these characters in 67% of cases, then a new generation of
tools will be written on top of that with the assumption that these
characters cannot appear.  But then those tools will be easy to break or
exploit because it's possible to sneak stuff in in creative ways.  So
we're on the road to having an endless stream of "I can sneak in a CR/LF
character in here" bugs.

The current setup is more robust:  We are prohibiting these characters
in specific locations where we know we can't handle them.  But we don't
give any guarantees about anything else.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] feature request: explain "with details" option

2016-09-08 Thread Roger Pack

My apologies if this was already requested before...

I think it would be fantastic if postgres had an "explain the explain" option:
Today's explain tells us what loops and scans were used, and relative
costs, etc.  It doesn't seem to tell *why* the planner elected to use
what it did.

For instance, in the case of a corrupted index, it doesn't say why
it's not using that index, it just doesn't use it, causing some
confusion to end users.  At least causing confusion to me.

Or in the case of where it iterates over an entire table (seq. scan)
instead of using an index because the index range specified "is most
of the table" (thus not helpful to use the index)...The choice is
appropriate.  The reasoning why is not explicitly mentioned.  Again
causing possibility for some delay as you try to "decipher the mind"
of the planner.  Sometimes tables (ex: tables after having been first
propagated) need an "analyze" run on them, but it's not clear from an
"explain" output that the analyze statistics are faulty.  Not even a
hint.

So this is a feature request for an "EXPLAIN DETAILS" option or
something, basically like today's explain but with more "rationale"
included.  This could be immensely useful to many Postgres users.

I'd even be willing to chip in a couple hundred bucks if it would help
grease the wheels for somebody taking up the challenge if that helps
at all :)

Thank you for your consideration in this regard.
-roger-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

Peter Geoghegan  writes:
> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
>> I still think tuplesort_heap_siftup is a confusing name, although I'm not
>> sure that Peter's "compact" is much better. I suggest that we rename it to
>> tuplesort_heap_delete_top(). In comments within the function, explain that
>> the *loop* corresponds to the "siftup" in Knuth's book.

> I'm also fine with that name.

I can live with it too.

>> Interestingly, as Knuth explains the siftup algorithm, it performs a
>> "replace the top" operation, rather than "remove the top". But we use it to
>> remove the top, by moving the last element to the top and running the
>> algorithm after that. Peter's other patch, to change the way we currently
>> replace the top node, to do it in one pass rather than as delete+insert, is
>> actually Knuth's siftup algorithm.

> Knuth must have a strange sense of humor.

I'm too lazy to get the book back down off the shelf to check, but I think
that Knuth uses sift-up to describe two different algorithms that have
essentially the same inner loop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default make target in test modules

Tom Lane wrote:
> Alvaro Herrera  writes:

> > I suppose this is a very easy mistake to make -- but also fortunately an
> > easy one to correct.  Do you want me to fix the affected modules?
> 
> I was going to do it, but if you want to, feel free.

Done.

> (BTW, I notice that test_pg_dump's Makefile misquotes its own path name,
> probably copy-paste sloppiness.)

Uh, I hadn't noticed.  Fixed that one too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
 wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire 
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs 
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane  wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all.  But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at  is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we think that the
> idea has merit.
>

Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make

Re: [HACKERS] Default make target in test modules

2016-09-08 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I happened to notice that if you type "make" in
> src/test/modules/test_pg_dump, you will get a "make check" action
> not "make all".  I hope this is just somebody being thoughtless
> about Makefile ordering and not an intentional override of the
> default make target.  If the latter, I beg to differ about it
> being a good idea.

No, it wasn't intentional.

> > Done.
> 
> Thanks.

Thanks to you both!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Add support for restrictive RLS policies

2016-09-08 Thread Stephen Frost

Greetings!

* Stephen Frost (sfr...@snowman.net) wrote:
> Based on Robert's suggestion and using Thom's verbiage, I've tested this
> out:
> 
> CREATE POLICY pol ON tab AS [PERMISSIVE|RESTRICTIVE] ...
> 
> and it appears to work fine with the grammar, etc.
> 
> Unless there's other thoughts on this, I'll update the patch to reflect
> this grammar in a couple days.

Updated patch attached which uses the above approach, includes
some initial documentation, and has fixes for the tab completion, 

I'm planning to add more documentation.  Otherwise, testing and code
review would certainly be appreciated.

Thanks!

Stpehen
From 6fad316de3cc50f4cc2c3bbe3c6fac91afc881e6 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 1 Sep 2016 02:11:30 -0400
Subject: [PATCH] Add support for restrictive RLS policies

We have had support for restrictive RLS policies since 9.5, but they
were only available through extensions which use the appropriate hooks.
This adds support into the grammer, catalog, psql and pg_dump for
restrictive RLS policies, thus reducing the cases where an extension is
necessary.
---
 doc/src/sgml/ref/create_policy.sgml   |  16 +++
 src/backend/commands/policy.c |   9 ++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  32 +++--
 src/backend/rewrite/rowsecurity.c |   7 +-
 src/bin/pg_dump/pg_dump.c |  38 --
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |  39 +-
 src/bin/psql/describe.c   | 109 
 src/bin/psql/tab-complete.c   |  29 -
 src/include/catalog/pg_policy.h   |  16 ++-
 src/include/nodes/parsenodes.h|   1 +
 src/include/parser/kwlist.h   |   2 +
 src/include/rewrite/rowsecurity.h |   1 +
 src/test/regress/expected/rowsecurity.out | 209 ++
 src/test/regress/sql/rowsecurity.sql  |  24 +++-
 17 files changed, 417 insertions(+), 118 deletions(-)

diff --git a/doc/src/sgml/ref/create_policy.sgml b/doc/src/sgml/ref/create_policy.sgml
index 89d2787..d930052 100644
--- a/doc/src/sgml/ref/create_policy.sgml
+++ b/doc/src/sgml/ref/create_policy.sgml
@@ -22,6 +22,7 @@ PostgreSQL documentation
  
 
 CREATE POLICY name ON table_name
+[ AS ( PERMISSIVE | RESTRICTIVE ) ]
 [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ]
 [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ]
 [ USING ( using_expression ) ]
@@ -120,6 +121,21 @@ CREATE POLICY name ON 
 

+permissive
+
+ 
+  If the policy is a "permissive" or "restrictive" policy.  Permissive
+  policies are the default and always add visibillity- they ar "OR"d
+  together to allow the user access to all rows through any of the
+  permissive policies they have access to.  Alternativly, a policy can
+  instead by "restrictive", meaning that the policy will be "AND"d
+  with other restrictive policies and with the result of all of the
+  permissive policies on the table.
+ 
+
+   
+
+   
 command
 
  
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index d694cf8..70e22c1 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -235,6 +235,7 @@ RelationBuildRowSecurity(Relation relation)
 		{
 			Datum		value_datum;
 			char		cmd_value;
+			bool		permissive_value;
 			Datum		roles_datum;
 			char	   *qual_value;
 			Expr	   *qual_expr;
@@ -257,6 +258,12 @@ RelationBuildRowSecurity(Relation relation)
 			Assert(!isnull);
 			cmd_value = DatumGetChar(value_datum);
 
+			/* Get policy permissive or restrictive */
+			value_datum = heap_getattr(tuple, Anum_pg_policy_polpermissive,
+	   RelationGetDescr(catalog), );
+			Assert(!isnull);
+			permissive_value = DatumGetBool(value_datum);
+
 			/* Get policy name */
 			value_datum = heap_getattr(tuple, Anum_pg_policy_polname,
 	   RelationGetDescr(catalog), );
@@ -298,6 +305,7 @@ RelationBuildRowSecurity(Relation relation)
 			policy = palloc0(sizeof(RowSecurityPolicy));
 			policy->policy_name = pstrdup(policy_name_value);
 			policy->polcmd = cmd_value;
+			policy->permissive = permissive_value;
 			policy->roles = DatumGetArrayTypePCopy(roles_datum);
 			policy->qual = copyObject(qual_expr);
 			policy->with_check_qual = copyObject(with_check_qual);
@@ -796,6 +804,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
 	values[Anum_pg_policy_polname - 1] = DirectFunctionCall1(namein,
 		 CStringGetDatum(stmt->policy_name));
 	values[Anum_pg_policy_polcmd - 1] = CharGetDatum(polcmd);
+	values[Anum_pg_policy_polpermissive - 1] = BoolGetDatum(stmt->permissive);
 	values[Anum_pg_policy_polroles - 1] = PointerGetDatum(role_ids);
 
 	/* Add qual if present. */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index

[HACKERS] Default make target in test modules

I happened to notice that if you type "make" in
src/test/modules/test_pg_dump, you will get a "make check" action
not "make all".  I hope this is just somebody being thoughtless
about Makefile ordering and not an intentional override of the
default make target.  If the latter, I beg to differ about it
being a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire  wrote:
> setup:
>
> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
> insert into lotsofitext select cast(random() * 10.0 as text)
> || 'blablablawblabla', cast(random() * 10.0 as text) ||
> 'blablablawjjjblabla', cast(random() * 10.0 as text) ||
> 'blablabl
> awwwabla', random() * 10.0, random() * 1.0 from
> generate_series(1, 1000);
>
> timed:
>
> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>
> Unpatched Time: 100351.251 ms
> Patched Time: 75180.787 ms
>
> That's like a 25% speedup on random input. As we say over here, rather
> badly translated, not a turkey's boogers (meaning "nice!")

Cool! What work_mem setting were you using here?

>> More than using "n" or "memtupcount" what I'm saying is to assert that
>> memtuples[imin] is inside the heap, which would catch the same errors
>> the original assert would, and more.
>>
>> Assert(imin < state->memtupcount)
>>
>> If you prefer.
>>
>> The original asserts allows any value of imin for memtupcount>1, and
>> that's my main concern. It shouldn't.
>
> So, for the assertions to properly avoid clobbering/reading out of
> bounds memory, you need both the above assert:
>
>  +*/
>  +   memtuples[i] = memtuples[imin];
>  +   i = imin;
>  +   }
>  +
>>+   Assert(imin < state->memtupcount);
>  +   memtuples[imin] = *newtup;
>  +}
>
> And another one at the beginning, asserting:
>
>  +   SortTuple  *memtuples = state->memtuples;
>  +   int n,
>  +   imin,
>  +   i;
>  +
>>+   Assert(state->memtupcount > 0 && memtuples[0].tupindex == 
>>newtup->tupindex);
>  +
>  +   CHECK_FOR_INTERRUPTS();
>
> It's worth making that change, IMHO, unless I'm missing something.

You're supposed to just not call it with an empty heap, so the
assertions trust that much. I'll look into that.

Currently, producing a new revision of this entire patchset. Improving
the cost model (used when the parallel_workers storage parameter is
not specified within CREATE INDEX) is taking a bit of time, but hope
to have it out in the next couple of days.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

On 9/6/16 1:08 PM, Tom Lane wrote:
>> As just mentioned elsewhere, this accidentally introduces a failure if
>> > the PostgreSQL installation path contains LF/CR, because of the use of
>> > appendShellString().
> I think that's intentional, not accidental.  What actual use case is
> there for allowing such paths?

There probably isn't one.  But we ought to be introducing this change in
a more intentional and consistent way.

For example, pg_basebackup has no such restriction.  So using
pg_basebackup, then promote, then pg_upgrade will (probably) fail now
for some paths.

More generally, I'm concerned that appendShellString() looks pretty
attractive for future use.  It's not inconceivable that someone will
want to use it for say calling pg_dump from pg_dumpall or pg_upgrade at
some point, and then maybe we'll accidentally disallow LF/CR in
tablespace names, say.

Also, if we're concerned about the potential for confusion that these
characters can cause, maybe we should be disallowing more control
characters in similar places.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee

On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire 
wrote:

> On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
>  wrote:
> > For example, for a table with 60 bytes wide tuple (including 24 byte
> > header), each page can approximately have 8192/60 = 136 tuples. Say we
> > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> > bitmap. First 272 offsets in every page are represented in the bitmap and
> > anything greater than are in overflow region. On the other hand, the
> current
> > representation will need about 16 bytes per page assuming 2% dead
> tuples, 40
> > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> > tuples. So bitmap will take more space for small tuples or when vacuum is
> > run very aggressively, both seems unlikely for very large tables. Of
> course
> > the calculation does not take into account the space needed by the
> overflow
> > area, but I expect that too be small.
>
> I thought about something like this, but it could be extremely
> inefficient for mostly frozen tables, since the bitmap cannot account
> for frozen pages without losing the O(1) lookup characteristic
>

Well, that's correct. But I thought the whole point is when there are large
number of dead tuples which requires large memory. If my math was correct
as explained above, then even at 5% dead tuples, bitmap representation will
consume approximate same memory but provide O(1) search time.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

On Thu, Sep 8, 2016 at 10:18 AM, Claudio Freire  wrote:
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB 
> table.

Cool. I wonder where this leaves Heikki's draft patch, that completely
removes batch memory, etc.



-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Remove superuser() checks from pgstattuple

On 9/4/16 7:36 PM, Stephen Frost wrote:
> Perhaps if the versioned install script was actually a non-versioned
> install script in the source tree, and the Makefile simply installed it
> into the correct place, then we could eliminate that part.  (All very
> hand-wavy, I've not looked at what it'd take to do.)

A number of external extensions actually do it that way, but it also has
its problems.  For example, if you don't preserve the old base install
scripts, you can't actually test whether your upgrade scripts work in
the sense that they upgrade you to the same place.

This is now being obsoleted by the later idea of allowing base installs
from a chain of upgrade scripts.  But if your upgrade scripts contain
ALTER TABLE commands, you will probably still want to write base install
scripts that do a fresh CREATE TABLE instead.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] DISABLE_PAGE_SKIPPING option for vacuumdb

Hi,

Attached patch add --disable-page-skipping option to vacuumed command for 9.6.
If by any chance the freeze map causes the serious data corruption bug
then the user will want to run pg_check_visible() and vacuum with
DISABLE_PAGE_SKIPPING option to the multiple tables having problem.
In this case, I think that this option for vacuumdb would be convenient.

Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


disable_page_skipping_option_for_vacuumdb_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregate Push Down - Performing aggregation on foreign server

2016-09-08 Thread Ashutosh Bapat

> While checking for shippability, we build the target list which is passed
> to
> the foreign server as fdw_scan_tlist. The target list contains
> a. All the GROUP BY expressions
> b. Shippable entries from the target list of upper relation
> c. Var and Aggref nodes from non-shippable entries from the target list of
> upper relation
>

The code in the patch doesn't seem to add Var nodes explicitly. It assumes
that
the Var nodes will be part of GROUP BY clause. The code is correct, I think.


> d. Var and Aggref nodes from non-shippable HAVING conditions.
>

This needs to be changed as per the comments below.

>
> Costing:
>
> If use_foreign_estimate is true for input relation, like JOIN case, we use
> EXPLAIN output to get the cost of query with aggregation/grouping on the
> foreign server. If not we calculate the costs locally. Similar to core, we
> use
> get_agg_clause_costs() to get costs for aggregation and then using logic
> similar to cost_agg() we calculate startup and total cost. Since we have no
> idea which aggregation strategy will be used at foreign side, we add all
> startup cost (startup cost of input relation, aggregates etc.) into startup
> cost for the grouping path and similarly for total cost.
>

This looks OK to me.


>
> Deparsing the query:
>
> Target list created while checking for shippability is deparsed using
> deparseExplicitTargetList(). sortgroupref are adjusted according to this
> target list. Most of the logic to deparse an Aggref is inspired from
> get_agg_expr(). For an upper relation, FROM and WHERE clauses come from the
> underlying scan relation and thus for simplicity, FROM clause deparsing
> logic
> is moved from deparseSelectSql() to a new function deparseFromClause(). The
> same function adds WHERE clause to the remote SQL.
>

Ok.


>
>
> Area of future work:
>
> 1. Adding path with path-keys to push ORDER BY clause along with
> aggregation/
> grouping.  Should be supported as a separate patch.
>
> 2. Grouping Sets/Rollup/Cube is not supported in current version. I have
> left
> this aside to keep patch smaller. If required I can add that support in the
> next version of the patch.
>

I am fine with these limitations for first cut of this feature.

I think we should try to measure performance gain because of aggregate
pushdown. The EXPLAIN
doesn't show actual improvement in the execution times.

Here are the comments on the patch.

Patch compiles without errors/warnings - Yes
make check passes - Yes.
make check in contrib/postgres_fdw passes - Yes

Attached patch (based on your patch) has some typos corrected, some comments
rephrased. It also has some code changes, as explained in various comments
below. Please see if those look good.

1. Usually for any deparseSomething function, Something is the type of node
produced by the parser when the string output by that function is parsed.
deparseColumnRef, for example, produces a string, which when parsed
produces a
columnRef node. There is are nodes of type FromClause and AggOrderBy. I
guess,
you meant FromExpr instead of FromClause. deparseAggOrderBy() may be
renamed as
appendOrderByFromList() or something similar. It may be utilized for window
functions if required.

2. Can we infer the value of foragg flag from the RelOptInfo passed to
is_foreign_expr()? For any upper relation, it is ok to have aggregate in
there. For any other relation aggregate is not expected.

3. In function foreign_grouping_ok(), should we use classifyConditions()?
The
function is written and used for base relations. There's nothing in that
function, which prohibits it being used for other relations. I feel that
foreign_join_ok() should have used the same function to classify the other
clauses.

4. May be the individual criteria in the comment block
+/*
+ * Aggregate is safe to pushdown if
+ * 1. It is a built-in aggregate
+ * 2. All its arguments are safe to push-down
+ * 3. The functions involved are immutable.
+ * 4. Other expressions involved like aggorder,
aggdistinct are
+ *safe to be pushed down.
+ */
should be associated with the code blocks which implement those. As the
criteria change it's difficult to maintain the numbered list in sync with
the
code.

5. The comment
+/* Aggregates other than simple one are non-pushable. */
should read /* Only non-split aggregates are pushable. */ as AGGSPLIT_SIMPLE
means a complete, non-split aggregation step.

6. An aggregate function has transition, combination and finalization
function
associated with it. Should we check whether all of the functions are
shippable?
But probably it suffices to check whether aggregate function as whole is
shippable or not using is_shippable() since it's the whole aggregate we are
interested in and not the intermediate results. Probably, we should add a
comment explaining why it's sufficient to check the aggregate function

Re: [HACKERS] Default make target in test modules

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Alvaro Herrera  writes:
>>> I suppose this is a very easy mistake to make -- but also fortunately an
>>> easy one to correct.  Do you want me to fix the affected modules?

>> I was going to do it, but if you want to, feel free.

> Done.

Thanks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cache Hash Index meta page.

2016-09-08 Thread Jesper Pedersen

On 09/05/2016 02:50 PM, Mithun Cy wrote:

On Sep 2, 2016 7:38 PM, "Jesper Pedersen"
wrote:

Could you provide a rebased patch based on Amit's v5 ?

Please find the the patch, based on Amit's V5.

I have fixed following things

1. now in "_hash_first" we check if (opaque->hasho_prevblkno ==
InvalidBlockNumber) to see if bucket is from older version hashindex and
has been upgraded. Since as of now InvalidBlockNumber is one value greater
than maximum value the variable "metap->hashm_maxbucket" can be set (see
_hash_expandtable). We can distinguish it from rest. I tested the upgrade
issue reported by amit. It is fixed now.

2. One case which buckets hasho_prevblkno is used is where we do backward
scan. So now before testing for previous block number I test whether
current page is bucket page if so we end the bucket scan (see changes in
_hash_readprev). On other places where hasho_prevblkno is used it is not
for bucket page, so I have not put any extra check to verify if is a bucket
page.

I think that the

+ pageopaque->hasho_prevblkno = metap->hashm_maxbucket;

trick should be documented in the README, as hashm_maxbucket is defined
as uint32 where as hasho_prevblkno is a BlockNumber.

(All bucket variables should probably use the Bucket definition instead
of uint32).

For the archives, this patch conflicts with the WAL patch [1].

[1]
https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-09-08 Thread Daniel Verite

Rahila Syed wrote:

> However, including the check here will require returning the status
> of command from this hook. i.e if we throw error inside this
> hook we will need to return false indicating the value has not changed.

Looking at the other variables hooks, they already emit errors and can
deny the effect of a change corresponding to a new value, without
informing the caller. Why would autocommit be different?

For example echo_hook does this:

/* ...if the value is in (queries,errors,all,none) then
   assign pset.echo accordingly ... */
else
{
  psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n",
   newval, "ECHO", "none");
  pset.echo = PSQL_ECHO_NONE;
}

If the user issues \set ECHO FOOBAR, then it produces the above error
message and makes the same internal change as if \set ECHO none
had been issued.

But, the value of the variable as seen by the user is still FOOBAR:

\set
[...other variables...]
ECHO = 'FOOBAR'

The proposed patch doesn't follow that behavior, as it denies
a new value for AUTOCOMMIT. You might argue that it's better,
but on the other side, there are two reasons against it:

- it's not consistent with the other uses of \set that affect psql,
such as ECHO, ECHO_HIDDEN,  ON_ERROR_ROLLBACK,
 COMP_KEYWORD_CASE... and even AUTOCOMMIT as
 \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted.

- it doesn't address the case of another method than \set being used
 to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset
 whereas the hook would work in that case.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
> I still think tuplesort_heap_siftup is a confusing name, although I'm not
> sure that Peter's "compact" is much better. I suggest that we rename it to
> tuplesort_heap_delete_top(). In comments within the function, explain that
> the *loop* corresponds to the "siftup" in Knuth's book.

I'm also fine with that name.

> Interestingly, as Knuth explains the siftup algorithm, it performs a
> "replace the top" operation, rather than "remove the top". But we use it to
> remove the top, by moving the last element to the top and running the
> algorithm after that. Peter's other patch, to change the way we currently
> replace the top node, to do it in one pass rather than as delete+insert, is
> actually Knuth's siftup algorithm.

Knuth must have a strange sense of humor.

> Peter, looking at your "displace" patch in this light, I think
> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm
> calling it now), should share a common subroutine. Displace replaces the top
> node with the new node passed as argument, and then calls the subroutine to
> push it down to the right place. Delete_top replaces the top node with the
> last value in the heap, and then calls the subroutine. Or perhaps delete_top
> should remove the last value from the heap, and then call displace() with
> it, to re-insert it.

I can see the value in that, but I'd still rather not. The code reuse
win isn't all that large, and having a separate
tuplesort_heap_root_displace() routine allows us to explain what's
going on for merging, to assert what we're able to assert with tape
numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex
cruft that the existing routines need (if only barely) to support
replacement selection. I would be surprised if with a very tight inner
loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it
increases pipeline stalls).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan  wrote:
> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire  wrote:
>> setup:
>>
>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>> insert into lotsofitext select cast(random() * 10.0 as text)
>> || 'blablablawblabla', cast(random() * 10.0 as text) ||
>> 'blablablawjjjblabla', cast(random() * 10.0 as text) ||
>> 'blablabl
>> awwwabla', random() * 10.0, random() * 1.0 from
>> generate_series(1, 1000);
>>
>> timed:
>>
>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>
>> Unpatched Time: 100351.251 ms
>> Patched Time: 75180.787 ms
>>
>> That's like a 25% speedup on random input. As we say over here, rather
>> badly translated, not a turkey's boogers (meaning "nice!")
>
> Cool! What work_mem setting were you using here?

The script iterates over a few variations of string patterns (easy
comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
4GB), and table sizes (~350M, ~650M, ~1.5G).

That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

On 9/8/16 11:16 AM, Tom Lane wrote:
> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

It would offer a significant upgrade over the current situation.

First, it offers stability inside the same version.  Whereas glibc might
change a collation in a minor upgrade, ICU won't do that.  And the
postgres binary is bound to a major version of ICU by the soname (which
changes with every major release).  So this would avoid the situation
that a simple OS update could break collations.

Second, it offers a way to detect that something has changed.  With
glibc, you don't know anything unless you read the source diffs.  With
ICU, you can compare the collation version before and after and at least
tell the user that they need to refresh indexes or whatever.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default make target in test modules

Tom Lane wrote:
> I happened to notice that if you type "make" in
> src/test/modules/test_pg_dump, you will get a "make check" action
> not "make all".  I hope this is just somebody being thoughtless
> about Makefile ordering and not an intentional override of the
> default make target.  If the latter, I beg to differ about it
> being a good idea.

Strange.  Don't all these makefiles depend on the pgxs stuff emitting
something sane, which would have "all" as the first one? ...  Ooh, I see
that test_pg_dump adds "check" before including pgxs, which is what
AFAICS causes the problem.

I suppose this is a very easy mistake to make -- but also fortunately an
easy one to correct.  Do you want me to fix the affected modules?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

Peter Eisentraut  writes:
> More generally, I'm concerned that appendShellString() looks pretty
> attractive for future use.  It's not inconceivable that someone will
> want to use it for say calling pg_dump from pg_dumpall or pg_upgrade at
> some point, and then maybe we'll accidentally disallow LF/CR in
> tablespace names, say.

That's fair, but I'm not sure how you propose to avoid it, given that
we don't have a method for safely quoting those characters on Windows.

I would certainly far rather find a way to do that and then remove the
prohibition.  But lacking such a solution, we're going to be forced into
messy compromises.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Default make target in test modules

Alvaro Herrera  writes:
> Tom Lane wrote:
>> I happened to notice that if you type "make" in
>> src/test/modules/test_pg_dump, you will get a "make check" action
>> not "make all".

> Strange.  Don't all these makefiles depend on the pgxs stuff emitting
> something sane, which would have "all" as the first one? ...  Ooh, I see
> that test_pg_dump adds "check" before including pgxs, which is what
> AFAICS causes the problem.

Right.

> I suppose this is a very easy mistake to make -- but also fortunately an
> easy one to correct.  Do you want me to fix the affected modules?

I was going to do it, but if you want to, feel free.

(BTW, I notice that test_pg_dump's Makefile misquotes its own path name,
probably copy-paste sloppiness.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump with tables created in schemas created by extensions

Michael Paquier  writes:
> On Tue, Aug 30, 2016 at 5:43 AM, MartÃn MarquÃ©s  
> wrote:
>> This is v4 of the patch, which is actually a cleaner version from the
>> v2 one Michael sent.

> Let's do as you suggest then, and just focus on the schema issue. I
> just had an extra look at the patch and it looks fine to me. So the
> patch is now switched as ready for committer.

Pushed with cosmetic adjustments (mainly, I thought the comment needed
to be much more explicit).

I'm not sure whether we want to try to fix this in older branches.
It would be a significantly larger patch, and maybe more of a behavior
change than we want to make in stable branches.  So I'm inclined to call
it done at this point.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] logical replication WIP: FailedAssertion("!(list_length(stmt->tables) > 0)", File: "publicationcmds.c", Line: 350)

2016-09-08 Thread Erik Rijkers


Hi,

I found these steps to a reliable crash (I know the patch is WIP but I 
assume you want to hear about these).


(Running a single instance suffices)


psql (10devel_logical_replication_20160901_1237_6f7c0ea32f80)
Type "help" for help.

(PID 9809) [testdb]  # CREATE PUBLICATION pub_all;
CREATE PUBLICATION

(PID 9809) [testdb]  # ALTER PUBLICATION pub_all FOR ALL TABLES;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 40851.246 ms
(PID ) [] ? >

Logfile:
[...]
2016-09-08 19:24:43.157 CEST 9765 LOG:  MultiXact member wraparound 
protections are now enabled
2016-09-08 19:24:43.160 CEST 9770 LOG:  logical replication launcher 
started
2016-09-08 19:24:43.160 CEST 9764 LOG:  database system is ready to 
accept connections
TRAP: FailedAssertion("!(list_length(stmt->tables) > 0)", File: 
"publicationcmds.c", Line: 350)
2016-09-08 19:25:45.673 CEST 9764 LOG:  server process (PID 9809) was 
terminated by signal 6: Aborted
2016-09-08 19:25:45.673 CEST 9764 DETAIL:  Failed process was running: 
ALTER PUBLICATION pub_all FOR ALL TABLES;
2016-09-08 19:25:45.674 CEST 9764 LOG:  terminating any other active 
server processes
2016-09-08 19:25:45.676 CEST 9764 LOG:  all server processes terminated; 
reinitializing
2016-09-08 19:25:47.463 CEST 9819 LOG:  database system was interrupted; 
last known up at 2016-09-08 19:24:43 CEST
2016-09-08 19:25:47.465 CEST 9820 FATAL:  the database system is in 
recovery mode
2016-09-08 19:25:47.493 CEST 9819 LOG:  database system was not properly 
shut down; automatic recovery in progress

2016-09-08 19:25:47.494 CEST 9819 LOG:  redo starts at 0/A6D8C070
2016-09-08 19:25:47.494 CEST 9819 LOG:  invalid record length at 
0/A6D8D498: wanted 24, got 0

2016-09-08 19:25:47.494 CEST 9819 LOG:  redo done at 0/A6D8D460
2016-09-08 19:25:47.494 CEST 9819 LOG:  last completed transaction was 
at log time 2016-09-08 19:25:00.774988+02
2016-09-08 19:25:47.511 CEST 9819 LOG:  MultiXact member wraparound 
protections are now enabled
2016-09-08 19:25:47.514 CEST 9826 LOG:  logical replication launcher 
started
2016-09-08 19:25:47.515 CEST 9764 LOG:  database system is ready to 
accept connections



Thanks,

Erik Rijkers


(BTW, the issue I reported a few days ago was indeed avoided when I 
created a receiving table subscriber-side, thanks)




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for restrictive RLS policies

Stephen Frost  writes:
> * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
>> Can't you keep those words as Sconst or something (DefElems?) until the
>> execution phase, so that they don't need to be keywords at all?

> Seems like we could do that, though I'm not convinced that it really
> gains us all that much.  These are only unreserved keywords, of course,
> so they don't impact users the way reserved keywords (of any kind) can.
> While there may be some places where we use a string to represent a set
> of defined options, I don't believe that's typical

-1 for having to write them as string literals; but I think what Alvaro
really means is to arrange for the words to just be identifiers in the
grammar, which you strcmp against at execution.  See for example
reloption_list.  (Whether you use DefElem as the internal representation
is a minor detail, though it might help for making the parsetree
copyObject-friendly.)

vacuum_option_elem shows another way to avoid making a word into a
keyword, although to me that one is more of an antipattern; it'd be better
to leave the strcmp to execution, since there's so much other code that
does things that way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

Tom Lane wrote:
> Hey Alvaro, can you comment on this bit in the proposed patch?
> 
> + for (i = FirstOffsetNumber; i <= itemcount; i++)
> + {
> + ItemId  ii = PageGetItemId(phdr, i);
> +
> + /* Normally here was following assertion
> +  * Assert(ItemIdHasStorage(ii));
> +  * This assertion was introduced in PageIndexTupleDelete
> +  * But if this function will be used from BRIN index
> +  * this assertion will fail. Thus, here we do not check 
> that
> +  * item has the storage.
> +  */
> + if (ItemIdGetOffset(ii) <= offset)
> + ii->lp_off += size_diff;
> + }
> + }
> 
> That comment seems utterly wrong to me, because both PageIndexTupleDelete
> and PageIndexMultiDelete certainly contain assertions that every item on
> the page has storage.  Are you expecting that any BRIN index items
> wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
> have storage sensible?

It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.

This is a tricky case to reproduce, but I think this should do it:
consider a table with pages_per_range=1.  Page 1 is summarized by index
tuple 1, page 2 is summarized by itup 2, page 3 is summarized by index
tuple 3.  On heap page 2 a new tuple is inserted and the summary data is
now much larger, such that the summary tuple no longer fits in the index
page, so it is moved to a new index page.  Then index tuple 2 in the
first index page becomes unused.  Revmap still points to lp 3, so it
can't be moved to lp 2.  The lp_offs can be adjusted, as long as the lp
itself is not moved from its original position.

Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

On Thu, Sep 8, 2016 at 2:09 PM, Tom Lane  wrote:
> Well, my vote is that it ain't broke and we shouldn't fix it.

To take a step back, what prompted this whole discussion is the patch
that I wrote that shifts down, replacing calls to
tuplesort_heap_siftup() and tuplesort_heap_insert with one new call to
a function I've called tuplesort_heap_root_displace() (today, Claudio
reports that it makes some of his test queries go 25% faster). This
new function shifts down. It's not clear what I'm supposed to say
about that, given the current understanding. So, in a sense, it's
blocking on this.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Lucas

Tom,

Yes, it is what I mean. Is what pg_dump uses to get things synchronized. It
seems to me a clear marker that the same task is using more than one
connection to accomplish the one job.

Em 08/09/2016 6:34 PM, "Tom Lane"  escreveu:

> Lucas  writes:
> > The queue jumping logic can not use the distributed transaction id?
>
> If we had such a thing as a distributed transaction id, maybe the
> answer could be yes.  We don't.
>
> I did wonder whether using a shared snapshot might be a workable proxy
> for that, but haven't pursued it.
>
> regards, tom lane
>

Re: [HACKERS] mdtruncate leaking fd.c handles?

Andres Freund  writes:
> Am I missing something or is md.c:mdtruncate() leaking open files?

Yeah, I think you're right.

> This only really matters for VACUUM truncate and ON COMMIT TRUNCATE temp
> table truncation afaics.

Also, you'd need a table > 1GB to leak anything at all, which probably
helps explain why it hasn't been noticed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mdtruncate leaking fd.c handles?

2016-09-08 Thread Andres Freund

On 2016-09-08 18:39:56 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Am I missing something or is md.c:mdtruncate() leaking open files?
> 
> Yeah, I think you're right.
> 
> > This only really matters for VACUUM truncate and ON COMMIT TRUNCATE temp
> > table truncation afaics.
> 
> Also, you'd need a table > 1GB to leak anything at all, which probably
> helps explain why it hasn't been noticed.

Heh, this bug is of some older vintage... Appears to originate in
1a5c450f3024ac57cd6079186c71b3baf39e84eb - before that we did a
FileUnlink(), which includes a FileClose().

Will fix.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Preventing deadlock on parallel backup

2016-09-08 Thread Lucas

People,

I made a small modification in pg_dump to prevent parallel backup failures
due to exclusive lock requests made by other tasks.

The modification I made take shared locks for each parallel backup worker
at the very beginning of the job. That way, any other job that attempts to
acquire exclusive locks will wait for the backup to finish.

In my case, each server was taking a day to complete the backup, now with
parallel backup one is taking 3 hours and the others less than a hour.

The code below is not very elegant, but it works for me. My whishlist for
the backup is:

1) replace plpgsql by c code reading the backup toc and assembling the lock
commands.
2) create an timeout to the locks.
3) broadcast the end of copy to every worker in order to release the locks
as early as possible;
4) create a monitor thread that prioritize an copy job based on a exclusive
lock acquired;
5) grant the lock for other connection of the same distributed transaction
if it is held by any connection of the same distributed transaction. There
is some sideefect I can't see on that?

1 to 4 are within my capabilities and I may do it in the future. 4 is to
advanced for me and I do not dare to mess with something so fundamental
rights now.

Anyone else is working on that?

On, Parallel.c, void RunWorker(...), add:

PQExpBuffer query;
PGresult   *res;

query = createPQExpBuffer();
resetPQExpBuffer(query);
appendPQExpBuffer(query,
"do language 'plpgsql' $$"
" declare "
"x record;"
" begin"
"for x in select * from pg_tables where schemaname not in
('pg_catalog','information_schema') loop"
"raise info 'lock table %.%', x.schemaname, x.tablename;"
"execute 'LOCK TABLE
'||quote_ident(x.schemaname)||'.'||quote_ident(x.tablename)||' IN ACCESS
SHARE MODE NOWAIT';"
"end loop;"
"end"
"$$" );

res = PQexec(AH->connection, query->data);

if (!res || PQresultStatus(res) != PGRES_COMMAND_OK)
exit_horribly(modulename,"Could not lock the tables to begin the
work\n\n");
PQclear(res);
destroyPQExpBuffer(query);

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Jeff Janes

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma 
wrote:

> > Thanks to Ashutosh Sharma for doing the testing of the patch and
> > helping me in analyzing some of the above issues.
>
> Hi All,
>
> I would like to summarize the test-cases that i have executed for
> validating WAL logging in hash index feature.
>
> 1) I have mainly ran the pgbench test with read-write workload at the
> scale factor of 1000 and various client counts like 16, 64 and 128 for
> time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
> on highly configured power2 machine with 128 cores and 512GB of RAM. I
> ran the test-case both with and without the replication setup.
>
> Please note that i have changed the schema of pgbench tables created
> during initialisation phase.
>
> The new schema of pgbench tables looks as shown below on both master
> and standby:
>
> postgres=# \d pgbench_accounts
>Table "public.pgbench_accounts"
>   Column  | Type  | Modifiers
> --+---+---
>  aid  | integer   | not null
>  bid  | integer   |
>  abalance | integer   |
>  filler   | character(84) |
> Indexes:
> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
> "pgbench_accounts_bid" hash (bid)
>
> postgres=# \d pgbench_history
>   Table "public.pgbench_history"
>  Column |Type | Modifiers
> +-+---
>  tid| integer |
>  bid| integer |
>  aid| integer |
>  delta  | integer |
>  mtime  | timestamp without time zone |
>  filler | character(22)   |
> Indexes:
> "pgbench_history_bid" hash (bid)
>
>
Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will never
use hash indexes for searching, so it limits the amount of test coverage
you will get.  While searching shouldn't generate novel types of WAL
records (that I know of), it will generate locking and timing issues that
might uncover bugs (if there are any left to uncover, of course).

I would drop the primary key on pgbench_accounts and replace it with a hash
index and test it that way (except I don't have a 128 core machine at my
disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index should not
be an operational problem, because the built in pgbench runs never attempt
to violate the constraints anyway.

In fact, I'd replace all of the indexes on the rest of the pgbench tables
with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals.  I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

Is there a way to gather statistics on how many of each type of WAL record
are actually getting sent over the replication link?  The only way I can
think of is to turn on wal archving as well as replication, then using
pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any problems.
But I realized I forgot to compile with enable-casserts, to I will have to
redo it to make sure the assertion failures have been fixed.  In my
original testing I did very rarely get a deadlock (or some kind of hang),
and I haven't seen that again so far.  It was probably the same source as
the one Mark observed, and so the same fix.

Cheers,

Jeff

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

On Thu, Sep 8, 2016 at 12:46 PM, Tom Lane  wrote:
>
>  /*
>   * The tuple at state->memtuples[0] has been removed from the heap.
> - * Decrement memtupcount, and sift up to maintain the heap invariant.
> + * Decrement memtupcount, and shift down to maintain the heap invariant.
>   */
>  static void
> -tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
> +tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
>
> I don't find this clearer at all, because
>
> (1) As noted in the comment, the heap top has *already* been removed,
> we're just trying to re-establish the heap invariant.  So maybe
> "delete_top" isn't the best name after all.

This is why I suggested tuplesort_heap_compact(). When merging, there
is a comment associated with a call to the function, "compact the
heap". I based my name off of that, thinking that that was really no
different to any other caller.

> (2) "shift down" seems a bit weird, since we're moving data closer to
> the heap top, not further away from it; why isn't that direction "up"?

Well, the fact that the routine does that makes it a higher level
thing than something like a sift or a shift. It might just as easily
be some other tuple (e.g., from the caller), as far as "shifting down"
goes. (I wrote a patch where for merging, it *is* from the caller, and
the heap stays the same size, which Heikki mentioned on this thread.)

> (3) "shift" makes it sound like it ought to be a simple memmove
> kind of operation, which it surely is not.

Then, how about the Sedgewick terminology, "sink"? I'm really not
attached to that aspect. I do think that "shift down" is unambiguous,
but I'd just as soon use "sink", or even explicitly spell it out.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]

Hey Alvaro, can you comment on this bit in the proposed patch?

+   for (i = FirstOffsetNumber; i <= itemcount; i++)
+   {
+   ItemId  ii = PageGetItemId(phdr, i);
+
+   /* Normally here was following assertion
+* Assert(ItemIdHasStorage(ii));
+* This assertion was introduced in PageIndexTupleDelete
+* But if this function will be used from BRIN index
+* this assertion will fail. Thus, here we do not check 
that
+* item has the storage.
+*/
+   if (ItemIdGetOffset(ii) <= offset)
+   ii->lp_off += size_diff;
+   }
+   }

That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage.  Are you expecting that any BRIN index items
wouldn't?  If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CVE-2016-1238 fix breaks (at least) pg_rewind tests

2016-09-08 Thread Andres Freund

On 2016-09-08 18:13:06 -0300, Alvaro Herrera wrote:
> I suppose -I$(srcdir) should be fine.  (Why the quotes?)

Because quoting correctly seems like a good thing to do? Most people
won't have whitespace in there, but it doesn't seem impossible?


> > check-world appears to mostly run (still doing so, but it's mostly
> > through everything relevant).

Passed successfully since.


> > I can't vouch for the windows stuff, and
> > the invocations indeed look vulnerable. I'm not sure if hte fix actually
> > matters on windows, given . is the default for pretty much everything
> > there.
> 
> Well, maybe it doesn't matter now but as I understand the fix is going
> to enter the next stable upstream perl, so it'll fail eventually.  It'd
> be saner to just fix the thing completely so that we can forget about
> it.

Yea, it'd need input from somebody on windows. Michael? What happens if
you put a line remove . from INC (like upthread) in the msvc stuff?


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

On Thu, Sep 8, 2016 at 2:19 PM, Peter Geoghegan  wrote:
>> Peter, looking at your "displace" patch in this light, I think
>> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm
>> calling it now), should share a common subroutine. Displace replaces the top
>> node with the new node passed as argument, and then calls the subroutine to
>> push it down to the right place. Delete_top replaces the top node with the
>> last value in the heap, and then calls the subroutine. Or perhaps delete_top
>> should remove the last value from the heap, and then call displace() with
>> it, to re-insert it.
>
> I can see the value in that, but I'd still rather not. The code reuse
> win isn't all that large, and having a separate
> tuplesort_heap_root_displace() routine allows us to explain what's
> going on for merging, to assert what we're able to assert with tape
> numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex
> cruft that the existing routines need (if only barely) to support
> replacement selection. I would be surprised if with a very tight inner
> loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it
> increases pipeline stalls).

BTW, regarding this, since I read in some other thread that it was ok
to use static inline functions, I believe the compiler is smart enough
to optimize out the constant true/false in checkIndex for inlined
calls, so that may be viable (on modern compilers).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Let file_fdw access COPY FROM PROGRAM