Re: [HACKERS] WIP: About CMake v2

2016-11-09 Thread Yury Zhuravlev

Tom Lane wrote:

So this is really not open for negotiation.  As Peter said upthread,
what we are looking for in a CMake reimplementation is that it behaves
exactly like the Autoconf version does.  To the extent that you are unable
or unwilling to duplicate that behavior, you increase the odds that
we'll reject this work.
Who asking about negotiation? I just wanted an explanation for the clear 
understanding and nothing more. 
Now I know about reasons. Thanks.


regards
--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - allow backslash continuations in \set expressions

2016-11-09 Thread Fabien COELHO



+1.  My vote is for backslash continuations.


I'm fine with that!

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas  wrote:
> On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila  wrote:
>> [ new patches ]
>
> Attached is yet another incremental patch with some suggested changes.
>
> + * This expects that the caller has acquired a cleanup lock on the target
> + * bucket (primary page of a bucket) and it is reponsibility of caller to
> + * release that lock.
>
> This is confusing, because it makes it sound like we retain the lock
> through the entire execution of the function, which isn't always true.
> I would say that caller must acquire a cleanup lock on the target
> primary bucket page before calling this function, and that on return
> that page will again be write-locked.  However, the lock might be
> temporarily released in the meantime, which visiting overflow pages.
> (Attached patch has a suggested rewrite.)
>

+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.

Looks like typo in above comment.   /will with a write lock/will
return with a write lock


> + * During scan of overflow pages, first we need to lock the next bucket and
> + * then release the lock on current bucket.  This ensures that any concurrent
> + * scan started after we start cleaning the bucket will always be behind the
> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
> + * required for sanctity of scan.
>
> This comment says that it's bad if other scans can pass our cleanup
> scan, but it doesn't explain why.  I think it's because we don't have
> page-at-a-time mode yet,
>

Right.

> and cleanup might decrease the TIDs for
> existing index entries.
>

I think the reason is that cleanup might move tuples around during
which it might move previously returned TID to a position earlier than
its current position.  This is a problem because it restarts the scan
from previously returned offset and try to find previously returned
tuples TID.  This has been mentioned in README as below:

+ It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.

We might want to slightly improve the README so that the reason is
more clear and then mention in comments to refer README, but I am open
either way, let me know which way you prefer?

>
> +if (delay)
> +vacuum_delay_point();
>
> You don't really need "delay".  If we're not in a cost-accounted
> VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
> which should be safe (and a good idea) regardless.  (Fixed in
> attached.)
>

Okay, that makes sense.

> +if (callback && callback(htup, callback_state))
> +{
> +/* mark the item for deletion */
> +deletable[ndeletable++] = offno;
> +if (tuples_removed)
> +*tuples_removed += 1;
> +}
> +else if (bucket_has_garbage)
> +{
> +/* delete the tuples that are moved by split. */
> +bucket = 
> _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
> ),
> +  maxbucket,
> +  highmask,
> +  lowmask);
> +/* mark the item for deletion */
> +if (bucket != cur_bucket)
> +{
> +/*
> + * We expect tuples to either belong to curent bucket or
> + * new_bucket.  This is ensured because we don't allow
> + * further splits from bucket that contains garbage. See
> + * comments in _hash_expandtable.
> + */
> +Assert(bucket == new_bucket);
> +deletable[ndeletable++] = offno;
> +}
> +else if (num_index_tuples)
> +*num_index_tuples += 1;
> +}
> +else if (num_index_tuples)
> +*num_index_tuples += 1;
> +}
>
> OK, a couple things here.  First, it seems like we could also delete
> any tuples where ItemIdIsDead, and that seems worth doing.

I think we can't do that because here we want to strictly rely on
callback function for vacuum similar to btree. The reason is explained
as below comment in function btvacuumpage().

/*
* During Hot Standby we currently assume that
* XLOG_BTREE_VACUUM records do not produce conflicts. That is
* only true as long as the callback function depends only
* upon whether the index 

Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?

2016-11-09 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Nov 9, 2016 at 1:44 PM, Tom Lane  wrote:
>> I don't think we need "named constants", especially not
>> manually-maintained ones.  The thing that would help in pg_proc.h is for
>> numeric type OIDs to be replaced by type names.  We talked awhile back
>> about introducing some sort of preprocessing step that would allow doing
>> that --- ie, it would look into some precursor file for pg_type.h and
>> extract the appropriate OID automatically.  I'm too tired to go find the
>> thread right now, but it was mostly about building the long-DATA-lines
>> representation from something easier to edit.

> You mean that I guess:
> https://www.postgresql.org/message-id/4d191a530911041228v621286a7q6a98d9ab8a2ed...@mail.gmail.com

Hmm, that's from 2009.  I thought I remembered something much more recent,
like last year or so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-11-09 Thread Kuntal Ghosh
On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
 wrote:
> Thank you for the new patch. This will be hopefully the last round of
> reviews, we are getting close to something that has an acceptable
> shape.
Thanks a lot for reviewing the patch. Based on your review, I've attached the
updated patch along with few comments.

> In DecodeXLogRecord@xlogreader.c, please add a boolean flag "apply"
> and then please could you do some error checks on it. Only one is
> needed: if "apply" is true and has_image is false, xlogreader.c should
> complain about an inconsistency!
Added a flag named apply_image in DecodedBkpBlock and used it to
check whether image apply is required or not.

> I would still for the removal of blkno in the list of arguments of the
> masking functions. This is used just for speculative inserts, where we
> could just enforce the page number to 0 because this does not matter,
> as Peter has mentioned upthread.
It just doesn't feel right to me to enforce the number manually when
I can use the blkno without any harm.

> I haven't performed any tests with the patch, and that's all I have
> regarding the code. With that done we should be in good shape
> code-speaking I think...
I've done a fair amount of testing which includes regression tests
and manual creation of inconsistencies in the page. I've also found the
reason behind inconsistency in brin revmap page.
Brin revmap page doesn't have standard page layout. Besides, It doesn't update
pd_upper and pd_lower in its operations as well. But, during WAL
insertions, it uses
REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
restore image before consistency check, RestoreBlockImage fills the space
between pd_upper and pd_lower(page hole) with zero. I've posted this as a
separate thread.
https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qo5shtzb9wwz_fm7+wd83zps9...@mail.gmail.com

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


walconsistency_v13.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Incorrect XLogRegisterBuffer flag for revmapbuf in brin

2016-11-09 Thread Kuntal Ghosh
Hi all,

In brin_doupdate(line 290), REGBUF_STANDARD is used to register
revmap buffer reference in WAL record. But, revmap buffer page doesn't
have a standard page layout and it doesn't update pd_upper and
pd_lower as well.

Either we should register revmapbuf as following:
XLogRegisterBuffer(1, revmapbuf, 0);
Or, we can update the pd_upper and pd_lower in brin_page_init() as follows:
if (BRIN_IS_REVMAP_PAGE(page))
p->pd_lower = p->upper.

To fix this, I've attached a small patch which follows the first approach.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


revmapbuf_xlogregister_flag_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-11-09 Thread Amit Kapila
On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes  wrote:
> On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila 
> wrote:
>>
>> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila 
>> wrote:
>> >
>> > I think here I am slightly wrong.  For the full page writes, it do use
>> > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> > doing page verification check and rather blindly setting the page to
>> > zero and then overwrites it with full page image.  So after my fix,
>> > you will not see the error of checksum failure.  I have a fix ready,
>> > but still doing some more verification.  If everything passes, I will
>> > share the patch in a day or so.
>> >
>>
>> Attached patch fixes the problem, now we do perform full page writes
>> for bitmap pages.  Apart from that, I have rebased the patch based on
>> latest concurrent index patch [1].  I have updated the README as well
>> to reflect the WAL logging related information for different
>> operations.
>>
>> With attached patch, all the review comments or issues found till now
>> are addressed.
>
>
> This needs to be updated to apply over concurrent_hash_index_v10.patch.
>
> Unless we want to wait until that work is committed before doing more review
> and testing on this.
>

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well.  So, I think after it gets
somewhat stabilized, I will update this patch as well.  I am not sure
if it is good idea to update it with every version of hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IPv6 link-local addresses and init data type

2016-11-09 Thread Peter Eisentraut
On 11/7/16 1:13 AM, Haribabu Kommi wrote:
> Yes, I agree that default zone is the main use case of the original thread.
> From the RFC 4007, the default zone is used for the global addresses,
> This may be the main use case with zone id. How about currently just
> ignoring it and store the actual IP address with the attached patch and
> handle the rest of the actual zone id support later once the it gets
> properly standardized?

Well, according to the RFC, the default zone is 0 "typically", which is
a very weak requirement.  So just ignoring it is probably also not right.

So far we have only heard one use case for any of this, which is someone
wanting to store ::1%0, which is not even a valid address according to
that same RFC.  So this is all on very weak ground.

I think we should just forget about this.  It's all a bit too dubious.

The only thing that might be useful and not questionable is a function
that takes text input and produces a zone-free address and a the zone ID
as separate return values.  (Or perhaps two functions, one for each
component.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
> + * This function expects that the caller has acquired a cleanup lock on the
> + * primary bucket page, and will with a write lock again held on the primary
> + * bucket page.  The lock won't necessarily be held continuously, though,
> + * because we'll release it when visiting overflow pages.
>
> Looks like typo in above comment.   /will with a write lock/will
> return with a write lock

Oh, yes.  Thanks.

>> + * During scan of overflow pages, first we need to lock the next bucket and
>> + * then release the lock on current bucket.  This ensures that any 
>> concurrent
>> + * scan started after we start cleaning the bucket will always be behind the
>> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
>> + * required for sanctity of scan.
>>
>> This comment says that it's bad if other scans can pass our cleanup
>> scan, but it doesn't explain why.  I think it's because we don't have
>> page-at-a-time mode yet,
>>
>
> Right.
>
>> and cleanup might decrease the TIDs for
>> existing index entries.
>>
>
> I think the reason is that cleanup might move tuples around during
> which it might move previously returned TID to a position earlier than
> its current position.  This is a problem because it restarts the scan
> from previously returned offset and try to find previously returned
> tuples TID.  This has been mentioned in README as below:
>
> + It is must to
> +keep scans behind cleanup, else vacuum could remove tuples that are required
> +to complete the scan as the scan that returns multiple tuples from the same
> +bucket page always restart the scan from the previous offset number from 
> which
> +it has returned last tuple.
>
> We might want to slightly improve the README so that the reason is
> more clear and then mention in comments to refer README, but I am open
> either way, let me know which way you prefer?

I think we can give a brief explanation right in the code comment.  I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around".  But I think that moving the tuples to different locations is
not the problem.  I think the problem is that a tuple might be
assigned a lower spot in the item pointer array - i.e. the TID
decreases.

>> OK, a couple things here.  First, it seems like we could also delete
>> any tuples where ItemIdIsDead, and that seems worth doing.
>
> I think we can't do that because here we want to strictly rely on
> callback function for vacuum similar to btree. The reason is explained
> as below comment in function btvacuumpage().

OK, I see.  It would probably be good to comment this, then, so that
someone later doesn't get confused as I did.

> This looks okay to me. So if you agree with my reasoning for not
> including first part, then I can take that out and keep this part in
> next patch.

Cool.

>>  I think that might be
>> clearer.  When LH_BEING_POPULATED is set, the bucket is being filled -
>> that is, populated - from the old bucket.
>
> How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
> indicates Bucket page?

LH_BUCKET_BEING_POPULATED seems good to me.

>>  And maybe
>> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>>
>
> How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
> I am slightly inclined to keep Bucket word, but let me know if you
> think it will make the length longer.

LH_BUCKET_NEEDS_SPLIT_CLEANUP seems good to me.

>>  How?  Can we just use an
>> if-then instead of a for-loop?
>
> I could see below two possibilities:
> First way -
>
> retry:
> mask = lowmask + 1;
> new_bucket = old_bucket | mask;
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> goto retry;
> }
>
> Second way -
> new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
> }
>
> #define CALC_NEW_BUCKET(old_bucket, lowmask) \
> new_bucket = old_bucket | (lowmask + 1)
>
> Do you have something else in mind?

Second one would be my preference.

>> I still don't like the names of these functions very much.  If you
>> said "get X from Y", it would be clear that you put in Y and you get
>> out X.  If you say "X 2 Y", it would be clear that you put in X and
>> you get out Y.  As it is, it's not very clear which is the input and
>> which is the output.
>
> Whatever exists earlier is input and the later one is output. For
> example in existing function _hash_get_indextuple_hashkey().  However,
> feel free to suggest better names here.  How about
> _hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
> or simply _hash_get_newblock()?

The problem with _hash_get_newblock() is that it sounds like you are
getting a new block in the relation, not the new bucket (or
corresponding block) for some old bucket.  The name isn't specific
enough to know what "new" means.

In general, I think "new" and 

Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?

2016-11-09 Thread Tom Lane
Amit Langote  writes:
> On Wed, Nov 9, 2016 at 11:47 PM, Tom Lane  wrote:
>> Hmm, that's from 2009.  I thought I remembered something much more recent,
>> like last year or so.

> This perhaps:

> * Re: Bootstrap DATA is a pita *
> https://www.postgresql.org/message-id/flat/CAOjayEfKBL-_Q9m3Jsv6V-mK1q8h%3Dca5Hm0fecXGxZUhPDN9BA%40mail.gmail.com

Yeah, that's the thread I remembered.  I think the basic conclusion was
that we needed a Perl script that would suck up a bunch of data from some
representation that's more edit-friendly than the DATA lines, expand
symbolic representations (regprocedure etc) into numeric OIDs, and write
out the .bki script from that.  I thought some people had volunteered to
work on that, but we've seen no results ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Performance degradation in Bitmapscan (commit 75ae538bc3168bf44475240d4e0487ee2f3bb376)

2016-11-09 Thread Dilip Kumar
While testing bitmap performance, I have observed that some of the
TPCH queries taking huge time in BitmapIndexScan node, when there are
lossy pages.

I suspected 75ae538bc3168bf44475240d4e0487ee2f3bb376 commit, because
prior to that it used to take very less time. So I tested by reverting
suspected commit and problem is solved.

Here is explained analyze result for TPCH query 6 (scale factor 10)

work_mem=10M
shared_buffers=20GB
machine under test: POWER, 4 socket machine

->On Head:

postgres=# \i 6.explain.sql

   QUERY PLAN

--
-
 Limit  (cost=1585507.74..1585507.75 rows=1 width=32) (actual
time=21626.467..21626.467 rows=1 loops=1)
   ->  Aggregate  (cost=1585507.74..1585507.75 rows=1 width=32)
(actual time=21626.466..21626.466 rows=1 loops=1)
 ->  Bitmap Heap Scan on lineitem  (cost=299632.60..1579529.48
rows=1195652 width=12) (actual time=9204.770..20910.089 rows=1190658
loops=1)
   Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate < '1996-01-01 00:00:00'::timestamp without time zone) AND
(l_discount >= 0.07
) AND (l_discount <= 0.09) AND (l_quantity < '25'::numeric))
   Rows Removed by Index Recheck: 27584798
   Heap Blocks: exact=101349 lossy=580141
   ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..299333.68 rows=1195652 width=0) (actual
time=9185.490..9185.490 rows=1190658 loops=
1)
 Index Cond: ((l_shipdate >= '1995-01-01'::date)
AND (l_shipdate < '1996-01-01 00:00:00'::timestamp without time zone)
AND (l_discount >=
0.07) AND (l_discount <= 0.09) AND (l_quantity < '25'::numeric))
 Planning time: 0.675 ms
 Execution time: 21626.838 ms
(10 rows)


->After reverting Commit: 75ae538bc3168bf44475240d4e0487ee2f3bb376
postgres=# \i 6.explain.sql

   QUERY PLAN

--
-
 Limit  (cost=1585507.74..1585507.75 rows=1 width=32) (actual
time=12807.293..12807.294 rows=1 loops=1)
   ->  Aggregate  (cost=1585507.74..1585507.75 rows=1 width=32)
(actual time=12807.291..12807.291 rows=1 loops=1)
 ->  Bitmap Heap Scan on lineitem  (cost=299632.60..1579529.48
rows=1195652 width=12) (actual time=1632.351..12131.552 rows=1190658
loops=1)
   Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate < '1996-01-01 00:00:00'::timestamp without time zone) AND
(l_discount >= 0.07
) AND (l_discount <= 0.09) AND (l_quantity < '25'::numeric))
   Rows Removed by Index Recheck: 28401743
   Heap Blocks: exact=84860 lossy=596630
   ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..299333.68 rows=1195652 width=0) (actual
time=1613.166..1613.166 rows=1190658 loops=
1)
 Index Cond: ((l_shipdate >= '1995-01-01'::date)
AND (l_shipdate < '1996-01-01 00:00:00'::timestamp without time zone)
AND (l_discount >=
0.07) AND (l_discount <= 0.09) AND (l_quantity < '25'::numeric))
 Planning time: 0.173 ms
 Execution time: 12807.380 ms
(10 rows)


>From above explain analyze result we can see that with commit
75ae538bc3168bf44475240d4e0487ee2f3bb376, Bitmap Index Scan node is
way slower than without this commit.

Perf result:
Head
+   13.12% 0.01%  postgres  postgres[.] tbm_lossify
+   13.10%13.10%  postgres  postgres[.]
tbm_mark_page_lossy
+   12.84%12.82%  postgres  postgres[.]
slot_deform_tuple
+6.94% 0.00%  postgres  postgres[.] _bt_next
+6.94% 0.02%  postgres  postgres[.] _bt_steppage
+6.55% 0.05%  postgres  postgres[.] _bt_readpage
+6.41% 1.00%  postgres  postgres[.] _bt_checkkeys

After Reverting 75ae538bc3168bf44475240d4e0487ee2f3bb376:
+0.71% 0.71%  postgres  postgres[.] cmp_var_common
+0.62% 0.02%  postgres  postgres[.] tbm_lossify
+0.62% 0.62%  postgres  postgres[.] AllocSetReset
+0.60% 0.11%  postgres  [kernel.kallsyms]   [k] sys_read
+0.59% 0.10%  postgres  postgres[.]
advance_transition_function

I think in new hash implementation, delete from pagetable have severe
performance issue.

Note: If I set work_mem=100MB (no lossy page) then performance is fine.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-09 Thread Mithun Cy
On Thu, Nov 3, 2016 at 7:16 PM, Robert Haas  wrote:
> Great, committed.  There's still potentially more work to be done
> here, because my patch omits some features that were present in
> Victor's original submission, like setting the failover timeout,
> optionally randomizing the order of the hosts, and distinguishing
> between master and standby servers; Victor, or anyone, please feel
> free to submit separate patches for those things.

Among the remaining things I have worked on failover to new master idea.
Below patch implement that idea. This is taken from Victors patch but
rewritten by me to do some cleanup. As in Victor's patch we have a new
connection parameter "target_server_type", It can take 2 values 1. "any" 2.
"master" with DEFAULT as "any". If it's has the value "any" we can connect
to any of the host server (both master(primary) and slave(standby)). If the
value is "master" then we try to connect to master(primary) only.
NOTE: Parameter name is inspired and taken from PostgreSql JDBC Driver
.

The main difference between Victor's and this new patch is Default value of
parameter target_server_type. In Victor's patch if number of host in
connection string is 1 then default value is "any" (This was done to make
sure old psql connect to standby as it is now). If it is greater than 1
then default value is set as "master". For me this appeared slightly
inconsistent having default value as "any" for any number of connection
appeared more appropriate which is also backward compatible. And, if user
want failover to master he should ask for it by setting
target_server_type=master in connection string.

This patch do not include load balancing feature by trying to connect to
host in random fashion.

This patch also do not have failover timeout feature. Victor's
implementation of same is not a strict one. i. e. 1 walk of all of the host
is allowed even if timer might have expired. Only on second walk we check
for timeout. I thought application can manage the failover timeout instead
libpq doing it. So avoided the same in this patch. If others find that I am
wrong to think that way, will add the same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


failover_to_new_master-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we need use more meaningful variables to replace 0 in catalog head files?

2016-11-09 Thread Amit Langote
On Wed, Nov 9, 2016 at 11:47 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Wed, Nov 9, 2016 at 1:44 PM, Tom Lane  wrote:
>>> I don't think we need "named constants", especially not
>>> manually-maintained ones.  The thing that would help in pg_proc.h is for
>>> numeric type OIDs to be replaced by type names.  We talked awhile back
>>> about introducing some sort of preprocessing step that would allow doing
>>> that --- ie, it would look into some precursor file for pg_type.h and
>>> extract the appropriate OID automatically.  I'm too tired to go find the
>>> thread right now, but it was mostly about building the long-DATA-lines
>>> representation from something easier to edit.
>
>> You mean that I guess:
>> https://www.postgresql.org/message-id/4d191a530911041228v621286a7q6a98d9ab8a2ed...@mail.gmail.com
>
> Hmm, that's from 2009.  I thought I remembered something much more recent,
> like last year or so.

This perhaps:

* Re: Bootstrap DATA is a pita *
https://www.postgresql.org/message-id/flat/CAOjayEfKBL-_Q9m3Jsv6V-mK1q8h%3Dca5Hm0fecXGxZUhPDN9BA%40mail.gmail.com

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>
> I think we can give a brief explanation right in the code comment.  I
> referred to "decreasing the TIDs"; you are referring to "moving tuples
> around".  But I think that moving the tuples to different locations is
> not the problem.  I think the problem is that a tuple might be
> assigned a lower spot in the item pointer array
>

I think we both understand the problem and it is just matter of using
different words.  I will go with your suggestion and will try to
slightly adjust the README as well so that both places use same
terminology.


>>> +/*
>>> + * Acquiring cleanup lock to clear the split-in-progress flag ensures 
>>> that
>>> + * there is no pending scan that has seen the flag after it is cleared.
>>> + */
>>> +_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
>>> +opage = BufferGetPage(bucket_obuf);
>>> +oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>>>
>>> I don't understand the comment, because the code *isn't* acquiring a
>>> cleanup lock.
>>
>> Oops, this is ramnant from one of the design approach to clear these
>> flags which was later discarded due to issues. I will change this to
>> indicate Exclusive lock.
>
> Of course, an exclusive lock doesn't guarantee anything like that...
>

Right, but we don't need that guarantee (there is no pending scan that
has seen the flag after it is cleared) to clear the flags.  It was
written in one of the previous patches where I was exploring the idea
of using cleanup lock to clear the flags and then don't use the same
during vacuum.  However, there were some problems in that design and I
have changed the code, but forgot to update the comment.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Copying Permissions

2016-11-09 Thread Corey Huinker
>
> > SET relacl = ( SELECT relacl FROM pg_class
> >WHERE oid = 'foo'::regclass)
> > WHERE oid = 'dummy'::regclass;
>
> Yikes, let's not suggest folks go updating catalog tables, ever, please.
> :)
>

Glad you find that as icky as I do.


> Should we have a way for users to define an ACL ala the ALTER DEFAULT
> PERMISSIONS approach where the ACL is not attached to any particular
> object and is, instead, something which can be assigned to a table.
> Further, if we support that, what happens if that is changed later?  I
> could see use-cases for both having that change impact all objects and
> for not having that happen.
>

I think allowing users to receive and send serialized relacl values (which
is what I *think* you're asking about here) is only slightly less icky, and
presents a backward compatibility issue. Those issues go away if the ACL is
contained in an existing object, or exists only for the life of a
statement. In which case I think you're suggesting something like this:


BEGIN;
 GATHER TEMPORARY DEFAULT PRIVILEGES FROM view_name;
 DROP VIEW view_name;
 CREATE VIEW view_name as ...;

COMMIT;


Which would solve the problem provided I don't want to drop dependent
objects with different permissions. Once I have to do a DROP a;DROP
b;CREATE b;CREATE a; and the permissions of A and B don't match, I'm sunk.


Second, as always, what's the syntax going to actually be?  I don't
> think GRANT SAME PERMISSIONS is going to work out too well in the
> parser, and it seems a bit grotty to me anyway.  I do think this should
> be associated with GRANT rather than ALTER TABLE- GRANT is what we use
> for managing privileges on an object.
>

So GRANT / REVOKE are a bit weird in this case, because they operate on an
object as it pertains to 1+ roles. Here are adding in a reference to
another like-typed object, and the roles aren't even mentioned.

Moreover, the operation itself would potentially do both GRANTing and
REVOKEing, depending on what the target objects permissions were relative
to the source object. So there's situations where an object could end up
with fewer permissions after a GRANT than it had before.

Or...we could instead decide that the GRANT only adds permissions, never
revokes, and if the user wants an exact copy then it's up to them to first
revoke all privs on the new object before the GRANT. Either way, the syntax
might be:

BEGIN;
 CREATE TEMPORARY VIEW dummy AS SELECT 1 AS dummy_col;
 GRANT ALL PRIVILEGES ON VIEW dummy FROM my_view;
 DROP VIEW my_view;
 CREATE VIEW my_view ...;
 REVOKE ALL PRIVILEGES on my_view FROM public ; /* repeat for every other
role you can think of ... ick */
 GRANT ALL PRIVILEGES ON VIEW my_view FROM dummy;
COMMIT;

That's still clumsy, but at least we've avoided having a user touch
pg_class.relacl.

So after all that wrangling, i got around to where Tom got rather quickly:
ALTER TABLE x COPY PERMISSIONS FROM y;

If we're worried about the ALTER-person's authority to GRANT things already
granted to table y, then I suppose the best thing to do would be this:

1. Strip all permissions from x (empty relacl), so the ALTER-ing person
pretty much has to be the owner.
2. Iterate over the permissions in the relacl of y, and attempt to grant
them (as the ALTER-person) one by one, issuing NOTICE or WARNING whenever a
grant fails.
3. The operation is judged to have succeeded if at least one permission is
granted, or NO grants failed (i.e. there was nothing to grant).

I can see obvious problems with copying grants from one user to another on
an existing object not of the user's creation, but in this case the
ALTER-ing person already has ownership (or close) of the object, they're
not compromising any previously existing object. Still, I'm sure somebody
could dream up a priv escalation somehow, hence my
iterate-and-spaghetti-test the individual grants approach.


On Wed, Nov 9, 2016 at 1:35 PM, Tom Lane  wrote:

> Robert Haas  writes:
> > On Tue, Nov 8, 2016 at 9:48 AM, Stephen Frost 
> wrote:
> >> Second, as always, what's the syntax going to actually be?  I don't
> >> think GRANT SAME PERMISSIONS is going to work out too well in the
> >> parser, and it seems a bit grotty to me anyway.  I do think this should
> >> be associated with GRANT rather than ALTER TABLE- GRANT is what we use
> >> for managing privileges on an object.
>
> > One thing to think about is that GRANT sort of implies adding
> > privileges, but this operation would both add and remove privileges as
> > necessary.
>
> Other things to think about:
>
> 1. If you can GRANT x, that generally implies that you can REVOKE x.
> What would REVOKE SAME PERMISSIONS mean?
>
> 2. The GRANT/REVOKE syntax is largely governed by the SQL standard.
> We risk getting boxed in by picking something that will conflict
> with future spec extensions in this area.
>
> On the whole, I suspect some sort of "ALTER TABLE x COPY PERMISSIONS
> FROM y" 

Re: [HACKERS] WIP: About CMake v2

2016-11-09 Thread Yury Zhuravlev

Craig Ringer wrote:

So personally I think it'd be fine if a cmake build defaulted to
finding and using what it could, but offered a --minimal mode or
whatever that gets us just core postgres + whatever we enable
explicitly. But our current behaviour is OK too.


To me it's best way. But I'm not sure what Tom Lane will accept this.


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is user_catalog_table sensible for matviews?

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 12:17 PM, Tom Lane  wrote:
> The system will let you set the "user_catalog_table" reloption to "true"
> on a materialized view.  Is this sensible, or is it a bug caused by the
> fact that reloptions.c fails to distinguish matviews from heaps at all?
>
> If it is sensible, then I broke it in e3e66d8a9 ...

I can understand what that combination of opens would mean from a
semantic point of view, so I don't think it's insensible.  However, it
doesn't seem like an important combination to support, and I suspect
that the fact that we did was accidental.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IPv6 link-local addresses and init data type

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 9:02 AM, Peter Eisentraut
 wrote:
> On 11/7/16 1:13 AM, Haribabu Kommi wrote:
>> Yes, I agree that default zone is the main use case of the original thread.
>> From the RFC 4007, the default zone is used for the global addresses,
>> This may be the main use case with zone id. How about currently just
>> ignoring it and store the actual IP address with the attached patch and
>> handle the rest of the actual zone id support later once the it gets
>> properly standardized?
>
> Well, according to the RFC, the default zone is 0 "typically", which is
> a very weak requirement.  So just ignoring it is probably also not right.
>
> So far we have only heard one use case for any of this, which is someone
> wanting to store ::1%0, which is not even a valid address according to
> that same RFC.  So this is all on very weak ground.
>
> I think we should just forget about this.  It's all a bit too dubious.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-11-09 Thread Tom Lane
Etsuro Fujita  writes:
> [ foreign_plan_cache_inval_v6.patch ]

I looked through this a bit.  A point that doesn't seem to have been
discussed anywhere in the thread is why foreign tables should deserve
exact tracking of dependencies, when we have not bothered with that
for most other object types.  Schemas, for example, we're happy to just
zap the whole plan cache for.  Attempting to do exact tracking is
somewhat expensive and bug-prone --- the length of this thread speaks
to the development cost and bug hazard.  Meanwhile, nobody has questioned
the cost of attaching more PlanInvalItems to a plan (and the cost of the
extra catalog lookups this patch does to create them).  For most plans,
that's nothing but overhead because no invalidation will actually occur
in the life of the plan.

I think there's a very good argument that we should just treat any inval
in pg_foreign_data_wrapper as a reason to blow away the whole plan cache.
People aren't going to alter FDW-level options often enough to make it
worth tracking things more finely.  Personally I'd put pg_foreign_server
changes in the same boat, though maybe those are changed slightly more
often.  If it's worth doing anything more than that, it would be to note
which plans mention foreign tables at all, and not invalidate those that
don't.  We could do that with, say, a hasForeignTables bool in
PlannedStmt.

That leaves the question of whether it's worth detecting table-level
option changes this way, or whether we should just handle those by forcing
a relcache inval in ATExecGenericOptions, as in Amit's original patch in
<5702298d.4090...@lab.ntt.co.jp>.  I kind of like that approach; that
patch was short and sweet, and it put the cost on the unusual path (ALTER
TABLE) not the common path (every time we create a query plan).

That leaves not much of this patch :-( though maybe we could salvage some
of the test cases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 12:00 PM, Robert Haas  wrote:
> In this latest patch set:
>
> src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
> +*rdatum;
>
> With all patches applied, "make check" fails with a bunch of diffs
> that look like this:
>
>   Check constraints:
> - "pt1chk2" CHECK (c2 <> ''::text)
>   "pt1chk3" CHECK (c2 <> ''::text)

And the pg_upgrade test also fails:

Done
+ pg_dumpall -f /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
+ pg_ctl -m fast stop
waiting for server to shut down done
server stopped
+ set +x

Files /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql and
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql differ
dumps were not identical
make[2]: *** [check] Error 1
make[1]: *** [check-pg_upgrade-recurse] Error 2
make: *** [check-world-src/bin-recurse] Error 2
[rhaas pgsql]$ diff
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql
/Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
6403d6402
< c text
8736,8737c8735
< CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
precision)),
< CONSTRAINT bmerged CHECK (((b)::double precision > (1)::double precision))
---
> CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double 
> precision))

For future revisions, please make sure "make check-world" passes before posting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Copying Permissions

2016-11-09 Thread Robert Haas
On Tue, Nov 8, 2016 at 9:48 AM, Stephen Frost  wrote:
> Second, as always, what's the syntax going to actually be?  I don't
> think GRANT SAME PERMISSIONS is going to work out too well in the
> parser, and it seems a bit grotty to me anyway.  I do think this should
> be associated with GRANT rather than ALTER TABLE- GRANT is what we use
> for managing privileges on an object.

One thing to think about is that GRANT sort of implies adding
privileges, but this operation would both add and remove privileges as
necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Copying Permissions

2016-11-09 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 8, 2016 at 9:48 AM, Stephen Frost  wrote:
>> Second, as always, what's the syntax going to actually be?  I don't
>> think GRANT SAME PERMISSIONS is going to work out too well in the
>> parser, and it seems a bit grotty to me anyway.  I do think this should
>> be associated with GRANT rather than ALTER TABLE- GRANT is what we use
>> for managing privileges on an object.

> One thing to think about is that GRANT sort of implies adding
> privileges, but this operation would both add and remove privileges as
> necessary.

Other things to think about:

1. If you can GRANT x, that generally implies that you can REVOKE x.
What would REVOKE SAME PERMISSIONS mean?

2. The GRANT/REVOKE syntax is largely governed by the SQL standard.
We risk getting boxed in by picking something that will conflict
with future spec extensions in this area.

On the whole, I suspect some sort of "ALTER TABLE x COPY PERMISSIONS
FROM y" syntax would be better.

BTW, please specify what the grantor of the resulting permissions
would be.  I rather doubt that it should involve blindly copying
the source ACL if the user doing the COPY is not the original
grantor --- that feels way too much like a security problem
waiting to happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables cleanup

2016-11-09 Thread Konstantin Knizhnik

Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,10));
INSERT 0 10



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory


knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:


postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,10));
INSERT 0 10
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
 x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-11-09 Thread Robert Haas
In this latest patch set:

src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
+*rdatum;

With all patches applied, "make check" fails with a bunch of diffs
that look like this:

  Check constraints:
- "pt1chk2" CHECK (c2 <> ''::text)
  "pt1chk3" CHECK (c2 <> ''::text)


On Wed, Nov 9, 2016 at 6:14 AM, Amit Langote
 wrote:
> Actually the sentence there is: The parenthesized list of columns or
> expressions forms the the partitioning key for the
> table

OK.

>> +/*
>> + * Strip any top-level COLLATE clause.  This ensures that we 
>> treat
>> + * "x COLLATE y" and "(x COLLATE y)" alike.
>> + */
>>
>> But you don't, right?  Unless I am confused, which is possible, the
>> latter COLLATE will be ignored, while the former one will set the
>> collation to be used in the context of partitioning comparisons.
>
> The code immediately following the comment does in fact strip the
> top-level COLLATE clause.
>
> while (IsA(expr, CollateExpr))
> expr = (Node *) ((CollateExpr *) expr)->arg;
>
> So that the following two specifications are equivalent which is the intent:
>
> create table p (a text) partition by range (a collate "en_US");
> vs.
> create table p (a text) partition by range ((a collate "en_US"));

I see.  You're right.

>> Re-reviewing 0002:
>>
>> +   if (fout->remoteVersion >= 10)
>> +   {
>> +   PQExpBuffer acl_subquery = createPQExpBuffer();
>> +   PQExpBuffer racl_subquery = createPQExpBuffer();
>> +   PQExpBuffer initacl_subquery = createPQExpBuffer();
>> +   PQExpBuffer initracl_subquery = createPQExpBuffer();
>> +
>> +   PQExpBuffer attacl_subquery = createPQExpBuffer();
>> +   PQExpBuffer attracl_subquery = createPQExpBuffer();
>> +   PQExpBuffer attinitacl_subquery = createPQExpBuffer();
>> +   PQExpBuffer attinitracl_subquery = createPQExpBuffer();
>>
>> It seems unnecessary to repeat all of this.  The only differences
>> between the 1 version and the 9600 version are:
>>
>> 60,61c60
>> <   "AS changed_acl, "
>> <   "CASE WHEN c.relkind = 'P' THEN
>> pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
>> ---
>>>   "AS changed_acl "
>> 73c72
>> <"WHERE c.relkind in ('%c', '%c', '%c', '%c',
>> '%c', '%c', '%c') "
>> ---
>>>"WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') 
>>> "
>> 87,88c86
>> <   RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
>> <   RELKIND_PARTITIONED_TABLE);
>> ---
>>>   RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);
>>
>> But none of that is really a problem.  Sure, the 'P' case will never
>> arise in 9.6, but so what?  I'd really like to not keep duplicating
>> these increasingly-complex hunks of code if we can find some way to
>> avoid that.
>
> We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that
> getTables() sends to pre-10 servers, right?  But I suppose we need not
> call it in that particular SQL query in the first place.

Oh, yeah, that's a problem; the query will error out against older
servers.  You could do something like:

char *partkeydef;
if (version <= 90600)
partkeydef = "NULL";
else
partkeydef = "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END";

...and the use %s to interpolate that into the query string.

> How about we do it in the following manner in getSchemaData():
>
> if (g_verbose)
> write_msg(NULL, "reading partition key information for interesting
> tables\n");
> getTablePartitionKeyInfo(fout, tblinfo, numTables);
>
> I have implemented the same.

That might be OK too; I'll have to read it through carefully.

>> Re-reviewing 0003:
>>
>> + 
>> +  If this table is a partition, one cannot perform DROP
>> NOT NULL
>> +  on a column if it is marked not null in the parent table.
>> + 
>>
>> This would, presumably, also be true for inheritance.  I think we
>> could just leave this out.
>
> Actually, it isn't true for the regular inheritance situation.
>
> create table parent (a int not null);
> create table child () inherits (parent);
> alter table child alter a drop not null;  -- this works (bug?)

Hrm, OK.  That doesn't satisfy MY idea of the principle of least
astonishment, but hey...

> But in the meantime, we can proceed with enforcing inheritance on NOT NULL
> constraint for *partitions*, because they only ever have one parent and
> hence do not need elaborate coninhcount based scheme, I think.  Thoughts?

I agree.

>> +  correspond to the partitioning method and partitioning key of the 
>> target
>>
>> I think that in most places were are referring to the "partitioning
>> method" (with ing) but the "partition 

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Robert Haas
On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai  wrote:
> As an example, I enhanced postgres_fdw to understand the ps_numTuples
> if it is set. If and when remote ORDER BY is pushed down, the latest
> code tries to sort the entire remote table because it does not know
> how many rows to be returned. Thus, it took larger execution time.
> On the other hands, the patched one runs the remote query with LIMIT
> clause according to the ps_numTuples; which is informed by the Limit
> node on top of the ForeignScan node.

So there are two cases here.  If the user says LIMIT 12, we could in
theory know that at planner time and optimize accordingly.  If the
user says LIMIT twelve(), however, we will need to wait until
execution time unless twelve() happens to be capable of being
simplified to a constant by the planner.

Therefore, it's possible to imagine having two mechanisms here. In the
simple case where the LIMIT and OFFSET values are constants, we could
implement a system to get hold of that information during planning and
use it for whatever we like.   In addition, we can have an
execution-time system that optimizes based on values available at
execution (regardless of whether those values were also available
during planning).  Those are, basically, two separate things, and this
patch has enough to do just focusing on one of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect overflow check condition for WAL segment size

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 12:54 PM, Markus Nullmeier
 wrote:
> On 11/08/16 18:12, Robert Haas wrote:
>> On Tue, Nov 8, 2016 at 1:01 AM, Michael Paquier  
>> wrote:
>>> On Tue, Nov 8, 2016 at 2:33 PM, Kuntal Ghosh  
>>> wrote:
 I've attached a patch to fix this.
>>> Good catch. Interesting copy-pasto from 88e9823.
>> Committed.
>
> Hmm, somehow this fix (60379f66c8 for master) does not seem to appear
> in the 9.5 and 9.6 branches, yet the latter both include commit 88e9823.

It didn't seem important to back-patch it, so I didn't.  It also
occurred to me that there was a small chance of breaking the build for
somebody who is skating by today, which would annoy that person
without being likely to benefit anyone else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
 wrote:
> Now simulate server crash using using "pkill -9 postgres".
>
> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
> logfile start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
> knizhnik@knizhnik:~/dtm-data$ psql postgres
> psql (10devel)
> Type "help" for help.
>
> postgres=# select * from foo;
> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
> No such file or directory
>
> knizhnik@knizhnik:~/dtm-data$ ls fs
> PG_10_201611041
> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>
> So all relation directory is removed!
> It happens only for first table created in tablespace.
> If you create table in Postgres data directory everything is ok: first
> segment of relation is truncated but not deleted.

Whoa.  There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch proposal

2016-11-09 Thread Jaime Casanova
On 16 August 2016 at 09:06, Stephen Frost  wrote:
> Greetings,
>
> * Venkata B Nagothi (nag1...@gmail.com) wrote:
>> The above said parameters can be configured to pause, shutdown or prevent
>> promotion only after reaching the recovery target point.
>> To clarify, I am referring to a scenario where recovery target point is not
>> reached at all ( i mean, half-complete or in-complete recovery) and there
>> are lots of WALs still pending to be replayed - in this situation,
>
> PG doesn't know that there are still WALs to be replayed.
>
>> PostgreSQL just completes the archive recovery until the end of the last
>> available WAL (WAL file "0001001E" in my case) and
>> starts-up the cluster by generating an error message (saying
>> "0001001F" not found).
>
> That's not a PG error, that's an error from cp.  From PG's perspective,
> your restore command has said that all of the WAL has been replayed.
>
> If that's not what you want then change your restore command to return
> an exit code > 125, which tells PG that it's unable to restore that WAL
> segment.
>

Ah! Is this documented somewhere?
if we expect people to use correct exit codes we should document them, no?

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IPv6 link-local addresses and init data type

2016-11-09 Thread Tom Lane
Robert Haas  writes:
> On Wed, Nov 9, 2016 at 9:02 AM, Peter Eisentraut
>  wrote:
>> On 11/7/16 1:13 AM, Haribabu Kommi wrote:
>>> Yes, I agree that default zone is the main use case of the original thread.
>>> From the RFC 4007, the default zone is used for the global addresses,
>>> This may be the main use case with zone id. How about currently just
>>> ignoring it and store the actual IP address with the attached patch and
>>> handle the rest of the actual zone id support later once the it gets
>>> properly standardized?

>> Well, according to the RFC, the default zone is 0 "typically", which is
>> a very weak requirement.  So just ignoring it is probably also not right.
>> So far we have only heard one use case for any of this, which is someone
>> wanting to store ::1%0, which is not even a valid address according to
>> that same RFC.  So this is all on very weak ground.
>> I think we should just forget about this.  It's all a bit too dubious.

> +1.

Agreed, let's wait until more standardization emerges.  Anything we do now
risks painting ourselves into a corner, and there is not so much demand
for a feature in this area that we need to do something about it Right Now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Robert Haas
On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
 wrote:
> 1. ps_numTuples is declared as long, however offset and count members in
> LimitState struct and bound member in SortState struct is int64.  However
> long on 32 bit machine may be 32 bits and thus I think tuples_needed which
> is long may have overflow hazards as it may store int64 + int64.  I think
> ps_numTuples should be int64.

I suggested long originally because that's what ExecutorRun() was
using at the time.  It seems that it got changed to uint64 in
23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
probably use uint64.

> 2. Robert suggested following in the previous discussion:
> "For example, suppose we add a new PlanState member "long
> numTuples" where 0 means that the number of tuples that will be needed
> is unknown (so that most node types need not initialize it), a
> positive value is an upper bound on the number of tuples that will be
> fetched, and -1 means that it is known for certain that we will need
> all of the tuples."
>
> We should have 0 for the default case so that we don't need to initialize it
> at most of the places.  But I see many such changes in the patch.  I think
> this is not possible here since 0 can be a legal user provided value which
> cannot be set as a default (default is all rows).
>
> However do you think, can we avoid that? Is there any other way so that we
> don't need every node having ps_numTuples to be set explicitly?

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in comparison of empty jsonb arrays to scalars

2016-11-09 Thread Robert Haas
On Tue, Nov 8, 2016 at 9:49 PM, Michael Paquier
 wrote:
> On Wed, Nov 9, 2016 at 8:31 AM, Nikita Glukhov  
> wrote:
>> Hi hackers.
>>
>> While working on jsonbstatistics, I found the following bug:
>> an empty jsonb array is considered to be lesser than any scalar,
>> but it is expected that objects > arrays > scalars.
>
> Sources?

How about "our documentation"?

https://www.postgresql.org/docs/current/static/datatype-json.html

Look at the last page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila  wrote:
> On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
>> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>> I think we can give a brief explanation right in the code comment.  I
>> referred to "decreasing the TIDs"; you are referring to "moving tuples
>> around".  But I think that moving the tuples to different locations is
>> not the problem.  I think the problem is that a tuple might be
>> assigned a lower spot in the item pointer array
>
> I think we both understand the problem and it is just matter of using
> different words.  I will go with your suggestion and will try to
> slightly adjust the README as well so that both places use same
> terminology.

Yes, I think we're on the same page.

> Right, but we don't need that guarantee (there is no pending scan that
> has seen the flag after it is cleared) to clear the flags.  It was
> written in one of the previous patches where I was exploring the idea
> of using cleanup lock to clear the flags and then don't use the same
> during vacuum.  However, there were some problems in that design and I
> have changed the code, but forgot to update the comment.

OK, got it, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect overflow check condition for WAL segment size

2016-11-09 Thread Markus Nullmeier
On 11/08/16 18:12, Robert Haas wrote:
> On Tue, Nov 8, 2016 at 1:01 AM, Michael Paquier  
> wrote:
>> On Tue, Nov 8, 2016 at 2:33 PM, Kuntal Ghosh  
>> wrote:
>>> I've attached a patch to fix this.
>> Good catch. Interesting copy-pasto from 88e9823.
> Committed.

Hmm, somehow this fix (60379f66c8 for master) does not seem to appear
in the 9.5 and 9.6 branches, yet the latter both include commit 88e9823.


-- 
Markus Nullmeierhttp://www.g-vo.org
German Astrophysical Virtual Observatory (GAVO)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Danger of automatic connection reset in psql

2016-11-09 Thread Robert Haas
On Tue, Nov 8, 2016 at 9:53 AM, Oleksandr Shulgin
 wrote:
> On Mon, Nov 7, 2016 at 9:31 PM, Jim Nasby  wrote:
>>
>> On 11/4/16 4:04 AM, Oleksandr Shulgin wrote:
>>>
>>> The psql process even exits with an error code 2, which might be not
>>> that expected.  We could stop reading the file and reset connection
>>> afterwards, but this is probably not that easy to achieve (think of
>>> nested \i calls).
>>
>>
>> Well, if you stop reading from the file then I don't think more \i's will
>> matter, no? I'd certainly like to see improvement here, because the
>> difference in behavior with \i is annoying.
>
>
> What I mean is you need a longjump out of the innermost \i back to the
> toplevel interactive prompt.  This might be not a problem since this is what
> already happens upon receiving SIGINT, I believe.
>
>> On the bigger question of how to better protect all these cases (cut &
>> paste, etc), I think the only robust way to do that is for psql to track
>> intended transaction status itself. With the amount of parsing it's already
>> doing, maybe that wouldn't be that hard to add. It looks like this might
>> require extra libpq calls to constantly check in on server status; I'm a bit
>> surprised that result objects don't include that info.
>
>
> This doesn't have to be solely about transaction status, though for
> something immediately destructive such as DELETE or UPDATE one should expect
> a transaction guard.  But for example, pasting something like the following
> two lines
>
> SET search_path = 'fancy_new_schema', 'old_boring_schema', public;
> SELECT * FROM get_item_ids_to_delete(...);
>
> can give slightly different results depending on whether the first statement
> had it effect or not.  And since psql is trying to be very helpful here by
> resetting the connection, it makes it all too easy to overlook the problem.
>
> What do you think about trying to read everything we can from the terminal
> using non-blocking IO and only if that gives EWOULDBLOCK, starting the
> interactive prompt?

I think that's unlikely to be completely reliable.

How about, instead of all this, adding an option to psql to suppress
the automatic reconnect behavior?  When enabled, psql just exits
instead of trying to reconnect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is user_catalog_table sensible for matviews?

2016-11-09 Thread Tom Lane
The system will let you set the "user_catalog_table" reloption to "true"
on a materialized view.  Is this sensible, or is it a bug caused by the
fact that reloptions.c fails to distinguish matviews from heaps at all?

If it is sensible, then I broke it in e3e66d8a9 ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specifying the log file name of pgbench -l option

2016-11-09 Thread Robert Haas
On Tue, Nov 8, 2016 at 4:04 AM, Fabien COELHO  wrote:
>
> [ ... v4 ]
>
>> I checked. It works as expected. I have no more comments.
>> If its okay with Fabien, we can mark it ready for committer.
>
> Patch applies, compiles (including the documentation), make check ok and
> features works for me. Code could be a little simpler, but it is okay. I've
> switched the status to "ready for committers".

Committed with some wordsmithing and cosmetic tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Copying Permissions

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 2:54 PM, Corey Huinker  wrote:
> 3. The operation is judged to have succeeded if at least one permission is
> granted, or NO grants failed (i.e. there was nothing to grant).

Allow me to be skeptical.  If a user types INSERT INTO blah VALUES
(...), (...), (...) should we change the system to report success if
at least 1 of the 3 rows got successfully inserted?  I bet that
wouldn't go over well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in comparison of empty jsonb arrays to scalars

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 3:27 AM, Robert Haas  wrote:
> On Tue, Nov 8, 2016 at 9:49 PM, Michael Paquier
>  wrote:
>> On Wed, Nov 9, 2016 at 8:31 AM, Nikita Glukhov  
>> wrote:
>>> Hi hackers.
>>>
>>> While working on jsonbstatistics, I found the following bug:
>>> an empty jsonb array is considered to be lesser than any scalar,
>>> but it is expected that objects > arrays > scalars.
>>
>> Sources?
>
> How about "our documentation"?
>
> https://www.postgresql.org/docs/current/static/datatype-json.html

Indeed, I missed that. So that's broken...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Referential integrity on large objects

2016-11-09 Thread Marc Balmer
I am looking for ways to ensure referential integrity on large objects. 
Something like having a column myoid in a table that holds an oid of a large 
object, and which throws an error when the referenced large object should be 
unlinked. Like "myoid references pg_largeobject(loid)", which does not work 
right now, as foreign keys to catalogs are not supported. 

Do you have any idea if this could be accomplished? Would it be complicated? 
Could it be realised as a sponsored development? If so, what would be the 
estimated effort/cost? 

I'd be happy to get any hints at this. 

- mb



Re: [HACKERS] Bug in comparison of empty jsonb arrays to scalars

2016-11-09 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Nov 10, 2016 at 3:27 AM, Robert Haas  wrote:
>> https://www.postgresql.org/docs/current/static/datatype-json.html

> Indeed, I missed that. So that's broken...

Given that nobody actually cares what that sort order is, I think that
having to jump through hoops in pg_upgrade in order to fix it is not a
great tradeoff.  I suggest changing the documentation to match the code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 12:11 PM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila  wrote:
>> On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
>>> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>>> I think we can give a brief explanation right in the code comment.  I
>>> referred to "decreasing the TIDs"; you are referring to "moving tuples
>>> around".  But I think that moving the tuples to different locations is
>>> not the problem.  I think the problem is that a tuple might be
>>> assigned a lower spot in the item pointer array
>>
>> I think we both understand the problem and it is just matter of using
>> different words.  I will go with your suggestion and will try to
>> slightly adjust the README as well so that both places use same
>> terminology.
>
> Yes, I think we're on the same page.

Some more review:

The API contract of _hash_finish_split seems a bit unfortunate.   The
caller is supposed to have obtained a cleanup lock on both the old and
new buffers, but the first thing it does is walk the entire new bucket
chain, completely ignoring the old one.  That means holding a cleanup
lock on the old buffer across an unbounded number of I/O operations --
which also means that you can't interrupt the query by pressing ^C,
because an LWLock (on the old buffer) is held.  Moreover, the
requirement to hold a lock on the new buffer isn't convenient for
either caller; they both have to go do it, so why not move it into the
function?  Perhaps the function should be changed so that it
guarantees that a pin is held on the primary page of the existing
bucket, but no locks are held.

Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
should it instead be trying to get the lock conditionally and
returning immediately if it doesn't get the lock?  Seems like a good
idea...

 * We're at the end of the old bucket chain, so we're done partitioning
 * the tuples.  Mark the old and new buckets to indicate split is
 * finished.
 *
 * To avoid deadlocks due to locking order of buckets, first lock the old
 * bucket and then the new bucket.

These comments have drifted too far from the code to which they refer.
The first part is basically making the same point as the
slightly-later comment /* indicate that split is finished */.

The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
coming to seem like a horrible idea to me.  That's not your fault - it
was like this before - but maybe in a followup patch we should
consider ripping all of that out and just calling MarkBufferDirty(),
ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
UnlockReleaseBuffer() as appropriate.  As far as I can see, the
current style is just obfuscating the code.

itupsize = new_itup->t_info & INDEX_SIZE_MASK;
new_itup->t_info &= ~INDEX_SIZE_MASK;
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
new_itup->t_info |= itupsize;

If I'm not mistaken, you could omit the first, second, and fourth
lines here and keep only the third one, and it would do exactly the
same thing.  The first line saves the bits in INDEX_SIZE_MASK.  The
second line clears the bits in INDEX_SIZE_MASK.   The fourth line
re-sets the bits that were originally saved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql \setfileref

2016-11-09 Thread Tom Lane
Pavel Stehule  writes:
> [ psql-setfileref-2016-10-11.patch ]

I haven't been paying any attention to this thread, but I got around to
looking at it finally.  I follow the idea of wanting to be able to shove
the contents of a file into a query literal, but there isn't much else
that I like about this patch.  In particular:

* I really dislike the notion of keying the behavior to a special type of
psql variable.  psql variables are untyped at the moment, and if we were
going to introduce typing, this wouldn't be what I'd want to use it for.
I really don't want to introduce typing and then invent one-off,
unextensible syntax like '^' prefixes to denote what type a variable is.

Aside from being conceptually a mess, I don't even find it particularly
convenient.  In the shell, if you want to source from a file, you write
"

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan  wrote:
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.

Also, logtape.c fragmentation made multiple merge pass cases
experience increased random I/O in a way that was only an accident of
our implementation. We've fixed that now, but that problem must have
added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638
*masked* when it was commited in 2006. (I do think that the problem
with the merge heap maintenance fixed recently in
24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that
the 2006 work masked, though).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] C based plugins, clocks, locks, and configuration variables

2016-11-09 Thread Clifford Hammerschmidt
On Tue, Nov 8, 2016 at 2:58 PM, Craig Ringer 
wrote:

> 2ndQuadrant/bdr


That is similar. I'm not clear on the usage of OID for sequence (`
DirectFunctionCall1(nextval_oid, seqoid)`) ... does that imply a lock
around a sequence generation? also different is that your sequence doesn't
reset on the time basis, it ascends and wraps independently of the time.

(also, you appear to modulo against the max (2^n-1), not the cardinality
(2^n), ... should that be an & ... i.e. take SEQUENCE_BITS of 1 ->
MAX_SEQ_ID of ((1<<1)-1) = 1 -> (seq % 1) = {0} ... not {0,1} as expected;
(seq & 1) = {0,1} as expected)

We tried 64-bit values for ids (based on twitter's snowflake), but found
that time-replay would cause collisions. We had a server have its time
corrected, going backwards, by an admin; leading to duplicate ids being
generated, leading to a fun day of debugging and a hard lesson about our
assumption that time always increases over time. Using node_id doesn't
protect against this, since it is the same node creating the colliding ids
as the original ids. By extending the ids to include a significant amount
of randomness, and requiring a restart of the db for the time value to move
backwards (by latching onto the last seen time), we narrow the window for
collisions to close enough to zero that winning the lottery is far more
likely (http://preshing.com/20110504/hash-collision-probabilities/ has the
exact math). We also increase the time scale for id wrap around to long
past the likely life expectancy of the software we're building today.

-- 
Clifford Hammerschmidt, P.Eng.


[HACKERS] Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 4:51 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> pgbench: Allow the transaction log file prefix to be changed.
>
> Perhaps the "logpath" buffer that the filename is constructed in
> needs to be made bigger.  64 bytes was obviously enough with the
> old pattern, but it's not with the new.

Oops, yes, that seems like a good idea.  How about 64 -> MAXPGPATH?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] C based plugins, clocks, locks, and configuration variables

2016-11-09 Thread Craig Ringer
On 10 November 2016 at 07:18, Clifford Hammerschmidt
 wrote:
>
> On Tue, Nov 8, 2016 at 2:58 PM, Craig Ringer 
> wrote:
>>
>> 2ndQuadrant/bdr
>
>
> That is similar. I'm not clear on the usage of OID for sequence
> (`DirectFunctionCall1(nextval_oid, seqoid)`) ... does that imply a lock
> around a sequence generation?

No.

> also different is that your sequence doesn't
> reset on the time basis, it ascends and wraps independently of the time.

Right.

> (also, you appear to modulo against the max (2^n-1), not the cardinality
> (2^n), ... should that be an & ... i.e. take SEQUENCE_BITS of 1 ->
> MAX_SEQ_ID of ((1<<1)-1) = 1 -> (seq % 1) = {0} ... not {0,1} as expected;
> (seq & 1) = {0,1} as expected)

Hm. I think you're right there.

> We tried 64-bit values for ids (based on twitter's snowflake), but found
> that time-replay would cause collisions. We had a server have its time
> corrected, going backwards, by an admin; leading to duplicate ids being
> generated, leading to a fun day of debugging and a hard lesson about our
> assumption that time always increases over time.

That's a good point, but it's just going to have to be a documented
limitation. BDR expects you to use NTP and slew time when needed
anyway.

> Using node_id doesn't
> protect against this, since it is the same node creating the colliding ids
> as the original ids. By extending the ids to include a significant amount of
> randomness, and requiring a restart of the db for the time value to move
> backwards (by latching onto the last seen time), we narrow the window for
> collisions to close enough to zero that winning the lottery is far more
> likely (http://preshing.com/20110504/hash-collision-probabilities/ has the
> exact math). We also increase the time scale for id wrap around to long past
> the likely life expectancy of the software we're building today.

It's a good idea. I like what you're doing. I've run into too many
sites that can't or won't use 128-bit generated values though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] C based plugins, clocks, locks, and configuration variables

2016-11-09 Thread Craig Ringer
On 10 November 2016 at 07:18, Clifford Hammerschmidt
 wrote:
>
> On Tue, Nov 8, 2016 at 2:58 PM, Craig Ringer 
> wrote:
>>
>> 2ndQuadrant/bdr
>
>
> That is similar. I'm not clear on the usage of OID for sequence
> (`DirectFunctionCall1(nextval_oid, seqoid)`) ... does that imply a lock
> around a sequence generation? also different is that your sequence doesn't
> reset on the time basis, it ascends and wraps independently of the time.

Meant to explain more here.

Most of the system identifies sequence relations by oid. All this does
is call nextval. By accepting and passing oid we reduce the number of
syscache/relcache lookups and memory allocations required to call
nextval vs calling it by name. That's about all, really.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke; Etsuro Fujita; Andres Freund
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai  wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in
> theory know that at planner time and optimize accordingly.  If the
> user says LIMIT twelve(), however, we will need to wait until
> execution time unless twelve() happens to be capable of being
> simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at
> execution (regardless of whether those values were also available
> during planning).  Those are, basically, two separate things, and this
> patch has enough to do just focusing on one of them.
>
OK, we need to have a private value of postgres_fdw to indicate whether
LIMIT and OFFSET were supplied at the planner stage. If any, it has to
be matched with the ps_numTuples informed at the executor stage.

I'll revise the patch.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai
> On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
>  wrote:
> > 1. ps_numTuples is declared as long, however offset and count members in
> > LimitState struct and bound member in SortState struct is int64.  However
> > long on 32 bit machine may be 32 bits and thus I think tuples_needed which
> > is long may have overflow hazards as it may store int64 + int64.  I think
> > ps_numTuples should be int64.
> 
> I suggested long originally because that's what ExecutorRun() was
> using at the time.  It seems that it got changed to uint64 in
> 23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
> probably use uint64.
> 
> > 2. Robert suggested following in the previous discussion:
> > "For example, suppose we add a new PlanState member "long
> > numTuples" where 0 means that the number of tuples that will be needed
> > is unknown (so that most node types need not initialize it), a
> > positive value is an upper bound on the number of tuples that will be
> > fetched, and -1 means that it is known for certain that we will need
> > all of the tuples."
> >
> > We should have 0 for the default case so that we don't need to initialize it
> > at most of the places.  But I see many such changes in the patch.  I think
> > this is not possible here since 0 can be a legal user provided value which
> > cannot be set as a default (default is all rows).
> >
> > However do you think, can we avoid that? Is there any other way so that we
> > don't need every node having ps_numTuples to be set explicitly?
> 
> +1.
>
I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Re: [CORE] temporal tables (SQL2011)

2016-11-09 Thread Craig Ringer
On 8 Nov. 2016 15:11, "Craig Ringer"  wrote:
>
>
>
> On 7 November 2016 at 05:08, Stefan Scheid  wrote:
>>
>> Hi all,
>>
>> are there plans to introduce temporal tables?
>
> I don't know of anybody working on them, but someone else may. Try
searching the list archives.

I should've mentioned that one of the reasons it doesn't seem to be that
high on many people's priority lists is that it's fairly easy to implement
with triggers and updatable views. There's a greater performance cost than
I'd expect to pay for the same thing done as a built-in feature, but it
works well enough.

Many ORMs and application frameworks also offer similar capabilities at the
application level.

So I think temporal tables are one of those nice-to-haves that so far
people just find other ways of doing.


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Robert Haas
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan  wrote:
> I attach V5.

I gather that 0001, which puts a cap on the number of tapes, is not
actually related to the subject of this thread; it's an independent
change that you think is a good idea.  I reviewed the previous
discussion on this topic upthread, between you and Heikki, which seems
to me to contain more heat than light.  At least in my opinion, the
question is not whether a limit on the number of tapes is the best
possible system, but rather whether it's better than the status quo.
It's silly to refuse to make a simple change on the grounds that some
much more complex change might be better, because if somebody writes
that patch and it is better we can always revert 0001 then.  If 0001
involved hundreds of lines of invasive code changes, that argument
wouldn't apply, but it doesn't; it's almost a one-liner.

Now, on the other hand, as far as I can see, the actual amount of
evidence that 0001 is a good idea which has been presented in this
forum is pretty near zero.  You've argued for it on theoretical
grounds several times, but theoretical arguments are not a substitute
for test results.  Therefore, I decided that the best thing to do was
test it myself.  I wrote a little patch to add a GUC for
max_sort_tapes, which actually turns out not to work as I thought:
setting max_sort_tapes = 501 seems to limit the highest tape number to
501 rather than the number of tapes to 501, so there's a sort of
off-by-one error.  But that doesn't really matter.  The patch is
attached here for the convenience of anyone else who may want to
fiddle with this.

Next, I tried to set things up so that I'd get a large enough number
of tapes for the cap to matter.  To do that, I initialized with
"pgbench -i --unlogged-tables -s 2" so that I had 2 billion
tuples.  Then I used this SQL query: "select sum(w+abalance) from
(select (aid::numeric * 7123000217)%10 w, * from
pgbench_accounts order by 1) x".  The point of the math is to perturb
the ordering of the tuples so that they actually need to be sorted
instead of just passed through unchanged. The use of abalance in the
outer sum prevents an index-only-scan from being used, which makes the
sort wider; perhaps I should have tried to make it wider still, but
this is what I did.  I wanted to have more than 501 tapes because,
obviously, a concern with a change like this is that things might get
slower in the case where it forces a polyphase merge rather than a
single merge pass. And, of course, I set trace_sort = on.

Here's what my initial run looked like, in brief:

2016-11-09 15:37:52 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 15:37:59 UTC [44026] LOG:  switching to external sort with
937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s
2016-11-09 16:48:31 UTC [44026] LOG:  finished writing run 616 to tape
615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s
2016-11-09 16:48:31 UTC [44026] LOG:  using 246719 KB of memory for
read buffers among 616 input tapes
2016-11-09 16:48:39 UTC [44026] LOG:  performsort done (except 616-way
final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41
s
2016-11-09 18:33:30 UTC [44026] LOG:  external sort ended, 6255145
disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed:
10538.06 s

And according to psql: Time: 10538068.225 ms (02:55:38.068)

Then I set max_sort_tapes = 501 and ran it again.  This time:

2016-11-09 19:05:22 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 19:05:28 UTC [44026] LOG:  switching to external sort with
502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s
2016-11-09 20:15:20 UTC [44026] LOG:  finished writing run 577 to tape
75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s
2016-11-09 20:15:20 UTC [44026] LOG:  using 249594 KB of memory for
read buffers among 501 input tapes
2016-11-09 20:21:19 UTC [44026] LOG:  finished 77-way merge step: CPU:
user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s
2016-11-09 20:21:19 UTC [44026] LOG:  performsort done (except 501-way
final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22
s
2016-11-09 21:38:12 UTC [44026] LOG:  external sort ended, 6255484
disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed:
9170.62 s

And this one, according to psql: Time: 9170629.597 ms (02:32:50.630)

That looks very good.  On a test that runs for almost 3 hours, we
saved more than 20 minutes.  The overall runtime improvement is 23% in
a case where we would not expect this patch to do particularly well;
after all, without limiting the number of runs, we are able to
complete the sort with a single merge pass, whereas when we reduce the
number of runs, we now require a polyphase merge.  Nevertheless, we
come out way ahead, because the final merge pass gets way faster,
presumably because there are fewer tapes involved.  The 

Re: [HACKERS] Is user_catalog_table sensible for matviews?

2016-11-09 Thread Craig Ringer
On 10 November 2016 at 01:55, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 12:17 PM, Tom Lane  wrote:
>> The system will let you set the "user_catalog_table" reloption to "true"
>> on a materialized view.  Is this sensible, or is it a bug caused by the
>> fact that reloptions.c fails to distinguish matviews from heaps at all?
>>
>> If it is sensible, then I broke it in e3e66d8a9 ...
>
> I can understand what that combination of opens would mean from a
> semantic point of view, so I don't think it's insensible.  However, it
> doesn't seem like an important combination to support, and I suspect
> that the fact that we did was accidental.

I think it'll work sanely, but I don't see why it's worth having. User
catalogs are for data you'll want to see consistently during logical
decoding. I don't see why anyone's going to need a matview at that
point. Since it's also untested, I suggest disallowing user catalog
matviews.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add support for SRF and returning composites to pl/tcl

2016-11-09 Thread Jim Nasby

On 11/8/16 8:33 AM, Tom Lane wrote:

As things stand in HEAD, the behavior is about the same, but the error
messages are not --- in one case they mention triggers and of course the
other doesn't.  There are a couple of other minor things in the way of
unifying the two hunks of code, so I concluded it probably wasn't worth
the trouble.  But feel free to take another look if it bugs you.


I had to add a bit of cruft to pltcl_build_tuple_result but it's not 
that bad. tg_tupdesc could potentially be eliminated, but I don't know 
if it's really worth it.


Note that this does change some of the trigger error messages, but I 
don't think that's really an issue?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461
diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c
index b0d9e41..25d959e 100644
--- a/src/pl/tcl/pltcl.c
+++ b/src/pl/tcl/pltcl.c
@@ -301,7 +301,8 @@ static void pltcl_set_tuple_values(Tcl_Interp *interp, 
const char *arrayname,
 static Tcl_Obj *pltcl_build_tuple_argument(HeapTuple tuple, TupleDesc tupdesc);
 static HeapTuple pltcl_build_tuple_result(Tcl_Interp *interp,
 Tcl_Obj **kvObjv, int kvObjc,
-pltcl_call_state *call_state);
+pltcl_call_state *call_state,
+TupleDesc tg_tupdesc);
 static void pltcl_init_tuple_store(pltcl_call_state *call_state);
 
 
@@ -966,7 +967,7 @@ pltcl_func_handler(PG_FUNCTION_ARGS, pltcl_call_state 
*call_state,
throw_tcl_error(interp, prodesc->user_proname);
 
tup = pltcl_build_tuple_result(interp, resultObjv, resultObjc,
-  
call_state);
+  
call_state, NULL);
retval = HeapTupleGetDatum(tup);
}
else
@@ -1000,8 +1001,6 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, pltcl_call_state 
*call_state,
const char *result;
int result_Objc;
Tcl_Obj   **result_Objv;
-   Datum  *values;
-   bool   *nulls;
 
/* Connect to SPI manager */
if (SPI_connect() != SPI_OK_CONNECT)
@@ -1219,70 +1218,9 @@ pltcl_trigger_handler(PG_FUNCTION_ARGS, pltcl_call_state 
*call_state,
 errmsg("could not split return value from 
trigger: %s",

utf_u2e(Tcl_GetStringResult(interp);
 
-   if (result_Objc % 2 != 0)
-   ereport(ERROR,
-   
(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
-errmsg("trigger's return list must have even number of 
elements")));
-
-   values = (Datum *) palloc0(tupdesc->natts * sizeof(Datum));
-   nulls = (bool *) palloc(tupdesc->natts * sizeof(bool));
-   memset(nulls, true, tupdesc->natts * sizeof(bool));
-
-   for (i = 0; i < result_Objc; i += 2)
-   {
-   char   *ret_name = utf_u2e(Tcl_GetString(result_Objv[i]));
-   char   *ret_value = utf_u2e(Tcl_GetString(result_Objv[i + 
1]));
-   int attnum;
-   Oid typinput;
-   Oid typioparam;
-   FmgrInfofinfo;
-
-   /
-* Get the attribute number
-*
-* We silently ignore ".tupno", if it's present but doesn't 
match
-* any actual output column.  This allows direct use of a row
-* returned by pltcl_set_tuple_values().
-/
-   attnum = SPI_fnumber(tupdesc, ret_name);
-   if (attnum == SPI_ERROR_NOATTRIBUTE)
-   {
-   if (strcmp(ret_name, ".tupno") == 0)
-   continue;
-   ereport(ERROR,
-   (errcode(ERRCODE_UNDEFINED_COLUMN),
-errmsg("unrecognized attribute \"%s\"",
-   ret_name)));
-   }
-   if (attnum <= 0)
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("cannot set system attribute 
\"%s\"",
-   ret_name)));
-
-   /
-* Lookup the attribute type's 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas  wrote:
> I gather that 0001, which puts a cap on the number of tapes, is not
> actually related to the subject of this thread; it's an independent
> change that you think is a good idea.  I reviewed the previous
> discussion on this topic upthread, between you and Heikki, which seems
> to me to contain more heat than light.

FWIW, I don't remember it that way. Heikki seemed to be uncomfortable
with the quasi-arbitrary choice of constant, rather than disagreeing
with the general idea of a cap. Or, maybe he thought I didn't go far
enough, by completely removing polyphase merge. I think that removing
polyphase merge would be an orthogonal change to this, though.

> Now, on the other hand, as far as I can see, the actual amount of
> evidence that 0001 is a good idea which has been presented in this
> forum is pretty near zero.  You've argued for it on theoretical
> grounds several times, but theoretical arguments are not a substitute
> for test results.

See the illustration in TAOCP, vol III, page 273 in the second edition
-- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
that it's actually a real-world benchmark.

I guess I felt that no one ever argued that as many tapes as possible
was sound on any grounds, even theoretical, and so didn't feel
obligated to test it until asked to do so. I think that the reason
that a cap like this didn't go in around the time that the growth
logic went in (2006) was because nobody followed up on it. If you look
at the archives, there is plenty of discussion of a cap like this at
the time.

> That looks very good.  On a test that runs for almost 3 hours, we
> saved more than 20 minutes.  The overall runtime improvement is 23% in
> a case where we would not expect this patch to do particularly well;
> after all, without limiting the number of runs, we are able to
> complete the sort with a single merge pass, whereas when we reduce the
> number of runs, we now require a polyphase merge.  Nevertheless, we
> come out way ahead, because the final merge pass gets way faster,
> presumably because there are fewer tapes involved.  The first test
> does a 616-way final merge and takes 6184.34 seconds to do it.  The
> second test does a 501-way final merge and takes 4519.31 seconds to
> do.  This increased final merge speed accounts for practically all of
> the speedup, and the reason it's faster pretty much has to be that
> it's merging fewer tapes.

It's CPU cache efficiency -- has to be.

> That, in turn, happens for two reasons.  First, because limiting the
> number of tapes increases slightly the memory available for storing
> the tuples belonging to each run, we end up with fewer runs in the
> first place.  The number of runs drops from from 616 to 577, about a
> 7% reduction.  Second, because we have more runs than tapes in the
> second case, it does a 77-way merge prior to the final merge.  Because
> of that 77-way merge, the time at which the second run starts
> producing tuples is slightly later.  Instead of producing the first
> tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
> disadvantage in this case, because it's hypothetically possible that a
> query like this could have a LIMIT and we'd end up worse off overall.
> However, that's pretty unlikely, for three reasons.  Number one, LIMIT
> isn't likely to be used on queries of this type in the first place.
> Number two, if it were used, we'd probably end up with a bounded sort
> plan which would be way faster anyway.  Number three, if somehow we
> still sorted the data set we'd still win in this case if the limit
> were more than about 20% of the total number of tuples.  The much
> faster run time to produce the whole data set is a small price to pay
> for possibly needing to wait a little longer for the first tuple.

Cool.

> So, I'm now feeling pretty bullish about this patch, except for one
> thing, which is that I think the comments are way off-base. Peter
> writes: $When allowedMem is significantly lower than what is required
> for an internal sort, it is unlikely that there are benefits to
> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
> I'm pretty sure that's totally wrong, first of all because commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
> precisely the thing which this comment says we shouldn't

It's more complicated than that. As I said, I think that Knuth
basically had it right with his sweet spot of 7. I think that commit
df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
because a one-pass merge avoided certain overheads not inherent to
polyphase merge, like all that memory accounting stuff, extra palloc()
traffic, etc. The expanded use of per tape buffering we have even in
multi-pass cases likely makes that much less true for us these days.

The reason I haven't actually gone right back down to 7 with this cap
is that it's 

Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-11-09 Thread Andreas Karlsson

On 11/09/2016 06:54 AM, Michael Paquier wrote:

It seems to me that this patch is missing something... To begin with,
src/test/ssl/ServerSetup.pm should be patched so as the new SSL
configuration is reloaded after pg_ctl reload, and not after an
instance restart. That's straight-forward:
--- a/src/test/ssl/ServerSetup.pm
+++ b/src/test/ssl/ServerSetup.pm
@@ -96,7 +96,7 @@ sub configure_test_server_for_ssl
close HBA;
 }

-# Change the configuration to use given server cert file, and restart
+# Change the configuration to use given server cert file, and reload
 # the server so that the configuration takes effect.
 sub switch_server_cert
 {
@@ -115,6 +115,6 @@ sub switch_server_cert
print SSLCONF "ssl_crl_file='root+client.crl'\n";
close SSLCONF;

-   # Stop and restart server to reload the new config.
-   $node->restart;
+   # Reload the new configuration set.
+   $node->reload;
 }

Once I did that, half of the tests are failing. And I would have
expected all of them to work properly.


Those tests fail due to that listen_addresses cannot be changed on 
reload so none of the test cases can even connect to the database. When 
I hacked ServerSetup.pm to set the correct listen_address before 
starting all tests pass.


It is a bit annoying that if pg_hba.conf contains hostssl then postgres 
will refuse to start. Maybe this is something we should also fix in this 
patch since now when we can enable SSL after starting it becomes more 
useful to not bail on hostssl. What do you think?


I will look into writing a cleaner patch for ServerSetup.pm some time 
later this week.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench - allow backslash continuations in \set expressions

2016-11-09 Thread Rafia Sabih
On Wed, Nov 9, 2016 at 3:28 PM, Fabien COELHO  wrote:

>
> +1.  My vote is for backslash continuations.
>>
>
> I'm fine with that!
>
> --
> Fabien.
>

Looks good to me also.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


Re: [HACKERS] macaddr 64 bit (EUI-64) datatype support

2016-11-09 Thread Haribabu Kommi
On Fri, Nov 4, 2016 at 11:09 AM, Haribabu Kommi 
wrote:

>
>
> On Tue, Oct 25, 2016 at 11:40 PM, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>
>> On 10/25/16 1:38 AM, Haribabu Kommi wrote:
>> > Here I attached the first version of patch that supports both EUI-48 and
>> > EUI-64 type
>> > Mac addresses with a single datatype called macaddr. This is an variable
>> > length
>> > datatype similar like inet. It can store both 6 and 8 byte addresses.
>> > Variable length
>> > type is used because in case in future, if MAC address gets enhanced,
>> > still this type
>> > can support without breaking DISK compatibility.
>>
>> Since the world knows the 6-byte variant as MAC-48, shouldn't it be
>> renamed to macaddr48 or even mac48?
>
>
> Yes. Before doing this change, it is better to confirm the approach and
> then do all the changes.
>
> 1. Does everyone agrees that renaming the existing datatype without
> changing the OID?
>
> 2. The old macaddress datatype rename to mac48 macaddr48
> or macaddr6 or mac6.
>
> 3. Add the new datatype with the name that supports both 48 bit
> and 64 bit MAC address.
>
> 4. The new datatype is of variable length datatype similar like INET,
> so it can handle any future changes.
>

I didn't hear any problems with the approach in supporting the MACADDR
with 8 bytes storage. I will go with proposed design, and "mac48" as
the datatype name for the old macaddr.


Regards,
Hari Babu
Fujitsu Australia

On Fri, Nov 4, 2016 at 11:09 AM, Haribabu Kommi 
wrote:

>
>
> On Tue, Oct 25, 2016 at 11:40 PM, Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>
>> On 10/25/16 1:38 AM, Haribabu Kommi wrote:
>> > Here I attached the first version of patch that supports both EUI-48 and
>> > EUI-64 type
>> > Mac addresses with a single datatype called macaddr. This is an variable
>> > length
>> > datatype similar like inet. It can store both 6 and 8 byte addresses.
>> > Variable length
>> > type is used because in case in future, if MAC address gets enhanced,
>> > still this type
>> > can support without breaking DISK compatibility.
>>
>> Since the world knows the 6-byte variant as MAC-48, shouldn't it be
>> renamed to macaddr48 or even mac48?
>
>
> Yes. Before doing this change, it is better to confirm the approach and
> then do all the changes.
>
> 1. Does everyone agrees that renaming the existing datatype without
> changing the OID?
>
> 2. The old macaddress datatype rename to mac48 macaddr48
> or macaddr6 or mac6.
>
> 3. Add the new datatype with the name that supports both 48 bit
> and 64 bit MAC address.
>
> 4. The new datatype is of variable length datatype similar like INET,
> so it can handle any future changes.
>
>
>
>> > Currently the patch lacks of documentation. Comments?
>>
>> For patches like this, it would be good if you included a mock commit
>> message so that someone who comes in late knows what's going on.
>
>
> Thanks, I will do it from now onward.
>
>
> Regards,
> Hari Babu
> Fujitsu Australia
>



-- 
Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Bug in comparison of empty jsonb arrays to scalars

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 7:37 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Thu, Nov 10, 2016 at 3:27 AM, Robert Haas  wrote:
>>> https://www.postgresql.org/docs/current/static/datatype-json.html
>
>> Indeed, I missed that. So that's broken...
>
> Given that nobody actually cares what that sort order is, I think that
> having to jump through hoops in pg_upgrade in order to fix it is not a
> great tradeoff.  I suggest changing the documentation to match the code.

Yes, definitely.
=# create table json_data (a jsonb);
CREATE TABLE
=# INSERT INTO json_data values ('{}'::jsonb), ('[]'::jsonb),
('null'::jsonb), ('true'::jsonb), ('1'::jsonb), ('""'::jsonb);
INSERT 0 6
=# SELECT * FROM json_data ORDER BY 1 DESC;
  a
--
 {}
 true
 1
 ""
 null
 []
(6 rows)
So that's object > boolean > integer > string > NULL > array.

And attached is a patch.
-- 
Michael
diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 3cf78d6..b2688ff 100644
--- a/doc/src/sgml/json.sgml
+++ b/doc/src/sgml/json.sgml
@@ -542,7 +542,7 @@ SELECT jdoc-'guid', jdoc-'name' FROM api WHERE jdoc 
@ '{"tags": ["qu
 The btree ordering for jsonb datums is seldom
 of great interest, but for completeness it is:
 
-Object > Array > 
Boolean > Number > 
String > Null
+Object > Boolean > 
Number > String > 
Null > Array
 
 Object with n pairs > object with n - 
1 pairs
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
>  wrote:
>> Now simulate server crash using using "pkill -9 postgres".
>>
>> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
>> logfile start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> knizhnik@knizhnik:~/dtm-data$ psql postgres
>> psql (10devel)
>> Type "help" for help.
>>
>> postgres=# select * from foo;
>> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
>> No such file or directory
>>
>> knizhnik@knizhnik:~/dtm-data$ ls fs
>> PG_10_201611041
>> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>>
>> So all relation directory is removed!
>> It happens only for first table created in tablespace.
>> If you create table in Postgres data directory everything is ok: first
>> segment of relation is truncated but not deleted.
>
> Whoa.  There should be an _init fork that doesn't get removed, and
> without removing the _init fork you shouldn't be able to remove the
> directory that contains it.

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread konstantin knizhnik

On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

> 
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce 
this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql 
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,10));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> -- 
> Michael



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
 wrote:
> No, it is latest sources from Postgres repository.
> Please notice that you should create new database and tablespace to reproduce 
> this issue.
> So actually the whole sequence is
>
> mkdir fs
> initdb -D pgsql
> pg_ctl -D pgsql -l logfile start
> psql postgres
> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
> # set default_tablespace=fs;
> # create unlogged table foo(x integer);
> # insert into foo values(generate_series(1,10));
> # ^D
> pkill -9 postgres
> pg_ctl -D pgsql -l logfile start
> # select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Implement failover on libpq connect level.

2016-11-09 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Mithun Cy
> Among the remaining things I have worked on failover to new master idea.
> Below patch implement that idea. This is taken from Victors patch but
> rewritten by me to do some cleanup. As in Victor's patch we have a new
> connection parameter "target_server_type", It can take 2 values 1. "any"
> 2. "master" with DEFAULT as "any". If it's has the value "any" we can connect
> to any of the host server (both master(primary) and slave(standby)). If
> the value is "master" then we try to connect to master(primary) only.
> NOTE: Parameter name is inspired and taken from PostgreSql JDBC Driver
>  .

I'm interested to review this patch (but I haven't read it yet, I'm reading 
Robert's patch now.)  Are you planning a new CommitFest entry?

Why don't you add "standby" and "prefer_standby" as the target_server_type 
value?  Are you thinking that those values are useful with load balancing 
feature?


> The main difference between Victor's and this new patch is Default value
> of parameter target_server_type. In Victor's patch if number of host in
> connection string is 1 then default value is "any" (This was done to make
> sure old psql connect to standby as it is now). If it is greater than 1
> then default value is set as "master". For me this appeared slightly
> inconsistent having default value as "any" for any number of connection
> appeared more appropriate which is also backward compatible. And, if user
> want failover to master he should ask for it by setting
> target_server_type=master in connection string.

That's sensible, I agree.


Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-11-09 Thread Etsuro Fujita

On 2016/11/10 5:17, Tom Lane wrote:

Etsuro Fujita  writes:

[ foreign_plan_cache_inval_v6.patch ]



I looked through this a bit.


Thank you for working on this!


A point that doesn't seem to have been
discussed anywhere in the thread is why foreign tables should deserve
exact tracking of dependencies, when we have not bothered with that
for most other object types.  Schemas, for example, we're happy to just
zap the whole plan cache for.  Attempting to do exact tracking is
somewhat expensive and bug-prone --- the length of this thread speaks
to the development cost and bug hazard.  Meanwhile, nobody has questioned
the cost of attaching more PlanInvalItems to a plan (and the cost of the
extra catalog lookups this patch does to create them).  For most plans,
that's nothing but overhead because no invalidation will actually occur
in the life of the plan.

I think there's a very good argument that we should just treat any inval
in pg_foreign_data_wrapper as a reason to blow away the whole plan cache.
People aren't going to alter FDW-level options often enough to make it
worth tracking things more finely.  Personally I'd put pg_foreign_server
changes in the same boat, though maybe those are changed slightly more
often.  If it's worth doing anything more than that, it would be to note
which plans mention foreign tables at all, and not invalidate those that
don't.  We could do that with, say, a hasForeignTables bool in
PlannedStmt.


I agree on that point.


That leaves the question of whether it's worth detecting table-level
option changes this way, or whether we should just handle those by forcing
a relcache inval in ATExecGenericOptions, as in Amit's original patch in
<5702298d.4090...@lab.ntt.co.jp>.  I kind of like that approach; that
patch was short and sweet, and it put the cost on the unusual path (ALTER
TABLE) not the common path (every time we create a query plan).


I think it'd be better to detect table-level option changes because that 
could allow us to do more; we could avoid invalidating cached plans that 
need not to be invalidated, by tracking the plan dependencies more 
exactly, ie, avoid collecting the dependencies of foreign tables in a 
plan that are in dead subqueries or excluded by constraint exclusion. 
The proposed patch doesn't do that, though.  I think updates on 
pg_foreign_table would be more frequent, so it might be worth 
complicating the code.  But the relcache-inval approach couldn't do 
that, IIUC.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in comparison of empty jsonb arrays to scalars

2016-11-09 Thread Ali Akbar
2016-11-10 13:54 GMT+07:00 Michael Paquier :

> On Thu, Nov 10, 2016 at 7:37 AM, Tom Lane  wrote:
> > Given that nobody actually cares what that sort order is, I think that
> > having to jump through hoops in pg_upgrade in order to fix it is not a
> > great tradeoff.  I suggest changing the documentation to match the code.
>

Don't you in this case think we should match sort order in javascript?


> Yes, definitely.
> =# create table json_data (a jsonb);
> CREATE TABLE
> =# INSERT INTO json_data values ('{}'::jsonb), ('[]'::jsonb),
> ('null'::jsonb), ('true'::jsonb), ('1'::jsonb), ('""'::jsonb);
> INSERT 0 6
> =# SELECT * FROM json_data ORDER BY 1 DESC;
>   a
> --
>  {}
>  true
>  1
>  ""
>  null
>  []
> (6 rows)
> So that's object > boolean > integer > string > NULL > array.
>

> a = [{}, [], null, true, 1, '""']
[ {}, [], null, true, 1, '""' ]
> a.sort()
[ [], '""', 1, {}, null, true ]
> a.reverse()
[ true, null, {}, 1, '""', [] ]

So in this case it's boolean > NULL > Object > integer > string > array
(tried in Chromium 53, Firefox 49 and Node v6.9.1)

When I tried to search for the ECMA Standard for this behavior, i found
this: http://blog.rodneyrehm.de/archives/14-Sorting-Were-Doing-It-Wrong.html.
There are problems about automatic conversion in javascript, like this:

> a = [{}, [], null, true, 1, 'someotherstring']
[ {}, [], null, true, 1, 'someotherstring' ]
> a.sort().reverse()
[ true, 'someotherstring', null, {}, 1, [] ]


versus this:

> a = [{}, [], null, true, 1, 'SomeOtherString']
[ {}, [], null, true, 1, 'SomeOtherString' ]
> a.sort().reverse()
[ true, null, {}, 'SomeOtherString', 1, [] ]


and this:

> a = [{}, [], null, true, 1, '2']
[ {}, [], null, true, 1, '2' ]
> a.sort().reverse()
[ true, null, {}, '2', 1, [] ]


So we can't replicate javascript sort order without emulating those.

Regards,
Ali Akbar


Re: [HACKERS] Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.

2016-11-09 Thread Tom Lane
Robert Haas  writes:
> On Wed, Nov 9, 2016 at 4:51 PM, Tom Lane  wrote:
>> Perhaps the "logpath" buffer that the filename is constructed in
>> needs to be made bigger.  64 bytes was obviously enough with the
>> old pattern, but it's not with the new.

> Oops, yes, that seems like a good idea.  How about 64 -> MAXPGPATH?

If we want to stick with the fixed-size-buffer-on-stack approach,
that would be the thing to use.  psprintf is another possibility,
though that would add a malloc/free cycle.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.

2016-11-09 Thread Michael Paquier
On Thu, Nov 10, 2016 at 12:08 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Nov 9, 2016 at 4:51 PM, Tom Lane  wrote:
>>> Perhaps the "logpath" buffer that the filename is constructed in
>>> needs to be made bigger.  64 bytes was obviously enough with the
>>> old pattern, but it's not with the new.
>
>> Oops, yes, that seems like a good idea.  How about 64 -> MAXPGPATH?
>
> If we want to stick with the fixed-size-buffer-on-stack approach,
> that would be the thing to use.  psprintf is another possibility,
> though that would add a malloc/free cycle.

MAXPGPATH is used quite a lot in the binaries of src/bin/, just using
that seems fine to me.. My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas  wrote:
> I guess that's possible, but the problem with polyphase merge is that
> the increased I/O becomes a pretty significant cost in a hurry.

Not if you have a huge RAID array. :-)

Obviously I'm not seriously suggesting that we revise the cap from 500
to 7. We're only concerned about the constant factors here. There is a
clearly a need to make some simplifying assumptions. I think that you
understand this very well, though.

> Maybe another way of putting this is that, while there's clearly a
> benefit to having some kind of a cap, it's appropriate to pick a large
> value, such as 500.  Having no cap at all risks creating many extra
> tapes that just waste memory, and also risks an unduly
> cache-inefficient final merge.  Reigning that in makes sense.
> However, we can't reign it in too far or we'll create slow polyphase
> merges in case that are reasonably likely to occur in real life.

I completely agree with your analysis.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas  wrote:
>
> Some more review:
>
> The API contract of _hash_finish_split seems a bit unfortunate.   The
> caller is supposed to have obtained a cleanup lock on both the old and
> new buffers, but the first thing it does is walk the entire new bucket
> chain, completely ignoring the old one.  That means holding a cleanup
> lock on the old buffer across an unbounded number of I/O operations --
> which also means that you can't interrupt the query by pressing ^C,
> because an LWLock (on the old buffer) is held.
>

I see the problem you are talking about, but it was done to ensure
locking order, old bucket first and then new bucket, else there could
be a deadlock risk.  However, I think we can avoid holding the cleanup
lock on old bucket till we scan the new bucket to form a hash table of
TIDs.

>  Moreover, the
> requirement to hold a lock on the new buffer isn't convenient for
> either caller; they both have to go do it, so why not move it into the
> function?
>

Yeah, we can move the locking of new bucket entirely into new function.

>  Perhaps the function should be changed so that it
> guarantees that a pin is held on the primary page of the existing
> bucket, but no locks are held.
>

Okay, so we can change the locking order as follows:
a. ensure a cleanup lock on old bucket and check if the bucket (old)
has pending split.
b. if there is a pending split, release the lock on old bucket, but not pin.

below steps will be performed by _hash_finish_split():

c. acquire the read content lock on new bucket and form the hash table
of TIDs and in the process of forming hash table, we need to traverse
whole bucket chain.  While traversing bucket chain, release the lock
on previous bucket (both lock and pin if not a primary bucket page).
d. After the hash table is formed, acquire cleanup lock on old and new
buckets conditionaly; if we are not able to get cleanup lock on
either, then just return from there.
e. Perform split operation.
f. release the lock and pin on new bucket
g. release the lock on old bucket.  We don't want to release the pin
on old bucket as the caller has ensure it before passing it to
_hash_finish_split(), so releasing pin should be resposibility of
caller.

Now, both the callers need to ensure that they restart the operation
from begining as after we release lock on old bucket, somebody might
have split the bucket.

Does the above change in locking strategy sounds okay?

> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
> should it instead be trying to get the lock conditionally and
> returning immediately if it doesn't get the lock?  Seems like a good
> idea...
>

Yeah, we can take a cleanup lock conditionally, but it would waste the
effort of forming hash table, if we don't get cleanup lock
immediately.  Considering incomplete splits to be a rare operation,
may be this the wasted effort is okay, but I am not sure.  Don't you
think we should avoid that effort?

>  * We're at the end of the old bucket chain, so we're done partitioning
>  * the tuples.  Mark the old and new buckets to indicate split is
>  * finished.
>  *
>  * To avoid deadlocks due to locking order of buckets, first lock the old
>  * bucket and then the new bucket.
>
> These comments have drifted too far from the code to which they refer.
> The first part is basically making the same point as the
> slightly-later comment /* indicate that split is finished */.
>

I think we can remove the second comment /* indicate that split is
finished */.  Apart from that, I think the above comment you have
quoted seems to be inline with current code.  At that point, we have
finished partitioning the tuples, so I don't understand what makes you
think that it is drifted from the code? Is it because of second part
of comment (To avoid deadlocks ...)?  If so, I think we can move it to
few lines down where we actually performs the locking on old and new
bucket?

> The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
> coming to seem like a horrible idea to me.  That's not your fault - it
> was like this before - but maybe in a followup patch we should
> consider ripping all of that out and just calling MarkBufferDirty(),
> ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
> UnlockReleaseBuffer() as appropriate.  As far as I can see, the
> current style is just obfuscating the code.
>

Okay, we can do some study and try to change it in the way you are
suggesting.  It seems partially this has been derived from btree code
where we have function _bt_relbuf().  I am sure that we don't need
_hash_wrtbuf after WAL patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan  wrote:
>> Now, on the other hand, as far as I can see, the actual amount of
>> evidence that 0001 is a good idea which has been presented in this
>> forum is pretty near zero.  You've argued for it on theoretical
>> grounds several times, but theoretical arguments are not a substitute
>> for test results.
>
> See the illustration in TAOCP, vol III, page 273 in the second edition
> -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
> that it's actually a real-world benchmark.

I don't have that publication, and I'm guessing that's not based on
PostgreSQL's implementation.  There's no substitute for tests using
the code we've actually got.

>> So, I'm now feeling pretty bullish about this patch, except for one
>> thing, which is that I think the comments are way off-base. Peter
>> writes: $When allowedMem is significantly lower than what is required
>> for an internal sort, it is unlikely that there are benefits to
>> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
>> I'm pretty sure that's totally wrong, first of all because commit
>> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
>> precisely the thing which this comment says we shouldn't
>
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.
>
> The reason I haven't actually gone right back down to 7 with this cap
> is that it's possible that the added I/O costs outweigh the CPU costs
> in extreme cases, even though I think that polyphase merge doesn't
> have all that much to do with I/O costs, even with its 1970s
> perspective. Knuth doesn't say much about I/O costs -- it's more about
> using an extremely small amount of memory effectively (minimizing CPU
> costs with very little available main memory).
>
> Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
> (benefitting from a few dozen or hundred instead) seems more possible
> with the improved merge heap maintenance logic added recently, where
> there could be perhaps hundreds of runs merged with very low CPU cost
> in the event of presorted input (or, input that is inversely
> logically/physically correlated). That would be true because we'd only
> examine the top of the heap through, and so I/O costs may matter much
> more.
>
> Depending on the exact details, I bet you could see a benefit with
> only 7 tapes due to CPU cache efficiency in a case like the one you
> describe. Perhaps when sorting integers, but not when sorting collated
> text. There are many competing considerations, which I've tried my
> best to balance here with a merge order of 500.

I guess that's possible, but the problem with polyphase merge is that
the increased I/O becomes a pretty significant cost in a hurry.
Here's the same test with max_sort_tapes = 100:

2016-11-09 23:02:49 UTC [48551] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 23:02:55 UTC [48551] LOG:  switching to external sort with
101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s
2016-11-10 00:13:00 UTC [48551] LOG:  finished writing run 544 to tape
49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s
2016-11-10 00:16:52 UTC [48551] LOG:  finished 51-way merge step: CPU:
user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s
2016-11-10 00:25:41 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s
2016-11-10 00:36:47 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s
2016-11-10 00:45:32 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s
2016-11-10 01:01:29 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s
2016-11-10 01:01:29 UTC [48551] LOG:  performsort done (except 100-way
final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58
s
2016-11-10 01:45:40 UTC [48551] LOG:  external sort ended, 6255949
disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed:
9771.49 s

This is already worse than max_sort_tapes = 501, though the total
runtime is still better than no cap (the time-to-first-tuple is way
worse, though).   I'm going to try max_sort_tapes = 10 next, but I
think the basic pattern is already fairly clear.  As you reduce the
cap on the number of tapes, (a) the time to build the initial runs
doesn't change very much, (b) the time to perform the final merge
decreases significantly, and (c) the time 

Re: [HACKERS] Declarative partitioning - another take

2016-11-09 Thread Amit Langote
On 2016/11/10 2:00, Robert Haas wrote:
> In this latest patch set:
> 
> src/backend/parser/parse_utilcmd.c:3194: indent with spaces.
> +*rdatum;

This one I will fix.

> 
> With all patches applied, "make check" fails with a bunch of diffs
> that look like this:
> 
>   Check constraints:
> - "pt1chk2" CHECK (c2 <> ''::text)
>   "pt1chk3" CHECK (c2 <> ''::text)

Hm, I can't seem to reproduce this one.  Is it perhaps possible that you
applied the patches on top of some other WIP patches or something?

> And the pg_upgrade test also fails:
>
> Done
> + pg_dumpall -f /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
> + pg_ctl -m fast stop
> waiting for server to shut down done
> server stopped
> + set +x
>
> Files /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql and
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql differ
> dumps were not identical
> make[2]: *** [check] Error 1
> make[1]: *** [check-pg_upgrade-recurse] Error 2
> make: *** [check-world-src/bin-recurse] Error 2
> [rhaas pgsql]$ diff
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump1.sql
> /Users/rhaas/pgsql/src/bin/pg_upgrade/tmp_check/dump2.sql
> 6403d6402
> < c text
> 8736,8737c8735
> < CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
> precision)),
> < CONSTRAINT bmerged CHECK (((b)::double precision > (1)::double
precision))
> ---
> > CONSTRAINT blocal CHECK (((b)::double precision < (1000)::double
precision))

This one too I can't seem to reproduce.

> For future revisions, please make sure "make check-world" passes before
posting.

OK, I will make sure.  FWIW, make check-world passes here after applying
the patches posted yesterday.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-11-09 Thread Peter Eisentraut
On 11/8/16 6:43 PM, Andreas Karlsson wrote:
> - A worry is that it might get a bit confusing to have both the future 
> catalog pg_sequence and the view pg_sequences.

We already have this in other cases: pg_index/pg_indexes,
pg_user_mapping/pg_user_mappings.  It's an established naming system by now.

> - I think it would be useful to include is_cycled in the view.

It's there under the name "cycle".

> - When creating a temporary sequences and then running "SELECT * FROM 
> pg_sequences" in another session I get the following error.
> 
> ERROR:  cannot access temporary tables of other sessions

Fixed that by adding pg_is_other_temp_schema() to the view definition.
We use that in the information schema but not in the system views so
far.  That might be worth looking into.

> - Shouldn't last_value be NULL directly after we have created the 
> sequence but nobody has called nextval() yet?
> 
> - I noticed that last_value includes the cached values, but that also 
> seems to me like the correct thing to do.

The documentation now emphasizes that this is the value stored on disk.
This matches what Oracle does.

> - I do not like the name of the new function, lastval(regclass). I think 
> like you suggested it would be better with something more verbose. 
> sequence_lastval()? sequence_last_value()?

changed

> - There is an XXX comment still in the code. It is about the name of the 
> lastval1() function.

fixed

> - The documentation does not mention the last_value column.

fixed

> - The extra empty line after "" does not fit with the formatting 
> of the rest of the SGML file.

fixed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 189910cc0f470f11ec1f76073acb7b91258c76fd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 9 Nov 2016 12:00:00 -0500
Subject: [PATCH v2] Add pg_sequences view

Like pg_tables, pg_views, and others, this view contains information
about sequences in a way that is independent of the system catalog
layout but more comprehensive than the information schema.
---
 doc/src/sgml/catalogs.sgml   | 97 
 src/backend/catalog/system_views.sql | 17 ++
 src/backend/commands/sequence.c  | 44 ++-
 src/include/catalog/pg_proc.h|  4 +-
 src/include/commands/sequence.h  |  1 +
 src/test/regress/expected/rules.out  | 14 +
 src/test/regress/expected/sequence.out   | 16 ++
 src/test/regress/expected/sequence_1.out | 16 ++
 src/test/regress/sql/sequence.sql|  7 +++
 9 files changed, 212 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index bac169a..fcc9038 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7395,6 +7395,11 @@ System Views
  
 
  
+  pg_sequences
+  sequences
+ 
+
+ 
   pg_settings
   parameter settings
  
@@ -9135,6 +9140,98 @@ pg_seclabels Columns
   
  
 
+ 
+  pg_sequences
+
+  
+   pg_sequences
+  
+
+  
+   The view pg_sequences provides access to
+   useful information about each sequence in the database.
+  
+
+  
+   pg_sequences Columns
+
+   
+
+ 
+  Name
+  Type
+  References
+  Description
+ 
+
+
+ 
+  schemaname
+  name
+  pg_namespace.nspname
+  Name of schema containing sequence
+ 
+ 
+  sequencename
+  name
+  pg_class.relname
+  Name of sequence
+ 
+ 
+  sequenceowner
+  name
+  pg_authid.rolname
+  Name of sequence's owner
+ 
+ 
+  start_value
+  bigint
+  
+  Start value of the sequence
+ 
+ 
+  min_value
+  bigint
+  
+  Minimum value of the sequence
+ 
+ 
+  max_value
+  bigint
+  
+  Maximum value of the sequence
+ 
+ 
+  increment_by
+  bigint
+  
+  Increment value of the sequence
+ 
+ 
+  cycle
+  boolean
+  
+  Whether the sequence cycles
+ 
+ 
+  cache_size
+  bigint
+  
+  Cache size of the sequence
+ 
+ 
+  last_value
+  bigint
+  
+  The last sequence value written to disk.  If caching is used,
+   this value can be greater than the last value handed out from the
+   sequence.
+ 
+
+   
+  
+ 
+
  
   pg_settings
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ada2142..e011af1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -158,6 +158,23 @@ CREATE VIEW pg_indexes AS
  LEFT JOIN pg_tablespace T ON (T.oid = I.reltablespace)
 WHERE C.relkind IN ('r', 'm') AND I.relkind = 'i';
 
+CREATE OR REPLACE VIEW pg_sequences AS
+SELECT
+N.nspname AS schemaname,
+C.relname AS sequencename,
+

Re: [HACKERS] [bug fix] Cascading standby cannot catch up and get stuck emitting the same message repeatedly

2016-11-09 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> OK.  I agree that's a problem.  However, your patch adds zero new comment
> text while removing some existing comments, so I can't easily tell how it
> solves that problem or whether it does so correctly.  Even if I were smart
> enough to figure it out, I wouldn't want to rely on the next person also
> being that smart.  This is obviously a subtle problem in tricky code, so
> a clear explanation of the fix seems like a very good idea.

The comment describes what the code is trying to achieve.  Actually, I just 
imitated the code and comment of later major releases.  The only difference 
between later releases and my patch (for 9.2) is whether the state is stored in 
XLogReaderStruct or as global variables.  Below is the comment from 9.6, where 
the second paragraph describes what the two nested if conditions mean.  The 
removed comment lines are what became irrelevant, which is also not present in 
later major releases.

/*
 * Since child timelines are always assigned a TLI greater than their
 * immediate parent's TLI, we should never see TLI go backwards across
 * successive pages of a consistent WAL sequence.
 *
 * Sometimes we re-read a segment that's already been (partially) read. 
So
 * we only verify TLIs for pages that are later than the last remembered
 * LSN.
 */

Regards
Takayuki Tsunakawa




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 10:08 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Nov 9, 2016 at 4:51 PM, Tom Lane  wrote:
>>> Perhaps the "logpath" buffer that the filename is constructed in
>>> needs to be made bigger.  64 bytes was obviously enough with the
>>> old pattern, but it's not with the new.
>
>> Oops, yes, that seems like a good idea.  How about 64 -> MAXPGPATH?
>
> If we want to stick with the fixed-size-buffer-on-stack approach,
> that would be the thing to use.  psprintf is another possibility,
> though that would add a malloc/free cycle.

I don't think the performance cost of a malloc/free cycle would be
noticeable, but I don't see much point in it, either.  It's likely
that, if you hadn't notice this by inspection, we could have gone a
few years before anyone ran afoul of the 64-character limit.  Now,
MAXPGPATH is 1024, and I do not know too many people who have a real
need for pathnames over 1024 characters.  I think we may as well just
keep it simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-11-09 Thread Michael Paquier
On Wed, Nov 9, 2016 at 11:32 PM, Kuntal Ghosh
 wrote:
> On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
>  wrote:
>> Thank you for the new patch. This will be hopefully the last round of
>> reviews, we are getting close to something that has an acceptable
>> shape.
>
> Thanks a lot for reviewing the patch. Based on your review, I've attached the
> updated patch along with few comments.

Thanks for the new version. pg_xlogdump is really helpful now for debugging.

>> I haven't performed any tests with the patch, and that's all I have
>> regarding the code. With that done we should be in good shape
>> code-speaking I think...
>
> I've done a fair amount of testing which includes regression tests
> and manual creation of inconsistencies in the page. I've also found the
> reason behind inconsistency in brin revmap page.
> Brin revmap page doesn't have standard page layout. Besides, It doesn't update
> pd_upper and pd_lower in its operations as well. But, during WAL
> insertions, it uses
> REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
> restore image before consistency check, RestoreBlockImage fills the space
> between pd_upper and pd_lower(page hole) with zero. I've posted this as a
> separate thread.
> https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qo5shtzb9wwz_fm7+wd83zps9...@mail.gmail.com

Nice to have spotted the inconsistency. This tool is going to be useful..

I have spent a couple of hours playing with the patch, and worked on
it a bit more with a couple of minor changes:
- In gindesc.c, the if blocks are more readable with brackets.
- Addition of a comment when info is set, to mention that this is done
at the beginning of the routine so as callers of XLogInsert() can pass
the flag for consistency checks.
- apply_image should be reset in ResetDecoder().
- The BRIN problem is here:
2016-11-10 12:24:10 JST [65776]: [23-1] db=,user=,app=,client= FATAL:
Inconsistent page found, rel 1663/16385/30625, forknum 0, blkno 1
2016-11-10 12:24:10 JST [65776]: [24-1] db=,user=,app=,client=
CONTEXT:  xlog redo at 0/9BD31148 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1
2016-11-10 12:24:10 JST [65776]: [25-1] db=,user=,app=,client=
WARNING:  buffer refcount leak: [4540] (rel=base/16385/30625,
blockNum=1, flags=0x9380, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
Now the buffer refcount leak is not normal! The safest thing to do
here is to release the buffer once a copy of it has been taken, and
the leaks goes away when calling FATAL to report the inconsistency.
- When checking for XLR_CHECK_CONSISTENCY, better to add a != 0 to get
a boolean out of it.
- guc_malloc() should be done as late as possible, this simplifies the
code and prevents any memory leaks which is what your patch is doing
when there is an error. (I have finally put my finger on what was
itching me here).
- In btree_mask, the lookup of BTP_DELETED can be deadly simplified.
- I wondered also about putting assign_wal_consistency and
check_wal_consistency in a separate file, say xlogparams.c, concluding
that the current code does nothing bad either even if it adds rmgr.h
in the list of headers in guc.c.
- Some comment blocks are longer than 72~80 characters.
- Typos!

With the patch for BRIN applied, I am able to get installcheck-world
working with wal_consistency = all and a standby doing the consistency
checks behind. Adding wal_consistency = all in PostgresNode.pm, the
recovery tests are passing. This patch is switched as "Ready for
Committer". Thanks for completing this effort begun 3 years ago!
-- 
Michael
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..57660d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2476,6 +2476,35 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_consistency (string)
+  
+   wal_consistency configuration parameter
+  
+  
+  
+   
+This parameter is used to check the consistency of WAL records, i.e,
+whether the WAL records are inserted and applied correctly. When
+wal_consistency is enabled for a WAL record, it
+stores a full-page image along with the record. When a full-page image
+arrives during redo, it compares against the current page to check 
whether
+both are consistent.
+   
+
+   
+By default, this setting does not contain any value. To check
+all records written to the write-ahead log, set this parameter to
+all. To check only some records, specify a
+comma-separated list of resource managers. The resource managers
+which are currently supported are heap2, heap,
+btree, gin, gist,
+spgist, sequence and brin. Only
+superusers can change this setting.

Re: [HACKERS] Incorrect XLogRegisterBuffer flag for revmapbuf in brin

2016-11-09 Thread Michael Paquier
On Wed, Nov 9, 2016 at 9:33 PM, Kuntal Ghosh  wrote:
> In brin_doupdate(line 290), REGBUF_STANDARD is used to register
> revmap buffer reference in WAL record. But, revmap buffer page doesn't
> have a standard page layout and it doesn't update pd_upper and
> pd_lower as well.
>
> Either we should register revmapbuf as following:
> XLogRegisterBuffer(1, revmapbuf, 0);

As this is not a standard buffer, let's do it this way. This issue has
been captured by the WAL consistency patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-11-09 Thread Andreas Karlsson

On 11/10/2016 05:29 AM, Peter Eisentraut wrote:

On 11/8/16 6:43 PM, Andreas Karlsson wrote:

- A worry is that it might get a bit confusing to have both the future
catalog pg_sequence and the view pg_sequences.


We already have this in other cases: pg_index/pg_indexes,
pg_user_mapping/pg_user_mappings.  It's an established naming system by now.


Then I am fine with it.


- I think it would be useful to include is_cycled in the view.


It's there under the name "cycle".


You are right, my bad, I managed to confuse myself.


- Shouldn't last_value be NULL directly after we have created the
sequence but nobody has called nextval() yet?

- I noticed that last_value includes the cached values, but that also
seems to me like the correct thing to do.


The documentation now emphasizes that this is the value stored on disk.
This matches what Oracle does.


We also store is_called on disk, I think that if is_called is false then 
last_value should be NULL. Either that or we should add is_called.


I will give the patch another review some time this week when i can find 
the time.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_hba_file_settings view patch

2016-11-09 Thread Haribabu Kommi
On Mon, Nov 7, 2016 at 3:36 PM, Michael Paquier 
wrote:

> On Mon, Nov 7, 2016 at 12:36 PM, Haribabu Kommi
>  wrote:
> > The added regression test fails for the cases where the server is loaded
> > with
> > different pg_hba.conf rules during installcheck verification. Updated
> patch
> > is
> > attached with removing those tests.
>
> That's not a full review as I just glanced at this patch a couple of
> seconds...
>
>  #include "utils/guc.h"
> +#include "utils/jsonb.h"
>  #include "utils/lsyscache.h"
> You don't need to include this header when using arrays.
>

Thanks for the review. Fixed in the updated patch with
additional error messages are also added.


> Implementing a test case is possible as well using the TAP
> infrastructure. You may want to look at it and help folks testing the
> patch more easily with a set of configurations in pg_hba.conf that
> cover a maximum of code paths in your patch.
>

Added a tap test under src/test folder to cover maximum code changes.

Regards,
Hari Babu
Fujitsu Australia


pg_hba_rules_tap_tests_1.patch
Description: Binary data


pg_hba_rules_4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-11-09 Thread Michael Paquier
On Wed, Nov 9, 2016 at 7:46 PM, Andreas Karlsson  wrote:
> Those tests fail due to that listen_addresses cannot be changed on reload so
> none of the test cases can even connect to the database. When I hacked
> ServerSetup.pm to set the correct listen_address before starting all tests
> pass.

Hm... listen_addresses remain constant at 127.0.0.1 and setting up
listen_addresses = '*' does not work either.. Perhaps I am missing
something?

> It is a bit annoying that if pg_hba.conf contains hostssl then postgres will
> refuse to start. Maybe this is something we should also fix in this patch
> since now when we can enable SSL after starting it becomes more useful to
> not bail on hostssl. What do you think?

I forgot that... There is the same problem today when updating
postgresql.conf and restarting the server if there is an hostssl
entry. Do you have in mind to relax things? It seems to be that the
safest bet is to not reload parameters if ssl is switched from on to
off and if pg_hba.conf has a hostssl entry, right? That complicates
the code though.

> I will look into writing a cleaner patch for ServerSetup.pm some time later
> this week.

Thanks. Making the restart/reload OS-dependent will be necessary.
src/test/ssl can run on Windows.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Radix tree for character conversion

2016-11-09 Thread Kyotaro HORIGUCHI
Hello, thank you for polishing this.

At Wed, 9 Nov 2016 02:19:01 +0100, Daniel Gustafsson  wrote in 
<80f34f25-bf6d-4bcd-9c38-42ed10d3f...@yesql.se>
> > On 08 Nov 2016, at 17:37, Peter Eisentraut 
> >  wrote:
> > 
> > On 10/31/16 12:11 PM, Daniel Gustafsson wrote:
> >> I took a small stab at doing some cleaning of the Perl scripts, mainly 
> >> around
> >> using the more modern (well, modern as in +15 years old) form for open(..),
> >> avoiding global filehandles for passing scalar references and enforcing use
> >> strict.  Some smaller typos and fixes were also included.  It seems my 
> >> Perl has
> >> become a bit rusty so I hope the changes make sense.  The produced files 
> >> are
> >> identical with these patches applied, they are merely doing cleaning as 
> >> opposed
> >> to bugfixing.
> >> 
> >> The attached patches are against the 0001-0006 patches from Heikki and you 
> >> in
> >> this series of emails, the separation is intended to make them easier to 
> >> read.
> > 
> > Cool.  See also here:
> > https://www.postgresql.org/message-id/55E52225.4040305%40gmx.net

> Nice, not having hacked much Perl in quite a while I had all but forgotten
> about perlcritic.

I tried it on CentOS7. Installation failed saying that
Module::Build is too old. It is yum-inatlled so removed it and
installed it with CPAN. Again failed with many 'Could not create
MYMETA files'. Then tried to install CPAN::Meta and it failed
saying that CPAN::Meta::YAML is too *new*. That sucks.

So your patch is greately helpfull. Thank you.

| -my @mapnames = map { s/\.map//; $_ } values %plainmaps;
| +my @mapnames = map { my $m = $_; $m =~ s/\.map//; $m } values %plainmaps;

It surprised me to know that perlcritic does such things.

> Running it on the current version of the patchset yields mostly warnings on
> string values used in the require “convutils.pm” statement.  There were 
> however
> two more interesting reports: one more open() call not using the three
> parameter form and an instance of map which alters the input value. 

Sorry for overlooking it.

> The latter
> is not causing an issue since we don’t use the input list past the map but
> fixing it seems like good form.

Agreed.

> Attached is a patch that addresses the perlcritic reports (running without any
> special options).

Thanks. The attached patch contains the patch by perlcritic.

0001,2,3 are Heikki's patch that are not modified since it is
first proposed. It's a bit too big so I don't attach them to this
mail (again).

https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd...@iki.fi

0004 is radix-tree stuff, applies on top of the three patches
above.

There's a hidden fifth patch which of 20MB in size. But it is
generated by running make in the Unicode directory.

[$(TOP)]$ ./configure ...
[$(TOP)]$ make
[Unicode]$ make
[Unicode]$ make distclean
[Unicode]$ git add .
[Unicode]$ commit 
=== COMMITE MESSSAGE
Replace map files with radix tree files.

These encodings no longer uses the former map files and uses new radix
tree files. All existing authority files in this directory are removed.
===

regards,
>From e6718001cbe3f1937e4f052c66bff68f7217c43c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 27 Oct 2016 14:19:47 +0900
Subject: [PATCH 4/5] Make map generators to generate radix tree files.

This introduces radix tree conversion to map-referring(non-arithmetic)
code conversions. UCS_to*.pl files generate _radix.map files addition
to .map files currently generated. These _radix.map files are
referenced by conversion procs instead of .map files.

Radix trees are not easily verified, so a checker is provided. "make
mapcheck" builds and runs it. It verifies radix maps against
corresponding .map files.

Now 'make distclean' removes authority files and unused .map files
since they should not be contained in source archive. On the other
hand 'make maintainer-clean' leaves them and removes all map
files. This seems somewhat strange but it comes from the special
characteristics of this directory.

All perl scripts turned into modern style. Use strict, modern usage of
file handles and accept reindentation by pgperltidy. Perl scripts in
this commit have been applied it.

For the first time running "make all", some UCS_to*.pl scripts many
times because of a limitation of gmake's capability.
---
 src/backend/utils/mb/Unicode/Makefile  |   78 +-
 src/backend/utils/mb/Unicode/UCS_to_BIG5.pl|   33 +-
 src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl  |   36 +-
 .../utils/mb/Unicode/UCS_to_EUC_JIS_2004.pl|   69 +-
 src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl  |   42 +-
 src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl  |   12 +-
 src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl  |   23 +-
 src/backend/utils/mb/Unicode/UCS_to_GB18030.pl |   32 +-
 src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl   |   10 +-