Re: [HACKERS] increasing the default WAL segment size

2017-07-05 Thread Beena Emerson
Hello,



On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson  wrote:
> Hello,
>
> On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut
>  wrote:
>>
>> At this point, I suggest splitting this patch up into several
>> potentially less controversial pieces.
>>
>> One big piece is that we currently don't support segment sizes larger
>> than 64 GB, for various internal arithmetic reasons.  Your patch appears
>> to address that.  So I suggest isolating that.  Assuming it works
>> correctly, I think there would be no great concern about it.
>>
>> The next piece would be making the various tools aware of varying
>> segment sizes without having to rely on a built-in value.
>>
>> The third piece would then be the rest that allows you to set the size
>> at initdb
>>
>> If we take these in order, we would make it easier to test various sizes
>> and see if there are any more unforeseen issues when changing sizes.  It
>> would also make it easier to do performance testing so we can address
>> the original question of what the default size should be.
>
>
> PFA the patches divided into 3 parts:
>
> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes
> the internal representation of max_wal_size and min_wal_size to mb.

Already committed.

>
> 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as
> XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using
> inbuilt value.
>

The updated 03-modify-tools_v2.patch has the following changes:
 - Rebased over current HEAD
 - Impoverished comments
 - Adding error messages where applicable.
 - Replace XLOG_SEG_SIZE in the tools and xlog_internal.h to
XLogSegSize. XLOG_SEG_SIZE is the wal_segment_size the server was
compiled with and XLogSegSize is the wal_segment_size of the target
server on which the tool is run. When the binaries used and the target
server are compiled with different wal_segment_size, the calculations
would be be affected and the tool would crash. To avoid it, all the
calculations used by tool should use XLogSegSize.
 - pg_waldump : The  fuzzy_open_file is split into two functions -
open_file_in_directory and identify_target_directory so that code can
be reused when determining the XLogSegSize from the WAL file header.
 - IsValidXLogSegSize macro is moved from 04 to here so that we can
use it for validating the size in all the tools.


> 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and
> make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE
> instead of XLOG_SEG_SIZE

The 04-initdb-walsegsize_v2.patch has the following improvements:
- Rebased over new 03 patch
- Pass the wal-segsize intidb option as command-line option rathern
than in an environment variable.
- Since new function check_wal_size had only had two checks and was
sed once, moved the code to ReadControlFile where it is used and
removed this function.
- improve comments and add validations where required.
- Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and
max_wal_size,instead of the value 16.
- Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc
wal_segment_size instead 16 - INT_MAX.


>
>>
>> One concern I have is that your patch does not contain any tests.  There
>> should probably be lots of tests.
>
>
> 05-initdb_tests.patch adds tap tests to initialize cluster with different
> wal_segment_size and then check the config values. What other tests do you
> have in mind? Checking the various tools?
>
>




-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


04-initdb-walsegsize_v2.patch
Description: Binary data


03-modify-tools_v2.patch
Description: Binary data


01-add-XLogSegmentOffset-macro_rebase.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Amit Kapila
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy  wrote:
> On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila  wrote:
>> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy  
>> wrote:
>>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy  
>>> wrote:
 On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:
>
> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
> detect an autoprewarm process already running.  I'd want this to
> return NULL or an error if called for a 2nd time.

 We log instead of error as we try to check only after launching the
 worker and inside worker. One solution could be as similar to
 autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
 memory and check if we can launch worker in backend itself. I will try
 to fix same.
>>>
>>> I have fixed it now as follows
>>>
>>> +Datum
>>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>>> +{
>>> +   pid_t   pid;
>>> +
>>> +   init_apw_shmem();
>>> +   pid = apw_state->bgworker_pid;
>>> +   if (pid != InvalidPid)
>>> +   ereport(ERROR,
>>> +   (errmsg("autoprewarm worker is already running under PID 
>>> %d",
>>> +   pid)));
>>> +
>>> +   autoprewarm_dump_launcher();
>>> +   PG_RETURN_VOID();
>>> +}
>>>
>>> In backend itself, we shall check if an autoprewarm worker is running
>>> then only start the server. There is a possibility if this function is
>>> executed concurrently when there is no worker already running (Which I
>>> think is not a normal usage) then both call will say it has
>>> successfully launched the worker even though only one could have
>>> successfully done that (other will log and silently die).
>>
>> Why can't we close this remaining race condition?  Basically, if we
>> just perform all of the actions in this function under the lock and
>> autoprewarm_dump_launcher waits till the autoprewarm worker has
>> initialized the bgworker_pid, then there won't be any remaining race
>> condition.  I think if we don't close this race condition, it will be
>> unpredictable whether the user will get the error or there will be
>> only a server log for the same.
>
> Yes, I can make autoprewarm_dump_launcher to wait until the launched
> bgworker set its pid, but this requires one more synchronization
> variable between launcher and worker. More than that I see
> ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
> worker), which needs to be called before setting pid. So I thought it
> won't be harmful let two concurrent calls to launch workers and we
> just log failures. Please let me know if I need to rethink about it.
>

I don't know whether you need to rethink but as presented in the
patch, it seems unclear to me about the specs of API.  As this is an
exposed function to the user, I think the behavior should be well
defined.  If you don't think changing code has many advantages, then
at the very least update the docs to indicate the expectation and
behavior of the API.  Also, I think it is better to add few comments
in the code to tell about the unpredictable behavior in case of race
condition and the reason for same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Amit Kapila
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy  wrote:
> On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila  wrote:
>>
>> Few comments on the latest patch:
>>
>> 1.
>> + LWLockRelease(&apw_state->lock);
>> + if (!is_bgworker)
>> + ereport(ERROR,
>> + (errmsg("could not perform block dump because dump file is being
>> used by PID %d",
>> + apw_state->pid_using_dumpfile)));
>> + ereport(LOG,
>> + (errmsg("skipping block dump because it is already being performed by PID 
>> %d",
>> + apw_state->pid_using_dumpfile)));
>>
>>
>> The above code looks confusing as both the messages are saying the
>> same thing in different words.  I think you keep one message (probably
>> the first one) and decide error level based on if this is invoked for
>> bgworker.  Also, you can move LWLockRelease after error message,
>> because if there is any error, then it will automatically release all
>> lwlocks.
>
> ERROR is used for autoprewarm_dump_now which is called from the backend.
> LOG is used for bgworker.
> wordings used are to match the context if failing to dump is
> acceptable or not. In the case of bgworker, it is acceptable we are
> not particular about the start time of dump but the only interval
> between the dumps. So if already somebody doing it is acceptable. But
> one who calls autoprewarm_dump_now might be particular about the start
> time of dump so we throw error making him retry same.
>
> The wording's are suggested by Robert(below snapshot) in one of his
> previous comments and I also agree with it. If you think I should
> reconsider this and I am missing something I am open to suggestions.
>

Not an issue, if you and Robert think having two different messages is
better, then let's leave it.  One improvement we could do here is to
initialize a boolean global variable for AutoPrewarmWorker, then use
it wherever required.


>> 3.
>> +dump_now(bool is_bgworker)
>> {
>> ..
>> + if (buf_state & BM_TAG_VALID)
>> + {
>> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
>> + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
>> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
>> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
>> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
>> + ++num_blocks;
>> + }
>> ..
>> }
>>I think there is no use of writing Unlogged buffers unless the dump is
>>for the shutdown.  You might want to use BufferIsPermanent to detect the same.
>
> -- I do not think that is true pages of the unlogged table are also
> read into buffers for read-only purpose. So if we miss to dump them
> while we shut down then the previous dump should be used.
>

I am not able to understand what you want to say. Unlogged tables
should be empty in case of crash recovery.  Also, we never flush
unlogged buffers except for shutdown checkpoint, refer BufferAlloc and
in particular below comment:

* Make sure BM_PERMANENT is set for buffers that must be written at every
* checkpoint.  Unlogged buffers only need to be written at shutdown
* checkpoints, except for their "init" forks, which need to be treated
* just like permanent relations.


>> 4.
>> +static uint32
>> +dump_now(bool is_bgworker)
>> {
>> ..
>> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
>> + {
>> + uint32 buf_state;
>> +
>> + if (!is_bgworker)
>> + CHECK_FOR_INTERRUPTS();
>> ..
>> }
>>
>> Why checking for interrupts is only for non-bgwroker cases?
>
> -- autoprewarm_dump_now is directly called from the backend. In such
> case, we have to handle signals registered for backend in dump_now().
> For bgworker dump_block_info_periodically caller of dump_now() handles
> SIGTERM, SIGUSR1 which we are interested in.
>

Okay, but what about signal handler for SIGUSR1
(procsignal_sigusr1_handler).  Have you verified that it will never
set the InterruptPending flag?

>
>> 6.
>> +dump_now(bool is_bgworker)
>> {
>> ..
>> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>> + apw_state->pid_using_dumpfile = InvalidPid;
>> ..
>> }
>>
>> How will pid_using_dumpfile be set to InvalidPid in the case of error
>> for non-bgworker cases?
>
> -- I have a try() - catch() in autoprewarm_dump_now I think that is okay.
>

Okay, then that will work.

>>
>> 7.
>> +dump_now(bool is_bgworker)
>> {
>> ..
>> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>>
>> ..
>> }
>>
>> How will transient_dump_file_path be unlinked in the case of error in
>> durable_rename?  I think you need to use PG_TRY..PG_CATCH to ensure
>> same?
>
> -- If durable_rename is failing that seems basic functionality of
> autoperwarm is failing so I want it to be an ERROR. I do not want to
> remove the temp file as we always truncate before reusing it again. So
> I think there is no need to catch all ERROR in dump_now() just to
> remove the temp file.
>

I am not getting your argument here, do you mean to say that if
writing to a transient file is failed then we should remove the
transient 

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread Amit Kapila
On Thu, Jul 6, 2017 at 2:40 AM, AP  wrote:
> On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
>> >> >  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages 
>> >> > | live_items | dead_items |   free_percent
>> >> > -+--++--+--+++--
>> >> >3 | 10485760 |2131192 |   66 |0 
>> >> > | 2975444240 |  0 | 1065.19942179026
>> >> > (1 row)
> ...
>> >> > And I do appear to have an odd percentage of free space. :)
>>
>> Are you worried about "unused_pages"? If so, then this is not a major
>
> Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :)
> Shouldn't that number be < 100?
>

Yes, there seems to be some gotcha in free percent calculation.  Is it
possible for you to debug or in some way share the test?

>> reason to worry, because these are probably freed overflow pages which
>> can be used in future.  In the hash index, when we free the overflow
>> pages, they are not returned back to OS, rather they are tracked in
>> the index as unused pages which will get used when required in future.
>
>> >> It looks like Vacuum hasn't been triggered.
>>
>> Vacuum won't be triggered on insert load.  I think that is one of the
>> reasons why in your initial copy, you might have got the error.  We
>> had some discussion in the past to trigger Vacuum on insert heavy
>> workloads [1], but the patch still didn't get committed.  I think if
>> that patch or some other form of that patch gets committed, it will
>> help the workload what you are trying here.
>
> Well, if this is the cause of my little issue, it might be nice. ATM
> my import script bombs out on errors (that I've duplicated! :) It took
> 11 hours but it bombed) and it sounds like I'll need to do a manual
> VACUUM before it can be run again.
>

Yeah, I think after manual vacuum you should be able to proceed.

> The stats you were looking for before are:
>
> # select * from  pgstathashindex('link_datum_id_idx');
>  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
> live_items | dead_items |   free_percent
> -+--++--+--+++--
>3 |  8559258 |4194176 |  128 |  1926502 | 
> 3591812743 |  0 | 942.873199357466
> (1 row)
>
> # select 4194176.0/128/8;
>?column?
> ---
>  4095.8750
> (1 row)
>

>From above stats, it is clear that you have hit the maximum number of
overflow pages we can support today.  Now, here one can argue that we
should increase the limit of overflow pages in hash index which we can
do, but I think you can again hit such a problem after some more time.
So at this stage, there are two possibilities for you (a) run manual
Vacuum in-between (b) create the index after bulk load.  In general,
whatever I have mentioned in (b) is a better way for bulk loading.
Note here again the free_percent seems to be wrong.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-05 Thread Noah Misch
On Mon, Jul 03, 2017 at 09:12:01AM +0530, Amit Kapila wrote:
> While discussing the behavior of hash indexes with Bruce in the nearby
> thread [1], it has been noticed that hash index on unlogged tables
> doesn't behave as expected.  Prior to 10, it has the different set of
> problems (mainly because hash indexes are not WAL-logged) which were
> discussed on that thread [1], however when I checked, it doesn't work
> even for 10.  Below are steps to reproduce the problem.
> 
> 1. Setup master and standby
> 2. On the master, create unlogged table and hash index.
>2A.  Create unlogged table t1(c1 int);
>2B.  Create hash index idx_t1_hash on t1 using hash(c1);
> 3. On Standby, try selecting data,
>select * from t1;
>ERROR:  cannot access temporary or unlogged relations during recovery
> ---Till here everything works as expected
> 4. Promote standby to master (I have just stopped the standby and
> master and removed recovery.conf file from the standby database
> location) and try starting the new master, it
> gives below error and doesn't get started.
> FATAL:  could not create file "base/12700/16387": File exists
> 
> The basic issue was that the WAL logging for Create Index operation
> was oblivion of the fact that for unlogged tables only INIT forks need
> to be logged.  Another point which we need to consider is that while
> replaying the WAL for the create index operation, we need to flush the
> buffer if it is for init fork.  This needs to be done only for pages
> that can be part of init fork file (like metapage, bitmappage).
> Attached patch fixes the issue.
> 
> Another approach to fix the issue could be that always log complete
> pages for the create index operation on unlogged tables (in
> hashbuildempty).  However, the logic for initial hash index pages
> needs us to perform certain other actions (like update metapage after
> the creation of bitmappage) which make it difficult to follow that
> approach.
> 
> I think this should be considered a PostgreSQL 10 open item.
> 
> 
> [1] - https://www.postgresql.org/message-id/20170630005634.GA4448%40momjian.us

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-05 Thread Amit Langote
Hi Dean,

On 2017/07/05 23:18, Dean Rasheed wrote:
> On 5 July 2017 at 10:43, Amit Langote  wrote:
>> In retrospect, that sounds like something that was implemented in the
>> earlier versions of the patch, whereby there was no ability to specify
>> UNBOUNDED on a per-column basis.  So the syntax was:
>>
>> FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED }
>>
> Yes, that's where I ended up too.

I see.

>> But, it was pointed out to me [1] that that doesn't address the use case,
>> for example, where part1 goes up to (10, 10) and part2 goes from (10, 10)
>> up to (10, unbounded).
>>
>> The new design will limit the usage of unbounded range partitions at the
>> tail ends.
> 
> True, but I don't think that's really a problem. When the first column
> is a discrete type, an upper bound of (10, unbounded) can be rewritten
> as (11) in the new design. When it's a continuous type, e.g. floating
> point, it can no longer be represented, because (10.0, unbounded)
> really means (col1 <= 10.0). But we've already decided not to support
> anything other than inclusive lower bounds and exclusive upper bounds,
> so allowing this upper bound goes against that design choice.

Yes.

>>> Of course, it's pretty late in the day to be proposing this kind of
>>> redesign, but I fear that if we don't tackle it now, it will just be
>>> harder to deal with in the future.
>>>
>>> Actually, a quick, simple hacky implementation might be to just fill
>>> in any omitted values in a partition bound with negative infinity
>>> internally, and when printing a bound, omit any values after an
>>> infinite value. But really, I think we'd want to tidy up the
>>> implementation, and I think a number of things would actually get much
>>> simpler. For example, get_qual_for_range() could simply stop when it
>>> reached the end of the list of values for the bound, and it wouldn't
>>> need to worry about an unbounded value following a bounded one.
>>>
>>> Thoughts?
>>
>> I cooked up a patch for the "hacky" implementation for now, just as you
>> described in the above paragraph.  Will you be willing to give it a look?
>> I will also think about the non-hacky way of implementing this.
> 
> OK, I'll take a look.

Thanks a lot for your time.

> Meanwhile, I already had a go at the "non-hacky" implementation (WIP
> patch attached). The more I worked on it, the simpler things got,
> which I think is a good sign.

It definitely looks good to me.  I was thinking of more or less the same
approach, but couldn't have done as clean a job as you've done with your
patch.

> Part-way through, I realised that the PartitionRangeDatum Node type is
> no longer needed, because each bound value is now necessarily finite,
> so the lowerdatums and upperdatums lists in a PartitionBoundSpec can
> now be made into lists of Const nodes, making them match the
> listdatums field used for LIST partitioning, and then a whole lot of
> related code gets simplified.

Yeah, seems that way.

> It needed a little bit more code in partition.c to track individual
> bound sizes, but there were a number of other places that could be
> simplified, so overall this represents a reduction in the code size
> and complexity.

Sounds reasonable.

> It's not complete (e.g., no doc updates yet), but it passes all the
> tests, and so far seems to work as I would expect.

Thanks a lot for working on it.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-05 Thread Mark Rofail
To make the queries fired by the RI triggers GIN indexed. We need to ‒ as
Tom Lane has previously suggested[1] ‒ to replace the query

SELECT 1 FROM ONLY fktable x WHERE $1 = ANY (fkcol) FOR SHARE OF x;

with

SELECT 1 FROM ONLY fktable x WHERE ARRAY[$1] <@ fkcol FOR SHARE OF x;

but since we have @<(anyarray, anyelement) it can be improved to

SELECT 1 FROM ONLY fktable x WHERE $1 @> fkcol FOR SHARE OF x;

and the piece of code responsible for all of this is ri_GenerateQual in
ri_triggers.c.

How to accomplish that is the next step. I don't know if we should hardcode
the "@>" symbol or if we just index the fk table then ri_GenerateQual would
be able to find the operator on it's own.

*What I plan to do:*

   - study how to index the fk table upon its creation. I suspect this can
   be done in tablecmds.c

*Questions:*

   - how can you programmatically in C index a table?

[1] https://www.postgresql.org/message-id/28389.1351094795%40sss.pgh.pa.us

Best Regards,
Mark Rofail
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 3a25ba52f3..0045f64c9e 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2650,7 +2650,7 @@ quoteRelationName(char *buffer, Relation rel)
  * ri_GenerateQual --- generate a WHERE clause equating two variables
  *
  * The idea is to append " sep leftop op rightop" to buf, or if fkreftype is
- * FKCONSTR_REF_EACH_ELEMENT, append " sep leftop op ANY(rightop)" to buf.
+ * FKCONSTR_REF_EACH_ELEMENT, append " sep leftop <@ rightop" to buf.
  *
  * The complexity comes from needing to be sure that the parser will select
  * the desired operator.  We always name the operator using
@@ -2697,17 +2697,10 @@ ri_GenerateQual(StringInfo buf,
 	appendStringInfo(buf, " %s %s", sep, leftop);
 	if (leftoptype != operform->oprleft)
 		ri_add_cast_to(buf, operform->oprleft);
- 
- 	appendStringInfo(buf, " OPERATOR(%s.%s) ",
- 	 quote_identifier(nspname), oprname);
- 
- 	if (fkreftype == FKCONSTR_REF_EACH_ELEMENT)
- 		appendStringInfoString(buf, "ANY (");
+ 	appendStringInfo(buf, " @> "); 
  	appendStringInfoString(buf, rightop);
  	if (rightoptype != oprright)
  		ri_add_cast_to(buf, oprright);
- 	if (fkreftype == FKCONSTR_REF_EACH_ELEMENT)
- 		appendStringInfoChar(buf, ')');
 
 	ReleaseSysCache(opertup);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-05 Thread Tatsuo Ishii
Michael,

> Couldn't you cache one single SASL exchange status for each
> connection, meaning one PGconn saved for each? As the challenge sent
> by the server and the response generated by the client are different
> by design, I am afraid you would need to do that anyway in this
> context (Isn't PG-pool using already the weaknesses of MD5 to make
> things easier?). As the server decides first which authentication type
> should happen before beginning the real message exchange, that should
> not be difficult. It seems to me that you would need something more
> modular than you have now if you want for example to handle
> automatically connections to multiple servers that have different
> password hashes stored for the same user. The latter may be an edge
> case with pgpool though.

Thank you for the quick response. I will study your suggestion along
with the SCRAM code in PostgreSQL whether it could be possible in
Pgpool-II.

Regarding your question on md5 auth handling in Pgpool-II, please look
into:

https://pgpool.net/mediawiki/index.php/FAQ#How_does_pgpool-II_handle_md5_authentication.3F

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-05 Thread Michael Paquier
On Thu, Jul 6, 2017 at 10:03 AM, Tatsuo Ishii  wrote:
> For Pgpool-II, things would go as follows:
>
> 1) clients sends user name to Pgpool-II.
> 2) Pgpool-II forwards it to PostgreSQL servers.
> 3) Each PostgreSQL server sends their own salt to Pgpool-II.
> 4) Pgpool-II is confused because there are multiple salts and each has
>different values. The client only accepts single salt obviously.

Couldn't you cache one single SASL exchange status for each
connection, meaning one PGconn saved for each? As the challenge sent
by the server and the response generated by the client are different
by design, I am afraid you would need to do that anyway in this
context (Isn't PG-pool using already the weaknesses of MD5 to make
things easier?). As the server decides first which authentication type
should happen before beginning the real message exchange, that should
not be difficult. It seems to me that you would need something more
modular than you have now if you want for example to handle
automatically connections to multiple servers that have different
password hashes stored for the same user. The latter may be an edge
case with pgpool though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()

2017-07-05 Thread Alvaro Herrera
Masahiko Sawada wrote:
> On Thu, Jul 6, 2017 at 1:36 AM, Alvaro Herrera  
> wrote:
> > Teodor Sigaev wrote:
> >
> >> Playing around freezing tuple I found suspicious piece of code:
> >>
> >> heap_prepare_freeze_tuple():
> >> ...
> >> frz->t_infomask = tuple->t_infomask;
> >> ...
> >> frz->t_infomask &= ~HEAP_XMAX_BITS;
> >> frz->xmax = newxmax;
> >> if (flags & FRM_MARK_COMMITTED)
> >> frz->t_infomask &= HEAP_XMAX_COMMITTED;
> >>
> >> Seems, in last line it should be a bitwise OR instead of AND. Now this line
> >> cleans all bits in t_infomask which later will be copied directly in tuple.
> >
> > I think you're right.
> 
> I also think that's right. Should we back-patch it down to 9.3?

Of course.  I think this could cause data corruption.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SCRAM auth and Pgpool-II

2017-07-05 Thread Tatsuo Ishii
Hi PostgreSQL hackers,

I would like to hear ideas how Pgpool-II can deal with SCRAM auth
which will be in PostgreSQL 10.

For those who are not familiar with Pgpool-II[1], it is an external
OSS project to provide some additional features to PostgreSQL,
including load balancing and automatic failover. Pgpool-II works as a
proxy between PostgreSQL client and PostgreSQL server(s).

When a client wants to connects to PostgreSQL and SCRAM auth is
enabled, it sends user name to server. Then the server sends
information including a salt to the client. The client computes a
"ClientProof" using the salt and other information, and sends it to
the server[2].

For Pgpool-II, things would go as follows:

1) clients sends user name to Pgpool-II.
2) Pgpool-II forwards it to PostgreSQL servers.
3) Each PostgreSQL server sends their own salt to Pgpool-II.
4) Pgpool-II is confused because there are multiple salts and each has
   different values. The client only accepts single salt obviously.

So my question is, is there any solution or workaround for the problem
#4. Someone at PGCon 2017 suggested that the problem could be avoided
if the auth method between the client and Pgpool-II is "trust" (which
means no auth). But this does not seem to be a best solution for me
because it would weaken the security.

I suspect this problem may not be specific to Pgpool-II. Any middle
ware which handles multiple PostgreSQL servers could have the similar
problem.

Any suggestion would be appreciated. Thanks in advance.

[1] https://pgpool.net
[2] https://tools.ietf.org/html/rfc5802#section-3
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()

2017-07-05 Thread Masahiko Sawada
On Thu, Jul 6, 2017 at 1:36 AM, Alvaro Herrera  wrote:
> Teodor Sigaev wrote:
>
>> Playing around freezing tuple I found suspicious piece of code:
>>
>> heap_prepare_freeze_tuple():
>> ...
>> frz->t_infomask = tuple->t_infomask;
>> ...
>> frz->t_infomask &= ~HEAP_XMAX_BITS;
>> frz->xmax = newxmax;
>> if (flags & FRM_MARK_COMMITTED)
>> frz->t_infomask &= HEAP_XMAX_COMMITTED;
>>
>> Seems, in last line it should be a bitwise OR instead of AND. Now this line
>> cleans all bits in t_infomask which later will be copied directly in tuple.
>
> I think you're right.
>

I also think that's right. Should we back-patch it down to 9.3?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additional tests for subtransactions in recovery)

2017-07-05 Thread Michael Paquier
On Thu, Jul 6, 2017 at 2:41 AM, Peter Eisentraut
 wrote:
> On 7/2/17 20:28, Michael Paquier wrote:
>>> I was going to hold this back for PG11, but since we're now doing some
>>> other tweaks in pg_ctl, it might be useful to add this too.  Thoughts?
>>
>> The use of 0 as exit code for the new promote -w if timeout is reached
>> looks like an open item to me. Cleaning up the pool queries after
>> promotion would be nice to see as well.
>
> committed

Thanks for finishing the cleanup.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Error-like LOG when connecting with SSL for password authentication

2017-07-05 Thread Michael Paquier
On Thu, Jul 6, 2017 at 12:45 AM, Ryan Murphy  wrote:
> I tried to apply your patch to HEAD and it failed.
> But I think that's because some version of it has already been committed, 
> correct?  I see some of your ECONNRESET and "SSL connection has been closed 
> unexpectedly" code already in HEAD.

Thanks Ryan for the reminder. Heikki has pushed a minimum patch set as
b93827c7. So I have updated the CF app accordingly.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-05 Thread Tom Lane
I wrote:
> (Pokes at it some more...) Oh, interesting: it behaves that way except
> when p is exactly the lowest histogram entry.

Here's a revised version that addresses that point and cleans up some
other minor details about treatment of cases near the histogram endpoints.
I'm still pretty unclear on exactly why it works (see XXX comments in
patch), but testing shows that it does work very well indeed given a flat
data distribution.  On less-flat distributions, you get some estimation
errors, but that's neither new nor surprising.

regards, tom lane

diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 333a36c..e9ff110 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -792,8 +792,7 @@ CREATE OPERATOR < (
It is important to specify the correct commutator and negator operators,
as well as suitable restriction and join selectivity
functions, otherwise the optimizer will be unable to make effective
-   use of the index.  Note that the less-than, equal, and
-   greater-than cases should use different selectivity functions.
+   use of the index.
   
 
   
diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml
index 8568e21..d484d80 100644
--- a/doc/src/sgml/xoper.sgml
+++ b/doc/src/sgml/xoper.sgml
@@ -242,20 +242,11 @@ column OP constant
 
  eqsel for =
  neqsel for <>
- scalarltsel for < or <=
- scalargtsel for > or >=
-   
-It might seem a little odd that these are the categories, but they
-make sense if you think about it.  = will typically accept only
-a small fraction of the rows in a table; <> will typically reject
-only a small fraction.  < will accept a fraction that depends on
-where the given constant falls in the range of values for that table
-column (which, it just so happens, is information collected by
-ANALYZE and made available to the selectivity estimator).
-<= will accept a slightly larger fraction than < for the same
-comparison constant, but they're close enough to not be worth
-distinguishing, especially since we're not likely to do better than a
-rough guess anyhow.  Similar remarks apply to > and >=.
+ scalarltsel for <
+ scalarlesel for <=
+ scalargtsel for >
+ scalargesel for >=
+

 

@@ -267,10 +258,12 @@ column OP constant

 

-You can use scalarltsel and scalargtsel for comparisons on data types that
-have some sensible means of being converted into numeric scalars for
-range comparisons.  If possible, add the data type to those understood
-by the function convert_to_scalar() in src/backend/utils/adt/selfuncs.c.
+You can use scalarltsel, scalarlesel,
+scalargtsel and scalargesel for comparisons on
+data types that have some sensible means of being converted into numeric
+scalars for range comparisons.  If possible, add the data type to those
+understood by the function convert_to_scalar() in
+src/backend/utils/adt/selfuncs.c.
 (Eventually, this function should be replaced by per-data-type functions
 identified through a column of the pg_type system catalog; but that hasn't happened
 yet.)  If you do not do this, things will still work, but the optimizer's
@@ -310,8 +303,10 @@ table1.column1 OP table2.column2
  
   eqjoinsel for =
   neqjoinsel for <>
-  scalarltjoinsel for < or <=
-  scalargtjoinsel for > or >=
+  scalarltjoinsel for <
+  scalarlejoinsel for <=
+  scalargtjoinsel for >
+  scalargejoinsel for >=
   areajoinsel for 2D area-based comparisons
   positionjoinsel for 2D position-based comparisons
   contjoinsel for 2D containment-based comparisons
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 9d34025..b4cbc34 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -71,7 +71,7 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
  *
  * We also recognize "range queries", such as "x > 34 AND x < 42".  Clauses
  * are recognized as possible range query components if they are restriction
- * opclauses whose operators have scalarltsel() or scalargtsel() as their
+ * opclauses whose operators have scalarltsel or a related function as their
  * restriction selectivity estimator.  We pair up clauses of this form that
  * refer to the same variable.  An unpairable clause of this kind is simply
  * multiplied into the selectivity product in the normal way.  But when we
@@ -92,8 +92,8 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
  * A free side-effect is that we can recognize redundant inequalities such
  * as "x < 4 AND x < 5"; only the tighter constraint will be counted.
  *
- * Of course this is all very dependent on the behavior of
- * scalarltsel/scalargtsel; perhaps some day we can generalize the approach.
+ * Of course this is all very dependent on the behavior 

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread AP
On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
> >> >  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
> >> > live_items | dead_items |   free_percent
> >> > -+--++--+--+++--
> >> >3 | 10485760 |2131192 |   66 |0 | 
> >> > 2975444240 |  0 | 1065.19942179026
> >> > (1 row)
...
> >> > And I do appear to have an odd percentage of free space. :)
> 
> Are you worried about "unused_pages"? If so, then this is not a major

Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :)
Shouldn't that number be < 100?

> reason to worry, because these are probably freed overflow pages which
> can be used in future.  In the hash index, when we free the overflow
> pages, they are not returned back to OS, rather they are tracked in
> the index as unused pages which will get used when required in future.

> >> It looks like Vacuum hasn't been triggered.
> 
> Vacuum won't be triggered on insert load.  I think that is one of the
> reasons why in your initial copy, you might have got the error.  We
> had some discussion in the past to trigger Vacuum on insert heavy
> workloads [1], but the patch still didn't get committed.  I think if
> that patch or some other form of that patch gets committed, it will
> help the workload what you are trying here.

Well, if this is the cause of my little issue, it might be nice. ATM
my import script bombs out on errors (that I've duplicated! :) It took
11 hours but it bombed) and it sounds like I'll need to do a manual
VACUUM before it can be run again.

The stats you were looking for before are:

# select * from  pgstathashindex('link_datum_id_idx');
 version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
live_items | dead_items |   free_percent
-+--++--+--+++--
   3 |  8559258 |4194176 |  128 |  1926502 | 
3591812743 |  0 | 942.873199357466
(1 row)

# select 4194176.0/128/8;
   ?column?
---
 4095.8750
(1 row)

If you need more info or whatnot, shout. I've a problematic index to
play with now.

> [1] - 
> https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb%402ndquadrant.fr

AP


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additional tests for subtransactions in recovery)

2017-07-05 Thread Peter Eisentraut
On 7/2/17 20:28, Michael Paquier wrote:
>> I was going to hold this back for PG11, but since we're now doing some
>> other tweaks in pg_ctl, it might be useful to add this too.  Thoughts?
> 
> The use of 0 as exit code for the new promote -w if timeout is reached
> looks like an open item to me. Cleaning up the pool queries after
> promotion would be nice to see as well.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-05 Thread Alvaro Herrera
Noah Misch wrote:

> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.

I volunteer to own this item.  My next update is going to be on or
before Friday 7th at 19:00 Chilean time, though I don't think I can have
this ready before then.  I expect a fix for this to miss next week's
beta release, but I'll try to get it done sooner.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-05 Thread Dean Rasheed
On 5 July 2017 at 10:43, Amit Langote  wrote:
> In retrospect, that sounds like something that was implemented in the
> earlier versions of the patch, whereby there was no ability to specify
> UNBOUNDED on a per-column basis.  So the syntax was:
>
> FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED }
>
> But, it was pointed out to me [1] that that doesn't address the use case,
> for example, where part1 goes up to (10, 10) and part2 goes from (10, 10)
> up to (10, unbounded).
>

[Reading that other thread]

It's a reasonable point that our syntax is quite different from
Oracle's, and doing this takes it even further away, and removes
support for things that they do support.

For the record, Oracle allows things like the following:

DROP TABLE t1;
CREATE TABLE t1 (a NUMBER, b NUMBER, c NUMBER)
PARTITION BY RANGE (a,b,c)
  (PARTITION t1p1 VALUES LESS THAN (1,2,3),
   PARTITION t1p2 VALUES LESS THAN (2,3,4),
   PARTITION t1p3 VALUES LESS THAN (3,MAXVALUE,5),
   PARTITION t1p4 VALUES LESS THAN (4,MAXVALUE,6)
  );

INSERT INTO t1 VALUES(1,2,3);
INSERT INTO t1 VALUES(2,3,4);
INSERT INTO t1 VALUES(3,4,5);
INSERT INTO t1 VALUES(3.01,4,5);
INSERT INTO t1 VALUES(4,5,10);

COLUMN subobject_name FORMAT a20;
SELECT a, b, c, subobject_name
  FROM t1, user_objects o
 WHERE o.data_object_id = dbms_rowid.rowid_object(t1.ROWID)
 ORDER BY a,b,c;

 A  B  C SUBOBJECT_NAME
-- -- -- 
 1  2  3 T1P2
 2  3  4 T1P3
 3  4  5 T1P3
  3.01  4  5 T1P4
 4  5 10 T1P4


So they use MAXVALUE instead of UNBOUNDED for an upper bound, which is
more explicit. They don't have an equivalent MINVALUE, but it's
arguably not necessary, since the first partition's lower bound is
implicitly unbounded.

With this syntax they don't need to worry about gaps or overlaps
between partitions, which is nice, but arguably less flexible.

They're also more lax about allowing finite values after MAXVALUE, and
they document the fact that any value after a MAXVALUE is ignored.

I don't think their scheme provides any way to define a partition of
the above table that would hold all rows for which a < some value.

So if we were to go for maximum flexibility and compatibility with
Oracle, then perhaps what we would do is more like the original idea
of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE,
which conveniently are already unreserved keywords, as well as being
much shorter. Plus, we would also relax the constraint about having
finite values after MINVALUE/MAXVALUE.

I think I'll go play around with that idea to see what it looks like
in practice. Your previous patch already does much of that, and is far
less invasive.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()

2017-07-05 Thread Alvaro Herrera
Teodor Sigaev wrote:

> Playing around freezing tuple I found suspicious piece of code:
> 
> heap_prepare_freeze_tuple():
> ...
> frz->t_infomask = tuple->t_infomask;
> ...
> frz->t_infomask &= ~HEAP_XMAX_BITS;
> frz->xmax = newxmax;
> if (flags & FRM_MARK_COMMITTED)
> frz->t_infomask &= HEAP_XMAX_COMMITTED;
> 
> Seems, in last line it should be a bitwise OR instead of AND. Now this line
> cleans all bits in t_infomask which later will be copied directly in tuple.

I think you're right.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Suspicious place in heap_prepare_freeze_tuple()

2017-07-05 Thread Teodor Sigaev

Hi!

Playing around freezing tuple I found suspicious piece of code:

heap_prepare_freeze_tuple():
...
frz->t_infomask = tuple->t_infomask;
...
frz->t_infomask &= ~HEAP_XMAX_BITS;
frz->xmax = newxmax;
if (flags & FRM_MARK_COMMITTED)
frz->t_infomask &= HEAP_XMAX_COMMITTED;

Seems, in last line it should be a bitwise OR instead of AND. Now this line 
cleans all bits in t_infomask which later will be copied directly in tuple.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8deb344d09..ec227bac80 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6639,7 +6639,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			frz->t_infomask &= ~HEAP_XMAX_BITS;
 			frz->xmax = newxmax;
 			if (flags & FRM_MARK_COMMITTED)
-frz->t_infomask &= HEAP_XMAX_COMMITTED;
+frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
 			totally_frozen = false;
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extra Vietnamese unaccent rules

2017-07-05 Thread Dang Minh Huong

On 2017/07/05 15:28, Michael Paquier wrote:

I have finally been able to look at this patch.


Thanks for reviewing and the new version of the patch.

(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

  def get_plain_letter(codepoint, table):
  """Return the base codepoint without marks."""
  if is_letter_with_marks(codepoint, table):
-return table[codepoint.combining_ids[0]]
+if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+# Recursive to find the plain letter
+return get_plain_letter(table[codepoint.combining_ids[0]],table)
+elif is_plain_letter(table[codepoint.combining_ids[0]]):
+return table[codepoint.combining_ids[0]]
+else:
+return None
  elif is_plain_letter(codepoint):
  return codepoint
  else:
-raise "mu"
+return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

  def is_letter_with_marks(codepoint, table):
-"""Returns true for plain letters combined with one or more marks."""
+"""Returns true for letters combined with one or more marks."""
  # See 
http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
  return len(codepoint.combining_ids) > 1 and \
-   is_plain_letter(table[codepoint.combining_ids[0]]) and \
+   (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
 all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

  if is_letter_with_marks(codepoint, table):
-charactersSet.add((codepoint.id,
+if get_plain_letter(codepoint, table) <> None:
+charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?


Thanks, so pretty. The patch is fine to me.

---
Thanks and best regards,
Dang Minh Huong

---
このEメールはアバスト アンチウイルスによりウイルススキャンされています。
https://www.avast.com/antivirus



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Error-like LOG when connecting with SSL for password authentication

2017-07-05 Thread Ryan Murphy
Hi Michael,

I tried to apply your patch to HEAD and it failed.  But I think that's because 
some version of it has already been committed, correct?  I see some of your 
ECONNRESET and "SSL connection has been closed unexpectedly" code already in 
HEAD.

Best,
Ryan

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] outfuncs.c utility statement support

2017-07-05 Thread Peter Eisentraut
On 6/21/17 23:40, Tom Lane wrote:
> Peter Eisentraut  writes:
>> On 6/18/17 10:14, Tom Lane wrote:
>>> Doesn't cope with backslash-quoted characters.  If we're going to bother
>>> to do anything here, I think we ought to make it reversible for all
>>> possible characters.
> 
>> Makes sense.  Updated patch attached.
> 
> That looks sane to me, though I've still not actually tested any
> interesting cases.

I have committed this.  I had to build a bunch more stuff around it to
be able to test it, which I will tidy up and submit at some later point.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-05 Thread Dean Rasheed
On 5 July 2017 at 10:43, Amit Langote  wrote:
>> So the more I think about this, the more I think that a cleaner design
>> would be as follows:
>>
>> 1). Don't allow UNBOUNDED, except in the first column, where it can
>> keep it's current meaning.
>>
>> 2). Allow the partition bounds to have fewer columns than the
>> partition definition, and have that mean the same as it would have
>> meant if you were partitioning by that many columns. So, for
>> example, if you were partitioning by (col1,col2), you'd be allowed
>> to define a partition like so:
>>
>>   FROM (x) TO (y)
>>
>> and it would mean
>>
>>   x <= col1 < y
>>
>> Or you'd be able to define a partition like
>>
>>   FROM (x1,x2) TO (y)
>>
>> which would mean
>>
>>   (col1 > x1) OR (col1 = x1 AND col2 >= x2) AND col1 < y
>>
>> 3). Don't allow any value after UNBOUNDED (i.e., only specify
>> UNBOUNDED once in a partition bound).
>
> I assume we don't need the ability of specifying ABOVE/BELOW in this design.
>

Yes that's right.


> In retrospect, that sounds like something that was implemented in the
> earlier versions of the patch, whereby there was no ability to specify
> UNBOUNDED on a per-column basis.  So the syntax was:
>
> FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED }
>

Yes, that's where I ended up too.


> But, it was pointed out to me [1] that that doesn't address the use case,
> for example, where part1 goes up to (10, 10) and part2 goes from (10, 10)
> up to (10, unbounded).
>
> The new design will limit the usage of unbounded range partitions at the
> tail ends.
>

True, but I don't think that's really a problem. When the first column
is a discrete type, an upper bound of (10, unbounded) can be rewritten
as (11) in the new design. When it's a continuous type, e.g. floating
point, it can no longer be represented, because (10.0, unbounded)
really means (col1 <= 10.0). But we've already decided not to support
anything other than inclusive lower bounds and exclusive upper bounds,
so allowing this upper bound goes against that design choice.


>> Of course, it's pretty late in the day to be proposing this kind of
>> redesign, but I fear that if we don't tackle it now, it will just be
>> harder to deal with in the future.
>>
>> Actually, a quick, simple hacky implementation might be to just fill
>> in any omitted values in a partition bound with negative infinity
>> internally, and when printing a bound, omit any values after an
>> infinite value. But really, I think we'd want to tidy up the
>> implementation, and I think a number of things would actually get much
>> simpler. For example, get_qual_for_range() could simply stop when it
>> reached the end of the list of values for the bound, and it wouldn't
>> need to worry about an unbounded value following a bounded one.
>>
>> Thoughts?
>
> I cooked up a patch for the "hacky" implementation for now, just as you
> described in the above paragraph.  Will you be willing to give it a look?
> I will also think about the non-hacky way of implementing this.
>

OK, I'll take a look.

Meanwhile, I already had a go at the "non-hacky" implementation (WIP
patch attached). The more I worked on it, the simpler things got,
which I think is a good sign.

Part-way through, I realised that the PartitionRangeDatum Node type is
no longer needed, because each bound value is now necessarily finite,
so the lowerdatums and upperdatums lists in a PartitionBoundSpec can
now be made into lists of Const nodes, making them match the
listdatums field used for LIST partitioning, and then a whole lot of
related code gets simplified.

It needed a little bit more code in partition.c to track individual
bound sizes, but there were a number of other places that could be
simplified, so overall this represents a reduction in the code size
and complexity.

It's not complete (e.g., no doc updates yet), but it passes all the
tests, and so far seems to work as I would expect.

Regards,
Dean
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
new file mode 100644
index 7da2058..aade9f5
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -66,23 +66,26 @@
  * is an upper bound.
  */
 
-/* Ternary value to represent what's contained in a range bound datum */
-typedef enum RangeDatumContent
+/* Type of an individual bound of a range partition */
+typedef enum RangeBoundKind
 {
-	RANGE_DATUM_FINITE = 0,		/* actual datum stored elsewhere */
-	RANGE_DATUM_NEG_INF,		/* negative infinity */
-	RANGE_DATUM_POS_INF			/* positive infinity */
-} RangeDatumContent;
+	RANGE_BOUND_FINITE = 0,		/* actual bound stored in the datums array */
+	RANGE_BOUND_NEG_INF,		/* negative infinity; NULL datums array */
+	RANGE_BOUND_POS_INF			/* positive infinity; NULL datums array */
+} RangeBoundKind;
 
 typedef struct PartitionBoundInfoData
 {
 	char		strategy;		/* list or range bounds? */
 	int			ndatums;		/* Length of the datums following a

Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-07-05 Thread Mark Dilger

> On Jul 5, 2017, at 5:30 AM, Amit Kapila  wrote:
> 
> On Wed, Jul 5, 2017 at 9:44 AM, Mark Dilger  wrote:
>> 
>>> On Jul 3, 2017, at 10:25 PM, Amit Kapila  wrote:
>>> 
>>> On Mon, Jul 3, 2017 at 8:57 PM, Simon Riggs  wrote:
 On 30 June 2017 at 05:14, Amit Kapila  wrote:
 
> This is explained in section 15.2 [1], refer below para:
> "The query might be suspended during execution. In any situation in
> which the system thinks that partial or incremental execution might
> occur, no parallel plan is generated. For example, a cursor created
> using DECLARE CURSOR will never use a parallel plan. Similarly, a
> PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never
> use a parallel plan, because the parallel query system is unable to
> verify that the code in the loop is safe to execute while parallel
> query is active."
 
 Can you explain "unable to verify that the code in the loop is safe to
 execute while parallel query is active". Surely we aren't pushing code
 in the loop into the actual query, so the safety of command in the FOR
 loop has nothing to do with the parallel safety of the query.
 
 Please give an example of something that would be unsafe? Is that
 documented anywhere, README etc?
 
 FOR x IN query LOOP .. END LOOP
 seems like a case that would be just fine, since we're going to loop
 thru every row or break early.
 
>>> 
>>> It is not fine because we don't support partial execution support.  In
>>> above case, if the loop breaks, we can't break parallel query
>>> execution.  Now, I don't think it will impossible to support the same,
>>> but as of now, parallel subsystem doesn't have such a support.
>> 
>> I can understand this, but wonder if I could use something like
>> 
>> FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP
>> ...
>> END LOOP;
>> 
>> if I hacked the grammar up a bit.  Would the problem go away, or would
>> I still have problems when exceptions beyond my control get thrown inside
>> the loop?
>> 
> 
> I don't think it is just a matter of hacking grammar, internally we
> are using cursor fetch to fetch the rows and there we are passing some
> fixed number of rows to fetch which again is a killer to invoke the
> parallel query.

Is the risk that a RETURN will be executed within the loop block the only
risk that is causing parallel query to be disabled in these plpgsql loops?
If so, it seems that a plpgsql language syntax which created a loop that
disabled RETURN from within the loop body would solve the problem.  I
do not propose any such language extension.  It was just a thought
experiment.

Is there a conditional somewhere in the source that says, in effect, "if this
is a plpgsql loop, then disable parallel query"?  If so, and if I removed that
check such that parallel query would run in plpgsql loops, would I only
get breakage when I returned out of a loop?  Or would there be other
situations where that would break?

Thanks,

mark



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread AP
On Tue, Jul 04, 2017 at 08:23:20PM -0700, Jeff Janes wrote:
> On Tue, Jul 4, 2017 at 3:57 AM, AP  wrote:
> > The data being indexed is BYTEA, (quasi)random and 64 bytes in size.
> > The table has over 2 billion entries. The data is not unique. There's
> > an average of 10 duplicates for every unique value.
> 
> What is the number of duplicates for the most common value?

Damn. Was going to collect this info as I was doing a fresh upload but
it fell through the cracks of my mind. It'll probably take at least
half a day to collect (a simple count(*) on the table takes 1.5-1.75
hours parallelised across 11 processes) so I'll probably have this in
around 24 hours if all goes well. (and I don't stuff up the SQL :) )

AP.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread AP
On Wed, Jul 05, 2017 at 03:33:45PM +1000, AP wrote:
> > Do you have any deletes?  How have you verified whether autovacuum has
> 
> No DELETEs. Just the initial COPY, then SELECTs, then a DB rename to get it
> out of the way of other testing, then the REINDEX.
> 
> > been triggered or not?
> 
> I just checked pg_stat_user_tables (which I hope is the right place for
> this info :)
> 
>relid   | schemaname | relname | seq_scan | seq_tup_read | idx_scan | 
> idx_tup_fetch | n_tup_ins  | n_tup_upd | n_tup_del | n_tup_hot_upd | 
> n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum 
> | last_analyze  |   last_autoanalyze| 
> vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
> ---++-+--+--+--+---++---+---+---+++-+-+-+---+---+--+--+---+---
>  129311803 | public | link|   70 |  15085880072 | 5779 |  
>   465623 | 2975444240 | 0 | 0 | 0 |  928658178 |  
> 0 |   0 | | | 
>   | 2017-06-28 10:43:51.273241+10 |0 |
> 0 | 0 | 2
> 
> So it appears not.

Actually, after a bit more late-arvo thought, I take it this would be the
case as the table has not changed since creation. Thus no need to autovac.

I've newer timestamps on the live db (whose data was uploaded more recently)
for its tables so I think autovac is functioning.

AP


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread AP
On Wed, Jul 05, 2017 at 10:29:09AM +0530, Amit Kapila wrote:
> >> bitmappages.  Can you try to use pgstattuple extension and get us the
> >> results of Select * from pgstathashindex('index_name');?  If the
> >> number of bitmappages is 128 and total overflow pages are 128 * 4096,
> >> then that would mean that all the pages are used.  Then maybe we can
> >
> > Hmm. Unless I misunderstood that'd mean that overflow_pages/4096 should
> > result in a number <= 128 at the moment, right?
> 
> No, sorry, I think my calculation above has something missing.  It
> should be 128 * 4096 * 8.  How we can compute this number is
> no_bitmap_pages * no_bits_used_to_represent_overflow_pages.

AHA! Ok. Then that appears to match. I get 65.041.

> >If so then something is
> > amiss:
> >
> > # select * from  pgstathashindex('link_datum_id_hash_idx');
> >  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
> > live_items | dead_items |   free_percent
> > -+--++--+--+++--
> >3 | 10485760 |2131192 |   66 |0 | 
> > 2975444240 |  0 | 1065.19942179026
> > (1 row)
> >
> > oldmdstash=# select 2131192/4096;
> >  ?column?
> > --
> >   520
> > (1 row)
> 
> You need to divide 520 by 8 to get the bitmap page.  Is this the index
> in which you get the error or is this the one on which you have done
> REINDEX?

Post REINDEX.

> > And I do appear to have an odd percentage of free space. :)
> >
> 
> It looks like Vacuum hasn't been triggered.

:(

> > This index was created yesterday so it has been around for maybe 18 hours.
> > Autovac is likely to have hit it by now.
> 
> Do you have any deletes?  How have you verified whether autovacuum has

No DELETEs. Just the initial COPY, then SELECTs, then a DB rename to get it
out of the way of other testing, then the REINDEX.

> been triggered or not?

I just checked pg_stat_user_tables (which I hope is the right place for
this info :)

   relid   | schemaname | relname | seq_scan | seq_tup_read | idx_scan | 
idx_tup_fetch | n_tup_ins  | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup 
| n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | 
last_analyze  |   last_autoanalyze| vacuum_count | 
autovacuum_count | analyze_count | autoanalyze_count
---++-+--+--+--+---++---+---+---+++-+-+-+---+---+--+--+---+---
 129311803 | public | link|   70 |  15085880072 | 5779 |
465623 | 2975444240 | 0 | 0 | 0 |  928658178 |  
0 |   0 | | |   
| 2017-06-28 10:43:51.273241+10 |0 |0 | 
0 | 2

So it appears not.

# show autovacuum;
 autovacuum 

 on
(1 row)

All autovacuum parameters are as per default. The autovacuum launcher process
exists.

:(

AP


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila  wrote:
>
> Few comments on the latest patch:
>
> 1.
> + LWLockRelease(&apw_state->lock);
> + if (!is_bgworker)
> + ereport(ERROR,
> + (errmsg("could not perform block dump because dump file is being
> used by PID %d",
> + apw_state->pid_using_dumpfile)));
> + ereport(LOG,
> + (errmsg("skipping block dump because it is already being performed by PID 
> %d",
> + apw_state->pid_using_dumpfile)));
>
>
> The above code looks confusing as both the messages are saying the
> same thing in different words.  I think you keep one message (probably
> the first one) and decide error level based on if this is invoked for
> bgworker.  Also, you can move LWLockRelease after error message,
> because if there is any error, then it will automatically release all
> lwlocks.

ERROR is used for autoprewarm_dump_now which is called from the backend.
LOG is used for bgworker.
wordings used are to match the context if failing to dump is
acceptable or not. In the case of bgworker, it is acceptable we are
not particular about the start time of dump but the only interval
between the dumps. So if already somebody doing it is acceptable. But
one who calls autoprewarm_dump_now might be particular about the start
time of dump so we throw error making him retry same.

The wording's are suggested by Robert(below snapshot) in one of his
previous comments and I also agree with it. If you think I should
reconsider this and I am missing something I am open to suggestions.

On Wed, May 31, 2017 at 10:18 PM, Robert Haas  wrote:
+If we go to perform
+an immediate dump process and finds a non-zero value already just does
+ereport(ERROR, ...), including the PID of the other process in the
+message (e.g. "unable to perform block dump because dump file is being
+used by PID %d").  In a background worker, if we go to dump and find
+the file in use, log a message (e.g. "skipping block dump because it
+is already being performed by PID %d", "skipping prewarm because block
+dump file is being rewritten by PID %d").

Thanks moved the LWLockRelease after ERROR call.

> 2.
> +autoprewarm_dump_now(PG_FUNCTION_ARGS)
> +{
> + uint32 num_blocks = 0;
> +
> ..
> + PG_RETURN_INT64(num_blocks);
> ..
> }
>
> Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32?

Return type autoprewarm_dump_now() is pg_catalog.int8 to accommodate
uint32 so I used PG_RETURN_INT64. I think PG_RETURN_UINT32 can be used
as well I have replaced now.

> 3.
> +dump_now(bool is_bgworker)
> {
> ..
> + if (buf_state & BM_TAG_VALID)
> + {
> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
> + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
> + ++num_blocks;
> + }
> ..
> }
>I think there is no use of writing Unlogged buffers unless the dump is
>for the shutdown.  You might want to use BufferIsPermanent to detect the same.

-- I do not think that is true pages of the unlogged table are also
read into buffers for read-only purpose. So if we miss to dump them
while we shut down then the previous dump should be used.

> 4.
> +static uint32
> +dump_now(bool is_bgworker)
> {
> ..
> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
> + {
> + uint32 buf_state;
> +
> + if (!is_bgworker)
> + CHECK_FOR_INTERRUPTS();
> ..
> }
>
> Why checking for interrupts is only for non-bgwroker cases?

-- autoprewarm_dump_now is directly called from the backend. In such
case, we have to handle signals registered for backend in dump_now().
For bgworker dump_block_info_periodically caller of dump_now() handles
SIGTERM, SIGUSR1 which we are interested in.

> 5.
> + * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
> + * forknum, blocknum. Note that this is in the text form so that the dump
> + * information is readable and can be edited, if required.
> + */
>
> In the above comment, you are saying that the dump file is in text
> form whereas in the code you are using binary form.  I think code
> should match comments. Is there a reason of preferring binary over
> text or vice versa?

-- Previously I used the write() on file descriptor. Sorry I should
have changed the mode of opening to text mode when I moved the code to
use AllocateFile Sorry fixed same now.

> 6.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
> + apw_state->pid_using_dumpfile = InvalidPid;
> ..
> }
>
> How will pid_using_dumpfile be set to InvalidPid in the case of error
> for non-bgworker cases?

-- I have a try() - catch() in autoprewarm_dump_now I think that is okay.

>
> 7.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>
> ..
> }
>
> How will transient_dump_file_path be unlinked in the 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila  wrote:
> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy  wrote:
>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy  
>> wrote:
>>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:

 Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
 detect an autoprewarm process already running.  I'd want this to
 return NULL or an error if called for a 2nd time.
>>>
>>> We log instead of error as we try to check only after launching the
>>> worker and inside worker. One solution could be as similar to
>>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>>> memory and check if we can launch worker in backend itself. I will try
>>> to fix same.
>>
>> I have fixed it now as follows
>>
>> +Datum
>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>> +{
>> +   pid_t   pid;
>> +
>> +   init_apw_shmem();
>> +   pid = apw_state->bgworker_pid;
>> +   if (pid != InvalidPid)
>> +   ereport(ERROR,
>> +   (errmsg("autoprewarm worker is already running under PID %d",
>> +   pid)));
>> +
>> +   autoprewarm_dump_launcher();
>> +   PG_RETURN_VOID();
>> +}
>>
>> In backend itself, we shall check if an autoprewarm worker is running
>> then only start the server. There is a possibility if this function is
>> executed concurrently when there is no worker already running (Which I
>> think is not a normal usage) then both call will say it has
>> successfully launched the worker even though only one could have
>> successfully done that (other will log and silently die).
>
> Why can't we close this remaining race condition?  Basically, if we
> just perform all of the actions in this function under the lock and
> autoprewarm_dump_launcher waits till the autoprewarm worker has
> initialized the bgworker_pid, then there won't be any remaining race
> condition.  I think if we don't close this race condition, it will be
> unpredictable whether the user will get the error or there will be
> only a server log for the same.

Yes, I can make autoprewarm_dump_launcher to wait until the launched
bgworker set its pid, but this requires one more synchronization
variable between launcher and worker. More than that I see
ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
worker), which needs to be called before setting pid. So I thought it
won't be harmful let two concurrent calls to launch workers and we
just log failures. Please let me know if I need to rethink about it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 3:55 PM, Amit Kapila  wrote:
> On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas  wrote:
>> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy  
>> wrote:
>>
>> * Instead of creating our own buffering system via buffer_file_write()
>> and buffer_file_flush(), why not just use the facilities provided by
>> the operating system?  fopen() et. al. provide buffering, and we have
>> AllocateFile() to provide a FILE *; it's just like
>> OpenTransientFile(), which you are using, but you'll get the buffering
>> stuff for free.  Maybe there's some reason why this won't work out
>> nicely, but off-hand it seems like it might.  It looks like you are
>> already using AllocateFile() to read the dump, so using it to write
>> the dump as well seems like it would be logical.
>>
>
> One thing that is worth considering is AllocateFile is recommended to
> be used for short operations.  Refer text atop AllocateFile().  If the
> number of blocks to be dumped is large, then the file can remain open
> for the significant period of time.

-- Agree. I think I need suggestion on this we will hold on to 1 fd,
but I am not sure what amount of time file being opened qualify as a
case against using AllocateFile().



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition : Append node over a single SeqScan

2017-07-05 Thread Ashutosh Sharma
Hi,

On Wed, Jul 5, 2017 at 3:58 PM, Ashutosh Bapat
 wrote:
> On Wed, Jul 5, 2017 at 3:48 PM, Ashutosh Sharma  wrote:
>> Hi All,
>>
>> Today while exploring a bit on Range table partitioning, I could see
>> that even if scan is performed on a single partition, the plan node
>> has Append node in it. Isn't it a bug?
>
> No. See following comment from create_append_plan()
> 1045 /*
> 1046  * XXX ideally, if there's just one child, we'd not bother to
> generate an
> 1047  * Append node but just return the single child.  At the
> moment this does
> 1048  * not work because the varno of the child scan plan won't match the
> 1049  * parent-rel Vars it'll be asked to emit.
> 1050  */
>

Thanks for the information.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-05 Thread Craig Ringer
On 3 Jul. 2017 23:01, "K S, Sandhya (Nokia - IN/Bangalore)" <
sandhya@nokia.com> wrote:

Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core
is generated when Postgres is invoking the shell to exit and may not be due
to kernel or file system issues. I will try to reproduce the issue with
dmesg output being printed.

However, is there any instance in Postgres where 'sh -c exit 1' will be
invoked?


Most likely it's used directly or indirectly by an archive_commsnd or
restore_comand you have configured.


Re: [HACKERS] [POC] hash partitioning

2017-07-05 Thread amul sul
On Wed, Jul 5, 2017 at 4:50 PM, Dilip Kumar  wrote:
> On Mon, Jul 3, 2017 at 4:39 PM, amul sul  wrote:
>> Thanks to catching this, fixed in the attached version.
>
> Few comments on the latest version.
>

Thanks for your review, please find my comment inline:

> 0001 looks fine, for 0002 I have some comments.
>
> 1.
> + hbounds = (PartitionHashBound * *) palloc(nparts *
> +  sizeof(PartitionHashBound *));
>
> /s/(PartitionHashBound * *)/(PartitionHashBound **)/g
>

Fixed in the attached version.

> 2.
> RelationBuildPartitionDesc
> {
>  
>
>
> * catalog scan that retrieved them, whereas that in the latter is
> * defined by canonicalized representation of the list values or the
> * range bounds.
> */
> for (i = 0; i < nparts; i++)
> result->oids[mapping[i]] = oids[i];
>
> Should this comments mention about hash as well?
>

Instead, I have generalised this comment in the attached patch

> 3.
>
> if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
> return false;
>
> if (b1->ndatums != b2->ndatums)
> return false;
>
> If ndatums itself is different then no need to access datum memory, so
> better to check ndatum first.
>

You are correct, we already doing this in the
partition_bounds_equal().   This is a redundant code, removed in the
attached version.

> 4.
> + * next larger modulus.  For example, if you have a bunch
> + * of partitions that all have modulus 5, you can add a
> + * new new partition with modulus 10 or a new partition
>
> Typo, "new new partition"  -> "new partition"
>

Fixed in the attached version.

Regards,
Amul


0001-Cleanup_v6.patch
Description: Binary data


0002-hash-partitioning_another_design-v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-07-05 Thread Amit Kapila
On Wed, Jul 5, 2017 at 9:44 AM, Mark Dilger  wrote:
>
>> On Jul 3, 2017, at 10:25 PM, Amit Kapila  wrote:
>>
>> On Mon, Jul 3, 2017 at 8:57 PM, Simon Riggs  wrote:
>>> On 30 June 2017 at 05:14, Amit Kapila  wrote:
>>>
 This is explained in section 15.2 [1], refer below para:
 "The query might be suspended during execution. In any situation in
 which the system thinks that partial or incremental execution might
 occur, no parallel plan is generated. For example, a cursor created
 using DECLARE CURSOR will never use a parallel plan. Similarly, a
 PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never
 use a parallel plan, because the parallel query system is unable to
 verify that the code in the loop is safe to execute while parallel
 query is active."
>>>
>>> Can you explain "unable to verify that the code in the loop is safe to
>>> execute while parallel query is active". Surely we aren't pushing code
>>> in the loop into the actual query, so the safety of command in the FOR
>>> loop has nothing to do with the parallel safety of the query.
>>>
>>> Please give an example of something that would be unsafe? Is that
>>> documented anywhere, README etc?
>>>
>>> FOR x IN query LOOP .. END LOOP
>>> seems like a case that would be just fine, since we're going to loop
>>> thru every row or break early.
>>>
>>
>> It is not fine because we don't support partial execution support.  In
>> above case, if the loop breaks, we can't break parallel query
>> execution.  Now, I don't think it will impossible to support the same,
>> but as of now, parallel subsystem doesn't have such a support.
>
> I can understand this, but wonder if I could use something like
>
> FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP
> ...
> END LOOP;
>
> if I hacked the grammar up a bit.  Would the problem go away, or would
> I still have problems when exceptions beyond my control get thrown inside
> the loop?
>

I don't think it is just a matter of hacking grammar, internally we
are using cursor fetch to fetch the rows and there we are passing some
fixed number of rows to fetch which again is a killer to invoke the
parallel query.

>  And if exception handling is a problem in the loop, are exceptions
> somehow not a problem in other parallel queries?
>

I don't see exceptions as a blocking factor to choose any parallel
query inside PL.  However, if you have something in mind, feel free to
share with some example?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-05 Thread Amit Kapila
On Wed, Jul 5, 2017 at 11:03 AM, AP  wrote:
> On Wed, Jul 05, 2017 at 10:29:09AM +0530, Amit Kapila wrote:
>> >> bitmappages.  Can you try to use pgstattuple extension and get us the
>> >> results of Select * from pgstathashindex('index_name');?  If the
>> >> number of bitmappages is 128 and total overflow pages are 128 * 4096,
>> >> then that would mean that all the pages are used.  Then maybe we can
>> >
>> > Hmm. Unless I misunderstood that'd mean that overflow_pages/4096 should
>> > result in a number <= 128 at the moment, right?
>>
>> No, sorry, I think my calculation above has something missing.  It
>> should be 128 * 4096 * 8.  How we can compute this number is
>> no_bitmap_pages * no_bits_used_to_represent_overflow_pages.
>
> AHA! Ok. Then that appears to match. I get 65.041.
>
>> >If so then something is
>> > amiss:
>> >
>> > # select * from  pgstathashindex('link_datum_id_hash_idx');
>> >  version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
>> > live_items | dead_items |   free_percent
>> > -+--++--+--+++--
>> >3 | 10485760 |2131192 |   66 |0 | 
>> > 2975444240 |  0 | 1065.19942179026
>> > (1 row)
>> >
>> > oldmdstash=# select 2131192/4096;
>> >  ?column?
>> > --
>> >   520
>> > (1 row)
>>
>> You need to divide 520 by 8 to get the bitmap page.  Is this the index
>> in which you get the error or is this the one on which you have done
>> REINDEX?
>
> Post REINDEX.
>
>> > And I do appear to have an odd percentage of free space. :)

Are you worried about "unused_pages"? If so, then this is not a major
reason to worry, because these are probably freed overflow pages which
can be used in future.  In the hash index, when we free the overflow
pages, they are not returned back to OS, rather they are tracked in
the index as unused pages which will get used when required in future.

>> >
>>
>> It looks like Vacuum hasn't been triggered.
>>

Vacuum won't be triggered on insert load.  I think that is one of the
reasons why in your initial copy, you might have got the error.  We
had some discussion in the past to trigger Vacuum on insert heavy
workloads [1], but the patch still didn't get committed.  I think if
that patch or some other form of that patch gets committed, it will
help the workload what you are trying here.


[1] - 
https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb%402ndquadrant.fr

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-07-05 Thread Amit Khandekar
On 30 June 2017 at 15:10, Rafia Sabih  wrote:
>
> On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar 
> wrote:
>>
>> Attached is an updated patch v13 that has some comments changed as per
>> your review, and also rebased on latest master.
>
>
> This is not applicable on the latest head i.e. commit --
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.

Thanks for notifying. Attached is the rebased version of the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-07-05 Thread Dilip Kumar
On Mon, Jul 3, 2017 at 4:39 PM, amul sul  wrote:
> Thanks to catching this, fixed in the attached version.

Few comments on the latest version.

0001 looks fine, for 0002 I have some comments.

1.
+ hbounds = (PartitionHashBound * *) palloc(nparts *
+  sizeof(PartitionHashBound *));

/s/(PartitionHashBound * *)/(PartitionHashBound **)/g

2.
RelationBuildPartitionDesc
{
 


* catalog scan that retrieved them, whereas that in the latter is
* defined by canonicalized representation of the list values or the
* range bounds.
*/
for (i = 0; i < nparts; i++)
result->oids[mapping[i]] = oids[i];

Should this comments mention about hash as well?

3.

if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
return false;

if (b1->ndatums != b2->ndatums)
return false;

If ndatums itself is different then no need to access datum memory, so
better to check ndatum first.

4.
+ * next larger modulus.  For example, if you have a bunch
+ * of partitions that all have modulus 5, you can add a
+ * new new partition with modulus 10 or a new partition

Typo, "new new partition"  -> "new partition"


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition : Append node over a single SeqScan

2017-07-05 Thread Ashutosh Bapat
On Wed, Jul 5, 2017 at 3:48 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> Today while exploring a bit on Range table partitioning, I could see
> that even if scan is performed on a single partition, the plan node
> has Append node in it. Isn't it a bug?

No. See following comment from create_append_plan()
1045 /*
1046  * XXX ideally, if there's just one child, we'd not bother to
generate an
1047  * Append node but just return the single child.  At the
moment this does
1048  * not work because the varno of the child scan plan won't match the
1049  * parent-rel Vars it'll be asked to emit.
1050  */

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Partition : Append node over a single SeqScan

2017-07-05 Thread Ashutosh Sharma
Hi All,

Today while exploring a bit on Range table partitioning, I could see
that even if scan is performed on a single partition, the plan node
has Append node in it. Isn't it a bug?

As per the usage of Append Node, it should only be seen in the
queryplan when scan is performed on multiple tables.

Following are the steps to reproduce the problem.

CREATE TABLE part_tab (c1 int, c2 int) PARTITION BY RANGE (c1);

CREATE TABLE part1 PARTITION OF part_tab FOR VALUES FROM (0) TO (100);

CREATE TABLE part2 PARTITION OF part_tab FOR VALUES FROM (100) TO (200);

postgres=# explain analyze select * from part_tab where c1 > 100;
   QUERY PLAN

 Append  (cost=0.00..38.25 rows=753 width=8) (actual time=0.009..0.009
rows=0 loops=1)
   ->  Seq Scan on part2  (cost=0.00..38.25 rows=753 width=8) (actual
time=0.009..0.009 rows=0 loops=1)
 Filter: (c1 > 100)
 Planning time: 166698.973 ms
 Execution time: 0.043 ms
(5 rows)

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multi column range partition table

2017-07-05 Thread Amit Langote
Hi Dean,

On 2017/07/04 16:49, Dean Rasheed wrote:
> On 3 July 2017 at 10:32, Amit Langote  wrote:
>> On 2017/07/03 17:36, Dean Rasheed wrote:
>>> The bigger question is do we want this for PG10? If so, time is
>>> getting tight. My feeling is that we do, because otherwise we'd be
>>> changing the syntax in PG11 of a feature only just released in PG10,
>>> and I think the current syntax is flawed, so it would be better not to
>>> have it in any public release. I'd feel better hearing from the
>>> original committer though.
>>
>> The way I have extended the syntax in the posted patch, ABOVE/BELOW (or
>> whatever we decide instead) are optional.  UNBOUNDED without the
>> ABOVE/BELOW specifications implicitly means UNBOUNDED ABOVE if in FROM and
>> vice versa, which seems to me like sensible default behavior and what's
>> already present in PG 10.
>>
>> Do you think ABOVE/BELOW shouldn't really be optional?
>>
> 
> Hmm, I'm not so sure about that.
> 
> The more I think about this, the more I think that the current design
> is broken, and that introducing UNBOUNDED ABOVE/BELOW is just a
> sticking plaster to cover that up. Yes, it allows nicer multi-column
> ranges to be defined, as demonstrated upthread. But, it also allows
> some pretty counterintuitive things like making the lower bound
> exclusive and the upper bound inclusive.

Yes, I kind of got that impression from the example, but wasn't able to
reach the same conclusion as yours that it stems from the underlying
design issues; I thought we'd just have to document them as caveats, but
that doesn't really sound nice.  Thanks for pointing that out.

> I think that's actually the real problem with the current design. If I
> have a single-column partition like
> 
>   (col) FROM (x) TO (y)
> 
> it's pretty clear that's a simple range, inclusive at the lower end
> and exclusive at the upper end:
> 
>   (x) <= (col) < (y)
> 
> If I now make that a 2-column partition, but leave the second column
> unbounded:
> 
>   (col1,col2) FROM (x,UNBOUNDED) TO (y,UNBOUNDED)
> 
> my initial expectation would have been for that to mean the same
> thing, i.e.,
> 
>   (x) <= (col1) < (y)
> 
> but that only happens if "UNBOUNDED" means negative infinity in both
> places. That then starts to give the sort of desirable properties
> you'd expect, like using the same expression for the lower bound of
> one partition as the upper bound of another makes the two partitions
> contiguous.
> 
> But of course, that's not exactly a pretty design either, because then
> you'd be saying that UNBOUNDED means positive infinity if it's the
> upper bound of the first column, and negative infinity if it's the
> lower bound of the first column or either bound of any other column.

Initially, I didn't understand the part where you said FROM (x, UNBOUNDED)
TO (y, UNBOUNDED) would mean the same thing as (x) <= (col1) < (y),
because row comparison logic that underlying multi-column range partition
key comparisons appears to me to contradict the same.  But, maybe it's
thinking about the implementation details like this that's clouding my
judgement about the correctness or the intuitiveness of the current design.

> Another aspect of the current design I don't like is that you have to
> keep repeating UNBOUNDED [ABOVE/BELOW], for each of the rest of the
> columns in the bound, and anything else is an error. That's a pretty
> verbose way of saying "the rest of the columns are unbounded".
>
> So the more I think about this, the more I think that a cleaner design
> would be as follows:
> 
> 1). Don't allow UNBOUNDED, except in the first column, where it can
> keep it's current meaning.
> 
> 2). Allow the partition bounds to have fewer columns than the
> partition definition, and have that mean the same as it would have
> meant if you were partitioning by that many columns. So, for
> example, if you were partitioning by (col1,col2), you'd be allowed
> to define a partition like so:
> 
>   FROM (x) TO (y)
> 
> and it would mean
> 
>   x <= col1 < y
> 
> Or you'd be able to define a partition like
> 
>   FROM (x1,x2) TO (y)
> 
> which would mean
> 
>   (col1 > x1) OR (col1 = x1 AND col2 >= x2) AND col1 < y
> 
> 3). Don't allow any value after UNBOUNDED (i.e., only specify
> UNBOUNDED once in a partition bound).

I assume we don't need the ability of specifying ABOVE/BELOW in this design.

In retrospect, that sounds like something that was implemented in the
earlier versions of the patch, whereby there was no ability to specify
UNBOUNDED on a per-column basis.  So the syntax was:

FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED }

But, it was pointed out to me [1] that that doesn't address the use case,
for example, where part1 goes up to (10, 10) and part2 goes from (10, 10)
up to (10, unbounded).

The new design will limit the usage of unbounded range partitions at the
tail ends.

> This design has a few neat properties:
> 
> - Lower bounds ar

Re: [HACKERS] UPDATE of partition key

2017-07-05 Thread Amit Khandekar
On 4 July 2017 at 15:23, Amit Khandekar  wrote:
> On 4 July 2017 at 14:48, Amit Khandekar  wrote:
>> On 4 July 2017 at 14:38, Amit Langote  wrote:
>>> On 2017/07/04 17:25, Etsuro Fujita wrote:
 On 2017/07/03 18:54, Amit Langote wrote:
> On 2017/07/02 20:10, Robert Haas wrote:
>> That
>> seems pretty easy to do - just have expand_inherited_rtentry() notice
>> that it's got a partitioned table and call
>> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to
>> produce the list of OIDs.
 Seems like a good idea.

> Interesting idea.
>
> If we are going to do this, I think we may need to modify
> RelationGetPartitionDispatchInfo() a bit or invent an alternative that
> does not do as much work.  Currently, it assumes that it's only ever
> called by ExecSetupPartitionTupleRouting() and hence also generates
> PartitionDispatchInfo objects for partitioned child tables.  We don't need
> that if called from within the planner.
>
> Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled
> with its usage within the executor, because there is this comment:
>
>  /*
>   * We keep the partitioned ones open until we're done using the
>   * information being collected here (for example, see
>   * ExecEndModifyTable).
>   */

 Yeah, we need some refactoring work.  Is anyone working on that?
>>>
>>> I would like to take a shot at that if someone else hasn't already cooked
>>> up a patch.  Working on making RelationGetPartitionDispatchInfo() a
>>> routine callable from both within the planner and the executor should be a
>>> worthwhile effort.
>>
>> What I am currently working on is to see if we can call
>> find_all_inheritors() or find_inheritance_children() instead of
>> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS().
>> Possibly we don't have to refactor it completely.
>> find_inheritance_children() needs to return the oids in canonical
>> order. So in find_inheritance_children () need to re-use part of
>> RelationBuildPartitionDesc() where it generates those oids in that
>> order. I am checking this part, and am going to come up with an
>> approach based on findings.
>
> The other approach is to make canonical ordering only in
> find_all_inheritors() by replacing call to find_inheritance_children()
> with the refactored/modified RelationGetPartitionDispatchInfo(). But
> that would mean that the callers of find_inheritance_children() would
> have one ordering, while the callers of find_all_inheritors() would
> have a different ordering; that brings up chances of deadlocks. That's
> why I think, we need to think about modifying the common function
> find_inheritance_children(), so that we would be consistent with the
> ordering. And then use find_inheritance_children() or
> find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes,
> there would be some modifications to
> RelationGetPartitionDispatchInfo().
>
>>
>> Also, need to investigate whether *always* sorting the oids in
>> canonical order is going to be much expensive than the current sorting
>> using oids. But I guess it won't be that expensive.


Like I mentioned upthread... in expand_inherited_rtentry(), if we
replace find_all_inheritors() with something else that returns oids in
canonical order, that will change the order in which children tables
get locked, which increases the chance of deadlock. Because, then the
callers of find_all_inheritors() will lock them in one order, while
callers of expand_inherited_rtentry() will lock them in a different
order. Even in the current code, I think there is a chance of
deadlocks because RelationGetPartitionDispatchInfo() and
find_all_inheritors() have different lock ordering.

Now, to get the oids of a partitioned table children sorted by
canonical ordering, (i.e. using the partition bound values) we need to
either use the partition bounds to sort the oids like the way it is
done in RelationBuildPartitionDesc() or, open the parent table and get
it's Relation->rd_partdesc->oids[] which are already sorted in
canonical order. So if we generate oids using this way in
find_all_inheritors() and find_inheritance_children(), that will
generate consistent ordering everywhere. But this method is quite
expensive as compared to the way oids are generated and sorted using
oid values in find_inheritance_children().

In both expand_inherited_rtentry() and
RelationGetPartitionDispatchInfo(), each of the child tables are
opened.

So, in both of these functions, what we can do is : call a new
function partition_tree_walker() which does following :
1. Lock the children using the existing order (i.e. sorted by oid
values) using the same function find_all_inheritors(). Rename
find_all_inheritors() to lock_all_inheritors(... , bool return_oids)
which returns the oid list only if requested.
2. And then scan through each of the partition

Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-05 Thread Michael Paquier
On Wed, Jul 5, 2017 at 10:19 AM, Masahiko Sawada  wrote:
> I feel that since we cannot switch the WAL forcibly on the standby
> server we need to find a new way to do so. I'm not sure but it might
> be a hard work and be late for PG10. Or you meant that you have a idea
> for this?

Why not refactoring a bit do_pg_stop_backup() so as the wait phase
happens even if a backup is started in recovery? Now wait_for_archive
is ignored because no wait is happening and the stop point is directly
returned back to the caller. For the wait actually happening, I don't
have a better idea than documenting the fact that enforcing manually a
segment switch on the primary needs to happen. That's better than
having users including WAL in their base backups but not actually
having everything they need. And I think that documenting that
properly is better than restricting things that should work.

In most workloads, multiple WAL segments can be generated per second,
and in even more of them a new segment generated would happen in less
than a minute, so waiting for a segment switch on the primary should
not be a problem for most users. The log letting user know about the
wait should be more informative when things happen on a standby, like
"waiting for segment to be finished or switched on the primary".

If the restriction approach is preferred, I think that the check
should happen in do_pg_stop_backup as well, and not in
pg_stop_backup_v2 as your patch suggests. pg_basebackup is not able to
do non-exclusive backups but this may happen some day, who knows..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers