Re: [PATCH] Speedup truncates of relation forks

2019-09-16 Thread Michael Paquier
On Tue, Sep 17, 2019 at 01:44:12AM +, Jamison, Kirk wrote:
> On Friday, September 13, 2019 10:06 PM (GMT+9), Fujii Masao wrote:
>> On Fri, Sep 13, 2019 at 9:51 PM Alvaro Herrera 
>> wrote:
 As committed, the smgrdounlinkfork case is actually dead code; it's
 never called from anywhere.  I left it in place just in case we want
 it someday.
>>>
>>> but if no use has appeared in 7 years, I say it's time to kill it.
>> 
>> +1
> 
> The consensus is we remove it, right?

Yes.  Just adding my +1 to nuke the function.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Michael Paquier
On Mon, Sep 16, 2019 at 03:26:10PM +0900, Tattsu Yama wrote:
> I should have explained the API changes more. I added cmdtype as a given
> parameter for all functions (See below).
> Therefore, I suppose that my patch is similar to the following fix as you
> mentioned on -hackers.

Yes, that's an option I mentioned here, but it has drawbacks:
https://www.postgresql.org/message-id/20190914024547.gb15...@paquier.xyz

I have just looked at it again after a small rebase and there are
issues with the design of your patch:
- When aborting a transaction, we need to enforce a reset of the
command ID used in st_progress_command to be PROGRESS_COMMAND_INVALID.
Unfortunately, your patch does not consider the case where an error
happens while a command ID is set, causing a command to still be
tracked with the next transactions of the session.  Even worse, it
prevents pgstat_progress_start_command() to be called again in this
case for another command.
- CLUSTER can rebuild indexes, and we'd likely want to be able to
track some of the information from CREATE INDEX for CLUSTER.

The second issue is perhaps fine as it is not really straight-forward
to share the same progress phases across multiple commands, and we
could live without it for now, or require a follow-up patch to make
the information of CREATE INDEX available to CLUSTER.

Now, the first issue is of another caliber and a no-go :(

On HEAD, pgstat_progress_end_command() has the limitation to not be
able to stack multiple commands, so calling it in cascade has the
disadvantage to perhaps erase the progress state of a command (and it
is not designed for that anyway), which is what happens with CLUSTER
when reindex_index() starts a new progress report, but the simplicity
of the current infrastructure is very safe when it comes to failure
handling, to make sure that an reset happens as long as the command ID
is not invalid.  Your patch makes that part unpredictable.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-09-16 Thread Michael Paquier
On Tue, Sep 17, 2019 at 09:23:45AM +0530, Amit Kapila wrote:
> We always return a single tuple/record from this function, so do we
> really need to return SETOF record or just returning record is
> sufficient?

Right (with the doc update).

> If you want to use the same size array, then you might want to just
> memset the previous array rather than first freeing it and then again
> allocating it.  This is not a big point, so any which way is fine.

Sure.  This is less expensive though, so changed it the way you
are suggesting on my local branch.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-09-16 Thread Amit Kapila
On Tue, Sep 17, 2019 at 6:14 AM Michael Paquier  wrote:
>
> On Mon, Sep 16, 2019 at 11:11:06AM -0300, Alvaro Herrera wrote:
> > On 2019-Sep-16, Michael Paquier wrote:
>
> Thanks, fixed.
>
> Amit, what do you think?  Does the patch match with what you have in
> mind?
>

*
 CREATE FUNCTION heap_tuple_infomask_flags(
t_infomask integer,
t_infomask2 integer,
-   decode_combined boolean DEFAULT false)
-RETURNS text[]
+   raw_flags OUT text[],
+   combined_flags OUT text[])
+RETURNS SETOF record

We always return a single tuple/record from this function, so do we
really need to return SETOF record or just returning record is
sufficient?

*
+ pfree(flags);
+ values[0] = PointerGetDatum(a);

- pfree(d);
+ /*
+ * Build set of combined flags.  Use the same size as previously for the
+ * allocation, this likely wastes a couple of bytes but it keeps the code
+ * simple.
+ */
+ cnt = 0;
+ flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);

If you want to use the same size array, then you might want to just
memset the previous array rather than first freeing it and then again
allocating it.  This is not a big point, so any which way is fine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Tatsuro Yamada

Hi Alvaro!


Is this fix strictly necessary for pg12, or is this something that we
can leave for pg13?


Not only me but many DBA needs this progress report feature on PG12,
therefore I'm trying to fix the problem. If you send other patch to
fix the problem, and it is more elegant than mine, I can withdraw my patch.
Anyway, I want to avoid this feature being reverted.
Do you have any ideas to fix the problem?


I committed a fix for the originally reported problem as da47e43dc32e in
branch REL_12_STABLE.  Is that insufficient, and if so why?



Ooops, I misunderstood. I now realized you committed your patch to
fix the problem. Thanks! I'll test it later. :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=da47e43dc32e3c5916396f0cbcfa974b371e4875



I tested your patch (da47e43d) and it works fine. Thanks! :)
So, my patch improving progress reporting API can leave for PG13.


#Test scenario
===
[Session #1]
select * from pg_stat_progress_cluster ; \watch 0.0001

[Session #2]
create table hoge as select a from generate_series(1, 10) a;
create index ind_hoge1 on hoge(a);
create index ind_hoge2 on hoge((a%2));
create index ind_hoge3 on hoge((a%3));
create index ind_hoge4 on hoge((a%4));
create index ind_hoge5 on hoge((a%5));
cluster hoge using ind_hoge1;
===

#Test result
===
22283|13593|postgres|16384|CLUSTER|initializing|0|0|0|0|0|0
...
22283|13593|postgres|16384|CLUSTER|rebuilding index|16387|10|10|0|0|0 
<= Increasing from 0 to 5
22283|13593|postgres|16384|CLUSTER|rebuilding index|16387|10|10|0|0|1
22283|13593|postgres|16384|CLUSTER|rebuilding index|16387|10|10|0|0|2
22283|13593|postgres|16384|CLUSTER|rebuilding index|16387|10|10|0|0|3
22283|13593|postgres|16384|CLUSTER|rebuilding index|16387|10|10|0|0|4
22283|13593|postgres|16384|CLUSTER|performing final 
cleanup|16387|10|10|0|0|5
===


Thanks,
Tatsuro Yamada






Re: Fwd: Extending range type operators to cope with elements

2019-09-16 Thread David Fetter
On Sun, Sep 15, 2019 at 04:30:52PM +0200, Esteban Zimanyi wrote:
> > So yes, I've had a need for those operators in the past. What I don't
> know is whether adding these functions will be worth the catalog clutter.
> 
> The operators are tested and running within MobilityDB. It concerns lines
> 231-657 for the C code in file
> https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/rangetypes_ext.c
> 
> and lines 32-248 for the SQL code in file
> https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/sql/07_rangetypes_ext.in.sql
> 
> Since you don't really use PR, please let me know whether I can be of
> any help.

It's not done by pull request at this time. Instead, it is done by sending
patches to this mailing list.

http://wiki.postgresql.org/wiki/Development_information
http://wiki.postgresql.org/wiki/Submitting_a_Patch
https://wiki.postgresql.org/wiki/So,_you_want_to_be_a_developer%3F
http://www.interdb.jp/pg/

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: refactoring - share str2*int64 functions

2019-09-16 Thread Michael Paquier
On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote:
> The code paths of the patch calling pg_strtouint64_check to parse
> completion tags (spi.c and pg_stat_statements.c) should complain in
> those cases as the format of the tags for SELECT and COPY is fixed.
> 
> In order to unify the parsing interface and put all the conversion
> routines in a single place, I still think that the patch has value so
> I would still keep it (with a fix for the queryId parsing of course),
> but there is much more to it.

Forgot to mention that another angle of attack would be of course to
add some control flags in the optimized parsing functions to make them
more permissive regarding the handling of the trailing characters, by
not considering as a syntax error the case where the last character is
not a zero-termination.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Tatsuro Yamada

Hi Alvaro!


Is this fix strictly necessary for pg12, or is this something that we
can leave for pg13?


Not only me but many DBA needs this progress report feature on PG12,
therefore I'm trying to fix the problem. If you send other patch to
fix the problem, and it is more elegant than mine, I can withdraw my patch.
Anyway, I want to avoid this feature being reverted.
Do you have any ideas to fix the problem?


I committed a fix for the originally reported problem as da47e43dc32e in
branch REL_12_STABLE.  Is that insufficient, and if so why?



Ooops, I misunderstood. I now realized you committed your patch to
fix the problem. Thanks! I'll test it later. :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=da47e43dc32e3c5916396f0cbcfa974b371e4875


Thanks,
Tatsuro Yamada





Re: refactoring - share str2*int64 functions

2019-09-16 Thread Michael Paquier
On Mon, Sep 16, 2019 at 10:08:19AM -0700, Andres Freund wrote:
> On 2019-09-14 15:02:36 +0900, Michael Paquier wrote:
>> - Switching OID to use pg_strtoint32 causes a failure with initdb.
> 
> Needs to be debugged too. Although I suspect this might just be that you
> need to use unsigned variant.

No, that's not it.  My last message had a typo as I used the unsigned 
variant.  Anyway, by switching the routines of readfuncs.c to use the
new _check wrappers it is easy to see what the actual issue is.  And
here we go with one example:
"FATAL:  invalid input syntax for type uint32: "12089 :relkind v"

So, the root of the problem is that the optimized conversion routines
would complain if the end of the string includes incorrect characters
when converting a node from text, which is not something that strtoXX
complains about.  So our wrappers atooid() and atoui() accept more
types of strings in input as they rely on the system's strtol().  And
that counts for the failures with UINT and OID.  atoi() also is more
flexible on that, which explains the failures for INT, as well as
atol() for LONG (this shows a failure in the regression tests, not at
initdb time though).

One may think that this restriction does not actually apply to
READ_UINT64_FIELD because the routine involves no other things than
queryId.  However once I enable -DWRITE_READ_PARSE_PLAN_TREES
-DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES in the builds,
queryId parsing also complains with the patch.  So except if we
redesign the node reads we are bound to keep around the wrapper of
strtoXX on HEAD called pg_strtouint64() to avoid an incompatibility
when parsing the 64-bit query ID.  We could keep that isolated in
readfuncs.c close to the declarations of atoui() and strtobool()
though.  This also points out that pg_strtouint64 of HEAD is
inconsistent with its signed relatives in the treatment of the input
string.

The code paths of the patch calling pg_strtouint64_check to parse
completion tags (spi.c and pg_stat_statements.c) should complain in
those cases as the format of the tags for SELECT and COPY is fixed.

In order to unify the parsing interface and put all the conversion
routines in a single place, I still think that the patch has value so
I would still keep it (with a fix for the queryId parsing of course),
but there is much more to it.

>> So while I agree with you that a switch should be doable, there is a
>> large set of issues to ponder about here, and the patch already does a
>> lot, so I really think that we had better do a closer lookup at those
>> issues separately, once the basics are in place, and consider them if
>> they actually make sense.  There is much more than just doing a direct
>> switch in this area with the family of ato*() system calls.
> 
> I have no problem applying this separately, but I really don't think
> it's wise to apply this before these problems have been debugged.

Sure.  No problem with that line of reasoning.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Alvaro Herrera
On 2019-Sep-17, Tatsuro Yamada wrote:

> On 2019/09/16 23:12, Alvaro Herrera wrote:

> > Is this fix strictly necessary for pg12, or is this something that we
> > can leave for pg13?
> 
> Not only me but many DBA needs this progress report feature on PG12,
> therefore I'm trying to fix the problem. If you send other patch to
> fix the problem, and it is more elegant than mine, I can withdraw my patch.
> Anyway, I want to avoid this feature being reverted.
> Do you have any ideas to fix the problem?

I committed a fix for the originally reported problem as da47e43dc32e in
branch REL_12_STABLE.  Is that insufficient, and if so why?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Tatsuro Yamada

Hi Alvaro,


On 2019/09/16 23:12, Alvaro Herrera wrote:

On 2019-Sep-16, Tattsu Yama wrote:


I should have explained the API changes more. I added cmdtype as a given
parameter for all functions (See below).
Therefore, I suppose that my patch is similar to the following fix as you
mentioned on -hackers.


Is this fix strictly necessary for pg12, or is this something that we
can leave for pg13?



Not only me but many DBA needs this progress report feature on PG12,
therefore I'm trying to fix the problem. If you send other patch to
fix the problem, and it is more elegant than mine, I can withdraw my patch.
Anyway, I want to avoid this feature being reverted.
Do you have any ideas to fix the problem?


Thanks,
Tatsuro Yamada








RE: [PATCH] Speedup truncates of relation forks

2019-09-16 Thread Jamison, Kirk
On Friday, September 13, 2019 10:06 PM (GMT+9), Fujii Masao wrote:
> On Fri, Sep 13, 2019 at 9:51 PM Alvaro Herrera 
> wrote:
> >
> > On 2019-Sep-13, Fujii Masao wrote:
> >
> > > On Mon, Sep 9, 2019 at 3:52 PM Jamison, Kirk 
> wrote:
> >
> > > > > Please add a preliminary patch that removes the function.  Dead
> > > > > code is good, as long as it is gone.  We can get it pushed ahead of
> the rest of this.
> > > >
> > > > Alright. I've attached a separate patch removing the smgrdounlinkfork.
> > >
> > > Per the past discussion, some people want to keep this "dead"
> > > function for some reasons. So, in my opinion, it's better to just
> > > enclose the function with #if NOT_USED and #endif, to keep the
> > > function itself as it is, and then to start new discussion on
> > > hackers about the removal of that separatedly from this patch.
> >
> > I searched for anybody requesting to keep the function.  I couldn't
> > find anything.  Tom said in 2012:
> > https://www.postgresql.org/message-id/1471.1339106...@sss.pgh.pa.us
> 
> Yes. And I found Andres.
> https://www.postgresql.org/message-id/20180621174129.hogefyopje4xaznu@al
> ap3.anarazel.de
> 
> > > As committed, the smgrdounlinkfork case is actually dead code; it's
> > > never called from anywhere.  I left it in place just in case we want
> > > it someday.
> >
> > but if no use has appeared in 7 years, I say it's time to kill it.
> 
> +1

The consensus is we remove it, right?
Re-attaching the patch that removes the deadcode: smgrdounlinkfork().

---
I've also fixed Fujii-san's comments below in the latest attached speedup 
truncate rel patch (v8).
> Here are other comments for the latest patch:
> 
> + block = visibilitymap_truncate_prepare(rel, 0); if
> + (BlockNumberIsValid(block)) fork = VISIBILITYMAP_FORKNUM;
> +
> + smgrtruncate(rel->rd_smgr, , 1, );
> 
> If visibilitymap_truncate_prepare() returns InvalidBlockNumber,
> smgrtruncate() should not be called.
> 
> + FreeSpaceMapVacuumRange(rel, first_removed_nblocks,
> + InvalidBlockNumber);

Thank you again for the review!

Regards,
Kirk Jamison


v1-0001-Remove-deadcode-smgrdounlinkfork.patch
Description: v1-0001-Remove-deadcode-smgrdounlinkfork.patch


v8-0001-Speedup-truncates-of-relation-forks.patch
Description: v8-0001-Speedup-truncates-of-relation-forks.patch


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-09-16 Thread Michael Paquier
On Mon, Sep 16, 2019 at 11:11:06AM -0300, Alvaro Herrera wrote:
> On 2019-Sep-16, Michael Paquier wrote:
>> On Mon, Sep 16, 2019 at 11:46:16AM +0530, Amit Kapila wrote:
>> Okay, using two separate columns leads to the attached.  Any thoughts?
>> This also includes a fix for cases with IS_LOCKED_ONLY and UPGRADED.
> 
> I like how it looks in the expected test output.  Didn't review the code
> closely, but it looks reasonable in a quick glance.

Thanks for the review.

> Whitespace nitpick: pgindent will do something very annoying with
> this:
>
>> +PG_RETURN_DATUM(
>> +HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
>> nulls)));
> 
> I suggest to split the line in the first comma, not in the parens.

Indeed.  I am switching to use a HeapTuple as intermediate step
instead.

>> +  Combined flags include for example HEAP_XMIN_FROZEN
>> +  which is defined as a set of HEAP_XMIN_COMMITTED
>> +  and HEAP_XMIN_INVALID.
> 
> I suggest something like "Combined flags are displayed for source-level
> macros that take into account the value of more than one raw bit, such
> as HEAP_XMIN_FROZEN".  (We probably don't want an exhaustive list, which
> becomes annoying to maintain; users can refer to the source file.)

Yes, I didn't want to provide a list for that exact reason, and your
suggestion of change sounds fine to me.

> There's a typo "bites" in a comment.

Thanks, fixed.

Amit, what do you think?  Does the patch match with what you have in
mind?
--
Michael
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 6a09d46a57..c8f6e1b4c6 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -86,80 +86,55 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
 -- always be the same in all test runs. we show raw flags by
 -- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
 VACUUM FREEZE test1;
-SELECT t_infomask, t_infomask2, flags
+SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
- LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(flags);
- t_infomask | t_infomask2 |   flags   
-+-+---
-   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID}
+ LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(raw_flags, combined_flags);
+ t_infomask | t_infomask2 | raw_flags |   combined_flags   
++-+---+
+   2816 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
 (1 row)
 
 -- output the decoded flag HEAP_XMIN_FROZEN instead
-SELECT t_infomask, t_infomask2, flags
+SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
- LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2, true) m(flags);
- t_infomask | t_infomask2 |flags 
-+-+--
-   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
+ LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(raw_flags, combined_flags);
+ t_infomask | t_infomask2 | raw_flags |   combined_flags   
++-+---+
+   2816 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
 (1 row)
 
 -- tests for decoding of combined flags
 -- HEAP_XMAX_SHR_LOCK = (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
-SELECT heap_tuple_infomask_flags(x'0050'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_XMAX_SHR_LOCK}
-(1 row)
-
-SELECT heap_tuple_infomask_flags(x'0050'::int, 0, false);
-  heap_tuple_infomask_flags  
--
- {HEAP_XMAX_EXCL_LOCK,HEAP_XMAX_KEYSHR_LOCK}
+SELECT * FROM heap_tuple_infomask_flags(x'0050'::int, 0);
+  raw_flags  |combined_flags
+-+--
+ {HEAP_XMAX_KEYSHR_LOCK,HEAP_XMAX_EXCL_LOCK} | {HEAP_XMAX_SHR_LOCK}
 (1 row)
 
 -- HEAP_XMIN_FROZEN = (HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID)
-SELECT heap_tuple_infomask_flags(x'0300'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_XMIN_FROZEN}
-(1 row)
-
-SELECT heap_tuple_infomask_flags(x'0300'::int, 0, false);
-heap_tuple_infomask_flags
--
- {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID}
+SELECT * FROM heap_tuple_infomask_flags(x'0300'::int, 0);
+

Re: SQL/JSON: JSON_TABLE

2019-09-16 Thread Nikita Glukhov

On 23.07.2019 16:58, Pavel Stehule wrote:


I got warning

ar crs libpgcommon.a base64.o config_info.o controldata_utils.o d2s.o 
exec.o f2s.o file_perm.o ip.o keywords.o kwlookup.o link-canary.o 
md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o rmtree.o 
saslprep.o scram-common.o string.o unicode_norm.o username.o wait_error.>

...skipping...
clauses.c:1076:3: warning: this ‘if’ clause does not guard... 
[-Wmisleading-indentation]

 1076 |   if (ExecEvalJsonNeedsSubTransaction(jsexpr, NULL))
      |   ^~
clauses.c:1078:4: note: ...this statement, but the latter is 
misleadingly indented as if it were guarded by the ‘if’

 1078 |    return true;
      |    ^~
gcc -Wall -Wmissing-protot


Fixed in 38th version. Thanks.



Regress tests diff is not empty - see attached file


Unfortunately, this is not reproducible on my machine, but really seems to be
a bug.



some strange fragments from code:

    deparseExpr(node->arg, context);
-   if (node->relabelformat != COERCE_IMPLICIT_CAST)
+   if (node->relabelformat != COERCE_IMPLICIT_CAST &&
+       node->relabelformat == COERCE_INTERNAL_CAST)


There obviously should be

node->relabelformat != COERCE_INTERNAL_CAST

Fixed in 38th version. Thanks.


Now, "format"  is type_func_name_keyword, so when you use it, then nobody
can use "format" as column name. It can break lot of application. "format"
is common name. It is relatively unhappy, and it can touch lot of users.


FORMAT was moved to unreserved_keywords in the 38th version. I remember that
there was conflicts with FORMAT, but now it works as unreserved_keyword.



This patch set (JSON functions & JSON_TABLE) has more tha 20K rows. 
More, there are more than few features are implemented.


Is possible to better (deeper) isolate these features, please? I have 
nothing against any implemented feature, but it is hard to work 
intensively (hard test) on this large patch. JSON_TABLE has only 
184kB, can we start with this patch?


SQLJSON_FUNCTIONS has 760kB - it is maybe too much for one feature, 
one patch.



Patch 0001 is simply a squash of all 7 patches from the thread
"SQL/JSON: functions".  These patches are preliminary for JSON_TABLE.

Patch 0002 only needs to be review in this thread.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Nondeterministic collations vs. text_pattern_ops

2019-09-16 Thread Tom Lane
Whilst poking at the leakproofness-of-texteq issue, I realized
that there's an independent problem caused by the nondeterminism
patch.  To wit, that the text_pattern_ops btree opclass uses
texteq as its equality operator, even though that operator is
no longer guaranteed to be bitwise equality.  That means that
depending on which collation happens to get attached to the
operator, equality might be inconsistent with the other members
of the opclass, leading to who-knows-what bad results.

bpchar_pattern_ops has the same issue with respect to bpchareq.

The obvious fix for this is to invent separate new equality operators,
but that's actually rather disastrous for performance, because
text_pattern_ops indexes would no longer be able to use WHERE clauses
using plain equality.  That also feeds into whether equality clauses
deduced from equivalence classes will work for them (nope, not any
more).  People using such indexes are just about certain to be
bitterly unhappy.

We may not have any choice but to do that, though --- I sure don't
see any other easy fix.  If we could be certain that the collation
attached to the operator is deterministic, then it would still work
with a pattern_ops index, but that's not a concept that the index
infrastructure has got right now.

Whatever we do about this is likely to require a catversion bump,
meaning we've got to fix it *now*.

regards, tom lane




Re: Define jsonpath functions as stable

2019-09-16 Thread Chapman Flack
On 09/16/19 17:10, Tom Lane wrote:

> I was initially troubled
> by the fact that XML Schema regexps are implicitly anchored, ie must
> match the whole string; that's a huge difference from POSIX.  However,
> 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
> predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
> defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
> except for some weirdness around newline matching; and that spec
> clearly says that fn:matches treats its pattern argument as NOT anchored.

Yeah, it's a layer cake. XML Schema regexps[1] are implicitly anchored and
don't have any metacharacters devoted to anchoring.

XQuery regexps layer onto[2] XML Schema regexps, adding ^ and $ anchors,
rescinding the implicit anchored-ness, adding reluctant quantifiers,
capturing groups, and back-references, and defining flags.

Then ISO SQL adds a third layer changing the newline semantics, affecting
^, $, ., \s, and \S.

Regards,
-Chap


[1] https://www.w3.org/TR/xmlschema-2/#regexs
[2] https://www.w3.org/TR/xpath-functions-31/#regex-syntax




Re: Define jsonpath functions as stable

2019-09-16 Thread Jonathan S. Katz
On 9/16/19 5:10 PM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> On 9/16/19 11:20 AM, Tom Lane wrote:
>>> I think we could possibly get away with not having any special marker
>>> on regexes, but just explaining in the documentation that "features
>>> so-and-so are not implemented".  Writing that text would require closer
>>> analysis than I've seen in this thread as to exactly what the differences
>>> are.
> 
>> +1, and likely would need some example strings too that highlight the
>> difference in how they are processed.
> 
> I spent an hour digging through these specs.

Thanks! That sounds like quite the endeavor...

>  I was initially troubled
> by the fact that XML Schema regexps are implicitly anchored, ie must
> match the whole string; that's a huge difference from POSIX.  However,
> 19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
> predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
> defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
> except for some weirdness around newline matching; and that spec
> clearly says that fn:matches treats its pattern argument as NOT anchored.
> So it looks like we end up in the same place as POSIX for this.
> 
> Otherwise, the pattern language differences I could find are all details
> of character class expressions (bracket expressions, such as "[a-z0-9]")
> and escapes that are character class shorthands:
> 
> * We don't have "character class subtraction".  I'd be pretty hesitant
> to add that to our regexp language because it seems to change "-" into
> a metacharacter, which would break an awful lot of regexps.  I might
> be misunderstanding their syntax for it, because elsewhere that spec
> explicitly claims that "-" is not a metacharacter.

Using something I could understand[1] it looks like the syntax is like:

[a-z-[aeiou]

e.g. all the consonants of the alphabet. I don't believe that would
break many, if any, regexps. I also don't know what kind of effort it
would take to add that in given I had not looked at the regexp code
until today (and only at some of the amusing comments in the header
file, which seemed like it wasn't expected the code would be read 20
years later), but it would likely not be a v12 problem.

> * Character class elements can be #xNN (NN being hex digits), which seems
> equivalent to POSIX \xNN as long as you're using UTF8 encoding.  Again,
> the compatibility costs of allowing that don't seem attractive, since #
> isn't a metacharacter today.

Seems reasonable.

> * Character class elements can be \p{UnicodeProperty} or
> the complement \P{UnicodeProperty}, where there are a bunch of different
> possible properties.  Perhaps we could add that someday; since there's no
> reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
> compatibility hit.  But I'm content to document this as unimplemented
> for now.

+1.

> * XQuery adds character class shorthands \i (complement \I) for "initial
> name characters" and \c (complement \C) for "NameChar".  Same as above;
> maybe add someday, but no hurry.

+1.

> * It looks like XQuery's \w class might allow more characters than our
> interpretation does, and hence \W allows fewer.  But since \w devolves
> to what libc thinks the "alnum" class is, it's at least possible that
> some locales might do the same thing XQuery calls for.

I'd still add this to the "to document" list.

> * The SQL-spec newline business mentioned above is a possible exception:
> it appears to require that when '.' is allowed to match newlines, a
> single '.' should match a '\r\n' Windows newline.  I think we can
> document that and move on.

+1.

> * The x flag in XQuery is defined as ignoring all whitespace in
> the pattern except within character class expressions.  Spencer's
> x flag does mostly that, but it thinks that "\ " means a literal space
> whereas XQuery explicitly says that the space is ignored and the
> backslash applies to the next non-space character.  (That's just
> weird, in my book.)  Also, Spencer's x mode causes # to begin
> a comment extending to EOL, which is a nice thing XQuery hasn't
> got, and it says you can't put spaces within multi-character
> symbols like "(?:", which presumably is allowed with XQuery's "x".
> 
> I feel a bit uncomfortable with these inconsistencies in x-flag
> rules.  We could probably teach the regexp library to have an
> alternate expanded mode that matches XQuery's rules, but that's
> not a project to tackle for v12.

That does not sound fun by any means. But likely that would be a part of
an overall effort to implement XQuery rules.

>  I tentatively recommend that
> we remove the jsonpath "x" flag for the time being.

I would add an alternative suggestion of just removing that "x" is
supported in the documentation...but likely better to just remove the
flag + docs.

> Also, I noted some things that seem to be flat out sloppiness
> in the XQuery flag conversions:
> 
> * The 

Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Thomas Munro
On Tue, Sep 17, 2019 at 3:09 AM Kuntal Ghosh  wrote:
> On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
> > Agreed.  I added a line to break out of that loop if !block->in_use.
> >
> I think we should skip the block if !block->in_use. Because, the undo
> buffer can be registered in a subsequent block as well. For different
> operations, we can use different block_id to register the undo buffer
> in the redo record.

Oops, right.  So it should just be added to the if condition.  Will do.

-- 
Thomas Munro
https://enterprisedb.com




Re: [PATCH][PROPOSAL] Add enum releation option type

2019-09-16 Thread Alvaro Herrera
On 2019-Sep-13, Nikolay Shaplov wrote:

> I've played with it around, and did some googling, but without much success.
> If we are moving this way (an this way seems to be good one), I need you 
> help, 
> because this thing is beyond my C knowledge, I will not manage it myself.

So I kinda did it ... and didn't like the result very much.

Partly, maybe the problem is not one that this patch introduces, but
just a pre-existing problem in reloptions: namely, does an access/
module (gist, rel) depend on reloptions, or does reloptions.c depend on
access modules?  It seems that there is no actual modularity separation
between these; currently both seem to depend on each other, if only
because there's a central list (arrays) of valid reloptions.  If we did
away with that and instead had each module defined its own reloptions,
maybe that problem would go away.  But how would that work?

Also: I'm not sure about the stuff that coerces the enum value to an int
generically; that works fine with my compiler but I'm not sure it's
kosher.

Thirdly: I'm not sure that what I suggested is syntactically valid,
because C doesn't let you define an array in an initializator in the
middle of another array.  If there is, it requires some syntax trick
that I'm not familiar with.  So I had to put the valid values arrays
separate from the main enum options, after all, which kinda sucks.

Finally (but this is easily fixable), with this patch the
allocate_reloption stuff is broken (needs to pstrdup() the "Valid values
are" message, which needs to be passed as a new argument).

All in all, I don't know what to think of this.  It seemed like a good
idea, but maybe in practice it's not so great.

Do I hear other opinions?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 20f4ed3c38..7b57eab1c5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -433,7 +433,22 @@ static relopt_real realRelOpts[] =
 	{{NULL}}
 };
 
-static relopt_string stringRelOpts[] =
+relopt_enum_elt_def gistBufferingOptValues[] =
+{
+	{ "on", GIST_OPTION_BUFFERING_ON },
+	{ "off", GIST_OPTION_BUFFERING_OFF },
+	{ "auto", GIST_OPTION_BUFFERING_AUTO },
+	{ (const char *) NULL }
+};
+
+relopt_enum_elt_def viewCheckOptValues[] =
+{
+	{ "local", VIEW_OPTION_CHECK_OPTION_LOCAL },
+	{ "cascaded", VIEW_OPTION_CHECK_OPTION_CASCADED },
+	{ (const char *) NULL }
+};
+
+static relopt_enum enumRelOpts[] =
 {
 	{
 		{
@@ -442,10 +457,9 @@ static relopt_string stringRelOpts[] =
 			RELOPT_KIND_GIST,
 			AccessExclusiveLock
 		},
-		4,
-		false,
-		gistValidateBufferingOption,
-		"auto"
+		gistBufferingOptValues,
+		GIST_OPTION_BUFFERING_AUTO,
+		"Valid values are \"on\", \"off\", and \"auto\"."
 	},
 	{
 		{
@@ -454,11 +468,19 @@ static relopt_string stringRelOpts[] =
 			RELOPT_KIND_VIEW,
 			AccessExclusiveLock
 		},
-		0,
-		true,
-		validateWithCheckOption,
-		NULL
+		viewCheckOptValues,
+		VIEW_OPTION_CHECK_OPTION_NOT_SET,
+		"Valid values are \"local\" and \"cascaded\"."
 	},
+	{
+		{
+			NULL
+		}
+	}
+};
+
+static relopt_string stringRelOpts[] =
+{
 	/* list terminator */
 	{{NULL}}
 };
@@ -505,6 +527,12 @@ initialize_reloptions(void)
    realRelOpts[i].gen.lockmode));
 		j++;
 	}
+	for (i = 0; enumRelOpts[i].gen.name; i++)
+	{
+		Assert(DoLockModesConflict(enumRelOpts[i].gen.lockmode,
+   enumRelOpts[i].gen.lockmode));
+		j++;
+	}
 	for (i = 0; stringRelOpts[i].gen.name; i++)
 	{
 		Assert(DoLockModesConflict(stringRelOpts[i].gen.lockmode,
@@ -543,6 +571,14 @@ initialize_reloptions(void)
 		j++;
 	}
 
+	for (i = 0; enumRelOpts[i].gen.name; i++)
+	{
+		relOpts[j] = [i].gen;
+		relOpts[j]->type = RELOPT_TYPE_ENUM;
+		relOpts[j]->namelen = strlen(relOpts[j]->name);
+		j++;
+	}
+
 	for (i = 0; stringRelOpts[i].gen.name; i++)
 	{
 		relOpts[j] = [i].gen;
@@ -640,6 +676,9 @@ allocate_reloption(bits32 kinds, int type, const char *name, const char *desc)
 		case RELOPT_TYPE_REAL:
 			size = sizeof(relopt_real);
 			break;
+		case RELOPT_TYPE_ENUM:
+			size = sizeof(relopt_enum);
+			break;
 		case RELOPT_TYPE_STRING:
 			size = sizeof(relopt_string);
 			break;
@@ -718,6 +757,24 @@ add_real_reloption(bits32 kinds, const char *name, const char *desc, double defa
 	add_reloption((relopt_gen *) newoption);
 }
 
+/*
+ * add_enuum_reloption
+ *		Add a new enum reloption
+ */
+void
+add_enum_reloption(bits32 kinds, const char *name, const char *desc,
+   relopt_enum_elt_def *members, int default_val)
+{
+	relopt_enum *newoption;
+
+	newoption = (relopt_enum *) allocate_reloption(kinds, RELOPT_TYPE_ENUM,
+   name, desc);
+	newoption->members = members;
+	newoption->default_val = default_val;
+
+	add_reloption((relopt_gen *) newoption);
+}
+
 /*
  * add_string_reloption
  *		Add a new string reloption
@@ -1234,6 +1291,35 @@ 

Re: [proposal] de-TOAST'ing using a iterator

2019-09-16 Thread Alvaro Herrera
On 2019-Sep-10, Binguo Bao wrote:

> +/*
> + * Support for de-TOASTing toasted value iteratively. "need" is a pointer
> + * between the beginning and end of iterator's ToastBuffer. The marco
> + * de-TOAST all bytes before "need" into iterator's ToastBuffer.
> + */
> +#define PG_DETOAST_ITERATE(iter, need)   
> \
> + do {
> \
> + Assert((need) >= (iter)->buf->buf && (need) <= 
> (iter)->buf->capacity);  \
> + while (!(iter)->done && (need) >= (iter)->buf->limit) { 
> \
> + detoast_iterate(iter);  
> \
> + }   
>   
>   \
> + } while (0)
>  /* WARNING -- unaligned pointer */
>  #define PG_DETOAST_DATUM_PACKED(datum) \
>   pg_detoast_datum_packed((struct varlena *) DatumGetPointer(datum))

In broad terms this patch looks pretty good to me.  I only have a small
quibble with this API definition in fmgr.h -- namely that it forces us
to export the definition of all the structs (that could otherwise be
private to toast_internals.h) in order to satisfy callers of this macro.
I am wondering if it would be possible to change detoast_iterate and
PG_DETOAST_ITERATE in a way that those details remain hidden -- I am
thinking something like "if this returns NULL, then iteration has
finished"; and relieve the macro from doing the "->buf->buf" and
"->buf->limit" checks.  I think changing that would require a change in
how the rest of the code is structured around this (the textpos internal
function), but seems like it would be better overall.

(AFAICS that would enable us to expose much less about the
iterator-related structs to detoast.h -- you should be able to move the
struct defs to toast_internals.h)

Then again, it might be just wishful thinking, but it seems worth
considering at least.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Define jsonpath functions as stable

2019-09-16 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On 9/16/19 11:20 AM, Tom Lane wrote:
>> I think we could possibly get away with not having any special marker
>> on regexes, but just explaining in the documentation that "features
>> so-and-so are not implemented".  Writing that text would require closer
>> analysis than I've seen in this thread as to exactly what the differences
>> are.

> +1, and likely would need some example strings too that highlight the
> difference in how they are processed.

I spent an hour digging through these specs.  I was initially troubled
by the fact that XML Schema regexps are implicitly anchored, ie must
match the whole string; that's a huge difference from POSIX.  However,
19075-6 says that jsonpath like_regex works the same as the LIKE_REGEX
predicate in SQL; and SQL:2011 "9.18 XQuery regular expression matching"
defines LIKE_REGEX to work exactly like XQuery's fn:matches function,
except for some weirdness around newline matching; and that spec
clearly says that fn:matches treats its pattern argument as NOT anchored.
So it looks like we end up in the same place as POSIX for this.

Otherwise, the pattern language differences I could find are all details
of character class expressions (bracket expressions, such as "[a-z0-9]")
and escapes that are character class shorthands:

* We don't have "character class subtraction".  I'd be pretty hesitant
to add that to our regexp language because it seems to change "-" into
a metacharacter, which would break an awful lot of regexps.  I might
be misunderstanding their syntax for it, because elsewhere that spec
explicitly claims that "-" is not a metacharacter.

* Character class elements can be #xNN (NN being hex digits), which seems
equivalent to POSIX \xNN as long as you're using UTF8 encoding.  Again,
the compatibility costs of allowing that don't seem attractive, since #
isn't a metacharacter today.

* Character class elements can be \p{UnicodeProperty} or
the complement \P{UnicodeProperty}, where there are a bunch of different
possible properties.  Perhaps we could add that someday; since there's no
reason to escape "p" or "P" today, this doesn't seem like it'd be a huge
compatibility hit.  But I'm content to document this as unimplemented
for now.

* XQuery adds character class shorthands \i (complement \I) for "initial
name characters" and \c (complement \C) for "NameChar".  Same as above;
maybe add someday, but no hurry.

* It looks like XQuery's \w class might allow more characters than our
interpretation does, and hence \W allows fewer.  But since \w devolves
to what libc thinks the "alnum" class is, it's at least possible that
some locales might do the same thing XQuery calls for.

* Likewise, any other discrepancies between the Unicode-centric character
class definitions in XQuery and what our stuff does are well within the
boundaries of locale variances.  So I don't feel too bad about that.

* The SQL-spec newline business mentioned above is a possible exception:
it appears to require that when '.' is allowed to match newlines, a
single '.' should match a '\r\n' Windows newline.  I think we can
document that and move on.

* The x flag in XQuery is defined as ignoring all whitespace in
the pattern except within character class expressions.  Spencer's
x flag does mostly that, but it thinks that "\ " means a literal space
whereas XQuery explicitly says that the space is ignored and the
backslash applies to the next non-space character.  (That's just
weird, in my book.)  Also, Spencer's x mode causes # to begin
a comment extending to EOL, which is a nice thing XQuery hasn't
got, and it says you can't put spaces within multi-character
symbols like "(?:", which presumably is allowed with XQuery's "x".

I feel a bit uncomfortable with these inconsistencies in x-flag
rules.  We could probably teach the regexp library to have an
alternate expanded mode that matches XQuery's rules, but that's
not a project to tackle for v12.  I tentatively recommend that
we remove the jsonpath "x" flag for the time being.

Also, I noted some things that seem to be flat out sloppiness
in the XQuery flag conversions:

* The newline-matching flags (m and s flags) can be mapped to
features of Spencer's library, but jsonpath_gram.y does so
incorrectly.

* XQuery says that the q flag overrides m, s, and x flags, which is
exactly the opposite of what our code does; besides which the code
is flag-order-sensitive which is just wrong.

These last two are simple to fix and we should just go do it.
Otherwise, I think we're okay with regarding Spencer's library
as being a sufficiently close approximation to LIKE_REGEX.
We need some documentation work though.

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-16 Thread Tomas Vondra

On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote:



On 16.09.2019 19:54, Alexey Kondratov wrote:

On 30.08.2019 18:59, Konstantin Knizhnik wrote:


I think that instead of defining savepoints it is simpler and more 
efficient to use


BeginInternalSubTransaction + 
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction


as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr



Both BeginInternalSubTransaction and DefineSavepoint use 
PushTransaction() internally for a normal subtransaction start. So 
they seems to be identical from the performance perspective, which 
is also stated in the comment section:


Yes, definitely them are using the same mechanism and most likely 
provides similar performance.
But BeginInternalSubTransaction does not require to generate some 
savepoint name which seems to be redundant in this case.





Anyway, I've performed a profiling of my apply worker (flamegraph is 
attached) and it spends the vast amount of time (>90%) applying 
changes. So the problem is not in the savepoints their-self, but in 
the fact that we first apply all changes and then abort all the 
work. Not sure, that it is possible to do something in this case.




Looks like the only way to increase apply speed is to do it in 
parallel: make it possible to concurrently execute non-conflicting 
transactions.




True, although it seems like a massive can of worms to me. I'm not aware
a way to identify non-conflicting transactions in advance, so it would
have to be implemented as optimistic apply, with a detection and
recovery from conflicts.

I'm not against doing that, and I'm willing to spend some time on revies
etc. but it seems like a completely separate effort.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: block-level incremental backup

2019-09-16 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost  wrote:
> > I disagree with this on a couple of levels.  The first is pretty simple-
> > we don't have all of the information.  The user may have some reason to
> > believe that timestamp-based is a bad idea, for example, and therefore
> > having an option to perform a checksum-based backup makes sense.  rsync
> > is a pretty good tool in my view and it has a very similar option-
> > because there are trade-offs to be made.  LSN is great, if you don't
> > mind reading every file of your database start-to-finish every time, but
> > in a running system which hasn't suffered from clock skew or other odd
> > issues (some of which we can also detect), it's pretty painful to scan
> > absolutely everything like that for an incremental.
> 
> There's a separate thread on using WAL-scanning to avoid having to
> scan all the data every time. I pointed it out to you early in this
> thread, too.

As discussed nearby, not everything that needs to be included in the
backup is actually going to be in the WAL though, right?  How would that
ever be able to handle the case where someone starts the server under
wal_level = logical, takes a full backup, then restarts with wal_level =
minimal, writes out a bunch of new data, and then restarts back to
wal_level = logical and takes an incremental?

How would we even detect that such a thing happened?

> > If you track the checksum of the file in the manifest then it's a pretty
> > strong validation that the backup repo hasn't been corrupted between the
> > backup and the restore.  Of course, the database could have been
> > corrupted at the source, and perhaps that's what you were getting at
> > with your 'limited extent' but that isn't what I was referring to.
> 
> Yeah, that all seems fair. Without the checksum, you can only validate
> that you have the right files and that they are the right sizes, which
> is not bad, but the checksums certainly make it stronger. But,
> wouldn't having to checksum all of the files add significantly to the
> cost of taking the backup? If so, I can imagine that some people might
> want to pay that cost but others might not. If it's basically free to
> checksum the data while we have it in memory anyway, then I guess
> there's little to be lost.

On larger systems, so many of the files are 1GB in size that checking
the file size is quite close to meaningless.  Yes, having to checksum
all of the files definitely adds to the cost of taking the backup, but
to avoid it we need strong assurances that a given file hasn't been
changed since our last full backup.  WAL, today at least, isn't quite
that, and timestamps can possibly be fooled with, so if you'd like to be
particularly careful, there doesn't seem to be a lot of alternatives.

> > I'm pretty baffled by this argument, particularly in this context.  We
> > already have tooling around trying to manage WAL archives in core- see
> > pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
> > about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
> > a set of tar files, or a data directory) could happily be backed up
> > using whatever Enterprise tool folks want to use- in much the same way
> > that a pgbackrest repo is also able to be backed up using whatever
> > Enterprise tools someone wishes to use.  We designed it quite carefully
> > to work with exactly that use-case, so the distinction here is quite
> > lost on me.  Perhaps you could clarify what use-case these changes to
> > pg_basebackup solve, when working with a Netbackup or Tivoli system,
> > that pgbackrest doesn't, since you bring it up here?
> 
> I'm not an expert on any of those systems, but I doubt that
> everybody's OK with backing everything up to a pgbackrest repository
> and then separately backing up that repository to some other system.
> That sounds like a pretty large storage cost.

I'm not asking you to be an expert on those systems, just to help me
understand the statements you're making.  How is backing up to a
pgbackrest repo different than running a pg_basebackup in the context of
using some other Enterprise backup system?  In both cases, you'll have a
full copy of the backup (presumably compressed) somewhere out on a disk
or filesystem which is then backed up by the Enterprise tool.

> > As for if we should be sending more to the server, or asking the server
> > to send more to us, I don't really have a good feel for what's "best".
> > At least one implementation I'm familiar with builds a manifest on the
> > PG server side and then compares the results of that to the manifest
> > stored with the backup (where that comparison is actually done is on
> > whatever system the "backup" was started from, typically a backup
> > server).  Perhaps there's an argument for sending the manifest from the
> > backup repository to PostgreSQL for it to then compare against the data
> > directory 

Re: Bug in GiST paring heap comparator

2019-09-16 Thread Alexander Korotkov
On Mon, Sep 16, 2019 at 3:47 PM Nikita Glukhov  wrote:
> On 13.09.2019 20:17, Alexander Korotkov wrote:
>
> On Fri, Sep 13, 2019 at 5:23 PM Nikita Glukhov  
> wrote:
>
> I have moved handling of NULL ordering keys from opclasses to the common
> SP-GiST code, but really I don't like how it is implemented now. Maybe it's
> worth to move handling of NULL order-by keys to the even more higher
> level so,
> that AM don't have to worry about NULLs?
>
> Yes, optimizer could remove away "col op NULL" clauses from ORDER BY
> if op is strict operator.  And then such clauses wouldn't be passed to
> AM.  But I see this as future improvement.  For backpatching we should
> solve this at AM side.
>
> Also I leaved usages of IndexOrderByDistance in opclasses. I think, that
> may help to minimize opclass changes in the future.
>
> Could you please extract this as a separate patch.  We can consider
> this for master, but we shouldn't backpatch this.
>
> Propagation of IndexOrderByDistance in SP-GiST was extracted into a separate
> patch #2.

First patch is minor editing from me and commit message is attached.
I'm going to push it if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Fix-GiST-and-SP-GiST-ordering-by-distance-for-NULLs-v4.patch
Description: Binary data


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-09-16 Thread Konstantin Knizhnik




On 16.09.2019 19:54, Alexey Kondratov wrote:

On 30.08.2019 18:59, Konstantin Knizhnik wrote:


I think that instead of defining savepoints it is simpler and more 
efficient to use


BeginInternalSubTransaction + 
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction


as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr



Both BeginInternalSubTransaction and DefineSavepoint use 
PushTransaction() internally for a normal subtransaction start. So 
they seems to be identical from the performance perspective, which is 
also stated in the comment section:


Yes, definitely them are using the same mechanism and most likely 
provides similar performance.
But BeginInternalSubTransaction does not require to generate some 
savepoint name which seems to be redundant in this case.





Anyway, I've performed a profiling of my apply worker (flamegraph is 
attached) and it spends the vast amount of time (>90%) applying 
changes. So the problem is not in the savepoints their-self, but in 
the fact that we first apply all changes and then abort all the work. 
Not sure, that it is possible to do something in this case.




Looks like the only way to increase apply speed is to do it in parallel: 
make it possible to concurrently execute non-conflicting transactions.







Re: block-level incremental backup

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 1:10 PM Stephen Frost  wrote:
> I disagree with this on a couple of levels.  The first is pretty simple-
> we don't have all of the information.  The user may have some reason to
> believe that timestamp-based is a bad idea, for example, and therefore
> having an option to perform a checksum-based backup makes sense.  rsync
> is a pretty good tool in my view and it has a very similar option-
> because there are trade-offs to be made.  LSN is great, if you don't
> mind reading every file of your database start-to-finish every time, but
> in a running system which hasn't suffered from clock skew or other odd
> issues (some of which we can also detect), it's pretty painful to scan
> absolutely everything like that for an incremental.

There's a separate thread on using WAL-scanning to avoid having to
scan all the data every time. I pointed it out to you early in this
thread, too.

> If you track the checksum of the file in the manifest then it's a pretty
> strong validation that the backup repo hasn't been corrupted between the
> backup and the restore.  Of course, the database could have been
> corrupted at the source, and perhaps that's what you were getting at
> with your 'limited extent' but that isn't what I was referring to.

Yeah, that all seems fair. Without the checksum, you can only validate
that you have the right files and that they are the right sizes, which
is not bad, but the checksums certainly make it stronger. But,
wouldn't having to checksum all of the files add significantly to the
cost of taking the backup? If so, I can imagine that some people might
want to pay that cost but others might not. If it's basically free to
checksum the data while we have it in memory anyway, then I guess
there's little to be lost.

> I'm pretty baffled by this argument, particularly in this context.  We
> already have tooling around trying to manage WAL archives in core- see
> pg_archivecleanup.  Further, we're talking about pg_basebackup here, not
> about Netbackup or Tivoli, and the results of a pg_basebackup (that is,
> a set of tar files, or a data directory) could happily be backed up
> using whatever Enterprise tool folks want to use- in much the same way
> that a pgbackrest repo is also able to be backed up using whatever
> Enterprise tools someone wishes to use.  We designed it quite carefully
> to work with exactly that use-case, so the distinction here is quite
> lost on me.  Perhaps you could clarify what use-case these changes to
> pg_basebackup solve, when working with a Netbackup or Tivoli system,
> that pgbackrest doesn't, since you bring it up here?

I'm not an expert on any of those systems, but I doubt that
everybody's OK with backing everything up to a pgbackrest repository
and then separately backing up that repository to some other system.
That sounds like a pretty large storage cost.

> As for if we should be sending more to the server, or asking the server
> to send more to us, I don't really have a good feel for what's "best".
> At least one implementation I'm familiar with builds a manifest on the
> PG server side and then compares the results of that to the manifest
> stored with the backup (where that comparison is actually done is on
> whatever system the "backup" was started from, typically a backup
> server).  Perhaps there's an argument for sending the manifest from the
> backup repository to PostgreSQL for it to then compare against the data
> directory but I'm not really sure how it could possibly do that more
> efficiently and that's moving work to the PG server that it doesn't
> really need to do.

I agree with all that, but... if the server builds a manifest on the
PG server that is to be compared with the backup's manifest, the one
the PG server builds can't really include checksums, I think. To get
the checksums, it would have to read the entire cluster while building
the manifest, which sounds insane. Presumably it would have to build a
checksum-free version of the manifest, and then the client could
checksum the files as they're streamed down and write out a revised
manifest that adds the checksums.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-16 Thread Peter Geoghegan
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
 wrote:
> Attached is v14 based on v12 (v13 changes are not merged).
>
> In this version, I fixed the bug you mentioned and also fixed nbtinsert,
> so that it doesn't save newposting in xlog record anymore.

Cool.

> I tested patch with nbtree_wal_test, and found out that the real issue is
> not the dedup WAL records themselves, but the full page writes that they 
> trigger.
> Here are test results (config is standard, except fsync=off to speedup tests):
>
> 'FPW on' and 'FPW off' are tests on v14.
> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.

> With random insertions into btree it's highly possible that deduplication 
> will often be
> the first write after checkpoint, and thus will trigger FPW, even if only a 
> few tuples were compressed.

I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.

I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.

It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.

I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.

 "pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.

(Thinks some more...)

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

/*
 * Copy the original page's LSN into leftpage, which will become the
 * updated version of the page.  We need this because XLogInsert will
 * examine the LSN and possibly dump it in a page image.
 */
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().

> That's why there is no significant difference with log_newpage_buffer() 
> approach.
> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.

> Another, and more realistic approach is to make deduplication less intensive:
> if freed space is less than some threshold, fall back to not changing page at 
> all and not generating xlog record.

I see 

Re: range test for hash index?

2019-09-16 Thread Paul A Jungwirth
On Mon, Sep 16, 2019 at 5:28 AM Amit Kapila  wrote:
> I don't see this function on the master branch.  Is this function name
> correct?  Are you looking at some different branch?

Sorry about that! You're right, I was on my multirange branch. But I
see the same thing on latest master (but calling hash_range instead of
hash_range_internal).

Paul




Re: block-level incremental backup

2019-09-16 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 9:30 AM Robert Haas  wrote:
> > > Isn't some operations where at the end we directly call heap_sync
> > > without writing WAL will have a similar problem as well?
> >
> > Maybe.  Can you give an example?
> 
> Looking through the code, I found two cases where we do this.  One is
> a bulk insert operation with wal_level = minimal, and the other is
> CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
> cases we are generating new blocks whose LSNs will be 0/0. So, I think
> we need a rule that if the server is asked to back up all blocks in a
> file with LSNs > some threshold LSN, it must also include any blocks
> whose LSN is 0/0. Those blocks are either uninitialized or are
> populated without WAL logging, so they always need to be copied.

I'm not sure I see a way around it but this seems pretty unfortunate-
every single incremental backup will have all of those included even
though the full backup likely also does (I say likely since someone
could do a full backup, set the WAL to minimal, load a bunch of data,
and then restart back to a WAL level where we can do a new backup, and
then do an incremental, so we don't *know* that the full includes those
blocks unless we also track a block-level checksum or similar).  Then
again, doing these kinds of server bounces to change the WAL level
around is, hopefully, relatively rare..

> Outside of unlogged and temporary tables, I don't know of any case
> where make a critical modification to an already-existing block
> without bumping the LSN. I hope there is no such case.

I believe we all do. :)

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Define jsonpath functions as stable

2019-09-16 Thread Jonathan S. Katz
On 9/16/19 11:20 AM, Tom Lane wrote:
> "Jonathan S. Katz"  writes:
>> It sounds like the easiest path to completion without potentially adding
>> futures headaches pushing back the release too far would be that, e.g.
>> these examples:
> 
>>  $.** ? (@ like_regex "O(w|v)" pg flag "i")
>>  $.** ? (@ like_regex "O(w|v)" pg)
> 
>> If it's using POSIX regexp, I would +1 using "posix" instead of "pg"
> 
> I agree that we'd be better off to say "POSIX".  However, having just
> looked through the references Chapman provided, it seems to me that
> the regex language Henry Spencer's library provides is awful darn
> close to what XPath is asking for.  The main thing I see in the XML/XPath
> specs that we don't have is a bunch of character class escapes that are
> specifically tied to Unicode character properties.  We could possibly
> add code to implement those, but I'm not sure how it'd work in non-UTF8
> database encodings.

Maybe taking a page from the pg_saslprep implementation. For some cases
where the string in question would issue a "reject" under normal
SASLprep[1] considerations (really stringprep[2]), PostgreSQL just lets
the string passthrough to the next step, without alteration.

What's implied here is if the string is UTF-8, it goes through SASLprep,
but if not, it is just passed through.

So perhaps the answer is that if we implement XQuery, the escape for
UTF-8 character properties are only honored if the encoding is set to be
UTF-8, and ignored otherwise. We would have to document that said
escapes only work on UTF-8 encodings.

>  There may also be subtle differences in the behavior
> of character class escapes that we do have in common, such as "\s" for
> white space; but again I'm not sure that those are any different than
> what you get naturally from encoding or locale variations.
>
> I think we could possibly get away with not having any special marker
> on regexes, but just explaining in the documentation that "features
> so-and-so are not implemented".  Writing that text would require closer
> analysis than I've seen in this thread as to exactly what the differences
> are.

+1, and likely would need some example strings too that highlight the
difference in how they are processed.

And again, if we end up updating the behavior in the future, it becomes
a part of our standard deprecation notice at the beginning of the
release notes, though one that could require a lot of explanation.

Jonathan

[1] https://tools.ietf.org/html/rfc4013
[2] https://www.ietf.org/rfc/rfc3454.txt



signature.asc
Description: OpenPGP digital signature


Re: refactoring - share str2*int64 functions

2019-09-16 Thread Fabien COELHO


Bonjour Michaël,


Otherwise, this batch of changes looks ok to me.


Thanks.


About v15: applies cleanly, compiles, "make check" ok.

While re-reading the patch, there are bit of repetitions on pg_strtou?int* 
comments. I'm wondering whether it would make sense to write a global 
comments before each 3-type series to avoid that.


--
Fabien.

Re: Commit fest 2019-09

2019-09-16 Thread Alvaro Herrera
The third week for this commitfest starts.  The numbers now:

  statusstring  │ week1 │ week2 │ week3
┼───┼───┼───
 Needs review   │   165 │   138 │   116
 Waiting on Author  │30 │44 │51
 Ready for Committer│11 │ 5 │ 8
 Returned with Feedback │ 1 │ 4 │ 5
 Moved to next CF   │ 2 │ 4 │ 4
 Committed  │14 │23 │32
 Rejected   │ 1 │ 1 │ 1
 Withdrawn  │ 4 │ 9 │11
(8 filas)

It seems to me that we're moving forward at a decent pace, if not super
fast.  At least, we're keeping patch authors busy!  The number of
patches waiting on committers are steadily low, which seems to be saying
that committers are making a pretty decent job at the final instance for
patches.  Still, there's a lot of stuff that just sits there ... and of
course committers could help with that also.

That said, we do need non-committer contributors to review other
people's patches.  Please do your part and review patches of at least
equivalent complexity to the patches you're submitting!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-09-16 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost  wrote:
> > In a number of cases, trying to make sure that on a failover or copy of
> > the backup the next 'incremental' is really an 'incremental' is
> > dangerous.  A better strategy to address this, and the other issues
> > realized on this thread recently, is to:
> >
> > - Have a manifest of every file in each backup
> > - Always back up new files that weren't in the prior backup
> > - Keep a checksum of each file
> > - Track the timestamp of each file as of when it was backed up
> > - Track the file size of each file
> > - Track the starting timestamp of each backup
> > - Always include files with a modification time after the starting
> >   timestamp of the prior backup, or if the file size has changed
> > - In the event of any anomolies (which includes things like a timeline
> >   switch), use checksum matching (aka 'delta checksum backup') to
> >   perform the backup instead of using timestamps (or just always do that
> >   if you want to be particularly careful- having an option for it is
> >   great)
> > - Probably other things I'm not thinking of off-hand, but this is at
> >   least a good start.  Make sure to checksum this information too.
> 
> I agree with some of these ideas but not all of them.  I think having
> a backup manifest is a good idea; that would allow taking a new
> incremental backup to work from the manifest rather than the data
> directory, which could be extremely useful, because it might be a lot
> faster and the manifest could also be copied to a machine other than
> the one where the entire backup is stored. If the backup itself has
> been pushed off to S3 or whatever, you can't access it quickly, but
> you could keep the manifest around.

Yes, those are also good reasons for having a manifest.

> I also agree that backing up all files that weren't in the previous
> backup is a good strategy.  I proposed that fairly explicitly a few
> emails back; but also, the contrary is obviously nonsense. And I also
> agree with, and proposed, that we record the size along with the file.

Sure, I didn't mean to imply that there was something wrong with that.
Including the checksum and other metadata is also valuable, both for
helping to identify corruption in the backup archive and for forensics,
if not for other reasons.

> I don't really agree with your comments about checksums and
> timestamps.  I think that, if possible, there should be ONE method of
> determining whether a block has changed in some important way, and I
> think if we can make LSN work, that would be for the best. If you use
> multiple methods of detecting changes without any clearly-defined
> reason for so doing, maybe what you're saying is that you don't really
> believe that any of the methods are reliable but if we throw the
> kitchen sink at the problem it should come out OK. Any bugs in one
> mechanism are likely to be masked by one of the others, but that's not
> as as good as one method that is known to be altogether reliable.

I disagree with this on a couple of levels.  The first is pretty simple-
we don't have all of the information.  The user may have some reason to
believe that timestamp-based is a bad idea, for example, and therefore
having an option to perform a checksum-based backup makes sense.  rsync
is a pretty good tool in my view and it has a very similar option-
because there are trade-offs to be made.  LSN is great, if you don't
mind reading every file of your database start-to-finish every time, but
in a running system which hasn't suffered from clock skew or other odd
issues (some of which we can also detect), it's pretty painful to scan
absolutely everything like that for an incremental.

Perhaps the discussion has already moved on to having some way of our
own to track if a given file has changed without having to scan all of
it- if so, that's a discussion I'd be interested in.  I'm not against
other approaches here besides timestamps if there's a solid reason why
they're better and they're also able to avoid scanning the entire
database.

> > By having a manifest for each backed up file for each backup, you also
> > gain the ability to validate that a backup in the repository hasn't been
> > corrupted post-backup, a feature that at least some other database
> > backup and restore systems have (referring specifically to the big O in
> > this particular case, but I bet others do too).
> 
> Agreed. The manifest only lets you validate to a limited extent, but
> that's still useful.

If you track the checksum of the file in the manifest then it's a pretty
strong validation that the backup repo hasn't been corrupted between the
backup and the restore.  Of course, the database could have been
corrupted at the source, and perhaps that's what you were getting at
with your 'limited extent' but that isn't what I was referring to.

Claiming that the backup has been 'validated' by only looking at 

Re: refactoring - share str2*int64 functions

2019-09-16 Thread Andres Freund
Hi,

On 2019-09-14 15:02:36 +0900, Michael Paquier wrote:
> On Fri, Sep 13, 2019 at 06:38:31PM -0700, Andres Freund wrote:
> > On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
> >> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> >> Attached is an updated patch?  How does it look?  I have left the
> >> parts of readfuncs.c for now as there are more issues behind that than
> >> doing a single switch, short reads are one, long reads a second.
> > 
> > Hm? I don't know what you mean by those issues.
> 
> I think that we have more issues than it looks.  For example:
> - Switching UINT to use pg_strtouint32() causes an incompatibility
> issue compared to atoui().

"An incompatibility" is, uh, vague.


> - Switching INT to use pg_strtoint32() causes a set of warnings as for
> example with AttrNumber:
> 72 |  (void) pg_strtoint32(token, _node->fldname)
>|  ^
>|  |
>|  AttrNumber * {aka short int *}
> And it is not like we should use a cast either, as we could hide real
> issues.Hence it seems to me that we need to have a new routine
> definition for shorter integers and switch more flags to that.

Yea.


> - Switching LONG to use pg_strtoint64() leads to another set of
> issues, particularly one could see an assertion failure related to Agg
> nodes.  I am not sure either that we should use int64 here as the size
> can be at least 32b.

That seems pretty clearly something that needs to be debugged before
applying this series. If there's such a failure, it indicates that
there's either a problem in this patchset, or a pre-existing problem in
readfuncs.


> - Switching OID to use pg_strtoint32 causes a failure with initdb.

Needs to be debugged too. Although I suspect this might just be that you
need to use unsigned variant.


> So while I agree with you that a switch should be doable, there is a
> large set of issues to ponder about here, and the patch already does a
> lot, so I really think that we had better do a closer lookup at those
> issues separately, once the basics are in place, and consider them if
> they actually make sense.  There is much more than just doing a direct
> switch in this area with the family of ato*() system calls.

I have no problme applying this separately, but I really don't think
it's wise to apply this before these problems have been debugged.

Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 11:09 AM Kuntal Ghosh
 wrote:
> Okay. In that case, we need to rethink the cases for multi-inserts and
> non-inlace updates both of which currently inserts multiple undo
> record corresponding to a single WAL record. For multi-inserts, it can
> be solved easily by moving all the offset information in the payload.
> But, for non-inlace updates, we insert one undo record for the update
> and one for the insert. Wondering whether we've to insert two WAL
> records - one for update and one for the new insert.

No, I think the solution is to put the information about both halves
of the non-in-place update in the same undo record.  I think the only
reason why that's tricky is because we've got two block numbers and
two offsets, and the only reason that's a problem is because
UnpackedUndoRecord only has one field for each of those things, and
that goes right back to Heikki's comments about the format not being
flexible enough. If you see some other problem, it would be
interesting to know what it is.

One thing I've been thinking about is: suppose that you're following
the undo chain for a tuple and you come to a non-in-place update
record. Can you get confused? I don't think so, because you can
compare the TID for which you're following the chain to the new TID
and the old TID in the record and it should match one or the other but
not both. But I don't think you even really need to do that much: if
you started with a deleted item, the first thing in the undo chain has
to be a delete or non-in-place update that got rid of it. And if you
started with a non-deleted item, then the beginning of the undo chain,
if it hasn't been discarded yet, will be the insert or non-in-place
update that created it. There's nowhere else that you can hit a
non-in-place update, and no room (that I can see) for any ambiguity.

It seems to me that zheap went wrong in ending up with separate undo
types for in-place and non-in-place updates. Why not just have ONE
kind of undo record that describes an update, and allow that update to
have either one TID or two TIDs depending on which kind of update it
is? There may be a reason, but I don't know what it is, unless it's
just that the UnpackedUndoRecord idea that I invented wasn't flexible
enough and nobody thought of generalizing it. Curious to hear your
thoughts on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 10:38 AM Stephen Frost  wrote:
> In a number of cases, trying to make sure that on a failover or copy of
> the backup the next 'incremental' is really an 'incremental' is
> dangerous.  A better strategy to address this, and the other issues
> realized on this thread recently, is to:
>
> - Have a manifest of every file in each backup
> - Always back up new files that weren't in the prior backup
> - Keep a checksum of each file
> - Track the timestamp of each file as of when it was backed up
> - Track the file size of each file
> - Track the starting timestamp of each backup
> - Always include files with a modification time after the starting
>   timestamp of the prior backup, or if the file size has changed
> - In the event of any anomolies (which includes things like a timeline
>   switch), use checksum matching (aka 'delta checksum backup') to
>   perform the backup instead of using timestamps (or just always do that
>   if you want to be particularly careful- having an option for it is
>   great)
> - Probably other things I'm not thinking of off-hand, but this is at
>   least a good start.  Make sure to checksum this information too.

I agree with some of these ideas but not all of them.  I think having
a backup manifest is a good idea; that would allow taking a new
incremental backup to work from the manifest rather than the data
directory, which could be extremely useful, because it might be a lot
faster and the manifest could also be copied to a machine other than
the one where the entire backup is stored. If the backup itself has
been pushed off to S3 or whatever, you can't access it quickly, but
you could keep the manifest around.

I also agree that backing up all files that weren't in the previous
backup is a good strategy.  I proposed that fairly explicitly a few
emails back; but also, the contrary is obviously nonsense. And I also
agree with, and proposed, that we record the size along with the file.

I don't really agree with your comments about checksums and
timestamps.  I think that, if possible, there should be ONE method of
determining whether a block has changed in some important way, and I
think if we can make LSN work, that would be for the best. If you use
multiple methods of detecting changes without any clearly-defined
reason for so doing, maybe what you're saying is that you don't really
believe that any of the methods are reliable but if we throw the
kitchen sink at the problem it should come out OK. Any bugs in one
mechanism are likely to be masked by one of the others, but that's not
as as good as one method that is known to be altogether reliable.

> By having a manifest for each backed up file for each backup, you also
> gain the ability to validate that a backup in the repository hasn't been
> corrupted post-backup, a feature that at least some other database
> backup and restore systems have (referring specifically to the big O in
> this particular case, but I bet others do too).

Agreed. The manifest only lets you validate to a limited extent, but
that's still useful.

> Having a system of keeping track of which backups are full and which are
> differential in an overall system also gives you the ability to do
> things like expiration in a sensible way, including handling WAL
> expiration.

True, but I'm not sure that functionality belongs in core. It
certainly needs to be possible for out-of-core code to do this part of
the work if desired, because people want to integrate with enterprise
backup systems, and we can't come in and say, well, you back up
everything else using Netbackup or Tivoli, but for PostgreSQL you have
to use pg_backrest. I mean, maybe you can win that argument, but I
know I can't.

> I'd like to clarify that while I would like to have an easier way to
> parallelize backups, that's a relatively minor complaint- the much
> bigger issue that I have with this feature is that trying to address
> everything correctly while having only the amount of information that
> could be passed on the command-line about the prior full/incremental is
> going to be extremely difficult, complicated, and likely to lead to
> subtle bugs in the actual code, and probably less than subtle bugs in
> how users end up using it, since they'll have to implement the
> expiration and tracking of information between backups themselves
> (unless something's changed in that part during this discussion- I admit
> that I've not read every email in this thread).

Well, the evidence seems to show that you are right, at least to some
extent. I consider it a positive good if the client needs to give the
server only a limited amount of information. After all, you could
always take an incremental backup by shipping every byte of the
previous backup to the server, having it compare everything to the
current contents, and having it then send you back the stuff that is
new or different. But that would be dumb, because most of the point of
an incremental backup is to save on 

Re: block-level incremental backup

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 9:30 AM Robert Haas  wrote:
> > Isn't some operations where at the end we directly call heap_sync
> > without writing WAL will have a similar problem as well?
>
> Maybe.  Can you give an example?

Looking through the code, I found two cases where we do this.  One is
a bulk insert operation with wal_level = minimal, and the other is
CLUSTER or VACUUM FULL with wal_level = minimal. In both of these
cases we are generating new blocks whose LSNs will be 0/0. So, I think
we need a rule that if the server is asked to back up all blocks in a
file with LSNs > some threshold LSN, it must also include any blocks
whose LSN is 0/0. Those blocks are either uninitialized or are
populated without WAL logging, so they always need to be copied.

Outside of unlogged and temporary tables, I don't know of any case
where make a critical modification to an already-existing block
without bumping the LSN. I hope there is no such case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-09-16 Thread Anastasia Lubennikova

13.09.2019 4:04, Peter Geoghegan wrote:

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan  wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.


Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they 
trigger.
Here are test results (config is standard, except fsync=off to speedup 
tests):


'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

+---+---+---++---+

|    ---    |   FPW on  |  FPW off  | FORCE_NO_IMAGE |   master  |

+---+---+---++---+

| time  | 09:12 min | 06:56 min | 06:24 min  | 08:10 min |

| nbtree_wal_volume | 8083 MB   | 2128 MB   | 2327 MB    | 2439 MB   |

| index_size    | 169 MB    | 169 MB    | 169 MB | 1118 MB   |

+---+---+---++---+


With random insertions into btree it's highly possible that 
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if 
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer() 
approach.

And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probably 
benefits less of wal_compression.


One possible "fix" to decrease WAL amplification is to add 
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of 
inadequate WAL amount.

However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication less 
intensive:
if freed space is less than some threshold, fall back to not changing 
page at all and not generating xlog record.


Probably that was the reason, why patch became faster after I added 
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because 
WAL load decreased.


So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings.  Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

+-+-+-+++

|   ---   |  FPW on | FPW off | FORCE_NO_IMAGE | master |

+-+-+-+++

| checkpoints_req |  16 |   7 |  8 | 10 |

+-+-+-+++

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 	   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-			ItemPointerGetBlockNumber(&(itup->t_tid)),
-			ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("could not find tuple using search from 

Re: Define jsonpath functions as stable

2019-09-16 Thread Tom Lane
"Jonathan S. Katz"  writes:
> It sounds like the easiest path to completion without potentially adding
> futures headaches pushing back the release too far would be that, e.g.
> these examples:

>   $.** ? (@ like_regex "O(w|v)" pg flag "i")
>   $.** ? (@ like_regex "O(w|v)" pg)

> If it's using POSIX regexp, I would +1 using "posix" instead of "pg"

I agree that we'd be better off to say "POSIX".  However, having just
looked through the references Chapman provided, it seems to me that
the regex language Henry Spencer's library provides is awful darn
close to what XPath is asking for.  The main thing I see in the XML/XPath
specs that we don't have is a bunch of character class escapes that are
specifically tied to Unicode character properties.  We could possibly
add code to implement those, but I'm not sure how it'd work in non-UTF8
database encodings.  There may also be subtle differences in the behavior
of character class escapes that we do have in common, such as "\s" for
white space; but again I'm not sure that those are any different than
what you get naturally from encoding or locale variations.

I think we could possibly get away with not having any special marker
on regexes, but just explaining in the documentation that "features
so-and-so are not implemented".  Writing that text would require closer
analysis than I've seen in this thread as to exactly what the differences
are.

regards, tom lane




Re: Default JIT setting in V12

2019-09-16 Thread Jeff Janes
On Wed, Sep 4, 2019 at 11:24 AM Andres Freund  wrote:

> Hi,
>
> On 2019-09-04 07:51:16 -0700, Andres Freund wrote:
>


> Or better, something slightly more complete, like the attached (which
> affects both code-gen time optimizations (which are more like peephole
> ones), and both function/global ones that are cheap).
>

Yes, that does completely solve the issue I raised.  It makes JIT either
better or at least harmless, even when falling into the gap between
jit_above_cost
and jit_optimize_above_cost.

What TPC-H implementation do you use/recommend?  This one
https://wiki.postgresql.org/wiki/DBT-3?

Cheers,

Jeff


Re: POC: Cleaning up orphaned files using undo logs

2019-09-16 Thread Kuntal Ghosh
Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro  wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed.  I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000  At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/2000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/2000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1.  We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed.  The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer.  But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint.  Rats.  I'll have to think about that some more.
Cool.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: Define jsonpath functions as stable

2019-09-16 Thread Jonathan S. Katz
Hi,

On 7/29/19 8:33 PM, Chapman Flack wrote:
> On 07/29/19 18:27, Alexander Korotkov wrote:
> 
>> What do you think about renaming existing operator from like_regex to
>> pg_like_regex?  Or introducing special flag indicating that PostgreSQL
>> regex engine is used ('p' for instance)?
> 
> Renaming the operator is simple and certainly solves the problem.
> 
> I don't have a strong technical argument for or against any of:
> 
> 
> $.** ? (@ pg_like_regex "O(w|v)" flag "i")
> $.** ? (@ pg_like_regex "O(w|v)")
> 
> 
> $.** ? (@ like_regex "O(w|v)" pg flag "i")
> $.** ? (@ like_regex "O(w|v)" pg)
> 
> 
> $.** ? (@ like_regex "O(w|v)" flag "ip")
> $.** ? (@ like_regex "O(w|v)" flag "p")
> 
> 
> It seems more of an aesthetic judgment (on which I am no particular
> authority).
> 
> I think I would be -0.3 on the third approach just because of the need
> to still spell out ' flag "p"' when there is no other flag you want.
> 
> I assume the first two approaches would be about equally easy to
> implement, assuming there's a parser that already has an optional
> production for "flag" STRING.
> 
> Both of the first two seem pretty safe from colliding with a
> future addition to the standard.
> 
> To my aesthetic sense, pg_like_regex feels like "another operator
> to remember" while like_regex ... pg feels like "ok, a slight variant
> on the operator from the spec".
> 
> Later on, if a conformant version is added, the grammar might be a bit
> simpler with just one name and an optional pg.
> 
> Going with a flag, there is some question of the likelihood of
> the chosen flag letter being usurped by the standard at some point.
> 
> I'm leaning toward a flag for now in my own effort to provide the five SQL
> functions (like_regex, occurrences_regex, position_regex, substring_regex,
> and translate_regex), as for the time being it will be as an extension,
> so no custom grammar for me, and I don't really want to make five
> pg_* variant function names, and have that expand to ten function names
> someday if the real ones are implemented. (Hmm, I suppose I could add
> an optional function argument, distinct from flags; that would be
> analogous to adding a pg in the grammar ... avoids overloading the flags,
> avoids renaming the functions.)

Looking at this thread and[1] and the current state of open items[2], a
few thoughts:

It sounds like the easiest path to completion without potentially adding
futures headaches pushing back the release too far would be that, e.g.
these examples:

$.** ? (@ like_regex "O(w|v)" pg flag "i")
$.** ? (@ like_regex "O(w|v)" pg)

If it's using POSIX regexp, I would +1 using "posix" instead of "pg"

That said, from a user standpoint, it's slightly annoying to have to
include that keyword every time, and could potentially mean changing /
testing quite a bit of code once we do support XQuery regexps. Based on
how we currently handle regular expressions, we've already condition
user's to expect a certain behavior, and it would be inconsistent if we
do one thing in one place, and another thing here, so I would like for
us to be cognizant of that.

Reading the XQuery spec that Chapman provided[3], it sounds like there
are some challenges present if we were to try to implement XQuery-based
regexps.

I do agree with Alvaro's comment ("We have an opportunity to do
better")[4], but I think we have to weigh the likelihood of actually
supporting the XQuery behaviors before we add more burden to our users.
Based on what needs to be done, it does not sound like it is any time soon.

My first choice would be to leave it as is. We can make it abundantly
clear that if we make changes in a future version we advise our users on
what actions to take, and counsel on any behavior changes.

My second choice is to have a flag that makes it clear what kind of
regex's are being used, in which case "posix" -- this is abundantly
clearer to the user, but still default, at present, to using "posix"
expressions. If we ever do add the XQuery ones, we can debate whether we
default to the standard at that time, and if we do, we treat it like we
treat other deprecation issues and make abundantly clear what the
behavior is now.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/flat/5CF28EA0.80902%40anastigmatix.net
[2] https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items
[3]
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL/XML_Standards#XML_Query_regular_expressions
[4]
https://www.postgresql.org/message-id/20190618154907.GA6049%40alvherre.pgsql



signature.asc
Description: OpenPGP digital signature


Re: block-level incremental backup

2019-09-16 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila  wrote:
> > Can we think of using creation time for file?  Basically, if the file
> > creation time is later than backup-labels "START TIME:", then include
> > that file entirely.  I think one big point against this is clock skew
> > like what if somebody tinkers with the clock.  And also, this can
> > cover cases like
> > what Jeevan has pointed but might not cover other cases which we found
> > problematic.
> 
> Well that would mean, for example, that if you copied the data
> directory from one machine to another, the next "incremental" backup
> would turn into a full backup. That sucks. And in other situations,
> like resetting the clock, it could mean that you end up with a corrupt
> backup without any real ability for PostgreSQL to detect it. I'm not
> saying that it is impossible to create a practically useful system
> based on file time stamps, but I really don't like it.

In a number of cases, trying to make sure that on a failover or copy of
the backup the next 'incremental' is really an 'incremental' is
dangerous.  A better strategy to address this, and the other issues
realized on this thread recently, is to:

- Have a manifest of every file in each backup
- Always back up new files that weren't in the prior backup
- Keep a checksum of each file
- Track the timestamp of each file as of when it was backed up
- Track the file size of each file
- Track the starting timestamp of each backup
- Always include files with a modification time after the starting
  timestamp of the prior backup, or if the file size has changed
- In the event of any anomolies (which includes things like a timeline
  switch), use checksum matching (aka 'delta checksum backup') to
  perform the backup instead of using timestamps (or just always do that
  if you want to be particularly careful- having an option for it is
  great)
- Probably other things I'm not thinking of off-hand, but this is at
  least a good start.  Make sure to checksum this information too.

I agree entirely that it is dangerous to simply rely on creation time as
compared to some other time, or to rely on modification time of a given
file across multiple backups (which has been shown to reliably cause
corruption, at least with rsync and its 1-second granularity on
modification time).

By having a manifest for each backed up file for each backup, you also
gain the ability to validate that a backup in the repository hasn't been
corrupted post-backup, a feature that at least some other database
backup and restore systems have (referring specifically to the big O in
this particular case, but I bet others do too).

Having a system of keeping track of which backups are full and which are
differential in an overall system also gives you the ability to do
things like expiration in a sensible way, including handling WAL
expiration.

As also mentioned up-thread, this likely also allows you to have a
simpler approach to parallelizing the overall backup.

I'd like to clarify that while I would like to have an easier way to
parallelize backups, that's a relatively minor complaint- the much
bigger issue that I have with this feature is that trying to address
everything correctly while having only the amount of information that
could be passed on the command-line about the prior full/incremental is
going to be extremely difficult, complicated, and likely to lead to
subtle bugs in the actual code, and probably less than subtle bugs in
how users end up using it, since they'll have to implement the
expiration and tracking of information between backups themselves
(unless something's changed in that part during this discussion- I admit
that I've not read every email in this thread).

> > One related point is how do incremental backups handle the case where
> > vacuum truncates the relation partially?  Basically, with current
> > patch/design, it doesn't appear that such information can be passed
> > via incremental backup.  I am not sure if this is a problem, but it
> > would be good if we can somehow handle this.
> 
> As to this, if you're taking a full backup of a particular file,
> there's no problem.  If you're taking a partial backup of a particular
> file, you need to include the current length of the file and the
> identity and contents of each modified block.  Then you're fine.

I would also expect this to be fine but if there's an example of where
this is an issue, please share.  The only issue that I can think of
off-hand is orphaned-file risk, whereby you have something like CREATE
DATABASE or perhaps ALTER TABLE .. SET TABLESPACE or such, take a
backup while that's happening, but that doesn't complete during the
backup (or recovery, or perhaps even in some other scenarios, it's
unfortunately quite complicated).  This orphaned file risk isn't newly
discovered but fixing it is pretty complicated- would love to discuss
ideas around how to handle it.

> > Isn't 

Re: Extending range type operators to cope with elements

2019-09-16 Thread Esteban Zimanyi
>
>
> So yes, I've had a need for those operators in the past. What I don't know
> is whether adding these functions will be worth the catalog clutter.
>

The operators are tested and running within MobilityDB. It concerns lines
231-657 for the C code in file
https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/rangetypes_ext.c

and lines 32-248 for the SQL code in file
https://github.com/ULB-CoDE-WIT/MobilityDB/blob/master/src/sql/07_rangetypes_ext.in.sql


Since you don't really use PR, please let me know whether I can be of
any help.

Regards

Esteban

-- 

Prof. Esteban Zimanyi
Department of Computer & Decision Engineering  (CoDE) CP 165/15
Universite Libre de Bruxelles
Avenue F. D. Roosevelt 50
B-1050 Brussels, Belgium
fax: + 32.2.650.47.13
tel: + 32.2.650.31.85
e-mail: ezima...@ulb.ac.be
Internet: http://code.ulb.ac.be/



Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-09-16 Thread Alvaro Herrera
On 2019-Sep-16, Michael Paquier wrote:

> On Mon, Sep 16, 2019 at 11:46:16AM +0530, Amit Kapila wrote:
> > I don't see much use of separating information for infomask and infomask2.
> 
> Okay, using two separate columns leads to the attached.  Any thoughts?
> This also includes a fix for cases with IS_LOCKED_ONLY and UPGRADED.

I like how it looks in the expected test output.  Didn't review the code
closely, but it looks reasonable in a quick glance.

Whitespace nitpick: pgindent will do something very annoying with this:

> + PG_RETURN_DATUM(
> + HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
> nulls)));

I suggest to split the line in the first comma, not in the parens.

> +  Combined flags include for example HEAP_XMIN_FROZEN
> +  which is defined as a set of HEAP_XMIN_COMMITTED
> +  and HEAP_XMIN_INVALID.

I suggest something like "Combined flags are displayed for source-level
macros that take into account the value of more than one raw bit, such
as HEAP_XMIN_FROZEN".  (We probably don't want an exhaustive list, which
becomes annoying to maintain; users can refer to the source file.)

There's a typo "bites" in a comment.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Alvaro Herrera
On 2019-Sep-16, Tattsu Yama wrote:

> I should have explained the API changes more. I added cmdtype as a given
> parameter for all functions (See below).
> Therefore, I suppose that my patch is similar to the following fix as you
> mentioned on -hackers.

Is this fix strictly necessary for pg12, or is this something that we
can leave for pg13?


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ecpglib major version changed

2019-09-16 Thread Tom Lane
Peter Eisentraut  writes:
> The ecpglib major version (SO_MAJOR_VERSION) was changed in
> bd7c95f0c1a38becffceb3ea7234d57167f6d4bf (Add DECLARE STATEMENT support
> to ECPG.), but I don't see anything in that patch that would warrant
> that.  I think we should undo that change.

Ouch.  Yeah, that's a Big Deal from a packaging standpoint.
Why would it be necessary?

regards, tom lane




Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2019-09-16 Thread Michael Paquier
On Mon, Sep 16, 2019 at 11:46:16AM +0530, Amit Kapila wrote:
> I don't see much use of separating information for infomask and infomask2.

Okay, using two separate columns leads to the attached.  Any thoughts?
This also includes a fix for cases with IS_LOCKED_ONLY and UPGRADED.
--
Michael
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 6a09d46a57..c8f6e1b4c6 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -86,80 +86,55 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
 -- always be the same in all test runs. we show raw flags by
 -- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
 VACUUM FREEZE test1;
-SELECT t_infomask, t_infomask2, flags
+SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
- LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(flags);
- t_infomask | t_infomask2 |   flags   
-+-+---
-   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID}
+ LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(raw_flags, combined_flags);
+ t_infomask | t_infomask2 | raw_flags |   combined_flags   
++-+---+
+   2816 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
 (1 row)
 
 -- output the decoded flag HEAP_XMIN_FROZEN instead
-SELECT t_infomask, t_infomask2, flags
+SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
- LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2, true) m(flags);
- t_infomask | t_infomask2 |flags 
-+-+--
-   2816 |   2 | {HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
+ LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) m(raw_flags, combined_flags);
+ t_infomask | t_infomask2 | raw_flags |   combined_flags   
++-+---+
+   2816 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
 (1 row)
 
 -- tests for decoding of combined flags
 -- HEAP_XMAX_SHR_LOCK = (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
-SELECT heap_tuple_infomask_flags(x'0050'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_XMAX_SHR_LOCK}
-(1 row)
-
-SELECT heap_tuple_infomask_flags(x'0050'::int, 0, false);
-  heap_tuple_infomask_flags  
--
- {HEAP_XMAX_EXCL_LOCK,HEAP_XMAX_KEYSHR_LOCK}
+SELECT * FROM heap_tuple_infomask_flags(x'0050'::int, 0);
+  raw_flags  |combined_flags
+-+--
+ {HEAP_XMAX_KEYSHR_LOCK,HEAP_XMAX_EXCL_LOCK} | {HEAP_XMAX_SHR_LOCK}
 (1 row)
 
 -- HEAP_XMIN_FROZEN = (HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID)
-SELECT heap_tuple_infomask_flags(x'0300'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_XMIN_FROZEN}
-(1 row)
-
-SELECT heap_tuple_infomask_flags(x'0300'::int, 0, false);
-heap_tuple_infomask_flags
--
- {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID}
+SELECT * FROM heap_tuple_infomask_flags(x'0300'::int, 0);
+raw_flags|   combined_flags   
+-+
+ {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID} | {HEAP_XMIN_FROZEN}
 (1 row)
 
 -- HEAP_MOVED = (HEAP_MOVED_IN | HEAP_MOVED_OFF)
-SELECT heap_tuple_infomask_flags(x'C000'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_MOVED}
+SELECT * FROM heap_tuple_infomask_flags(x'C000'::int, 0);
+   raw_flags| combined_flags 
++
+ {HEAP_MOVED_OFF,HEAP_MOVED_IN} | {HEAP_MOVED}
 (1 row)
 
-SELECT heap_tuple_infomask_flags(x'C000'::int, 0, false);
-   heap_tuple_infomask_flags
-
- {HEAP_MOVED_IN,HEAP_MOVED_OFF}
-(1 row)
-
--- HEAP_LOCKED_UPGRADED = (HEAP_XMAX_IS_MULTI | HEAP_XMAX_LOCK_ONLY)
-SELECT heap_tuple_infomask_flags(x'1080'::int, 0, true);
- heap_tuple_infomask_flags 

- {HEAP_LOCKED_UPGRADED}
-(1 row)
-
-SELECT heap_tuple_infomask_flags(x'1080'::int, 0, false);
-heap_tuple_infomask_flags 
---
- {HEAP_XMAX_LOCK_ONLY,HEAP_XMAX_IS_MULTI}
+SELECT * FROM 

Re: block-level incremental backup

2019-09-16 Thread Robert Haas
On Mon, Sep 16, 2019 at 4:31 AM Amit Kapila  wrote:
> This seems to be a blocking problem for the LSN based design.

Well, only the simplest version of it, I think.

> Can we think of using creation time for file?  Basically, if the file
> creation time is later than backup-labels "START TIME:", then include
> that file entirely.  I think one big point against this is clock skew
> like what if somebody tinkers with the clock.  And also, this can
> cover cases like
> what Jeevan has pointed but might not cover other cases which we found
> problematic.

Well that would mean, for example, that if you copied the data
directory from one machine to another, the next "incremental" backup
would turn into a full backup. That sucks. And in other situations,
like resetting the clock, it could mean that you end up with a corrupt
backup without any real ability for PostgreSQL to detect it. I'm not
saying that it is impossible to create a practically useful system
based on file time stamps, but I really don't like it.

> I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
> have similar problems.

I'm not sure quite what you mean by that.  Can you elaborate? It
appears to me that the XLR_SPECIAL_REL_UPDATE operations are all
things that create files, remove files, or truncate files, and the
sketch in my previous email would handle the first two of those cases
correctly.  See below for the third.

> One related point is how do incremental backups handle the case where
> vacuum truncates the relation partially?  Basically, with current
> patch/design, it doesn't appear that such information can be passed
> via incremental backup.  I am not sure if this is a problem, but it
> would be good if we can somehow handle this.

As to this, if you're taking a full backup of a particular file,
there's no problem.  If you're taking a partial backup of a particular
file, you need to include the current length of the file and the
identity and contents of each modified block.  Then you're fine.

> Isn't some operations where at the end we directly call heap_sync
> without writing WAL will have a similar problem as well?

Maybe.  Can you give an example?

> Similarly,
> it is not very clear if unlogged relations are handled in some way if
> not, the same could be documented.

I think that we don't need to back up the contents of unlogged
relations at all, right? Restoration from an online backup always
involves running recovery, and so unlogged relations will anyway get
zapped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-16 Thread Tom Lane
Martijn van Oosterhout  writes:
> On Mon, 16 Sep 2019 at 00:14, Tom Lane  wrote:
>> ... I also think we
>> can simplify the handling of other-database listeners by including
>> them in the set signaled by SignalBackends, but only if they're
>> several pages behind.  So that leads me to the attached patch;
>> what do you think?

> I think I like the idea of having SignalBackend do the waking up a
> slow backend but I'm not enthused by the "lets wake up (at once)
> everyone that is behind". That's one of the issues I was explicitly
> trying to solve. If there are any significant number of "slow"
> backends then we get the "thundering herd" again.

But do we care?  With asyncQueueAdvanceTail gone from the listeners,
there's no longer an exclusive lock for them to contend on.  And,
again, I failed to see any significant contention even in HEAD as it
stands; so I'm unconvinced that you're solving a live problem.

regards, tom lane




Re: (Re)building index using itself or another index of the same table

2019-09-16 Thread Arseny Sher

Tomas Vondra  writes:

> On Thu, Sep 12, 2019 at 11:08:28AM -0400, Tom Lane wrote:

>>I have exactly no faith that this fixes things enough to make such
>>cases supportable.  And I have no interest in opening that can of
>>worms anyway.  I'd rather put in some code to reject database
>>accesses in immutable functions.
>>
>
> Same here. My hunch is a non-trivaial fraction of applications using
> this "trick" is silently broken in various subtle ways.

Ok, I see the point. However, "could not read block" error might seem
quite scary to the users; it looks like data corruption. How about
ERRORing out then in get_relation_info instead of skipping reindexing
indexes, like in attached? Even if this doesn't cover all cases, at
least one scenario observed in the field would have better error
message.

Rejecting database access completely in immutable functions would be
unfortunate for our particular case, because this GIN index on
expression joining the very indexed table multiple times (and using thus
btree index) is, well, useful. Here is a brief description of the
case. Indexed table stores postal addresses, which are of hierarchical
nature (e.g. country-region-city-street-house). Single row is one element
of any depth (e.g. region or house); each row stores link to its parent
in parent_guid column, establishing thus the hierarchy
(e.g. house has link to the street).

The task it to get the full address by typing random parts of it
(imagine typing hints in Google Maps). For that, FTS is used. GIN index
is built on full addresses, and to get the full address table is climbed
up about six times (hierarchy depth) by following parent_guid chain.

We could materialize full addresses in the table and eliminate the need
to form them in the index expression, but that would seriously increase
amount of required storage -- GIN doesn't store indexed columns fully,
and thus it is cheaper to 'materialize' full addresses inside it only.


Surely this is a hack which cheats the system. We might imagine creating
some functionality (kinda index referring to multiple rows of the table
-- or even rows of different tables) making it unneccessary, but such
functionality doesn't exist today, and the hack is useful, if you
understand the risk.


>>> One might argue that function selecting from table can hardly be called
>>> immutable, and immutability is required for index expressions. However,
>>> if user is sure table contents doesn't change, why not?
>>
>>If the table contents never change, why are you doing VACUUM FULL on it?
>>
>
> It's possible the columns referenced by the index expression are not
> changing, but some additional columns are updated.

Yeah. Also table can be CLUSTERed without VACUUM FULL.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From f5b9c433bf387a9ddbe318dfea2b96c02c4a945e Mon Sep 17 00:00:00 2001
From: Arseny Sher 
Date: Thu, 12 Sep 2019 17:35:16 +0300
Subject: [PATCH] ERROR out early on attempt to touch user indexes while they
 are being (re)built.

Existing ReindexIsProcessingIndex check is consulted only in genam.c and thus
enforced only for system catalogs. Check it also in the planner, so that indexes
which are currently being rebuilt are never tried. Also cock SetReindexProcessing
in index_create to defend from index self usage during its creation.

Without this, VACUUM FULL or just CREATE INDEX might fail with something like

ERROR:  could not read block 3534 in file "base/41366676/56697497": read only 0 of 8192 bytes

if there are indexes which usage can be considered during these very
indexes (re)building, i.e. index expression scans indexed table.

Such error might seem scary, so catch this earlier.
---
 src/backend/catalog/index.c| 22 --
 src/backend/optimizer/util/plancat.c   |  9 +
 src/test/regress/expected/create_index.out | 15 +++
 src/test/regress/sql/create_index.sql  | 13 +
 4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3e1d40662d..5bc764ce46 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1174,7 +1174,22 @@ index_create(Relation heapRelation,
 	}
 	else
 	{
-		index_build(heapRelation, indexRelation, indexInfo, false, true);
+		/* ensure SetReindexProcessing state isn't leaked */
+		PG_TRY();
+		{
+			/* Suppress use of the target index while building it */
+			SetReindexProcessing(heapRelationId, indexRelationId);
+
+			index_build(heapRelation, indexRelation, indexInfo, false, true);
+		}
+		PG_CATCH();
+		{
+			/* Make sure flag gets cleared on error exit */
+			ResetReindexProcessing();
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+		ResetReindexProcessing();
 	}
 
 	/*
@@ -1379,7 +1394,10 @@ index_concurrently_build(Oid heapRelationId,
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
 
-	/* Now build the index */
+	/*
+	

Re: Bug in GiST paring heap comparator

2019-09-16 Thread Nikita Glukhov

On 13.09.2019 20:17, Alexander Korotkov wrote:


On Fri, Sep 13, 2019 at 5:23 PM Nikita Glukhov  wrote:

I have moved handling of NULL ordering keys from opclasses to the common
SP-GiST code, but really I don't like how it is implemented now. Maybe it's
worth to move handling of NULL order-by keys to the even more higher
level so,
that AM don't have to worry about NULLs?

Yes, optimizer could remove away "col op NULL" clauses from ORDER BY
if op is strict operator.  And then such clauses wouldn't be passed to
AM.  But I see this as future improvement.  For backpatching we should
solve this at AM side.


Also I leaved usages of IndexOrderByDistance in opclasses. I think, that
may help to minimize opclass changes in the future.

Could you please extract this as a separate patch.  We can consider
this for master, but we shouldn't backpatch this.


Propagation of IndexOrderByDistance in SP-GiST was extracted into a separate
patch #2.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
>From e651a985b832e8429c8bab69ec3a7f7c39cdf2dd Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Tue, 10 Sep 2019 17:26:26 +0300
Subject: [PATCH 1/5] Fix GiST and SP-GiST ordering by distance for NULLs

---
 src/backend/access/gist/gistget.c | 68 -
 src/backend/access/gist/gistscan.c| 18 +++---
 src/backend/access/index/indexam.c| 22 +++
 src/backend/access/spgist/spgscan.c   | 74 +++
 src/include/access/genam.h| 10 ++-
 src/include/access/gist_private.h | 27 ++---
 src/include/access/spgist_private.h   |  8 ++-
 src/test/regress/expected/create_index_spgist.out | 10 +++
 src/test/regress/sql/create_index_spgist.sql  |  5 ++
 src/tools/pgindent/typedefs.list  |  1 +
 10 files changed, 142 insertions(+), 101 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index db633a9..22d790d 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -112,9 +112,8 @@ gistkillitems(IndexScanDesc scan)
  * Similarly, *recheck_distances_p is set to indicate whether the distances
  * need to be rechecked, and it is also ignored for non-leaf entries.
  *
- * If we are doing an ordered scan, so->distancesValues[] and
- * so->distancesNulls[] is filled with distance data from the distance()
- * functions before returning success.
+ * If we are doing an ordered scan, so->distances[] is filled with distance
+ * data from the distance() functions before returning success.
  *
  * We must decompress the key in the IndexTuple before passing it to the
  * sk_funcs (which actually are the opclass Consistent or Distance methods).
@@ -135,8 +134,7 @@ gistindex_keytest(IndexScanDesc scan,
 	GISTSTATE  *giststate = so->giststate;
 	ScanKey		key = scan->keyData;
 	int			keySize = scan->numberOfKeys;
-	double	   *distance_value_p;
-	bool	   *distance_null_p;
+	IndexOrderByDistance *distance_p;
 	Relation	r = scan->indexRelation;
 
 	*recheck_p = false;
@@ -155,8 +153,8 @@ gistindex_keytest(IndexScanDesc scan,
 			elog(ERROR, "invalid GiST tuple found on leaf page");
 		for (i = 0; i < scan->numberOfOrderBys; i++)
 		{
-			so->distanceValues[i] = -get_float8_infinity();
-			so->distanceNulls[i] = false;
+			so->distances[i].value = -get_float8_infinity();
+			so->distances[i].isnull = false;
 		}
 		return true;
 	}
@@ -240,8 +238,7 @@ gistindex_keytest(IndexScanDesc scan,
 
 	/* OK, it passes --- now let's compute the distances */
 	key = scan->orderByData;
-	distance_value_p = so->distanceValues;
-	distance_null_p = so->distanceNulls;
+	distance_p = so->distances;
 	keySize = scan->numberOfOrderBys;
 	while (keySize > 0)
 	{
@@ -256,8 +253,8 @@ gistindex_keytest(IndexScanDesc scan,
 		if ((key->sk_flags & SK_ISNULL) || isNull)
 		{
 			/* Assume distance computes as null */
-			*distance_value_p = 0.0;
-			*distance_null_p = true;
+			distance_p->value = 0.0;
+			distance_p->isnull = true;
 		}
 		else
 		{
@@ -294,13 +291,12 @@ gistindex_keytest(IndexScanDesc scan,
 	 ObjectIdGetDatum(key->sk_subtype),
 	 PointerGetDatum());
 			*recheck_distances_p |= recheck;
-			*distance_value_p = DatumGetFloat8(dist);
-			*distance_null_p = false;
+			distance_p->value = DatumGetFloat8(dist);
+			distance_p->isnull = false;
 		}
 
 		key++;
-		distance_value_p++;
-		distance_null_p++;
+		distance_p++;
 		keySize--;
 	}
 
@@ -313,8 +309,7 @@ gistindex_keytest(IndexScanDesc scan,
  *
  * scan: index scan we are executing
  * pageItem: search queue item identifying an index page to scan
- * myDistanceValues: distances array associated with pageItem, or NULL at the root
- * myDistanceNulls: null flags for myDistanceValues array, or NULL at the root
+ * myDistances: distances array associated with pageItem, or NULL at the root
  * tbm: if not NULL, 

Re: range test for hash index?

2019-09-16 Thread Amit Kapila
On Mon, Sep 16, 2019 at 7:23 AM Paul A Jungwirth
 wrote:
>
> On Sat, Sep 14, 2019 at 5:13 AM Amit Kapila  wrote:
> > In general, the hash_range is covered by some of the existing test,
> > but I don't which test.  See the code coverage report here:
> > https://coverage.postgresql.org/src/backend/utils/adt/rangetypes.c.gcov.html
>
> Thanks! I did some experimenting, and the current test code *only*
> calls `hash_range_internal` when we force it like this:
>

I don't see this function on the master branch.  Is this function name
correct?  Are you looking at some different branch?

> set enable_nestloop=f;
> set enable_hashjoin=t;
> set enable_mergejoin=f;
> select * from numrange_test natural join numrange_test2 order by nr;
>
> But if I create that index as a hash index instead, we also call it
> for these inserts and selects (except for the empty ranges):
>
> create table numrange_test2(nr numrange);
> create index numrange_test2_hash_idx on numrange_test2 (nr);
>
> INSERT INTO numrange_test2 VALUES('[, 5)');
> INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
> INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2));
> INSERT INTO numrange_test2 VALUES(numrange(1.1, 2.2,'()'));
> INSERT INTO numrange_test2 VALUES('empty');
>
> select * from numrange_test2 where nr = 'empty'::numrange;
> select * from numrange_test2 where nr = numrange(1.1, 2.2);
> select * from numrange_test2 where nr = numrange(1.1, 2.3);
>
> (None of that is surprising, right? :-)
>
> So that seems like more confirmation that it was always intended to be
> a hash index.

Yes, it indicates that.

Jeff/Heikki, to me the issue pointed by Paul looks like an oversight
in commit 4429f6a9e3.  Can you think of any other reason?  If not, I
can commit this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




ecpglib major version changed

2019-09-16 Thread Peter Eisentraut
The ecpglib major version (SO_MAJOR_VERSION) was changed in
bd7c95f0c1a38becffceb3ea7234d57167f6d4bf (Add DECLARE STATEMENT support
to ECPG.), but I don't see anything in that patch that would warrant
that.  I think we should undo that change.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A problem presentaion about ECPG, DECLARE STATEMENT

2019-09-16 Thread Peter Eisentraut
On 2019-09-11 18:04, Michael Meskes wrote:
 Is it acceptable for PG12?
>>> In general absolutely.
>>
>> It seems far too late to be considering any major rewrite for v12;
>> even assuming that there was consensus on the rewrite being an
>> improvement, which I bet there won't be.
> 
> Oops, I read 13. Yes, it's obviously way too late for 12. Sorry for the
> noise.
> 
> In this case I'd like to details about what is wrong with the
> implementation.

I tried finding some information about where the idea for this statement
came from but couldn't find any.  The documentation references Oracle
and DB2, and while they indeed do have this statement, it doesn't seem
to be used for the same purpose.  The only purpose in ECPG appears to be
to associate a statement with a connection, but for example the DB2
implementation doesn't even have the AT clause, so I don't see how that
could be the same.

Moreover, I've been wondering about the behavior detail given in the
table at
.
 In scenario 3, the declare statement says con1 but the subsequent
dynamic statement says con2, and as a result of that, con1 is used.
This is not intuitive, I'd say, but given that there is no indication of
where this statement came from or whose idea it follows, it's unclear
how to evaluate that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Improve performance of NOTIFY over many databases (v2)

2019-09-16 Thread Martijn van Oosterhout
Hoi Tom,

On Mon, 16 Sep 2019 at 00:14, Tom Lane  wrote:
>
> I spent some more time thinking about this, and I'm still not too
> satisfied with this patch's approach.  It seems to me the key insights
> we're trying to make use of are:
>
> 1. We don't really need to keep the global tail pointer exactly
> up to date.  It's bad if it falls way behind, but a few pages back
> is fine.

Agreed.

> 2. When sending notifies, only listening backends connected to our
> own database need be awakened immediately.  Backends connected to
> other DBs will need to advance their queue pointer sometime, but
> again it doesn't need to be right away.

Agreed.

> 3. It's bad for multiple processes to all be trying to do
> asyncQueueAdvanceTail concurrently: they'll contend for exclusive
> access to the AsyncQueueLock.  Therefore, having the listeners
> do it is really the wrong thing, and instead we should do it on
> the sending side.

Agreed, but I'd add that listeners in databases that are largely idle
there may never be a sender, and thus need to be advanced up some
other way.

> However, the patch as presented doesn't go all the way on point 3,
> instead having listeners maybe-or-maybe-not do asyncQueueAdvanceTail
> in asyncQueueReadAllNotifications.  I propose that we should go all
> the way and just define tail-advancing as something that happens on
> the sending side, and only once every few pages.  I also think we
> can simplify the handling of other-database listeners by including
> them in the set signaled by SignalBackends, but only if they're
> several pages behind.  So that leads me to the attached patch;
> what do you think?

I think I like the idea of having SignalBackend do the waking up a
slow backend but I'm not enthused by the "lets wake up (at once)
everyone that is behind". That's one of the issues I was explicitly
trying to solve. If there are any significant number of "slow"
backends then we get the "thundering herd" again. If the number of
slow backends exceeds the number of cores then commits across the
system could be held up quite a while (which is what caused me to make
this patch, multiple seconds was not unusual).

The maybe/maybe not in asyncQueueReadAllNotifications is that "if I
was behind, then I probably got woken up, hence I need to wake up
someone else", thus ensuring the cleanup proceeds in an orderly
fashion, leaving gaps where the lock isn't held allowing COMMITs to
proceed.

> BTW, in my hands it seems like point 2 (skip wakening other-database
> listeners) is the only really significant win here, and of course
> that only wins when the notify traffic is spread across a fair number
> of databases.  Which I fear is not the typical use-case.  In single-DB
> use-cases, point 2 helps not at all.  I had a really hard time measuring
> any benefit from point 3 --- I eventually saw a noticeable savings
> when I tried having one notifier and 100 listen-only backends, but
> again that doesn't seem like a typical use-case.  I could not replicate
> your report of lots of time spent in asyncQueueAdvanceTail's lock
> acquisition.  I wonder whether you're using a very large max_connections
> setting and we already fixed most of the problem with that in bca6e6435.
> Still, this patch doesn't seem to make any cases worse, so I don't mind
> if it's just improving unusual use-cases.

I'm not sure if it's an unusual use-case, but it is my use-case :).
Specifically, there are 100+ instances of the same application running
on the same cluster with wildly different usage patterns. Some will be
idle because no-one is logged in, some will be quite busy. Although
there are only 2 listeners per database, that's still a lot of
listeners that can be behind. Though I agree that bca6e6435 will have
mitigated quite a lot (yes, max_connections is quite high). Another
mitigation would be to spread across more smaller database clusters,
which we need to do anyway.

That said, your approach is conceptually simpler which is also worth
something and it gets essentially all the same benefits for more
normal use cases. If the QUEUE_CLEANUP_DELAY were raised a bit then we
could do mitigation of the rest on the client side by having idle
databases send dummy notifies every now and then to trigger clean up
for their database. The flip-side is that slow backends will then have
further to catch up, thus holding the lock longer. It's not worth
making it configurable so we have to guess, but 16 is perhaps a good
compromise.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-09-16 Thread Tomas Vondra

On Sun, Sep 15, 2019 at 09:33:33PM -0400, James Coleman wrote:

On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
 wrote:

>> ...
>>
>> I think this may be a thinko, as this plan demonstrates - but I'm not
>> sure about it. I wonder if this might be penalizing some other types of
>> plans (essentially anything with limit + gather).
>>
>> Attached is a WIP patch fixing this by considering both startup and
>> total cost (by calling compare_path_costs_fuzzily).
>
>It seems to me that this is likely a bug, and not just a changed
>needed for this. Do you think it's better addressed in a separate
>thread? Or retain it as part of this patch for now (and possibly break
>it out later)? On the other hand, it's entirely possible that someone
>more familiar with parallel plan limitations could explain why the
>above comment holds true. That makes me lean towards asking in a new
>thread.
>

Maybe. I think creating a separate thread would be useful, provided we
manage to demonstrate the issue without an incremental sort.


I did some more thinking about this, and I can't currently come up
with a way to reproduce this issue outside of this patch. It doesn't
seem reasonable to me to assume that there's anything inherent about
this patch that means it's the only way we can end up with a partial
path with a low startup cost we'd want to prefer.

Part of me wants to pull it over to a separate thread just to get
additional feedback, but I'm not sure how useful that is given we
don't currently have an example case outside of this patch.



Hmm, I see.

While I initially suggested to start a separate thread only if we have
example not involving an incremental sort, that's probably not a hard
requirement. I think it's fine to start a thead briefly explaining the
issue, and pointing to incremental sort thread for actual example.



One thing to note though: the current patch does not also modify
add_partial_path_precheck which also does not take into account
startup cost, so we probably need to update that for completeness's
sake.



Good point. It does indeed seem to make the same assumption about only
comparing total cost before calling add_path_precheck.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: refactoring - share str2*int64 functions

2019-09-16 Thread Michael Paquier
On Sat, Sep 14, 2019 at 10:24:10AM +0200, Fabien COELHO wrote:
> It should rather call pg_strtoint16? And possibly switch the "short int"
> declaration to int16?

Sure, but you get into other problems if using the 16-bit version for
some other fields, which is why it seems to me that we should add an
extra routine for shorts.  So for now I prefer discarding this part.

> I do not think that "pg_strtoint_error" should be inlinable. The function is
> unlikely to be called, so it is not performance critical to inline it, and
> would enlarge the executable needlessly. However, the "pg_strto*_check"
> variants should be inlinable, as you have done.

Makes sense.

> About the code, on several instances of:
> 
>/* skip leading spaces */
>  while (likely(*ptr) && isspace((unsigned char) *ptr)) …
> 
> I would drop the "likely(*ptr)".

Right as well.  There were two places out of six with that pattern.

> And on several instances of:
> 
>!unlikely(isdigit((unsigned char) *ptr)))
> 
> ISTM that we want "unlikely(!isdigit((unsigned char) *ptr)))". Parsing
> !unlikely leads to false conclusion and a headache:-)

That part was actually inconsistent with the rest.

> Otherwise, this batch of changes looks ok to me.

Thanks.
--
Michael
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 221b47298c..7af05fc009 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1022,7 +1022,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		/* parse command tag to retrieve the number of affected rows. */
 		if (completionTag &&
 			strncmp(completionTag, "COPY ", 5) == 0)
-			rows = pg_strtouint64(completionTag + 5, NULL, 10);
+			rows = pg_strtouint64_check(completionTag + 5);
 		else
 			rows = 0;
 
diff --git a/contrib/spi/refint.c b/contrib/spi/refint.c
index adf0490f85..b063f1e122 100644
--- a/contrib/spi/refint.c
+++ b/contrib/spi/refint.c
@@ -306,7 +306,7 @@ check_foreign_key(PG_FUNCTION_ARGS)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: too short %d (< 5) list of arguments", nargs);
 
-	nrefs = pg_strtoint32(args[0]);
+	nrefs = pg_strtoint32_check(args[0]);
 	if (nrefs < 1)
 		/* internal error */
 		elog(ERROR, "check_foreign_key: %d (< 1) number of references specified", nrefs);
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 2c0ae395ba..1272b4ee04 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2338,8 +2338,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 	CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
 
 	if (strncmp(completionTag, "SELECT ", 7) == 0)
-		_SPI_current->processed =
-			pg_strtouint64(completionTag + 7, NULL, 10);
+		_SPI_current->processed = pg_strtouint64_check(completionTag + 7);
 	else
 	{
 		/*
@@ -2361,8 +2360,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 else if (IsA(stmt->utilityStmt, CopyStmt))
 {
 	Assert(strncmp(completionTag, "COPY ", 5) == 0);
-	_SPI_current->processed = pg_strtouint64(completionTag + 5,
-			 NULL, 10);
+	_SPI_current->processed = pg_strtouint64_check(completionTag + 5);
 }
 			}
 
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..ac2bf2f2c4 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -286,10 +286,10 @@ pq_parse_errornotice(StringInfo msg, ErrorData *edata)
 edata->hint = pstrdup(value);
 break;
 			case PG_DIAG_STATEMENT_POSITION:
-edata->cursorpos = pg_strtoint32(value);
+edata->cursorpos = pg_strtoint32_check(value);
 break;
 			case PG_DIAG_INTERNAL_POSITION:
-edata->internalpos = pg_strtoint32(value);
+edata->internalpos = pg_strtoint32_check(value);
 break;
 			case PG_DIAG_INTERNAL_QUERY:
 edata->internalquery = pstrdup(value);
@@ -316,7 +316,7 @@ pq_parse_errornotice(StringInfo msg, ErrorData *edata)
 edata->filename = pstrdup(value);
 break;
 			case PG_DIAG_SOURCE_LINE:
-edata->lineno = pg_strtoint32(value);
+edata->lineno = pg_strtoint32_check(value);
 break;
 			case PG_DIAG_SOURCE_FUNCTION:
 edata->funcname = pstrdup(value);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..2a1e9be3eb 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -32,6 +32,7 @@
 
 #include 
 
+#include "common/string.h"
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "nodes/extensible.h"
@@ -80,7 +81,7 @@
 #define READ_UINT64_FIELD(fldname) \
 	token = pg_strtok();		/* skip :fldname */ \
 	token = pg_strtok();		/* get field value */ \
-	local_node->fldname = pg_strtouint64(token, NULL, 10)
+	(void) pg_strtouint64(token, _node->fldname)
 
 /* Read a long integer field (anything written as ":fldname %ld") */
 #define READ_LONG_FIELD(fldname) \

Re: block-level incremental backup

2019-09-16 Thread Amit Kapila
On Mon, Sep 16, 2019 at 7:22 AM Robert Haas  wrote:
>
> On Thu, Sep 12, 2019 at 9:13 AM Jeevan Chalke
>  wrote:
> > I had a look over this issue and observed that when the new database is
> > created, the catalog files are copied as-is into the new directory
> > corresponding to a newly created database. And as they are just copied,
> > the LSN on those pages are not changed. Due to this incremental backup
> > thinks that its an existing file and thus do not copy the blocks from
> > these new files, leading to the failure.
>
> *facepalm*
>
> Well, this shoots a pretty big hole in my design for this feature. I
> don't know why I didn't think of this when I wrote out that design
> originally. Ugh.
>
> Unless we change the way that CREATE DATABASE and any similar
> operations work so that they always stamp pages with new LSNs, I think
> we have to give up on the idea of being able to take an incremental
> backup by just specifying an LSN.
>

This seems to be a blocking problem for the LSN based design.  Can we
think of using creation time for file?  Basically, if the file
creation time is later than backup-labels "START TIME:", then include
that file entirely.  I think one big point against this is clock skew
like what if somebody tinkers with the clock.  And also, this can
cover cases like
what Jeevan has pointed but might not cover other cases which we found
problematic.

>  We'll instead need to get a list of
> files from the server first, and then request the entirety of any that
> we don't have, plus the changed blocks from the ones that we do have.
> I guess that will make Stephen happy, since it's more like the design
> he wanted originally (and should generalize more simply to parallel
> backup).
>
> One question I have is: is there any scenario in which an existing
> page gets modified after the full backup and before the incremental
> backup but does not end up with an LSN that follows the full backup's
> start LSN?
>

I think the operations covered by WAL flag XLR_SPECIAL_REL_UPDATE will
have similar problems.

One related point is how do incremental backups handle the case where
vacuum truncates the relation partially?  Basically, with current
patch/design, it doesn't appear that such information can be passed
via incremental backup.  I am not sure if this is a problem, but it
would be good if we can somehow handle this.

Isn't some operations where at the end we directly call heap_sync
without writing WAL will have a similar problem as well?  Similarly,
it is not very clear if unlogged relations are handled in some way if
not, the same could be documented.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] CLUSTER command progress monitor

2019-09-16 Thread Tattsu Yama
Hi Michael,


>> > Attached file is WIP patch.In my patch, I added "command id" to all
>> APIs of
>> > progress reporting to isolate commands. Therefore, it doesn't allow to
>> > cascade updating system views. And my patch is on WIP so it needs
>> clean-up
>> > and test.
>> > I share it anyway. :)
>>
>> +   if (cmdtype == PROGRESS_COMMAND_INVALID ||
>> beentry->st_progress_command == cmdtype)
>> +   {
>> +   PGSTAT_BEGIN_WRITE_ACTIVITY(beentry);
>> +   beentry->st_progress_param[index] = val;
>> +   PGSTAT_END_WRITE_ACTIVITY(beentry);
>> +   }
>> You basically don't need the progress reports if the command ID is
>> invalid, no?
>>
>
>
> Ah, right.
> I'll check and fix that today. :)
>
>

I fixed the patch based on your comment.
Please find attached file.  :)

I should have explained the API changes more. I added cmdtype as a given
parameter for all functions (See below).
Therefore, I suppose that my patch is similar to the following fix as you
mentioned on -hackers.

- Allow only reporting for a given command ID, which would basically
> require to pass down the command ID to progress update APIs and bypass an
> update
> if the command ID provided by caller does not match the existing one
> started (?).


#pgstat.c
pgstat_progress_start_command(ProgressCommandType cmdtype,...)
   - Progress reporter starts when beentry->st_progress_command is
PROGRESS_COMMAND_INVALID

pgstat_progress_end_command(ProgressCommandType cmdtype,...)
   - Progress reporter ends when beentry->st_progress_command equals cmdtype

pgstat_progress_update_param(ProgressCommandType cmdtype,...)  and
pgstat_progress_update_multi_param(ProgressCommandType cmdtype,...)
   - Progress reporter updates parameters if beentry->st_progress_command
equals cmdtype

Note:
cmdtype means the ProgressCommandType below:

# pgstat.h
typedef enum ProgressCommandType
{
PROGRESS_COMMAND_INVALID,
PROGRESS_COMMAND_VACUUM,
PROGRESS_COMMAND_CLUSTER,
PROGRESS_COMMAND_CREATE_INDEX
} ProgressCommandType;

Thanks,
Tatsuro Yamada


v2_fix_progress_report_for_cluster.patch
Description: Binary data