date:20190802





On 02.08.2019 21:20, Andres Freund wrote:

Another question is whether we'd actually want to include the code in

core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

+1

Re: POC: Cleaning up orphaned files using undo logs

2019-08-02 Thread Amit Kapila

On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar  wrote:
>
> On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila  wrote:
> >
>
> I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions,
> Please find my comment so far.
..
> 4.
> +void
> +undoaction_redo(XLogReaderState *record)
> +{
> + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
> +
> + switch (info)
> + {
> + case XLOG_UNDO_APPLY_PROGRESS:
> + undo_xlog_apply_progress(record);
> + break;
>
> For HotStandby it doesn't make sense to apply this wal as this
> progress is only required when we try to apply the undo action after
> restart
> but in HotStandby we never apply undo actions.
>

Hmm, I think it is required.  Think what if Hotstandby is later
promoted to master and a large part of undo is already applied?  In
such a case, we can skip the already applied undo.

> 6.
> + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno))
> + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false);
> +
> + Assert(slot != NULL);
> We are passing missing_ok as false in UndoLogGetSlot.  But, not sure
> why we are expecting that undo lot can not be dropped.  In multi-log
> transaction it's possible
> that the tablespace in which next undolog is there is already dropped?
>

If the transaction spans multiple logs, then both the logs should be
in the same tablespace.  So, how is it possible to drop the tablespace
when part of undo is still pending? AFAICS, the code in
choose_undo_tablespace() doesn't seem to allow switching tablespace
for the same transaction, but I agree if someone used a different
algorithm, then it might be possible.

I think the important question is whether we should allow the same
transactions undo to span across tablespaces? If so, then what you are
telling makes sense and we should handle that, if not, then I think we
are fine here.  One might argue that there should be some more strong
checks to ensure that the same transaction will always get the undo
logs from the same tablespace, but I think that is a different thing
then what you are raising here.

Thomas, others, do you have any opinion on this matter?

In FindUndoEndLocationAndSize, there is a check if the next log is
discarded (Case 4: If the transaction is overflowed to ...), won't
this case (considering it is possible) get detected by that check?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Avoid full GIN index scan when possible

2019-08-02 Thread Nikita Glukhov


Attached 6th version of the patches.

On 01.08.2019 22:28, Tom Lane wrote:


Alexander Korotkov  writes:

On Thu, Aug 1, 2019 at 9:59 PM Tom Lane  wrote:

While I've not attempted to fix that here, I wonder whether we shouldn't
fix it by just forcing forcedRecheck to true in any case where we discard
an ALL qualifier.


+1 for setting forcedRecheck in any case we discard ALL qualifier.
ISTM, real life number of cases we can skip recheck here is
negligible.  And it doesn't justify complexity.

Yeah, that was pretty much what I was thinking --- by the time we got
it fully right considering nulls and multicolumn indexes, the cases
where not rechecking could actually do something useful would be
pretty narrow.  And a bitmap heap scan is always going to have to
visit the heap, IIRC, so how much could skipping the recheck really
save?


I have simplified patch #1 setting forcedRecheck for all discarded ALL quals.
(This solution is very close to the earliest unpublished version of the patch.)


More accurate recheck-forcing logic was moved into patch #2 (multicolumn
indexes were fixed).  This patch also contains ginFillScanKey() refactoring
and new internal mode GIN_SEARCH_MODE_NOT_NULL that is used only for
GinScanKey.xxxConsistentFn initialization and transformed into
GIN_SEARCH_MODE_ALL before GinScanEntry initialization.


The cost estimation seems to be correct for both patch #1 and patch #2 and
left untouched since v05.



BTW, it's not particularly the fault of this patch, but: what does it
even mean to specify GIN_SEARCH_MODE_ALL with a nonzero number of keys?


It might mean we would like to see all the results, which don't
contain given key.

Ah, right, I forgot that the consistent-fn might look at the match
results.


Also I decided to go further and tried to optimize (patch #3) the case for
GIN_SEARCH_MODE_ALL with a nonzero number of keys.

Full GIN scan can be avoided in queries like this contrib/intarray query:
"arr @@ '1' AND arr @@ '!2'" (search arrays containing 1 and not containing 2).

Here we have two keys:
 - key '1' with GIN_SEARCH_MODE_DEFAULT
 - key '2' with GIN_SEARCH_MODE_ALL

Key '2' requires full scan that can be avoided with the forced recheck.

This query is equivalent to single-qual query "a @@ '1 & !2'" which
emits only one GIN key '1' with recheck.


Below is example for contrib/intarray operator @@:

=# CREATE EXTENSION intarray;
=# CREATE TABLE t (a int[]);
=# INSERT INTO t SELECT ARRAY[i] FROM generate_series(1, 100) i;
=# CREATE INDEX ON t USING gin (a gin__int_ops);
=# SET enable_seqscan = OFF;

-- master
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1' AND a @@ '!2';
 QUERY PLAN

 Bitmap Heap Scan on t  (cost=1695.45..16007168.16 rows=5019 width=24) 
(actual time=66.955..66.956 rows=1 loops=1)
   Recheck Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int))
   Heap Blocks: exact=1
   Buffers: shared hit=6816
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..1694.19 rows=5019 width=0) 
(actual time=66.950..66.950 rows=1 loops=1)
 Index Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int))
 Buffers: shared hit=6815
 Planning Time: 0.086 ms
 Execution Time: 67.076 ms
(9 rows)

-- equivalent single-qual query
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1 & !2';
 QUERY PLAN

 Bitmap Heap Scan on t  (cost=78.94..7141.57 rows=5025 width=24) (actual 
time=0.019..0.019 rows=1 loops=1)
   Recheck Cond: (a @@ '1 & !2'::query_int)
   Heap Blocks: exact=1
   Buffers: shared hit=8
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..77.68 rows=5025 width=0) 
(actual time=0.014..0.014 rows=1 loops=1)
 Index Cond: (a @@ '1 & !2'::query_int)
 Buffers: shared hit=7
 Planning Time: 0.056 ms
 Execution Time: 0.039 ms
(9 rows)

-- with patch #3
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t WHERE a @@ '1' AND a @@ '!2';
 QUERY PLAN

 Bitmap Heap Scan on t  (cost=75.45..7148.16 rows=5019 width=24) (actual 
time=0.019..0.020 rows=1 loops=1)
   Recheck Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int))
   Heap Blocks: exact=1
   Buffers: shared hit=6
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..74.19 rows=5019 width=0) 
(actual time=0.011..0.012 rows=1 loops=1)
 Index Cond: ((a @@ '1'::query_int) AND (a @@ '!2'::query_int))
 Buffers: shared hit=5
 Planning Time: 0.059 ms
 Execution Time: 0.040 ms
(9 rows)




Patch #3 again contains a similar ugly solution -- we have to remove already
initialized

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Hi,

On 2019-08-02 20:57:20 -0400, Stephen Frost wrote:
> No, I’m saying that we already *have* a library and we can add a few
> functions to it and if people want to leverage those functions then they
> can write glue code to do so, just like was done with libpq. The argument
> that “we shouldn’t put code into the common library because only tools
> written in C can use the common library” is what I was specifically taking
> exception with and your response doesn’t change my opinion of that argument
> one bit.

Wait, which library is this? And which code is suitable for being put in
a library right now?

We're WAY WAY past feature freeze. This isn't the time to rewrite guc.c,
guc-file.l to be suitable for running outside of a backend environment.

> Apparently I don’t have the experiences that you do as I’ve not seen a lot
> of systems which are constantly rewriting the conf file to the point where
> keeping the versions would be likely to add up to anything interesting.

Shrug. I've e.g. seen people continuously (every few minutes or so)
change autovacuum settings depending on load and observed response
times. Which isn't even a crazy thing to do.

> Designing the system around “well, we don’t think you’ll modify the file
> very much from an external tool, so we just won’t worry about it, but if
> you use alter system then we will clean things up” certainly doesn’t strike
> me as terribly principled.

Well. You shouldn't change postgresql.conf.auto while the server is
running, for fairly obvious reasons. Therefore external tools not using
ALTER SYSTEM only make sense when the server is not running. And I don't
think it's a crazy to assume that PG servers where you'd regularly
change the config are running most of the time.

And again, we're talking about v12 here. I don't think anybody is
arguing that we shouldn't provide library/commandline tools to make make
changes to postgresql.auto.conf conveniently possible without
duplicating lines. BUT not for v12, especially not because as the person
arguing for this, you've not provided a patch providing such a library.

> > Personally, I don’t buy the “run out of disk space” argument but if we are
> > > going to go there then we should apply it appropriately.
> > >
> > > Having the history of changes to auto.conf would actually be quite
> > useful,
> > > imv, and worth a bit of disk space (heck, it’s not exactly uncommon for
> > > people to keep their config files in git repos..). I’d suggest we also
> > > include the date/time of when the modification was made.
> >
> > That just seems like an entirely different project. It seems blindlingly
> > obvious that we can't keep the entire history in the file that we're
> > going to be parsing on a regular basis. Having some form of config
> > history tracking might be interesting, but I think it's utterly and
> > completely independent from what we need to fix for v12.

> We don’t parse the file on anything like a “regular” basis.

Well, everytime somebody does pg_reload_conf(), which for systems that
do frequent ALTER SYSTEMs, is kinda frequent too...

Greetings,

Andres Freund

Re: Optimize single tuple fetch from nbtree index

On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan  wrote:
> I wonder if some variety of block nested loop join would be helpful
> here. I'm not aware of any specific design that would help with
> Floris' case, but the idea of reducing the number of scans required on
> the inner side by buffering outer side tuples (say based on the
> "k=:val" constant) seems like it might generalize well enough. I
> suggest Floris look into that possibility. This paper might be worth a
> read:
>
> https://dl.acm.org/citation.cfm?id=582278

Actually, having looked at the test case in more detail, that now
seems less likely. The test case seems designed to reward making it
cheaper to access one specific tuple among a fairly large group of
related tuples -- reducing the number of inner scans is not going to
be possible there.

If this really is totally representative of the case that Floris cares
about, I suppose that the approach taken more or less makes sense.
Unfortunately, it doesn't seem like an optimization that many other
users would find compelling, partly because it's only concerned with
fixed overheads, and partly because most queries don't actually look
like this.

-- 
Peter Geoghegan

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Greetings,

On Fri, Aug 2, 2019 at 20:46 Andres Freund  wrote:

> Hi,
>
> On 2019-08-02 20:27:25 -0400, Stephen Frost wrote:
> > On Fri, Aug 2, 2019 at 18:47 Tom Lane  wrote:
> > > Stephen Frost  writes:
> > > > On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> > > >>> The proposal seems to be to run through the .auto.conf file,
> remove any
> > > >>> duplicates, and append the new entry at the end. That seems
> reasonable.
> > >
> > > >> +1
> > >
> > > > I disagree that this should only be addressed in alter system, as
> I’ve
> > > said
> > > > before and as others have agreed with.  Having one set of code that
> can
> > > be
> > > > used to update parameters in the auto.conf and then have that be
> used by
> > > > pg_basebackup, alter system, and external tools, is the right
> approach.
> > >
> > > I don't find that to be necessary or even desirable.  Many (most?) of
> the
> > > situations where this would be important wouldn't have access to a
> running
> > > backend, and maybe not to any PG code at all --- what if your tool
> isn't
> > > written in C?
> >
> >
> > What if you want to access PG and your tool isn’t written in C?  You use
> a
> > module, extension, package, whatever, that provides the glue between what
> > your language wants and what the C library provides.  There’s psycopg2
> for
> > python, DBD::Pg for Perl, et al, and they use libpq. There’s languages
> that
> > like to write their own too, like the JDBC driver, the Golang driver, but
> > that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t
> > leverage libpq.  This argument is just not sensible.
>
> Oh, comeon. Are you seriously suggesting that a few commands to add a a
> new config setting to postgresql.auto.conf will cause a lot of people to
> write wrappers around $new_config_library in their language of choice,
> because they did the same for libpq? And that we should design such a
> library, for v12?


No, I’m saying that we already *have* a library and we can add a few
functions to it and if people want to leverage those functions then they
can write glue code to do so, just like was done with libpq. The argument
that “we shouldn’t put code into the common library because only tools
written in C can use the common library” is what I was specifically taking
exception with and your response doesn’t change my opinion of that argument
one bit.

> I think it's perfectly fine to say that external tools need only append
> > > to the file, which will require no special tooling.  But then we need
> > > ALTER SYSTEM to be willing to clean out duplicates, if only so you
> don't
> > > run out of disk space after awhile.
>
> > Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of
> disk
> > space” due to external tools modifying by just adding to the file.
>
> That was commented upon in the emails you're replying to? It seems
> hardly likely that you'd get enough config entries to make that
> problematic while postgres is not running. While running it's a
> different story.


Apparently I don’t have the experiences that you do as I’ve not seen a lot
of systems which are constantly rewriting the conf file to the point where
keeping the versions would be likely to add up to anything interesting.

Designing the system around “well, we don’t think you’ll modify the file
very much from an external tool, so we just won’t worry about it, but if
you use alter system then we will clean things up” certainly doesn’t strike
me as terribly principled.

> Personally, I don’t buy the “run out of disk space” argument but if we are
> > going to go there then we should apply it appropriately.
> >
> > Having the history of changes to auto.conf would actually be quite
> useful,
> > imv, and worth a bit of disk space (heck, it’s not exactly uncommon for
> > people to keep their config files in git repos..). I’d suggest we also
> > include the date/time of when the modification was made.
>
> That just seems like an entirely different project. It seems blindlingly
> obvious that we can't keep the entire history in the file that we're
> going to be parsing on a regular basis. Having some form of config
> history tracking might be interesting, but I think it's utterly and
> completely independent from what we need to fix for v12.


We don’t parse the file on anything like a “regular” basis.

Thanks,

Stephen

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Hi,

On 2019-08-02 20:27:25 -0400, Stephen Frost wrote:
> On Fri, Aug 2, 2019 at 18:47 Tom Lane  wrote:
> > Stephen Frost  writes:
> > > On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> > >>> The proposal seems to be to run through the .auto.conf file, remove any
> > >>> duplicates, and append the new entry at the end. That seems reasonable.
> >
> > >> +1
> >
> > > I disagree that this should only be addressed in alter system, as I’ve
> > said
> > > before and as others have agreed with.  Having one set of code that can
> > be
> > > used to update parameters in the auto.conf and then have that be used by
> > > pg_basebackup, alter system, and external tools, is the right approach.
> >
> > I don't find that to be necessary or even desirable.  Many (most?) of the
> > situations where this would be important wouldn't have access to a running
> > backend, and maybe not to any PG code at all --- what if your tool isn't
> > written in C?
>
>
> What if you want to access PG and your tool isn’t written in C?  You use a
> module, extension, package, whatever, that provides the glue between what
> your language wants and what the C library provides.  There’s psycopg2 for
> python, DBD::Pg for Perl, et al, and they use libpq. There’s languages that
> like to write their own too, like the JDBC driver, the Golang driver, but
> that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t
> leverage libpq.  This argument is just not sensible.

Oh, comeon. Are you seriously suggesting that a few commands to add a a
new config setting to postgresql.auto.conf will cause a lot of people to
write wrappers around $new_config_library in their language of choice,
because they did the same for libpq? And that we should design such a
library, for v12?

> I think it's perfectly fine to say that external tools need only append
> > to the file, which will require no special tooling.  But then we need
> > ALTER SYSTEM to be willing to clean out duplicates, if only so you don't
> > run out of disk space after awhile.

> Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of disk
> space” due to external tools modifying by just adding to the file.

That was commented upon in the emails you're replying to? It seems
hardly likely that you'd get enough config entries to make that
problematic while postgres is not running. While running it's a
different story.

> Personally, I don’t buy the “run out of disk space” argument but if we are
> going to go there then we should apply it appropriately.
>
> Having the history of changes to auto.conf would actually be quite useful,
> imv, and worth a bit of disk space (heck, it’s not exactly uncommon for
> people to keep their config files in git repos..). I’d suggest we also
> include the date/time of when the modification was made.

That just seems like an entirely different project. It seems blindlingly
obvious that we can't keep the entire history in the file that we're
going to be parsing on a regular basis. Having some form of config
history tracking might be interesting, but I think it's utterly and
completely independent from what we need to fix for v12.

It seems pretty clear that there's more people disagreeing with your
position than agreeing with you. Because of this conflict there's not
been progress on this for weeks. I think it's beyond time that we just
do the minimal thing for v12, and then continue from there in v13.

- Andres

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

On Fri, Aug 2, 2019 at 3:52 PM Tom Lane  wrote:
> Better ... but I'm the world's second worst Perl programmer,
> so I have little to say about whether it's idiomatic.

Perhaps Michael can weigh in here? I'd rather hear a second opinion on
v4 of the patch before proceeding.

-- 
Peter Geoghegan

Re: Optimize single tuple fetch from nbtree index

On Fri, Aug 2, 2019 at 1:43 PM Tom Lane  wrote:
> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

I agree that this sounds very scary.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

I wonder if some variety of block nested loop join would be helpful
here. I'm not aware of any specific design that would help with
Floris' case, but the idea of reducing the number of scans required on
the inner side by buffering outer side tuples (say based on the
"k=:val" constant) seems like it might generalize well enough. I
suggest Floris look into that possibility. This paper might be worth a
read:

https://dl.acm.org/citation.cfm?id=582278

(Though it also might not be worth a read -- I haven't actually read it myself.)

--
Peter Geoghegan

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Greetings,

On Fri, Aug 2, 2019 at 18:47 Tom Lane  wrote:

> Stephen Frost  writes:
> > On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> >>> The proposal seems to be to run through the .auto.conf file, remove any
> >>> duplicates, and append the new entry at the end. That seems reasonable.
>
> >> +1
>
> > I disagree that this should only be addressed in alter system, as I’ve
> said
> > before and as others have agreed with.  Having one set of code that can
> be
> > used to update parameters in the auto.conf and then have that be used by
> > pg_basebackup, alter system, and external tools, is the right approach.
>
> I don't find that to be necessary or even desirable.  Many (most?) of the
> situations where this would be important wouldn't have access to a running
> backend, and maybe not to any PG code at all --- what if your tool isn't
> written in C?

What if you want to access PG and your tool isn’t written in C?  You use a
module, extension, package, whatever, that provides the glue between what
your language wants and what the C library provides.  There’s psycopg2 for
python, DBD::Pg for Perl, et al, and they use libpq. There’s languages that
like to write their own too, like the JDBC driver, the Golang driver, but
that doesn’t mean we shouldn’t provide libpq or that non-C tools can’t
leverage libpq.  This argument is just not sensible.

I agree entirely that we want to be able to modify auto.conf without having
PG running (and without using single mode, bleh, that’s horrid..).  I think
we can accept that there we can’t necessarily *validate* that every option
is acceptable but that’s not the same as being able to simply parse the
file and modify a value.

I think it's perfectly fine to say that external tools need only append
> to the file, which will require no special tooling.  But then we need
> ALTER SYSTEM to be willing to clean out duplicates, if only so you don't
> run out of disk space after awhile.

Uh, if you don’t ever run ALTER SYSTEM then you could also “run out of disk
space” due to external tools modifying by just adding to the file.

Personally, I don’t buy the “run out of disk space” argument but if we are
going to go there then we should apply it appropriately.

Having the history of changes to auto.conf would actually be quite useful,
imv, and worth a bit of disk space (heck, it’s not exactly uncommon for
people to keep their config files in git repos..). I’d suggest we also
include the date/time of when the modification was made.

Thanks,

Stephen

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Greetings,

On Fri, Aug 2, 2019 at 19:36 Ian Barwick 
wrote:

> On 8/3/19 8:24 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2019-08-03 08:22:29 +0900, Ian Barwick wrote:
> >> What I came up with shoehorned a stripped-down version of the backend
> >> config parser into fe_utils and provides a function to modify
> pg.auto.conf
> >> in much the same way ALTER SYSTEM does, but with only the basic syntax
> >> checking provided by the parser of course. And for completeness a
> >> client utility which can be called by scripts etc.
> >
> >> I can clean it up and submit it later for reference (got distracted by
> other things
> >> recently) though I don't think it's a particularly good solution due to
> the
> >> lack of actual checks for the provided GUCSs (and the implementation
> >> is ugly anyway); something like what Andres suggests below would be far
> better.
> >
> > I think my main problem with that is that it duplicates a nontrivial
> > amount of code.
>
> That is indeed part of the ugliness of the implementation.


I agree that duplicate code isn’t good- the goal would be to eliminate the
duplication by having it be common code instead of duplicated.  We have
other code that’s common to the frontend and backend and I don’t doubt that
we will have more going forward...

Thanks,

Stephen

>

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-08-02 Thread Ashwin Agrawal

On Wed, Jul 31, 2019 at 2:06 PM Andres Freund  wrote:

> Looking at the code as of master, we currently have:
>

Super awesome feedback and insights, thank you!

- PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure
>   out a whether the tuple has been locked by the current
>   transaction. That check afaict just should be
>   TransactionIdIsCurrentTransactionId(), without all the other
>   stuff that's done today.
>

Agree. v1-0002 patch attached does that now. Please let me know if that's
what you meant.

  TransactionIdIsCurrentTransactionId() imo ought to be optimized to
>   always check for the top level transactionid first - that's a good bet
>   today, but even moreso for the upcoming AMs that won't have separate
>   xids for subtransactions.  Alternatively we shouldn't make that a
>   binary search for each subtrans level, but just have a small
>   simplehash hashtable for xids.
>

v1-0001 patch checks for GetTopTransactionIdIfAny() first in
TransactionIdIsCurrentTransactionId() which seems yes better in general and
more for future. That mostly meets the needs for current discussion.

The alternative of not using binary search seems bigger refactoring and
should be handled as separate optimization exercise outside of this thread.


> - CheckForSerializableConflictOut() wants to get the toplevel xid for
>   the tuple, because that's the one the predicate hashtable stores.
>
>   In your patch you've already moved the HTSV() call etc out of
>   CheckForSerializableConflictOut(). I'm somewhat inclined to think that
>   the SubTransGetTopmostTransaction() call ought to go along with that.
>   I don't really think that belongs in predicate.c, especially if
>   most/all new AMs don't use subtransaction ids.
>
>   The only downside is that currently the
>   TransactionIdEquals(xid, GetTopTransactionIdIfAny()) check
>   avoids the SubTransGetTopmostTransaction() check.
>
>   But again, the better fix for that seems to be to improve the generic
>   code. As written the check won't prevent a subtrans lookup for heap
>   when subtransactions are in use, and it's IME pretty common for tuples
>   to get looked at again in the transaction that has created them. So
>   I'm somewhat inclined to think that SubTransGetTopmostTransaction()
>   should have a fast-path for the current transaction - probably just
>   employing TransactionIdIsCurrentTransactionId().
>

That optimization, as Kuntal also mentioned, seems something which can be
done on-top afterwards on current patch.


> I don't really see what we gain by having the subtrans handling in the
> predicate code. Especially given that we've already moved the HTSV()
> handling out, it seems architecturally the wrong place to me - but I
> admit that that's a fuzzy argument.  The relevant mapping should be one
> line in the caller.
>

Okay, I moved the sub transaction handling out of
CheckForSerializableConflictOut() and have it along side HTSV() now.

The reason I felt leaving subtransaction handling in generic place, was it
might be premature to thing no future AM will need it. Plus, all
serializable function api's having same expectations is easier. Like
PredicateLockTuple() can be passed top or subtransaction id and it can
handle it but with the change CheckForSerializableConflictOut() only be
feed top transaction ID. But its fine and can see the point of AM needing
it can easily get top transaction ID and feed it as heap.


> I wonder if it'd be wroth to combine the
> TransactionIdIsCurrentTransactionId() calls in the heap cases that
> currently do both, PredicateLockTuple() and
> HeapCheckForSerializableConflictOut(). The heap_fetch() case probably
> isn't commonly that hot a pathq, but heap_hot_search_buffer() is.
>

Maybe, will give thought to it separate from the current patch.


> Minor notes:
> - I don't think 'insert_xid' is necessarily great - it could also be the
>   updating xid etc. And while you can argue that an update is an insert
>   in the current heap, that's not the case for future AMs.
> - to me
> @@ -1621,7 +1622,8 @@ heap_hot_search_buffer(ItemPointer tid, Relation
> relation, Buffer buffer,
> if (valid)
> {
> ItemPointerSetOffsetNumber(tid, offnum);
> -   PredicateLockTuple(relation, heapTuple,
> snapshot);
> +   PredicateLockTID(relation,
> &(heapTuple)->t_self, snapshot,
> +
> HeapTupleHeaderGetXmin(heapTuple->t_data));
> if (all_dead)
> *all_dead = false;
> return true;
>
>  What are those parens - as placed they can't do anything. Did you
>  intend to write &(heapTuple->t_self)? Even that is pretty superfluous,
>  but it at least clarifies the precedence.
>

Fixed. No idea what I was thinking there, mostly feel I intended to have it
as like &(heapTuple->t_self).

  I'm also a bit co

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions


On 8/3/19 8:24 AM, Andres Freund wrote:

Hi,

On 2019-08-03 08:22:29 +0900, Ian Barwick wrote:

What I came up with shoehorned a stripped-down version of the backend
config parser into fe_utils and provides a function to modify pg.auto.conf
in much the same way ALTER SYSTEM does, but with only the basic syntax
checking provided by the parser of course. And for completeness a
client utility which can be called by scripts etc.



I can clean it up and submit it later for reference (got distracted by other 
things
recently) though I don't think it's a particularly good solution due to the
lack of actual checks for the provided GUCSs (and the implementation
is ugly anyway); something like what Andres suggests below would be far better.


I think my main problem with that is that it duplicates a nontrivial
amount of code.


That is indeed part of the ugliness of the implementation.


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Hi,

On 2019-08-03 08:22:29 +0900, Ian Barwick wrote:
> What I came up with shoehorned a stripped-down version of the backend
> config parser into fe_utils and provides a function to modify pg.auto.conf
> in much the same way ALTER SYSTEM does, but with only the basic syntax
> checking provided by the parser of course. And for completeness a
> client utility which can be called by scripts etc.

> I can clean it up and submit it later for reference (got distracted by other 
> things
> recently) though I don't think it's a particularly good solution due to the
> lack of actual checks for the provided GUCSs (and the implementation
> is ugly anyway); something like what Andres suggests below would be far 
> better.

I think my main problem with that is that it duplicates a nontrivial
amount of code.

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions


On 8/3/19 7:56 AM, Andres Freund wrote:

Hi,

On 2019-08-02 18:47:07 -0400, Tom Lane wrote:

Stephen Frost  writes:

I disagree that this should only be addressed in alter system, as I’ve said
before and as others have agreed with.  Having one set of code that can be
used to update parameters in the auto.conf and then have that be used by
pg_basebackup, alter system, and external tools, is the right approach.


I don't find that to be necessary or even desirable.  Many (most?) of the
situations where this would be important wouldn't have access to a running
backend, and maybe not to any PG code at all --- what if your tool isn't
written in C?


I think a commandline tool to perform the equivalent of ALTER SYSTEM on
a shutdown cluster would be a great tool. It's easy enough to add
something with broken syntax, and further down the road such a tool
could not only ensure the syntax is correct, but also validate
individual settings as much as possible (obviously there's some hairy
issues here).


What I came up with shoehorned a stripped-down version of the backend
config parser into fe_utils and provides a function to modify pg.auto.conf
in much the same way ALTER SYSTEM does, but with only the basic syntax
checking provided by the parser of course. And for completeness a
client utility which can be called by scripts etc.

I can clean it up and submit it later for reference (got distracted by other 
things
recently) though I don't think it's a particularly good solution due to the
lack of actual checks for the provided GUCSs (and the implementation
is ugly anyway); something like what Andres suggests below would be far better.


Quite possibly the most realistic way to implement something like that
would be a postgres commandline switch, which'd start up far enough to
perform GUC checks and execute AlterSystem(), and then shut down
again. We already have -C, I think such an option could reasonably be
implemented alongside it.

Obviously this is widely out of scope for v12.



Regards


Ian Barwick


--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

Hi,

On 2019-08-02 10:54:35 +0200, Julien Rouhaud wrote:
> On Thu, Aug 1, 2019 at 11:05 PM Andres Freund  wrote:
> >
> > I'm actually quite unconvinced that it's sensible to update the global
> > value for nested queries. That'll mean e.g. the log_line_prefix and
> > pg_stat_activity values are most of the time going to be bogus while
> > nested, because the querystring that's associated with those will *not*
> > be the value that the queryid corresponds to. elog.c uses
> > debug_query_string to log the statement, which is only updated for
> > top-level queries (outside of some exceptions like parallel workers for
> > parallel queries in a function or stuff like that). And pg_stat_activity
> > is also only updated for top level queries.
> 
> Having the nested queryid seems indeed quite broken for
> log_line_prefix.  However having the nested queryid in
> pg_stat_activity would be convenient to track what is a long stored
> functions currently doing.  Maybe we could expose something like
> top_level_queryid and current_queryid instead?

Given that the query string is the toplevel one, I think that'd just be
confusing. And given the fact that it adds *substantial* additional
complexity, I'd just rip the subcommand bits out.

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions


On 8/3/19 8:09 AM, Tom Lane wrote:

I wrote:

Andres Freund  writes:

I think a commandline tool to perform the equivalent of ALTER SYSTEM on
a shutdown cluster would be a great tool.



Perhaps, but ...



Obviously this is widely out of scope for v12.



... this.


Although, there's always

echo "alter system set work_mem = 4242;" | postgres --single

Maybe we could recommend that to tools that need to do
potentially-repetitive modifications?


The slight problem with that, particularly with the use-case
I am concerned with (writing replication configuration), is:

[2019-08-03 08:14:21 JST]FATAL:  0A000: standby mode is not supported 
by single-user servers

(I may be missing something obvious of course)


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

I wrote:
> Andres Freund  writes:
>> I think a commandline tool to perform the equivalent of ALTER SYSTEM on
>> a shutdown cluster would be a great tool.

> Perhaps, but ...

>> Obviously this is widely out of scope for v12.

> ... this.

Although, there's always

echo "alter system set work_mem = 4242;" | postgres --single

Maybe we could recommend that to tools that need to do
potentially-repetitive modifications?

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Andres Freund  writes:
> On 2019-08-02 18:47:07 -0400, Tom Lane wrote:
>> I don't find that to be necessary or even desirable.  Many (most?) of the
>> situations where this would be important wouldn't have access to a running
>> backend, and maybe not to any PG code at all --- what if your tool isn't
>> written in C?

> I think a commandline tool to perform the equivalent of ALTER SYSTEM on
> a shutdown cluster would be a great tool.

Perhaps, but ...

> Obviously this is widely out of scope for v12.

... this.  It's entirely insane to think we're going to produce any such
thing for v12 (much less back-patch it into prior versions).  In the short
term I don't think there's any workable alternative except to decree that
"just append to the end" is a supported way to alter pg.auto.conf.

But, as you said, it's also not sane for ALTER SYSTEM to behave that way,
because it won't cope for long with repetitive modifications.  I think
we can get away with the "just append" recommendation for most external
drivers because they won't be doing that.  If they are, they'll need to
be smarter, and maybe some command-line tool would make their lives
simpler down the line.  But we aren't providing that in this cycle.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

On Fri, Aug 02, 2019 at 06:38:46PM -0400, Stephen Frost wrote:

Greetings,

On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:

Tomas Vondra  writes:
> There seems to be a consensus that this this not a pg_basebackup issue
> (i.e. duplicate values don't make the file invalid), and it should be
> handled in ALTER SYSTEM.

Yeah.  I doubt pg_basebackup is the only actor that can create such
situations.

> The proposal seems to be to run through the .auto.conf file, remove any
> duplicates, and append the new entry at the end. That seems reasonable.

+1

I disagree that this should only be addressed in alter system, as I’ve said
before and as others have agreed with.  Having one set of code that can be
used to update parameters in the auto.conf and then have that be used by
pg_basebackup, alter system, and external tools, is the right approach.

The idea that alter system should be the only thing that doesn’t just
append changes to the file is just going to lead to confusion and bugs down
the road.

I don't remember any suggestions ALTER SYSTEM should be the only thing
that can rewrite the config file, but maybe it's buried somewhere in the
thread history. The current proposal certainly does not prohibit any
external tool from doing so, it just says we should expect duplicates.

As I said before, an alternative could be to make alter system simply
always append and declare that to be the way to update parameters in the
auto.conf.

That just seems strange, TBH.

There was a discussion whether to print warnings about the duplicates. I
> personally see not much point in doing that - if we consider duplicates
> to be expected, and if ALTER SYSTEM has the license to rework the config
> file any way it wants, why warn about it?

Personally I agree that warnings are unnecessary.

And at least Magnus and I disagree with that, as I recall from this
thread.  Let’s have a clean and clear way to modify the auto.conf and have
everything that touches the file update it in a consistent way.

Well, I personally don't feel very strongly about it. I think the
warnings will be a nuisance bothering people with expeced stuff, but I'm
not willing to fight against it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Hi,

On 2019-08-02 18:47:07 -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > I disagree that this should only be addressed in alter system, as I’ve said
> > before and as others have agreed with.  Having one set of code that can be
> > used to update parameters in the auto.conf and then have that be used by
> > pg_basebackup, alter system, and external tools, is the right approach.
> 
> I don't find that to be necessary or even desirable.  Many (most?) of the
> situations where this would be important wouldn't have access to a running
> backend, and maybe not to any PG code at all --- what if your tool isn't
> written in C?

I think a commandline tool to perform the equivalent of ALTER SYSTEM on
a shutdown cluster would be a great tool. It's easy enough to add
something with broken syntax, and further down the road such a tool
could not only ensure the syntax is correct, but also validate
individual settings as much as possible (obviously there's some hairy
issues here).

Quite possibly the most realistic way to implement something like that
would be a postgres commandline switch, which'd start up far enough to
perform GUC checks and execute AlterSystem(), and then shut down
again. We already have -C, I think such an option could reasonably be
implemented alongside it.

Obviously this is widely out of scope for v12.

Greetings,

Andres Freund

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

Peter Geoghegan  writes:
> How about the attached? I've simply removed the "if ($oid > $prev_oid
> + 2)" test.

Better ... but I'm the world's second worst Perl programmer,
so I have little to say about whether it's idiomatic.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Hi,

On 2019-08-02 18:38:46 -0400, Stephen Frost wrote:
> On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
> 
> > Tomas Vondra  writes:
> > > There seems to be a consensus that this this not a pg_basebackup issue
> > > (i.e. duplicate values don't make the file invalid), and it should be
> > > handled in ALTER SYSTEM.
> >
> > Yeah.  I doubt pg_basebackup is the only actor that can create such
> > situations.
> >
> > > The proposal seems to be to run through the .auto.conf file, remove any
> > > duplicates, and append the new entry at the end. That seems reasonable.
> >
> > +1

> I disagree that this should only be addressed in alter system, as I’ve said
> before and as others have agreed with.  Having one set of code that can be
> used to update parameters in the auto.conf and then have that be used by
> pg_basebackup, alter system, and external tools, is the right approach.
> 
> The idea that alter system should be the only thing that doesn’t just
> append changes to the file is just going to lead to confusion and bugs down
> the road.

To me that seems like an alternative that needs a good chunk more work
than just having ALTER SYSTEM fix things up, and isn't actually likely
to prevent such scenarios from occurring in practice.  Providing a
decent API to change conflict files from various places, presumably
including a commandline utility to do so, would be a nice feature, but
it seems vastly out of scope for v12.  My vote is to fix this via ALTER
SYSTEM in v12, and then for whoever is interested enough to provide
better tools down the road.

> As I said before, an alternative could be to make alter system simply
> always append and declare that to be the way to update parameters in the
> auto.conf.

Why would that be a good idea? We'd just take longer and longer to parse
it. There's people that change database settings on a regular and
automated basis using ALTER SYSTEm.

> > There was a discussion whether to print warnings about the duplicates. I
> > > personally see not much point in doing that - if we consider duplicates
> > > to be expected, and if ALTER SYSTEM has the license to rework the config
> > > file any way it wants, why warn about it?
> >
> > Personally I agree that warnings are unnecessary.

+1

Greetings,

Andres Freund

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Stephen Frost  writes:
> On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:
>>> The proposal seems to be to run through the .auto.conf file, remove any
>>> duplicates, and append the new entry at the end. That seems reasonable.

>> +1

> I disagree that this should only be addressed in alter system, as I’ve said
> before and as others have agreed with.  Having one set of code that can be
> used to update parameters in the auto.conf and then have that be used by
> pg_basebackup, alter system, and external tools, is the right approach.

I don't find that to be necessary or even desirable.  Many (most?) of the
situations where this would be important wouldn't have access to a running
backend, and maybe not to any PG code at all --- what if your tool isn't
written in C?

I think it's perfectly fine to say that external tools need only append
to the file, which will require no special tooling.  But then we need
ALTER SYSTEM to be willing to clean out duplicates, if only so you don't
run out of disk space after awhile.

regards, tom lane

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

On Fri, Aug 2, 2019 at 3:19 PM Tom Lane  wrote:
> The "if ($oid > $prev_oid + 2)" test seems unnecessary.
> It's certainly wrong to keep iterating beyond the first
> oid that's > $suggestion.

Sorry. That was just carelessness on my part. (Being the world's worst
Perl programmer is no excuse.)

How about the attached? I've simply removed the "if ($oid > $prev_oid
+ 2)" test.

-- 
Peter Geoghegan


v4-0001-unused_oids-suggestion.patch
Description: Binary data

Re: tableam vs. TOAST

Hi,

On 2019-08-01 12:23:42 -0400, Robert Haas wrote:
> Roughly, on both master and with the patches, the first one takes
> about 4.2 seconds, the second 7.5, and the third 1.2.  The third one
> is the fastest because it doesn't do any compression.  Since it does
> less irrelevant work than the other two cases, it has the best chance
> of showing up any performance regression that the patch has caused --
> if any regression existed, I suppose that it would be an increased
> per-toast-fetch or per-toast-chunk overhead. However, I can't
> reproduce any such regression.  My first attempt at testing that case
> showed the patch about 1% slower, but that wasn't reliably
> reproducible when I did it a bunch more times.  So as far as I can
> figure, this patch does not regress performance in any
> easily-measurable way.

Hm, those all include writing, right? And for read-only we don't expect
any additional overhead, correct? The write overhead is probably too
large show a bit of function call overhead - but if so, it'd probably be
on unlogged tables? And with COPY, because that utilizes multi_insert,
which means more toasting in a shorter amount of time?


.oO(why does everyone attach attachements out of order? Is that
a gmail thing?)


> From a4c858c75793f0f8aff7914c572a6615ea5babf8 Mon Sep 17 00:00:00 2001
> From: Robert Haas 
> Date: Mon, 8 Jul 2019 11:58:05 -0400
> Subject: [PATCH 1/4] Split tuptoaster.c into three separate files.
>
> detoast.c/h contain functions required to detoast a datum, partially
> or completely, plus a few other utility functions for examining the
> size of toasted datums.
>
> toast_internals.c/h contain functions that are used internally to the
> TOAST subsystem but which (mostly) do not need to be accessed from
> outside.
>
> heaptoast.c/h contains code that is intrinsically specific to the
> heap AM, either because it operates on HeapTuples or is based on the
> layout of a heap page.
>
> detoast.c and toast_internals.c are placed in
> src/backend/access/common rather than src/backend/access/heap.  At
> present, both files still have dependencies on the heap, but that will
> be improved in a future commit.

I wonder if toasting.c should be moved too?

If anybody doesn't know git's --color-moved, it makes patches like this
a lot easier to review.

> index 000..582af147ea1
> --- /dev/null
> +++ b/src/include/access/detoast.h
> @@ -0,0 +1,92 @@
> +/*-
> + *
> + * detoast.h
> + *Access to compressed and external varlena values.
>
> Hm. Does it matter that that also includes stuff like expanded datums?
>
> + * Copyright (c) 2000-2019, PostgreSQL Global Development Group
> + *
> + * src/include/access/detoast.h
> + *
> + *-
> + */
> +#ifndef DETOAST_H
> +#define DETOAST_H

trailing whitespace after "#ifndef DETOAST_H ".


> commit 60d51e6510c66f79c51e43fe22730fe050d87854
> Author: Robert Haas 
> Date:   2019-07-08 12:02:16 -0400
>
> Create an API for inserting and deleting rows in TOAST tables.
>
> This moves much of the non-heap-specific logic from toast_delete and
> toast_insert_or_update into a helper functions accessible via a new
> header, toast_helper.h.  Using the functions in this module, a table
> AM can implement creation and deletion of TOAST table rows with
> much less code duplication than was possible heretofore.  Some
> table AMs won't want to use the TOAST logic at all, but for those
> that do this will make that easier.
>
> Discussion: 
> http://postgr.es/m/CA+TgmoZv-=2iwm4jcw5zhjel18hf96+w1yjeyrngmydkffn...@mail.gmail.com

Hm. This leaves toast_insert_or_update() as a name. That makes it sound
like it's generic toast code, rather than heap specific?

It's definitely nice how a lot of repetitive code has been deduplicated.
Also makes it easier to see how algorithmically expensive
toast_insert_or_update() is :(.


>   /*
>* Second we look for attributes of attstorage 'x' or 'e' that are still
>* inline, and make them external.  But skip this if there's no toast
>* table to push them to.
>*/
>   while (heap_compute_data_size(tupleDesc,
> toast_values, 
> toast_isnull) > maxDataLen &&
>  rel->rd_rel->reltoastrelid != InvalidOid)

Shouldn't this condition be the other way round?


>   if (for_compression)
>   skip_colflags |= TOASTCOL_INCOMPRESSIBLE;
>
>   for (i = 0; i < numAttrs; i++)
>   {
>   Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
>
>   if ((ttc->ttc_attr[i].tai_colflags & skip_colflags) != 0)
>   continue;
>   if (VARATT_IS_EXTERNAL(DatumGetPointer(ttc->ttc_values[i])))
>   continue;   /* can't happen, 
> toast_action would be 'p' */

Re: jsonb_plperl bug

=?UTF-8?B?SXZhbiBQYW5jaGVua28=?=  writes:
> I have found a bug in jsonb_plperl extension. A possible fix is proposed 
> below.
> ...
> + /* SVt_PV without POK flag is also NULL */
> + if(SvTYPE(in) == SVt_PV) 

Ugh.  Doesn't Perl provide some saner way to determine the type of a SV?

The core code seems to think that SvOK() is a sufficient test for an
undef.  Should we be doing that before the switch, perhaps?

(My underlying concern here is mostly about whether we have other
similar bugs.  There are a lot of places checking SvTYPE.)

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

On 8/3/19 7:27 AM, Tom Lane wrote:

Tomas Vondra writes:

There seems to be a consensus that this this not a pg_basebackup issue
(i.e. duplicate values don't make the file invalid), and it should be
handled in ALTER SYSTEM.

Yeah. I doubt pg_basebackup is the only actor that can create such
situations.

The proposal seems to be to run through the .auto.conf file, remove any
duplicates, and append the new entry at the end. That seems reasonable.

There was a discussion whether to print warnings about the duplicates. I
personally see not much point in doing that - if we consider duplicates
to be expected, and if ALTER SYSTEM has the license to rework the config
file any way it wants, why warn about it?

Personally I agree that warnings are unnecessary.

Having played around with the pg.auto.conf stuff for a while, my feeling is
that ALTER SYSTEM does indeed have a license to rewrite it (which is what
currently happens anyway, with comments and include directives [1] being
silently
removed) so it seems reasonable to remove duplicate entries and ensure
the correct one is processed.

[1] suprisingly any include directives present are honoured, which seems crazy
to me, see:
https://www.postgresql.org/message-id/flat/8c8bcbca-3bd9-dc6e-8986-04a5abdef142%402ndquadrant.com

The main issue however is that no code was written yet.

Seems like it ought to be relatively simple ... but I didn't look.

The patch I originally sent does exactly this.

The thread then drifted off into a discussion about providing ways for
applications to properly write to pg.auto.conf while PostgreSQL is not
running; I have a patch for that which I can submit later (though it
is a thing of considerable ugliness).

Regards

Ian Barwick

--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Greetings,

On Fri, Aug 2, 2019 at 18:27 Tom Lane  wrote:

> Tomas Vondra  writes:
> > There seems to be a consensus that this this not a pg_basebackup issue
> > (i.e. duplicate values don't make the file invalid), and it should be
> > handled in ALTER SYSTEM.
>
> Yeah.  I doubt pg_basebackup is the only actor that can create such
> situations.
>
> > The proposal seems to be to run through the .auto.conf file, remove any
> > duplicates, and append the new entry at the end. That seems reasonable.
>
> +1

I disagree that this should only be addressed in alter system, as I’ve said
before and as others have agreed with.  Having one set of code that can be
used to update parameters in the auto.conf and then have that be used by
pg_basebackup, alter system, and external tools, is the right approach.

The idea that alter system should be the only thing that doesn’t just
append changes to the file is just going to lead to confusion and bugs down
the road.

As I said before, an alternative could be to make alter system simply
always append and declare that to be the way to update parameters in the
auto.conf.

> There was a discussion whether to print warnings about the duplicates. I
> > personally see not much point in doing that - if we consider duplicates
> > to be expected, and if ALTER SYSTEM has the license to rework the config
> > file any way it wants, why warn about it?
>
> Personally I agree that warnings are unnecessary.

And at least Magnus and I disagree with that, as I recall from this
thread.  Let’s have a clean and clear way to modify the auto.conf and have
everything that touches the file update it in a consistent way.

Thanks,

Stephen

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions

Tomas Vondra  writes:
> There seems to be a consensus that this this not a pg_basebackup issue
> (i.e. duplicate values don't make the file invalid), and it should be
> handled in ALTER SYSTEM.

Yeah.  I doubt pg_basebackup is the only actor that can create such
situations.

> The proposal seems to be to run through the .auto.conf file, remove any
> duplicates, and append the new entry at the end. That seems reasonable.

+1

> There was a discussion whether to print warnings about the duplicates. I
> personally see not much point in doing that - if we consider duplicates
> to be expected, and if ALTER SYSTEM has the license to rework the config
> file any way it wants, why warn about it?

Personally I agree that warnings are unnecessary.

> The main issue however is that no code was written yet.

Seems like it ought to be relatively simple ... but I didn't look.

regards, tom lane

Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions


Hi,

This thread discusses an issue that's tracked as an open item for pg12,
but it's been quiet for the last ~1 month. I think it's probably time to
decide what to do with it. The thread is a bit long, so let me sum what
the issue is and what options we have.

The problem is that ALTER SYSTEM does not handle duplicate entries in
postgresql.auto.conf file correctly, because it simply modifies the
first item, but the value is then overridden by the duplicate items.
This contradicts the idea that duplicate GUCs are allowed, and that we
should use the last item.

This bug seems to exist since ALTER SYSTEM was introduced, so it's not
a clear PG12 item. But it was made more prominent by the removal of
recovery.conf in PG12, because pg_basebackup now appends stuff to
postgresql.auto.conf and may easily create duplicate items.


There seems to be a consensus that this this not a pg_basebackup issue
(i.e. duplicate values don't make the file invalid), and it should be
handled in ALTER SYSTEM.

The proposal seems to be to run through the .auto.conf file, remove any
duplicates, and append the new entry at the end. That seems reasonable.

There was a discussion whether to print warnings about the duplicates. I
personally see not much point in doing that - if we consider duplicates
to be expected, and if ALTER SYSTEM has the license to rework the config
file any way it wants, why warn about it?

The main issue however is that no code was written yet.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

Peter Geoghegan  writes:
> Attached is v3, which implements your suggestion, generating output
> like the above. I haven't written a line of Perl in my life prior to
> today, so basic code review would be helpful.

The "if ($oid > $prev_oid + 2)" test seems unnecessary.
It's certainly wrong to keep iterating beyond the first
oid that's > $suggestion.

Perhaps you meant to go back and try a different suggestion
if there's not at least 2 free OIDs?  But then there needs
to be an outer loop around both of these loops.

regards, tom lane

Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Floris Van Nee

Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1?  That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice.  That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 
2-3% as mentioned on some cases, but initially I also expected worse. Do you 
have any ideas on cases that would suffer the most? I thought the tight inner 
nested loop that I posted in my performance tests would have this index lookup 
as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because 
these improve by a factor 2-3 with the patch). And my theory is that for a 
LIMIT 3, the price paid for this optimization is highest, because it would 
touch the page twice and read all items from it, while only returning three of 
them.

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting.  But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing 
if this is deemed interesting enough by the community. However, I didn't want 
to do this for the first version of this patch, as it would be quite some extra 
work, which would be useless if the idea of the patch itself gets rejected 
already. :-) I'd appreciate any pointers in the right direction - I can take a 
look at how top-N sorting pushes the LIMIT down. Given enough interest for the 
basic idea of this patch, I will implement it.

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
>> descend the tree to a leaf page and read just the first (or possibly two) 
>> tuples. It won't touch the rest of the page yet. If indeed just one tuple 
>> was required, there won't be a call to _bt_next and we're done. If we do 
>> need more than one tuple, _bt_next will resume reading tuples from the index 
>> page at the point where we left off.

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any 
assumptions about this and just assumes the good-weather scenario that 
everything is visible. I'm not sure if it's possible to give an estimation of 
this and whether or not that would be useful. Currently, if it turns out that 
the tuple is not visible, there'll just be another call to _bt_next again which 
will resume reading the page as normal. I'm open to implement any suggestions 
that may improve this.

>> - We need to take into account page splits, insertions and vacuums while we 
>> do not have the read-lock in between _bt_first and the first call to 
>> _bt_next. This made my patch quite a bit more complicated than my initial 
>> implementation.

> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here 
already, to get some input from the experts on this topic what assumptions can 
be made now and in the future. If it turns out that it's completely not 
feasible to do an optimization like this, because of other constraints in the 
btree implementation, then we're done pretty quickly here. :-) For what it's 
worth: the patch at least passes make check consistently - I caught a lot of 
these edge cases related to page splits and insertions while running the 
regression tests, which runs the modified bits of code quite often and in 
parallel. There may be plenty of edge cases left however...

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. 
Cases with a LIMIT seem common enough to me to make it possible to add some 
extra optimizations, especially if that could lead to 2-3x the TPS for these 
kind of queries. However, it indeed needs to be within a reasonable complexity. 
If it turns out that in order for us to optimize this, we need to add a lot of 
extra complexity, it may not be worth it to add it.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is prett

jsonb_plperl bug

2019-08-02 Thread Ivan Panchenko

Hi, 

I have found a bug in jsonb_plperl extension. A possible fix is proposed below.

jsonb_perl is the contrib module, which defines TRANSFORM functions for jsonb 
data type and PL/Perl procedural language.

The bug can be reproduced as follows:

CREATE EXTENSION plperl;
CREATE EXTENSION jsonb_plperl;

CREATE OR REPLACE FUNCTION text2jsonb (text) RETURNS jsonb 
    LANGUAGE plperl TRANSFORM FOR TYPE jsonb AS 
$$ 
    my $x = shift; 
    my $ret = {a=>$x};
    return $ret;
$$;
SELECT text2jsonb(NULL);
SELECT text2jsonb('11');
SELECT text2jsonb(NULL);

The last SELECT produces a strange error.

ERROR:  cannot transform this Perl type to jsonb

A brief research has shown that the problem is in an incomplete logic inside 
the transform function. The reason can be illustrated by the flollowing Perl 
one-liner:


perl -MDevel::Peek  -e 'sub x { my $x = shift; Dump $x; warn "\n\n"; }; 
x(undef); x("a"); x(undef); '

It outputs:
SV = NULL(0x0) at 0x73a1b8
  REFCNT = 1
  FLAGS = (PADMY)


SV = PV(0x71da50) at 0x73a1b8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7409a0 "a"\0
  CUR = 1
  LEN = 16


SV = PV(0x71da50) at 0x73a1b8
  REFCNT = 1
  FLAGS = (PADMY)
  PV = 0x7409a0 "a"\0
  CUR = 1
  LEN = 16


This shows that internal representation of the same  undef   in perl is 
different in first and third function calls. 
It is the way Perl reuses the the lexical variable, probably, for optimization 
reasons.

Current jsonb_plperl implementation works good for the first (most evident) 
case, but does not work at all for the third, which results in the 
abovementioned error.

The attached patch solves this issue and defines corresponding tests.

Regards,
Ivan













diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
index 6dc090a..b784ca1 100644
--- a/contrib/jsonb_plperl/expected/jsonb_plperl.out
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -228,6 +228,31 @@ SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}', 'HASH');
  {"1": {"2": [3, 4, 5]}, "2": 3}
 (1 row)
 
+CREATE FUNCTION text2jsonb (text) RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $x = shift;
+return {a=>$x};
+$$;
+SELECT text2jsonb(NULL);
+ text2jsonb  
+-
+ {"a": null}
+(1 row)
+
+SELECT text2jsonb('11');
+ text2jsonb  
+-
+ {"a": "11"}
+(1 row)
+
+SELECT text2jsonb(NULL);
+ text2jsonb  
+-
+ {"a": null}
+(1 row)
+
 \set VERBOSITY terse \\ -- suppress cascade details
 DROP EXTENSION plperl CASCADE;
-NOTICE:  drop cascades to 7 other objects
+NOTICE:  drop cascades to 8 other objects
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperlu.out b/contrib/jsonb_plperl/expected/jsonb_plperlu.out
index 434327b..7fe0721 100644
--- a/contrib/jsonb_plperl/expected/jsonb_plperlu.out
+++ b/contrib/jsonb_plperl/expected/jsonb_plperlu.out
@@ -255,6 +255,31 @@ INFO:  $VAR1 = {'1' => {'2' => ['3','4','5']},'2' => '3'};
  {"1": {"2": [3, 4, 5]}, "2": 3}
 (1 row)
 
+CREATE FUNCTION text2jsonb (text) RETURNS jsonb
+LANGUAGE plperlu
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $x = shift;
+return {a=>$x};
+$$;
+SELECT text2jsonb(NULL);
+ text2jsonb  
+-
+ {"a": null}
+(1 row)
+
+SELECT text2jsonb('11');
+ text2jsonb  
+-
+ {"a": "11"}
+(1 row)
+
+SELECT text2jsonb(NULL);
+ text2jsonb  
+-
+ {"a": null}
+(1 row)
+
 \set VERBOSITY terse \\ -- suppress cascade details
 DROP EXTENSION plperlu CASCADE;
-NOTICE:  drop cascades to 7 other objects
+NOTICE:  drop cascades to 8 other objects
diff --git a/contrib/jsonb_plperl/jsonb_plperl.c b/contrib/jsonb_plperl/jsonb_plperl.c
index 79c5f57..5244ab7 100644
--- a/contrib/jsonb_plperl/jsonb_plperl.c
+++ b/contrib/jsonb_plperl/jsonb_plperl.c
@@ -257,6 +257,12 @@ SV_to_JsonbValue(SV *in, JsonbParseState **jsonb_state, bool is_elem)
 			}
 			else
 			{
+/* SVt_PV without POK flag is also NULL */
+if(SvTYPE(in) == SVt_PV) 
+{
+	out.type = jbvNull;
+	break;
+}
 /*
  * XXX It might be nice if we could include the Perl type in
  * the error message.
diff --git a/contrib/jsonb_plperl/sql/jsonb_plperl.sql b/contrib/jsonb_plperl/sql/jsonb_plperl.sql
index 8b062df..622141b 100644
--- a/contrib/jsonb_plperl/sql/jsonb_plperl.sql
+++ b/contrib/jsonb_plperl/sql/jsonb_plperl.sql
@@ -99,6 +99,17 @@ SELECT roundtrip('{"1": "string1"}', 'HASH');
 
 SELECT roundtrip('{"1": {"2": [3, 4, 5]}, "2": 3}', 'HASH');
 
+CREATE FUNCTION text2jsonb (text) RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $x = shift;
+return {a=>$x};
+$$;
+
+SELECT text2jsonb(NULL);
+SELECT text2jsonb('11');
+SELECT text2jsonb(NULL);
 
 \set VERBOSITY terse \\ -- suppress cascade details
 DROP EXTENSION plperl CASCADE;
diff --git a/contrib/jsonb_plperl/sql/jsonb_plperlu.sql b/contrib/jsonb_plperl/sql/jsonb_plperlu.sql
index 8d8e841..9981c37 100644
--- a/contrib/jsonb_plperl/sql/jsonb_plperlu.sql
+++ b/contrib/jsonb_plperl/sql

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

On Fri, Aug 2, 2019 at 2:52 PM Isaac Morland  wrote:
> Noob question here: why not start with the next unused OID in the range, and 
> on the other hand reserve the range for sequentially-assigned values?

The general idea is to avoid OID collisions while a patch is under
development. Choosing a value that aligns nicely with
already-allocated OIDs makes these collisions much more likely, which
commit a6417078 addressed back in March. We want a random choice among
patches, but OIDs used within a patch should be consecutive.

(There is still some chance of a collision, but you have to be fairly
unlucky to have that happen under the system introduced by commit
a6417078.)

It's probably the case that most patches that create a new pg_proc
entry only create one. The question of consecutive OIDs only comes up
with a fairly small number of patches.

--
Peter Geoghegan

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-02 Thread Isaac Morland

On Fri, 2 Aug 2019 at 16:49, Tom Lane  wrote:

> Peter Geoghegan  writes:
> > I've taken your patch, and changed the wording a bit. I think that
> > it's worth being a bit more explicit. The attached revision produces
> > output that looks like this:
>
> > Patches should use a more-or-less consecutive range of OIDs.
> > Best practice is to make a random choice in the range 8000-.
> > Suggested random unused OID: 9099
>

Noob question here: why not start with the next unused OID in the range,
and on the other hand reserve the range for sequentially-assigned values?

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

On Fri, Aug 2, 2019 at 1:49 PM Tom Lane  wrote:
> Maybe s/make a/start with/ ?

> Also, once people start doing this, it'd be unfriendly to suggest
> 9099 if 9100 is already committed.  There should be some attention
> to *how many* consecutive free OIDs will be available if one starts
> at the suggestion.

How about this wording?:

Patches should use a more-or-less consecutive range of OIDs.
Best practice is to start with a random choice in the range 8000-.
Suggested random unused OID: 9591 (409 consecutive OID(s) available
starting here)

Attached is v3, which implements your suggestion, generating output
like the above. I haven't written a line of Perl in my life prior to
today, so basic code review would be helpful.

-- 
Peter Geoghegan

v3-0001-unused_oids-suggestion.patch
Description: Binary data

Re: partition routing layering in nodeModifyTable.c

Hi,

On 2019-08-01 18:38:09 +0900, Etsuro Fujita wrote:
> On Thu, Aug 1, 2019 at 10:33 AM Amit Langote  wrote:
> > If it's the approach of adding a resultRelation Index field to
> > ForeignScan node, I tried and had to give up, realizing that we don't
> > maintain ResultRelInfos in an array that is indexable by RT indexes.
> > It would've worked if es_result_relations had mirrored es_range_table,
> > although that probably complicates how the individual ModifyTable
> > nodes attach to that array.

We know at plan time what the the resultRelation offset for a
ModifyTable node is. We just need to transport that to the respective
foreign scan node, and update it properly in setrefs?  Then we can index
es_result_relations without any additional mapping?

Maybe I'm missing something? I think all we need to do is to have
setrefs.c:set_plan_refs() iterate over ->fdwDirectModifyPlans or such,
and set the respective node's result_relation_offset or whatever we're
naming it to splan->resultRelIndex + offset from fdwDirectModifyPlans?

> > In any case, given this discussion, further hacking on a global
> > variable like es_result_relations may be a course we might not want
> > to pursue.

I don't think es_result_relations really is problem - it doesn't have to
change while processing individual subplans / partitions / whatnot.  If
we needed a mapping between rtis and result indexes, I'd not see a
problem. Doubtful it's needed though.

There's a fundamental difference between EState->es_result_relations and
EState->es_result_relation_info. The former stays static during the
whole query once initialized, whereas es_result_relation_info changes
depending on which relation we're processing. The latter is what makes
the code more complicated, because we cannot ever return early etc.

Similarly, ModifyTableState->mt_per_subplan_tupconv_maps is not a
problem, it stays static, but e.g. mtstate->mt_transition_capture is a
problem, because we have to change for each subplan / routing /
partition movement.

Greetings,

Andres Freund

Re: [PROPOSAL] Temporal query processing with range types

2019-08-02 Thread Ibrar Ahmed

Hi,
I have rebased the patch and currently reviewing the patch
on master (1e2fddfa33d3c7cc93ca3ee0f32852699bd3e012).




On Mon, Jul 1, 2019 at 4:45 PM Thomas Munro  wrote:

> On Wed, Apr 3, 2019 at 2:12 AM Ibrar Ahmed  wrote:
> > I start looking at the patch, there is a couple of problems with the
> patch. The first one is the OID conflict, which I fixed on my machine. The
> second problem is assertion failure. I think you have not compiled the
> PostgreSQL code with the assertion.
>
> Hi Peter,
>
> Looks like there was some good feedback for this WIP project last time
> around.  It's currently in "Needs Review" status in the July
> Commitfest.  To encourage more review and see some automated compile
> and test results, could we please have a fresh rebase?  The earlier
> patches no longer apply.
>
> Thanks,
>
> --
> Thomas Munro
> https://enterprisedb.com
>


-- 
Ibrar Ahmed


001_temporal_query_processing_with_range_types_v4.patch
Description: Binary data

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

Peter Geoghegan  writes:
> I've taken your patch, and changed the wording a bit. I think that
> it's worth being a bit more explicit. The attached revision produces
> output that looks like this:

> Patches should use a more-or-less consecutive range of OIDs.
> Best practice is to make a random choice in the range 8000-.
> Suggested random unused OID: 9099

Maybe s/make a/start with/ ?

Also, once people start doing this, it'd be unfriendly to suggest
9099 if 9100 is already committed.  There should be some attention
to *how many* consecutive free OIDs will be available if one starts
at the suggestion.  You could perhaps print "9099 (42 OIDs available
starting here)", and if the user doesn't like the amount of headroom
in that, they could just run it again for a different suggestion.

regards, tom lane

Re: Optimize single tuple fetch from nbtree index

Floris Van Nee  writes:
> Every time the index scan is done, all tuples from the leaf page are
> read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an
> exception for this *only* the first time amgettuple gets called.

Regardless of whether there's actually a LIMIT 1?  That seems disastrous
for every other case than the narrow one where the optimization wins.
Because every other case is now going to have to touch the index page
twice.  That's more CPU and about double the contention --- if you could
not measure any degradation from that, you're not measuring the right
thing.

In principle, you could pass down knowledge of whether there's a LIMIT,
using the same mechanism used to enable top-N sorting.  But it'd have to
also go through the AM interface layer, so I'm not sure how messy this
would be.

> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
> descend the tree to a leaf page and read just the first (or possibly two) 
> tuples. It won't touch the rest of the page yet. If indeed just one tuple was 
> required, there won't be a call to _bt_next and we're done. If we do need 
> more than one tuple, _bt_next will resume reading tuples from the index page 
> at the point where we left off.

How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

> - We need to take into account page splits, insertions and vacuums while we 
> do not have the read-lock in between _bt_first and the first call to 
> _bt_next. This made my patch quite a bit more complicated than my initial 
> implementation.

Meh.  I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller.  There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.

I'm not unalterably opposed to doing something like this, but my sense
is that the complexity and probable negative performance impact on other
cases are not going to look like a good trade-off for optimizing the
case at hand.

BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.

regards, tom lane

Re: partition routing layering in nodeModifyTable.c

Hi,

On 2019-08-03 05:20:35 +0900, Etsuro Fujita wrote:
> On Sat, Aug 3, 2019 at 3:01 AM Andres Freund  wrote:
> > On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > > I looked into trying to do the things I mentioned above and it seems
> > > to me that revising BeginDirectModify()'s API to receive the
> > > ResultRelInfo directly as Andres suggested might be the best way
> > > forward.  I've implemented that in the attached 0001.
>
> > Fujita-san, do you have any comments on the FDW API change?  Or anybody
> > else?
> >
> > I'm a bit woried about the move of BeginDirectModify() into
> > nodeModifyTable.c - it just seems like an odd control flow to me. Not
> > allowing any intermittent nodes between ForeignScan and ModifyTable also
> > seems like an undesirable restriction for the future. I realize that we
> > already do that for BeginForeignModify() (just btw, that already accepts
> > resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> > makes sense), but it still seems like the wrong direction to me.
> >
> > The need for that move, I assume, comes from needing knowing the correct
> > ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> > at plan time (in setrefs.c), somewhat similar to how we determine
> > ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?
>
> I'd vote for that; I created a patch for that [1].
>
> [1] 
> https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com

Oh, missed that. But that's not quite what I'm proposing. I don't like
ExecFindResultRelInfo at all. What's the point of it? It's introduction
is still an API break - I don't understand why that break is better than
just passing the ResultRelInfo directly to BeginDirectModify()?  I want
to again remark that BeginForeignModify() does get the ResultRelInfo -
it should have been done the same when adding direct modify.

Even if you need the loop - which I don't think is right - it should
live somewhere that individual FDWs don't have to care about.

- Andres

Re: SQL:2011 PERIODS vs Postgres Ranges?

2019-08-02 Thread Ibrar Ahmed

The patch does not work.

postgres=# CREATE TABLE foo (id int,r int4range, valid_at tsrange, CONSTRAINT 
bar_pk PRIMARY KEY (r, valid_at WITHOUT OVERLAPS));
CREATE TABLE
postgres=# CREATE TABLE bar (id int,r int4range, valid_at tsrange, CONSTRAINT 
bar_fk FOREIGN KEY (r, PERIOD valid_at) REFERENCES foo);
ERROR:  cache lookup failed for type 0

The new status of this patch is: Waiting on Author

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-02 Thread Julien Rouhaud

Le ven. 2 août 2019 à 20:12, Peter Geoghegan  a écrit :

> On Fri, Aug 2, 2019 at 1:42 AM Julien Rouhaud  wrote:
> > Trivial patch for that attached.
>
> Thanks!
>
> > The output is now like:
> >
> > [...]
> > Using an oid in the 8000- range is recommended.
> > For instance: 9427
> >
> > (checking that the suggested random oid is not used yet.)
>
> I've taken your patch, and changed the wording a bit. I think that
> it's worth being a bit more explicit. The attached revision produces
> output that looks like this:
>
> Patches should use a more-or-less consecutive range of OIDs.
> Best practice is to make a random choice in the range 8000-.
> Suggested random unused OID: 9099
>
> I would like to push this patch shortly. How do people feel about this
> wording? (It's based on the documentation added by commit a6417078.)
>

I'm fine with it!

>

First draft of back-branch release notes is done

See
https://git.postgresql.org/pg/commitdiff/082c9f5f761ced18a6f014f2638096f6a8228164

Please send comments/corrections before Sunday.

regards, tom lane

Re: partition routing layering in nodeModifyTable.c

2019-08-02 Thread Etsuro Fujita

Hi,

On Sat, Aug 3, 2019 at 3:01 AM Andres Freund  wrote:
> On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> > I looked into trying to do the things I mentioned above and it seems
> > to me that revising BeginDirectModify()'s API to receive the
> > ResultRelInfo directly as Andres suggested might be the best way
> > forward.  I've implemented that in the attached 0001.

> Fujita-san, do you have any comments on the FDW API change?  Or anybody
> else?
>
> I'm a bit woried about the move of BeginDirectModify() into
> nodeModifyTable.c - it just seems like an odd control flow to me. Not
> allowing any intermittent nodes between ForeignScan and ModifyTable also
> seems like an undesirable restriction for the future. I realize that we
> already do that for BeginForeignModify() (just btw, that already accepts
> resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
> makes sense), but it still seems like the wrong direction to me.
>
> The need for that move, I assume, comes from needing knowing the correct
> ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
> at plan time (in setrefs.c), somewhat similar to how we determine
> ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

I'd vote for that; I created a patch for that [1].

> Then we could just have BeginForeignModify, BeginDirectModify,
> BeginForeignScan all be called from ExecInitForeignScan().

I think so too.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK15%3DoFHmWNND5vopfokSGfn6jMXVvnHa7K7P49F7k1hWPQ%40mail.gmail.com

Re: using index or check in ALTER TABLE SET NOT NULL

Tomas Vondra  writes:
> I think there's a consensus to change INFO to DEBUG1 in pg12, and then
> maybe imlpement something like VERBOSE mode in the future. Objections?

ISTM the consensus is "we'd rather reduce the verbosity, but we don't
want to give up test coverage".  So what's blocking this is lack of
a patch to show that there's another way to verify what code path
was taken.

> As for the reduction of test coverage, can't we deduce whether a
> constraint was used from data in pg_stats or something like that?

Not sure how exactly ... and we've already learned that pg_stats
isn't too reliable.

regards, tom lane

Re: using index or check in ALTER TABLE SET NOT NULL

On Wed, Jun 12, 2019 at 08:34:57AM +1200, David Rowley wrote:

On Tue, 11 Jun 2019 at 03:35, Sergei Kornilov  wrote:

> Does anyone think we shouldn't change the INFO message in ATTACH
> PARTITION to a DEBUG1 in PG12?

Seems no one wants to vote against this change.

I'm concerned about two things:

1. The patch reduces the test coverage of ATTACH PARTITION. We now
have no way to ensure the constraint was used to validate the rows in
the partition.
2. We're inconsistent with what we do for SET NOT NULL and ATTACH
PARTITION. We raise an INFO message when we use a constraint for
ATTACH PARTITION and only a DEBUG1 for SET NOT NULL.

Unfortunately, my two concerns conflict with each other.

We're getting close to beta3/rc1, and this thread was idle for ~1 month.

I think there's a consensus to change INFO to DEBUG1 in pg12, and then
maybe imlpement something like VERBOSE mode in the future. Objections?

As for the reduction of test coverage, can't we deduce whether a
constraint was used from data in pg_stats or something like that?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pglz performance


On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean
existing systems can start using it. Which is good.


That depends on what do you mean by "incremental"? A single toasted
datum can only have one compression type, because we only update them
all in one anyway. But different datums can be compressed differently.



I meant different toast values using different compression algorithm,
sorry for the confusion.




Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).


I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.



OK. I'd say to require a system library, but that's a minor detail.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench - implement strict TPC-B benchmark

Hi,

On 2019-08-02 10:34:24 +0200, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> Thanks a lot for these feedbacks and comments.
> 
> > Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
> > e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
> > a pretty significant share.
> 
> Fine, but what is the corresponding server load? 211%? 611%? And what actual
> time is spent in pgbench itself, vs libpq and syscalls?

System wide pgbench, including libpq, is about 22% of the whole system.

As far as I can tell there's a number of things that are wrong:
- prepared statement names are recomputed for every query execution
- variable name lookup is done for every command, rather than once, when
  parsing commands
- a lot of string->int->string type back and forths

> Conclusion: pgbench-specific overheads are typically (much) below 10% of the
> total client-side cpu cost of a transaction, while over 90% of the cpu cost
> is spent in libpq and system, for the worst case do-nothing query.

I don't buy that that's the actual worst case, or even remotely close to
it. I e.g. see higher pgbench overhead for the *modify* case than for
the pgbench's readonly case. And that's because some of the meta
commands are slow, in particular everything related to variables. And
the modify case just has more variables.

> 
> > +   12.35%  pgbench  pgbench[.] threadRun
> > +3.54%  pgbench  pgbench[.] dopr.constprop.0
> > +3.30%  pgbench  pgbench[.] fmtint
> > +1.93%  pgbench  pgbench[.] getVariable
> 
> ~ 21%, probably some inlining has been performed, because I would have
> expected to see significant time in "advanceConnectionState".

Yea, there's plenty inlining.  Note dopr() is string processing.

> > +2.95%  pgbench  libpq.so.5.13  [.] PQsendQueryPrepared
> > +2.15%  pgbench  libpq.so.5.13  [.] pqPutInt
> > +4.47%  pgbench  libpq.so.5.13  [.] pqParseInput3
> > +1.66%  pgbench  libpq.so.5.13  [.] pqPutMsgStart
> > +1.63%  pgbench  libpq.so.5.13  [.] pqGetInt
> 
> ~ 13%

A lot of that is really stupid. We need to improve
libpq. PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds
the command in many separate pqPut* commands, which reside in another
translation unit, is pretty sad.

> > +3.16%  pgbench  libc-2.28.so   [.] __strcmp_avx2
> > +2.95%  pgbench  libc-2.28.so   [.] malloc
> > +1.85%  pgbench  libc-2.28.so   [.] ppoll
> > +1.85%  pgbench  libc-2.28.so   [.] __strlen_avx2
> > +1.85%  pgbench  libpthread-2.28.so [.] __libc_recv
> 
> ~ 11%, str is a pain… Not sure who is calling though, pgbench or
> libpq.

Both. Most of the strcmp is from getQueryParams()/getVariable(). The
dopr() is from pg_*printf, which is mostly attributable to
preparedStatementName() and getVariable().

> This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear
> about where system time is measured.

It was excluded in this profile, both to reduce profiling costs, and to
focus on pgbench.

Greetings,

Andres Freund

Re: pglz performance

Hi,

On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:
> Hmmm, I don't remember the details of those patches so I didn't realize
> it allows incremental recompression. If that's possible, that would mean
> existing systems can start using it. Which is good.

That depends on what do you mean by "incremental"? A single toasted
datum can only have one compression type, because we only update them
all in one anyway. But different datums can be compressed differently.


> Another question is whether we'd actually want to include the code in
> core directly, or use system libraries (and if some packagers might
> decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

Greetings,

Andres Freund

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

On Fri, Aug 2, 2019 at 1:42 AM Julien Rouhaud  wrote:
> Trivial patch for that attached.

Thanks!

> The output is now like:
>
> [...]
> Using an oid in the 8000- range is recommended.
> For instance: 9427
>
> (checking that the suggested random oid is not used yet.)

I've taken your patch, and changed the wording a bit. I think that
it's worth being a bit more explicit. The attached revision produces
output that looks like this:

Patches should use a more-or-less consecutive range of OIDs.
Best practice is to make a random choice in the range 8000-.
Suggested random unused OID: 9099

I would like to push this patch shortly. How do people feel about this
wording? (It's based on the documentation added by commit a6417078.)

-- 
Peter Geoghegan

v2-0001-unused_oids-suggestion.patch
Description: Binary data

Re: partition routing layering in nodeModifyTable.c

Hi,

On 2019-07-31 17:04:38 +0900, Amit Langote wrote:
> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Thanks!  Some minor quibbles aside, the non FDW patches look good to me.

Fujita-san, do you have any comments on the FDW API change?  Or anybody
else?

I'm a bit woried about the move of BeginDirectModify() into
nodeModifyTable.c - it just seems like an odd control flow to me. Not
allowing any intermittent nodes between ForeignScan and ModifyTable also
seems like an undesirable restriction for the future. I realize that we
already do that for BeginForeignModify() (just btw, that already accepts
resultRelInfo as a parameter, so being symmetrical for BeginDirectModify
makes sense), but it still seems like the wrong direction to me.

The need for that move, I assume, comes from needing knowing the correct
ResultRelInfo, correct?  I wonder if we shouldn't instead determine the
at plan time (in setrefs.c), somewhat similar to how we determine
ModifyTable.resultRelIndex. Doesn't look like that'd be too hard?

Then we could just have BeginForeignModify, BeginDirectModify,
BeginForeignScan all be called from ExecInitForeignScan().



Path 04 is such a nice improvement. Besides getting rid of a substantial
amount of code, it also makes the control flow a lot easier to read.


> @@ -4644,9 +4645,7 @@ GetAfterTriggersTableData(Oid relid, CmdType cmdType)
>   * If there are no triggers in 'trigdesc' that request relevant transition
>   * tables, then return NULL.
>   *
> - * The resulting object can be passed to the ExecAR* functions.  The caller
> - * should set tcs_map or tcs_original_insert_tuple as appropriate when 
> dealing
> - * with child tables.
> + * The resulting object can be passed to the ExecAR* functions.
>   *
>   * Note that we copy the flags from a parent table into this struct (rather
>   * than subsequently using the relation's TriggerDesc directly) so that we 
> can
> @@ -5750,14 +5749,26 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo 
> *relinfo,
>*/
>   if (row_trigger && transition_capture != NULL)
>   {
> - TupleTableSlot *original_insert_tuple = 
> transition_capture->tcs_original_insert_tuple;
> - TupleConversionMap *map = transition_capture->tcs_map;
> + TupleTableSlot *original_insert_tuple;
> + PartitionRoutingInfo *pinfo = relinfo->ri_PartitionInfo;
> + TupleConversionMap *map = pinfo ?
> + 
> pinfo->pi_PartitionToRootMap :
> + 
> relinfo->ri_ChildToRootMap;
>   booldelete_old_table = 
> transition_capture->tcs_delete_old_table;
>   boolupdate_old_table = 
> transition_capture->tcs_update_old_table;
>   boolupdate_new_table = 
> transition_capture->tcs_update_new_table;
>   boolinsert_new_table = 
> transition_capture->tcs_insert_new_table;
>  
>   /*
> +  * Get the originally inserted tuple from the global variable 
> and set
> +  * the latter to NULL because any given tuple must be read only 
> once.
> +  * Note that the TransitionCaptureState is shared across many 
> calls
> +  * to this function.
> +  */
> + original_insert_tuple = 
> transition_capture->tcs_original_insert_tuple;
> + transition_capture->tcs_original_insert_tuple = NULL;

Maybe I'm missing something, but original_insert_tuple is not a global
variable?


> @@ -888,7 +889,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>   PartitionTupleRouting *proute,
>   PartitionDispatch dispatch,
>   ResultRelInfo *partRelInfo,
> - int partidx)
> + int partidx,
> + bool is_update_result_rel)
>  {
>   MemoryContext oldcxt;
>   PartitionRoutingInfo *partrouteinfo;
> @@ -935,10 +937,15 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
>   if (mtstate &&
>   (mtstate->mt_transition_capture || 
> mtstate->mt_oc_transition_capture))
>   {
> - partrouteinfo->pi_PartitionToRootMap =
> - 
> convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
> -
> RelationGetDescr(partRelInfo->ri_PartitionRoot),
> -

Re: pglz performance

On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:

On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
> Hi,
>
> On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
> > We have some kind of "roadmap" of "extensible pglz". We plan to
> > provide implementation on Novembers CF.
>
> I don't understand why it's a good idea to improve the compression side
> of pglz. There's plenty other people that spent a lot of time
> developing better compression algorithms.
>

Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?

Why would they be stuck continuing to *compress* with pglz? As we
fully retoast on write anyway we can just gradually switch over to the
better algorithm. Decompression speed is another story, of course.

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean 
existing systems can start using it. Which is good.

Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

But yeah, I agree you may have a point about optimizing pglz compression.

Here's what I had a few years back:

https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
see also
https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de

I think we should refresh something like that patch, and:
- make the compression algorithm GUC an enum, rename
- add --with-system-lz4
- obviously refresh the copy of lz4
- drop snappy

That's a reasonable plan, I guess.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pglz performance

Hi,

On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:
> On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
> > > We have some kind of "roadmap" of "extensible pglz". We plan to
> > > provide implementation on Novembers CF.
> > 
> > I don't understand why it's a good idea to improve the compression side
> > of pglz. There's plenty other people that spent a lot of time
> > developing better compression algorithms.
> > 
> 
> Isn't it beneficial for existing systems, that will be stuck with pglz
> even if we end up adding other algorithms?

Why would they be stuck continuing to *compress* with pglz? As we
fully retoast on write anyway we can just gradually switch over to the
better algorithm. Decompression speed is another story, of course.


Here's what I had a few years back:

https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
see also
https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de

I think we should refresh something like that patch, and:
- make the compression algorithm GUC an enum, rename
- add --with-system-lz4
- obviously refresh the copy of lz4
- drop snappy

Greetings,

Andres Freund

Re: pglz performance


On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:

We have some kind of "roadmap" of "extensible pglz". We plan to
provide implementation on Novembers CF.


I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time
developing better compression algorithms.



Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?




Currently, pglz starts with empty cache map: there is no prior 4k
bytes before start. We can add imaginary prefix to any data with
common substrings: this will enhance compression ratio.  It is hard
to decide on training data set for this "common prefix". So we want
to produce extension with aggregate function which produces some
"adapted common prefix" from users's data.  Then we can "reserve" few
negative bytes for "decompression commands". This command can
instruct database on which common prefix to use.  But also system
command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and
substitute pglz compression with custom compression method
seamlessly.

This will make hard-choosen compression unneeded, but seems overly
hacky. But there will be no need to have lz4, zstd, brotli, lzma and
others in core. Why not provide e.g. "time series compression"? Or
"DNA compression"? Whatever gun user wants for his foot.


I think this is way too complicated, and will provide not particularly
much benefit for the majority users.



I agree with this. I do see value in the feature, but probably not as a
drop-in replacement for the default compression algorithm. I'd compare
it to the "custom compression methods" patch that was submitted some
time ago.


In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in core.
You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.



I wouldn't be so strict I guess. But I do agree an algorithm that 
requires additional steps (training, ...) is unlikely to be a good

candidate for default instance compression alorithm.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pglz performance

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
> We have some kind of "roadmap" of "extensible pglz". We plan to provide 
> implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time developing
better compression algorithms.

> Currently, pglz starts with empty cache map: there is no prior 4k bytes 
> before start. We can add imaginary prefix to any data with common substrings: 
> this will enhance compression ratio.
> It is hard to decide on training data set for this "common prefix". So we 
> want to produce extension with aggregate function which produces some 
> "adapted common prefix" from users's data.
> Then we can "reserve" few negative bytes for "decompression commands". This 
> command can instruct database on which common prefix to use.
> But also system command can say "invoke decompression from extension".
> 
> Thus, user will be able to train database compression on his data and 
> substitute pglz compression with custom compression method seamlessly.
> 
> This will make hard-choosen compression unneeded, but seems overly hacky. But 
> there will be no need to have lz4, zstd, brotli, lzma and others in core. Why 
> not provide e.g. "time series compression"? Or "DNA compression"? Whatever 
> gun user wants for his foot.

I think this is way too complicated, and will provide not particularly
much benefit for the majority users.

In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in
core. You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.

Greetings,

Andres Freund

Re: Add client connection check during the execution of the query





On 18.07.2019 6:19, Tatsuo Ishii wrote:


I noticed that there are some confusions in the doc and code regarding what the 
new
configuration parameter means. According to the doc:

+Default value is zero - it disables connection
+checks, so the backend will detect client disconnection only when 
trying
+to send a response to the query.

But guc.c comment says:

+   gettext_noop("Sets the time interval for checking connection 
with the client."),
+   gettext_noop("A value of -1 disables this feature. Zero 
selects a suitable default value."),

Probably the doc is correct since the actual code does so.


Yes, value -1 is not even accepted due to the specified range.


tps = 67715.993428 (including connections establishing)
tps = 67717.251843 (excluding connections establishing)

So the performance is about 5% down with the feature enabled in this
case.  For me, 5% down is not subtle. Probably we should warn this in
the doc.


I also see some performance degradation, although it is not so large in 
my case (I used pgbench with scale factor 10 and run the same command as 
you).

In my case difference is 103k vs. 105k TPS is smaller than 2%.

It seems to me that it is not necessary to enable timeout at each command:

@@ -4208,6 +4210,9 @@ PostgresMain(int argc, char *argv[],
          */
         CHECK_FOR_INTERRUPTS();
         DoingCommandRead = false;
+        if (client_connection_check_interval)
+            enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT,
+                                 client_connection_check_interval);

         /*
          * (5) turn off the idle-in-transaction timeout


Unlike statement timeout or idle in transaction timeout price start of 
measuring time is not important.

So it is possible to do once before main backend loop:

@@ -3981,6 +3983,10 @@ PostgresMain(int argc, char *argv[],
    if (!IsUnderPostmaster)
    PgStartTime = GetCurrentTimestamp();

+   if (client_connection_check_interval)
+   enable_timeout_after(SKIP_CLIENT_CHECK_TIMEOUT,
+ client_connection_check_interval);
+
    /*
 * POSTGRES main processing loop begins here
 *


But actually I do not see much difference from moving enabling timeout code.
Moreover the difference in performance almost not depend on the value of 
timeout.
I set it to 100 seconds with pgbench loop 30 seconds (so timeout never 
fired and recv is never called)

and still there is small difference in performance.

After some experiments I found out that just presence of active timer 
results some small performance penalty.
You can easily check it: set for example statement_timeout to the same 
large value (100 seconds) and you will get the same small slowdown.


So recv() itself is not source of the problem.
Actually any system call (may be except fsync) performed with frequency 
less than one second can not have some noticeable impact on performance.
So I do not think that recv(MSG_PEEK)  can cause any performance problem 
at Windows or any other platform.
But I wonder why we can not perform just pool with POLLOUT flag and zero 
timeout.

If OS detected closed connection, it should return POLLHUP, should not it?
I am not sure if it is more portable or more efficient way - just seems 
to be a little bit more natural way (from my point of view) to check if 
connection is still alive.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Index Skip Scan

2019-08-02 Thread Jesper Pedersen


Hi,

On 8/2/19 8:14 AM, Dmitry Dolgov wrote:

And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.



Yes.

We decided to send this now, although there is still feedback from David 
that needs to be considered/acted on.


The patches can be reviewed independently, but we will send them as a 
set from now on. Development of UniqueKey will be kept separate though [1].


Note, that while UniqueKey can form the foundation of optimizations for 
GROUP BY queries it isn't the focus for this patch series. Contributions 
are very welcomed of course.


[1] https://github.com/jesperpedersen/postgres/tree/uniquekey

Best regards,
 Jesper
>From 35018a382d792d6ceeb8d0e9d16bc14ea2e3f148 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 2 Aug 2019 07:52:08 -0400
Subject: [PATCH 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   2 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/costsize.c  |   5 +
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  72 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planner.c   |  18 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/backend/optimizer/util/tlist.c |   1 -
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   2 +-
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 377 insertions(+), 15 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..9a4f3e8e4b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4073,6 +4084,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..62c9d4ef7a 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  unique_key an UniqueKey
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr, rtable);
+		}
+		printf(")");
+		if (lnext(uniquekeys, l))
+			printf(", ");
+	}
+	printf(")\n");
+}
+
 /*
  * print_tl
  *	  print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-   joinpath.o

Re: pglz performance

2019-08-02 Thread Andrey Borodin

Thanks for looking into this!

> 2 авг. 2019 г., в 19:43, Tomas Vondra  
> написал(а):
> 
> On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:
>> 
>> It takes me some time to understand that your memcpy optimization is 
>> correct;)
Seems that comments are not explanatory enough... will try to fix.

>> I have tested different ways of optimizing this fragment of code, but failed 
>> tooutperform your implementation!
JFYI we tried optimizations with memcpy with const size (optimized into 
assembly instead of call), unrolling literal loop and some others. All these 
did not work better.

>> But ...  below are results for lz4:
>> 
>> Decompressor score (summ of all times):
>> NOTICE:  Decompressor lz4_decompress result 3.660066
>> Compressor score (summ of all times):
>> NOTICE:  Compressor lz4_compress result 10.288594
>> 
>> There is 2 times advantage in decompress speed and 10 times advantage in 
>> compress speed.
>> So may be instead of "hacking" pglz algorithm we should better switch to lz4?
>> 
> 
> I think we should just bite the bullet and add initdb option to pick
> compression algorithm. That's been discussed repeatedly, but we never
> ended up actually doing that. See for example [1].
> 
> If there's anyone willing to put some effort into getting this feature
> over the line, I'm willing to do reviews & commit. It's a seemingly
> small change with rather insane potential impact.
> 
> But even if we end up doing that, it still makes sense to optimize the
> hell out of pglz, because existing systems will still use that
> (pg_upgrade can't switch from one compression algorithm to another).

We have some kind of "roadmap" of "extensible pglz". We plan to provide 
implementation on Novembers CF.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before 
start. We can add imaginary prefix to any data with common substrings: this 
will enhance compression ratio.
It is hard to decide on training data set for this "common prefix". So we want 
to produce extension with aggregate function which produces some "adapted 
common prefix" from users's data.
Then we can "reserve" few negative bytes for "decompression commands". This 
command can instruct database on which common prefix to use.
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and 
substitute pglz compression with custom compression method seamlessly.

This will make hard-choosen compression unneeded, but seems overly hacky. But 
there will be no need to have lz4, zstd, brotli, lzma and others in core. Why 
not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun 
user wants for his foot.

Best regards, Andrey Borodin.

[PATCH] Improve performance of NOTIFY over many databases (v2)

2019-08-02 Thread Martijn van Oosterhout

Hoi hackers,

Here is a reworked version of the previous patches.

The original three patches have been collapsed into one as given the
changes discussed it didn't make sense to keep them separate. There
are now two patches (the third is just to help with testing):

Patch 1: Tracks the listening backends in a list so non-listening
backends can be quickly skipped over. This is separate because it's
orthogonal to the rest of the changes and there are other ways to do
this.

Patch 2: This is the meat of the change. It implements all the
suggestions discussed:

- The queue tail is now only updated lazily, whenever the notify queue
moves to a new page. This did require a new global to track this state
through the transaction commit, but it seems worth it.

- Only backends for the current database are signalled when a
notification is made

- Slow backends are woken up one at a time rather than all at once

- A backend is allowed to lag up to 4 SLRU pages behind before being
signalled. This is a tradeoff between how often to get woken up verses
how much work to do once woken up.

- All the relevant comments have been updated to describe the new
algorithm. Locking should also be correct now.

This means in the normal case where listening backends get a
notification occasionally, no-one will ever be considered slow. An
exclusive lock for cleanup will happen about once per SLRU page.
There's still the exclusive locks on adding notifications but that's
unavoidable.

One minor issue is that pg_notification_queue_usage() will now return
a small but non-zero number (about 3e-6) even when nothing is really
going on. This could be fixed by having it take an exclusive lock
instead and updating to the latest values but that barely seems worth
it.

Performance-wise it's even better than my original patches, with about
20-25% reduction in CPU usage in my test setup (using the test script
sent previously).

Here is the log output from my postgres, where you see the signalling in action:

--
16:42:48.673 [10188] martijn@test_131 DEBUG:  PreCommit_Notify
16:42:48.673 [10188] martijn@test_131 DEBUG:  NOTIFY QUEUE = (74,896)...(79,0)
16:42:48.673 [10188] martijn@test_131 DEBUG:  backendTryAdvanceTail -> true
16:42:48.673 [10188] martijn@test_131 DEBUG:  AtCommit_Notify
16:42:48.673 [10188] martijn@test_131 DEBUG:  ProcessCompletedNotifies
16:42:48.673 [10188] martijn@test_131 DEBUG:  backendTryAdvanceTail -> false
16:42:48.673 [10188] martijn@test_131 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10188] martijn@test_131 DEBUG:  waking backend 137 (pid 10055)
16:42:48.673 [10055] martijn@test_067 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10187] martijn@test_131 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10055] martijn@test_067 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10055] martijn@test_067 DEBUG:  waking backend 138 (pid 10056)
16:42:48.673 [10187] martijn@test_131 DEBUG:  ProcessIncomingNotify: done
16:42:48.673 [10055] martijn@test_067 DEBUG:  ProcessIncomingNotify: done
16:42:48.673 [10056] martijn@test_067 DEBUG:  ProcessIncomingNotify
16:42:48.673 [10056] martijn@test_067 DEBUG:  asyncQueueAdvanceTail
16:42:48.673 [10056] martijn@test_067 DEBUG:  ProcessIncomingNotify: done
16:42:48.683 [9991] martijn@test_042 DEBUG:  Async_Notify(changes)
16:42:48.683 [9991] martijn@test_042 DEBUG:  PreCommit_Notify
16:42:48.683 [9991] martijn@test_042 DEBUG:  NOTIFY QUEUE = (75,7744)...(79,32)
16:42:48.683 [9991] martijn@test_042 DEBUG:  AtCommit_Notify
-

Have a nice weekend.
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
From 82366f1dbc0fc234fdd10dbc15519b3cf7104684 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout 
Date: Tue, 23 Jul 2019 16:49:30 +0200
Subject: [PATCH 1/3] Maintain queue of listening backends to speed up loops

Especially the loop to advance the tail pointer is called fairly often while
holding an exclusive lock.
---
 src/backend/commands/async.c | 45 +++-
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6e9c580ec6..ba0b1baecc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -217,6 +217,7 @@ typedef struct QueueBackendStatus
 {
 	int32		pid;			/* either a PID or InvalidPid */
 	Oid			dboid;			/* backend's database OID, or InvalidOid */
+	int			nextListener;	/* backendid of next listener, 0=last */
 	QueuePosition pos;			/* backend has read queue up to here */
 } QueueBackendStatus;
 
@@ -247,6 +248,7 @@ typedef struct AsyncQueueControl
 	QueuePosition tail;			/* the global tail is equivalent to the pos of
  * the "slowest" backend */
 	TimestampTz lastQueueFillWarn;	/* time of last queue-full msg */
+	int			firstListener;	/* backendId of first listener, 0=none */
 	QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
 	/* backend[0] is not used; used entries are from [1] to [MaxBackends] */
 } AsyncQueueControl;
@@ -257,8 +259,11 @@ static Asy

Optimize single tuple fetch from nbtree index

2019-08-02 Thread Floris Van Nee

Hi hackers,


While I was reviewing some code in another patch, I stumbled upon a possible 
optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 
1'. I have written a small patch that implements this optimization, but I'd 
like some feedback on the design of the patch, whether it is correct at all to 
use this optimization, and whether the performance tradeoffs are deemed worth 
it by the community.


Basically, an example of the case I'd like to optimize is the following. Given 
a table 'tbl' with an index on columns (k,ts DESC):


SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;


And, even more importantly, when this query gets called in an inner loop like:


SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps 
thousands of iterations, could also be a loop over values of 'k' rather than 
timestamps. this is just an example

CROSS JOIN LATERAL (

   SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;

) _;


With time-series data, this case often arises as you have a certain natural key 
for which you store updates as they occur. Getting the state of k at a specific 
time then boils down to the given query there, which is almost always the 
fastest way to get this information, since the index scan with LIMIT 1 is very 
fast already. However, there seems to be a possibility to make this even faster 
(up to nearly 3x faster in test cases that use this nested loop of index scans)

Every time the index scan is done, all tuples from the leaf page are read in 
nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for 
this *only* the first time amgettuple gets called. This calls _bt_first in 
nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page 
and read just the first (or possibly two) tuples. It won't touch the rest of 
the page yet. If indeed just one tuple was required, there won't be a call to 
_bt_next and we're done. If we do need more than one tuple, _bt_next will 
resume reading tuples from the index page at the point where we left off.


There are a few caveats:

- Possible performance decrease for queries that need a small number of tuples 
(but more than one), because they now need to lock the same page twice. This 
can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple 
returned does not match other scan conditions; LIMIT 1 but the tuple returned 
is not visible; no LIMIT at all but there are just only a few matching rows.

- We need to take into account page splits, insertions and vacuums while we do 
not have the read-lock in between _bt_first and the first call to _bt_next. 
This made my patch quite a bit more complicated than my initial implementation.


I did performance tests for some best case and worst case test scenarios. TPS 
results were stable and reproducible in re-runs on my, otherwise idle, server. 
Attached are the full results and how to reproduce. I picked test cases that 
show best performance as well as worst performance compared to master. Summary: 
the greatest performance improvement can be seen for the cases with the 
subquery in a nested loop. In a nested loop of 100 times, the performance is 
roughly two times better, for 1 times the performance is roughly three 
times better. For most test cases that don't use LIMIT 1, I couldn't find a 
noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, 
a nested loop without any LIMIT-clause that returns just three tuples). This is 
also theoretically the worst-case test case, because it has to lock the page 
again and then read it, just for one tuple. In this case, I saw TPS decrease by 
2-3% in a few cases (details in the attached file), due to it having to 
lock/unlock the same page in both _bt_first and _bt_next.


A few concrete questions to the community:

- Does the community also see this as a useful optimization?

- Is the way it is currently implemented safe? I struggled quite a bit to get 
everything working with respect to page splits and insertions. In particular, I 
don't know why in my patch, _bt_find_offset_for_tid needs to consider searching 
for items with an offset *before* the passed offset. As far as my understanding 
goes, this could only happen when the index gets vacuumed in the mean-time. 
However, we hold a pin on the buffer the whole time (we even assert this), so 
vacuum should not have been possible. Still, this code gets triggered 
sometimes, so it seems to be necessary. Perhaps someone in the community who's 
more of an expert on this can shed some light on it.

- What are considered acceptable performance tradeoffs in this case? Is a 
performance degradation in any part generally not acceptable at all?


I'd also welcome any feedback on the process - this is my first patch and while 
I tried to follow the guidelines, I may have missed something along the way.


Attachments:

- out_limit.txt: pgbench resu

Re: Memory-Bounded Hash Aggregation


On Fri, Aug 02, 2019 at 08:11:19AM -0700, Jeff Davis wrote:

On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:

I'm late to the party.


You are welcome to join any time!


These two approaches both spill the input tuples, what if the skewed
groups are not encountered before the hash table fills up? The spill
files' size and disk I/O could be downsides.


Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).


Greenplum spills all the groups by writing the partial aggregate
states,
reset the memory context, process incoming tuples and build in-memory
hash table, then reload and combine the spilled partial states at
last,
how does this sound?


That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.

It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.

Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.



+1 to leaving that as a future optimization

I think it's clear there's no perfect eviction strategy - for every
algorithm we came up with we can construct a data set on which it
performs terribly (I'm sure we could do that for the approach used by
Greenplum, for example).

So I think it makes sense to do what Jeff proposed, and then maybe try
improving that in the future with a switch to different eviction
strategy based on some heuristics.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory-Bounded Hash Aggregation

2019-08-02 Thread Jeff Davis

On Fri, 2019-08-02 at 14:44 +0800, Adam Lee wrote:
> I'm late to the party.

You are welcome to join any time!

> These two approaches both spill the input tuples, what if the skewed
> groups are not encountered before the hash table fills up? The spill
> files' size and disk I/O could be downsides.

Let's say the worst case is that we encounter 10 million groups of size
one first; just enough to fill up memory. Then, we encounter a single
additional group of size 20 million, and need to write out all of those
20 million raw tuples. That's still not worse than Sort+GroupAgg which
would need to write out all 30 million raw tuples (in practice Sort is
pretty fast so may still win in some cases, but not by any huge
amount).

> Greenplum spills all the groups by writing the partial aggregate
> states,
> reset the memory context, process incoming tuples and build in-memory
> hash table, then reload and combine the spilled partial states at
> last,
> how does this sound?

That can be done as an add-on to approach #1 by evicting the entire
hash table (writing out the partial states), then resetting the memory
context.

It does add to the complexity though, and would only work for the
aggregates that support serializing and combining partial states. It
also might be a net loss to do the extra work of initializing and
evicting a partial state if we don't have large enough groups to
benefit.

Given that the worst case isn't worse than Sort+GroupAgg, I think it
should be left as a future optimization. That would give us time to
tune the process to work well in a variety of cases.

Regards,
Jeff Davis

Re: Patch to document base64 encoding

2019-08-02 Thread Karl O. Pinc

On Fri, 02 Aug 2019 10:44:43 -0400
Tom Lane  wrote:

> I don't really have a problem with
> repeating the entries for other functions that exist in both
> text and bytea variants, either.

Ok.  Thanks.  I'll repeat entries then.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: Patch to document base64 encoding

2019-08-02 Thread Chapman Flack

On 8/2/19 10:32 AM, Karl O. Pinc wrote:

> But I'm not happy with putting any function that works with
> bytea into the binary string section.  This would mean moving,
> say, length() out of the regular string section.

I'm not sure why. The bytea section already has an entry for its
length() function.

There are also length() functions for bit, character, lseg, path,
tsvector

I don't really think of those as "a length() function" that works
on all those types. I think of a variety of types, many of which
offer a length() function.

That seems to be reflected in the way the docs are arranged.

Regards,
-Chap

Re: Patch to document base64 encoding

"Karl O. Pinc"  writes:
> But I'm not happy with putting any function that works with
> bytea into the binary string section.  This would mean moving,
> say, length() out of the regular string section.  There's a
> lot of functions that work on both string and bytea inputs
> and most (not all, see below) are functions that people
> typically associate with string data.

Well, there are two different length() functions --- length(text)
and length(bytea) are entirely different things, they don't even
measure in the same units.  I think documenting them separately
is the right thing to do.  I don't really have a problem with
repeating the entries for other functions that exist in both
text and bytea variants, either.  There aren't that many.

> What I think I'd like to do is add a column to the table
> in the string section that says whether or not the function
> works with both string and bytea.

Meh.  Seems like what that would mostly do is ensure that
neither page is understandable on its own.

regards, tom lane

Re: pglz performance


On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:



On 27.06.2019 21:33, Andrey Borodin wrote:



13 мая 2019 г., в 12:14, Michael Paquier  написал(а):

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked.  Any improvement in one or the
other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list 
of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to 
optimize them and provide less constraints for compiler to optimize.
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first 
different byte

In weighted mix of different data (same as for compression), overall speedup is 
x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking 
purposes[0].

Best regards, Andrey Borodin.

[0] https://github.com/x4m/test_pglz


It takes me some time to understand that your memcpy optimization is 
correct;)
I have tested different ways of optimizing this fragment of code, but 
failed tooutperform your implementation!

Results at my computer is simlar with yours:

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603

Compressor score (summ of all times):
NOTICE:  Compressor pglz_compress_vanilla result 116.970005
NOTICE:  Compressor pglz_compress_hacked result 89.706105


But ...  below are results for lz4:

Decompressor score (summ of all times):
NOTICE:  Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE:  Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage 
in compress speed.
So may be instead of "hacking" pglz algorithm we should better switch 
to lz4?




I think we should just bite the bullet and add initdb option to pick
compression algorithm. That's been discussed repeatedly, but we never
ended up actually doing that. See for example [1].

If there's anyone willing to put some effort into getting this feature
over the line, I'm willing to do reviews & commit. It's a seemingly
small change with rather insane potential impact.

But even if we end up doing that, it still makes sense to optimize the
hell out of pglz, because existing systems will still use that
(pg_upgrade can't switch from one compression algorithm to another).

regards

[1] 
https://www.postgresql.org/message-id/flat/55341569.1090107%402ndquadrant.com

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch to document base64 encoding

2019-08-02 Thread Karl O. Pinc

On Tue, 30 Jul 2019 23:44:49 +0200 (CEST)
Fabien COELHO  wrote:

> Personnaly, I'd be ok with having a separate "conversion function"
> table, and also with Tom suggestion to have string functions with
> "only simple string" functions, and if any binary appears it is moved
> into the binary section.

I'll make a "conversion function" table and put it in the binary
section.

But I'm not happy with putting any function that works with
bytea into the binary string section.  This would mean moving,
say, length() out of the regular string section.  There's a
lot of functions that work on both string and bytea inputs
and most (not all, see below) are functions that people
typically associate with string data.

What I think I'd like to do is add a column to the table
in the string section that says whether or not the function
works with both string and bytea.  The result would be:

The hash functions (md5(), sha256(), etc.) would move
to the string section, because they work on both strings
and binary data.  So the binary function table would
considerably shorten.

There would be a new table for conversions between
bytea and string (both directions).  This would
be placed in the binary string section.

So the binary string section would be "just simple bytea", plus
conversion functions.  Kind of the opposite of Tom's
suggestion.

Please let me know what you think.  Thanks.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

Re: Recent failures in IsolationCheck deadlock-hard

Thomas Munro  writes:
> There have been five failures on three animals like this, over the
> past couple of months:

Also worth noting is that anole failed its first try at the new
deadlock-parallel isolation test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-01%2015%3A48%3A16

What that looks like is the queries got stuck and eventually
isolationtester gave up and canceled the test.  So I'm suspicious
that there's a second bug in the parallel deadlock detection code.

Possibly relevant factoids: all three of the animals in question
run HEAD with force_parallel_mode = regress, and there's reason
to think that their timing behavior could be different from other
animals (anole and gharial run on HPUX, while hyrax uses
CLOBBER_CACHE_ALWAYS).

regards, tom lane

Re: pglz performance





On 27.06.2019 21:33, Andrey Borodin wrote:



13 мая 2019 г., в 12:14, Michael Paquier  написал(а):

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked.  Any improvement in one or the
other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list 
of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to 
optimize them and provide less constraints for compiler to optimize.
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first 
different byte

In weighted mix of different data (same as for compression), overall speedup is 
x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking 
purposes[0].

Best regards, Andrey Borodin.

[0] https://github.com/x4m/test_pglz


It takes me some time to understand that your memcpy optimization is 
correct;)
I have tested different ways of optimizing this fragment of code, but 
failed tooutperform your implementation!

Results at my computer is simlar with yours:

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603

Compressor score (summ of all times):
NOTICE:  Compressor pglz_compress_vanilla result 116.970005
NOTICE:  Compressor pglz_compress_hacked result 89.706105


But ...  below are results for lz4:

Decompressor score (summ of all times):
NOTICE:  Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE:  Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage in 
compress speed.
So may be instead of "hacking" pglz algorithm we should better switch to 
lz4?



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: block-level incremental backup

2019-08-02 Thread vignesh C

On Thu, Aug 1, 2019 at 5:06 PM Jeevan Chalke
 wrote:
>
> On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke 
>  wrote:
>>
>> I am almost done writing the patch for pg_combinebackup and will post soon.
>
>
> Attached patch which implements the pg_combinebackup utility used to combine
> full basebackup with one or more incremental backups.
>
> I have tested it manually and it works for all best cases.
>
> Let me know if you have any inputs/suggestions/review comments?
>
Some comments:
1) There will be some link files created for tablespace, we might
require some special handling for it

2)
+ while (numretries <= maxretries)
+ {
+ rc = system(copycmd);
+ if (rc == 0)
+ return;
+
+ pg_log_info("could not copy, retrying after %d seconds",
+ sleeptime);
+ pg_usleep(numretries++ * sleeptime * 100L);
+ }
Retry functionality is hanlded only for copying of full files, should
we handle retry for copying of partial files

3)
+ maxretries = atoi(optarg);
+ if (maxretries < 0)
+ {
+ pg_log_error("invalid value for maxretries");
+ fprintf(stderr, _("%s: -r maxretries must be >= 0\n"), progname);
+ exit(1);
+ }
+ break;
+ case 's':
+ sleeptime = atoi(optarg);
+ if (sleeptime <= 0 || sleeptime > 60)
+ {
+ pg_log_error("invalid value for sleeptime");
+ fprintf(stderr, _("%s: -s sleeptime must be between 1 and 60\n"), progname);
+ exit(1);
+ }
+ break;
we can have some range for maxretries similar to sleeptime

4)
+ fp = fopen(filename, "r");
+ if (fp == NULL)
+ {
+ pg_log_error("could not read file \"%s\": %m", filename);
+ exit(1);
+ }
+
+ labelfile = malloc(statbuf.st_size + 1);
+ if (fread(labelfile, 1, statbuf.st_size, fp) != statbuf.st_size)
+ {
+ pg_log_error("corrupted file \"%s\": %m", filename);
+ free(labelfile);
+ exit(1);
+ }
Should we check for malloc failure

5) Should we add display of progress as backup may take some time,
this can be added as enhancement. We can get other's opinion on this.

6)
+ if (nIncrDir == MAX_INCR_BK_COUNT)
+ {
+ pg_log_error("too many incremental backups to combine");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ IncrDirs[nIncrDir] = optarg;
+ nIncrDir++;
+ break;

If the backup count increases providing the input may be difficult,
Shall user provide all the incremental backups from a parent folder
and can we handle the ordering of incremental backup internally

7)
+ if (isPartialFile)
+ {
+ if (verbose)
+ pg_log_info("combining partial file \"%s.partial\"", fn);
+
+ combine_partial_files(fn, IncrDirs, nIncrDir, subdirpath, outfn);
+ }
+ else
+ copy_whole_file(infn, outfn);

Add verbose for copying whole file

8) We can also check if approximate space is available in disk before
starting combine backup, this can be added as enhancement. We can get
other's opinion on this.

9)
+ printf(_("  -i, --incr-backup=DIRECTORY incremental backup directory
(maximum %d)\n"), MAX_INCR_BK_COUNT);
+ printf(_("  -o, --output-dir=DIRECTORY  combine backup into directory\n"));
+ printf(_("\nGeneral options:\n"));
+ printf(_("  -n, --no-clean  do not clean up after errors\n"));

Combine backup into directory can be combine backup directory

10)
+/* Max number of incremental backups to be combined. */
+#define MAX_INCR_BK_COUNT 10
+
+/* magic number in incremental backup's .partial file */

MAX_INCR_BK_COUNT can be increased little, some applications use 1
full backup at the beginning of the month and use 30 incremental
backups rest of the days in the month

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Recent failures in IsolationCheck deadlock-hard

Hi,

There have been five failures on three animals like this, over the
past couple of months:

 step s6a7: LOCK TABLE a7; 
 step s7a8: LOCK TABLE a8; 
 step s8a1: LOCK TABLE a1; 
-step s8a1: <... completed>
 step s7a8: <... completed>
-error in steps s8a1 s7a8: ERROR:  deadlock detected
+step s8a1: <... completed>
+ERROR:  deadlock detected
 step s8c: COMMIT;
 step s7c: COMMIT;
 step s6a7: <... completed>

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-07-18%2021:57:59
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-07-10%2005:59:16
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2019-07-08%2015:02:17
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-23%2004:17:09
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2019-06-12%2021:46:24

Before that there were some like that a couple of years back:

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-09%2021:58:03
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2021:58:04
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-08%2005:19:17
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-07%2000:23:39
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2017-04-05%2018:58:04

-- 
Thomas Munro
https://enterprisedb.com

Re: Index Skip Scan

2019-08-02 Thread Dmitry Dolgov

> On Thu, Jul 25, 2019 at 1:21 PM Kyotaro Horiguchi  
> wrote:
>
> I feel uncomfortable to look into indexinfo there. Couldnd't we
> use indexskipprefix == -1 to signal !amcanskip from
> create_index_path?

Looks like it's not that straightforward to do this only in create_index_path,
since to make this decision we need to have both parts, indexinfo and distinct
keys.

> Yeah, your explanation was perfect for me. What I failed to
> understand was what is expected to be done in the case. I
> reconsidered and understood that:
>
> For example, the following query:
>
> select distinct (a, b) a, b, c from t where  c < 100;
>
> skip scan returns one tuple for one distinct set of (a, b) with
> arbitrary one of c, If the choosed c doesn't match the qual and
> there is any c that matches the qual, we miss that tuple.
>
> If this is correct, an explanation like the above might help.

Yes, that's correct, I've added this into commentaries.

> Maybe something like the following will work *for me*:p
>
> | When we are fetching a cursor in backward direction, return the
> | tuples that forward fetching should have returned. In other
> | words, we return the last scanned tuple in a DISTINCT set. Skip
> | to that tuple before returning the first tuple.

And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.

Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-08-02 Thread Fabien COELHO




I guess the CF app could show those kind of metrics, but having a
written report from a human seems to be a good idea (I got it from
Alvaro's blog[1]).  The CF is now closed, and here are the final
numbers:

status | w1 | w2 | w3 | w4 | final
+++++---
Committed  | 32 | 41 | 49 | 59 |64
Moved to next CF   |  5 |  6 |  6 |  6 |   145
Rejected   |  2 |  2 |  2 |  2 | 2
Returned with feedback |  2 |  2 |  2 |  2 | 9
Withdrawn  |  8 |  8 |  9 | 10 |11

In percentages, we returned and rejected 5%, withdrew 5%, committed
28%, and pushed 62% to the next 'fest.  That's a wrap.  Thanks
everyone.

[1] https://www.2ndquadrant.com/en/blog/managing-a-postgresql-commitfest/


Thanks.

Attached a small graphical display of CF results over time.

--
Fabien.

Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData

2019-08-02 Thread Robert Haas

On Fri, Aug 2, 2019 at 6:37 AM Thomas Munro  wrote:
> Thanks.  This looks pretty reasonable to me, and I don't think we need
> to worry about the subxid list for now.

Why not just do them all at once?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Built-in connection pooler




On 02.08.2019 12:57, DEV_OPS wrote:

Hello Konstantin


would you please re-base this patch? I'm going to test it, and back port
into PG10 stable and PG9 stable


thank you very much




Thank you.
Rebased patch is attached.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1..119daac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,137 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  idle_pool_worker_timeout (integer)
+  
+   idle_pool_worker_timeout configuration parameter
+  
+  
+  
+
+ Terminate an idle connection pool worker after the specified number of milliseconds.
+ The default value is 0, so pool workers are never terminated.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.
+  The default value is false.
+   
+  
+ 
+
  
   unix_socket_directories (string)
   
diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml
new file mode 100644
index 000..bc6547b
--- /dev/null
+++ b/doc/src/sgml/connpool.sgml
@@ -0,0 +1,174 @@
+
+
+ 
+  Connection pooling
+
+  
+   built-in connection pool proxy
+  
+
+  
+PostgreSQL spawns a separate process (backend) for each client.
+For large number of clients this model can consume a large number of system
+resources and lead to significant performance degradation, especially on computers with large
+number of CPU cores. The

Re: Fix typos

2019-08-02 Thread Sehrope Sarkuni

On Fri, Aug 2, 2019 at 12:11 AM Michael Paquier  wrote:

> On Thu, Aug 01, 2019 at 11:01:59PM -0400, Alvaro Herrera wrote:
> > I think slight variations don't really detract from the value of the
> > product, and consider the odd variation a reminder of the diversity of
> > the project.  I don't suggest that we purposefully introduce spelling
> > variations, or that we refrain from fixing ones that appear in code
> > we're changing, but I don't see the point in changing a line for the
> > sole reason of standardising the spelling of a word.
>
> Agreed.  This always reminds me of ANALYZE vs. ANALYSE where we don't
> actually document the latter :)
>

I didn't know about that. That's a fun one!

>
> > That said, I'm not a native English speaker.
>
> Neither am I.
>

I am. Consistency is nice but either reads fine to me. Only brought it up
as I didn't see many other usages so seemed out of place.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Should we add xid_current() or a int8->xid cast?

On Thu, Jul 25, 2019 at 1:11 PM Thomas Munro  wrote:
> On Thu, Jul 25, 2019 at 12:42 PM Tom Lane  wrote:
> > Andres Freund  writes:
> > > On 2019-07-24 20:34:39 -0400, Tom Lane wrote:
> > >> Yeah, I would absolutely NOT recommend that you open that can of worms
> > >> right now.  We have looked at adding unsigned integer types in the past
> > >> and it looked like a mess.
> >
> > > I assume Thomas was thinking more of another bespoke type like xid, just
> > > wider.  There's some notational advantage in not being able to
> > > immediately do math etc on xids.
> >
> > Well, we could invent an xid8 type if we want, just don't try to make
> > it part of the numeric hierarchy (as indeed xid isn't).
>
> Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a
> kind of number.

I played around with an xid8 type over here (not tested much yet, in
particular not tested on 32 bit box):

https://www.postgresql.org/message-id/CA%2BhUKGKbQtX8E5TEdcZaYhTxqLqrvcpN1Vjb7eCu2bz5EACZbw%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

Re: Fix typos

2019-08-02 Thread Sehrope Sarkuni

On Thu, Aug 1, 2019 at 10:18 PM Tom Lane  wrote:

> It's British vs. American spelling.  For the most part, Postgres
> follows American spelling, but there's the odd Briticism here and
> there.


Thanks for the explanation. I thought that might be the case but didn't
find any other usages of "serialise" so was not sure.


>   I'm not sure whether it's worth trying to standardize.
> I think the most recent opinion on this was Munro's:
>
>
> https://www.postgresql.org/message-id/ca+hukgjz-pdmgwxroiwvn-aeg4-ajdwj3gwdqkosa8g65sp...@mail.gmail.com


Either reads fine to me and the best rationale I can think of for going
with one spelling is to not have the same "fix" come up again.

If there is a desire to change this, attached is updated to include one
more instance of "materialise" and a change to the commit message to match
some similar ones I found in the past.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/
From 20c018b9350cb17e87930cebb66b715bc4b9fcf4 Mon Sep 17 00:00:00 2001
From: Sehrope Sarkuni 
Date: Fri, 2 Aug 2019 05:58:35 -0400
Subject: [PATCH] Use American spelling for "serialize" and "materalize"

---
 contrib/pgstattuple/pgstatapprox.c  | 2 +-
 contrib/xml2/xpath.c| 2 +-
 src/backend/access/transam/README   | 4 ++--
 src/backend/storage/buffer/bufmgr.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 636c8d40ac..83b3270554 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -278,7 +278,7 @@ pgstattuple_approx_internal(Oid relid, FunctionCallInfo fcinfo)
  errmsg("cannot access temporary tables of other sessions")));
 
 	/*
-	 * We support only ordinary relations and materialised views, because we
+	 * We support only ordinary relations and materialized views, because we
 	 * depend on the visibility map and free space map for our estimates about
 	 * unscanned pages.
 	 */
diff --git a/contrib/xml2/xpath.c b/contrib/xml2/xpath.c
index 1e5b71d9a0..11749b6c47 100644
--- a/contrib/xml2/xpath.c
+++ b/contrib/xml2/xpath.c
@@ -573,7 +573,7 @@ xpath_table(PG_FUNCTION_ARGS)
  errmsg("xpath_table must be called as a table function")));
 
 	/*
-	 * We want to materialise because it means that we don't have to carry
+	 * We want to materialize because it means that we don't have to carry
 	 * libxml2 parser state between invocations of this function
 	 */
 	if (!(rsinfo->allowedModes & SFRM_Materialize))
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index ad4083eb6b..49b6809c82 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -609,8 +609,8 @@ or more buffer locks be held concurrently, you must lock the pages in
 appropriate order, and not release the locks until all the changes are done.
 
 Note that we must only use PageSetLSN/PageGetLSN() when we know the action
-is serialised. Only Startup process may modify data blocks during recovery,
-so Startup process may execute PageGetLSN() without fear of serialisation
+is serialized. Only Startup process may modify data blocks during recovery,
+so Startup process may execute PageGetLSN() without fear of serialization
 problems. All other processes must only call PageSet/GetLSN when holding
 either an exclusive buffer lock or a shared lock plus buffer header lock,
 or be writing the data block directly rather than through shared buffers
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..3d9e0a3488 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3519,7 +3519,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 			/*
 			 * Set the page LSN if we wrote a backup block. We aren't supposed
 			 * to set this when only holding a share lock but as long as we
-			 * serialise it somehow we're OK. We choose to set LSN while
+			 * serialize it somehow we're OK. We choose to set LSN while
 			 * holding the buffer header lock, which causes any reader of an
 			 * LSN who holds only a share lock to also obtain a buffer header
 			 * lock before using PageGetLSN(), which is enforced in
-- 
2.17.1

Re: Store FullTransactionId in TwoPhaseFileHeader/GlobalTransactionData

On Fri, Aug 2, 2019 at 1:38 AM vignesh C  wrote:
> On Thu, Aug 1, 2019 at 5:36 PM Thomas Munro  wrote:
> > On Thu, Aug 1, 2019 at 9:32 PM vignesh C  wrote:
> > > There is also one more comment which is yet to be concluded. The
> > > comment discusses about changing subxids which are of TransactionId
> > > type to FullTransactionId type being written in two phase transaction
> > > file. We could not conclude this as the data is similarly stored in
> > > TransactionStateData.
> >
> > No comment on that question or the patch yet but could you please add
> > this to the next Commitfest so that cfbot starts testing it?
> >
> I have added it to the commitfest.

Thanks.  This looks pretty reasonable to me, and I don't think we need
to worry about the subxid list for now.  Here's a version I ran though
pgindent to fix some whitespace problems.

The pg_prepared_xacts view seems like as good a place as any to start
showing FullTransactionId values to users.  Here's an experimental
patch on top to do that, introducing a new "xid8" type.  After
pg_resetwal -e 10 -D pgdata you can make transactions that
look like this:

postgres=# select transaction, gid from pg_prepared_xacts ;
 transaction | gid
-+-
 429496729600496 | tx1
(1 row)

-- 
Thomas Munro
https://enterprisedb.com


0001-Use-FullTransactionId-for-two-phase-commit.patch
Description: Binary data


0002-Add-SQL-type-xid8-to-expose-FullTransactionId-to-use.patch
Description: Binary data


0003-Use-xid8-in-the-pg_prepared_xacts-view.patch
Description: Binary data

Re: Partial join

2019-08-02 Thread Richard Guo

On Thu, Aug 1, 2019 at 7:46 PM Arne Roland  wrote:

> Hello Richard,
>
> thanks for your quick reply.
>
>
> > We need to fix this.
>
>
> Do you have a better idea than just keeping the old quals - possibly just
> the ones that get eliminated - in a separate data structure? Is the push
> down of quals the only case of elimination of quals, only counting the ones
> which happen before the restrict lists are generated?
>
In you case, the restriction 'sl = sl' is just not generated for the
join, because it forms an EC with const, which is not considered when
generating join clauses.

Please refer to the code snippet below:

@@ -1164,8 +1164,8 @@ generate_join_implied_equalities(PlannerInfo *root,
List   *sublist = NIL;

/* ECs containing consts do not need any further
enforcement */
if (ec->ec_has_const)
continue;

Thanks
Richard

Re: Partial join

2019-08-02 Thread Richard Guo

On Thu, Aug 1, 2019 at 10:15 PM Tom Lane  wrote:

> Richard Guo  writes:
> > For the third query,  a rough investigation shows that, the qual 'sl =
> > 5' and 'sc.sl = sg.sl' will form an equivalence class and generate two
> > implied equalities: 'sc.sl = 5' and 'sg.sl = 5', which can be pushed
> > down to the base rels. One consequence of the deduction is when
> > constructing restrict lists for the joinrel, we lose the original
> > restrict 'sc.sl = sg.sl', and this would fail the check
> > have_partkey_equi_join(), which checks if there exists an equi-join
> > condition for each pair of partition keys. As a result, this joinrel
> > would not be considered as an input to further partitionwise joins.
>
> > We need to fix this.
>
> Uh ... why?  The pushed-down restrictions should result in pruning
> away any prunable partitions at the scan level, leaving nothing for
> the partitionwise join code to do.
>

Hmm..In the case of  multiple partition keys, for range partitioning, if
we have no clauses for a given key, any later keys would not be
considered for partition pruning.

That is to day, for table 'p partition by range (k1, k2)', quals like
'k2 = Const' would not prune partitions.

For query:

select * from p as t1 join p as t2 on t1.k1 = t2.k1 and t1.k2 = t2.k2
and t1.k2 = 2;

Since we don't consider ECs containing consts when generating join
clauses, we don't have restriction 't1.k2 = t2.k2' when building the
joinrel. As a result, partitionwise join is not considered as it
requires there existing an equi-join condition for each pair of
partition keys.

Is this a problem? What's your opinion?

Thanks
Richard

A couple of random BF failures in kerberosCheck

Hello,

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=elver&dt=2019-07-24%2003%3A22%3A17
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-08-02%2007%3A17%3A25

I wondered if this might be like the recently fixed problem with slapd
not being ready to handle requests yet, since we start up krb5kdc
first and then don't do anything explicit to wait for it, but it
doesn't look like an obvious failure to reach it.  It looks like test
3 on elver connected successfully but didn't like the answer it got
for this query:

SELECT gss_authenticated AND encrypted from pg_stat_gssapi where pid =
pg_backend_pid();

-- 
Thomas Munro
https://enterprisedb.com

Re: [HACKERS] Cached plans and statement generalization

2019-08-02 Thread Daniel Migowski

Am 02.08.2019 um 10:57 schrieb Konstantin Knizhnik:

On 02.08.2019 11:25, Daniel Migowski wrote:

I have two suggestions however:

1. Please allow to gather information about the autoprepared
statements by returning them in pg_prepared_statements view. I would
love to monitor usage of them as well as the memory consumption that
occurs. I suggested a patch to display that in
https://www.postgresql.org/message-id/41ed3f5450c90f4d8381bc4d8df6bbdcf02e1...@exchangeserver.ikoffice.de

Sorry, but there is pg_autoprepared_statements view. I think that it
will be better to distinguish explicitly and implicitly prepared
statements, will not it?
Do you want to add more information to this view? Right now it shows
query string, types of parameters and number of this query execution.
My sorry, I didn't notice it in the patch. If there is another view for
those that's perfectly fine.
2. Please not only use a LRU list, but maybe something which would
prefer statements that get reused at all. We often create ad hoc
statements with parameters which don't really need to be kept. Maybe
I can suggest an implementation of an LRU list where a reusal of a
statement not only pulls it to the top of the list but also increases
a reuse counter. When then a statement would drop off the end of the
list one checks if the reusal count is non-zero, and only really
drops it if the resual count is zero. Else the reusal count is
decremented (or halved) and the element is also placed at the start
of the LRU list, so it is kept a bit longer.

There are many variations of LRU and alternatives: clock, segmented
LRU, ...
I agree that classical LRU may be not the best replacement policy for
autoprepare.
Application is either use static queries, either constructing them
dynamically. It will be better to distinguish queries executed at
least twice from one-shot queries.
So may be SLRU will be the best choice. But actually situation you
have described (We often create ad hoc statements with parameters
which don't really need to be kept)
is not applicable to atutoprepare. Autoprepare deals only with
statements which are actually executed.

We have another patch which limits number of stored prepared plans to
avoid memory overflow (in case of using stored procedures which
implicitly prepare all issued queries).

Here choice of replacement policy is more critical.

Alright, just wanted to have something better than a LRU, so it is not a
stepback from the implementation in the JDBC driver for us.

Kind regards,
Daniel Migowski

Re: [HACKERS] Cached plans and statement generalization

On 02.08.2019 11:25, Daniel Migowski wrote:

Am 01.08.2019 um 18:56 schrieb Konstantin Knizhnik:
I decided to implement your proposal. Much simple version of
autoprepare patch is attached.

At my computer I got the following results:

pgbench -M simple -S 22495 TPS
pgbench -M extended -S 47633 TPS
pgbench -M prepared -S 47683 TPS

So autoprepare speedup execution of queries sent using extended
protocol more than twice and it is almost the same as with explicitly
prepared statements.
I failed to create regression test for it because I do not know how
to force psql to use extended protocol. Any advice is welcome.

I am very interested in such a patch, because currently I use the same
functionality within my JDBC driver and having this directly in
PostgreSQL would surely speed things up.

I have two suggestions however:

2. Please not only use a LRU list, but maybe something which would
prefer statements that get reused at all. We often create ad hoc
statements with parameters which don't really need to be kept. Maybe I
can suggest an implementation of an LRU list where a reusal of a
statement not only pulls it to the top of the list but also increases
a reuse counter. When then a statement would drop off the end of the
list one checks if the reusal count is non-zero, and only really drops
it if the resual count is zero. Else the reusal count is decremented
(or halved) and the element is also placed at the start of the LRU
list, so it is kept a bit longer.

There are many variations of LRU and alternatives: clock, segmented LRU, ...
I agree that classical LRU may be not the best replacement policy for
autoprepare.
Application is either use static queries, either constructing them
dynamically. It will be better to distinguish queries executed at least
twice from one-shot queries.
So may be SLRU will be the best choice. But actually situation you have
described (We often create ad hoc statements with parameters which don't
really need to be kept)
is not applicable to atutoprepare. Autoprepare deals only with
statements which are actually executed.

We have another patch which limits number of stored prepared plans to
avoid memory overflow (in case of using stored procedures which
implicitly prepare all issued queries).

Here choice of replacement policy is more critical.

Thanks,

Daniel Migowski

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-08-02 Thread Julien Rouhaud

On Thu, Aug 1, 2019 at 11:05 PM Andres Freund  wrote:
>
> I'm actually quite unconvinced that it's sensible to update the global
> value for nested queries. That'll mean e.g. the log_line_prefix and
> pg_stat_activity values are most of the time going to be bogus while
> nested, because the querystring that's associated with those will *not*
> be the value that the queryid corresponds to. elog.c uses
> debug_query_string to log the statement, which is only updated for
> top-level queries (outside of some exceptions like parallel workers for
> parallel queries in a function or stuff like that). And pg_stat_activity
> is also only updated for top level queries.

Having the nested queryid seems indeed quite broken for
log_line_prefix.  However having the nested queryid in
pg_stat_activity would be convenient to track what is a long stored
functions currently doing.  Maybe we could expose something like
top_level_queryid and current_queryid instead?

Re: The unused_oids script should have a reminder to use the 8000-8999 OID range

2019-08-02 Thread Julien Rouhaud

On Fri, Aug 2, 2019 at 6:21 AM Michael Paquier  wrote:
>
> On Thu, Aug 01, 2019 at 06:59:06PM -0700, Peter Geoghegan wrote:
> > Seems like I should propose a patch this time around. I don't do Perl,
> > but I suppose I could manage something as trivial as this.
>
> Well, that new project policy is not that well-advertised then, see
> for example the recent 5925e55, c085e1c and 313f87a.  So having some
> kind of safety net would be nice.

Trivial patch for that attached.  The output is now like:

[...]
Using an oid in the 8000- range is recommended.
For instance: 9427

(checking that the suggested random oid is not used yet.)


suggest_unused_oid.diff
Description: Binary data

Re: pgbench - implement strict TPC-B benchmark

2019-08-02 Thread Fabien COELHO



Hello Andres,

Thanks a lot for these feedbacks and comments.


Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
a pretty significant share.


Fine, but what is the corresponding server load? 211%? 611%? And what 
actual time is spent in pgbench itself, vs libpq and syscalls?


Figures and discussion below.


And before you argue that that's just about a read-only workload:


I'm fine with worth case scenarii:-) Let's do the worse for my 2 cores 
running at 2.2 GHz laptop:



(0) we can run a really do nearly nothing script:

  sh> cat nope.sql
  \sleep 0
  # do not sleep, so stay awake…

  sh> time pgbench -f nope.sql -T 10 -r
  latency average = 0.000 ms
  tps = 12569499.226367 (excluding connections establishing) # 12.6M
  statement latencies in milliseconds:
 0.000  \sleep 0
  real 0m10.072s, user 0m10.027s, sys 0m0.012s

Unsurprisingly pgbench is at about 100% cpu load, and the transaction cost 
(transaction loop and stat collection) is 0.080 µs (1/12.6M) per script 
execution (one client on one thread).



(1) a pgbench complex-commands-only script:

  sh> cat set.sql
  \set x random_exponential(1, :scale * 10, 2.5) + 2.1
  \set y random(1, 9) + 17.1 * :x
  \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end

  sh> time pgbench -f set.sql -T 10 -r
  latency average = 0.001 ms
  tps = 1304989.729560 (excluding connections establishing) # 1.3M
  statement latencies in milliseconds:
0.000  \set x random_exponential(1, :scale * 10, 2.5) + 2.1
0.000  \set y random(1, 9) + 17.1 * :x
0.000  \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end
  real 0m10.038s, user 0m10.003s, sys 0m0.000s

Again pgbench load is near 100%, with only pgbench stuff (thread loop, 
expression evaluation, variables, stat collection) costing about 0.766 µs 
cpu per script execution. This is about 10 times the previous case, 90% of 
pgbench cpu cost is in expressions and variables, not a surprise.


Probably this under-a-µs could be reduced… but what overall improvements 
would it provide? An answer with the last test:



(2) a ridiculously small SQL query, tested through a local unix socket:

  sh> cat empty.sql
  ;
  # yep, an empty query!

  sh> time pgbench -f empty.sql -T 10 -r
  latency average = 0.016 ms
  tps = 62206.501709 (excluding connections establishing) # 62.2K
  statement latencies in milliseconds:
 0.016  ;
  real 0m10.038s, user 0m1.754s, sys 0m3.867s

We are adding minimal libpq and underlying system calls to pgbench 
internal cpu costs in the most favorable (or worst:-) sql query with the 
most favorable postgres connection.


Apparent load is about (1.754+3.867)/10.038 = 56%, so the cpu cost per 
script is 0.56 / 62206.5 = 9 µs, over 100 times the cost of a do-nothing 
script (0), and over 10 times the cost of a complex expression command 
script (1).


Conclusion: pgbench-specific overheads are typically (much) below 10% of 
the total client-side cpu cost of a transaction, while over 90% of the cpu 
cost is spent in libpq and system, for the worst case do-nothing query.


A perfect bench driver which would have zero overheads would reduce the 
cpu cost by at most 10%, because you still have to talk to the database. 
through the system. If pgbench cost were divided by two, which would be a 
reasonable achievement, the benchmark client cost would be reduced by 5%.


Wow?

I have already given some thought in the past to optimize "pgbench", 
especially to avoid long switches (eg in expression evaluation) and maybe 
to improve variable management, but as show above I would not expect a 
gain worth the effort and assume that a patch would probably be justly 
rejected, because for a realistic benchmark script these costs are already 
much less than other inevitable libpq/syscall costs.


That does not mean that nothing needs to be done, but the situation is 
currently quite good.


In conclusion, ISTM that current pgbench allows to saturate a postgres 
server with a client significantly smaller than the server, which seems 
like a reasonable benchmarking situation. Any other driver in any other 
language would necessarily incur the same kind of costs.



[...] And the largest part of the overhead is in pgbench's interpreter 
loop:


Indeed, the figures below are very interesting! Thanks for collecting 
them.



+   12.35%  pgbench  pgbench[.] threadRun
+3.54%  pgbench  pgbench[.] dopr.constprop.0
+3.30%  pgbench  pgbench[.] fmtint
+1.93%  pgbench  pgbench[.] getVariable


~ 21%, probably some inlining has been performed, because I would have 
expected to see significant time in "advanceConnectionState".



+2.95%  pgbench  libpq.so.5.13  [.] PQsendQueryPrepared
+2.15%  pgbench  libpq.so.5.13  [.] pqPutInt
+4.47%  pgbench  libpq.so.5.13  [.] pqParseInput3
+1.6

Re: [HACKERS] Cached plans and statement generalization

2019-08-02 Thread Daniel Migowski

Am 01.08.2019 um 18:56 schrieb Konstantin Knizhnik:
I decided to implement your proposal. Much simple version of
autoprepare patch is attached.

At my computer I got the following results:

pgbench -M simple -S 22495 TPS
pgbench -M extended -S 47633 TPS
pgbench -M prepared -S 47683 TPS

So autoprepare speedup execution of queries sent using extended
protocol more than twice and it is almost the same as with explicitly
prepared statements.
I failed to create regression test for it because I do not know how to
force psql to use extended protocol. Any advice is welcome.

I am very interested in such a patch, because currently I use the same
functionality within my JDBC driver and having this directly in
PostgreSQL would surely speed things up.

I have two suggestions however:

1. Please allow to gather information about the autoprepared statements
by returning them in pg_prepared_statements view. I would love to
monitor usage of them as well as the memory consumption that occurs. I
suggested a patch to display that in
https://www.postgresql.org/message-id/41ed3f5450c90f4d8381bc4d8df6bbdcf02e1...@exchangeserver.ikoffice.de

2. Please not only use a LRU list, but maybe something which would
prefer statements that get reused at all. We often create ad hoc
statements with parameters which don't really need to be kept. Maybe I
can suggest an implementation of an LRU list where a reusal of a
statement not only pulls it to the top of the list but also increases a
reuse counter. When then a statement would drop off the end of the list
one checks if the reusal count is non-zero, and only really drops it if
the resual count is zero. Else the reusal count is decremented (or
halved) and the element is also placed at the start of the LRU list, so
it is kept a bit longer.

Thanks,

Daniel Migowski

Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-08-02 Thread Adrien Nayrat

On 8/1/19 12:04 PM, Tomas Vondra wrote:
> On Thu, Aug 01, 2019 at 11:47:46AM +0200, Adrien Nayrat wrote:
>> Hi,
>>
>> As we are at the end of this CF and there is still discussions about whether 
>> we
>> should revert log_statement_sample_limit and log_statement_sample_rate, or 
>> try
>> to fix it in v12.
>> I moved this patch to next commit fest and change status from "ready for
>> commiter" to "need review". I hope I didn't make a mistake.
>>
> 
> Thanks. The RFC status was clearly stale, so thanks for updating. I should
> have done that after my review. I think the patch would be moved to the
> next CF at the end, but I might be wrong. In any case, I don't think
> you've done any mistake.
> 
> As for the sampling patch - I think we'll end up reverting the feature for
> v12 - it's far too late to rework it at this point. Sorry about that, I
> know it's not a warm feeling when you get something done, and then it gets
> reverted on the last minute. :-(
> 

Don't worry, I understand. It is better to add straigforward GUC in v13 than
confusionning in v12 we will regret.

Re: complier warnings from ecpg tests

2019-08-02 Thread Sergei Kornilov

Hi

>>  Thanks for the set of flags. So this comes from the use of -Og, and
>>  the rest of the tree does not complain. The issue is that gcc
>>  complains about the buffer not being large enough, but %d only uses up
>>  to 2 characters so there is no overflow. In order to fix the issue it
>>  is fine enough to increase the buffer size to 28 bytes, so I would
>>  recommend to just do that. This is similar to the business done in
>>  3a4b891.
>
> And fixed with a9f301d.

Thank you! My compiler is now quiet

regards, Sergei

Re: [proposal] de-TOAST'ing using a iterator

2019-08-02 Thread John Naylor

On Tue, Jul 30, 2019 at 8:20 PM Binguo Bao  wrote:
>
> John Naylor  于2019年7月29日周一 上午11:49写道：
>>
>> 1). For every needle comparison, text_position_next_internal()
>> calculates how much of the value is needed and passes that to
>> detoast_iterate(), which then calculates if it has to do something or
>> not. This is a bit hard to follow. There might also be a performance
>> penalty -- the following is just a theory, but it sounds plausible:
>> The CPU can probably correctly predict that detoast_iterate() will
>> usually return the same value it did last time, but it still has to
>> call the function and make sure, which I imagine is more expensive
>> than advancing the needle. Ideally, we want to call the iterator only
>> if we have to.
>>
>> In the attached patch (applies on top of your v5),
>> text_position_next_internal() simply compares hptr to the detoast
>> buffer limit, and calls detoast_iterate() until it can proceed. I
>> think this is clearer.
>
>
> Yes, I think this is a general scenario where the caller continually
> calls detoast_iterate until gets enough data, so I think such operations can
> be extracted as a macro, as I did in patch v6. In the macro, the 
> detoast_iterate
> function is called only when the data requested by the caller is greater than 
> the
> buffer limit.

I like the use of a macro here. However, I think we can find a better
location for the definition. See the header comment of fmgr.h:
"Definitions for the Postgres function manager and function-call
interface." Maybe tuptoaster.h is as good a place as any?

>> I also noticed that no one updates or looks at
>> "toast_iter.done" so I removed that as well.
>
>
> toast_iter.done is updated when the buffer limit reached the buffer capacity 
> now.
> So, I added it back.

Okay.

>> 2). detoast_iterate() and fetch_datum_iterate() return a value but we
>> don't check it or do anything with it. Should we do something with it?
>> It's also not yet clear if we should check the iterator state instead
>> of return values. I've added some XXX comments as a reminder. We
>> should also check the return value of pglz_decompress_iterate().
>
>
> IMO, we need to provide users with a simple iterative interface.
> Using the required data pointer to compare with the buffer limit is an easy 
> way.
> And the application scenarios of the iterator are mostly read operations.
> So I think there is no need to return a value, and the iterator needs to 
> throw an
> exception for some wrong calls, such as all the data have been iterated,
> but the user still calls the iterator.

Okay, and see these functions now return void. The orignal
pglz_decompress() returned a value that was check against corruption.
Is there a similar check we can do for the iterative version?

>> 3). Speaking of pglz_decompress_iterate(), I diff'd it with
>> pglz_decompress(), and I have some questions on it:
>>
>> a).
>> + srcend = (const unsigned char *) (source->limit == source->capacity
>> ? source->limit : (source->limit - 4));
>>
>> What does the 4 here mean in this expression?
>
>
> Since we fetch chunks one by one, if we make srcend equals to the source 
> buffer limit,
> In the while loop "while (sp < srcend && dp < destend)", sp may exceed the 
> source buffer limit and read unallocated bytes.

Why is this? That tells me the limit is incorrect. Can the setter not
determine the right value?

> Giving a four-byte buffer can prevent sp from exceeding the source buffer 
> limit.

Why 4? That's a magic number. Why not 2, or 27?

> If we have read all the chunks, we don't need to be careful to cross the 
> border,
> just make srcend equal to source buffer limit. I've added comments to explain 
> it in patch v6.

That's a good thing to comment on, but it doesn't explain why. This
logic seems like a band-aid and I think a committer would want this to
be handled in a more principled way.

>> Is it possible it's
>> compensating for this bit in init_toast_buffer()?
>>
>> + buf->limit = VARDATA(buf->buf);
>>
>> It seems the initial limit should also depend on whether the datum is
>> compressed, right? Can we just do this:
>>
>> + buf->limit = buf->position;
>
>
> I'm afraid not. buf->position points to the data portion of the buffer, but 
> the beginning of
> the chunks we read may contain header information. For example, for 
> compressed data chunks,
> the first four bytes record the size of raw data, this means that limit is 
> four bytes ahead of position.
> This initialization doesn't cause errors, although the position is less than 
> the limit in other cases.
> Because we always fetch chunks first, then decompress it.

I see what you mean now. This could use a comment or two to explain
the stated constraints may not actually be satisfied at
initialization.

>> b).
>> - while (sp < srcend && dp < destend)
>> ...
>> + while (sp + 1 < srcend && dp < destend &&
>> ...
>>
>> Why is it here "sp + 1"?
>
>
> Ignore it, I set the inactive state of detoast_iter->ctrl to 8 in patch v

Re: Hash join explain is broken