from:"Jeff Janes"

connection timeout hint

2023-12-03 Thread Jeff Janes

When one tries to connect to a server and port which is protected by a
firewall, ones get messages like this:

Unix:
psql: error: connection to server at "192.168.0.26", port 5432 failed:
Connection timed out
Is the server running on that host and accepting TCP/IP connections?

Windows:
psql: error: connection to server at "192.168.0.26", port 5432 failed:
Connection timed out (0x274C/10060)
Is the server running on that host and accepting TCP/IP connections?

But the hint given is unhelpful, and even positively misleading.  If the
port is blocked by a firewall, it doesn't imply the database server is
not listening (if one could just get to it), and it doesn't matter if the
database server is listening.  If for some reason it weren't listening as
well as being blocked, making it listen wouldn't help as long it remains
blocked at the firewall.

Is there some portable way to detect this cause of the connection problem
(connection timeout) and issue a more suitable hint
explicitly mentioning firewalls and routers, or perhaps just no hint at all?

As far as I know, only a firewall causes this problem, at least on a
persistent basis.  Maybe you could see it sporadically on a vastly
overloaded server or a server caught in the process of rebooting.  It would
be better to give a hint that is correct the vast majority of the time than
one that is wrong the vast majority of the time.

There are a lot of questions about this on, for example, stackoverflow.  I
think people might be better able to figure it out for themselves if the
hint were not actively leading them astray.

Cheers,

Jeff

Re: Use FD_CLOEXEC on ListenSockets (was Re: Refactoring backend fork+exec code)

2023-08-28 Thread Jeff Janes

On Thu, Aug 24, 2023 at 10:05 AM Heikki Linnakangas  wrote:

> On 24/08/2023 15:48, Thomas Munro wrote:
> > LGTM.  I vaguely recall thinking that it might be better to keep
> > EXEC_BACKEND and !EXEC_BACKEND working the same which might be why I
> > didn't try this one, but it looks fine with the comment to explain, as
> > you have it.  (It's a shame we can't use O_CLOFORK.)
>
> Yeah, O_CLOFORK would be nice..
>
> Committed, thanks!
>
>
Since this commit, I'm getting a lot (63 per restart) of messages:

 LOG:  could not close client or listen socket: Bad file descriptor

All I have to do to get the message is turn logging_collector = on and
restart.

The close failure condition existed before the commit, it just wasn't
logged before.  So, did the extra logging added here just uncover a
pre-existing bug?

The LOG message is sent to the terminal, not to the log file.

Cheers,

Jeff

awkward cancellation of parallel queries on standby.

2023-03-26 Thread Jeff Janes

When a parallel query gets cancelled on a standby due to
max_standby_streaming_delay, it happens rather awkwardly.  I get two errors
stacked up, a query cancellation followed by a connection termination.

I use `pgbench -R 1 -T3600 -P5` on the master to generate a light but
steady stream of HOT pruning records, and then run `select
sum(a.abalance*b.abalance) from pgbench_accounts a join pgbench_accounts b
using (bid);` on the standby not in a transaction block to be a
long-running parallel query (scale factor of 20)

I also set max_standby_streaming_delay = 0.  That isn't necessary, but it
saves wear and tear on my patience.

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be
removed.
FATAL:  terminating connection due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be
removed.

This happens quite reliably.  In psql, these sometimes both show up
immediately, and sometimes only the first one shows up immediately and then
the second one appears upon the next communication to the backend.

I don't know if this is actually a problem.  It isn't for me as I don't do
this kind of thing outside of testing, but it seems untidy and I can see it
being frustrating from a catch-and-retry perspective and from a log-spam
perspective.

It looks like the backend gets signalled by the startup process, and then
it signals the postmaster to signal the parallel workers, and then they
ignore it for a quite long time (tens to hundreds of ms).  By the time they
get around responding, someone has decided to escalate things.  Which
doesn't seem to be useful, because no one can do anything until the workers
respond anyway.

This behavior seems to go back a long way, but the propensity for both
messages to show up at the same time vs. in different round-trips changes
from version to version.

Is this something we should do something about?

Cheers,

Jeff

Bug with ICU for merge join

2023-03-24 Thread Jeff Janes

Ever since 27b62377b47f9e7bf58613, I have been getting "ERROR:  mergejoin
input data is out of order" for the attached reproducer.

I get this on Ubuntu 20.04 and 22.04, whether initdb was run under LC_ALL=C
or under LANG=en_US.UTF-8.

It is not my query, I don't really know what its point is.  I just got this
error while looking into the performance of it and accidentally running it
against 16dev.

Cheers,

Jeff


bad_merge.sql
Description: Binary data

Index ordering after IS NULL

2022-09-10 Thread Jeff Janes

On a two-column btree index, we can constrain the first column with
equality and read the rows in order by the 2nd column.  But we can't
constrain the first column by IS NULL and still read the rows in order by
the 2nd column.  But why not?  Surely the structure of the btree index
would allow for this to work.

Example:

create table if not exists j as select case when random()<0.9 then
floor(random()*10)::int end b, random() c from generate_series(1,100);
create index if not exists j_b_c_idx on j (b,c);
set enable_sort TO off;
explain analyze select * from j where b is null order by c limit 10;
explain analyze select * from j where b =8 order by c limit 10;

The first uses a sort despite it being disabled.

Cheers,

Jeff

Re: Instrumented pages/tuples frozen in autovacuum's server log out (and VACUUM VERBOSE)

2022-08-31 Thread Jeff Janes

On Sat, Aug 20, 2022 at 7:29 PM Peter Geoghegan  wrote:

> XIDs processed: 45 pages from table (100.00% of total) had 1 tuples
> frozen
>

I like this addition, but I would also like to see how many pages got newly
set to all frozen by the vacuum.  Would that be a convenient thing to also
report here?

Also, isn't all of vacuuming about XID processing?  I think "frozen:" would
be a more suitable line prefix.

Cheers,

Jeff

Re: num_sa_scans in genericcostestimate

2022-08-31 Thread Jeff Janes

On Sun, Aug 21, 2022 at 2:45 PM Jeff Janes  wrote:


> ...
>


> The context for this is that I was looking at cases where btree indexes
> were not using all the columns they could, but rather shoving some of the
> conditions down into a Filter unnecessarily/unhelpfully.  This change
> doesn't fix that, but it does seem to be moving in the right direction.
>

Added to commitfest.


> This does cause a regression test failure due to an (apparently?)
> uninteresting plan change.
>

Looking more at the regression test plan change, it points up an
interesting question which is only tangentially related to this patch.

With patch applied:

[local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM
tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
  QUERY PLAN

---
 Sort  (cost=4.55..4.56 rows=1 width=8) (actual time=0.100..0.101 rows=2
loops=1)
   Sort Key: thousand
   Sort Method: quicksort  Memory: 25kB
   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
 (cost=0.29..4.50 rows=1 width=8) (actual time=0.044..0.048 rows=2 loops=1)
 Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
 Heap Fetches: 0
 Planning Time: 1.040 ms
 Execution Time: 0.149 ms
(8 rows)


[local] 417536 regression=# set enable_sort TO off ;


[local] 417536 regression=# explain analyze SELECT thousand, tenthous FROM
tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
   QUERY PLAN

-
 Index Only Scan using tenk1_thous_tenthous on tenk1  (cost=0.29..4.71
rows=1 width=8) (actual time=0.021..0.024 rows=2 loops=1)
   Index Cond: (thousand < 2)
   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
   Rows Removed by Filter: 18
   Heap Fetches: 0
 Planning Time: 0.156 ms
 Execution Time: 0.039 ms
(7 rows)

Why does having the =ANY in the "Index Cond:" rather than the "Filter:"
inhibit it from understanding that the rows will still be delivered in
order by "thousand"?

Cheers,

Jeff

>

num_sa_scans in genericcostestimate

2022-08-21 Thread Jeff Janes

When costing a btree index scan, num_sa_scans gets computed twice, once in
btcostestmeate and once in genericcostestimate.  But the computations are
different.  It looks like the generic one includes all =ANY in any column
in the index, while the bt one includes only =ANY which or on columns for
which all the preceding index columns are tested for equality.

It looks like the generic one was there first then the bt-specific one was
added later to improve planning of btree indexes.  But then shouldn't the
value be passed down to generic, rather than recomputed (differently)?
I've attached a patch to do that.  Generic still computes the value itself
for other-than-btree indexes.

Or, is there a reason we want a different value to be used in
genericcostestimate?

The context for this is that I was looking at cases where btree indexes
were not using all the columns they could, but rather shoving some of the
conditions down into a Filter unnecessarily/unhelpfully.  This change
doesn't fix that, but it does seem to be moving in the right direction.

This does cause a regression test failure due to an (apparently?)
uninteresting plan change.

Cheers,

Jeff


eq_any.patch
Description: Binary data

15beta1 tab completion of extension versions

2022-06-18 Thread Jeff Janes

Extension version strings need to be quoted.  Either double or single
quotes will work.  In released psql clients, tab completion offers double
quoted suggestions:

alter extension pg_trgm update TO 
"1.3"  "1.4"  "1.5"  "1.6"

But commit 02b8048ba5 broke that, it now offers unquoted version strings
which if used as offered then lead to syntax errors.

The code change seems to have been intentional, but I don't think the
behavior change was intended.  While the version string might not be an
identifier, it still needs to be treated as if it were one.
Putting pg_catalog.quote_ident back
into Query_for_list_of_available_extension_versions* fixes it, but might
not be the best way to fix it.

commit 02b8048ba5dc36238f3e7c3c58c5946220298d71 (HEAD, refs/bisect/bad)
Author: Tom Lane 
Date:   Sun Jan 30 13:33:23 2022 -0500

psql: improve tab-complete's handling of variant SQL names.


Cheer,

Jeff

Re: PostgreSQL 15 Beta 1 release announcement draft

2022-05-23 Thread Jeff Janes

On Sat, May 14, 2022 at 2:52 PM Jonathan S. Katz 
wrote:

> Hi,
>
> Attached is a draft of the release announcement for the PostgreSQL 15
> Beta 1 release. The goal of this announcement is to raise awareness
> around many of the new features appearing in PostgreSQL 15 and to
> encourage people to test. The success of the PostgreSQL 15 GA depends
> heavily on people testing during the Beta period!
>

I have some belated feedback.  I was excited to try this on Windows (I
don't have a build system for that) and so followed the first link in the
message, to https://www.postgresql.org/download/.  At first glance there is
nothing about beta there, but there is a prominent Windows icon so I
click on that.  And then to EDB, but there is no apparent way to download
beta, just the released versions.  I poked around EDB a bit but didn't find
anything promising, then backed out of all that poking around and
eventually all the way back to /download, where I scrolled down and finally
found the link to https://www.postgresql.org/download/snapshots/ which
tells me what I need to know.  But at this point I was more annoyed than
excited.

An invitation to download the beta should take me directly to the page
relevant to doing that.  I shouldn't have to read the page backwards, or do
a breadth-first traversal, to get to the right place efficiently.  People
will click on the first link which seems relevant, and "Windows" on the
generic download page certainly seems relevant to Beta for Windows, until
after you have scrolled down to find the beta/RC specific link instead.  (I
now recall being annoyed by this in a prior year as well, I guess I have a
bad memory for avoiding mistakes but a good memory for recalling them).
Also, the download page should probably say "binary packages and
installers" where it currently says  "There are source code and binary
packages of beta and release candidates", although I guess that is not
about the announcement itself.

Cheers,

Jeff

Re: Query generates infinite loop

2022-05-04 Thread Jeff Janes

On Wed, Apr 20, 2022 at 5:43 PM Tom Lane  wrote:

> I wrote:
> > it's true that infinities as generate_series endpoints are going
> > to work pretty oddly, so I agree with the idea of forbidding 'em.
>
> > Numeric has infinity as of late, so the numeric variant would
> > need to do this too.
>
> Oh --- looks like numeric generate_series() already throws error for
> this, so we should just make the timestamp variants do the same.
>

The regression test you added for this change causes an infinite loop when
run against an unpatched server with --install-check.  That is a bit
unpleasant.  Is there something we can and should do about that?  I was
expecting regression test failures of course but not an infinite loop
leading towards disk exhaustion.

Cheers,

Jeff

track_io_timing default setting

2021-12-09 Thread Jeff Janes

Can we change the default setting of track_io_timing to on?

I see a lot of questions, such as over at stackoverflow or
dba.stackexchange.com, where people ask for help with plans that would be
much more useful were this on.  Maybe they just don't know better, maybe
they can't turn it on because they are not a superuser.

I can't imagine a lot of people who care much about its performance impact
will be running the latest version of PostgreSQL on
ancient/weird systems that have slow clock access. (And the few who do can
just turn it off for their system).

For systems with fast user-space clock access, I've never seen this setting
being turned on make a noticeable dent in performance.  Maybe I just never
tested enough in the most adverse scenario (which I guess would be a huge
FS cache, a small shared buffers, and a high CPU count with constant
churning of pages that hit the FS cache but miss shared buffers--not a
system I have handy to do a lot of tests with.)

Cheers,

Jeff

Bitmap reuse

2021-07-20 Thread Jeff Janes

For some queries PostgreSQL can spend most of its time creating the exact
same bitmap over and over.  For example, in the below case: (also attached
as a file because line-wrapping is going to make a mess of it)

drop table if exists foo;
create table foo (x daterange, i int, t text);
insert into foo select daterange(x::date,x::date+3), random()*3000 from
(select now()-interval '3 years'*random() as x from
generate_series(1,1e6))foo;
vacuum analyze foo;
create index ON  foo using gist  ( x);
create index ON  foo   ( i);
explain (analyze, buffers) select * from generate_series(1,20) g(i), foo
where x && '[2019-08-09,2019-08-11)' and g.i=foo.i;

  QUERY PLAN

---
 Nested Loop  (cost=170.21..3563.24 rows=33 width=54) (actual
time=1.295..24.890 rows=28 loops=1)
   Buffers: shared hit=543 read=8
   I/O Timings: read=0.040
   ->  Function Scan on generate_series g  (cost=0.00..0.20 rows=20
width=4) (actual time=0.007..0.014 rows=20 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=170.21..178.13 rows=2 width=50)
(actual time=1.238..1.240 rows=1 loops=20)
 Recheck Cond: ((i = g.i) AND (x &&
'[2019-08-09,2019-08-11)'::daterange))
 Heap Blocks: exact=28
 Buffers: shared hit=543 read=8
 I/O Timings: read=0.040
 ->  BitmapAnd  (cost=170.21..170.21 rows=2 width=0) (actual
time=1.234..1.234 rows=0 loops=20)
   Buffers: shared hit=515 read=8
   I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_i_idx  (cost=0.00..6.92
rows=333 width=0) (actual time=0.031..0.031 rows=327 loops=20)
 Index Cond: (i = g.i)
 Buffers: shared hit=55 read=8
 I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..161.78
rows=5000 width=0) (actual time=1.183..1.183 rows=3670 loops=20)
 Index Cond: (x && '[2019-08-09,2019-08-11)'::daterange)
 Buffers: shared hit=460

Note that the fast bitmap index scan is parameterized to the other side of
the nested loop, so has to be recomputed.  While the slow one is
parameterized to a constant, so it could in principle just be reused.

What kind of infrastructure would be needed to detect this case and reuse
that bitmap?

Cheers,

Jeff
drop table if exists foo;
create table foo (x daterange, i int, t text);
insert into foo select daterange(x::date,x::date+3), random()*3000 from (select 
now()-interval '3 years'*random() as x from generate_series(1,1e6))foo;
vacuum analyze foo;
create index ON  foo using gist  ( x);
create index ON  foo   ( i);
explain (analyze, buffers) select * from generate_series(1,20) g(i), foo where 
x && '[2019-08-09,2019-08-11)' and g.i=foo.i;
\q
  QUERY PLAN
   
---
 Nested Loop  (cost=170.21..3563.24 rows=33 width=54) (actual 
time=1.295..24.890 rows=28 loops=1)
   Buffers: shared hit=543 read=8
   I/O Timings: read=0.040
   ->  Function Scan on generate_series g  (cost=0.00..0.20 rows=20 width=4) 
(actual time=0.007..0.014 rows=20 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=170.21..178.13 rows=2 width=50) (actual 
time=1.238..1.240 rows=1 loops=20)
 Recheck Cond: ((i = g.i) AND (x && 
'[2019-08-09,2019-08-11)'::daterange))
 Heap Blocks: exact=28
 Buffers: shared hit=543 read=8
 I/O Timings: read=0.040
 ->  BitmapAnd  (cost=170.21..170.21 rows=2 width=0) (actual 
time=1.234..1.234 rows=0 loops=20)
   Buffers: shared hit=515 read=8
   I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_i_idx  (cost=0.00..6.92 rows=333 
width=0) (actual time=0.031..0.031 rows=327 loops=20)
 Index Cond: (i = g.i)
 Buffers: shared hit=55 read=8
 I/O Timings: read=0.040
   ->  Bitmap Index Scan on foo_x_idx  (cost=0.00..161.78 rows=5000 
width=0) (actual time=1.183..1.183 rows=3670 loops=20)
 Index Cond: (x && '[2019-08-09,2019-08-11)'::daterange)
 Buffers: shared hit=460
 Planning:
   Buffers: shared hit=47 read=4
   I/O Timings: read=0.034
 Planning Time: 0.380 ms
 Execution Time: 24.937 ms
(24 rows)

Time: 26.094 ms

Re: SQL-standard function body

2021-04-22 Thread Jeff Janes

On Wed, Apr 7, 2021 at 3:55 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

>
> Committed.  Thanks!
>
>
This commit break line continuation prompts for unbalanced parentheses in
the psql binary.  Skimming through this thread, I don't see that this is
intentional or has been noticed before.

with psql -X

Before:

jjanes=# asdf (
jjanes(#

Now:

jjanes=# asdf (
jjanes-#

I've looked through the parts of the commit that change psql, but didn't
see an obvious culprit.

Cheers,

Jeff

DETAIL for wrong scram password

2021-02-27 Thread Jeff Janes

When md5 password authentication fails, the server log file has a helpful
detail to say why, usually one of:

DETAIL:  Role "none" does not exist.
DETAIL:  User "jjanes" has no password assigned.
DETAIL:  User "jjanes" has an expired password.
DETAIL:  Password does not match for user "jjanes".

But for scram authentication, only the first three of these will be
reported when applicable.  If the password is simply incorrect, then you do
get a DETAIL line reporting which line of pg_hba was used, but you don't
get a DETAIL line reporting the reason for the failure.  It is pretty
unfriendly to make the admin guess what the absence of a DETAIL is supposed
to mean. And as far as I can tell, this is not intentional.

Note that in one case you do get the "does not match" line.  That is if the
user has a scram password assigned and the hba specifies plain-text
'password' as the method.  So if the absence of the DETAIL is intentional,
it is not internally consistent.

The attached patch fixes the issue.  I don't know if this is the correct
location to be installing the message, maybe verify_client_proof should be
doing it instead.  I am also not happy to be testing 'doomed' here, but it
was already checked a few lines up so I didn't want to go to lengths to
avoid doing it here too.

Cheers,

Jeff


scram_password_mismatch.patch
Description: Binary data

Re: memory leak in auto_explain

2021-02-01 Thread Jeff Janes

On Mon, Feb 1, 2021 at 6:09 PM Jeff Janes  wrote:

>
>
> create or replace function gibberish(int) returns text language SQL as $_$
> select left(string_agg(md5(random()::text),),$1) from
> generate_series(0,$1/32) $_$;
>
> create table j1 as select x, md5(random()::text) as t11, gibberish(1500)
> as t12 from generate_series(1,20e6) f(x);
>

I should have added, found it on HEAD, verified it also in 12.5.

Cheers,

Jeff

>

memory leak in auto_explain

2021-02-01 Thread Jeff Janes

I accidentally tried to populate a test case
while auto_explain.log_min_duration was set to
zero.  auto_explain.log_nested_statements was also on.

create or replace function gibberish(int) returns text language SQL as $_$
select left(string_agg(md5(random()::text),),$1) from
generate_series(0,$1/32) $_$;

create table j1 as select x, md5(random()::text) as t11, gibberish(1500) as
t12 from generate_series(1,20e6) f(x);

I got logorrhea of course, but I also got a memory leak into the SQL
function context:

  TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalContext: 16384 total in 5 blocks; 5328 free (1 chunks); 11056
used: 
  ExecutorState: 4810120 total in 13 blocks; 4167160 free (74922
chunks); 642960 used
SQL function: 411058232 total in 60 blocks; 4916568 free (4
chunks); 406141664 used: gibberish

The memory usage grew until OOM killer stepped in.

Cheers,

Jeff

Re: BUG #16079: Question Regarding the BUG #16064

2020-12-21 Thread Jeff Janes

On Sun, Dec 20, 2020 at 7:58 PM Stephen Frost  wrote:

>
> * Magnus Hagander (mag...@hagander.net) wrote:
>
>
Changed from bugs to hackers.

> > For the old plaintext "password" method, we log a warning when we parse
> the
> > configuration file.
>

Like Stephen, I don't see such a warning getting logged.

> >
> > Maybe we should do the same for LDAP (and RADIUS)? This seems like a
> better
> > place to put it than to log it at every time it's received?
>
> A dollar short and a year late, but ... +1.

I would suggest going further.  I would make the change on the client side,
and have libpq refuse to send unhashed passwords without having an
environment variable set which allows it.  (Also, change the client
behavior so it defaults to verify-full when PGSSLMODE is not set.)

What is the value of logging on the server side?  I can change the setting
from password to md5 or from ldap to gss, when I notice the log message.
But once compromised or during a MITM attack, the bad guy will just set it
back to the unsafe form and the client will silently go along with it.

Cheers,

Jeff

Re: Supporting = operator in gin/gist_trgm_ops

2020-11-15 Thread Jeff Janes

On Sat, Nov 14, 2020 at 12:31 AM Alexander Korotkov 
wrote:

>
> I went through and revised this patch.  I made the documentation
> statement less categorical.  pg_trgm gist/gin indexes might have lower
> performance of equality operator search than B-tree.  So, we can't
> claim the B-tree index is always not needed.  Also, simple comparison
> operators are <, <=, >, >=, and they are not supported.
>

Is "simple comparison" here a well-known term of art?  If I read the doc as
committed (which doesn't include the sentence above), and if I didn't
already know what it was saying, I would be left wondering which
comparisons those are.  Could we just say "inequality operators"?

Cheers,

Jeff

moving aggregate bad error message

2020-09-06 Thread Jeff Janes

I was wondering if I could just add minvfunc, and have the rest of the m*
functions be assumed to be the same as their non-moving counterparts.
Apparently the answer is 'no'.  But in the process, I found a bad error
message.  When omitting mfinalfunc when there is a finalfunc, I get the
error:

"ERROR:  moving-aggregate implementation returns type jj_state, but plain
implementation returns type jj_state."  A rather peculiar
complaint, analogous to the infamous "something failed: Success".

Looking at the code, it seems we are testing rettype != finaltype, but
reporting aggmTransType and aggTransType.  Why aren't we reporting what we
are testing?

With the attached patch, it gives the more sensible "ERROR:
 moving-aggregate implementation returns type jj_state, but plain
implementation returns type numeric."

Cheers,

Jeff
drop type jj_state cascade;

CREATE TYPE jj_state AS ( x numeric, y numeric);


CREATE OR REPLACE FUNCTION jj_transition(
  state jj_state,
  val numeric
) RETURNS jj_state AS $$
BEGIN
  RETURN NULL::jj_state;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION jj_final(
  state jj_state
) RETURNS numeric AS $$
BEGIN
  RETURN 5.5;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION jj_inverse(state jj_state, val numeric)
RETURNS jj_state AS $$
BEGIN
  RETURN NULL::jj_state;
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE jj(numeric) (
  sfunc  = jj_transition,
  stype  = jj_state,
  finalfunc  = jj_final,
  initcond   = '(0,0)',
  msfunc = jj_transition,
  minvfunc   = jj_inverse,
  mstype = jj_state
  -- mfinalfunc = average_final
);


moving_agg_error.patch
Description: Binary data

Re: Autovac cancellation is broken in v14

2020-08-27 Thread Jeff Janes

On Thu, Aug 27, 2020 at 3:10 PM Jeff Janes  wrote:

> If I create a large table with "CREATE TABLE ... AS SELECT ... from
> generate_series(1,3e7)" with no explicit transactions, then once it is done
> I wait for autovac to kick in, then when I try to build an index on that
> table (or drop the table) the autovac doesn't go away on its own.
>

After a bit more poking at this, I think we are checking if we ourselves
are an autovac process, not doing the intended check of whether the other
guy is one.

Where would be a good spot to add a regression test for this?
"isolation_regression" ?
Cheers,

Jeff

fix_autovac_cancel.patch
Description: Binary data

Autovac cancellation is broken in v14

2020-08-27 Thread Jeff Janes

If I create a large table with "CREATE TABLE ... AS SELECT ... from
generate_series(1,3e7)" with no explicit transactions, then once it is done
I wait for autovac to kick in, then when I try to build an index on that
table (or drop the table) the autovac doesn't go away on its own.

Bisects down to:

commit 5788e258bb26495fab65ff3aa486268d1c50b123
Author: Andres Freund 
Date:   Wed Jul 15 15:35:07 2020 -0700

snapshot scalability: Move PGXACT->vacuumFlags to
ProcGlobal->vacuumFlags.

Which makes sense given the parts of the code this touches, although I
don't understand exactly what the problem is.  The problem persists in HEAD
(77c7267c37).

Cheers,

Jeff

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-26 Thread Jeff Janes

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila  wrote:

>  I am planning
> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> this series tomorrow unless you have any comments on the same.
>

I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c
line 288 needs to be:

boolfound PG_USED_FOR_ASSERTS_ONLY = false;

Cheers,

Jeff

tab completion of IMPORT FOREIGN SCHEMA

2020-08-09 Thread Jeff Janes

I use IMPORT FOREIGN SCHEMA a bit to set up systems for testing.  But not
enough that I can ever remember whether INTO or FROM SERVER comes first in
the syntax.

Here is an improvement to the tab completion, so I don't have to keep
looking it up in the docs.

It accidentally (even before this patch) completes "IMPORT FOREIGN SCHEMA"
with a list of local schemas.  This is probably wrong, but I find this
convenient as I often do this in a loop-back setup where the list of
foreign schema would be the same as the local ones.  So I don't countermand
that behavior here.

Cheers,

Jeff


foreign_schema_tab_complete.patch
Description: Binary data

Re: estimation problems for DISTINCT ON with FDW

2020-07-25 Thread Jeff Janes

On Fri, Jul 3, 2020 at 5:50 PM Tom Lane  wrote:

>
> OK, I'll go ahead and push the patch I proposed yesterday.
>

Thank you.  I tested 12_STABLE with my real queries on the real data set,
and the "hard coded" estimate of 200 distinct rows (when use_remote_estimte
is turned back on) is enough to get rid of the worst plans I was seeing in
12.3.

Cheers,

Jeff

Re: Loaded footgun open_datasync on Windows

2020-07-23 Thread Jeff Janes

On Fri, Sep 14, 2018 at 3:32 AM Michael Paquier  wrote:

> On Fri, Sep 14, 2018 at 08:43:18AM +0200, Laurenz Albe wrote:
>
> > If it turns out not to break anything, would you consider backpatching?
> > On the one hand it fixes a bug, on the other hand it affects all
> > frontend executables...
>
> Yeah, for this reason I would not do a backpatch.  I have a very hard
> time to believe that any frontend tools on Windows developed by anybody
> rely on files to be opened only by a single process, still if they do
> they would be surprised to see a change of behavior after a minor
> update in case they rely on the concurrency limitations.
>

Reviving an old thread here.

Could it be back-patched in some pg_test_fsync specific variant?  I
don't think we should just ignore the fact that pg_test_fsync on Windows is
unfit for its intended purpose on 4 still-supported versions.

> > I wonder why nobody noticed the problem in pg_test_fsync earlier.
> > Is it that people running Windows care less if their storage is
> > reliable?
>
> likely so.
>

I have noticed this before, but since it wasn't a production machine I just
shrugged it off as being a hazard of using consumer-grade stuff; it didn't
seem to be worth investigating further.

 Cheers,

Jeff

estimation problems for DISTINCT ON with FDW

2020-06-28 Thread Jeff Janes

If I use the attached sql file to set up the database with loop-back
postgres_fdw, and then turn on use_remote_estimate for this query:

distinct on (id) id, z from fgn.priority order by id, priority desc,z

It issues two queries for the foreign estimate, one with a sort and one
without:

EXPLAIN SELECT id, priority, z FROM public.priority

EXPLAIN SELECT id, priority, z FROM public.priority ORDER BY id ASC NULLS
LAST, priority DESC NULLS FIRST, z ASC NULLS LAST

It doesn't cost out the plan of pushing the DISTINCT ON down to the foreign
side, which is probably the best way to run the query.  I guess it makes
sense that FDW machinery in general doesn't want to try to push a
PostgreSQL specific construct.

But much worse than that, it horribly misestmates the number of unique rows
it will get back, having never asked the remote side for an estimate of
that.

 Result  (cost=100.51..88635.90 rows=1 width=16)
   ->  Unique  (cost=100.51..88635.90 rows=1 width=16)
 ->  Foreign Scan on priority  (cost=100.51..86135.90 rows=100
width=16)

Where does it come up with the idea that these 1,000,000 rows will
DISTINCT/Unique down to just 1 row?   I can't find the place in the code
where that happens.  I suspect it is happening somewhere in the core code
based on data fed into it by postgres_fdw, not in postgres_fdw itself.

This leads to horrible plans if the DISTINCT ON is actually in a subquery
which is joined to other tables, for example.

If you don't use the remote estimates, it at least comes up with a roughly
sane estimate of 200 distinct rows, which is enough to inhibit selection of
the worst plans. Why does an uninformative remote estimate do so much worse
than no remote estimate at all?

Of course I could just disable remote estimates for this table, but then
other queries that use the table without DISTINCT ON suffer.  Another
solution is to ANALYZE the foreign table, but that opens up a can of worms
of its own.

I see this behavior in all supported or in-development versions.

Cheers,

Jeff
create table priority as select floor(random()*100)::int as id, 
floor(random()*10)::int as priority, random() as z from 
generate_series(1,100)f(id) order by random();
vacuum ANALYZE priority;
create index on priority (id, priority,z);
create extension postgres_fdw ;
create server fdw foreign data wrapper postgres_fdw;
create user MAPPING FOR CURRENT_USER SERVER fdw ;
create schema fgn;
import foreign schema public from server fdw into fgn;
explain select distinct on (id) id, z from fgn.priority order by id, priority 
desc,z;
alter foreign table fgn.priority options (use_remote_estimate 'true');
explain select distinct on (id) id, z from fgn.priority order by id, priority 
desc,z;

max_slot_wal_keep_size comment in postgresql.conf

2020-05-26 Thread Jeff Janes

In postgresql.conf, it says:

#max_slot_wal_keep_size = -1  # measured in bytes; -1 disables

I don't know if that is describing the dimension of this parameter or the
units of it, but the default units for it are megabytes, not individual
bytes, so I think it is pretty confusing.

Cheers,

Jeff

Re: SEQUENCE values (duplicated) in some corner cases when crash happens

2020-05-14 Thread Jeff Janes

On Wed, May 6, 2020 at 1:52 PM Jeremy Schneider  wrote:

> The behavior we're observing is that a nextval() call in a committed
>
transaction is not crash-safe. This was discovered because some
> applications were using nextval() to get a guaranteed unique sequence
> number [or so they thought], then the application did some processing
> with the value and later stored it in a relation of the same database.
>
> The nextval() number was not used until the transaction was committed -
>

I don't know what this line means.  You said it was stored in a relation,
surely that needs to have happened through some command which preceded the
commit chronologically, though formally they may have happened atomically.

> but then the fact of a value being generated, returned and committed was
> lost on crash. The nextval() call used in isolation did not seem to
> provide durability.
>

Are you clarifying the original complaint, or this a new, different
complaint? Vini's test cases don't include any insertions.  Do you have
test cases that can reproduce your complaint?

Cheers,

Jeff

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-10 Thread Jeff Janes

On Thu, Apr 9, 2020 at 10:04 AM Ashutosh Bapat 
wrote:

> On Thu, Apr 9, 2020 at 12:03 PM Etsuro Fujita 
> wrote:
> >
> > On Thu, Apr 9, 2020 at 2:36 PM Tom Lane  wrote:
> > > Etsuro Fujita  writes:
> > > > Yeah, partition_bounds_merge() is currently called only from
> > > > try_partitionwise_join(), which guarantees that the strategies are
> the
> > > > same.
> >
> > > If there's only one caller and there's not likely to ever be more,
> > > then I tend to agree that you don't need the assertion.
> >
> > It seems unlikely that partition_bounds_merge() will be called from
> > more places in the foreseeable future, so I'd still vote for removing
> > the assertion.
>
> When I wrote that function, I had UNION also in mind. A UNION across
> multiple partitioned relations will be partitioned if we can merge the
> partition bounds in a sensible manner. Of course the current structure
> of that function looks more purposed for join, but it's not difficult
> to convert it to be used for UNION as well. In that case those set of
> functions will have many more callers. So, I will vote to keep that
> assertion now that we have it there.
>

In that case, we really should add the PG_USED_FOR_ASSERTS_ONLY to make the
compiler happy.

Cheers,

Jeff

Re: BUG #16345: ts_headline does not find phrase matches correctly

2020-04-09 Thread Jeff Janes

redirected to hackers.

On Wed, Apr 8, 2020 at 11:02 PM Tom Lane  wrote:

>
> In short then, I propose applying 0001-0006.  I'm not quite sure
> if we should back-patch, or just be content to fix this in HEAD.
> But there's definitely an argument that this has been broken since
> we added phrase search (in 9.6) and deserves to be back-patched.
>
>
Thanks for fixing this.

I am getting a compiler warning, both with and without --enable-cassert.

wparser_def.c: In function 'prsd_headline':
wparser_def.c:2530:2: warning: 'pose' may be used uninitialized in this
function [-Wmaybe-uninitialized]
  mark_fragment(prs, highlightall, bestb, beste);
  ^
wparser_def.c:2384:6: note: 'pose' was declared here
  int   pose,


It makes no sense to me that pose could be used uninitialized on a line
that doesn't use pose at all, so maybe it is a compiler bug or something.

PostgreSQL 13devel-c9b0c67 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit

Cheers,

Jeff

Add absolute value to dict_int

2020-02-27 Thread Jeff Janes

I've seen a few requests on how to make FTS search on the absolute value of
integers.  This question is usually driven by the fact that the text search
parser interprets a separating hyphen ("partnumber-987") as a minus sign.

There is currently no good answer for this that doesn't involve C coding.
I think this feature has a natural and trivial home in the dict_int
extension, so attached is a patch that does that.

There are no changes to the extension creation scripts, so there is no need
to bump the version.  And I think the convention is that we don't bump the
version just to signify a change which implements a new feature when that
doesn't change the creation scripts.

Cheers,

Jeff


dict_int_absval_v001.patch
Description: Binary data

ALTER TEXT SEARCH DICTIONARY tab completion

2020-02-27 Thread Jeff Janes

"ALTER TEXT SEARCH DICTIONARY foobar" can be followed by an open
parenthesis, but that is not offered in tab completion.  That is useful,
because otherwise I have to look up the docs to see if I need a SET or
OPTION(S) or WITH or something before it, just to discover I don't.

The attached one-line patch adds "(".

We can't go beyond that, as available options for each dictionary are not
known in advance.

Cheers,

Jeff


alter_dict_tab_complete.patch
Description: Binary data

bad logging around broken restore_command

2020-02-05 Thread Jeff Janes

If the restore command claims to have succeeded, but fails to have created
a file with the right name (due to typos or escaping or quoting issues, for
example), no complaint is issued at the time, and it then fails later with
a relatively uninformative error message like "could not locate required
checkpoint record".

if (rc == 0)
{
/*
 * command apparently succeeded, but let's make sure the file is
 * really there now and has the correct size.
 */
if (stat(xlogpath, _buf) == 0)
{..
}
else
{
/* stat failed */
if (errno != ENOENT)
ereport(FATAL,
(errcode_for_file_access(),
 errmsg("could not stat file \"%s\": %m",
xlogpath)));
}

I don't see why ENOENT is thought to deserve the silent treatment.  It
seems weird that success gets logged ("restored log file \"%s\" from
archive"), but one particular type of unexpected failure does not.

I've attached a patch which emits a LOG message for ENOENT.  The exact
wording doesn't matter to me, I'm sure someone can improve it.
Alternatively, perhaps the message a few lines down, "could not restore
file", could get promoted from DEBUG2 to LOG when rc indicates success.
But I don't think we need any more messages which say "Something went
wrong: success".

This is based on the question at
https://stackoverflow.com/questions/60056597/couldnt-restore-postgres-v11-from-pg-basebackup.
But this isn' the first time I've seen similar confusion.

Cheers,

Jeff


restore_log.patch
Description: Binary data

Why is pq_begintypsend so slow?

2020-01-11 Thread Jeff Janes

I was using COPY recently and was wondering why BINARY format is not much
(if any) faster than the default format.  Once I switched from mostly
exporting ints to mostly exporting double precisions (7e6 rows of 100
columns, randomly generated), it was faster, but not by as much as I
intuitively thought it should be.

Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS
m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL:

PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit

I saw that the hotspot was pq_begintypsend at 20%, which was twice the
percentage as the next place winner (AllocSetAlloc).  If I drill down into
teh function, I see something like the below.  I don't really speak
assembly, but usually when I see an assembly instruction being especially
hot and not being the inner most instruction in a loop, I blame it on CPU
cache misses.  But everything being touched here should already be well
cached, since  initStringInfo has just got done setting it up. And if not
for that, then the by the 2nd invocation of appendStringInfoCharMacro it
certainly should be in the cache, yet that one is even slower than the 1st
appendStringInfoCharMacro.

Why is this such a bottleneck?

pq_begintypsend  /usr/local/pgsql/bin/postgres

 0.15 |push  %rbx
 0.09 |mov  %rdi,%rbx
   |   initStringInfo(buf);
 3.03 |callq initStringInfo
   |   /* Reserve four bytes for the bytea length word */
   |   appendStringInfoCharMacro(buf, '\0');
   |movslq 0x8(%rbx),%rax
 1.05 |lea  0x1(%rax),%edx
 0.72 |cmp  0xc(%rbx),%edx
   |jge  b0
 2.92 |mov  (%rbx),%rdx
   |movb  $0x0,(%rdx,%rax,1)
13.76 |mov  0x8(%rbx),%eax
 0.81 |mov  (%rbx),%rdx
 0.52 |add  $0x1,%eax
 0.12 |mov  %eax,0x8(%rbx)
 2.85 |cltq
 0.01 |movb  $0x0,(%rdx,%rax,1)
   |   appendStringInfoCharMacro(buf, '\0');
10.65 |movslq 0x8(%rbx),%rax
   |lea  0x1(%rax),%edx
 0.90 |cmp  0xc(%rbx),%edx
   |jge  ca
 0.54 | 42:  mov  (%rbx),%rdx
 1.84 |movb  $0x0,(%rdx,%rax,1)
13.88 |mov  0x8(%rbx),%eax
 0.03 |mov  (%rbx),%rdx
   |add  $0x1,%eax
 0.33 |mov  %eax,0x8(%rbx)
 2.60 |cltq
 0.06 |movb  $0x0,(%rdx,%rax,1)
   |   appendStringInfoCharMacro(buf, '\0');
 3.21 |movslq 0x8(%rbx),%rax
 0.23 |lea  0x1(%rax),%edx
 1.74 |cmp  0xc(%rbx),%edx
   |jge  e0
 0.21 | 67:  mov  (%rbx),%rdx
 1.18 |movb  $0x0,(%rdx,%rax,1)
 9.29 |mov  0x8(%rbx),%eax
 0.18 |mov  (%rbx),%rdx
   |add  $0x1,%eax
 0.19 |mov  %eax,0x8(%rbx)
 3.14 |cltq
 0.12 |movb  $0x0,(%rdx,%rax,1)
   |   appendStringInfoCharMacro(buf, '\0');
 5.29 |movslq 0x8(%rbx),%rax
 0.03 |lea  0x1(%rax),%edx
 1.45 |cmp  0xc(%rbx),%edx
   |jge  f6
 0.41 | 8c:  mov  (%rbx),%rdx

Cheers,

Jeff

Re: color by default

2020-01-03 Thread Jeff Janes

On Tue, Dec 31, 2019 at 8:35 AM Tom Lane  wrote:

> Peter Eisentraut  writes:
> > With the attached patch, I propose to enable the colored output by
> > default in PG13.
>
> FWIW, I shall be setting NO_COLOR permanently if this gets committed.
> I wonder how many people there are who actually *like* colored output?
> I find it to be invariably less readable than plain B text.
>
>
I find color massively useful for grep and its variants, where the hit can
show up anywhere on the line.  It was also kind of useful for git,
especially "git grep", but on my current system git's colorizing seems
hopelessly borked, so I had to turn it off.

But I turned PG_COLOR on and played with many commands, and must say I
don't really see much of a point.  When most of these command fail, they
only generate a few lines of output, and it isn't hard to spot the error
message.  When pg_restore goes wrong, you get a lot of messages but
colorizing them isn't really helpful.  I don't need 'error' to show up in
red in order to know that I have a lot of errors, especially since the
lines which do report errors always have 'error' as the 2nd word on the
line, where it isn't hard to spot.  If it could distinguish the important
errors from unimportant errors, that would be more helpful.  But if it
could reliably do that, why print the unimportant ones at all?

It doesn't seem like this is useful enough to have it on by default, and
without it being on by default there is no point in having NO_COLOR to turn
if off.  There is something to be said for going with the flow, but the
"emerging standard" seems like it has quite a bit further to emerge before
I think that would be an important reason.

Cheers,

Jeff

Re: [PATCH] Increase the maximum value track_activity_query_size

2020-01-02 Thread Jeff Janes

On Mon, Dec 30, 2019 at 6:46 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

> On Tue, Dec 31, 2019 at 9:38 AM Tom Lane  wrote:
>
> >
> > This doesn't seem like a reason not to allow a higher limit, like a
> > megabyte or so, but I'm not sure that pushing it to the moon would be
> > wise.
> >
>
>
> Just to get a mental handle on the size of queries we might be
> allowing before truncation, I did some very rough arithmetic on what
> well known texts might fit in a megabyte. By my calculations you could
> fit about four Animal Farms or one Madame Bovary in about a megabyte.
> So I think that seems like more than enough :-). (My mind kinda
> explores at the thought of debugging a query as long as Animal Farm.)
>
>
I've seen some pretty big IN-lists and VALUES lists.   They aren't so hard
to debug once you tune out iterations 3 through N-3 of the list members.
Unless they are hard to debug for other reasons.  In these cases, it would
be helpful, if not just allowing bigger texts in general, to instead
"truncate" from the middle, preserving both the beginning and the end of
the query text.

Cheers,

Jeff

Re: vacuum verbose detail logs are unclear (show debug lines at start of each stage?)

2019-12-29 Thread Jeff Janes

On Fri, Dec 20, 2019 at 12:11 PM Justin Pryzby  wrote:

> This is a usability complaint.  If one knows enough about vacuum and/or
> logging, I'm sure there's no issue.
>

> | 11  DEBUG:  "t": found 999 removable, 999 nonremovable row versions in 9
> out of 9 pages
>

I agree the mixture of pre-action and after-action reporting is rather
confusing sometimes.  I'm more concerned about what the user sees in their
terminal, though, rather than the server's log file.

Also, the above quoted line is confusing.  It makes it sound like it found
removable items, but didn't actually remove them.  I think that that is
taking grammatical parallelism too far.  How about something like:

DEBUG:  "t": removed 999 row versions, found 999 nonremovable row versions
in 9 out of 9 pages

Also, I'd appreciate a report on how many hint-bits were set, and how many
pages were marked all-visible and/or frozen.  When I do a manual vacuum, it
is more often for those purposes than it is for removing removable rows
(which autovac generally does a good enough job of).

Also, is not so clear that "nonremovable rows" includes both live and
recently dead.  Although hopefully reading the next line will clarify that,
to the person who has enough background knowledge.

> | 12  DETAIL:  0 dead row versions cannot be removed yet, oldest xmin:
> 130886944
> | 13  There were 0 unused item identifiers.
> | 14  Skipped 0 pages due to buffer pins, 0 frozen pages.
>

It is a bit weird that we don't report skipped all-visible pages here.  It
was implicitly reported in the "in 9 out of 9 pages" message, but I think
it should be reported explicitly as well.

Cheers,

Jeff

Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-29 Thread Jeff Janes

On Tue, Dec 24, 2019 at 12:11 AM Robert Haas  wrote:

> On Sat, Dec 21, 2019 at 1:25 PM Tom Lane  wrote:
> > > What is the overhead here except the memory consumption?
> >
> > The time to copy those strings out of shared storage, any time
> > you query pg_stat_activity.
>
> It seems like you're masterminding this, and I don't know why. It
> seems unlikely that anyone will raise the value unless they have very
> long queries, and if those people would rather pay the overhead of
> copying more data than have their queries truncated, who are we to
> argue?
>

+1

Cheers,

Jeff

Re: [HACKERS] Proposal to add work_mem option to postgres_fdw module

2019-12-17 Thread Jeff Janes

On Fri, Sep 7, 2018 at 9:17 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 05/09/2018 18:46, Peter Eisentraut wrote:
> > On 01/09/2018 06:33, Shinoda, Noriyoshi (PN Japan GCS Delivery) wrote:
> >> Certainly the PQconndefaults function specifies Debug flag for the
> "options" option.
> >> I think that eliminating the Debug flag is the simplest solution.
> >> For attached patches, GUC can be specified with the following syntax.
> >>
> >> CREATE SERVER remsvr1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host
> 'host 1', ..., options '-c work_mem=64MB -c geqo=off');
> >> ALTER SERVER remsv11 OPTIONS (SET options '-c work_mem=64MB -c
> geqo=off');
> >>
> >> However, I am afraid of the effect that this patch will change the
> behavior of official API PQconndefaults.
> >
> > It doesn't change the behavior of the API, it just changes the result of
> > the API function, which is legitimate and the reason we have the API
> > function in the first place.
> >
> > I think this patch is fine.  I'll work on committing it.
>
> I have committed just the libpq change.  The documentation change was
> redundant, because the documentation already stated that all libpq
> options are accepted.  (Arguably, the documentation was wrong before.)
>

Since the effect of this commit was to make postgres_fdw actually comply
with its documentation,
should this have been back-patched?  Is there a danger in backpatching this
change to libpq to older versions?

This seems like more of a bug fix than an added feature.

Cheers,

Jeff

Re: psql's \watch is broken

2019-12-13 Thread Jeff Janes

On Fri, Dec 13, 2019 at 9:45 PM Michael Paquier  wrote:

> On Sat, Dec 14, 2019 at 12:09:51AM +0100, Fabien COELHO wrote:
> >
> >> explain (analyze) select * from pgbench_accounts \watch 1
> >>
> >> It behaves as expected.  But once I break out of the loop with ctrl-C,
> then
> >> if I execute the same thing again it executes the command once, but
> shows
> >> no output and doesn't loop.  It seems like some flag is getting set with
> >> ctrl-C, but then never gets reset.
> >>
> >>
> >> I've not dug into code itself, I just bisected it.
> >
> > Thanks for the report. I'll look into it.
>
> Looked at it already.   And yes, I can see the difference.  This comes
> from the switch from cancel_pressed to CancelRequested in psql,
> especially PSQLexecWatch() in this case.  And actually, now that I
> look at it, I think that we should simply get rid of cancel_pressed in
> psql completely and replace it with CancelRequested.  This also
> removes the need of having cancel_pressed defined in print.c, which
> was not really wanted originally.  Attached is a patch which addresses
> the issue for me, and cleans up the code while on it.  Fabien, Jeff,
> can you confirm please?
> --
> Michael
>

This works for me.

Thanks,

Jeff

psql's \watch is broken

2019-12-13 Thread Jeff Janes

If I do something like this:

explain (analyze) select * from pgbench_accounts \watch 1

It behaves as expected.  But once I break out of the loop with ctrl-C, then
if I execute the same thing again it executes the command once, but shows
no output and doesn't loop.  It seems like some flag is getting set with
ctrl-C, but then never gets reset.

It was broken in this commit:

commit a4fd3aa719e8f97299dfcf1a8f79b3017e2b8d8b
Author: Michael Paquier 
Date:   Mon Dec 2 11:18:56 2019 +0900

Refactor query cancellation code into src/fe_utils/


I've not dug into code itself, I just bisected it.

Cheers,

Jeff

Re: disable only nonparallel seq scan.

2019-12-13 Thread Jeff Janes

On Tue, Dec 10, 2019 at 1:32 PM Robert Haas  wrote:

> On Sun, Dec 8, 2019 at 1:24 PM Jeff Janes  wrote:
> > Is there a way to force a meaningful parallel seq scan, or at least the
> planning of one, when the planner wants a non-parallel one?
> >
> > Usually I can do things like with with enable_* setting, but if I `set
> enable_seqscan to off`, it penalizes the parallel seq scan 8 times harder
> than it penalizes the non-parallel one, so the plan does not switch.
> >
> > If I set `force_parallel_mode TO on` then I do get a parallel plan, but
> it is a degenerate one which tells me nothing I want to know.
> >
> > If I `set parallel_tuple_cost = 0` (or in some cases to a negative
> number), I can force it switch, but that destroys the purpose, which is to
> see what the "would have been" plan estimates are for the parallel seq scan
> under the default setting of the cost parameters.
> >
> > I can creep parallel_tuple_cost downward until it switches, and then try
> to extrapolate back up, but this tedious and not very reliable.
>
> I don't think there's a way to force this, but setting both
> parallel_setup_cost and parallel_tuple_cost to 0 seems to often be
> enough.
>

Yes, that is fine if I want the actual execution results.  And I patch
guc.c to allow negative settings, for when some extra persuasion is needed.

But here I want to see what the planner is thinking, and changing the *cost
settings changes that thinking.  So I want to force the planner to choose
the "next-best" plan under the original cost settings so I can see how far
away they are from each other.  I made a crude patch to add
enable_singleseqscan, which has been letting me get at this information now.

I'm not proposing to apply this particular patch to the code base, but I do
wonder if we can do something about this "dark spot" which no combination
of current enable_* setting seems to be able to get at.

Cheers,

Jeff

enable_singleseqscan.patch
Description: Binary data

Re: Contention on LWLock buffer_content, due to SHARED lock(?)

2019-12-10 Thread Jeff Janes

On Mon, Dec 9, 2019 at 5:10 PM Jens-Wolfhard Schicke-Uffmann <
drahf...@gmx.de> wrote:

> Hi,
>
> today I observed (on a r5.24xlarge AWS RDS instance, i.e. 96 logical
> cores) lock contention on a buffer content lock due to taking of a
> SHARED lock (I think):
>

What version of PostgreSQL are you using?

Cheers,

Jeff

Re: Index corruption / planner issue with one table in my pg 11.6 instance

2019-12-09 Thread Jeff Janes

On Mon, Dec 9, 2019 at 1:00 PM Jeremy Finzel  wrote:

> I have a table with about 7 million records.  I had a query in which I
> needed 2 indexes added, one for a created timestamp field another for an id
> field; both very high cardinality.
>
> First I noticed the query would not use the timestamp index no matter what
> session config settings I used.  I finally created a temp table copy of the
> table and verified index is used.  Then I rebuilt the main table with
> VACUUM FULL and this caused the index to be used.
>

Were they built with CONCURRENTLY?  Do you have any long-open snapshots?

Cheers,

Jeff

>

disable only nonparallel seq scan.

2019-12-08 Thread Jeff Janes

Is there a way to force a meaningful parallel seq scan, or at least the
planning of one, when the planner wants a non-parallel one?

Usually I can do things like with with enable_* setting, but if I `set
enable_seqscan to off`, it penalizes the parallel seq scan 8 times harder
than it penalizes the non-parallel one, so the plan does not switch.

If I set `force_parallel_mode TO on` then I do get a parallel plan, but it
is a degenerate one which tells me nothing I want to know.

If I `set parallel_tuple_cost = 0` (or in some cases to a negative number),
I can force it switch, but that destroys the purpose, which is to see what
the "would have been" plan estimates are for the parallel seq scan under
the default setting of the cost parameters.

I can creep parallel_tuple_cost downward until it switches, and then try to
extrapolate back up, but this tedious and not very reliable.

Cheers,

Jeff

Re: WAL archive is lost

2019-11-23 Thread Jeff Janes

On Fri, Nov 22, 2019 at 8:04 AM matsumura@fujitsu.com <
matsumura@fujitsu.com> wrote:

> Hi all
>
> I find a situation that WAL archive file is lost but any WAL segment file
> is not lost.
> It causes for archive recovery to fail. Is this behavior a bug?
>
> example:
>
>   WAL segment files
>   00010001
>   00010002
>   00010003
>
>   Archive files
>   00010001
>   00010003
>
>   Archive file 00010002 is lost but WAL segment files
>   is continuous. Recovery with archive (i.e. PITR) stops at the end of
>   00010001.
>

Will it not archive  00010002 eventually, like at the
conclusion of the next restartpoint?  or does it get recycled/removed
without ever being archived?  Or does it just hang out forever in pg_wal?



> How to reproduce:
> - Set up replication (primary and standby).
> - Set [archive_mode = always] in standby.
> - WAL receiver exits (i.e. because primary goes down)
>   after receiver inserts the last record in some WAL segment file
>   before receiver notifies the segement file to archiver(create .ready
> file).
>

Do you have a trick for reliably achieving this last step?

Cheers,

Jeff

Coding in WalSndWaitForWal

2019-11-09 Thread Jeff Janes

in src/backend/replication/walsender.c, there is the section quoted below.
It looks like nothing interesting happens between the GetFlushRecPtr just
before the loop starts, and the one inside the loop the first time through
the loop.  If we want to avoid doing  CHECK_FOR_INTERRUPTS(); etc.
needlessly, then we should check the result of  GetFlushRecPtr and return
early if it is sufficiently advanced--before entering the loop.  If we
don't care, then what is the point of updating it twice with no meaningful
action in between?  We could just get rid of the section just before the
loop starts.  The current coding seems confusing, and increases traffic on
a potentially busy spin lock.



/* Get a more recent flush pointer. */
if (!RecoveryInProgress())
RecentFlushPtr = GetFlushRecPtr();
else
RecentFlushPtr = GetXLogReplayRecPtr(NULL);

for (;;)
{
longsleeptime;

/* Clear any already-pending wakeups */
ResetLatch(MyLatch);

CHECK_FOR_INTERRUPTS();

/* Process any requests or signals received recently */
if (ConfigReloadPending)
{
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
SyncRepInitConfig();
}

/* Check for input from the client */
ProcessRepliesIfAny();

/*
 * If we're shutting down, trigger pending WAL to be written out,
 * otherwise we'd possibly end up waiting for WAL that never gets
 * written, because walwriter has shut down already.
 */
if (got_STOPPING)
XLogBackgroundFlush();

/* Update our idea of the currently flushed position. */
if (!RecoveryInProgress())
RecentFlushPtr = GetFlushRecPtr();
else
RecentFlushPtr = GetXLogReplayRecPtr(NULL);

Cheers,

Jeff

Re: logical replication empty transactions

2019-11-09 Thread Jeff Janes

On Fri, Nov 8, 2019 at 8:59 PM Euler Taveira  wrote:

> Em seg., 21 de out. de 2019 às 21:20, Jeff Janes
>  escreveu:
> >
> > After setting up logical replication of a slowly changing table using
> the built in pub/sub facility, I noticed way more network traffic than made
> sense.  Looking into I see that every transaction in that database on the
> master gets sent to the replica.  99.999+% of them are empty transactions
> ('B' message and 'C' message with nothing in between) because the
> transactions don't touch any tables in the publication, only non-replicated
> tables.  Is doing it this way necessary for some reason?  Couldn't we hold
> the transmission of 'B' until something else comes along, and then if that
> next thing is 'C' drop both of them?
> >
> That is not optimal. Those empty transactions is a waste of bandwidth.
> We can suppress them if no changes will be sent. test_decoding
> implements "skip empty transaction" as you described above and I did
> something similar to it. Patch is attached.
>

Thanks.  I didn't think it would be that simple, because I thought we would
need some way to fake an acknowledgement for any dropped empty
transactions, to keep the LSN advancing and allow WAL to get recycled on
the master.  But it turns out the opposite.  While your patch drops the
network traffic by a lot, there is still a lot of traffic.  Now it is
keep-alives, rather than 'B' and 'C'.  I don't know why I am getting a few
hundred keep alives every second when the timeouts are at their defaults,
but it is better than several thousand 'B' and 'C'.

My setup here was just to create, publish, and subscribe to a inactive
dummy table, while having pgbench running on the master (with unpublished
tables).  I have not created an intentionally slow network, but I am
testing it over wifi, which is inherently kind of slow.

Cheers,

Jeff

Re: Logical replication wal sender timestamp bug

2019-11-08 Thread Jeff Janes

On Wed, Nov 6, 2019 at 2:15 AM Michael Paquier  wrote:

> On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote:
> > On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
> >> Filling out the timestamp after the message has already been sent is
> taking
> >> "as late as possible" a little too far.  It results in every message
> having
> >> a zero timestamp, other than keep-alives which go through a different
> path.
> >
> > It seems to me that you are right: the timestamp is computed too
> > late.
>
> It is easy enough to reproduce the problem by setting for example
> logical replication between two nodes and pgbench to produce some
> load and then monitor pg_stat_subscription periodically.  However, it
> is a problem since logical decoding has been introduced (5a991ef) so
> committed your fix down to 9.4.
>

Thanks.  This column looks much more reasonable now.

Cheers,

Jeff

Re: cost based vacuum (parallel)

2019-11-04 Thread Jeff Janes

On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila  wrote:

> For parallel vacuum [1], we were discussing what is the best way to
> divide the cost among parallel workers but we didn't get many inputs
> apart from people who are very actively involved in patch development.
> I feel that we need some more inputs before we finalize anything, so
> starting a new thread.
>

Maybe a I just don't have experience in the type of system that parallel
vacuum is needed for, but if there is any meaningful IO throttling which is
active, then what is the point of doing the vacuum in parallel in the first
place?

Cheers,

Jeff

Logical replication wal sender timestamp bug

2019-11-02 Thread Jeff Janes

While monitoring pg_stat_subscription, I noticed that last_msg_send_time
was usually NULL, which doesn't make sense.  Why would we have a message,
but not know when it was sent?

Looking in src/backend/replication/walsender.c, there is this:

/* output previously gathered data in a CopyData packet */
pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);

/*
 * Fill the send timestamp last, so that it is taken as late as
possible.
 * This is somewhat ugly, but the protocol is set as it's already used
for
 * several releases by streaming physical replication.
 */
resetStringInfo();
now = GetCurrentTimestamp();
pq_sendint64(, now);
memcpy(>out->data[1 + sizeof(int64) + sizeof(int64)],
   tmpbuf.data, sizeof(int64));

Filling out the timestamp after the message has already been sent is taking
"as late as possible" a little too far.  It results in every message having
a zero timestamp, other than keep-alives which go through a different path.

Re-ordering the code blocks as in the attached seems to fix it.  But I have
to wonder, if this has been broken from the start and no one ever noticed,
is this even valuable information to relay in the first place?  We could
just take the column out of the view, and not bother calling
GetCurrentTimestamp() for each message.

Cheers,

Jeff


walsender_timestamp_fix.patch
Description: Binary data

logical replication empty transactions

2019-10-21 Thread Jeff Janes

After setting up logical replication of a slowly changing table using the
built in pub/sub facility, I noticed way more network traffic than made
sense.  Looking into I see that every transaction in that database on the
master gets sent to the replica.  99.999+% of them are empty transactions
('B' message and 'C' message with nothing in between) because the
transactions don't touch any tables in the publication, only non-replicated
tables.  Is doing it this way necessary for some reason?  Couldn't we hold
the transmission of 'B' until something else comes along, and then if that
next thing is 'C' drop both of them?

There is a comment for WalSndPrepareWrite which seems to foreshadow such a
thing, but I don't really see how to use it in this case.  I want to drop
two messages, not one.

 * Don't do anything lasting in here, it's quite possible that nothing will
be done
 * with the data.

This applies to all version which have support for pub/sub, including the
recent commits to 13dev.

I've searched through the voluminous mailing list threads for when this
feature was being presented to see if it was already discussed, but since
every word I can think to search on occurs in virtually every message in
the threads in some context or another, I didn't have much luck.

Cheers,

Jeff

Re: [PATCH] Race condition in logical walsender causes long postgresql shutdown delay

2019-09-26 Thread Jeff Janes

On Wed, Sep 11, 2019 at 3:52 PM Alvaro Herrera 
wrote:

>
> Reading over this code, I noticed that the detection of the catch-up
> state ends up being duplicate code, so I would rework that function as
> in the attached patch.
>
> The naming of those flags (got_SIGUSR2, got_STOPPING) is terrible, but
> I'm not going to change that in a backpatchable bug fix.
>

Hi Alvaro, does this count as a review?  And Craig, do you agree with
Alvaro's patch as a replacement for your own?

Thanks,

Jeff

Re: DROP SUBSCRIPTION with no slot

2019-09-24 Thread Jeff Janes

On Tue, Sep 24, 2019 at 6:42 PM Ziga  wrote:

> This also seems to be a problem for somewhat fringe case of subscriptions
> created with connect=false option.
> They cannot be dropped in an obvious way, without knowing the ALTER
> SUBSCRIPTION trick.
>
> For example:
>
> contrib_regression=# create subscription test_sub connection
> 'dbname=contrib_regression' publication test_pub with ( connect=false );
>
>
> WARNING:  tables were not subscribed, you will have to run ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
> CREATE SUBSCRIPTION
>
> contrib_regression=# drop subscription test_sub; -- fails
> ERROR:  could not drop the replication slot "test_sub" on publisher
> DETAIL:  The error was: ERROR:  replication slot "test_sub" does not exist
>
> contrib_regression=# alter subscription test_sub set ( slot_name=none );
> ALTER SUBSCRIPTION
>
> contrib_regression=# drop subscription test_sub; -- succeeds
> DROP SUBSCRIPTION
>
>
> Note that the publication was never refreshed.
> It seems that the first DROP should succeed in the above case.
> Or at least some hint should be given how to fix this.
>

There is no HINT in the error message itself, but there is in the
documentation, see note at end of
https://www.postgresql.org/docs/current/sql-dropsubscription.html.  I agree
with you that the DROP should just work in this case, even more so than in
my case.  But if we go with the argument that doing that is too error
prone, then do we want to include a HINT on how to be error prone more
conveniently?

Cheers,

Jeff

Re: DROP SUBSCRIPTION with no slot

2019-09-24 Thread Jeff Janes

On Tue, Sep 24, 2019 at 5:25 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-09-24 16:31, Jeff Janes wrote:
> > I recently had to cut loose (pg_drop_replication_slot) a logical replica
> > that couldn't keep up and so was threatening to bring down the master.
> >
> > In mopping up on the replica side, I couldn't just drop the
> > subscription, because it couldn't drop the nonexistent slot on the
> > master and so refused to work.  So I had to do a silly little dance
> > where I first disable the subscription, then ALTER SUBSCRIPTION ... SET
> > (slot_name = NONE), then drop it.
> >
> > Wanting to clean up after itself is admirable, but if there is nothing
> > to clean up, why should that be an error condition?
>
> The alternatives seem quite error prone to me.  Better be explicit.
>

If you can connect to the master, and see that the slot already fails to
exist, what is error prone about that?

If someone goes to the effort of setting up a different master, configures
it to receive replica connections, and alters the subscription CONNECTION
parameter on the replica to point to that poisoned master, will an error on
the DROP SUBSCRIPTION really force them to see the error of their ways, or
will they just succeed at explicitly doing the silly dance and finalize the
process of shooting themselves in the foot via a roundabout mechanism?
(And at the point the CONNECTION is changed, he is in the same boat even if
he doesn't try to drop the sub--either way the master has a dangling
slot).  I'm in favor of protecting a fool from his foolishness, except when
it is annoying to the rest of us (Well, annoying to me, I guess. I don't
know yet about the rest of us).

Cheers,

Jeff

DROP SUBSCRIPTION with no slot

2019-09-24 Thread Jeff Janes

I recently had to cut loose (pg_drop_replication_slot) a logical replica
that couldn't keep up and so was threatening to bring down the master.

In mopping up on the replica side, I couldn't just drop the subscription,
because it couldn't drop the nonexistent slot on the master and so refused
to work.  So I had to do a silly little dance where I first disable the
subscription, then ALTER SUBSCRIPTION ... SET (slot_name = NONE), then drop
it.

Wanting to clean up after itself is admirable, but if there is nothing to
clean up, why should that be an error condition?  Should this be an item on
https://wiki.postgresql.org/wiki/Todo (to whatever extent that is still
used).

Cheers,

Jeff

Re: JSONPATH documentation

2019-09-22 Thread Jeff Janes

On Sun, Sep 22, 2019 at 2:18 PM Jeff Janes  wrote:

> I find the documentation in
> https://www.postgresql.org/docs/12/functions-json.html very confusing.
>
> In table 9.44 take the first entry,
>
> Example JSON
>  {"x": [2.85, -14.7, -9.4]}
>
> Example Query
>   + $.x.floor()
>
> Result
> 2, -15, -10
>
> There are no end to end examples here. How do I apply the example query to
> the example json to obtain the given result?
>

OK, never mind here.  After digging in the regression tests, I did find
jsonb_path_query and friends, and they are in the docs with examples in
table 9.49.  I don't know how I overlooked that in the first place, I guess
I was fixated on operators.  Or maybe by the time I was down in those
functions, I thought I had cycled back up and was looking at 9.44 again.
But I think it would make sense to move the description of jsonpath to its
own page.  It is confusing to have operators within the jsonpath language,
and operators which apply to jsonpath "from the outside", together in the
same page.

Cheers,

Jeff

>

JSONPATH documentation

2019-09-22 Thread Jeff Janes

I find the documentation in
https://www.postgresql.org/docs/12/functions-json.html very confusing.

In table 9.44 take the first entry,

Example JSON
 {"x": [2.85, -14.7, -9.4]}

Example Query
  + $.x.floor()

Result
2, -15, -10

There are no end to end examples here. How do I apply the example query to
the example json to obtain the given result?

Table 9.47 only gives two operators which apply a jsonpath to a json(b)
object: @? and @@; and neither one of those yield the indicated result from
the first line in 9.44. What does?

Also, I can't really figure out what the descriptions of @? and @@ mean.
Does @? return true if an item exists, even if the value of that item is
false, while @@ returns the truth value of the existing item?

https://www.postgresql.org/docs/12/datatype-json.html#DATATYPE-JSONPATH

"The SQL/JSON path language is fully integrated into the SQL engine". What
does that mean?  If it were only partially integrated, what would that
mean?  Is this providing me with any useful information?  Is this just
saying that this is not a contrib extension module?

What is the difference between "SQL/JSON Path Operators And Methods" and
and "jsonpath Accessors" and why are they not described in the same place,
or at least nearby each other?

Cheers,

Jeff

Re: WAL recycled despite logical replication slot

2019-09-22 Thread Jeff Janes

On Fri, Sep 20, 2019 at 6:25 PM Andres Freund  wrote:

> Hi,
>
> On September 20, 2019 5:45:34 AM PDT, Jeff Janes 
> wrote:
> >While testing something else (whether "terminating walsender process
> >due to
> >replication timeout" was happening spuriously), I had logical
> >replication
> >set up streaming a default pgbench transaction load, with the publisher
> >being 13devel-e1c8743 and subscriber being 12BETA4.  Eventually I
> >started
> >getting errors about requested wal segments being already removed:
> >
> >10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG:  starting logical
> >decoding for slot "sub"
> >10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL:  Streaming
> >transactions committing after 79/EB0B17A0, reading WAL from
> >79/E70736A0.
> >10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR:  requested WAL
> >segment 0001007900E7 has already been removed
> >10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG:  disconnection:
> >session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2
> >port=40830
> >
> >It had been streaming for about 50 minutes before the error showed up,
> >and
> >it showed right when streaming was restarting after one of the
> >replication
> >timeouts.
> >
> >Is there an innocent explanation for this?  I thought logical
> >replication
> >slots provided an iron-clad guarantee that WAL would be retained until
> >it
> >was no longer needed.  I am just using pub/sub, none of the lower level
> >stuff.
>
> It indeed should. What's the content of
> pg_replication_slot for that slot?
>

Unfortunately I don't think I have that preserved.  If I can reproduce the
issue, would preserving data/pg_replslot/sub/state help as well?

Cheers,

Jeff

Re: WAL recycled despite logical replication slot

2019-09-22 Thread Jeff Janes

On Fri, Sep 20, 2019 at 11:27 AM Tomas Vondra 
wrote:

> >
> >Is there an innocent explanation for this?  I thought logical replication
> >slots provided an iron-clad guarantee that WAL would be retained until it
> >was no longer needed.  I am just using pub/sub, none of the lower level
> >stuff.
> >
>
> I think you're right - this should not happen with replication slots.
> Can you provide more detailed setup instructions, so that I can try to
> reproduce and investigate the isssue?
>

It is a bit messy, because this isn't what I was trying to test.

The basic set up is pretty simple:

On master:

pgbench -i -s 100
create publication pgbench for table pgbench_accounts,  pgbench_branches,
pgbench_history , pgbench_tellers;
pgbench -R200 -c4 -j4 -P60 -T36 -n

on replica:

pgbench -i -s 1
truncate pgbench_history , pgbench_accounts, pgbench_branches,
pgbench_tellers;
create subscription sub CONNECTION 'host=192.168.0.15' publication pgbench;

The messy part:  It looked like the synch was never going to finish, so
first I cut the rate down to -R20.  Then what I thought I did was drop the
primary key on pgbench_accounts (manually doing a kill -15 on the synch
worker to release the lock), wait for the copy to start again and then
finish and then start getting "ERROR:  logical replication target relation
"public.pgbench_accounts" has neither REPLICA IDENTITY index nor PRIMARY
KEY and published relation does not have REPLICA IDENTITY FULL" log
messages, then I re-added the primary key.  Then I increased the -R back to
200, and about 50 minutes later got the WAL already removed error.

But now I can't seem to reproduce this, as the next time I tried to do the
synch with no primary key there doesn't seem to be a commit after the COPY
finishes so once it tries to replay the first update, it hits the above "no
primary key" error and then rolls back the **the entire COPY** as well as
the single-row update, and starts the entire COPY over again before you
have a chance to intervene and build the index.  So I'm guessing now that
either the lack of a commit (which itself seems like a spectacularly bad
idea) is situation dependent, or the very slow COPY had finished between
the time I had decided to drop the primary key, and time I actually
implemented the drop.

Perhaps important here is that the replica is rather underpowered.  Write
IO and fsyncs periodically become painfully slow, which is probably why
there are replication timeouts, and since the problem happened when trying
to reestablish after a timeout I would guess that that is critical to the
issue.

I was running the master with fsync=off, but since the OS never crashed
that should not be the source of corruption.

I'll try some more to reproduce this, but I wanted to make sure there was
actually something here to reproduce, and not just my misunderstanding of
how things are supposed to work.

Cheers,

Jeff

WAL recycled despite logical replication slot

2019-09-20 Thread Jeff Janes

While testing something else (whether "terminating walsender process due to
replication timeout" was happening spuriously), I had logical replication
set up streaming a default pgbench transaction load, with the publisher
being 13devel-e1c8743 and subscriber being 12BETA4.  Eventually I started
getting errors about requested wal segments being already removed:

10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG:  starting logical
decoding for slot "sub"
10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL:  Streaming
transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0.
10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR:  requested WAL
segment 0001007900E7 has already been removed
10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG:  disconnection:
session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2
port=40830

It had been streaming for about 50 minutes before the error showed up, and
it showed right when streaming was restarting after one of the replication
timeouts.

Is there an innocent explanation for this?  I thought logical replication
slots provided an iron-clad guarantee that WAL would be retained until it
was no longer needed.  I am just using pub/sub, none of the lower level
stuff.

Cheers,

Jeff

Re: Default JIT setting in V12

2019-09-16 Thread Jeff Janes

On Wed, Sep 4, 2019 at 11:24 AM Andres Freund  wrote:

> Hi,
>
> On 2019-09-04 07:51:16 -0700, Andres Freund wrote:
>

> Or better, something slightly more complete, like the attached (which
> affects both code-gen time optimizations (which are more like peephole
> ones), and both function/global ones that are cheap).
>

Yes, that does completely solve the issue I raised.  It makes JIT either
better or at least harmless, even when falling into the gap between
jit_above_cost
and jit_optimize_above_cost.

What TPC-H implementation do you use/recommend?  This one
https://wiki.postgresql.org/wiki/DBT-3?

Cheers,

Jeff

Re: log spam with postgres_fdw

2019-09-15 Thread Jeff Janes

On Sun, Sep 15, 2019 at 11:14 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > When closing the local session which had used postgres_fdw over an ssl
> > connection, I get log spam on the foreign server saying:
> > LOG:  could not receive data from client: Connection reset by peer
> > It is easy to reproduce, but you must be using ssl to do so.
> > On searching, I see that a lot of people have run into this issue, with
> > considerable confusion, but as far as I can see it has never been
> diagnosed.
>
> In
>
> https://www.postgresql.org/message-id/flat/3DPLMQIC.YU6IFMLY.3PLOWL6W%40FQT5M7HS.IFBAANAE.A7GUPCPM
>
>
Thanks, I had not spotted that one, I guess because the log message itself
was not in the subject so it ranked lower.


> we'd concluded that the issue is probably that postgres_fdw has no
> logic to shut down its external connections when the session closes.
> It's not very clear why the SSL dependency, but we speculated that
> adding an on_proc_exit callback to close the connection(s) would help.
>
>
It is easy to reproduce the ssl dependency without any FDW, just by doing a
kill -9 on psql. Apparently the backend process for unencrypted connections
are happy to be ghosted, while ssl ones are not; which seems like an odd
distinction to make.  So should this be addressed on both sides (the server
not whining, and the client doing the on_proc_exit anyway?).  I can take a
stab at the client side one, but I'm over my head on the ssl connection
handling logic on the server side.

Cheers,

Jeff

Re: Primary keepalive message not appearing in Logical Streaming Replication

2019-09-15 Thread Jeff Janes

On Sun, Sep 15, 2019 at 11:44 AM Michael Loftis  wrote:

>
>
> On Sun, Sep 15, 2019 at 08:36 Virendra Negi  wrote:
>
>> Oh I miss the documentation link there you go
>> https://www.postgresql.org/docs/9.5/protocol-replication.html
>>
>> On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi 
>> wrote:
>>
>>> Agreed but why is there a message specification for it describe in the
>>> documentation  and it ask to client reply back if a particular *bit* is
>>> set.(1 means that the client should reply to this message as soon as
>>> possible, to avoid a timeout disconnect. 0 otherwise)
>>>
>>
> This is unrelated to TCP keepalive. I honestly don't know where the knob
> is to turn these on but the configuration variables you quoted earlier I am
> familiar with and they are not it. Perhaps someone else can chime in with
> how to enable the protocol level keepalive in replication.
>

Protocol-level keepalives are governed by "wal_sender_timeout"

Cheers,

Jeff

log spam with postgres_fdw

2019-09-15 Thread Jeff Janes

I'm sending this to hackers, because it is not exactly a bug, and it can't
be addressed from userland.  I think it is a coding issue, although I
haven't identified the exact code.

When closing the local session which had used postgres_fdw over an ssl
connection, I get log spam on the foreign server saying:

LOG:  could not receive data from client: Connection reset by peer

It is easy to reproduce, but you must be using ssl to do so.

On searching, I see that a lot of people have run into this issue, with
considerable confusion, but as far as I can see it has never been diagnosed.

Is there anything that can be done about this, other than just learning to
ignore it?

Cheers,

Jeff
-- set up ssl, I won't walk through those steps.
\! pgbench -i
create extension postgres_fdw ;
create server ssl foreign data wrapper postgres_fdw OPTIONS ( host '127.0.0.1');
create schema fgn;
create user MAPPING FOR CURRENT_USER SERVER ssl;
import  foreign schema public from server ssl into fgn;
select count(*) from fgn.pgbench_history;
\q

Default JIT setting in V12

2019-09-04 Thread Jeff Janes

Since JIT is on by default in v12, I wanted to revisit the issue raised in
https://www.postgresql.org/message-id/CAMkU=1zVhQ5k5d=YyHNyrigLUNTkOj4=YB17s9--3ts8H-SO=q...@mail.gmail.com

When the total estimated cost is between jit_above_cost and
jit_optimize_above_cost, I get a substantial regression in the attached.
Note that I did not devise this case specifically to cause this problem, I
just stumbled upon it.

JIT, no optimization: 10.5s
JIT, optimization: 3.8s
no JIT:  4.1s

It seems like the unoptimized JIT code is much worse than the general
purpose code.

This is on AWS c4.large, Ubuntu 18.04, installed from PGDG apt repository.
No config changes were made, other than the local ones included in the
script. (Previously there were questions about how LLVM was configured,
that I couldn't really answer well, but here there should be question as I
didn't compile or configure it at all.)

There were some proposed mitigations in sister threads, but none have been
adopted in v12.

I think it is intuitive, and with empirical evidence, that we do not want
to JIT compile at all unless we are going to optimize the compiled code.

Is there a rationale for, or other examples to show, that it makes sense
for the default value of jit_optimize_above_cost to be 5 fold higher than
the default setting of jit_above_cost?

I think these defaults are setting a trap for our users who aren't really
interested in JIT, and are just upgrading to stay on the most-current
version.  I would propose lowering the default jit_optimize_above_cost to
be the same as jit_above_cost, or set it to 0 so that  jit_above_cost is
always in control and always optimizes.

Cheers,

Jeff
drop table if exists i200c200;
create table i200c200 ( pk bigint primary key, 
int1 bigint,
int2 bigint,
int3 bigint,
int4 bigint,
int5 bigint,
int6 bigint,
int7 bigint,
int8 bigint,
int9 bigint,
int10 bigint,
int11 bigint,
int12 bigint,
int13 bigint,
int14 bigint,
int15 bigint,
int16 bigint,
int17 bigint,
int18 bigint,
int19 bigint,
int20 bigint,
int21 bigint,
int22 bigint,
int23 bigint,
int24 bigint,
int25 bigint,
int26 bigint,
int27 bigint,
int28 bigint,
int29 bigint,
int30 bigint,
int31 bigint,
int32 bigint,
int33 bigint,
int34 bigint,
int35 bigint,
int36 bigint,
int37 bigint,
int38 bigint,
int39 bigint,
int40 bigint,
int41 bigint,
int42 bigint,
int43 bigint,
int44 bigint,
int45 bigint,
int46 bigint,
int47 bigint,
int48 bigint,
int49 bigint,
int50 bigint,
int51 bigint,
int52 bigint,
int53 bigint,
int54 bigint,
int55 bigint,
int56 bigint,
int57 bigint,
int58 bigint,
int59 bigint,
int60 bigint,
int61 bigint,
int62 bigint,
int63 bigint,
int64 bigint,
int65 bigint,
int66 bigint,
int67 bigint,
int68 bigint,
int69 bigint,
int70 bigint,
int71 bigint,
int72 bigint,
int73 bigint,
int74 bigint,
int75 bigint,
int76 bigint,
int77 bigint,
int78 bigint,
int79 bigint,
int80 bigint,
int81 bigint,
int82 bigint,
int83 bigint,
int84 bigint,
int85 bigint,
int86 bigint,
int87 bigint,
int88 bigint,
int89 bigint,
int90 bigint,
int91 bigint,
int92 bigint,
int93 bigint,
int94 bigint,
int95 bigint,
int96 bigint,
int97 bigint,
int98 bigint,
int99 bigint,
int100 bigint,
int101 bigint,
int102 bigint,
int103 bigint,
int104 bigint,
int105 bigint,
int106 bigint,
int107 bigint,
int108 bigint,
int109 bigint,
int110 bigint,
int111 bigint,
int112 bigint,
int113 bigint,
int114 bigint,
int115 bigint,
int116 bigint,
int117 bigint,
int118 bigint,
int119 bigint,
int120 bigint,
int121 bigint,
int122 bigint,
int123 bigint,
int124 bigint,
int125 bigint,
int126 bigint,
int127 bigint,
int128 bigint,
int129 bigint,
int130 bigint,
int131 bigint,
int132 bigint,
int133 bigint,
int134 bigint,
int135 bigint,
int136 bigint,
int137 bigint,
int138 bigint,
int139 bigint,
int140 bigint,
int141 bigint,
int142 bigint,
int143 bigint,
int144 bigint,
int145 bigint,
int146 bigint,
int147 bigint,
int148 bigint,
int149 bigint,
int150 bigint,
int151 bigint,
int152 bigint,
int153 bigint,
int154 bigint,
int155 bigint,
int156 bigint,
int157 bigint,
int158 bigint,
int159 bigint,
int160 bigint,
int161 bigint,
int162 bigint,
int163 bigint,
int164 bigint,
int165 bigint,
int166 bigint,
int167 bigint,
int168 bigint,
int169 bigint,
int170 bigint,
int171 bigint,
int172 bigint,
int173 bigint,
int174 bigint,
int175 bigint,
int176 bigint,
int177 bigint,
int178 bigint,
int179 bigint,
int180 bigint,
int181 bigint,
int182 bigint,
int183 bigint,
int184 bigint,
int185 bigint,
int186 bigint,
int187 bigint,
int188 bigint,
int189 bigint,
int190 bigint,
int191 bigint,
int192 bigint,
int193 bigint,
int194 bigint,
int195 bigint,
int196 bigint,
int197 bigint,
int198 bigint,
int199 bigint,
int200 bigint,
char1 varchar(255),
char2 varchar(255),
char3 varchar(255),
char4 varchar(255),
char5 varchar(255),
char6 varchar(255),
char7 varchar(255),
char8 varchar(255),
char9 varchar(255),
char10 varchar(255),
char11 varchar(255),
char12 varchar(255),
char13 varchar(255),
char14 varchar(255),
char15 varchar(255),
char16 varchar(255),
char17 varchar(255),

Re: Can't we give better table bloat stats easily?

2019-08-26 Thread Jeff Janes

On Fri, Aug 16, 2019 at 8:39 PM Greg Stark  wrote:

> Everywhere I've worked I've seen people struggle with table bloat. It's
> hard to even measure how much of it you have or where, let alone actually
> fix it.
>
> If you search online you'll find dozens of different queries estimating
> how much empty space are in your tables and indexes based on pg_stats
> statistics and suppositions about header lengths and padding and plugging
> them into formulas of varying credibility.
>

There is not much we can do to suppress bad advice that people post on
their own blogs.  If wiki.postgresql.org is hosting bad advice, by all
means we should fix that.

> But isn't this all just silliiness these days? We could actually sum up
> the space recorded in the fsm and get a much more trustworthy number in
> milliseconds.
>

If you have bloat problems, then you probably have vacuuming problems.  If
you have vacuuming problems, how much can you trust fsm anyway?

Cheers,

Jeff

pg_basebackup delays closing of stdout

2019-07-23 Thread Jeff Janes

Ever since pg_basebackup was created, it had a comment like this:

 * End of chunk. If requested, and this is the base tablespace
 * write configuration file into the tarfile. When done, close the
 * file (but not stdout).

But, why make the exception for output going to stdout?  If we are done
with it, why not close it?

After a massive maintenance operation, I want to re-seed a streaming
standby, which I start to do by:

pg_basebackup -D - -Ft -P -X none  | pxz > base.tar.xz

But the archiver is way behind, so when it finishes the basebackup part, I
get:

NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to
be archived
WARNING:  pg_stop_backup still waiting for all required WAL segments to be
archived (60 seconds elapsed)
...

The base backup file is not finalized, because pg_basebackup has not closed
its stdout while waiting for the WAL segment to be archived. The file is
incomplete due to data stuck in buffers, so I can't copy it to where I want
and bring up a new streaming replica (which bypasses the WAL archive, so
would otherwise work). Also, if pg_basebackup gets interupted somehow while
it is waiting for WAL archiving, the backup will be invalid, as it won't
flush the last bit of data.  Of course if it gets interupted, I would have
to test the backup to make sure it is valid.  But testing it and finding
that it is valid is better than testing it and finding that it is not.

I think it makes sense for pg_basebackup to wait for the WAL to be
archived, but there is no reason for it to hold the base.tar.xz file
hostage while it does so.

If I simply remove the test for strcmp(basedir, "-"), as in the attached, I
get the behavior I desire, and nothing bad seems to happen.  Meaning, "make
check -C src/bin/pg_basebackup/" still passes (but only tested on Linux).

Is there a reason not to do this?

Cheers,

Jeff


pg_basebackup_close_stdout.patch
Description: Binary data

mcv compiler warning

2019-07-05 Thread Jeff Janes

One of the recent mcv commits introduced an unused variable warning.

mcv.c: In function 'statext_mcv_serialize':
mcv.c:914:7: warning: unused variable 'itemlen' [-Wunused-variable]
   int   itemlen = ITEM_SIZE(dim);

The attached fixes it.

Cheers,

Jeff


mcv_assert_warning.patch
Description: Binary data

crash testing suggestions for 12 beta 1

2019-05-23 Thread Jeff Janes

Now that beta is out, I wanted to do some crash-recovery testing where I
inject PANIC-inducing faults and see if it recovers correctly. A long-lived
Perl process keeps track of what it should find after the crash, and
verifies that it finds it.  You will probably be familiar with the general
theme from examples like the threads below.  Would anyone like to nominate
some areas to focus on?  I think the pluggable storage refactoring work
will be get inherently tested, so I'm not planning designing test
specifically for that (unless there is a non-core plugin I should test
with). Making the ctid be tie-breakers in btree index is also tested
inherently (plus I think Peter tested that pretty thoroughly himself with
similar methods).  I've already tested declarative partitioning where the
tuples do a lot of migrating, and tested prepared transactions.  Any other
suggestions for changes that might be risky and should be specifically
targeted for testing?



https://www.postgresql.org/message-id/CAMkU=1xeuubphdwdmb1wjn4+td4kpnenifatbxnk1xzhcw8...@mail.gmail.com


https://www.postgresql.org/message-id/CAMkU=1xBP8cqdS5eK8APHL=x6rhmmm2vg5g+qamduutsycw...@mail.gmail.com


Cheers,

Jeff

improve transparency of bitmap-only heap scans

2019-05-18 Thread Jeff Janes

When bitmap-only heap scans were introduced in v11 (7c70996ebf0949b142a99)
no changes were made to "EXPLAIN".  This makes the feature rather opaque.
You can sometimes figure out what is going by the output of EXPLAIN
(ANALYZE, BUFFERS), but that is unintuitive and fragile.

Looking at the discussion where the feature was added, I think changing the
EXPLAIN just wasn't considered.

The attached patch adds "avoided" to "exact" and "lossy" as a category
under
"Heap Blocks".  Also attached is the example output, as the below will
probably wrap to the point of illegibility:

explain analyze select count(*) from foo  where a=35 and d between 67 and
70;
 QUERY
PLAN
-
 Aggregate  (cost=21451.36..21451.37 rows=1 width=8) (actual
time=103.955..103.955 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=9920.73..21442.44 rows=3570 width=0)
(actual time=100.239..103.204 rows=3950 loops=1)
 Recheck Cond: ((a = 35) AND (d >= 67) AND (d <= 70))
 Heap Blocks: avoided=3718 exact=73
 ->  BitmapAnd  (cost=9920.73..9920.73 rows=3570 width=0) (actual
time=98.666..98.666 rows=0 loops=1)
   ->  Bitmap Index Scan on foo_a_c_idx  (cost=0.00..1682.93
rows=91000 width=0) (actual time=28.541..28.541 rows=99776 loops=1)
 Index Cond: (a = 35)
   ->  Bitmap Index Scan on foo_d_idx  (cost=0.00..8235.76
rows=392333 width=0) (actual time=66.946..66.946 rows=399003 loops=1)
 Index Cond: ((d >= 67) AND (d <= 70))
 Planning Time: 0.458 ms
 Execution Time: 104.487 ms


I think the name of the node should also be changed to "Bitmap Only Heap
Scan", but I didn't implement that as adding another NodeTag looks like a
lot of tedious error prone work to do before getting feedback on whether
the change is desirable in the first place, or the correct approach.

 Cheers,

Jeff
create table foo as select floor(random()*100)::int as a, 
floor(random()*100)::int as b, floor(random()*100)::int as c, 
floor(random()*100)::int as d, floor(random()*100)::int as e from 
generate_series(1,1000);
vacuum ANALYZE ;
create index on foo (a,c);
create index on foo (d);
update foo set d=d+1 where a=35 and c = 70;

explain analyze select count(*) from foo  where a=35 and d between 67 and 70;
 QUERY PLAN 
 
-
 Aggregate  (cost=21451.36..21451.37 rows=1 width=8) (actual 
time=103.955..103.955 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=9920.73..21442.44 rows=3570 width=0) 
(actual time=100.239..103.204 rows=3950 loops=1)
 Recheck Cond: ((a = 35) AND (d >= 67) AND (d <= 70))
 Heap Blocks: avoided=3718 exact=73
 ->  BitmapAnd  (cost=9920.73..9920.73 rows=3570 width=0) (actual 
time=98.666..98.666 rows=0 loops=1)
   ->  Bitmap Index Scan on foo_a_c_idx  (cost=0.00..1682.93 
rows=91000 width=0) (actual time=28.541..28.541 rows=99776 loops=1)
 Index Cond: (a = 35)
   ->  Bitmap Index Scan on foo_d_idx  (cost=0.00..8235.76 
rows=392333 width=0) (actual time=66.946..66.946 rows=399003 loops=1)
 Index Cond: ((d >= 67) AND (d <= 70))
 Planning Time: 0.458 ms
 Execution Time: 104.487 ms



bitmap_only_avoided_v1.patch
Description: Binary data

Re: Usage of epoch in txid_current

2019-05-04 Thread Jeff Janes

On Sat, May 4, 2019 at 1:34 PM Jeff Janes  wrote:

> On Thu, Mar 28, 2019 at 1:30 AM Thomas Munro 
> wrote:
>
>> On Thu, Mar 28, 2019 at 1:48 AM Heikki Linnakangas 
>> wrote:
>> > Once we have the FullTransactionId type and basic macros in place, I'm
>> > sure we could tidy up a bunch of code by using them.
>
>
> Thanks for the reviews!  Pushed.
>>
>
> I think that this might be broken.
>
> We have this change:
>
> @@ -73,7 +75,8 @@ GetNewTransactionId(bool isSubXact)
>
> LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
>
> -   xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
> +   full_xid = ShmemVariableCache->nextFullXid;
> +   xid = XidFromFullTransactionId(full_xid);
>
> But then later on in an little-used code path around line 164:
>
> /* Re-acquire lock and start over */
> LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
> xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
>
> full_xid does not get updated, but then later on full_xid gets returned in
> lieu of xid.
>
> Is there a reason that this is OK?
>
>
Ah, sorry for the noise.  I see that this has been fixed already.  I wasn't
looking at HEAD, or at the other email thread, when I "discovered" this.

Sorry for the noise.

Cheers,

Jeff

Re: Usage of epoch in txid_current

2019-05-04 Thread Jeff Janes

On Thu, Mar 28, 2019 at 1:30 AM Thomas Munro  wrote:

> On Thu, Mar 28, 2019 at 1:48 AM Heikki Linnakangas 
> wrote:
> > Once we have the FullTransactionId type and basic macros in place, I'm
> > sure we could tidy up a bunch of code by using them.

Thanks for the reviews!  Pushed.
>

I think that this might be broken.

We have this change:

@@ -73,7 +75,8 @@ GetNewTransactionId(bool isSubXact)

LWLockAcquire(XidGenLock, LW_EXCLUSIVE);

-   xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+   full_xid = ShmemVariableCache->nextFullXid;
+   xid = XidFromFullTransactionId(full_xid);

But then later on in an little-used code path around line 164:

/* Re-acquire lock and start over */
LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);

full_xid does not get updated, but then later on full_xid gets returned in
lieu of xid.

Is there a reason that this is OK?

Cheers,

Jeff

Re: compiler warning in pgcrypto imath.c

2019-05-04 Thread Jeff Janes

On Sat, May 4, 2019 at 3:15 AM Noah Misch  wrote:

>
> I pushed Jeff's patch.
>

Thank you.  I've re-tested it and I get warning-free compilation now.

Cheers,

Jeff

make maintainer-clean and config.cache

2019-05-04 Thread Jeff Janes

In side-note in another thread Tom pointed out the speed improvements of
using an autoconf cache when re-building, which sounded nice to me as
config takes an annoyingly long time and is not parallelized.

But the config.cache files gets deleted by make maintainer-clean.  Doesn't
that mostly defeat the purpose of having a cache?  Am I doing something
wrong here, or just thinking about it wrong?

time ./configure --config-cache > /dev/null
real 0m21.538s

time ./configure --config-cache > /dev/null
real 0m3.425s

make maintainer-clean > /dev/null ;
## presumably git checkout a new commit here
time ./configure --config-cache > /dev/null
real 0m21.260s

Cheers,

Jeff

Re: pg_upgrade --clone error checking

2019-05-03 Thread Jeff Janes

On Fri, May 3, 2019 at 3:53 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-05-02 20:03, Jeff Janes wrote:
> > It looks like it was designed for early checking, it just wasn't placed
> > early enough.  So changing it is pretty easy, as check_file_clone does
> > not need to be invented, and there is no additional code duplication
> > over what was already there.
> >
> > This patch moves the checking to near the beginning.
>
> I think the reason it was ordered that way is that it wants to do all
> the checks of the old cluster before doing any checks touching the new
> cluster.
>

But is there a reason to want to do that?  I understand we don't want to
keep starting and stopping the clusters needlessly, so we should do
everything we can in one before moving to the other.  But for checks that
don't need a running cluster, why would it matter?  The existence and
contents of PG_VERSION of the new cluster directory is already checked at
the very beginning (and even tries to start it up and shuts it down again
if a pid file also exists), so there is precedence for touching the new
cluster directory at the filesystem level early (albeit in a readonly
manner) and if a pid file exists then doing even more than that.  I didn't
move check_file_clone to before the liveness check is done, out of a
abundance of caution.  But creating a transient file with a name of no
significance ("PG_VERSION.clonetest") in a cluster that is not even running
seems like a very low risk thing to do.   The pay off is that we get an
inevitable error message much sooner.

Cheers,

Jeff

Re: pg_upgrade --clone error checking

2019-05-02 Thread Jeff Janes

On Thu, May 2, 2019 at 12:28 PM Alvaro Herrera 
wrote:

> On 2019-May-02, Jeff Janes wrote:
>
> >
> > When something is doomed to fail, we should report the failure as early
> as
> > feasibly detectable.
>
> I agree -- this check should be done before checking the database
> contents.  Maybe even before "Checking cluster versions".
>

It looks like it was designed for early checking, it just wasn't placed
early enough.  So changing it is pretty easy, as check_file_clone does not
need to be invented, and there is no additional code duplication over what
was already there.

This patch moves the checking to near the beginning.

It carries the --link mode checking along with it.  That should be done as
well, and doing it as a separate patch would make both patches uglier.

Cheers,

Jeff

pg_upgrade_filesystem.patch
Description: Binary data

Re: pg_upgrade --clone error checking

2019-05-02 Thread Jeff Janes

On Thu, May 2, 2019 at 11:57 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-05-01 22:10, Jeff Janes wrote:
> > With the new pg_upgrade --clone, if we are going to end up throwing the
> > error "file cloning not supported on this platform" (which seems to
> > depend only on ifdefs) I think we should throw it first thing, before
> > any other checks are done and certainly before pg_dump gets run.
>
> Could you explain in more detail what command you are running, what
> messages you are getting, and what you would like to see instead?
>

I'm running:

pg_upgrade --clone -b /home/jjanes/pgsql/REL9_6_12/bin/ -B
/home/jjanes/pgsql/origin_jit/bin/ -d /home/jjanes/pgsql/data_96/ -D
/home/jjanes/pgsql/data_clone/

And I get:

Performing Consistency Checks
-
Checking cluster versions   ok
Checking database user is the install user  ok
Checking database connection settings   ok
Checking for prepared transactions  ok
Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch   ok
Checking for tables WITH OIDS   ok
Checking for invalid "unknown" user columns ok
Creating dump of global objects ok
Creating dump of database schemas
ok
Checking for presence of required libraries ok

file cloning not supported on this platform
Failure, exiting

I think the error message wording is OK, I think it should be thrown
earlier, before the "Creating dump of database schemas" (which can take a
long time), and preferably before either database is even started.  So
ideally it would be something like:


Performing Consistency Checks
-
Checking cluster versions
Checking file cloning support
File cloning not supported on this platform
Failure, exiting


When something is doomed to fail, we should report the failure as early as
feasibly detectable.

Cheers,

Jeff

pg_upgrade --clone error checking

2019-05-01 Thread Jeff Janes

With the new pg_upgrade --clone, if we are going to end up throwing the
error "file cloning not supported on this platform" (which seems to depend
only on ifdefs) I think we should throw it first thing, before any other
checks are done and certainly before pg_dump gets run.

This might result in some small amount of code duplication, but I think it
would be worth the cost.

For cases where we might throw "could not clone file between old and new
data directories", I wonder if we shouldn't do some kind of dummy copy to
catch that error earlier, as well.  Maybe that one is not worth it.

Cheers,

Jeff

Re: TRACE_SORT defined by default

2019-04-25 Thread Jeff Janes

On Wed, Apr 24, 2019 at 6:04 PM Tom Lane  wrote:

> Peter Geoghegan  writes:
>
> > In
> > any case the current status quo is that it's built by default. I have
> > used it in production, though not very often. It's easy to turn it on
> > and off.
>
> Would any non-wizard really have a use for it?
>

I've had people use it to get some insight into the operation and memory
usage of Aggregate nodes, since those nodes offer nothing useful via
EXPLAIN ANALYZE.  It would be a shame to lose that ability on
package-installed PostgreSQL unless we fix Aggregate node reporting first.

Cheers,

Jeff

Re: Should the docs have a warning about pg_stat_reset()?

2019-04-14 Thread Jeff Janes

On Wed, Apr 10, 2019 at 2:52 PM Bruce Momjian  wrote:

>
> OK, let me step back.  Why are people resetting the statistics
> regularly?  Based on that purpose, does it make sense to clear the
> stats that effect autovacuum?
>

When I've done it (not regularly, thankfully), it was usually because I
failed to type "pg_stat_reset_shared('bgwriter')" or
"pg_stat_statements_reset()" correctly.

I've also been tempted to do it because storing snapshots with a cron job
or something requires effort and planning ahead to set up the tables and
cron and some way to limit the retention, and than running LAG windows
functions over the snapshots requires a re-study of the documentation,
because they are a bit esoteric and I don't use them enough to commit the
syntax to memory.  I don't want to see pg_statio_user_indexes often enough
to make elaborate arrangements ahead of time (especially since
track_io_timing columns is missing from it); but when I do want them, I
want them.  And when I do, I don't want them to be "since the beginning of
time".

When I'm thinking about pg_statio_user_indexes, I am probably not thinking
about autovac, since they have about nothing in common with each other.
(Other than pg_stat_reset operating on both.)

Cheers,

Jeff

Re: pg_upgrade: Pass -j down to vacuumdb

2019-04-03 Thread Jeff Janes

On Fri, Mar 29, 2019 at 5:58 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-28 02:43, Jeff Janes wrote:
> > At first blush I thought it was obvious that you would not want to run
> > analyze-in-stages in parallel.  But after thinking about it some more
> > and reflecting on experience doing some troublesome upgrades, I would
> > reverse that and say it is now obvious you do want at least the first
> > stage of analyze-in-stages, and probably the first two, to run in
> > parallel.  That is not currently an option it supports, so we can't
> > really recommend it in the script or the docs.
>
> So do you think we should copy down the -j option from pg_upgrade, or
> make some other arrangement?
>

For 12, I think we should not pass it down so that it runs automatically,
and just go with a doc-only patch instead.  Where the text written into
the analyze_new_cluster.sh recommending to hit Ctrl-C and do something else
is counted as documentation.

I agree with you that the value of -j that was used for pg_ugrade might not
be suitable for vacuumdb, but anyone who tests things thoroughly enough to
know exactly what value to use for each is not going to be the target
audience of analyze_new_cluster.sh anyway.  So I think the -j used in
pg_upgrade can just be written directly into the suggestion, and that would
be good enough for the intended use.

Ideally I think the analyze-in-stages should have the ability to
parallelize the first stage and not the last one, but that is not v12
material.

Even more ideally it should only have two stages, not three.  One stage
runs to generate minimally-tolerable stats before the database is opened to
other users, and one runs after it is open (but hopefully before the big
end-of-month reports get run, or whatever the big statistics-sensitive
queries are on your system).   But we don't really have a concept of "open
to other users" in the first place, and doing it yourself by juggling
pg_hba files is annoying and error prone.  So maybe the first stage could
be run by pg_upgrade itself, while the new server is still running on a
linux socket in a private directory.

The defense of the current three-stage method for analyze-in-stages is that
very few people are likely to know just what the level of
"minimally-tolerable stats" for their system actually are, because upgrades
are rare enough not to learn by experience, and realistic load-generators
are rare enough not to learn from test/QA environments.  If the
default_statistics_target=1 stats are enough, then you got them quickly.
If they aren't, at least you didn't waste too much time collecting them
before moving on to the second-stage default_statistics_target=10 and then
the final stage.  So the three stages are amortizing over our ignorance.

Cheers,

Jeff

Re: [HACKERS] generated columns

2019-03-31 Thread Jeff Janes

On Sat, Mar 30, 2019 at 4:03 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-26 20:50, Pavel Stehule wrote:
> > It is great feature and I'll mark this feature as ready for commit
>
> Committed, thanks.
>

I can't do a same-major-version pg_upgrade across this commit, as the new
pg_dump is trying to dump a nonexistent attgenerated column from the old
database.  Is that acceptable collateral damage?  I thought we usually
avoided that breakage within the dev branch, but I don't know how we do it.

Cheers,

Jeff

Re: pg_upgrade: Pass -j down to vacuumdb

2019-03-27 Thread Jeff Janes

On Tue, Mar 26, 2019 at 7:28 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-25 22:57, Tom Lane wrote:
> > + fprintf(script, "echo %sYou may wish to add --jobs=N for parallel
> analyzing.%s\n",
> > + ECHO_QUOTE, ECHO_QUOTE);
>
> But then you get that information after you have already started the
> script.
>

True, but the same goes for all the other information there, and it sleeps
to let you break out of it.  And I make it a habit to glance through any
scripts someone suggests that I run, so would notice the embedded advice
without running it at all.

I don't find any information about this analyze business on the
> pg_upgrade reference page.  Maybe a discussion there could explain the
> different paths better than making the output script extra complicated.
>
> Essentially: If you want a slow and gentle analyze, use the supplied
> script.  If you want a fast analyze, use vacuumdb, perhaps with an
> appropriate --jobs option.  Note that pg_upgrade --jobs and vacuumdb
> --jobs are resource-bound in different ways, so the same value might not
> be appropriate for both.
>
>
To me, analyze-in-stages is not about gentleness at all.  For example, it
does nothing to move vacuum_cost_delay away from its default of 0.  Rather,
it is about slamming the bare minimum statistics in there as fast as
possible, so your database doesn't keel over from horrible query plans on
even simple queries as soon as you reopen.  I want the database to survive
long enough for the more complete statistics to be gathered.  If you have
quickly accumulated max_connection processes all running horrible query
plans that never finish, your database might as well still be closed for
all the good it does the users. And all the load generated by those is
going to make the critical ANALYZE all that much slower.

At first blush I thought it was obvious that you would not want to run
analyze-in-stages in parallel.  But after thinking about it some more and
reflecting on experience doing some troublesome upgrades, I would reverse
that and say it is now obvious you do want at least the first stage of
analyze-in-stages, and probably the first two, to run in parallel.  That is
not currently an option it supports, so we can't really recommend it in the
script or the docs.

But we could at least adopt the more straightforward patch, suggest that if
they don't want analyze-in-stages they should consider doing the big-bang
analyze in parallel.

Cheers,

Jeff

compiler warning in pgcrypto imath.c

2019-03-22 Thread Jeff Janes

When compiling on an AWS 64 bit Arm machine, I get this compiler warning:

imath.c: In function 's_ksqr':
imath.c:2590:6: warning: variable 'carry' set but not used
[-Wunused-but-set-variable]
  carry;
  ^

With this version():

PostgreSQL 12devel on aarch64-unknown-linux-gnu, compiled by gcc
(Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit

The attached patch adds PG_USED_FOR_ASSERTS_ONLY to silence it.  Perhaps
there is a better way, given that we want to change imath.c as little as
possible from its upstream?

Cheers,

Jeff


pgcrypto_warning.patch
Description: Binary data

Re: jsonpath

2019-03-16 Thread Jeff Janes

On Sat, Mar 16, 2019 at 5:36 AM Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

>
> So, pushed.  Many thanks to reviewers and authors!
>

I think these files have to be cleaned up by "make maintainer-clean"

./src/backend/utils/adt/jsonpath_gram.c
./src/backend/utils/adt/jsonpath_scan.c

Cheers,

Jeff

Re: GiST VACUUM

2019-03-15 Thread Jeff Janes

On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas  wrote:

> On 05/03/2019 02:26, Andrey Borodin wrote:
> >> I also tried your amcheck tool with this. It did not report any
> >> errors.
> >>
> >> Attached is also latest version of the patch itself. It is the
> >> same as your latest patch v19, except for some tiny comment
> >> kibitzing. I'll mark this as Ready for Committer in the commitfest
> >> app, and will try to commit it in the next couple of days.
> >
> > That's cool! I'll work on 2nd step of these patchset to make
> > blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!
>

Thank you.  This is a transformational change; it will allow GiST indexes
larger than RAM to be used in some cases where they were simply not
feasible to use before.  On a HDD, it resulted in a 50 fold improvement in
vacuum time, and the machine went from unusably unresponsive to merely
sluggish during the vacuum.  On a SSD (albeit a very cheap laptop one, and
exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30
fold improvement, from a far faster baseline.  Even on an AWS instance with
a "GP2" SSD volume, which normally shows little benefit from sequential
reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated
torn-page writes using my traditional testing harness with high concurrency
(AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and
did not encounter any problems.  I tested both with btree_gist on a scalar
int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to
get re-used very efficiently, but that is an ancestral problem independent
of this change.

Cheers,

Jeff

Hash index initial size is too large given NULLs or partial indexes

2019-03-08 Thread Jeff Janes

Referring to this thread:

https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices

When a hash index is created on a populated table, it estimates the number
of buckets to start out with based on the number of tuples returned
by estimate_rel_size.  But this number ignores both the fact that NULLs are
not stored in hash indexes, and that partial indexes exist.  This can lead
to much too large hash indexes.  Doing a re-index just repeats the logic,
so doesn't fix anything.  Fill factor also can't fix it, as you are not
allowed to increase that beyond 100.

This goes back to when the pre-sizing was implemented in 2008
(c9a1cc694abef737548a2a).  It seems to be an oversight, rather than
something that was considered.

Is this a bug that should be fixed?  Or if getting a more accurate estimate
is not possible or not worthwhile, add a code comment about that?

Cheers,

Jeff

Re: Should we increase the default vacuum_cost_limit?

2019-03-08 Thread Jeff Janes

On Wed, Mar 6, 2019 at 2:54 PM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> > On 3/5/19 14:14, Andrew Dunstan wrote:
> >> This patch is tiny, seems perfectly reasonable, and has plenty of
> >> support. I'm going to commit it shortly unless there are last minute
> >> objections.
> > +1
> >
>
> done.
>

Now that this is done, the default value is only 5x below the hard-coded
maximum of 10,000.

This seems a bit odd, and not very future-proof.  Especially since the
hard-coded maximum appears to have no logic to it anyway, at least none
that is documented.  Is it just mindless nannyism?

Any reason not to increase by at least a factor of 10, but preferably the
largest value that does not cause computational problems (which I think
would be INT_MAX)?

Cheers,

Jeff

Re: Index Skip Scan

2019-02-28 Thread Jeff Janes

On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthali...@gmail.com>
> wrote in  aa+fz3guncutf52q1sufb7ise37tjpsd...@mail.gmail.com>
> > A bit of adjustment after nodes/relation -> nodes/pathnodes.
>
> I had a look on this.
>
> The name "index skip scan" is a different feature from the
> feature with the name on other prodcuts, which means "index scan
> with postfix key (of mainly of multi column key) that scans
> ignoring the prefixing part" As Thomas suggested I'd suggest that
> we call it "index hop scan". (I can accept Hopscotch, either:p)
>

I think that what we have proposed here is just an incomplete
implementation of what other products call a skip scan, not a fundamentally
different thing.  They don't ignore the prefix part, they use that part in
a way to cancel itself out to give the same answer, but faster.  I think
they would also use this skip method to get distinct values if that is what
is requested.  But they would go beyond that to also use it to do something
similar to the plan we get with this:

Set up:
pgbench -i -s50
create index on pgbench_accounts (bid, aid);
alter table pgbench_accounts drop constraint pgbench_accounts_pkey ;

Query:
explain analyze with t as (select distinct bid from pgbench_accounts )
  select pgbench_accounts.* from pgbench_accounts join t using (bid) where
aid=5;

If we accept this patch, I hope it would be expanded in the future to give
similar performance as the above query does even when the query is written
in its more natural way of:

explain analyze select * from pgbench_accounts where aid=5;

(which currently takes 200ms, rather than the 0.9ms taken for the one
benefiting from skip scan)

I don't think we should give it a different name, just because our initial
implementation is incomplete.

Or do you think our implementation of one feature does not really get us
closer to implementing the other?

Cheers,

Jeff

Re: Index Skip Scan

2019-02-28 Thread Jeff Janes

On Wed, Feb 20, 2019 at 11:33 AM Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> > On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen <
> jesper.peder...@redhat.com> wrote:
> >
> > Dmitry and I will look at this and take it into account for the next
> > version.
>
> In the meantime, just to not forget, I'm going to post another version
> with a
> fix for cursor fetch backwards, which was crashing before.

This version of the patch can return the wrong answer.

create index on pgbench_accounts (bid, aid);
begin; declare c  cursor for select distinct on (bid) bid, aid from
pgbench_accounts order by bid, aid;
fetch 2 from c;
 bid |   aid
-+-
   1 |   1
   2 | 100,001

fetch backward 1 from c;
 bid |   aid
-+-
   1 | 100,000

Without the patch, instead of getting a wrong answer, I get an error:

ERROR:  cursor can only scan forward
HINT:  Declare it with SCROLL option to enable backward scan.

If I add "SCROLL", then I do get the right answer with the patch.

Cheers,

Jeff

Re: Bloom index cost model seems to be wrong

2019-02-28 Thread Jeff Janes

On Sun, Feb 24, 2019 at 11:09 AM Jeff Janes  wrote:

> I've moved this to the hackers list, and added Teodor and Alexander of the
> bloom extension, as I would like to hear their opinions on the costing.
>

My previous patch had accidentally included a couple lines of a different
thing I was working on (memory leak, now-committed), so this patch removes
that diff.

I'm adding it to the commitfest targeting v13.  I'm more interested in
feedback on the conceptual issues rather than stylistic ones, as I would
probably merge the two functions together before proposing something to
actually be committed.

Should we be trying to estimate the false positive rate and charging
cpu_tuple_cost and cpu_operator_cost the IO costs for visiting the table to
recheck and reject those?  I don't think other index types do that, and I'm
inclined to think the burden should be on the user not to create silly
indexes that produce an overwhelming number of false positives.

Cheers,

Jeff

bloom_cost_v3.patch
Description: Binary data

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-26 Thread Jeff Janes

On Thu, Feb 21, 2019 at 1:05 AM Pavan Deolasee 
wrote:

> Hi,
>
> Jeff Janes raised an issue [1] about PD_ALL_VISIBLE not being set
> correctly while loading data via COPY FREEZE and had also posted a draft
> patch.
>
> I now have what I think is a more complete patch. I took a slightly
> different approach and instead of setting PD_ALL_VISIBLE bit initially and
> then not clearing it during insertion, we now recheck the page for
> all-frozen, all-visible tuples just before switching to a new page. This
> allows us to then also mark set the visibility map bit, like we do in
> vacuumlazy.c
>
> Some special treatment is required to handle the last page before bulk
> insert it shutdown. We could have chosen not to do anything special for the
> last page and let it remain unfrozen, but I thought it makes sense to take
> that extra effort so that we can completely freeze the table and set all VM
> bits at the end of COPY FREEZE.
>
> Let me know what you think.
>

Hi Pavan, thanks for picking this up.

After doing a truncation and '\copy ... with (freeze)' of a table with long
data, I find that the associated toast table has a handful of unfrozen
blocks.  I don't know if that is an actual problem, but it does seem a bit
odd, and thus suspicious.

perl -le 'print join "", map rand(), 1..500 foreach 1..100' > foo

create table foobar1 (x text);
begin;
truncate foobar1;
\copy foobar1 from foo with (freeze)
commit;
select all_visible,all_frozen,pd_all_visible, count(*) from
pg_visibility('pg_toast.pg_toast_25085') group by 1,2,3;
 all_visible | all_frozen | pd_all_visible |  count
-+++-
 f   | f  | f  |  18
 t   | t  | t  | 530,361
(2 rows)

Cheers,

Jeff

>

Re: Bloom index cost model seems to be wrong

2019-02-24 Thread Jeff Janes

I've moved this to the hackers list, and added Teodor and Alexander of the
bloom extension, as I would like to hear their opinions on the costing.

On Tue, Feb 12, 2019 at 4:17 PM Tom Lane  wrote:

>
> It's possible that a good cost model for bloom is so far outside
> genericcostestimate's ideas that trying to use it is not a good
> idea anyway.
>
>
I'm attaching a crude patch based over your refactored header files.

I just copied genericcostestimate into bloom, and made a few changes.

I think one change should be conceptually uncontroversial, which is to
change the IO cost from random_page_cost to seq_page_cost.  Bloom indexes
are always scanned in their entirety.

The other change is not to charge any cpu_operator_cost per tuple.  Bloom
doesn't go through the ADT machinery, it just does very fast
bit-twiddling.  I could assign a fraction of a cpu_operator_cost,
multiplied by bloomLength rather than list_length(indexQuals), to this
bit-twiddling. But I think that that fraction would need to be quite close
to zero, so I just removed it.

When everything is in memory, Bloom still gets way overcharged for CPU
usage even without the cpu_operator_cost.  This patch doesn't do anything
about that.  I don't know if the default value of cpu_index_tuple_cost is
way too high, or if Bloom just shouldn't be charging the full value of it
in the first place given the way it accesses index tuples.  For comparison,
when using a Btree as an equality filter on a non-leading column, most of
the time goes to index_getattr.  Should the time spent there be loaded on
cpu_operator_cost or onto cpu_index_tuple_cost?  It is not strictly spent
in the operator, but fetching the parameter to be used in an operator is
more closely a per-operator problem than a per-tuple problem.

Most of genericcostestimate still applies.  For example, ScalarArrayOpExpr
handling, and Mackert-Lohman formula.  It is a shame that all of that has
to be copied.

There are some other parts of genericcostestimate that probably don't apply
(OrderBy, for example) but I left them untouched for now to make it easier
to reconcile changes to the real genericcostestimate with the copy.

For ScalarArrayOpExpr, it would be nice to scan the index once and add to
the bitmap all branches of the =ANY in one index scan, but I don't see the
machinery to do that.  It would be a matter for another patch anyway, other
than the way it would change the cost estimate.

Cheers,

Jeff

bloom_cost_v2.patch
Description: Binary data

Re: Problems with plan estimates in postgres_fdw

2019-02-20 Thread Jeff Janes

On Wed, Jan 30, 2019 at 6:12 AM Etsuro Fujita 
wrote:

> (2018/12/28 15:50), Etsuro Fujita wrote:
> > Attached is a new version of the
> > patch.
>
> Here is an updated version of the patch set.  Changes are:
>
> * In the previous version, LIMIT without OFFSET was not performed
> remotely as the costs was calculated the same as the costs of performing
> it locally.  (Actually, such LIMIT was performed remotely in a case in
> the postgres_fdw regression test, but that was due to a bug :(.)  I
> think we should prefer performing such LIMIT remotely as that avoids
> extra row fetches from the remote that performing it locally might
> cause, so I tweaked the costs estimated in estimate_path_cost_size(), to
> ensure that we'll prefer performing such LIMIT remotely.
>

With your tweaks, I'm still not seeing the ORDER-less LIMIT get pushed down
when using use_remote_estimate in a simple test case, either with this set
of patches, nor in the V4 set.  However, without use_remote_estimate, the
LIMIT is now getting pushed with these patches when it does not in HEAD.

See attached test case, to be run in new database named 'foo'.

Cheers,

Jeff
create extension postgres_fdw ;
create server foo foreign data wrapper postgres_fdw options 
(use_remote_estimate  'true', fetch_size '10');
\! pgbench -i -s20 foo
create user MAPPING FOR public SERVER foo ;
create schema schema2;
import FOREIGN SCHEMA public from server foo into schema2;
explain (verbose,analyze) select * from schema2.pgbench_accounts limit 1;

Re: Actual Cost

2019-02-17 Thread Jeff Janes

On Sat, Feb 16, 2019 at 10:33 PM Donald Dong  wrote:

> On Feb 16, 2019, at 6:44 PM, Tomas Vondra wrote:
> >
> > On 2/17/19 3:40 AM, David Fetter wrote:
> >>
> >> As someone not volunteering to do any of the work, I think it'd be a
> >> nice thing to have.  How large an effort would you guess it would be
> >> to build a proof of concept?
> >
> > I don't quite understand what is meant by "actual cost metric" and/or
> > how is that different from running EXPLAIN ANALYZE.
>
> Here is an example:
>
> Hash Join  (cost=3.92..18545.70 rows=34 width=32) (actual cost=3.92..18500
> time=209.820..1168.831 rows=47 loops=3)
>
> Now we have the actual time. Time can have a high variance (a change
> in system load, or just noises), but I think the actual cost would be
> less likely to change due to external factors.
>

I don't think there is any way to assign an actual cost.  For example how
do you know if a buffer read was "actually" seq_page_cost or
random_page_cost?  And if there were a way, it too would have a high
variance.

What would I find very useful is a verbosity option to get the cost
estimates expressed as a multiplier of each *_cost parameter, rather than
just as a scalar.  And at the whole-query level, get an rusage report
rather than just wall-clock duration.  And if the HashAggregate node under
"explain analyze" would report memory and bucket stats; and if
the Aggregate node would report...anything.

Cheers,

Jeff

Re: TupleTableSlot abstraction

2019-02-16 Thread Jeff Janes

On Fri, Nov 16, 2018 at 7:46 PM Andres Freund  wrote:

> Hi,
>
> On 2018-11-13 15:30:21 -0800, Andres Freund wrote:
> > What I'm now planning to do is to go through the big comment in
> > tuptable.h and update that to the new world.  While I'm not convinced
> > that that that's the best place for it, it needs to be accurate.
> >
> > Furthermore:
> > - More comment polishing
> > - I'll probably split the commits up a bit further (particulary JIT
> >   ignoring virtual tuple slots, inlining the hot path of
> >   slot_getsomeattrs())
> > - serious commit message polishing
>
> I've done all of that now, and pushed it.  Thanks Ashutosh, Amit
> Khandekar and everyone else.
>

Hi Andres and all,

This commit 763f2edd92095b1ca2 "Rejigger materializing and fetching a
HeapTuple from a slot" introduces a memory leak into the ExecutorState
context which is invoked by this statement:

CREATE TABLE tbloom AS
   SELECT
 (random() * 100)::int as i1,
 (random() * 100)::int as i2,
 (random() * 100)::int as i3,
 (random() * 100)::int as i4,
 (random() * 100)::int as i5,
 (random() * 100)::int as i6
   FROM
  generate_series(1,5000);

By blind analogy to the changes made to matview.c, I think that createas.c
is missing a heap_freetuple, as attached.

It fixes the leak, and still passes "make check".
Cheers,

Jeff


createas_leak_fix.patch
Description: Binary data

Re: Make relcache init write errors not be fatal

2018-12-22 Thread Jeff Janes

On Sat, Dec 22, 2018 at 8:54 PM Andres Freund  wrote:

> Hi,
>
> On 2018-12-22 20:49:58 -0500, Jeff Janes wrote:
> > After running a testing server out of storage, I tried to track down why
> it
> > was so hard to get it back up again.  (Rather than what I usually do
> which
> > is just throwing it away and making the test be smaller).
> >
> > I couldn't start a backend because it couldn't write the relcache init
> file.
> >
> > I found this comment, but it did not carry its sentiment to completion:
> >
> > /*
> >  * We used to consider this a fatal error, but we might as well
> >  * continue with backend startup ...
> >  */
> >
> > With the attached patch applied, I could at least get a backend going so
> I
> > could drop some tables/indexes and free up space.
>
> Why is this a good idea?  It'll just cause hard to debug performance
> issues imo.
>
>
You get lots of WARNINGs, so it shouldn't be too hard to debug.  And once
you drop a table or an index, the init will succeed and you wouldn't have
the performance issues at all anymore.

The alternative, barring finding extraneous data on the same partition that
can be removed, seems to be having indefinite downtime until you can locate
a larger hard drive and move everything to it, or using dangerous hacks.

Cheers,

Jeff

Make relcache init write errors not be fatal

2018-12-22 Thread Jeff Janes

After running a testing server out of storage, I tried to track down why it
was so hard to get it back up again.  (Rather than what I usually do which
is just throwing it away and making the test be smaller).

I couldn't start a backend because it couldn't write the relcache init file.

I found this comment, but it did not carry its sentiment to completion:

/*
 * We used to consider this a fatal error, but we might as well
 * continue with backend startup ...
 */

With the attached patch applied, I could at least get a backend going so I
could drop some tables/indexes and free up space.

I'm not enamoured with the implementation of passing a flag down
to write_item, but it seemed better than making write_item return an error
code and then checking the return status in a dozen places.  Maybe we could
turn write_item into a macro, so the macro can implement the "return" from
the outer function directly?

Cheers,

Jeff


relcache_init_v1.patch
Description: Binary data

1 2 >

1 - 100 of 158 matches

Mail list logo