Re: BUG #16079: Question Regarding the BUG #16064
On Sun, Dec 20, 2020 at 7:58 PM Stephen Frost wrote: > > * Magnus Hagander (mag...@hagander.net) wrote: > > Changed from bugs to hackers. > > For the old plaintext "password" method, we log a warning when we parse > the > > configuration file. > Like Stephen, I don't see such a warning getting logged. > > > > Maybe we should do the same for LDAP (and RADIUS)? This seems like a > better > > place to put it than to log it at every time it's received? > > A dollar short and a year late, but ... +1. I would suggest going further. I would make the change on the client side, and have libpq refuse to send unhashed passwords without having an environment variable set which allows it. (Also, change the client behavior so it defaults to verify-full when PGSSLMODE is not set.) What is the value of logging on the server side? I can change the setting from password to md5 or from ldap to gss, when I notice the log message. But once compromised or during a MITM attack, the bad guy will just set it back to the unsafe form and the client will silently go along with it. Cheers, Jeff
DETAIL for wrong scram password
When md5 password authentication fails, the server log file has a helpful detail to say why, usually one of: DETAIL: Role "none" does not exist. DETAIL: User "jjanes" has no password assigned. DETAIL: User "jjanes" has an expired password. DETAIL: Password does not match for user "jjanes". But for scram authentication, only the first three of these will be reported when applicable. If the password is simply incorrect, then you do get a DETAIL line reporting which line of pg_hba was used, but you don't get a DETAIL line reporting the reason for the failure. It is pretty unfriendly to make the admin guess what the absence of a DETAIL is supposed to mean. And as far as I can tell, this is not intentional. Note that in one case you do get the "does not match" line. That is if the user has a scram password assigned and the hba specifies plain-text 'password' as the method. So if the absence of the DETAIL is intentional, it is not internally consistent. The attached patch fixes the issue. I don't know if this is the correct location to be installing the message, maybe verify_client_proof should be doing it instead. I am also not happy to be testing 'doomed' here, but it was already checked a few lines up so I didn't want to go to lengths to avoid doing it here too. Cheers, Jeff scram_password_mismatch.patch Description: Binary data
Re: Supporting = operator in gin/gist_trgm_ops
On Sat, Nov 14, 2020 at 12:31 AM Alexander Korotkov wrote: > > I went through and revised this patch. I made the documentation > statement less categorical. pg_trgm gist/gin indexes might have lower > performance of equality operator search than B-tree. So, we can't > claim the B-tree index is always not needed. Also, simple comparison > operators are <, <=, >, >=, and they are not supported. > Is "simple comparison" here a well-known term of art? If I read the doc as committed (which doesn't include the sentence above), and if I didn't already know what it was saying, I would be left wondering which comparisons those are. Could we just say "inequality operators"? Cheers, Jeff
memory leak in auto_explain
I accidentally tried to populate a test case while auto_explain.log_min_duration was set to zero. auto_explain.log_nested_statements was also on. create or replace function gibberish(int) returns text language SQL as $_$ select left(string_agg(md5(random()::text),),$1) from generate_series(0,$1/32) $_$; create table j1 as select x, md5(random()::text) as t11, gibberish(1500) as t12 from generate_series(1,20e6) f(x); I got logorrhea of course, but I also got a memory leak into the SQL function context: TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalContext: 16384 total in 5 blocks; 5328 free (1 chunks); 11056 used: ExecutorState: 4810120 total in 13 blocks; 4167160 free (74922 chunks); 642960 used SQL function: 411058232 total in 60 blocks; 4916568 free (4 chunks); 406141664 used: gibberish The memory usage grew until OOM killer stepped in. Cheers, Jeff
Re: memory leak in auto_explain
On Mon, Feb 1, 2021 at 6:09 PM Jeff Janes wrote: > > > create or replace function gibberish(int) returns text language SQL as $_$ > select left(string_agg(md5(random()::text),),$1) from > generate_series(0,$1/32) $_$; > > create table j1 as select x, md5(random()::text) as t11, gibberish(1500) > as t12 from generate_series(1,20e6) f(x); > I should have added, found it on HEAD, verified it also in 12.5. Cheers, Jeff >
track_io_timing default setting
Can we change the default setting of track_io_timing to on? I see a lot of questions, such as over at stackoverflow or dba.stackexchange.com, where people ask for help with plans that would be much more useful were this on. Maybe they just don't know better, maybe they can't turn it on because they are not a superuser. I can't imagine a lot of people who care much about its performance impact will be running the latest version of PostgreSQL on ancient/weird systems that have slow clock access. (And the few who do can just turn it off for their system). For systems with fast user-space clock access, I've never seen this setting being turned on make a noticeable dent in performance. Maybe I just never tested enough in the most adverse scenario (which I guess would be a huge FS cache, a small shared buffers, and a high CPU count with constant churning of pages that hit the FS cache but miss shared buffers--not a system I have handy to do a lot of tests with.) Cheers, Jeff
Re: Loaded footgun open_datasync on Windows
On Fri, Sep 14, 2018 at 3:32 AM Michael Paquier wrote: > On Fri, Sep 14, 2018 at 08:43:18AM +0200, Laurenz Albe wrote: > > > If it turns out not to break anything, would you consider backpatching? > > On the one hand it fixes a bug, on the other hand it affects all > > frontend executables... > > Yeah, for this reason I would not do a backpatch. I have a very hard > time to believe that any frontend tools on Windows developed by anybody > rely on files to be opened only by a single process, still if they do > they would be surprised to see a change of behavior after a minor > update in case they rely on the concurrency limitations. > Reviving an old thread here. Could it be back-patched in some pg_test_fsync specific variant? I don't think we should just ignore the fact that pg_test_fsync on Windows is unfit for its intended purpose on 4 still-supported versions. > > I wonder why nobody noticed the problem in pg_test_fsync earlier. > > Is it that people running Windows care less if their storage is > > reliable? > > likely so. > I have noticed this before, but since it wasn't a production machine I just shrugged it off as being a hazard of using consumer-grade stuff; it didn't seem to be worth investigating further. Cheers, Jeff
Re: estimation problems for DISTINCT ON with FDW
On Fri, Jul 3, 2020 at 5:50 PM Tom Lane wrote: > > OK, I'll go ahead and push the patch I proposed yesterday. > Thank you. I tested 12_STABLE with my real queries on the real data set, and the "hard coded" estimate of 200 distinct rows (when use_remote_estimte is turned back on) is enough to get rid of the worst plans I was seeing in 12.3. Cheers, Jeff
tab completion of IMPORT FOREIGN SCHEMA
I use IMPORT FOREIGN SCHEMA a bit to set up systems for testing. But not enough that I can ever remember whether INTO or FROM SERVER comes first in the syntax. Here is an improvement to the tab completion, so I don't have to keep looking it up in the docs. It accidentally (even before this patch) completes "IMPORT FOREIGN SCHEMA" with a list of local schemas. This is probably wrong, but I find this convenient as I often do this in a loop-back setup where the list of foreign schema would be the same as the local ones. So I don't countermand that behavior here. Cheers, Jeff foreign_schema_tab_complete.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila wrote: > I am planning > to push the first patch (v53-0001-Extend-the-BufFile-interface) in > this series tomorrow unless you have any comments on the same. > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be: boolfound PG_USED_FOR_ASSERTS_ONLY = false; Cheers, Jeff
Autovac cancellation is broken in v14
If I create a large table with "CREATE TABLE ... AS SELECT ... from generate_series(1,3e7)" with no explicit transactions, then once it is done I wait for autovac to kick in, then when I try to build an index on that table (or drop the table) the autovac doesn't go away on its own. Bisects down to: commit 5788e258bb26495fab65ff3aa486268d1c50b123 Author: Andres Freund Date: Wed Jul 15 15:35:07 2020 -0700 snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags. Which makes sense given the parts of the code this touches, although I don't understand exactly what the problem is. The problem persists in HEAD (77c7267c37). Cheers, Jeff
Re: Autovac cancellation is broken in v14
On Thu, Aug 27, 2020 at 3:10 PM Jeff Janes wrote: > If I create a large table with "CREATE TABLE ... AS SELECT ... from > generate_series(1,3e7)" with no explicit transactions, then once it is done > I wait for autovac to kick in, then when I try to build an index on that > table (or drop the table) the autovac doesn't go away on its own. > After a bit more poking at this, I think we are checking if we ourselves are an autovac process, not doing the intended check of whether the other guy is one. Where would be a good spot to add a regression test for this? "isolation_regression" ? Cheers, Jeff fix_autovac_cancel.patch Description: Binary data
moving aggregate bad error message
I was wondering if I could just add minvfunc, and have the rest of the m* functions be assumed to be the same as their non-moving counterparts. Apparently the answer is 'no'. But in the process, I found a bad error message. When omitting mfinalfunc when there is a finalfunc, I get the error: "ERROR: moving-aggregate implementation returns type jj_state, but plain implementation returns type jj_state." A rather peculiar complaint, analogous to the infamous "something failed: Success". Looking at the code, it seems we are testing rettype != finaltype, but reporting aggmTransType and aggTransType. Why aren't we reporting what we are testing? With the attached patch, it gives the more sensible "ERROR: moving-aggregate implementation returns type jj_state, but plain implementation returns type numeric." Cheers, Jeff drop type jj_state cascade; CREATE TYPE jj_state AS ( x numeric, y numeric); CREATE OR REPLACE FUNCTION jj_transition( state jj_state, val numeric ) RETURNS jj_state AS $$ BEGIN RETURN NULL::jj_state; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION jj_final( state jj_state ) RETURNS numeric AS $$ BEGIN RETURN 5.5; END; $$ LANGUAGE plpgsql; CREATE OR REPLACE FUNCTION jj_inverse(state jj_state, val numeric) RETURNS jj_state AS $$ BEGIN RETURN NULL::jj_state; END; $$ LANGUAGE plpgsql; CREATE AGGREGATE jj(numeric) ( sfunc = jj_transition, stype = jj_state, finalfunc = jj_final, initcond = '(0,0)', msfunc = jj_transition, minvfunc = jj_inverse, mstype = jj_state -- mfinalfunc = average_final ); moving_agg_error.patch Description: Binary data
Re: WIP: Covering + unique indexes.
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev wrote: > Thanks to everyone, pushed. > > Indeed thanks, this will be a nice feature. It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: indextuple.c: In function 'index_truncate_tuple': indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] int indnatts = tupleDescriptor->natts; Cheers, Jeff
Default JIT setting in V12
Since JIT is on by default in v12, I wanted to revisit the issue raised in https://www.postgresql.org/message-id/CAMkU=1zVhQ5k5d=YyHNyrigLUNTkOj4=YB17s9--3ts8H-SO=q...@mail.gmail.com When the total estimated cost is between jit_above_cost and jit_optimize_above_cost, I get a substantial regression in the attached. Note that I did not devise this case specifically to cause this problem, I just stumbled upon it. JIT, no optimization: 10.5s JIT, optimization: 3.8s no JIT: 4.1s It seems like the unoptimized JIT code is much worse than the general purpose code. This is on AWS c4.large, Ubuntu 18.04, installed from PGDG apt repository. No config changes were made, other than the local ones included in the script. (Previously there were questions about how LLVM was configured, that I couldn't really answer well, but here there should be question as I didn't compile or configure it at all.) There were some proposed mitigations in sister threads, but none have been adopted in v12. I think it is intuitive, and with empirical evidence, that we do not want to JIT compile at all unless we are going to optimize the compiled code. Is there a rationale for, or other examples to show, that it makes sense for the default value of jit_optimize_above_cost to be 5 fold higher than the default setting of jit_above_cost? I think these defaults are setting a trap for our users who aren't really interested in JIT, and are just upgrading to stay on the most-current version. I would propose lowering the default jit_optimize_above_cost to be the same as jit_above_cost, or set it to 0 so that jit_above_cost is always in control and always optimizes. Cheers, Jeff drop table if exists i200c200; create table i200c200 ( pk bigint primary key, int1 bigint, int2 bigint, int3 bigint, int4 bigint, int5 bigint, int6 bigint, int7 bigint, int8 bigint, int9 bigint, int10 bigint, int11 bigint, int12 bigint, int13 bigint, int14 bigint, int15 bigint, int16 bigint, int17 bigint, int18 bigint, int19 bigint, int20 bigint, int21 bigint, int22 bigint, int23 bigint, int24 bigint, int25 bigint, int26 bigint, int27 bigint, int28 bigint, int29 bigint, int30 bigint, int31 bigint, int32 bigint, int33 bigint, int34 bigint, int35 bigint, int36 bigint, int37 bigint, int38 bigint, int39 bigint, int40 bigint, int41 bigint, int42 bigint, int43 bigint, int44 bigint, int45 bigint, int46 bigint, int47 bigint, int48 bigint, int49 bigint, int50 bigint, int51 bigint, int52 bigint, int53 bigint, int54 bigint, int55 bigint, int56 bigint, int57 bigint, int58 bigint, int59 bigint, int60 bigint, int61 bigint, int62 bigint, int63 bigint, int64 bigint, int65 bigint, int66 bigint, int67 bigint, int68 bigint, int69 bigint, int70 bigint, int71 bigint, int72 bigint, int73 bigint, int74 bigint, int75 bigint, int76 bigint, int77 bigint, int78 bigint, int79 bigint, int80 bigint, int81 bigint, int82 bigint, int83 bigint, int84 bigint, int85 bigint, int86 bigint, int87 bigint, int88 bigint, int89 bigint, int90 bigint, int91 bigint, int92 bigint, int93 bigint, int94 bigint, int95 bigint, int96 bigint, int97 bigint, int98 bigint, int99 bigint, int100 bigint, int101 bigint, int102 bigint, int103 bigint, int104 bigint, int105 bigint, int106 bigint, int107 bigint, int108 bigint, int109 bigint, int110 bigint, int111 bigint, int112 bigint, int113 bigint, int114 bigint, int115 bigint, int116 bigint, int117 bigint, int118 bigint, int119 bigint, int120 bigint, int121 bigint, int122 bigint, int123 bigint, int124 bigint, int125 bigint, int126 bigint, int127 bigint, int128 bigint, int129 bigint, int130 bigint, int131 bigint, int132 bigint, int133 bigint, int134 bigint, int135 bigint, int136 bigint, int137 bigint, int138 bigint, int139 bigint, int140 bigint, int141 bigint, int142 bigint, int143 bigint, int144 bigint, int145 bigint, int146 bigint, int147 bigint, int148 bigint, int149 bigint, int150 bigint, int151 bigint, int152 bigint, int153 bigint, int154 bigint, int155 bigint, int156 bigint, int157 bigint, int158 bigint, int159 bigint, int160 bigint, int161 bigint, int162 bigint, int163 bigint, int164 bigint, int165 bigint, int166 bigint, int167 bigint, int168 bigint, int169 bigint, int170 bigint, int171 bigint, int172 bigint, int173 bigint, int174 bigint, int175 bigint, int176 bigint, int177 bigint, int178 bigint, int179 bigint, int180 bigint, int181 bigint, int182 bigint, int183 bigint, int184 bigint, int185 bigint, int186 bigint, int187 bigint, int188 bigint, int189 bigint, int190 bigint, int191 bigint, int192 bigint, int193 bigint, int194 bigint, int195 bigint, int196 bigint, int197 bigint, int198 bigint, int199 bigint, int200 bigint, char1 varchar(255), char2 varchar(255), char3 varchar(255), char4 varchar(255), char5 varchar(255), char6 varchar(255), char7 varchar(255), char8 varchar(255), char9 varchar(255), char10 varchar(255), char11 varchar(255), char12 varchar(255), char13 varchar(255), char14 varchar(255), char15 varchar(255), char16 varchar(255), char17 varchar(255), char1
log spam with postgres_fdw
I'm sending this to hackers, because it is not exactly a bug, and it can't be addressed from userland. I think it is a coding issue, although I haven't identified the exact code. When closing the local session which had used postgres_fdw over an ssl connection, I get log spam on the foreign server saying: LOG: could not receive data from client: Connection reset by peer It is easy to reproduce, but you must be using ssl to do so. On searching, I see that a lot of people have run into this issue, with considerable confusion, but as far as I can see it has never been diagnosed. Is there anything that can be done about this, other than just learning to ignore it? Cheers, Jeff -- set up ssl, I won't walk through those steps. \! pgbench -i create extension postgres_fdw ; create server ssl foreign data wrapper postgres_fdw OPTIONS ( host '127.0.0.1'); create schema fgn; create user MAPPING FOR CURRENT_USER SERVER ssl; import foreign schema public from server ssl into fgn; select count(*) from fgn.pgbench_history; \q
Re: Primary keepalive message not appearing in Logical Streaming Replication
On Sun, Sep 15, 2019 at 11:44 AM Michael Loftis wrote: > > > On Sun, Sep 15, 2019 at 08:36 Virendra Negi wrote: > >> Oh I miss the documentation link there you go >> https://www.postgresql.org/docs/9.5/protocol-replication.html >> >> On Sun, Sep 15, 2019 at 8:05 PM Virendra Negi >> wrote: >> >>> Agreed but why is there a message specification for it describe in the >>> documentation and it ask to client reply back if a particular *bit* is >>> set.(1 means that the client should reply to this message as soon as >>> possible, to avoid a timeout disconnect. 0 otherwise) >>> >> > This is unrelated to TCP keepalive. I honestly don't know where the knob > is to turn these on but the configuration variables you quoted earlier I am > familiar with and they are not it. Perhaps someone else can chime in with > how to enable the protocol level keepalive in replication. > Protocol-level keepalives are governed by "wal_sender_timeout" Cheers, Jeff
Re: log spam with postgres_fdw
On Sun, Sep 15, 2019 at 11:14 AM Tom Lane wrote: > Jeff Janes writes: > > When closing the local session which had used postgres_fdw over an ssl > > connection, I get log spam on the foreign server saying: > > LOG: could not receive data from client: Connection reset by peer > > It is easy to reproduce, but you must be using ssl to do so. > > On searching, I see that a lot of people have run into this issue, with > > considerable confusion, but as far as I can see it has never been > diagnosed. > > In > > https://www.postgresql.org/message-id/flat/3DPLMQIC.YU6IFMLY.3PLOWL6W%40FQT5M7HS.IFBAANAE.A7GUPCPM > > Thanks, I had not spotted that one, I guess because the log message itself was not in the subject so it ranked lower. > we'd concluded that the issue is probably that postgres_fdw has no > logic to shut down its external connections when the session closes. > It's not very clear why the SSL dependency, but we speculated that > adding an on_proc_exit callback to close the connection(s) would help. > > It is easy to reproduce the ssl dependency without any FDW, just by doing a kill -9 on psql. Apparently the backend process for unencrypted connections are happy to be ghosted, while ssl ones are not; which seems like an odd distinction to make. So should this be addressed on both sides (the server not whining, and the client doing the on_proc_exit anyway?). I can take a stab at the client side one, but I'm over my head on the ssl connection handling logic on the server side. Cheers, Jeff
Re: Default JIT setting in V12
On Wed, Sep 4, 2019 at 11:24 AM Andres Freund wrote: > Hi, > > On 2019-09-04 07:51:16 -0700, Andres Freund wrote: > > Or better, something slightly more complete, like the attached (which > affects both code-gen time optimizations (which are more like peephole > ones), and both function/global ones that are cheap). > Yes, that does completely solve the issue I raised. It makes JIT either better or at least harmless, even when falling into the gap between jit_above_cost and jit_optimize_above_cost. What TPC-H implementation do you use/recommend? This one https://wiki.postgresql.org/wiki/DBT-3? Cheers, Jeff
WAL recycled despite logical replication slot
While testing something else (whether "terminating walsender process due to replication timeout" was happening spuriously), I had logical replication set up streaming a default pgbench transaction load, with the publisher being 13devel-e1c8743 and subscriber being 12BETA4. Eventually I started getting errors about requested wal segments being already removed: 10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG: starting logical decoding for slot "sub" 10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL: Streaming transactions committing after 79/EB0B17A0, reading WAL from 79/E70736A0. 10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR: requested WAL segment 0001007900E7 has already been removed 10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG: disconnection: session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2 port=40830 It had been streaming for about 50 minutes before the error showed up, and it showed right when streaming was restarting after one of the replication timeouts. Is there an innocent explanation for this? I thought logical replication slots provided an iron-clad guarantee that WAL would be retained until it was no longer needed. I am just using pub/sub, none of the lower level stuff. Cheers, Jeff
Re: WAL recycled despite logical replication slot
On Fri, Sep 20, 2019 at 11:27 AM Tomas Vondra wrote: > > > >Is there an innocent explanation for this? I thought logical replication > >slots provided an iron-clad guarantee that WAL would be retained until it > >was no longer needed. I am just using pub/sub, none of the lower level > >stuff. > > > > I think you're right - this should not happen with replication slots. > Can you provide more detailed setup instructions, so that I can try to > reproduce and investigate the isssue? > It is a bit messy, because this isn't what I was trying to test. The basic set up is pretty simple: On master: pgbench -i -s 100 create publication pgbench for table pgbench_accounts, pgbench_branches, pgbench_history , pgbench_tellers; pgbench -R200 -c4 -j4 -P60 -T36 -n on replica: pgbench -i -s 1 truncate pgbench_history , pgbench_accounts, pgbench_branches, pgbench_tellers; create subscription sub CONNECTION 'host=192.168.0.15' publication pgbench; The messy part: It looked like the synch was never going to finish, so first I cut the rate down to -R20. Then what I thought I did was drop the primary key on pgbench_accounts (manually doing a kill -15 on the synch worker to release the lock), wait for the copy to start again and then finish and then start getting "ERROR: logical replication target relation "public.pgbench_accounts" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL" log messages, then I re-added the primary key. Then I increased the -R back to 200, and about 50 minutes later got the WAL already removed error. But now I can't seem to reproduce this, as the next time I tried to do the synch with no primary key there doesn't seem to be a commit after the COPY finishes so once it tries to replay the first update, it hits the above "no primary key" error and then rolls back the **the entire COPY** as well as the single-row update, and starts the entire COPY over again before you have a chance to intervene and build the index. So I'm guessing now that either the lack of a commit (which itself seems like a spectacularly bad idea) is situation dependent, or the very slow COPY had finished between the time I had decided to drop the primary key, and time I actually implemented the drop. Perhaps important here is that the replica is rather underpowered. Write IO and fsyncs periodically become painfully slow, which is probably why there are replication timeouts, and since the problem happened when trying to reestablish after a timeout I would guess that that is critical to the issue. I was running the master with fsync=off, but since the OS never crashed that should not be the source of corruption. I'll try some more to reproduce this, but I wanted to make sure there was actually something here to reproduce, and not just my misunderstanding of how things are supposed to work. Cheers, Jeff
Re: WAL recycled despite logical replication slot
On Fri, Sep 20, 2019 at 6:25 PM Andres Freund wrote: > Hi, > > On September 20, 2019 5:45:34 AM PDT, Jeff Janes > wrote: > >While testing something else (whether "terminating walsender process > >due to > >replication timeout" was happening spuriously), I had logical > >replication > >set up streaming a default pgbench transaction load, with the publisher > >being 13devel-e1c8743 and subscriber being 12BETA4. Eventually I > >started > >getting errors about requested wal segments being already removed: > > > >10863 sub idle 0 2019-09-19 17:14:58.140 EDT LOG: starting logical > >decoding for slot "sub" > >10863 sub idle 0 2019-09-19 17:14:58.140 EDT DETAIL: Streaming > >transactions committing after 79/EB0B17A0, reading WAL from > >79/E70736A0. > >10863 sub idle 58P01 2019-09-19 17:14:58.140 EDT ERROR: requested WAL > >segment 0001007900E7 has already been removed > >10863 sub idle 0 2019-09-19 17:14:58.144 EDT LOG: disconnection: > >session time: 0:00:00.030 user=jjanes database=jjanes host=10.0.2.2 > >port=40830 > > > >It had been streaming for about 50 minutes before the error showed up, > >and > >it showed right when streaming was restarting after one of the > >replication > >timeouts. > > > >Is there an innocent explanation for this? I thought logical > >replication > >slots provided an iron-clad guarantee that WAL would be retained until > >it > >was no longer needed. I am just using pub/sub, none of the lower level > >stuff. > > It indeed should. What's the content of > pg_replication_slot for that slot? > Unfortunately I don't think I have that preserved. If I can reproduce the issue, would preserving data/pg_replslot/sub/state help as well? Cheers, Jeff
JSONPATH documentation
I find the documentation in https://www.postgresql.org/docs/12/functions-json.html very confusing. In table 9.44 take the first entry, Example JSON {"x": [2.85, -14.7, -9.4]} Example Query + $.x.floor() Result 2, -15, -10 There are no end to end examples here. How do I apply the example query to the example json to obtain the given result? Table 9.47 only gives two operators which apply a jsonpath to a json(b) object: @? and @@; and neither one of those yield the indicated result from the first line in 9.44. What does? Also, I can't really figure out what the descriptions of @? and @@ mean. Does @? return true if an item exists, even if the value of that item is false, while @@ returns the truth value of the existing item? https://www.postgresql.org/docs/12/datatype-json.html#DATATYPE-JSONPATH "The SQL/JSON path language is fully integrated into the SQL engine". What does that mean? If it were only partially integrated, what would that mean? Is this providing me with any useful information? Is this just saying that this is not a contrib extension module? What is the difference between "SQL/JSON Path Operators And Methods" and and "jsonpath Accessors" and why are they not described in the same place, or at least nearby each other? Cheers, Jeff
Re: JSONPATH documentation
On Sun, Sep 22, 2019 at 2:18 PM Jeff Janes wrote: > I find the documentation in > https://www.postgresql.org/docs/12/functions-json.html very confusing. > > In table 9.44 take the first entry, > > Example JSON > {"x": [2.85, -14.7, -9.4]} > > Example Query > + $.x.floor() > > Result > 2, -15, -10 > > There are no end to end examples here. How do I apply the example query to > the example json to obtain the given result? > OK, never mind here. After digging in the regression tests, I did find jsonb_path_query and friends, and they are in the docs with examples in table 9.49. I don't know how I overlooked that in the first place, I guess I was fixated on operators. Or maybe by the time I was down in those functions, I thought I had cycled back up and was looking at 9.44 again. But I think it would make sense to move the description of jsonpath to its own page. It is confusing to have operators within the jsonpath language, and operators which apply to jsonpath "from the outside", together in the same page. Cheers, Jeff >
DROP SUBSCRIPTION with no slot
I recently had to cut loose (pg_drop_replication_slot) a logical replica that couldn't keep up and so was threatening to bring down the master. In mopping up on the replica side, I couldn't just drop the subscription, because it couldn't drop the nonexistent slot on the master and so refused to work. So I had to do a silly little dance where I first disable the subscription, then ALTER SUBSCRIPTION ... SET (slot_name = NONE), then drop it. Wanting to clean up after itself is admirable, but if there is nothing to clean up, why should that be an error condition? Should this be an item on https://wiki.postgresql.org/wiki/Todo (to whatever extent that is still used). Cheers, Jeff
Re: DROP SUBSCRIPTION with no slot
On Tue, Sep 24, 2019 at 5:25 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-09-24 16:31, Jeff Janes wrote: > > I recently had to cut loose (pg_drop_replication_slot) a logical replica > > that couldn't keep up and so was threatening to bring down the master. > > > > In mopping up on the replica side, I couldn't just drop the > > subscription, because it couldn't drop the nonexistent slot on the > > master and so refused to work. So I had to do a silly little dance > > where I first disable the subscription, then ALTER SUBSCRIPTION ... SET > > (slot_name = NONE), then drop it. > > > > Wanting to clean up after itself is admirable, but if there is nothing > > to clean up, why should that be an error condition? > > The alternatives seem quite error prone to me. Better be explicit. > If you can connect to the master, and see that the slot already fails to exist, what is error prone about that? If someone goes to the effort of setting up a different master, configures it to receive replica connections, and alters the subscription CONNECTION parameter on the replica to point to that poisoned master, will an error on the DROP SUBSCRIPTION really force them to see the error of their ways, or will they just succeed at explicitly doing the silly dance and finalize the process of shooting themselves in the foot via a roundabout mechanism? (And at the point the CONNECTION is changed, he is in the same boat even if he doesn't try to drop the sub--either way the master has a dangling slot). I'm in favor of protecting a fool from his foolishness, except when it is annoying to the rest of us (Well, annoying to me, I guess. I don't know yet about the rest of us). Cheers, Jeff
Re: DROP SUBSCRIPTION with no slot
On Tue, Sep 24, 2019 at 6:42 PM Ziga wrote: > This also seems to be a problem for somewhat fringe case of subscriptions > created with connect=false option. > They cannot be dropped in an obvious way, without knowing the ALTER > SUBSCRIPTION trick. > > For example: > > contrib_regression=# create subscription test_sub connection > 'dbname=contrib_regression' publication test_pub with ( connect=false ); > > > WARNING: tables were not subscribed, you will have to run ALTER > SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables > CREATE SUBSCRIPTION > > contrib_regression=# drop subscription test_sub; -- fails > ERROR: could not drop the replication slot "test_sub" on publisher > DETAIL: The error was: ERROR: replication slot "test_sub" does not exist > > contrib_regression=# alter subscription test_sub set ( slot_name=none ); > ALTER SUBSCRIPTION > > contrib_regression=# drop subscription test_sub; -- succeeds > DROP SUBSCRIPTION > > > Note that the publication was never refreshed. > It seems that the first DROP should succeed in the above case. > Or at least some hint should be given how to fix this. > There is no HINT in the error message itself, but there is in the documentation, see note at end of https://www.postgresql.org/docs/current/sql-dropsubscription.html. I agree with you that the DROP should just work in this case, even more so than in my case. But if we go with the argument that doing that is too error prone, then do we want to include a HINT on how to be error prone more conveniently? Cheers, Jeff
Re: [PATCH] Race condition in logical walsender causes long postgresql shutdown delay
On Wed, Sep 11, 2019 at 3:52 PM Alvaro Herrera wrote: > > Reading over this code, I noticed that the detection of the catch-up > state ends up being duplicate code, so I would rework that function as > in the attached patch. > > The naming of those flags (got_SIGUSR2, got_STOPPING) is terrible, but > I'm not going to change that in a backpatchable bug fix. > Hi Alvaro, does this count as a review? And Craig, do you agree with Alvaro's patch as a replacement for your own? Thanks, Jeff
logical replication empty transactions
After setting up logical replication of a slowly changing table using the built in pub/sub facility, I noticed way more network traffic than made sense. Looking into I see that every transaction in that database on the master gets sent to the replica. 99.999+% of them are empty transactions ('B' message and 'C' message with nothing in between) because the transactions don't touch any tables in the publication, only non-replicated tables. Is doing it this way necessary for some reason? Couldn't we hold the transmission of 'B' until something else comes along, and then if that next thing is 'C' drop both of them? There is a comment for WalSndPrepareWrite which seems to foreshadow such a thing, but I don't really see how to use it in this case. I want to drop two messages, not one. * Don't do anything lasting in here, it's quite possible that nothing will be done * with the data. This applies to all version which have support for pub/sub, including the recent commits to 13dev. I've searched through the voluminous mailing list threads for when this feature was being presented to see if it was already discussed, but since every word I can think to search on occurs in virtually every message in the threads in some context or another, I didn't have much luck. Cheers, Jeff
Re: Bloom index cost model seems to be wrong
On Sun, Feb 24, 2019 at 11:09 AM Jeff Janes wrote: > I've moved this to the hackers list, and added Teodor and Alexander of the > bloom extension, as I would like to hear their opinions on the costing. > My previous patch had accidentally included a couple lines of a different thing I was working on (memory leak, now-committed), so this patch removes that diff. I'm adding it to the commitfest targeting v13. I'm more interested in feedback on the conceptual issues rather than stylistic ones, as I would probably merge the two functions together before proposing something to actually be committed. Should we be trying to estimate the false positive rate and charging cpu_tuple_cost and cpu_operator_cost the IO costs for visiting the table to recheck and reject those? I don't think other index types do that, and I'm inclined to think the burden should be on the user not to create silly indexes that produce an overwhelming number of false positives. Cheers, Jeff bloom_cost_v3.patch Description: Binary data
Re: Index Skip Scan
On Wed, Feb 20, 2019 at 11:33 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen < > jesper.peder...@redhat.com> wrote: > > > > Dmitry and I will look at this and take it into account for the next > > version. > > In the meantime, just to not forget, I'm going to post another version > with a > fix for cursor fetch backwards, which was crashing before. This version of the patch can return the wrong answer. create index on pgbench_accounts (bid, aid); begin; declare c cursor for select distinct on (bid) bid, aid from pgbench_accounts order by bid, aid; fetch 2 from c; bid | aid -+- 1 | 1 2 | 100,001 fetch backward 1 from c; bid | aid -+- 1 | 100,000 Without the patch, instead of getting a wrong answer, I get an error: ERROR: cursor can only scan forward HINT: Declare it with SCROLL option to enable backward scan. If I add "SCROLL", then I do get the right answer with the patch. Cheers, Jeff
Re: Index Skip Scan
On Thu, Jan 31, 2019 at 1:32 AM Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > Hello. > > At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthali...@gmail.com> > wrote in aa+fz3guncutf52q1sufb7ise37tjpsd...@mail.gmail.com> > > A bit of adjustment after nodes/relation -> nodes/pathnodes. > > I had a look on this. > > The name "index skip scan" is a different feature from the > feature with the name on other prodcuts, which means "index scan > with postfix key (of mainly of multi column key) that scans > ignoring the prefixing part" As Thomas suggested I'd suggest that > we call it "index hop scan". (I can accept Hopscotch, either:p) > I think that what we have proposed here is just an incomplete implementation of what other products call a skip scan, not a fundamentally different thing. They don't ignore the prefix part, they use that part in a way to cancel itself out to give the same answer, but faster. I think they would also use this skip method to get distinct values if that is what is requested. But they would go beyond that to also use it to do something similar to the plan we get with this: Set up: pgbench -i -s50 create index on pgbench_accounts (bid, aid); alter table pgbench_accounts drop constraint pgbench_accounts_pkey ; Query: explain analyze with t as (select distinct bid from pgbench_accounts ) select pgbench_accounts.* from pgbench_accounts join t using (bid) where aid=5; If we accept this patch, I hope it would be expanded in the future to give similar performance as the above query does even when the query is written in its more natural way of: explain analyze select * from pgbench_accounts where aid=5; (which currently takes 200ms, rather than the 0.9ms taken for the one benefiting from skip scan) I don't think we should give it a different name, just because our initial implementation is incomplete. Or do you think our implementation of one feature does not really get us closer to implementing the other? Cheers, Jeff
Re: Should we increase the default vacuum_cost_limit?
On Wed, Mar 6, 2019 at 2:54 PM Andrew Dunstan < andrew.duns...@2ndquadrant.com> wrote: > > On 3/6/19 1:38 PM, Jeremy Schneider wrote: > > On 3/5/19 14:14, Andrew Dunstan wrote: > >> This patch is tiny, seems perfectly reasonable, and has plenty of > >> support. I'm going to commit it shortly unless there are last minute > >> objections. > > +1 > > > > done. > Now that this is done, the default value is only 5x below the hard-coded maximum of 10,000. This seems a bit odd, and not very future-proof. Especially since the hard-coded maximum appears to have no logic to it anyway, at least none that is documented. Is it just mindless nannyism? Any reason not to increase by at least a factor of 10, but preferably the largest value that does not cause computational problems (which I think would be INT_MAX)? Cheers, Jeff
Hash index initial size is too large given NULLs or partial indexes
Referring to this thread: https://dba.stackexchange.com/questions/231647/why-are-partial-postgresql-hash-indices-not-smaller-than-full-indices When a hash index is created on a populated table, it estimates the number of buckets to start out with based on the number of tuples returned by estimate_rel_size. But this number ignores both the fact that NULLs are not stored in hash indexes, and that partial indexes exist. This can lead to much too large hash indexes. Doing a re-index just repeats the logic, so doesn't fix anything. Fill factor also can't fix it, as you are not allowed to increase that beyond 100. This goes back to when the pre-sizing was implemented in 2008 (c9a1cc694abef737548a2a). It seems to be an oversight, rather than something that was considered. Is this a bug that should be fixed? Or if getting a more accurate estimate is not possible or not worthwhile, add a code comment about that? Cheers, Jeff
Re: GiST VACUUM
On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas wrote: > On 05/03/2019 02:26, Andrey Borodin wrote: > >> I also tried your amcheck tool with this. It did not report any > >> errors. > >> > >> Attached is also latest version of the patch itself. It is the > >> same as your latest patch v19, except for some tiny comment > >> kibitzing. I'll mark this as Ready for Committer in the commitfest > >> app, and will try to commit it in the next couple of days. > > > > That's cool! I'll work on 2nd step of these patchset to make > > blockset data structure prettier and less hacky. > > Committed the first patch. Thanks for the patch! > Thank you. This is a transformational change; it will allow GiST indexes larger than RAM to be used in some cases where they were simply not feasible to use before. On a HDD, it resulted in a 50 fold improvement in vacuum time, and the machine went from unusably unresponsive to merely sluggish during the vacuum. On a SSD (albeit a very cheap laptop one, and exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30 fold improvement, from a far faster baseline. Even on an AWS instance with a "GP2" SSD volume, which normally shows little benefit from sequential reads, I get a 3 fold speed up. I also ran this through a lot of crash-recovery testing using simulated torn-page writes using my traditional testing harness with high concurrency (AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and did not encounter any problems. I tested both with btree_gist on a scalar int, and on tsvector with each tsvector having 101 tokens. I did notice that the space freed up in the index by vacuum doesn't seem to get re-used very efficiently, but that is an ancestral problem independent of this change. Cheers, Jeff
Re: jsonpath
On Sat, Mar 16, 2019 at 5:36 AM Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > > So, pushed. Many thanks to reviewers and authors! > I think these files have to be cleaned up by "make maintainer-clean" ./src/backend/utils/adt/jsonpath_gram.c ./src/backend/utils/adt/jsonpath_scan.c Cheers, Jeff
compiler warning in pgcrypto imath.c
When compiling on an AWS 64 bit Arm machine, I get this compiler warning: imath.c: In function 's_ksqr': imath.c:2590:6: warning: variable 'carry' set but not used [-Wunused-but-set-variable] carry; ^ With this version(): PostgreSQL 12devel on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit The attached patch adds PG_USED_FOR_ASSERTS_ONLY to silence it. Perhaps there is a better way, given that we want to change imath.c as little as possible from its upstream? Cheers, Jeff pgcrypto_warning.patch Description: Binary data
Re: pg_upgrade: Pass -j down to vacuumdb
On Tue, Mar 26, 2019 at 7:28 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-03-25 22:57, Tom Lane wrote: > > + fprintf(script, "echo %sYou may wish to add --jobs=N for parallel > analyzing.%s\n", > > + ECHO_QUOTE, ECHO_QUOTE); > > But then you get that information after you have already started the > script. > True, but the same goes for all the other information there, and it sleeps to let you break out of it. And I make it a habit to glance through any scripts someone suggests that I run, so would notice the embedded advice without running it at all. I don't find any information about this analyze business on the > pg_upgrade reference page. Maybe a discussion there could explain the > different paths better than making the output script extra complicated. > > Essentially: If you want a slow and gentle analyze, use the supplied > script. If you want a fast analyze, use vacuumdb, perhaps with an > appropriate --jobs option. Note that pg_upgrade --jobs and vacuumdb > --jobs are resource-bound in different ways, so the same value might not > be appropriate for both. > > To me, analyze-in-stages is not about gentleness at all. For example, it does nothing to move vacuum_cost_delay away from its default of 0. Rather, it is about slamming the bare minimum statistics in there as fast as possible, so your database doesn't keel over from horrible query plans on even simple queries as soon as you reopen. I want the database to survive long enough for the more complete statistics to be gathered. If you have quickly accumulated max_connection processes all running horrible query plans that never finish, your database might as well still be closed for all the good it does the users. And all the load generated by those is going to make the critical ANALYZE all that much slower. At first blush I thought it was obvious that you would not want to run analyze-in-stages in parallel. But after thinking about it some more and reflecting on experience doing some troublesome upgrades, I would reverse that and say it is now obvious you do want at least the first stage of analyze-in-stages, and probably the first two, to run in parallel. That is not currently an option it supports, so we can't really recommend it in the script or the docs. But we could at least adopt the more straightforward patch, suggest that if they don't want analyze-in-stages they should consider doing the big-bang analyze in parallel. Cheers, Jeff
Re: [HACKERS] generated columns
On Sat, Mar 30, 2019 at 4:03 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-03-26 20:50, Pavel Stehule wrote: > > It is great feature and I'll mark this feature as ready for commit > > Committed, thanks. > I can't do a same-major-version pg_upgrade across this commit, as the new pg_dump is trying to dump a nonexistent attgenerated column from the old database. Is that acceptable collateral damage? I thought we usually avoided that breakage within the dev branch, but I don't know how we do it. Cheers, Jeff
pg_basebackup delays closing of stdout
Ever since pg_basebackup was created, it had a comment like this: * End of chunk. If requested, and this is the base tablespace * write configuration file into the tarfile. When done, close the * file (but not stdout). But, why make the exception for output going to stdout? If we are done with it, why not close it? After a massive maintenance operation, I want to re-seed a streaming standby, which I start to do by: pg_basebackup -D - -Ft -P -X none | pxz > base.tar.xz But the archiver is way behind, so when it finishes the basebackup part, I get: NOTICE: pg_stop_backup cleanup done, waiting for required WAL segments to be archived WARNING: pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed) ... The base backup file is not finalized, because pg_basebackup has not closed its stdout while waiting for the WAL segment to be archived. The file is incomplete due to data stuck in buffers, so I can't copy it to where I want and bring up a new streaming replica (which bypasses the WAL archive, so would otherwise work). Also, if pg_basebackup gets interupted somehow while it is waiting for WAL archiving, the backup will be invalid, as it won't flush the last bit of data. Of course if it gets interupted, I would have to test the backup to make sure it is valid. But testing it and finding that it is valid is better than testing it and finding that it is not. I think it makes sense for pg_basebackup to wait for the WAL to be archived, but there is no reason for it to hold the base.tar.xz file hostage while it does so. If I simply remove the test for strcmp(basedir, "-"), as in the attached, I get the behavior I desire, and nothing bad seems to happen. Meaning, "make check -C src/bin/pg_basebackup/" still passes (but only tested on Linux). Is there a reason not to do this? Cheers, Jeff pg_basebackup_close_stdout.patch Description: Binary data
Re: Can't we give better table bloat stats easily?
On Fri, Aug 16, 2019 at 8:39 PM Greg Stark wrote: > Everywhere I've worked I've seen people struggle with table bloat. It's > hard to even measure how much of it you have or where, let alone actually > fix it. > > If you search online you'll find dozens of different queries estimating > how much empty space are in your tables and indexes based on pg_stats > statistics and suppositions about header lengths and padding and plugging > them into formulas of varying credibility. > There is not much we can do to suppress bad advice that people post on their own blogs. If wiki.postgresql.org is hosting bad advice, by all means we should fix that. > But isn't this all just silliiness these days? We could actually sum up > the space recorded in the fsm and get a much more trustworthy number in > milliseconds. > If you have bloat problems, then you probably have vacuuming problems. If you have vacuuming problems, how much can you trust fsm anyway? Cheers, Jeff
Re: TODO list (was Re: Contributing with code)
On Tue, Jan 2, 2018 at 2:48 PM, Robert Haas wrote: > On Sun, Dec 31, 2017 at 2:02 PM, David G. Johnston > wrote: > > It probably needs three sub-sections. Fist the raw ideas put forth by > > people not capable of implementation but needing capabilities; these get > > moved to one of two sections: ideas that have gotten some attention by > core > > that have merit but don't have development interest presently; and one > like > > this that have gotten the some attention and that core doesn't feel > would be > > worth maintaining even if someone was willing to develop it. We already > > have this in practice but maybe a bit more formality would help. > > > > I'm not seeing that having it, even if incorrect, does harm. > > It causes people to waste time developing features we don't want. > We don't want them at all, or we just don't want a naive implementation of them? If we don't want them at all, then surely we should remove those items. Or move them to the bottom of the page, where there is a section just for such things. That way people can at least see that it has been considered and rejected. And for things that we do want, it is nice to have links to the emails of where it was discussed/attempted before. This is especially useful because searching the archives is very inefficient due to nearly every word you want to search on being a stop word or a very common word with many meanings. It is much easier to find it on the todo list if it is there than to search the archives. > > It also has a note at the top saying we think it's complete, but we > don't think that, or I don't think it anyway. > Yeah, I don't care for that, either. Cheers, Jeff
Re: TODO list (was Re: Contributing with code)
On Tue, Jan 2, 2018 at 6:42 PM, Joshua D. Drake wrote: > On 01/02/2018 11:17 AM, Robert Haas wrote: > >> On Sun, Dec 31, 2017 at 2:31 PM, Peter Geoghegan wrote: >> >>> On Sun, Dec 31, 2017 at 10:42 AM, Tom Lane wrote: >>> If we're not going to maintain/curate it properly, I agree it's not worth keeping it around. But I'd rather see somebody put some effort into it ... >>> If somebody was going to resolve to put some effort into maintaining >>> it to a high standard then it probably would have happened already. >>> The fact that it hasn't happened tells us plenty. >>> >> +1, and well said. >> > > O.k. what does it tell us though? Is it a resource issue? Is it a barrier > of entry issue? Lack of ownership/ruthlessness. While I can edit it to remove items that don't seem desirable (or comprehensible, or whatever) I'm not likely to do so, unless I'm the one who added it in the first place. Maybe it made more sense or was more important to someone else, like the person who added it. At one time many of the items didn't have links to the relevant email discussions (or more detailed wiki pages of their own), so those would have been good targets for purging but I think Bruce hunted down and added links for most of them. Another problem is that wikimedia doesn't have a "git blame" like feature. I've been frustrated before trying to figure out who added an item and when, so I could research it a bit more. > What does deleting it solve? What problems (and there is a very large > obvious one) are caused by deleting it? > > Right now, the TODO list is the "only" portal to "potential" things we > "might" want. If we delete it we are just creating yet another barrier of > entry to potential contribution. I think we need to consider an alternative > solution because of that. > There are various "roadmaps" floating around (wiki and elsewhere), but they aren't very prominent or easy to find. They seem to mostly be minutes from meetings, but you wouldn't know to look for them if you weren't at the meeting. Cheers, Jeff
Re: Speeding up pg_upgrade
On Thu, Dec 7, 2017 at 11:28 AM, Justin Pryzby wrote: > On Tue, Dec 05, 2017 at 09:01:35AM -0500, Bruce Momjian wrote: > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > > zero-downtime upgrades. ... we discussed speeding up pg_upgrade. > > > > There are clusters that take a long time to dump the schema from the old > > cluster > > Maybe it isn't representative of a typical case, but I can offer a data > point: > > For us, we have ~40 customers with DBs ranging in size from <100GB to ~25TB > (for which ~90% is on a ZFS tablespace with compression). We have what's > traditionally considered to be an excessive number of child tables, which > works > okay since planning time is unimportant to us for the report queries which > hit > them. Some of the tables are wide (historically up to 1600 columns). > Some of > those have default values on nearly every column, and pg_attrdef was large > (>500MB), causing pg_dump --section pre-data to be slow (10+ minutes). > Since > something similar is run by pg_upgrade, I worked around the issue for now > by > dropping defaults on the historic children in advance of upgrades (at some > point I'll figure out what I have to do to allow DROPing DEFAULTs). It's > not > the first time we've seen an issue with larger number of children*columns. > This is probably worth fixing independent of other ways of speeding up pg_upgrade. It spends most of its time making the column names unique while de-parsing the DEFAULT clause for each column. But I don't think it ever outputs the column name which results from that deparsing, and since there is only one table involved, the names should already be unique anyway, unless I am missing something. The time seems to be quadratic in number of columns if all columns have defaults, or proportional to the product of number of columns in table and the number of columns with defaults. The CREATE TABLE has a similar problem upon restoring the dump. Cheers, Jeff pg_dump_default.sh Description: Bourne shell script
Re: [HACKERS] Removing useless DISTINCT clauses
On Mon, Nov 6, 2017 at 1:16 AM, David Rowley wrote: > In [1] we made a change to process the GROUP BY clause to remove any > group by items that are functionally dependent on some other GROUP BY > items. > > This really just checks if a table's PK columns are entirely present > in the GROUP BY clause and removes anything else belonging to that > table. > > All this seems to work well, but I totally failed to consider that the > exact same thing applies to DISTINCT too. > > Over in [2], Rui Liu mentions that the planner could do a better job > for his case. > > Using Rui Liu's example: > > CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text); > INSERT into test_tbl select generate_series(1,1000), 'test'; > > Master: > > postgres=# explain analyze verbose select distinct col, k from > test_tbl order by k limit 1000; >QUERY > PLAN > > > - > Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual > time=8934.962..8935.495 rows=1000 loops=1) >Output: col, k >-> Unique (cost=1658556.19..1733557.50 rows=1175 width=9) > (actual time=8934.961..8935.460 rows=1000 loops=1) > Output: col, k > -> Sort (cost=1658556.19..1683556.63 rows=1175 width=9) > (actual time=8934.959..8935.149 rows=1000 loops=1) >Output: col, k >Sort Key: test_tbl.k, test_tbl.col >Sort Method: external merge Disk: 215128kB >-> Seq Scan on public.test_tbl (cost=0.00..154056.75 > rows=1175 width=9) (actual time=0.062..1901.728 rows=1000 > loops=1) > Output: col, k > Planning time: 0.092 ms > Execution time: 8958.687 ms > (12 rows) > > Patched: > > postgres=# explain analyze verbose select distinct col, k from > test_tbl order by k limit 1000; > > QUERY PLAN > > > -- > Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895 > rows=1000 loops=1) >Output: col, k >-> Unique (cost=0.44..338745.50 rows=1175 width=9) (actual > time=0.029..0.814 rows=1000 loops=1) > Output: col, k > -> Index Scan using test_tbl_pkey on public.test_tbl > (cost=0.44..313745.06 rows=1175 width=9) (actual time=0.026..0.452 > rows=1000 loops=1) >Output: col, k > Planning time: 0.152 ms > Execution time: 0.985 ms > (8 rows) > > A patch to implement this is attached. > > Couldn't the Unique node be removed entirely? If k is a primary key, you can't have duplicates in need of removal. Or would that be a subject for a different patch? I think remove_functionally_dependant_groupclauses should have a more generic name, like remove_functionally_dependant_clauses. Cheers, Jeff
Re: Why standby restores some WALs many times from archive?
On Sat, Dec 30, 2017 at 4:20 AM, Michael Paquier wrote: > On Sat, Dec 30, 2017 at 04:30:07AM +0300, Sergey Burladyan wrote: > > We use this scripts: > > https://github.com/avito-tech/dba-utils/tree/master/pg_archive > > > > But I can reproduce problem with simple cp & mv: > > archive_command: > > test ! -f /var/lib/postgresql/wals/%f && \ > > test ! -f /var/lib/postgresql/wals/%f.tmp && \ > > cp %p /var/lib/postgresql/wals/%f.tmp && \ > > mv /var/lib/postgresql/wals/%f.tmp /var/lib/postgresql/wals/%f > > This is unsafe. PostgreSQL expects the WAL segment archived to be > flushed to disk once the archive command has returned its result to the > backend. Don't be surprised if you get corrupted instances or that you > are not able to recover up to a consistent point if you need to roll in > a backup. Note that the documentation of PostgreSQL provides a simple > example of archive command, which is itself bad enough not to use. > True, but that but doesn't explain the current situation, as it reproduces without an OS level crash so a missing sync would not be relevant. (and on some systems, mv'ing a file will force it to be synced under some conditions, so it might be safe anyway) I thought I'd seen something recently in the mail lists or commit log about an off-by-one error which causes it to re-fetch the previous file rather than the current file if the previous file ends with just the right type of record and amount of padding. But now I can't find it. Cheers, Jeff
Re: Possible performance regression with pg_dump of a large number of relations
On Thu, Jan 11, 2018 at 5:26 PM, Luke Cowell wrote: > I've been troubleshooting an issue with slow pg_dump times on postgres > 9.6.6. I believe something changed between 9.5.10 and 9.6.6 that has made > dumps significantly slower for databases with a large number of relations. > I posted this in irc and someone suggested that I should post this here. > I'm sorry if this isn't the right place. > > To simulate the issue I generated 150,000 relations spread across 1000 > schemas (this roughly reflects my production setup). > > ```ruby > File.write "many_relations.sql", (15 / 150).times.flat_map {|n| > [ >"create schema s_#{n};", >150.times.map do |t| > "create table s_#{n}.test_#{t} (id int);" >end >] > }.join("\n") > ``` > > I have 2 identical pieces of hardware. I've installed 9.5 on one and 9.6 > on the other. I've run the same generated piece of sql in a fresh database > on both systems. > > On my 9.5.10 system: > > time pg_dump -n s_10 testing > /dev/null > real0m5.492s > user0m1.424s > sys 0m0.184s > > On my 9.6.6 system: > > time pg_dump -n s_10 testing > /dev/null > real0m27.342s > user0m1.748s > sys 0m0.248s > I don't get quite as large a regression as you do, from 6s to 19s. It looks like there are multiple of them, but the biggest is caused by: commit 5d589993cad212f7d556d52cc1e42fe18f65b057 Author: Stephen Frost Date: Fri May 6 14:06:50 2016 -0400 pg_dump performance and other fixes That commit covered a few different things, and I don't what improvement it mentions is the one that motivated this, but the key change was to add this query: EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid = pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname = 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND at.attnum>0 and ((SELECT count(acl) FROM (SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) >1 OR (SELECT count(acl) FROM (SELECT unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo) >0))AS changed_acl Considering it runs 2 subqueries for every column (including the 6 hidden system columns) of every table, even ones that don't end up getting dumped out, it is no wonder it is slow. If you were just dumping the database with 150,000 objects, I wouldn't worry about a 20 second regression. But I assume you intend to loop over every schema and dump each individually? Cheers, Jeff
Re: Possible performance regression with pg_dump of a large number of relations
On Fri, Jan 12, 2018 at 8:01 AM, Jeff Janes wrote: > That commit covered a few different things, and I don't what improvement > it mentions is the one that motivated this, but the key change was to add > this query: > > EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid > = pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname = > 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND > at.attnum>0 and ((SELECT count(acl) FROM (SELECT > unnest(coalesce(at.attacl,acldefault('c',c.relowner))) > AS acl EXCEPT SELECT > unnest(coalesce(pip.initprivs,acldefault('c',c.relowner > as foo) >1 OR (SELECT count(acl) FROM (SELECT > unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) > AS acl EXCEPT SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner > as foo) >0))AS changed_acl > Sorry, that query reflects some munging I did to it. The real part added to the query is: EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON(c.oid = pip.objoid AND pip.classoid = (SELECT oid FROM pg_class WHERE relname = 'pg_class') AND pip.objsubid = at.attnum)WHERE at.attrelid = c.oid AND ((SELECT array_agg(acl) FROM (SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner))) AS acl EXCEPT SELECT unnest(coalesce(pip.initprivs,acldefault('c',c.relowner as foo) IS NOT NULL OR (SELECT array_agg(acl) FROM (SELECT unnest(coalesce(pip.initprivs,acldefault('c',c.relowner))) AS acl EXCEPT SELECT unnest(coalesce(at.attacl,acldefault('c',c.relowner as foo) IS NOT NULL OR NULL IS NOT NULL OR NULL IS NOT NULL))AS changed_ac Cheers, Jeff
Logical replication wal sender timestamp bug
While monitoring pg_stat_subscription, I noticed that last_msg_send_time was usually NULL, which doesn't make sense. Why would we have a message, but not know when it was sent? Looking in src/backend/replication/walsender.c, there is this: /* output previously gathered data in a CopyData packet */ pq_putmessage_noblock('d', ctx->out->data, ctx->out->len); /* * Fill the send timestamp last, so that it is taken as late as possible. * This is somewhat ugly, but the protocol is set as it's already used for * several releases by streaming physical replication. */ resetStringInfo(&tmpbuf); now = GetCurrentTimestamp(); pq_sendint64(&tmpbuf, now); memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)], tmpbuf.data, sizeof(int64)); Filling out the timestamp after the message has already been sent is taking "as late as possible" a little too far. It results in every message having a zero timestamp, other than keep-alives which go through a different path. Re-ordering the code blocks as in the attached seems to fix it. But I have to wonder, if this has been broken from the start and no one ever noticed, is this even valuable information to relay in the first place? We could just take the column out of the view, and not bother calling GetCurrentTimestamp() for each message. Cheers, Jeff walsender_timestamp_fix.patch Description: Binary data
Re: cost based vacuum (parallel)
On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila wrote: > For parallel vacuum [1], we were discussing what is the best way to > divide the cost among parallel workers but we didn't get many inputs > apart from people who are very actively involved in patch development. > I feel that we need some more inputs before we finalize anything, so > starting a new thread. > Maybe a I just don't have experience in the type of system that parallel vacuum is needed for, but if there is any meaningful IO throttling which is active, then what is the point of doing the vacuum in parallel in the first place? Cheers, Jeff
Re: Logical replication wal sender timestamp bug
On Wed, Nov 6, 2019 at 2:15 AM Michael Paquier wrote: > On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote: > > On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote: > >> Filling out the timestamp after the message has already been sent is > taking > >> "as late as possible" a little too far. It results in every message > having > >> a zero timestamp, other than keep-alives which go through a different > path. > > > > It seems to me that you are right: the timestamp is computed too > > late. > > It is easy enough to reproduce the problem by setting for example > logical replication between two nodes and pgbench to produce some > load and then monitor pg_stat_subscription periodically. However, it > is a problem since logical decoding has been introduced (5a991ef) so > committed your fix down to 9.4. > Thanks. This column looks much more reasonable now. Cheers, Jeff
Re: logical replication empty transactions
On Fri, Nov 8, 2019 at 8:59 PM Euler Taveira wrote: > Em seg., 21 de out. de 2019 Ã s 21:20, Jeff Janes > escreveu: > > > > After setting up logical replication of a slowly changing table using > the built in pub/sub facility, I noticed way more network traffic than made > sense. Looking into I see that every transaction in that database on the > master gets sent to the replica. 99.999+% of them are empty transactions > ('B' message and 'C' message with nothing in between) because the > transactions don't touch any tables in the publication, only non-replicated > tables. Is doing it this way necessary for some reason? Couldn't we hold > the transmission of 'B' until something else comes along, and then if that > next thing is 'C' drop both of them? > > > That is not optimal. Those empty transactions is a waste of bandwidth. > We can suppress them if no changes will be sent. test_decoding > implements "skip empty transaction" as you described above and I did > something similar to it. Patch is attached. > Thanks. I didn't think it would be that simple, because I thought we would need some way to fake an acknowledgement for any dropped empty transactions, to keep the LSN advancing and allow WAL to get recycled on the master. But it turns out the opposite. While your patch drops the network traffic by a lot, there is still a lot of traffic. Now it is keep-alives, rather than 'B' and 'C'. I don't know why I am getting a few hundred keep alives every second when the timeouts are at their defaults, but it is better than several thousand 'B' and 'C'. My setup here was just to create, publish, and subscribe to a inactive dummy table, while having pgbench running on the master (with unpublished tables). I have not created an intentionally slow network, but I am testing it over wifi, which is inherently kind of slow. Cheers, Jeff
Coding in WalSndWaitForWal
in src/backend/replication/walsender.c, there is the section quoted below. It looks like nothing interesting happens between the GetFlushRecPtr just before the loop starts, and the one inside the loop the first time through the loop. If we want to avoid doing CHECK_FOR_INTERRUPTS(); etc. needlessly, then we should check the result of GetFlushRecPtr and return early if it is sufficiently advanced--before entering the loop. If we don't care, then what is the point of updating it twice with no meaningful action in between? We could just get rid of the section just before the loop starts. The current coding seems confusing, and increases traffic on a potentially busy spin lock. /* Get a more recent flush pointer. */ if (!RecoveryInProgress()) RecentFlushPtr = GetFlushRecPtr(); else RecentFlushPtr = GetXLogReplayRecPtr(NULL); for (;;) { longsleeptime; /* Clear any already-pending wakeups */ ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); /* Process any requests or signals received recently */ if (ConfigReloadPending) { ConfigReloadPending = false; ProcessConfigFile(PGC_SIGHUP); SyncRepInitConfig(); } /* Check for input from the client */ ProcessRepliesIfAny(); /* * If we're shutting down, trigger pending WAL to be written out, * otherwise we'd possibly end up waiting for WAL that never gets * written, because walwriter has shut down already. */ if (got_STOPPING) XLogBackgroundFlush(); /* Update our idea of the currently flushed position. */ if (!RecoveryInProgress()) RecentFlushPtr = GetFlushRecPtr(); else RecentFlushPtr = GetXLogReplayRecPtr(NULL); Cheers, Jeff
Re: WAL archive is lost
On Fri, Nov 22, 2019 at 8:04 AM matsumura@fujitsu.com < matsumura@fujitsu.com> wrote: > Hi all > > I find a situation that WAL archive file is lost but any WAL segment file > is not lost. > It causes for archive recovery to fail. Is this behavior a bug? > > example: > > WAL segment files > 00010001 > 00010002 > 00010003 > > Archive files > 00010001 > 00010003 > > Archive file 00010002 is lost but WAL segment files > is continuous. Recovery with archive (i.e. PITR) stops at the end of > 00010001. > Will it not archive 00010002 eventually, like at the conclusion of the next restartpoint? or does it get recycled/removed without ever being archived? Or does it just hang out forever in pg_wal? > How to reproduce: > - Set up replication (primary and standby). > - Set [archive_mode = always] in standby. > - WAL receiver exits (i.e. because primary goes down) > after receiver inserts the last record in some WAL segment file > before receiver notifies the segement file to archiver(create .ready > file). > Do you have a trick for reliably achieving this last step? Cheers, Jeff
disable only nonparallel seq scan.
Is there a way to force a meaningful parallel seq scan, or at least the planning of one, when the planner wants a non-parallel one? Usually I can do things like with with enable_* setting, but if I `set enable_seqscan to off`, it penalizes the parallel seq scan 8 times harder than it penalizes the non-parallel one, so the plan does not switch. If I set `force_parallel_mode TO on` then I do get a parallel plan, but it is a degenerate one which tells me nothing I want to know. If I `set parallel_tuple_cost = 0` (or in some cases to a negative number), I can force it switch, but that destroys the purpose, which is to see what the "would have been" plan estimates are for the parallel seq scan under the default setting of the cost parameters. I can creep parallel_tuple_cost downward until it switches, and then try to extrapolate back up, but this tedious and not very reliable. Cheers, Jeff
Re: Index corruption / planner issue with one table in my pg 11.6 instance
On Mon, Dec 9, 2019 at 1:00 PM Jeremy Finzel wrote: > I have a table with about 7 million records. I had a query in which I > needed 2 indexes added, one for a created timestamp field another for an id > field; both very high cardinality. > > First I noticed the query would not use the timestamp index no matter what > session config settings I used. I finally created a temp table copy of the > table and verified index is used. Then I rebuilt the main table with > VACUUM FULL and this caused the index to be used. > Were they built with CONCURRENTLY? Do you have any long-open snapshots? Cheers, Jeff >
Re: Contention on LWLock buffer_content, due to SHARED lock(?)
On Mon, Dec 9, 2019 at 5:10 PM Jens-Wolfhard Schicke-Uffmann < drahf...@gmx.de> wrote: > Hi, > > today I observed (on a r5.24xlarge AWS RDS instance, i.e. 96 logical > cores) lock contention on a buffer content lock due to taking of a > SHARED lock (I think): > What version of PostgreSQL are you using? Cheers, Jeff
Re: disable only nonparallel seq scan.
On Tue, Dec 10, 2019 at 1:32 PM Robert Haas wrote: > On Sun, Dec 8, 2019 at 1:24 PM Jeff Janes wrote: > > Is there a way to force a meaningful parallel seq scan, or at least the > planning of one, when the planner wants a non-parallel one? > > > > Usually I can do things like with with enable_* setting, but if I `set > enable_seqscan to off`, it penalizes the parallel seq scan 8 times harder > than it penalizes the non-parallel one, so the plan does not switch. > > > > If I set `force_parallel_mode TO on` then I do get a parallel plan, but > it is a degenerate one which tells me nothing I want to know. > > > > If I `set parallel_tuple_cost = 0` (or in some cases to a negative > number), I can force it switch, but that destroys the purpose, which is to > see what the "would have been" plan estimates are for the parallel seq scan > under the default setting of the cost parameters. > > > > I can creep parallel_tuple_cost downward until it switches, and then try > to extrapolate back up, but this tedious and not very reliable. > > I don't think there's a way to force this, but setting both > parallel_setup_cost and parallel_tuple_cost to 0 seems to often be > enough. > Yes, that is fine if I want the actual execution results. And I patch guc.c to allow negative settings, for when some extra persuasion is needed. But here I want to see what the planner is thinking, and changing the *cost settings changes that thinking. So I want to force the planner to choose the "next-best" plan under the original cost settings so I can see how far away they are from each other. I made a crude patch to add enable_singleseqscan, which has been letting me get at this information now. I'm not proposing to apply this particular patch to the code base, but I do wonder if we can do something about this "dark spot" which no combination of current enable_* setting seems to be able to get at. Cheers, Jeff enable_singleseqscan.patch Description: Binary data
psql's \watch is broken
If I do something like this: explain (analyze) select * from pgbench_accounts \watch 1 It behaves as expected. But once I break out of the loop with ctrl-C, then if I execute the same thing again it executes the command once, but shows no output and doesn't loop. It seems like some flag is getting set with ctrl-C, but then never gets reset. It was broken in this commit: commit a4fd3aa719e8f97299dfcf1a8f79b3017e2b8d8b Author: Michael Paquier Date: Mon Dec 2 11:18:56 2019 +0900 Refactor query cancellation code into src/fe_utils/ I've not dug into code itself, I just bisected it. Cheers, Jeff
Re: psql's \watch is broken
On Fri, Dec 13, 2019 at 9:45 PM Michael Paquier wrote: > On Sat, Dec 14, 2019 at 12:09:51AM +0100, Fabien COELHO wrote: > > > >> explain (analyze) select * from pgbench_accounts \watch 1 > >> > >> It behaves as expected. But once I break out of the loop with ctrl-C, > then > >> if I execute the same thing again it executes the command once, but > shows > >> no output and doesn't loop. It seems like some flag is getting set with > >> ctrl-C, but then never gets reset. > >> > >> > >> I've not dug into code itself, I just bisected it. > > > > Thanks for the report. I'll look into it. > > Looked at it already. And yes, I can see the difference. This comes > from the switch from cancel_pressed to CancelRequested in psql, > especially PSQLexecWatch() in this case. And actually, now that I > look at it, I think that we should simply get rid of cancel_pressed in > psql completely and replace it with CancelRequested. This also > removes the need of having cancel_pressed defined in print.c, which > was not really wanted originally. Attached is a patch which addresses > the issue for me, and cleans up the code while on it. Fabien, Jeff, > can you confirm please? > -- > Michael > This works for me. Thanks, Jeff
Re: [HACKERS] Proposal to add work_mem option to postgres_fdw module
On Fri, Sep 7, 2018 at 9:17 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 05/09/2018 18:46, Peter Eisentraut wrote: > > On 01/09/2018 06:33, Shinoda, Noriyoshi (PN Japan GCS Delivery) wrote: > >> Certainly the PQconndefaults function specifies Debug flag for the > "options" option. > >> I think that eliminating the Debug flag is the simplest solution. > >> For attached patches, GUC can be specified with the following syntax. > >> > >> CREATE SERVER remsvr1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host > 'host 1', ..., options '-c work_mem=64MB -c geqo=off'); > >> ALTER SERVER remsv11 OPTIONS (SET options '-c work_mem=64MB -c > geqo=off'); > >> > >> However, I am afraid of the effect that this patch will change the > behavior of official API PQconndefaults. > > > > It doesn't change the behavior of the API, it just changes the result of > > the API function, which is legitimate and the reason we have the API > > function in the first place. > > > > I think this patch is fine. I'll work on committing it. > > I have committed just the libpq change. The documentation change was > redundant, because the documentation already stated that all libpq > options are accepted. (Arguably, the documentation was wrong before.) > Since the effect of this commit was to make postgres_fdw actually comply with its documentation, should this have been back-patched? Is there a danger in backpatching this change to libpq to older versions? This seems like more of a bug fix than an added feature. Cheers, Jeff
Re: [PATCH] Increase the maximum value track_activity_query_size
On Tue, Dec 24, 2019 at 12:11 AM Robert Haas wrote: > On Sat, Dec 21, 2019 at 1:25 PM Tom Lane wrote: > > > What is the overhead here except the memory consumption? > > > > The time to copy those strings out of shared storage, any time > > you query pg_stat_activity. > > It seems like you're masterminding this, and I don't know why. It > seems unlikely that anyone will raise the value unless they have very > long queries, and if those people would rather pay the overhead of > copying more data than have their queries truncated, who are we to > argue? > +1 Cheers, Jeff
Re: vacuum verbose detail logs are unclear (show debug lines at *start* of each stage?)
On Fri, Dec 20, 2019 at 12:11 PM Justin Pryzby wrote: > This is a usability complaint. If one knows enough about vacuum and/or > logging, I'm sure there's no issue. > > | 11 DEBUG: "t": found 999 removable, 999 nonremovable row versions in 9 > out of 9 pages > I agree the mixture of pre-action and after-action reporting is rather confusing sometimes. I'm more concerned about what the user sees in their terminal, though, rather than the server's log file. Also, the above quoted line is confusing. It makes it sound like it found removable items, but didn't actually remove them. I think that that is taking grammatical parallelism too far. How about something like: DEBUG: "t": removed 999 row versions, found 999 nonremovable row versions in 9 out of 9 pages Also, I'd appreciate a report on how many hint-bits were set, and how many pages were marked all-visible and/or frozen. When I do a manual vacuum, it is more often for those purposes than it is for removing removable rows (which autovac generally does a good enough job of). Also, is not so clear that "nonremovable rows" includes both live and recently dead. Although hopefully reading the next line will clarify that, to the person who has enough background knowledge. > | 12 DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: > 130886944 > | 13 There were 0 unused item identifiers. > | 14 Skipped 0 pages due to buffer pins, 0 frozen pages. > It is a bit weird that we don't report skipped all-visible pages here. It was implicitly reported in the "in 9 out of 9 pages" message, but I think it should be reported explicitly as well. Cheers, Jeff
Re: [PATCH] Increase the maximum value track_activity_query_size
On Mon, Dec 30, 2019 at 6:46 PM Andrew Dunstan < andrew.duns...@2ndquadrant.com> wrote: > On Tue, Dec 31, 2019 at 9:38 AM Tom Lane wrote: > > > > > This doesn't seem like a reason not to allow a higher limit, like a > > megabyte or so, but I'm not sure that pushing it to the moon would be > > wise. > > > > > Just to get a mental handle on the size of queries we might be > allowing before truncation, I did some very rough arithmetic on what > well known texts might fit in a megabyte. By my calculations you could > fit about four Animal Farms or one Madame Bovary in about a megabyte. > So I think that seems like more than enough :-). (My mind kinda > explores at the thought of debugging a query as long as Animal Farm.) > > I've seen some pretty big IN-lists and VALUES lists. They aren't so hard to debug once you tune out iterations 3 through N-3 of the list members. Unless they are hard to debug for other reasons. In these cases, it would be helpful, if not just allowing bigger texts in general, to instead "truncate" from the middle, preserving both the beginning and the end of the query text. Cheers, Jeff
Re: color by default
On Tue, Dec 31, 2019 at 8:35 AM Tom Lane wrote: > Peter Eisentraut writes: > > With the attached patch, I propose to enable the colored output by > > default in PG13. > > FWIW, I shall be setting NO_COLOR permanently if this gets committed. > I wonder how many people there are who actually *like* colored output? > I find it to be invariably less readable than plain B&W text. > > I find color massively useful for grep and its variants, where the hit can show up anywhere on the line. It was also kind of useful for git, especially "git grep", but on my current system git's colorizing seems hopelessly borked, so I had to turn it off. But I turned PG_COLOR on and played with many commands, and must say I don't really see much of a point. When most of these command fail, they only generate a few lines of output, and it isn't hard to spot the error message. When pg_restore goes wrong, you get a lot of messages but colorizing them isn't really helpful. I don't need 'error' to show up in red in order to know that I have a lot of errors, especially since the lines which do report errors always have 'error' as the 2nd word on the line, where it isn't hard to spot. If it could distinguish the important errors from unimportant errors, that would be more helpful. But if it could reliably do that, why print the unimportant ones at all? It doesn't seem like this is useful enough to have it on by default, and without it being on by default there is no point in having NO_COLOR to turn if off. There is something to be said for going with the flow, but the "emerging standard" seems like it has quite a bit further to emerge before I think that would be an important reason. Cheers, Jeff
Why is pq_begintypsend so slow?
I was using COPY recently and was wondering why BINARY format is not much (if any) faster than the default format. Once I switched from mostly exporting ints to mostly exporting double precisions (7e6 rows of 100 columns, randomly generated), it was faster, but not by as much as I intuitively thought it should be. Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL: PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit I saw that the hotspot was pq_begintypsend at 20%, which was twice the percentage as the next place winner (AllocSetAlloc). If I drill down into teh function, I see something like the below. I don't really speak assembly, but usually when I see an assembly instruction being especially hot and not being the inner most instruction in a loop, I blame it on CPU cache misses. But everything being touched here should already be well cached, since initStringInfo has just got done setting it up. And if not for that, then the by the 2nd invocation of appendStringInfoCharMacro it certainly should be in the cache, yet that one is even slower than the 1st appendStringInfoCharMacro. Why is this such a bottleneck? pq_begintypsend /usr/local/pgsql/bin/postgres 0.15 |push %rbx 0.09 |mov %rdi,%rbx | initStringInfo(buf); 3.03 |callq initStringInfo | /* Reserve four bytes for the bytea length word */ | appendStringInfoCharMacro(buf, '\0'); |movslq 0x8(%rbx),%rax 1.05 |lea 0x1(%rax),%edx 0.72 |cmp 0xc(%rbx),%edx |jge b0 2.92 |mov (%rbx),%rdx |movb $0x0,(%rdx,%rax,1) 13.76 |mov 0x8(%rbx),%eax 0.81 |mov (%rbx),%rdx 0.52 |add $0x1,%eax 0.12 |mov %eax,0x8(%rbx) 2.85 |cltq 0.01 |movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 10.65 |movslq 0x8(%rbx),%rax |lea 0x1(%rax),%edx 0.90 |cmp 0xc(%rbx),%edx |jge ca 0.54 | 42: mov (%rbx),%rdx 1.84 |movb $0x0,(%rdx,%rax,1) 13.88 |mov 0x8(%rbx),%eax 0.03 |mov (%rbx),%rdx |add $0x1,%eax 0.33 |mov %eax,0x8(%rbx) 2.60 |cltq 0.06 |movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 3.21 |movslq 0x8(%rbx),%rax 0.23 |lea 0x1(%rax),%edx 1.74 |cmp 0xc(%rbx),%edx |jge e0 0.21 | 67: mov (%rbx),%rdx 1.18 |movb $0x0,(%rdx,%rax,1) 9.29 |mov 0x8(%rbx),%eax 0.18 |mov (%rbx),%rdx |add $0x1,%eax 0.19 |mov %eax,0x8(%rbx) 3.14 |cltq 0.12 |movb $0x0,(%rdx,%rax,1) | appendStringInfoCharMacro(buf, '\0'); 5.29 |movslq 0x8(%rbx),%rax 0.03 |lea 0x1(%rax),%edx 1.45 |cmp 0xc(%rbx),%edx |jge f6 0.41 | 8c: mov (%rbx),%rdx Cheers, Jeff
Re: pg_upgrade: Pass -j down to vacuumdb
On Fri, Mar 29, 2019 at 5:58 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-03-28 02:43, Jeff Janes wrote: > > At first blush I thought it was obvious that you would not want to run > > analyze-in-stages in parallel. But after thinking about it some more > > and reflecting on experience doing some troublesome upgrades, I would > > reverse that and say it is now obvious you do want at least the first > > stage of analyze-in-stages, and probably the first two, to run in > > parallel. That is not currently an option it supports, so we can't > > really recommend it in the script or the docs. > > So do you think we should copy down the -j option from pg_upgrade, or > make some other arrangement? > For 12, I think we should not pass it down so that it runs automatically, and just go with a doc-only patch instead. Where the text written into the analyze_new_cluster.sh recommending to hit Ctrl-C and do something else is counted as documentation. I agree with you that the value of -j that was used for pg_ugrade might not be suitable for vacuumdb, but anyone who tests things thoroughly enough to know exactly what value to use for each is not going to be the target audience of analyze_new_cluster.sh anyway. So I think the -j used in pg_upgrade can just be written directly into the suggestion, and that would be good enough for the intended use. Ideally I think the analyze-in-stages should have the ability to parallelize the first stage and not the last one, but that is not v12 material. Even more ideally it should only have two stages, not three. One stage runs to generate minimally-tolerable stats before the database is opened to other users, and one runs after it is open (but hopefully before the big end-of-month reports get run, or whatever the big statistics-sensitive queries are on your system). But we don't really have a concept of "open to other users" in the first place, and doing it yourself by juggling pg_hba files is annoying and error prone. So maybe the first stage could be run by pg_upgrade itself, while the new server is still running on a linux socket in a private directory. The defense of the current three-stage method for analyze-in-stages is that very few people are likely to know just what the level of "minimally-tolerable stats" for their system actually are, because upgrades are rare enough not to learn by experience, and realistic load-generators are rare enough not to learn from test/QA environments. If the default_statistics_target=1 stats are enough, then you got them quickly. If they aren't, at least you didn't waste too much time collecting them before moving on to the second-stage default_statistics_target=10 and then the final stage. So the three stages are amortizing over our ignorance. Cheers, Jeff
Re: Should the docs have a warning about pg_stat_reset()?
On Wed, Apr 10, 2019 at 2:52 PM Bruce Momjian wrote: > > OK, let me step back. Why are people resetting the statistics > regularly? Based on that purpose, does it make sense to clear the > stats that effect autovacuum? > When I've done it (not regularly, thankfully), it was usually because I failed to type "pg_stat_reset_shared('bgwriter')" or "pg_stat_statements_reset()" correctly. I've also been tempted to do it because storing snapshots with a cron job or something requires effort and planning ahead to set up the tables and cron and some way to limit the retention, and than running LAG windows functions over the snapshots requires a re-study of the documentation, because they are a bit esoteric and I don't use them enough to commit the syntax to memory. I don't want to see pg_statio_user_indexes often enough to make elaborate arrangements ahead of time (especially since track_io_timing columns is missing from it); but when I do want them, I want them. And when I do, I don't want them to be "since the beginning of time". When I'm thinking about pg_statio_user_indexes, I am probably not thinking about autovac, since they have about nothing in common with each other. (Other than pg_stat_reset operating on both.) Cheers, Jeff
Re: TRACE_SORT defined by default
On Wed, Apr 24, 2019 at 6:04 PM Tom Lane wrote: > Peter Geoghegan writes: > > > In > > any case the current status quo is that it's built by default. I have > > used it in production, though not very often. It's easy to turn it on > > and off. > > Would any non-wizard really have a use for it? > I've had people use it to get some insight into the operation and memory usage of Aggregate nodes, since those nodes offer nothing useful via EXPLAIN ANALYZE. It would be a shame to lose that ability on package-installed PostgreSQL unless we fix Aggregate node reporting first. Cheers, Jeff
pg_upgrade --clone error checking
With the new pg_upgrade --clone, if we are going to end up throwing the error "file cloning not supported on this platform" (which seems to depend only on ifdefs) I think we should throw it first thing, before any other checks are done and certainly before pg_dump gets run. This might result in some small amount of code duplication, but I think it would be worth the cost. For cases where we might throw "could not clone file between old and new data directories", I wonder if we shouldn't do some kind of dummy copy to catch that error earlier, as well. Maybe that one is not worth it. Cheers, Jeff
Re: pg_upgrade --clone error checking
On Thu, May 2, 2019 at 11:57 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-05-01 22:10, Jeff Janes wrote: > > With the new pg_upgrade --clone, if we are going to end up throwing the > > error "file cloning not supported on this platform" (which seems to > > depend only on ifdefs) I think we should throw it first thing, before > > any other checks are done and certainly before pg_dump gets run. > > Could you explain in more detail what command you are running, what > messages you are getting, and what you would like to see instead? > I'm running: pg_upgrade --clone -b /home/jjanes/pgsql/REL9_6_12/bin/ -B /home/jjanes/pgsql/origin_jit/bin/ -d /home/jjanes/pgsql/data_96/ -D /home/jjanes/pgsql/data_clone/ And I get: Performing Consistency Checks - Checking cluster versions ok Checking database user is the install user ok Checking database connection settings ok Checking for prepared transactions ok Checking for reg* data types in user tables ok Checking for contrib/isn with bigint-passing mismatch ok Checking for tables WITH OIDS ok Checking for invalid "unknown" user columns ok Creating dump of global objects ok Creating dump of database schemas ok Checking for presence of required libraries ok file cloning not supported on this platform Failure, exiting I think the error message wording is OK, I think it should be thrown earlier, before the "Creating dump of database schemas" (which can take a long time), and preferably before either database is even started. So ideally it would be something like: Performing Consistency Checks - Checking cluster versions Checking file cloning support File cloning not supported on this platform Failure, exiting When something is doomed to fail, we should report the failure as early as feasibly detectable. Cheers, Jeff
Re: pg_upgrade --clone error checking
On Thu, May 2, 2019 at 12:28 PM Alvaro Herrera wrote: > On 2019-May-02, Jeff Janes wrote: > > > > > When something is doomed to fail, we should report the failure as early > as > > feasibly detectable. > > I agree -- this check should be done before checking the database > contents. Maybe even before "Checking cluster versions". > It looks like it was designed for early checking, it just wasn't placed early enough. So changing it is pretty easy, as check_file_clone does not need to be invented, and there is no additional code duplication over what was already there. This patch moves the checking to near the beginning. It carries the --link mode checking along with it. That should be done as well, and doing it as a separate patch would make both patches uglier. Cheers, Jeff pg_upgrade_filesystem.patch Description: Binary data
Re: pg_upgrade --clone error checking
On Fri, May 3, 2019 at 3:53 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2019-05-02 20:03, Jeff Janes wrote: > > It looks like it was designed for early checking, it just wasn't placed > > early enough. So changing it is pretty easy, as check_file_clone does > > not need to be invented, and there is no additional code duplication > > over what was already there. > > > > This patch moves the checking to near the beginning. > > I think the reason it was ordered that way is that it wants to do all > the checks of the old cluster before doing any checks touching the new > cluster. > But is there a reason to want to do that? I understand we don't want to keep starting and stopping the clusters needlessly, so we should do everything we can in one before moving to the other. But for checks that don't need a running cluster, why would it matter? The existence and contents of PG_VERSION of the new cluster directory is already checked at the very beginning (and even tries to start it up and shuts it down again if a pid file also exists), so there is precedence for touching the new cluster directory at the filesystem level early (albeit in a readonly manner) and if a pid file exists then doing even more than that. I didn't move check_file_clone to before the liveness check is done, out of a abundance of caution. But creating a transient file with a name of no significance ("PG_VERSION.clonetest") in a cluster that is not even running seems like a very low risk thing to do. The pay off is that we get an inevitable error message much sooner. Cheers, Jeff
make maintainer-clean and config.cache
In side-note in another thread Tom pointed out the speed improvements of using an autoconf cache when re-building, which sounded nice to me as config takes an annoyingly long time and is not parallelized. But the config.cache files gets deleted by make maintainer-clean. Doesn't that mostly defeat the purpose of having a cache? Am I doing something wrong here, or just thinking about it wrong? time ./configure --config-cache > /dev/null real 0m21.538s time ./configure --config-cache > /dev/null real 0m3.425s make maintainer-clean > /dev/null ; ## presumably git checkout a new commit here time ./configure --config-cache > /dev/null real 0m21.260s Cheers, Jeff
Re: compiler warning in pgcrypto imath.c
On Sat, May 4, 2019 at 3:15 AM Noah Misch wrote: > > I pushed Jeff's patch. > Thank you. I've re-tested it and I get warning-free compilation now. Cheers, Jeff
Re: Usage of epoch in txid_current
On Thu, Mar 28, 2019 at 1:30 AM Thomas Munro wrote: > On Thu, Mar 28, 2019 at 1:48 AM Heikki Linnakangas > wrote: > > Once we have the FullTransactionId type and basic macros in place, I'm > > sure we could tidy up a bunch of code by using them. Thanks for the reviews! Pushed. > I think that this might be broken. We have this change: @@ -73,7 +75,8 @@ GetNewTransactionId(bool isSubXact) LWLockAcquire(XidGenLock, LW_EXCLUSIVE); - xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid); + full_xid = ShmemVariableCache->nextFullXid; + xid = XidFromFullTransactionId(full_xid); But then later on in an little-used code path around line 164: /* Re-acquire lock and start over */ LWLockAcquire(XidGenLock, LW_EXCLUSIVE); xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid); full_xid does not get updated, but then later on full_xid gets returned in lieu of xid. Is there a reason that this is OK? Cheers, Jeff
Re: Usage of epoch in txid_current
On Sat, May 4, 2019 at 1:34 PM Jeff Janes wrote: > On Thu, Mar 28, 2019 at 1:30 AM Thomas Munro > wrote: > >> On Thu, Mar 28, 2019 at 1:48 AM Heikki Linnakangas >> wrote: >> > Once we have the FullTransactionId type and basic macros in place, I'm >> > sure we could tidy up a bunch of code by using them. > > > Thanks for the reviews! Pushed. >> > > I think that this might be broken. > > We have this change: > > @@ -73,7 +75,8 @@ GetNewTransactionId(bool isSubXact) > > LWLockAcquire(XidGenLock, LW_EXCLUSIVE); > > - xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid); > + full_xid = ShmemVariableCache->nextFullXid; > + xid = XidFromFullTransactionId(full_xid); > > But then later on in an little-used code path around line 164: > > /* Re-acquire lock and start over */ > LWLockAcquire(XidGenLock, LW_EXCLUSIVE); > xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid); > > full_xid does not get updated, but then later on full_xid gets returned in > lieu of xid. > > Is there a reason that this is OK? > > Ah, sorry for the noise. I see that this has been fixed already. I wasn't looking at HEAD, or at the other email thread, when I "discovered" this. Sorry for the noise. Cheers, Jeff
improve transparency of bitmap-only heap scans
When bitmap-only heap scans were introduced in v11 (7c70996ebf0949b142a99) no changes were made to "EXPLAIN". This makes the feature rather opaque. You can sometimes figure out what is going by the output of EXPLAIN (ANALYZE, BUFFERS), but that is unintuitive and fragile. Looking at the discussion where the feature was added, I think changing the EXPLAIN just wasn't considered. The attached patch adds "avoided" to "exact" and "lossy" as a category under "Heap Blocks". Also attached is the example output, as the below will probably wrap to the point of illegibility: explain analyze select count(*) from foo where a=35 and d between 67 and 70; QUERY PLAN - Aggregate (cost=21451.36..21451.37 rows=1 width=8) (actual time=103.955..103.955 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=9920.73..21442.44 rows=3570 width=0) (actual time=100.239..103.204 rows=3950 loops=1) Recheck Cond: ((a = 35) AND (d >= 67) AND (d <= 70)) Heap Blocks: avoided=3718 exact=73 -> BitmapAnd (cost=9920.73..9920.73 rows=3570 width=0) (actual time=98.666..98.666 rows=0 loops=1) -> Bitmap Index Scan on foo_a_c_idx (cost=0.00..1682.93 rows=91000 width=0) (actual time=28.541..28.541 rows=99776 loops=1) Index Cond: (a = 35) -> Bitmap Index Scan on foo_d_idx (cost=0.00..8235.76 rows=392333 width=0) (actual time=66.946..66.946 rows=399003 loops=1) Index Cond: ((d >= 67) AND (d <= 70)) Planning Time: 0.458 ms Execution Time: 104.487 ms I think the name of the node should also be changed to "Bitmap Only Heap Scan", but I didn't implement that as adding another NodeTag looks like a lot of tedious error prone work to do before getting feedback on whether the change is desirable in the first place, or the correct approach. Cheers, Jeff create table foo as select floor(random()*100)::int as a, floor(random()*100)::int as b, floor(random()*100)::int as c, floor(random()*100)::int as d, floor(random()*100)::int as e from generate_series(1,1000); vacuum ANALYZE ; create index on foo (a,c); create index on foo (d); update foo set d=d+1 where a=35 and c = 70; explain analyze select count(*) from foo where a=35 and d between 67 and 70; QUERY PLAN - Aggregate (cost=21451.36..21451.37 rows=1 width=8) (actual time=103.955..103.955 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=9920.73..21442.44 rows=3570 width=0) (actual time=100.239..103.204 rows=3950 loops=1) Recheck Cond: ((a = 35) AND (d >= 67) AND (d <= 70)) Heap Blocks: avoided=3718 exact=73 -> BitmapAnd (cost=9920.73..9920.73 rows=3570 width=0) (actual time=98.666..98.666 rows=0 loops=1) -> Bitmap Index Scan on foo_a_c_idx (cost=0.00..1682.93 rows=91000 width=0) (actual time=28.541..28.541 rows=99776 loops=1) Index Cond: (a = 35) -> Bitmap Index Scan on foo_d_idx (cost=0.00..8235.76 rows=392333 width=0) (actual time=66.946..66.946 rows=399003 loops=1) Index Cond: ((d >= 67) AND (d <= 70)) Planning Time: 0.458 ms Execution Time: 104.487 ms bitmap_only_avoided_v1.patch Description: Binary data
crash testing suggestions for 12 beta 1
Now that beta is out, I wanted to do some crash-recovery testing where I inject PANIC-inducing faults and see if it recovers correctly. A long-lived Perl process keeps track of what it should find after the crash, and verifies that it finds it. You will probably be familiar with the general theme from examples like the threads below. Would anyone like to nominate some areas to focus on? I think the pluggable storage refactoring work will be get inherently tested, so I'm not planning designing test specifically for that (unless there is a non-core plugin I should test with). Making the ctid be tie-breakers in btree index is also tested inherently (plus I think Peter tested that pretty thoroughly himself with similar methods). I've already tested declarative partitioning where the tuples do a lot of migrating, and tested prepared transactions. Any other suggestions for changes that might be risky and should be specifically targeted for testing? https://www.postgresql.org/message-id/CAMkU=1xeuubphdwdmb1wjn4+td4kpnenifatbxnk1xzhcw8...@mail.gmail.com https://www.postgresql.org/message-id/CAMkU=1xBP8cqdS5eK8APHL=x6rhmmm2vg5g+qamduutsycw...@mail.gmail.com Cheers, Jeff
Re: WAL archive (archive_mode = always) ?
On Fri, Oct 19, 2018 at 10:00 AM Adelino Silva < adelino.j.si...@googlemail.com> wrote: > Hi, > > What is the advantage to use archive_mode = always in a slave server > compared to archive_mode = on (shared WAL archive) ? > I only see duplication of Wal files, what is the purpose of this feature ? > You might not have a shared wal archive in the first place. For example, your only access to the master is through the streaming replication protocol, but you want to maintain a local WAL archive anyway so you can PITR locally for testing or debugging purposes. Cheers, Jeff
Re: WAL archive (archive_mode = always) ?
On Mon, Oct 22, 2018 at 5:06 AM Adelino Silva < adelino.j.si...@googlemail.com> wrote: > Hello Takayuki, > > Sorry can you explain how we can same network bandwidth by not sending the > WAL archive from the primary to the standby(s). > I possible scenario is have to multiple standby servers in same host for > same master. or other scenarios exists ? > Before archive_mode = always became available, we've had to stream the WAL twice, once to the hot standby for immediate application, and once to pg_receivexlog for archival purposes. So it doubled the bandwidth usage. Cheers, Jeff
Re: Estimating number of distinct values.
On Wed, Oct 24, 2018 at 10:07 AM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > Real number of distinct value for this dataset is about 10 millions. For > some reasons, sampling using random blocks and Vitter algorithm produces > worser results than just examining first 3 rows of the table: It is a known problem with our sampling method that once a block is chosen, it then oversamples rows from that block[1]. So you get too many blocks with 0 rows sampled or 2 or more rows samples, and too few with exactly one row sampled. If rows with the same value are clustered together into same blocks, this will find too many duplicates and really skew the Duj1 estimate, because we feeding it a biased sample. Tomas was working on a patch to make the sampling truly random[2], but I think he abandoned it to work on the multivariate statistics instead. It is hard to tell if the IO implications of no longer reading sampled blocks in physical order would be acceptable, because everyone has different hardware, data, and ideas of what is acceptable. [1] https://www.postgresql.org/message-id/CAMkU=1wrh_jopycayukbdqy4dwhsx1-1e2s0vvgfrryfxde...@mail.gmail.com [2] https://www.postgresql.org/message-id/5588D644.1000909%402ndquadrant.com Cheers, Jeff
Re: Buildfarm failures for hash indexes: buffer leaks
On Tue, Oct 23, 2018 at 10:51 AM Andres Freund wrote: > On 2018-10-23 13:54:31 +0200, Fabien COELHO wrote: > > > > Hello Tom & Amit, > > > > > > > Both animals use gcc experimental versions, which may rather > underline a > > > > > new bug in gcc head rather than an existing issue in pg. Or not. > > > > > > > It is possible, but what could be the possible theory? > > > > > > It seems like the two feasible theories are (1) gcc bug, or (2) buffer > > > leak that only occurs in very narrow circumstances, perhaps from a race > > > condition. Given that the hash index code hasn't changed meaningfully > > > in several months, I thought (1) seemed more probable. > > > > Yep, that is my thought as well. > > > FWIW, my animal 'serinus', which runs debian's gcc-snapshot shows the same > problem: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2018-10-22%2006%3A34%3A02 > > So it seems much more likely to be 1). > > > > The problem is that this kind of issue is not simple to wrap-up as a gcc > bug > > report, unlike other earlier instances that I forwarded to clang & gcc > dev > > teams. > > > > I'm in favor in waiting before trying to report it, to check whether the > > probable underlying gcc problem is detected, reported by someone else, > and > > fixed in gcc head. If it persists, then we'll see. > > I suspect the easiest thing to narrow it down would be to bisect the > problem in gcc :( > Their commit r265241 is what broke the PostgreSQL build. It also broke the compiler itself--at that commit it was no longer possible to build itself. I had to --disable-bootstrap in order to get a r265241 compiler to test PostgreSQL on. Their commit r265375 fixed the ability to compile itself, but built PostgreSQL binaries remain broken there and thereafter. I ran this on a AWS c5d.4xlarge ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 (ami-0f65671a86f061fcd) Configuring PostgreSQL with ./configure --enable-cassert is necessary in order for the subsequent "make check" to experience the failure. I'm using PostgreSQL v12dev commit 5953c99697621174f, but the problem is not sensitive to that and reproduces back to v10.0. I haven't the slightest idea what to do with this information, and GCC's cryptic SVN commit messages don't offer much insight to me. Cheers, Jeff
Don't wake up to check trigger file if none is configured
A streaming replica waiting on WAL from the master will wake up every 5 seconds to check for a trigger file. This is pointless if no trigger file has been configured. The attached patch suppresses the timeout when there is no trigger file configured. A minor thing to be sure, but there was a campaign a couple years ago to remove such spurious wake-ups, so maybe this change is worthwhile. I noticed that the existing codebase does not have a consensus on what to pass to WaitLatch for the timeout when the timeout isn't relevant. I picked 0, but -1L also has precedent. Cheers, Jeff WAL_sleep_not_triggered.patch Description: Binary data
Re: Don't wake up to check trigger file if none is configured
On Sat, Nov 24, 2018 at 11:29 AM Jeff Janes wrote: > A streaming replica waiting on WAL from the master will wake up every 5 > seconds to check for a trigger file. This is pointless if no trigger file > has been configured. > > The attached patch suppresses the timeout when there is no trigger file > configured. > Rebased over the removal of recovery.conf and renaming of TriggerFile. If the promote_trigger_file param is someday made responsive to SIGHUP, I think this will just continue to do the right thing without further modification. Cheers, Jeff WAL_sleep_not_triggered_v2.patch Description: Binary data
Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.
On Wed, Jan 17, 2018 at 4:49 PM, David Gould wrote: > > Please add the attached patch and this discussion to the open commit fest. > The > original bugs thread is here: 2018011254.1408.8342@wrigl > eys.postgresql.org. > > Bug reference: 15005 > Logged by: David Gould > Email address: da...@sonic.net > PostgreSQL version: 10.1 and earlier > Operating system: Linux > Description: > > ANALYZE can make pg_class.reltuples wildly inaccurate compared to the > actual > row counts for tables that are larger than the default_statistics_target. > > Example from one of a clients production instances: > > # analyze verbose pg_attribute; > INFO: analyzing "pg_catalog.pg_attribute" > INFO: "pg_attribute": scanned 3 of 24519424 pages, containing 6475 > live rows and 83 dead rows; 6475 rows in sample, 800983035 estimated total > rows. > > This is a large complex database - pg_attribute actually has about five > million rows and needs about one hundred thouand pages. However it has > become extremely bloated and is taking 25 million pages (192GB!), about 250 > times too much. This happened despite aggressive autovacuum settings and a > periodic bloat monitoring script. Since pg_class.reltuples was 800 million, > the bloat monitoring script did not detect that this table was bloated and > autovacuum did not think it needed vacuuming. > I can see how this issue would prevent ANALYZE from fixing the problem, but I don't see how it could have caused the problem in the first place. In your demonstration case, you had to turn off autovac in order to get it to happen, and then when autovac is turned back on, it is all primed for an autovac to launch, go through, touch almost all of the pages, and fix it for you. How did your original table get into a state where this wouldn't happen? Maybe a well-timed crash caused n_dead_tup to get reset to zero and that is why autovac is not kicking in? What are the pg_stat_user_table number and the state of the visibility map for your massively bloated table, if you still have them? In any event, I agree with your analysis that ANALYZE should set the number of tuples from scratch. After all, it sets the other estimates, such as MCV, from scratch, and those are much more fragile to sampling than just the raw number of tuples are. But if the default target is set to 1, that would scan only 300 pages. I think that that is a little low of a sample size to base an estimate on, but it isn't clear to that using 300 pages plus whacking them around with an exponential averaging is really going to be much better. And if you set your default target to 1, that is more-or-less what you signed up for. It is little weird to have VACUUM incrementally update and then ANALYZE compute from scratch and discard the previous value, but no weirder than what we currently do of having ANALYZE incrementally update despite that it is specifically designed to representatively sample the entire table. So I don't think we need to decide what to do about VACUUM before we can do something about ANALYZE. So I support your patch. There is probably more investigation and work that could be done in this area, but those could be different patches, not blocking this one. Cheers, Jeff
Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.
On Sun, Mar 4, 2018 at 3:18 PM, David Gould wrote: > On Sun, 4 Mar 2018 07:49:46 -0800 > Jeff Janes wrote: > > > On Wed, Jan 17, 2018 at 4:49 PM, David Gould wrote: > ... > > > > Maybe a well-timed crash caused n_dead_tup to get reset to zero and that > is > > why autovac is not kicking in? What are the pg_stat_user_table number > and > > the state of the visibility map for your massively bloated table, if you > > still have them? > > ... > > The main pain points are that when reltuples gets inflated there is no way > to fix it, auto vacuum stops looking at the table and hand run ANALYZE > can't > reset the reltuples. The only cure is VACUUM FULL, but that is not really > practical without unacceptable amounts of downtime. > But why won't an ordinary manual VACUUM (not FULL) fix it? That seems like that is a critical thing to figure out. As for preventing it in the first place, based on your description of your hardware and operations, I was going to say you need to increase the max number of autovac workers, but then I remembered you from "Autovacuum slows down with large numbers of tables. More workers makes it slower" ( https://www.postgresql.org/message-id/20151030133252.3033.4249%40wrigleys.postgresql.org). So you are probably still suffering from that? Your patch from then seemed to be pretty invasive and so controversial. I had a trivial but fairly effective patch at the time, but it now less trivial because of how shared catalogs are dealt with (commit 15739393e4c3b64b9038d75) and I haven't rebased it over that issue. Cheers, Jeff
WARNING in parallel index creation.
If i run: pgbench -i -s30 And then create the function: CREATE OR REPLACE FUNCTION foobar(text) RETURNS text LANGUAGE plperl IMMUTABLE PARALLEL SAFE STRICT COST 1 AS $function$ return scalar reverse($_[0]); $function$; Then when I create in index, I get a warning: jjanes=# create index on pgbench_accounts (foobar(filler)); WARNING: cannot set parameters during a parallel operation WARNING: cannot set parameters during a parallel operation If I create the index again within the same session, there is no WARNING. This only occurs if plperl.on_init is set in the postgresql.conf file. It doesn't seem to matter what it is set to, even the empty string triggers the warning. plperl.on_init='' As far as I can tell the index is created correctly despite the warning. Cheers, Jeff
neqjoinsel versus "refresh materialized view concurrently"
The following commit has caused a devastating performance regression in concurrent refresh of MV: commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97 Author: Tom Lane Date: Wed Nov 29 22:00:29 2017 -0500 Fix neqjoinsel's behavior for semi/anti join cases. The below reproduction goes from taking about 1 second to refresh, to taking an amount of time I don't have the patience to measure. drop table foobar2 cascade; create table foobar2 as select * from generate_series(1,20); create materialized view foobar3 as select * from foobar2; create unique index on foobar3 (generate_series ); analyze foobar3; refresh materialized view CONCURRENTLY foobar3 ; When I interrupt the refresh, I get a message including this line: CONTEXT: SQL statement "SELECT newdata FROM pg_temp_3.pg_temp_16420 newdata WHERE newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16420 newdata2 WHERE newdata2 IS NOT NULL AND newdata2 OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1" So I makes sense that the commit in question could have caused a change in the execution plan. Because these are temp tables, I can't easily get my hands on them to investigate further. Cheers, Jeff
Re: neqjoinsel versus "refresh materialized view concurrently"
On Tue, Mar 13, 2018 at 4:57 PM, Thomas Munro wrote: > On Wed, Mar 14, 2018 at 12:29 PM, Tom Lane wrote: > > Thomas Munro writes: > >> There is a fundamental and complicated estimation problem lurking here > >> of course and I'm not sure what to think about that yet. Maybe there > >> is a very simple fix for this particular problem: > > > > Ah, I see you thought of the same hack I did. > > > > I think this may actually be a good fix, and here's the reason: this plan > > is in fact being driven entirely off planner default estimates, because > > we don't have any estimation code that knows what to do with > > "wholerowvar *= wholerowvar". I'm suspicious that we could drop the > > preceding ANALYZE as being a waste of cycles, except maybe it's finding > > out the number of rows for us. In any case, LIMIT 1 is only a good idea > > to the extent that the planner knows what it's doing, and this is an > > example where it demonstrably doesn't and won't any time soon. > > Hmm. I wonder if the ANALYZE might have been needed to avoid the > nested loop plan at some point in history. > > Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test > scenario and some smaller and larger examples I tried. The query is > already executed with SPI_execute(..., 1) so it'll give up after one > row anyway. The regression test includes a case that causes a row to > be produced here and that's passing ('ERROR: new data for > materialized view "mvtest_mv" contains duplicate rows without any null > columns'). > Is there any good way to make the regression tests fail if the plan reverts to the bad one? The only thing I can think of would be to make the table bigger so the regression tests becomes "noticeably slower", but that is pretty vague and not user friendly to formally pass and just hope it is slow enough for someone to investigate. Cheers, Jeff
Re: Ambigous Plan - Larger Table on Hash Side
On Tue, Mar 13, 2018 at 4:02 AM, Narendra Pradeep U U < narendra.prad...@zohocorp.com> wrote: > Hi, > Thanks everyone for your suggestions. I would like to add explain > analyze of both the plans so that we can have broader picture. > > I have a work_mem of 1000 MB. > Is it possible to repeat with 2000MB or 3000MB? It would be interesting to see what the estimated cost and what the actual time would be if there were only 1 batch rather than 2. Also, can you repeat all of these with EXPLAIN (ANALYZE, TIMING OFF) ? Sometimes the act of measuring the times can distort the times by quite a bit. (It will still give an overall execution time, it just won't try to attribute that time to the individual steps) Cheers, Jeff
Re: INOUT parameters in procedures
On Wed, Mar 14, 2018 at 10:46 AM, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > committed > > I'm getting compiler warnings: pl_exec.c: In function 'exec_stmt_call': pl_exec.c:2089:8: warning: variable 'numargs' set but not used [-Wunused-but-set-variable] int numargs; ^ select version(); PostgreSQL 11devel-6b960aa on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit Cheers, Jeff
Re: INOUT parameters in procedures
On Thu, Mar 15, 2018 at 6:58 AM, Tom Lane wrote: > Jeff Janes writes: > > I'm getting compiler warnings: > > pl_exec.c: In function 'exec_stmt_call': > > pl_exec.c:2089:8: warning: variable 'numargs' set but not used > > Not fixed by 8df5a1c868cc28f89ac6221cff8e2b5c952d0eb6? > I think you meant to type "now fixed by". (unless your compiler is pickier than mine) Cheers Jeff
Bug with ICU for merge join
Ever since 27b62377b47f9e7bf58613, I have been getting "ERROR: mergejoin input data is out of order" for the attached reproducer. I get this on Ubuntu 20.04 and 22.04, whether initdb was run under LC_ALL=C or under LANG=en_US.UTF-8. It is not my query, I don't really know what its point is. I just got this error while looking into the performance of it and accidentally running it against 16dev. Cheers, Jeff bad_merge.sql Description: Binary data
awkward cancellation of parallel queries on standby.
When a parallel query gets cancelled on a standby due to max_standby_streaming_delay, it happens rather awkwardly. I get two errors stacked up, a query cancellation followed by a connection termination. I use `pgbench -R 1 -T3600 -P5` on the master to generate a light but steady stream of HOT pruning records, and then run `select sum(a.abalance*b.abalance) from pgbench_accounts a join pgbench_accounts b using (bid);` on the standby not in a transaction block to be a long-running parallel query (scale factor of 20) I also set max_standby_streaming_delay = 0. That isn't necessary, but it saves wear and tear on my patience. ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. FATAL: terminating connection due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. This happens quite reliably. In psql, these sometimes both show up immediately, and sometimes only the first one shows up immediately and then the second one appears upon the next communication to the backend. I don't know if this is actually a problem. It isn't for me as I don't do this kind of thing outside of testing, but it seems untidy and I can see it being frustrating from a catch-and-retry perspective and from a log-spam perspective. It looks like the backend gets signalled by the startup process, and then it signals the postmaster to signal the parallel workers, and then they ignore it for a quite long time (tens to hundreds of ms). By the time they get around responding, someone has decided to escalate things. Which doesn't seem to be useful, because no one can do anything until the workers respond anyway. This behavior seems to go back a long way, but the propensity for both messages to show up at the same time vs. in different round-trips changes from version to version. Is this something we should do something about? Cheers, Jeff
connection timeout hint
When one tries to connect to a server and port which is protected by a firewall, ones get messages like this: Unix: psql: error: connection to server at "192.168.0.26", port 5432 failed: Connection timed out Is the server running on that host and accepting TCP/IP connections? Windows: psql: error: connection to server at "192.168.0.26", port 5432 failed: Connection timed out (0x274C/10060) Is the server running on that host and accepting TCP/IP connections? But the hint given is unhelpful, and even positively misleading. If the port is blocked by a firewall, it doesn't imply the database server is not listening (if one could just get to it), and it doesn't matter if the database server is listening. If for some reason it weren't listening as well as being blocked, making it listen wouldn't help as long it remains blocked at the firewall. Is there some portable way to detect this cause of the connection problem (connection timeout) and issue a more suitable hint explicitly mentioning firewalls and routers, or perhaps just no hint at all? As far as I know, only a firewall causes this problem, at least on a persistent basis. Maybe you could see it sporadically on a vastly overloaded server or a server caught in the process of rebooting. It would be better to give a hint that is correct the vast majority of the time than one that is wrong the vast majority of the time. There are a lot of questions about this on, for example, stackoverflow. I think people might be better able to figure it out for themselves if the hint were not actively leading them astray. Cheers, Jeff
Re: SQL-standard function body
On Wed, Apr 7, 2021 at 3:55 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > > Committed. Thanks! > > This commit break line continuation prompts for unbalanced parentheses in the psql binary. Skimming through this thread, I don't see that this is intentional or has been noticed before. with psql -X Before: jjanes=# asdf ( jjanes(# Now: jjanes=# asdf ( jjanes-# I've looked through the parts of the commit that change psql, but didn't see an obvious culprit. Cheers, Jeff
Re: Query generates infinite loop
On Wed, Apr 20, 2022 at 5:43 PM Tom Lane wrote: > I wrote: > > it's true that infinities as generate_series endpoints are going > > to work pretty oddly, so I agree with the idea of forbidding 'em. > > > Numeric has infinity as of late, so the numeric variant would > > need to do this too. > > Oh --- looks like numeric generate_series() already throws error for > this, so we should just make the timestamp variants do the same. > The regression test you added for this change causes an infinite loop when run against an unpatched server with --install-check. That is a bit unpleasant. Is there something we can and should do about that? I was expecting regression test failures of course but not an infinite loop leading towards disk exhaustion. Cheers, Jeff
Re: PostgreSQL 15 Beta 1 release announcement draft
On Sat, May 14, 2022 at 2:52 PM Jonathan S. Katz wrote: > Hi, > > Attached is a draft of the release announcement for the PostgreSQL 15 > Beta 1 release. The goal of this announcement is to raise awareness > around many of the new features appearing in PostgreSQL 15 and to > encourage people to test. The success of the PostgreSQL 15 GA depends > heavily on people testing during the Beta period! > I have some belated feedback. I was excited to try this on Windows (I don't have a build system for that) and so followed the first link in the message, to https://www.postgresql.org/download/. At first glance there is nothing about beta there, but there is a prominent Windows icon so I click on that. And then to EDB, but there is no apparent way to download beta, just the released versions. I poked around EDB a bit but didn't find anything promising, then backed out of all that poking around and eventually all the way back to /download, where I scrolled down and finally found the link to https://www.postgresql.org/download/snapshots/ which tells me what I need to know. But at this point I was more annoyed than excited. An invitation to download the beta should take me directly to the page relevant to doing that. I shouldn't have to read the page backwards, or do a breadth-first traversal, to get to the right place efficiently. People will click on the first link which seems relevant, and "Windows" on the generic download page certainly seems relevant to Beta for Windows, until after you have scrolled down to find the beta/RC specific link instead. (I now recall being annoyed by this in a prior year as well, I guess I have a bad memory for avoiding mistakes but a good memory for recalling them). Also, the download page should probably say "binary packages and installers" where it currently says "There are source code and binary packages of beta and release candidates", although I guess that is not about the announcement itself. Cheers, Jeff
15beta1 tab completion of extension versions
Extension version strings need to be quoted. Either double or single quotes will work. In released psql clients, tab completion offers double quoted suggestions: alter extension pg_trgm update TO "1.3" "1.4" "1.5" "1.6" But commit 02b8048ba5 broke that, it now offers unquoted version strings which if used as offered then lead to syntax errors. The code change seems to have been intentional, but I don't think the behavior change was intended. While the version string might not be an identifier, it still needs to be treated as if it were one. Putting pg_catalog.quote_ident back into Query_for_list_of_available_extension_versions* fixes it, but might not be the best way to fix it. commit 02b8048ba5dc36238f3e7c3c58c5946220298d71 (HEAD, refs/bisect/bad) Author: Tom Lane Date: Sun Jan 30 13:33:23 2022 -0500 psql: improve tab-complete's handling of variant SQL names. Cheer, Jeff