Re: Need help identifying a periodic performance issue.
On Thu, Nov 18, 2021 at 04:39:42PM +1300, Thomas Munro wrote: > On Thu, Nov 18, 2021 at 1:18 PM Robert Creager > wrote: > > So, how do I go about capturing more information for the big brains (you > > guys) to help figure this out? I have all our resources at mine (and hence > > your) disposal. > > As a workaround, does it help if you issue DISCARD PLANS before your > COPY jobs, or alternatively start with a fresh connection? I'm > guessing that something like this is happening. > > -- set up the auto_explain extension to show the internal foreign key check > queries' plans > load 'auto_explain'; > set auto_explain.log_nested_statements = true; > set auto_explain.log_min_duration = 0; > set auto_explain.log_analyze = true; ..and SET client_min_messages=debug; > drop table if exists r, s cascade; > create table r (i int primary key); > create table s (i int references r(i)); > > -- collect stats showing r as empty > analyze r; > > -- execute RI query 6 times to lock the plan (inserts fail, log shows seq > scan) > insert into s values (42); > insert into s values (42); > insert into s values (42); > insert into s values (42); > insert into s values (42); > insert into s values (42); > > insert into r select generate_series(1, 100); > > -- once more, we still get a seq scan, which is by now a bad idea > insert into s values (42); > > discard plans; > > -- once more, now we get an index scan > insert into s values (42); It also seems to work if one does SET plan_cache_mode=force_custom_plan; Robert might try that, either in postresql.conf, or SET in the client that's doing COPY. Robert is using jdbc, which (as I recall) has this problem more often than other clients. But, in this case, I think JDBC isn't causing the problem. -- Justin
Re: Need help identifying a periodic performance issue.
On Thu, Nov 18, 2021 at 1:18 PM Robert Creager wrote: > So, how do I go about capturing more information for the big brains (you > guys) to help figure this out? I have all our resources at mine (and hence > your) disposal. As a workaround, does it help if you issue DISCARD PLANS before your COPY jobs, or alternatively start with a fresh connection? I'm guessing that something like this is happening. -- set up the auto_explain extension to show the internal foreign key check queries' plans load 'auto_explain'; set auto_explain.log_nested_statements = true; set auto_explain.log_min_duration = 0; set auto_explain.log_analyze = true; drop table if exists r, s cascade; create table r (i int primary key); create table s (i int references r(i)); -- collect stats showing r as empty analyze r; -- execute RI query 6 times to lock the plan (inserts fail, log shows seq scan) insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into s values (42); insert into r select generate_series(1, 100); -- once more, we still get a seq scan, which is by now a bad idea insert into s values (42); discard plans; -- once more, now we get an index scan insert into s values (42);
Re: Need help identifying a periodic performance issue.
> On Nov 17, 2021, at 4:18 PM, Justin Pryzby wrote: > > This message originated outside your organization. > > On Wed, Nov 17, 2021 at 09:54:14PM +, Robert Creager wrote: > > We are able to move up to Postgres 13.5, in our ports tree, if that would > > help. We used pg_upgrade to get from 9.6 to 13.3, so that should work fine > > going instead to 13.5. We’re almost branching/releasing our code, so it’s > > not a good time, but if it may help with this problem, we’ll deal with it. > > To be clear, I have no specfic reason to believe it would help. I figured as much, and told our gang that also. Was looking through the release notes a little, need to finish with 4 and look at 5. > But it would be silly to chase down a problem that someone already fixed 10 > months ago (the source of this problem, or something else that comes up). Yeah, trying to figure out how feasible it is for us to do quickly. Ok, we have approval, and will be working on an upgraded build in the morning. > > In fact I suspect it won't help, and there's an issue with your schema, or > autovacuum, or postgres. Well, if it’s our schema, 9.6 didn’t care about it, doesn’t mean there wasn’t one though, and I understand that completely. So, how do I go about capturing more information for the big brains (you guys) to help figure this out? I have all our resources at mine (and hence your) disposal. > > Note that since v10, the version scheme uses only two components, and 13.3 to > 13.5 is a minor release, similar to 9.6.3 to 9.6.5. So you don't need to use > pg_upgrade - just update the binaries. Good, but the previous release of our product was at 9.6, so we’re currently using the pg_upgrade to do that, automated (storage appliance). Just talking out loud, that switching to 13.5 shouldn’t cause us any upgrade issues other than figuring out builds and dependencies and some upgrade testing. And a straight re-run with 13.5 on top of the db should work just fine. When I talk out loud, I sometimes catch the stupid things I’m missing, and it allows others to point out the stupid things I’m missing... Best, Robert
Re: Need help identifying a periodic performance issue.
On Wed, Nov 17, 2021 at 09:54:14PM +, Robert Creager wrote: > We are able to move up to Postgres 13.5, in our ports tree, if that would > help. We used pg_upgrade to get from 9.6 to 13.3, so that should work fine > going instead to 13.5. We’re almost branching/releasing our code, so it’s > not a good time, but if it may help with this problem, we’ll deal with it. To be clear, I have no specfic reason to believe it would help. But it would be silly to chase down a problem that someone already fixed 10 months ago (the source of this problem, or something else that comes up). In fact I suspect it won't help, and there's an issue with your schema, or autovacuum, or postgres. Note that since v10, the version scheme uses only two components, and 13.3 to 13.5 is a minor release, similar to 9.6.3 to 9.6.5. So you don't need to use pg_upgrade - just update the binaries. https://www.postgresql.org/docs/13/release-13-5.html -- Justin
Re: Need help identifying a periodic performance issue.
On Thu, Nov 18, 2021 at 8:28 AM Tom Lane wrote: > Justin Pryzby writes: > > It shows that the process is running FK triggers. > > Indeed, and doing a seqscan therein. Normally I'd suppose that > this reflects a lack of an index, but RI_FKey_check should always > be doing something that matches the referenced table's unique > constraint, so why isn't it using that? I wonder if the reference tables are empty sometimes, and there's an unlucky sequence of events that results in cached RI plans with seq scans being used later in the same session after the tables are populated.
Re: Need help identifying a periodic performance issue.
On Nov 17, 2021, at 12:00 PM, Justin Pryzby mailto:pry...@telsasoft.com>> wrote: This message originated outside your organization. On Wed, Nov 17, 2021 at 05:51:05PM +, Robert Creager wrote: postgres`HeapTupleSatisfiesVisibility+0x42 postgres`heapgetpage+0x237 postgres`heapgettup_pagemode+0x5ad postgres`heap_getnextslot+0x52 postgres`SeqNext+0x71 postgres`ExecScan+0xc9 postgres`ExecLockRows+0x7b postgres`standard_ExecutorRun+0x10a postgres`_SPI_execute_plan+0x524 postgres`SPI_execute_snapshot+0x116 postgres`ri_PerformCheck+0x29e postgres`RI_FKey_check+0x5d3 postgres`RI_FKey_check_ins+0x21 postgres`ExecCallTriggerFunc+0x105 postgres`afterTriggerInvokeEvents+0x605 postgres`AfterTriggerEndQuery+0x7a postgres`CopyFrom+0xaca postgres`DoCopy+0x553 postgres`standard_ProcessUtility+0x5f9 postgres`ProcessUtility+0x28 55 It shows that the process is running FK triggers. Would you show \d for the table which is the destination of COPY, and for other tables to which it has FK constraints. Two tables being copied into. I chased the first FK tables from the job_entry. I can do the entire thing if you want. There are bunches... tapesystem=# \d ds3.job_entry Table "ds3.job_entry" Column| Type | Collation | Nullable | Default -+-+---+--+- blob_id | uuid| | not null | chunk_id| uuid| | not null | id | uuid| | not null | job_id | uuid| | not null | order_index | integer | | not null | Indexes: "job_entry_pkey" PRIMARY KEY, btree (id) "job_entry_blob_id_idx" btree (blob_id) "job_entry_chunk_id_idx" btree (chunk_id) "job_entry_job_id_blob_id_key" UNIQUE CONSTRAINT, btree (job_id, blob_id) "job_entry_job_id_idx" btree (job_id) "job_entry_order_index_chunk_id_key" UNIQUE CONSTRAINT, btree (order_index, chunk_id) Foreign-key constraints: "job_entry_blob_id_fkey" FOREIGN KEY (blob_id) REFERENCES ds3.blob(id) ON UPDATE CASCADE ON DELETE CASCADE "job_entry_chunk_id_fkey" FOREIGN KEY (chunk_id) REFERENCES ds3.job_chunk(id) ON UPDATE CASCADE ON DELETE CASCADE "job_entry_job_id_fkey" FOREIGN KEY (job_id) REFERENCES ds3.job(id) ON UPDATE CASCADE ON DELETE CASCADE tapesystem=# \d ds3.job_chunk Table "ds3.job_chunk" Column | Type | Collation | Nullable | Default ---++---+--+- blob_store_state | ds3.job_chunk_blob_store_state | | not null | chunk_number | integer| | not null | id| uuid | | not null | job_id| uuid | | not null | node_id | uuid | | | pending_target_commit | boolean| | not null | read_from_azure_target_id | uuid | | | read_from_ds3_target_id | uuid | | | read_from_pool_id | uuid | | | read_from_s3_target_id| uuid | | | read_from_tape_id | uuid | | | Indexes: "job_chunk_pkey" PRIMARY KEY, btree (id) "job_chunk_blob_store_state_idx" btree (blob_store_state) "job_chunk_chunk_number_job_id_key" UNIQUE CONSTRAINT, btree (chunk_number, job_id) "job_chunk_job_id_idx" btree (job_id) "job_chunk_node_id_idx" btree (node_id) "job_chunk_read_from_azure_target_id_idx" btree (read_from_azure_target_id) "job_chunk_read_from_ds3_target_id_idx" btree (read_from_ds3_target_id) "job_chunk_read_from_pool_id_idx" btree (read_from_pool_id) "job_chunk_read_from_s3_target_id_idx" btree (read_from_s3_target_id) "job_chunk_read_from_tape_id_idx" btree (read_from_tape_id) Foreign-key constraints: "job_chunk_job_id_fkey" FOREIGN KEY (job_id) REFERENCES ds3.job(id) ON UPDATE CASCADE ON DELETE CASCADE "job_chunk_node_id_fkey" FOREIGN KEY (node_id) REFERENCES ds3.node(id) ON UPDATE CASCADE ON DELETE SET NULL "job_chunk_read_from_azure_target_id_fkey" FOREIGN KEY (read_from_azure_target_id) REFERENCES target.azure_target(id) ON UPDATE CASCADE ON DELETE SET NULL "job_chunk_read_from_ds3_target_id_fkey" FOREIGN KEY (read_from_ds3_target_id) REFERENCES target.ds3_target(id) ON UPDATE CASCADE ON DELETE SET NULL
Re: Need help identifying a periodic performance issue.
Justin Pryzby writes: > It shows that the process is running FK triggers. Indeed, and doing a seqscan therein. Normally I'd suppose that this reflects a lack of an index, but RI_FKey_check should always be doing something that matches the referenced table's unique constraint, so why isn't it using that? regards, tom lane
Re: Need help identifying a periodic performance issue.
On Wed, Nov 17, 2021 at 05:51:05PM +, Robert Creager wrote: > postgres`HeapTupleSatisfiesVisibility+0x42 > postgres`heapgetpage+0x237 > postgres`heapgettup_pagemode+0x5ad > postgres`heap_getnextslot+0x52 > postgres`SeqNext+0x71 > postgres`ExecScan+0xc9 > postgres`ExecLockRows+0x7b > postgres`standard_ExecutorRun+0x10a > postgres`_SPI_execute_plan+0x524 > postgres`SPI_execute_snapshot+0x116 > postgres`ri_PerformCheck+0x29e > postgres`RI_FKey_check+0x5d3 > postgres`RI_FKey_check_ins+0x21 > postgres`ExecCallTriggerFunc+0x105 > postgres`afterTriggerInvokeEvents+0x605 > postgres`AfterTriggerEndQuery+0x7a > postgres`CopyFrom+0xaca > postgres`DoCopy+0x553 > postgres`standard_ProcessUtility+0x5f9 > postgres`ProcessUtility+0x28 >55 It shows that the process is running FK triggers. Would you show \d for the table which is the destination of COPY, and for other tables to which it has FK constraints. Also, do you have any long-running transactions ? In your first message, you showed no other queries except "idle" ones (not idle-in-transaction) but I figured I'd ask anyway. Does your COPY job run in a transaction block ? You're running pg13.2, so it would be interesting to know if the problem exists under 13.5. -- Justin
Re: Need help identifying a periodic performance issue.
On Nov 17, 2021, at 10:51 AM, Robert Creager mailto:robe...@spectralogic.com>> wrote: On Nov 15, 2021, at 10:50 PM, Thomas Munro mailto:thomas.mu...@gmail.com>> wrote: This message originated outside your organization. On Tue, Nov 16, 2021 at 5:43 PM Robert Creager mailto:robe...@spectralogic.com>> wrote: One CPU is pegged, the data has been sent over STDIN, so Postgres is not waiting for more, there are no other queries running using this select: So PostgreSQL is eating 100% CPU, with no value shown in wait_event_type, and small numbers of system calls are counted. In that case, is there an interesting user stack that jumps out with a profiler during the slowdown (or the kernel version, stack())? sudo dtrace -n 'profile-99 /arg0/ { @[ustack()] = count(); } tick-10s { exit(0); } Ok, here is the logs around a dtrace included. I have dtaces every 1m10s, and the two I looked at, the last entry, were the same. Also, the wall settings are different on this machine, had changed them on the original email submittal to see if it was related. tapesystem=# SELECT name, current_setting(name), source FROM pg_settings WHERE source NOT IN ('default', 'override'); name | current_setting | source -+-+ application_name| psql| client autovacuum_analyze_scale_factor | 0.05| configuration file autovacuum_analyze_threshold| 5000| configuration file autovacuum_max_workers | 8 | configuration file autovacuum_vacuum_cost_delay| 5ms | configuration file autovacuum_vacuum_scale_factor | 0.1 | configuration file autovacuum_vacuum_threshold | 5000| configuration file checkpoint_completion_target| 0.9 | configuration file checkpoint_timeout | 15min | configuration file client_encoding | UTF8| client DateStyle | ISO, MDY| configuration file default_text_search_config | pg_catalog.english | configuration file dynamic_shared_memory_type | posix | configuration file effective_cache_size| 6546MB | configuration file effective_io_concurrency| 200 | configuration file full_page_writes| off | configuration file hot_standby | off | configuration file lc_messages | C | configuration file lc_monetary | C | configuration file lc_numeric | C | configuration file lc_time | C | configuration file listen_addresses| * | configuration file log_autovacuum_min_duration | 1s | configuration file log_checkpoints | on | configuration file log_connections | on | configuration file log_destination | syslog | configuration file log_disconnections | on | configuration file log_duration| off | configuration file log_line_prefix | db=%d,user=%u,app=%a,client=%h | configuration file log_lock_waits | on | configuration file log_min_duration_sample | 500ms | configuration file log_min_duration_statement | 1s | configuration file log_statement_sample_rate | 0.01| configuration file log_temp_files | 0 | configuration file log_timezone| UTC | configuration file maintenance_work_mem| 1964MB | configuration file max_connections | 250 | configuration file max_parallel_workers_per_gather | 8 | configuration file max_replication_slots | 0 | configuration file max_stack_depth | 32MB| configuration file max_wal_senders | 0 | configuration file