Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Justin Pryzby
On Thu, Nov 18, 2021 at 04:39:42PM +1300, Thomas Munro wrote:
> On Thu, Nov 18, 2021 at 1:18 PM Robert Creager  
> wrote:
> > So, how do I go about capturing more information for the big brains (you 
> > guys) to help figure this out?  I have all our resources at mine (and hence 
> > your) disposal.
> 
> As a workaround, does it help if you issue DISCARD PLANS before your
> COPY jobs, or alternatively start with a fresh connection?  I'm
> guessing that something like this is happening.
> 
> -- set up the auto_explain extension to show the internal foreign key check 
> queries' plans
> load 'auto_explain';
> set auto_explain.log_nested_statements = true;
> set auto_explain.log_min_duration = 0;
> set auto_explain.log_analyze = true;

..and SET client_min_messages=debug;

> drop table if exists r, s cascade;
> create table r (i int primary key);
> create table s (i int references r(i));
> 
> -- collect stats showing r as empty
> analyze r;
> 
> -- execute RI query 6 times to lock the plan (inserts fail, log shows seq 
> scan)
> insert into s values (42);
> insert into s values (42);
> insert into s values (42);
> insert into s values (42);
> insert into s values (42);
> insert into s values (42);
> 
> insert into r select generate_series(1, 100);
> 
> -- once more, we still get a seq scan, which is by now a bad idea
> insert into s values (42);
> 
> discard plans;
>
> -- once more, now we get an index scan
> insert into s values (42);

It also seems to work if one does SET plan_cache_mode=force_custom_plan;

Robert might try that, either in postresql.conf, or SET in the client that's
doing COPY.

Robert is using jdbc, which (as I recall) has this problem more often than
other clients.  But, in this case, I think JDBC isn't causing the problem.

-- 
Justin




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Thomas Munro
On Thu, Nov 18, 2021 at 1:18 PM Robert Creager  wrote:
> So, how do I go about capturing more information for the big brains (you 
> guys) to help figure this out?  I have all our resources at mine (and hence 
> your) disposal.

As a workaround, does it help if you issue DISCARD PLANS before your
COPY jobs, or alternatively start with a fresh connection?  I'm
guessing that something like this is happening.

-- set up the auto_explain extension to show the internal foreign key
check queries' plans
load 'auto_explain';
set auto_explain.log_nested_statements = true;
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;

drop table if exists r, s cascade;
create table r (i int primary key);
create table s (i int references r(i));

-- collect stats showing r as empty
analyze r;

-- execute RI query 6 times to lock the plan (inserts fail, log shows seq scan)
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);
insert into s values (42);

insert into r select generate_series(1, 100);

-- once more, we still get a seq scan, which is by now a bad idea
insert into s values (42);

discard plans;

-- once more, now we get an index scan
insert into s values (42);




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Robert Creager


> On Nov 17, 2021, at 4:18 PM, Justin Pryzby  wrote:
> 
> This message originated outside your organization.
> 
> On Wed, Nov 17, 2021 at 09:54:14PM +, Robert Creager wrote:
> > We are able to move up to Postgres 13.5, in our ports tree, if that would 
> > help. We used pg_upgrade to get from 9.6 to 13.3, so that should work fine 
> > going instead to 13.5. We’re almost branching/releasing our code, so it’s 
> > not a good time, but if it may help with this problem, we’ll deal with it.
> 
> To be clear, I have no specfic reason to believe it would help.

I figured as much, and told our gang that also.  Was looking through the 
release notes a little, need to finish with 4 and look at 5.

> But it would be silly to chase down a problem that someone already fixed 10
> months ago (the source of this problem, or something else that comes up).

Yeah, trying to figure out how feasible it is for us to do quickly.  Ok, we 
have approval, and will be working on an upgraded build in the morning.

> 
> In fact I suspect it won't help, and there's an issue with your schema, or
> autovacuum, or postgres.

Well, if it’s our schema, 9.6 didn’t care about it, doesn’t mean there wasn’t 
one though, and I understand that completely.

So, how do I go about capturing more information for the big brains (you guys) 
to help figure this out?  I have all our resources at mine (and hence your) 
disposal.

> 
> Note that since v10, the version scheme uses only two components, and 13.3 to
> 13.5 is a minor release, similar to 9.6.3 to 9.6.5. So you don't need to use
> pg_upgrade - just update the binaries.

Good, but the previous release of our product was at 9.6, so we’re currently 
using the pg_upgrade to do that, automated (storage appliance).  Just talking 
out loud, that switching to 13.5 shouldn’t cause us any upgrade issues other 
than figuring out builds and dependencies and some upgrade testing.  And a 
straight re-run with 13.5 on top of the db should work just fine.  When I talk 
out loud, I sometimes catch the stupid things I’m missing, and it allows others 
to point out the stupid things I’m missing...

Best,
Robert



Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Justin Pryzby
On Wed, Nov 17, 2021 at 09:54:14PM +, Robert Creager wrote:
> We are able to move up to Postgres 13.5, in our ports tree, if that would 
> help.  We used pg_upgrade to get from 9.6 to 13.3, so that should work fine 
> going instead to 13.5.  We’re almost branching/releasing our code, so it’s 
> not a good time, but if it may help with this problem, we’ll deal with it.

To be clear, I have no specfic reason to believe it would help.
But it would be silly to chase down a problem that someone already fixed 10
months ago (the source of this problem, or something else that comes up).

In fact I suspect it won't help, and there's an issue with your schema, or
autovacuum, or postgres.

Note that since v10, the version scheme uses only two components, and 13.3 to
13.5 is a minor release, similar to 9.6.3 to 9.6.5.  So you don't need to use
pg_upgrade - just update the binaries.

https://www.postgresql.org/docs/13/release-13-5.html

-- 
Justin




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Thomas Munro
On Thu, Nov 18, 2021 at 8:28 AM Tom Lane  wrote:
> Justin Pryzby  writes:
> > It shows that the process is running FK triggers.
>
> Indeed, and doing a seqscan therein.  Normally I'd suppose that
> this reflects a lack of an index, but RI_FKey_check should always
> be doing something that matches the referenced table's unique
> constraint, so why isn't it using that?

I wonder if the reference tables are empty sometimes, and there's an
unlucky sequence of events that results in cached RI plans with seq
scans being used later in the same session after the tables are
populated.




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Robert Creager


On Nov 17, 2021, at 12:00 PM, Justin Pryzby 
mailto:pry...@telsasoft.com>> wrote:

This message originated outside your organization.

On Wed, Nov 17, 2021 at 05:51:05PM +, Robert Creager wrote:
 postgres`HeapTupleSatisfiesVisibility+0x42
 postgres`heapgetpage+0x237
 postgres`heapgettup_pagemode+0x5ad
 postgres`heap_getnextslot+0x52
 postgres`SeqNext+0x71
 postgres`ExecScan+0xc9
 postgres`ExecLockRows+0x7b
 postgres`standard_ExecutorRun+0x10a
 postgres`_SPI_execute_plan+0x524
 postgres`SPI_execute_snapshot+0x116
 postgres`ri_PerformCheck+0x29e
 postgres`RI_FKey_check+0x5d3
 postgres`RI_FKey_check_ins+0x21
 postgres`ExecCallTriggerFunc+0x105
 postgres`afterTriggerInvokeEvents+0x605
 postgres`AfterTriggerEndQuery+0x7a
 postgres`CopyFrom+0xaca
 postgres`DoCopy+0x553
 postgres`standard_ProcessUtility+0x5f9
 postgres`ProcessUtility+0x28
  55

It shows that the process is running FK triggers.
Would you show \d for the table which is the destination of COPY, and for other
tables to which it has FK constraints.

Two tables being copied into. I chased the first FK tables from the job_entry.  
I can do the entire thing if you want.  There are bunches...

tapesystem=# \d ds3.job_entry
 Table "ds3.job_entry"
   Column|  Type   | Collation | Nullable | Default
-+-+---+--+-
 blob_id | uuid|   | not null |
 chunk_id| uuid|   | not null |
 id  | uuid|   | not null |
 job_id  | uuid|   | not null |
 order_index | integer |   | not null |
Indexes:
"job_entry_pkey" PRIMARY KEY, btree (id)
"job_entry_blob_id_idx" btree (blob_id)
"job_entry_chunk_id_idx" btree (chunk_id)
"job_entry_job_id_blob_id_key" UNIQUE CONSTRAINT, btree (job_id, blob_id)
"job_entry_job_id_idx" btree (job_id)
"job_entry_order_index_chunk_id_key" UNIQUE CONSTRAINT, btree (order_index, 
chunk_id)
Foreign-key constraints:
"job_entry_blob_id_fkey" FOREIGN KEY (blob_id) REFERENCES ds3.blob(id) ON 
UPDATE CASCADE ON DELETE CASCADE
"job_entry_chunk_id_fkey" FOREIGN KEY (chunk_id) REFERENCES 
ds3.job_chunk(id) ON UPDATE CASCADE ON DELETE CASCADE
"job_entry_job_id_fkey" FOREIGN KEY (job_id) REFERENCES ds3.job(id) ON 
UPDATE CASCADE ON DELETE CASCADE

tapesystem=# \d ds3.job_chunk
Table "ds3.job_chunk"
  Column   |  Type  | Collation | 
Nullable | Default
---++---+--+-
 blob_store_state  | ds3.job_chunk_blob_store_state |   | not 
null |
 chunk_number  | integer|   | not 
null |
 id| uuid   |   | not 
null |
 job_id| uuid   |   | not 
null |
 node_id   | uuid   |   |   
   |
 pending_target_commit | boolean|   | not 
null |
 read_from_azure_target_id | uuid   |   |   
   |
 read_from_ds3_target_id   | uuid   |   |   
   |
 read_from_pool_id | uuid   |   |   
   |
 read_from_s3_target_id| uuid   |   |   
   |
 read_from_tape_id | uuid   |   |   
   |
Indexes:
"job_chunk_pkey" PRIMARY KEY, btree (id)
"job_chunk_blob_store_state_idx" btree (blob_store_state)
"job_chunk_chunk_number_job_id_key" UNIQUE CONSTRAINT, btree (chunk_number, 
job_id)
"job_chunk_job_id_idx" btree (job_id)
"job_chunk_node_id_idx" btree (node_id)
"job_chunk_read_from_azure_target_id_idx" btree (read_from_azure_target_id)
"job_chunk_read_from_ds3_target_id_idx" btree (read_from_ds3_target_id)
"job_chunk_read_from_pool_id_idx" btree (read_from_pool_id)
"job_chunk_read_from_s3_target_id_idx" btree (read_from_s3_target_id)
"job_chunk_read_from_tape_id_idx" btree (read_from_tape_id)
Foreign-key constraints:
"job_chunk_job_id_fkey" FOREIGN KEY (job_id) REFERENCES ds3.job(id) ON 
UPDATE CASCADE ON DELETE CASCADE
"job_chunk_node_id_fkey" FOREIGN KEY (node_id) REFERENCES ds3.node(id) ON 
UPDATE CASCADE ON DELETE SET NULL
"job_chunk_read_from_azure_target_id_fkey" FOREIGN KEY 
(read_from_azure_target_id) REFERENCES target.azure_target(id) ON UPDATE 
CASCADE ON DELETE SET NULL
"job_chunk_read_from_ds3_target_id_fkey" FOREIGN KEY 
(read_from_ds3_target_id) REFERENCES target.ds3_target(id) ON UPDATE CASCADE ON 
DELETE SET NULL
 

Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Tom Lane
Justin Pryzby  writes:
> It shows that the process is running FK triggers.

Indeed, and doing a seqscan therein.  Normally I'd suppose that
this reflects a lack of an index, but RI_FKey_check should always
be doing something that matches the referenced table's unique
constraint, so why isn't it using that?

regards, tom lane




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Justin Pryzby
On Wed, Nov 17, 2021 at 05:51:05PM +, Robert Creager wrote:
>   postgres`HeapTupleSatisfiesVisibility+0x42
>   postgres`heapgetpage+0x237
>   postgres`heapgettup_pagemode+0x5ad
>   postgres`heap_getnextslot+0x52
>   postgres`SeqNext+0x71
>   postgres`ExecScan+0xc9
>   postgres`ExecLockRows+0x7b
>   postgres`standard_ExecutorRun+0x10a
>   postgres`_SPI_execute_plan+0x524
>   postgres`SPI_execute_snapshot+0x116
>   postgres`ri_PerformCheck+0x29e
>   postgres`RI_FKey_check+0x5d3
>   postgres`RI_FKey_check_ins+0x21
>   postgres`ExecCallTriggerFunc+0x105
>   postgres`afterTriggerInvokeEvents+0x605
>   postgres`AfterTriggerEndQuery+0x7a
>   postgres`CopyFrom+0xaca
>   postgres`DoCopy+0x553
>   postgres`standard_ProcessUtility+0x5f9
>   postgres`ProcessUtility+0x28
>55

It shows that the process is running FK triggers.
Would you show \d for the table which is the destination of COPY, and for other
tables to which it has FK constraints.

Also, do you have any long-running transactions ?
In your first message, you showed no other queries except "idle" ones (not
idle-in-transaction) but I figured I'd ask anyway.

Does your COPY job run in a transaction block ?

You're running pg13.2, so it would be interesting to know if the problem exists
under 13.5.

-- 
Justin




Re: Need help identifying a periodic performance issue.

2021-11-17 Thread Robert Creager


On Nov 17, 2021, at 10:51 AM, Robert Creager 
mailto:robe...@spectralogic.com>> wrote:



On Nov 15, 2021, at 10:50 PM, Thomas Munro 
mailto:thomas.mu...@gmail.com>> wrote:

This message originated outside your organization.

On Tue, Nov 16, 2021 at 5:43 PM Robert Creager 
mailto:robe...@spectralogic.com>> wrote:
One CPU is pegged, the data has been sent over STDIN, so Postgres is not 
waiting for more, there are no other queries running using this select:

So PostgreSQL is eating 100% CPU, with no value shown in
wait_event_type, and small numbers of system calls are counted.  In
that case, is there an interesting user stack that jumps out with a
profiler during the slowdown (or the kernel version, stack())?

sudo dtrace -n 'profile-99 /arg0/ { @[ustack()] = count(); } tick-10s
{ exit(0); }

Ok, here is the logs around a dtrace included.  I have dtaces every 1m10s, and 
the two I looked at, the last entry, were the same.

Also, the wall settings are different on this machine, had changed them on the 
original email submittal to see if it was related.

tapesystem=# SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');
  name   | current_setting |   
source
-+-+
 application_name| psql| client
 autovacuum_analyze_scale_factor | 0.05| 
configuration file
 autovacuum_analyze_threshold| 5000| 
configuration file
 autovacuum_max_workers  | 8   | 
configuration file
 autovacuum_vacuum_cost_delay| 5ms | 
configuration file
 autovacuum_vacuum_scale_factor  | 0.1 | 
configuration file
 autovacuum_vacuum_threshold | 5000| 
configuration file
 checkpoint_completion_target| 0.9 | 
configuration file
 checkpoint_timeout  | 15min   | 
configuration file
 client_encoding | UTF8| client
 DateStyle   | ISO, MDY| 
configuration file
 default_text_search_config  | pg_catalog.english  | 
configuration file
 dynamic_shared_memory_type  | posix   | 
configuration file
 effective_cache_size| 6546MB  | 
configuration file
 effective_io_concurrency| 200 | 
configuration file
 full_page_writes| off | 
configuration file
 hot_standby | off | 
configuration file
 lc_messages | C   | 
configuration file
 lc_monetary | C   | 
configuration file
 lc_numeric  | C   | 
configuration file
 lc_time | C   | 
configuration file
 listen_addresses| *   | 
configuration file
 log_autovacuum_min_duration | 1s  | 
configuration file
 log_checkpoints | on  | 
configuration file
 log_connections | on  | 
configuration file
 log_destination | syslog  | 
configuration file
 log_disconnections  | on  | 
configuration file
 log_duration| off | 
configuration file
 log_line_prefix | db=%d,user=%u,app=%a,client=%h  | 
configuration file
 log_lock_waits  | on  | 
configuration file
 log_min_duration_sample | 500ms   | 
configuration file
 log_min_duration_statement  | 1s  | 
configuration file
 log_statement_sample_rate   | 0.01| 
configuration file
 log_temp_files  | 0   | 
configuration file
 log_timezone| UTC | 
configuration file
 maintenance_work_mem| 1964MB  | 
configuration file
 max_connections | 250 | 
configuration file
 max_parallel_workers_per_gather | 8   | 
configuration file
 max_replication_slots   | 0   | 
configuration file
 max_stack_depth | 32MB| 
configuration file
 max_wal_senders | 0   | 
configuration file