Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread Maxim Boguk
On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA  wrote:

>
> What happens if you take pg_stat_statements out of the picture (remove
> from shared_preload_libraries)?  Does your BIND problem go away?
>

I didn't test this idea, because it requires restart of the database (it
cannot be done quickly) and without pg_stat_statements there will be no
adequate performance monitoring of the database.
But I'm pretty sure that the issue will go away with pg_stat_statements
disabled.

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread MichaelDBA
Well if you find out for sure, please let me know.  I'm very interested 
in the outcome of this problem.


Maxim Boguk wrote on 1/5/2023 6:44 AM:



On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA > wrote:



What happens if you takepg_stat_statements out of the picture
(remove from shared_preload_libraries)?  Does your BIND problem go
away?


I didn't test this idea, because it requires restart of the database 
(it cannot be done quickly) and without pg_stat_statementsthere will 
be no adequate performance monitoring of the database.
But I'm pretty sure that the issue will go away with 
pg_stat_statements disabled.


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678







Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread MichaelDBA


What happens if you take pg_stat_statements out of the picture (remove 
from shared_preload_libraries)?  Does your BIND problem go away?


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread Maxim Boguk
On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk  wrote:

> Hi,
>
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
>
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.
> (
> grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration'
> | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
> ...
>   9 2022-12-29 00:12:5
>   2 2022-12-29 00:13:1
>   3 2022-12-29 00:13:5
> !!!  68 2022-12-29 00:14:0
>   5 2022-12-29 00:14:1
>   3 2022-12-29 00:14:2
>   2 2022-12-29 00:14:3
> ).
> But no lock waits on the BIND stage logged during the problem period (and
> no lock waits in general).
> Similar issues happen a few times per day without any visible pattern (but
> on the same tables usually).
> No CPU or IO load/latency spikes found during problem periods.
> No EXECUTE slowdown found in the log during that time.
>


Followup research of this issue lead me to following results:
Every logged spike of BIND/PARSE response time correlated with
corresponding backend waiting on
wait_event_type = LWLock
wait_event = pg_stat_statements
and all of these spikes happen during increment of
pg_stat_statements_info.dealloc counter.

Some searching about this issue lead me to following blog post about
similar issue:
https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html

However, we already have pg_stat_statements.max=1 so further increase
of this parameter
seems counterproductive (the size of
14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB).


Open questions remains:
1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE
during deallocation of least used entries for the whole period of cleanup?


2)Any recommended workaround for this issue for systems with strict latency
SLA
(block every database query (used extended query protocol) for 200-500ms
50+ times per day at random time - isn't acceptable for our case
unfortunately)?


3)Why only BIND/PARSE locks but not EXECUTE?
(may be some difference in implementation of plan vs exec
pg_stat_statements counters?).


Kind Regards,
Maxim


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:55 PM MichaelDBA  wrote:

> You said it's a dedicated server, but pgbouncer is running locally,
> right?  PGBouncer has a small footprint, but is the CPU high for it?
>

There are 4 pgbouncer processes in so_reuseport mode.
I never saw more than 40% of a single CPU core per one pgbouncer process
(most time under 20%).
So it's an unlikely result of pgbouncer being overloaded.


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA
You said it's a dedicated server, but pgbouncer is running locally, 
right?  PGBouncer has a small footprint, but is the CPU high for it?


Maxim Boguk wrote on 1/1/2023 11:51 AM:



On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA > wrote:


Hi Maxim,

10-20 active, concurrent connections is way below any CPU load
problem you should have with 48 available vCPUs.
You never explicitly said what the load is, so what is it in the
context of the 1,5,15?


LA 10-15 all time, servers are really overprovisioned (2-3x by 
available CPU resources) because an application is quite sensitive to 
the database latency.
And during these latency spikes - EXECUTE work without any issues 
(e.g. only PARSE/BIND suck).



--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678




Regards,

Michael Vitale

michael...@sqlexec.com 

703-600-9343






Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA  wrote:

> Hi Maxim,
>
> 10-20 active, concurrent connections is way below any CPU load problem you
> should have with 48 available vCPUs.
> You never explicitly said what the load is, so what is it in the context
> of the 1,5,15?
>
>
LA 10-15 all time, servers are really overprovisioned (2-3x by available
CPU resources) because an application is quite sensitive to the database
latency.
And during these latency spikes - EXECUTE work without any issues (e.g.
only PARSE/BIND suck).


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA

Hi Maxim,

10-20 active, concurrent connections is way below any CPU load problem 
you should have with 48 available vCPUs.
You never explicitly said what the load is, so what is it in the context 
of the 1,5,15?


Maxim Boguk wrote on 1/1/2023 11:30 AM:

1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load 
spike or similar anomalies)

additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local 
pgbouncer in transaction mode)



Regards,

Michael Vitale

michael...@sqlexec.com 

703-600-9343






Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 3:27 PM MichaelDBA  wrote:

> Howdy,
>
> Few additional questions:
>
>1. How many concurrent, active connections are running when these BIND
>problems occur?  select count(*) from pg_stat_activity where state in
>('active','idle in transaction')
>2. Are the queries using gigantic IN () values?
>3. Perhaps unrelated, but is log_temp_files turned on, and if so, do
>you have a lot of logs related to that?
>
> Regards,
> Michael Vitale, just another PG DBA
>

1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load spike or
similar anomalies)
additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local pgbouncer
in transaction mode)

2)no... long BIND for huge parameter lists is a known issue for me, in this
case there is nothing like that... just (every?) PARSE/BIND stuck for a
short period (including ones which don't require pg_statistic table
access)...
There are some funny samples from the latest spike:
2023-01-01 15:45:09.151 UTC 2421121  from [local]
[vxid:109/20732521 txid:0] [BIND] LOG:  duration: 338.830 ms  bind
: ROLLBACK
2023-01-01 15:45:09.151 UTC 2365255  from [local] [vxid:41/21277531
txid:2504447286] [PARSE] LOG:  duration: 338.755 ms  parse :
select nextval ('jobs_id_seq')
along with normal select/insert/update/delete operations stuck for a short
time too...

3)log_temp_files on for sure, I found no correlation with temp file usage,
as well as no correlation between latency spikes and logged autovacuum
actions.

PS: '[BIND] LOG:  duration: 338.830 ms  bind : ROLLBACK' on a
definitely not overloaded and perfectly healthy server - probably the most
curious log entry of 2022 year for me.

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA

Howdy,

Few additional questions:

1. How many concurrent, active connections are running when these BIND
   problems occur?  select count(*) from pg_stat_activity where state
   in ('active','idle in transaction')
2. Are the queries using gigantic IN () values?
3. Perhaps unrelated, but islog_temp_files turned on, and if so, do you
   have a lot of logs related to that?

Regards,
Michael Vitale, just another PG DBA









Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby  wrote:

> On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> > Hi,
> >
> > When performing post-mortem analysis of some short latency spikes on a
> > heavily loaded database, I found that the reason for (less than 10 second
> > latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> > At the same time graphical monitoring shows that during this few second
> > period there were some queries waiting in the BIND stage.
> >
> > Logging setup:
> > log_min_duration_statement=200ms
> > log_lock_waits=on
> > deadlock_timeout=100ms
> > So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should
> > be in the log.
> > But in the log I see only spikes on slow BIND but not lock waits logged.
>
> What version postgres?  What settings have non-default values ?
> What OS/version?  What environment/hardware?  VM/image/provider/...
> What are the queries that are running BIND ?  What parameter types ?
> Are the slow BINDs failing?  Are their paramters being logged ?
> What else is running besides postgres ?  Are the DB clients local or
> remote ?  It shouldn't matter, but what client library?
>

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2
server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM
and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ?  - nothing special, e.g.
during problem period a lot completely different queries become stuck in
BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in
general it feel that whole BIND/PARSE mechanics lock for short period
 LOG SAMPLE ==
2023-01-01 09:07:31.622 UTC 1848286 ** from [local] [vxid:84/20886619
txid:0] [PARSE] LOG:  duration: 235.472 ms  parse : SELECT
COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.35') AND (job_stats_master.created_at <
'2023-01-01 09:07:31.35') AND "job_stats_master"."employer_id" = 
AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" =
 AND "job_stats_master"."ip_matching_id" = *
2023-01-01 09:07:31.622 UTC 1898699 **  from [local] [vxid:158/22054921
txid:0] [BIND] LOG:  duration: 231.274 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 *** from [local] [vxid:158/22054921
txid:0] [BIND] DETAIL:  parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] LOG:  duration: 231.024 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] DETAIL:  parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries
logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] LOG:  duration: 139.372 ms  bind : SELECT
employers.*, third_party_employer_pixels.facebook_pixel_id AS
facebook_pixel_id, third_party_employer_pixels.google_pixel_id   AS
google_pixel_id, third_party_employer_pixels.google_actionsAS
google_actions, employer_pixel_configurations.solutionAS
tracking_solution, employer_pixel_configurations.domain_name AS
domain, settings.use_multiple_bids   FROM employers LEFT JOIN
third_party_employer_pixels   ON third_party_employer_pixels.employer_id =
employers.id LEFT JOIN employer_pixel_configurations ON
employer_pixel_configurations.id = employers.id LEFT JOIN settings
 ON settings.id = employers.setting_id WHERE employers.id =
$1
2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] DETAIL:  parameters: $1 = '*'
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] LOG:  duration: 120.237 ms  bind : SELECT
job_boards.*, enterprises.product_type,
feed_settings.use_employer_exported_name as use_employer_exported_name,
integration_job_board_settings.integration_status as integration_status
FROM job_boards LEFT JOIN integration_job_board_settings ON
integration_job_board_settings.id =
job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON
enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON
feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] DETAIL:  parameters: $1 = ''
===
What really curious in the log: that every of 100+ stuck in PARSE/BIND
stage queries that had been logged (and thus unstuck) in the same exact
moment... that highly likely means that they all had been 

Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2022-12-31 Thread Justin Pryzby
On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> Hi,
> 
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
> 
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout) should
> be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.

What version postgres?  What settings have non-default values ?
What OS/version?  What environment/hardware?  VM/image/provider/...
What are the queries that are running BIND ?  What parameter types ?
Are the slow BINDs failing?  Are their paramters being logged ?
What else is running besides postgres ?  Are the DB clients local or
remote ?  It shouldn't matter, but what client library?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

-- 
Justin




How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2022-12-31 Thread Maxim Boguk
Hi,

When performing post-mortem analysis of some short latency spikes on a
heavily loaded database, I found that the reason for (less than 10 second
latency spike) wasn't on the EXECUTE stage but on the BIND stage.
At the same time graphical monitoring shows that during this few second
period there were some queries waiting in the BIND stage.

Logging setup:
log_min_duration_statement=200ms
log_lock_waits=on
deadlock_timeout=100ms
So I expected that every lock waiting over 100ms (>deadlock_timeout) should
be in the log.
But in the log I see only spikes on slow BIND but not lock waits logged.
(
grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' |
perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
...
  9 2022-12-29 00:12:5
  2 2022-12-29 00:13:1
  3 2022-12-29 00:13:5
!!!  68 2022-12-29 00:14:0
  5 2022-12-29 00:14:1
  3 2022-12-29 00:14:2
  2 2022-12-29 00:14:3
).
But no lock waits on the BIND stage logged during the problem period (and
no lock waits in general).
Similar issues happen a few times per day without any visible pattern (but
on the same tables usually).
No CPU or IO load/latency spikes found during problem periods.
No EXECUTE slowdown found in the log during that time.

So currently I have two hypotheses in research:
1)during BIND stage not every lock waits logged
2)there are some not a lock related intermittent slowdown of BIND

I ask for any ideas how to debug this issue (duration of such spike usually
under 1s but given how many TPS database serving - 1s is too much and
affect end users).


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678