Re: Postgres Out Of Memory Crash

2023-11-02 Thread Laurenz Albe
On Thu, 2023-11-02 at 09:12 +, Avi Weinberg wrote:
> I'm using Patroni Postgres installation and noticed that twice already 
> postgres
> crashed due to out of memory.  I'm using logical replication with around 30-40
> active subscribers on this machine.  The machine has 128GB but only 32GB is 
> allocated
> to Postgres.  How can I know what is actually causing the out of memory issue?
> Is it caused by not optimal postgres configuration or something else?

You should look into the PostgreSQL log.  That should show a message like

  LOG:  server process (PID 16024) was terminated by signal 9: Killed
  DETAIL:  Failed process was running: SELECT ...

It is not certain, but often that statement is the one that used up
all that memory.  At least it is a starting point for your investigation.

Yours,
Laurenz Albe




Re: pg_checksums?

2023-11-02 Thread Paul Förster
Hi Nikolay,

> On Nov 2, 2023, at 07:36, Nikolay Samokhvalov  wrote:

> There is also a good trick described in
> https://www.crunchydata.com/blog/fun-with-pg_checksums to avoid
> accidental start of Postgres:
[...]
> Additionally, I compiled some thoughts about running pg_checksums
> without downtime (Patroni-friendly, of course) here:
> https://twitter.com/samokhvalov/status/1719961485160689993.

These two links are very interesting. Thanks very much.

Cheers
Paul



Postgres Out Of Memory Crash

2023-11-02 Thread Avi Weinberg
Hi experts,

I'm using Patroni Postgres installation and noticed that twice already postgres 
crashed due to out of memory.  I'm using logical replication with around 30-40 
active subscribers on this machine.  The machine has 128GB but only 32GB is 
allocated to Postgres.  How can I know what is actually causing the out of 
memory issue?  Is it caused by not optimal postgres configuration or something 
else?

/usr/lib/postgresql/13/bin/postgres -D /home/postgres/pgdata/pgroot/data 
--config-file=/home/postgres/pgdata/pgroot/data/postgresql.conf --port=5432 
--cluster_name=postgres-cluster --wal_level=logical --hot_standby=on 
--max_connections=533 --max_wal_senders=90 --max_prepared_transactions=0 
--max_locks_per_transaction=64 --track_commit_timestamp=on 
--max_replication_slots=90 --max_worker_processes=30 --wal_log_hints=on


Oct 27 07:05:31 node2 kernel: postgres invoked oom-killer: gfp_mask=0xd0, 
order=0, oom_score_adj=993
Oct 27 07:05:31 node2 kernel: postgres 
cpuset=docker-6ae67e04710619972d3b1ab5d4c69c318d001c2da47fecee121cdc60279a14a0.scope
 mems_allowed=0
Oct 27 07:05:31 node2 kernel: CPU: 6 PID: 15536 Comm: postgres Kdump: 
loaded Tainted: G    T 3.10.0-1160.el7.x86_64 #1
Oct 27 07:05:31 node2 kernel: Hardware name: Kontron MSP8040/4008, BIOS 
Core: 5.11, MSP804x: 1.57.0943FC77 05/06/2020
Oct 27 07:05:31 node2 kernel: Call Trace:
Oct 27 07:05:31 node2 kernel: [] dump_stack+0x19/0x1b
Oct 27 07:05:31 node2 kernel: [] dump_header+0x90/0x229
Oct 27 07:05:31 node2 kernel: [] ? 
ep_poll_callback+0xf8/0x220
Oct 27 07:05:31 node2 kernel: [] ? 
find_lock_task_mm+0x56/0xc0
Oct 27 07:05:31 node2 kernel: [] ? 
try_get_mem_cgroup_from_mm+0x28/0x60
Oct 27 07:05:31 node2 kernel: [] 
oom_kill_process+0x2cd/0x490
Oct 27 07:05:31 node2 kernel: [] 
mem_cgroup_oom_synchronize+0x55c/0x590
Oct 27 07:05:31 node2 kernel: [] ? 
mem_cgroup_charge_common+0xc0/0xc0
Oct 27 07:05:31 node2 kernel: [] 
pagefault_out_of_memory+0x14/0x90
Oct 27 07:05:31 node2 kernel: [] mm_fault_error+0x6a/0x157
Oct 27 07:05:31 node2 kernel: [] 
__do_page_fault+0x491/0x500
Oct 27 07:05:31 node2 kernel: [] do_page_fault+0x35/0x90
Oct 27 07:05:31 node2 kernel: [] page_fault+0x28/0x30
Oct 27 07:05:31 node2 kernel: Task in 
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5984f099_30b0_4506_8730_6c72e7c02b78.slice/docker-6ae67e0471061997$
Oct 27 07:05:31 node2 kernel: memory: usage 32768000kB, limit 32768000kB, 
failcnt 144867
Oct 27 07:05:31 node2 kernel: memory+swap: usage 32768000kB, limit 
9007199254740988kB, failcnt 0
Oct 27 07:05:31 node2 kernel: kmem: usage 0kB, limit 9007199254740988kB, 
failcnt 0
Oct 27 07:05:31 node2 kernel: Memory cgroup stats for 
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5984f099_30b0_4506_8730_6c72e7c02b78.slice:
 cache:$
Oct 27 07:05:31 node2 kernel: Memory cgroup stats for 
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5984f099_30b0_4506_8730_6c72e7c02b78.slice/docker-$
Oct 27 07:05:31 node2 kernel: Memory cgroup stats for 
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5984f099_30b0_4506_8730_6c72e7c02b78.slice/docker-$
Oct 27 07:05:31 node2 kernel: [ pid ]   uid  tgid total_vm  rss nr_ptes 
swapents oom_score_adj name
Oct 27 07:05:31 node2 kernel: [13159] 0 13159  2391   3 
   0  -998 pause
Oct 27 07:05:31 node2 kernel: [13322] 0 13322 1095   97   8 
   0   993 dumb-init
Oct 27 07:05:31 node2 kernel: [13335] 0 13335 1156  171   8 
   0   993 sh
Oct 27 07:05:31 node2 kernel: [13411] 0 13411 1137   98   8 
   0   993 runsvdir
Oct 27 07:05:31 node2 kernel: [13438] 0 13438 1099   98   7 
   0   993 runsv
Oct 27 07:05:31 node2 kernel: [13439] 0 13439 1099   98   7 
   0   993 runsv
Oct 27 07:05:31 node2 kernel: [13440]   101 1344027026 1186  54 
   0   993 pgqd
Oct 27 07:05:31 node2 kernel: [13441]   101 13441   155215 8237 101 
   0   993 patroni
Oct 27 07:05:31 node2 kernel: [19532]   101 19532  174003346817 171 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19542]   101 19542  1767874 6713 121 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19546]   101 19546  1740173  14450313166 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19547]   101 19547  174006920060 171 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19548]   101 19548  1740027 4821  86 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19549]   101 19549  1740283 1011  91 
   0   993 postgres
Oct 27 07:05:31 node2 kernel: [19549]   101 19549  

Re: pg_dump/pg_restore --jobs practical limit?

2023-11-02 Thread Tomek
Hi!
When I really want to use all the resources - I set the number of jobs to a
value equal to the number of CPU plus 1. Probably there is no reason to
make run more jobs than number of CPU :-) .
Unfortunately, pg_dump will not allocate more than one thread to a table,
even a huge one (unless it is partitioned) - so, sometimes it is no
sense to define many jobs when you have one or two big tables and the rest
of them are relatively small - in such situation there will be no
difference if you define 4 or 10 jobs.
But, yes testing is the best way to get known :-).

   Regards Tomek

czw., 2 lis 2023 o 02:20 Ron  napisał(a):

> On 11/1/23 20:05, Brad White wrote:
>
>
>
> --
> *From:* Ron  
> *Sent:* Thursday, November 2, 2023 3:01:47 AM
> *To:* pgsql-general@lists.postgresql.org
>  
> *Subject:* Re: pg_dump/pg_restore --jobs practical limit?
>
> On 11/1/23 15:42, Laurenz Albe wrote:
>
> On Wed, 2023-11-01 at 13:09 -0500, Ron wrote:
>
> I will "soon" migrate some 1+ TB database from Pg 9.6.24 on RHEL 6 VMs to Pg
> 14.latest on RHEL 8 VMs.  The VMs have 10Gbps "NICs", SAN-based LUNs managed
> by LVM, and are all on ESX blades.  nproc count on some is 16 and on others
> is 32.
>
> Does anyone have experience as to the point of diminishing returns?
>
> IOW, can I crank them processes up to --jobs=30, will I see no gain -- or
> even degradation -- after, for example, --jobs=24?
>
> This would be for both pg_dump and pg_restore (which would be run on the
> RHEL 8 VM).
>
> Test, test, test.  Theoretical considerations are pretty worthless,
>
>
> Which is why I asked if anyone has experience.
>
>  and it is easy to measure that.
>
>
> Not necessarily. Our test systems are way too small (only good enough to
> validate that the script works correctly), and there's always something
> (sometimes a lot, sometime just "some") going on in production, whether
> it's my customer's work, or the SAN (like snapshotting *every* VM and
> then copying the snapshots to the virtual tape device) or something else.
>
> Sure,  but are the new systems busy already?
> Ideally you could run tests on them before they are put into production.
>
>
> Testing pg_restore with different --jobs= values will be easier.   pg_dump
> is what's going to be reading from a constantly varying system.
>
> --
> Born in Arizona, moved to Babylonia.
>


Re: pg_checksums?

2023-11-02 Thread Nikolay Samokhvalov
On Mon, Oct 30, 2023 at 6:57 AM Alexander Kukushkin  wrote:
...
> As Michael already said, the following workflow works just fine (I did it 
> dozens of times):
> 1. enable checksums on the standby node
> 2. start the standby and let it catch up with the primary
> 3. switchover to a standby node
> 4. enable checksums on the former primary (now replica).

There is also a good trick described in
https://www.crunchydata.com/blog/fun-with-pg_checksums to avoid
accidental start of Postgres:

after pg_ctl stop and before pg_checksums --enable, do:
  mv data/pg_twophase data/pg_twophase.DO_NOT_START_THIS_DATABASE

and once pg_checksums --enable is done, move it back.

Additionally, I compiled some thoughts about running pg_checksums
without downtime (Patroni-friendly, of course) here:
https://twitter.com/samokhvalov/status/1719961485160689993.