If the dump was taken with pd_dump -Fd and pg_restore -j has no effect on 
restore time, that’s a good clue.
You can start with testing deferring checkpoints, by setting wal_max_size = 1TB 
and checkpoint_timeout = 10h, and see how this affects the pg_restore (should 
be limited by WAL write throughput).  Perhaps increase wal_buffers to 128MB.  
The idea being to identify (by elimination) the write chock-point, before 
starting to tune for it.

Irrelevant for your problem, you should set the *_io_concurrency to 200 since 
you use SSDs.
pg_restore rebuilds indices so also make sure the settings relevant to index 
building are set appropriately (see max_parallel_* and *_io_concurrency)

Kiriakos Georgiou

> On Dec 5, 2025, at 5:30 AM, MentionTheElephant - MentionTheElephant at 
> gmail.com <[email protected]> wrote:
> 
> Hello,
> 
> I would greatly appreciate your insight into an issue where pg_restore
> runs significantly slower than expected, even though the underlying
> storage shows very high random write throughput. I am trying to
> understand which PostgreSQL mechanisms or system layers I should
> investigate next in order to pinpoint the bottleneck and improve
> restore performance.
> 
> The central question is: What should I examine further to understand
> why checkpoint processing becomes the dominant bottleneck during
> restore, despite fsync=off, synchronous_commit=off, and excellent
> random write latency?
> 
> Below is a detailed description of the environment, the behavior
> observed, the steps I have already taken, and the research performed
> so far.
> 
> During pg_restore, execution time remains extremely long: around 2+
> hours using a custom-format dump and over 4 hours using directory
> format. The machine consistently demonstrates high random write
> performance (median latency ~5 ms, ~45k random write IOPS), yet
> PostgreSQL logs show very long checkpoints where the write phase
> dominates (hundreds to thousands of seconds). Checkpoints appear to
> stall the entire restore process.
> 
> I have tested multiple combinations of dump formats (custom and
> directory) and parallel jobs (j = 1, 12, 18). The restore duration
> barely changes. This strongly suggests that the bottleneck is not
> client-side parallelism but internal server behavior—specifically the
> checkpoint write phase.
> 
> Example log excerpts show checkpoint write times consistently in the
> range of 600–1100 seconds, with large numbers of buffers written (from
> hundreds of thousands to over 1.6 million). Sync times remain
> negligible because fsync is disabled, reinforcing the suspicion that
> PostgreSQL's internal buffer flushing and write throttling mechanisms
> are the source of slowdown, not WAL or filesystem sync.
> 
> Given that:
> 
> * Storage is fast,
> * fsync and synchronous commits are disabled,
> * full_page_writes is off,
> * wal_level is minimal,
> * autovacuum is off,
> * the restore is the only workload,
> 
> I am trying to determine what further PostgreSQL internals or Linux
> I/O mechanisms may explain why these checkpoints are taking orders of
> magnitude longer than the device’s raw write characteristics would
> suggest.
> 
> I am particularly looking for guidance on:
> 
> * Whether backend or checkpointer write throttling may still be
> limiting write concurrency even during bulk restore,
> * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could
> introduce any serialization invisible to raw I/O tests,
> * Whether certain parameters (e.g., effective_io_concurrency,
> maintenance_io_concurrency, wal_writer settings, combine limits,
> io_uring behavior) could unintentionally reduce write throughput,
> * Whether parallel pg_restore is inherently constrained by global
> buffer flushing behavior,
> * Any other PostgreSQL mechanisms that could cause prolonged
> checkpoint write durations even with crash-safety disabled.
> 
> Below are the configuration values and environment details referenced above.
> 
> Machine:
> Hyper-V VM
> 24 vCPU
> 80 GB RAM
> Ubuntu 24.04.3 (kernel 6.8.0-88)
> PostgreSQL 18.1
> 
> Database size:
> ~700 GB across two tablespaces on separate disks (freshly restored)
> 
> Storage layout:
> Each disk is its own VHDX
> LVM on battery-backed SSD array
> XFS for PGDATA
> Barriers disabled
> 
> Random write performance (steady state):
> Median latency: 5.1 ms
> IOPS: ~45.6k
> 
> Restore tests:
> pg_restore custom format: ~2h+
> pg_restore directory format: ~4h+
> Parallelism tested with j = 1, 12, 18, 24
> 
> Representative checkpoint log entries:
> (write phases ranging 76–1079 seconds, buffer writes up to 1.6M)
> 
> postgresql.conf (relevant parts):
> shared_buffers = 20GB
> work_mem = 150MB
> maintenance_work_mem = 8GB
> effective_io_concurrency = 1
> maintenance_io_concurrency = 1
> io_max_combine_limit = 512kB
> io_combine_limit = 1024kB
> io_method = io_uring
> 
> fsync = off
> synchronous_commit = off
> wal_sync_method = fdatasync
> full_page_writes = off
> wal_compression = lz4
> 
> checkpoint_timeout = 60min
> checkpoint_completion_target = 0.9
> max_wal_size = 80GB
> min_wal_size = 10GB
> 
> effective_cache_size = 65GB
> autovacuum = off
> max_locks_per_transaction = 256
> 
> If anyone has encountered similar behavior or can recommend specific
> PostgreSQL subsystems, kernel settings, or I/O patterns worth
> investigating, I would be very grateful for advice. My main goal is to
> understand why checkpoint writes are so slow relative to the
> hardware’s demonstrated capabilities, and how to safely accelerate the
> restore workflow.
> 
> Thank you in advance for any guidance.
> 
> 
> 

Reply via email to