Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-10-23 Thread Michael Paquier
On Tue, Oct 03, 2023 at 04:20:45PM +0900, Michael Paquier wrote: > If there's no interest in this patch set after the next CF, I'm OK to > drop it. The state of HEAD is at least correct in the OOM cases now. I have been thinking about this patch for the last few days, and in light of 6b18b3fe2c2f

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-10-03 Thread Michael Paquier
On Tue, Sep 26, 2023 at 03:48:07PM +0900, Michael Paquier wrote: > By the way, anything that I am proposing here cannot be backpatched > because of the infrastructure changes required in walreader.c, so I am > going to create a second thread with something that could be > backpatched (yeah, likely

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-09-25 Thread Michael Paquier
On Thu, Aug 10, 2023 at 02:59:07PM +0900, Michael Paquier wrote: > My apologies if I sounded unclear here. It seems to me that we should > wrap the patch on [1] first, and get it backpatched. At least that > makes for less conflicts when 0001 gets merged for HEAD when we are > able to set a prope

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Thu, Aug 10, 2023 at 02:47:51PM +0900, Kyotaro Horiguchi wrote: > At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier > wrote in >> On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: >> > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi >> > wrote in >> >> We hav

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Thu, 10 Aug 2023 13:33:48 +0900, Michael Paquier wrote in > On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: > > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi > > wrote in > >> We have treated every kind of broken data as end-of-recovery, like > >> incorrect

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Thu, Aug 10, 2023 at 10:15:40AM +0900, Kyotaro Horiguchi wrote: > At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi > wrote in >> At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier >> wrote in While it's a kind of bug in total, we encountered a case where an excessively

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Thu, 10 Aug 2023 10:00:17 +0900 (JST), Kyotaro Horiguchi wrote in > At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier > wrote in > > > While it's a kind of bug in total, we encountered a case where an > > > excessively large xl_tot_len actually came from a corrupted > > > record. [1] > >

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier wrote in > > While it's a kind of bug in total, we encountered a case where an > > excessively large xl_tot_len actually came from a corrupted > > record. [1] > > Right, I remember this one. I think that Thomas was pretty much right > that thi

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Wed, Aug 09, 2023 at 05:00:49PM +0900, Kyotaro Horiguchi wrote: > Looks fine. Okay, I've updated the patch in consequence. I'll look at 0001 again at the beginning of next week. > While it's a kind of bug in total, we encountered a case where an > excessively large xl_tot_len actually came fr

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Michael Paquier
On Wed, Aug 09, 2023 at 04:13:53PM +0900, Kyotaro Horiguchi wrote: > I'm not certain if message_deferred is a property of the error > struct. Callers don't seem to need that information. True enough, will remove. > The name "XLOG_RADER_NONE" seems too generic. XLOG_READER_NOERROR will > be cleare

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-09 Thread Kyotaro Horiguchi
At Wed, 9 Aug 2023 15:03:21 +0900, Michael Paquier wrote in > I have spent more time on 0001, polishing it and fixing a few bugs > that I have found while reviewing the whole. Most of them were > related to mistakes in resetting the error state when expected. I > have also expanded DecodeXLogR

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Michael Paquier
On Tue, Aug 08, 2023 at 05:44:03PM +0900, Kyotaro Horiguchi wrote: > I like the overall direction. Though, I'm considering enclosing the > errormsg and errorcode in a struct. Yes, this suggestion makes sense as it simplifies all the WAL routines that need to report back a complete error state, and

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Kyotaro Horiguchi
At Tue, 8 Aug 2023 16:29:49 +0900, Michael Paquier wrote in > On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote: > > I believe this approach is sufficient to determine whether the error > > is OOM or not. If total_len is currupted and has an excessively large > > value, it's high

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-08 Thread Michael Paquier
On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote: > I believe this approach is sufficient to determine whether the error > is OOM or not. If total_len is currupted and has an excessively large > value, it's highly unlikely that all subsequent pages for that length > will be consist

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-07 Thread Michael Paquier
On Tue, Aug 01, 2023 at 04:39:54PM -0700, Jeff Davis wrote: > On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote: > > Probably I'm missing something, but if memory allocation is required > > during WAL replay and it fails, wouldn't it be a better solution to > > log the error and terminat

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-01 Thread Kyotaro Horiguchi
At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi wrote in > I thoght that the failure on a stanby results in continuing to retry > reading the next record. However, I found that there's a case where > start process stops in response to OOM [1]. I've examined the calls to MemoryContex

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-01 Thread Jeff Davis
On Tue, 2023-08-01 at 16:14 +0300, Aleksander Alekseev wrote: > Probably I'm missing something, but if memory allocation is required > during WAL replay and it fails, wouldn't it be a better solution to > log the error and terminate the DBMS immediately? We need to differentiate between: 1. No va

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-08-01 Thread Aleksander Alekseev
Hi, > As far as I can see, PerformWalRecovery() uses LOG as elevel > [...] > On top of my mind, any solution I can think of needs to add more > information to XLogReaderState, where we'd either track the type of > error that happened close to errormsg_buf which is where these errors > are tracked,

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi
At Tue, 01 Aug 2023 15:28:54 +0900 (JST), Kyotaro Horiguchi wrote in > While we will not agree, we could establish a defalut behavior where > an OOM during recovery immediately triggers an ERROR. Then, we could > introduce a *GUC* that causes recovery to regard OOM as an > end-of-recovery error.

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi
At Tue, 1 Aug 2023 14:03:36 +0900, Michael Paquier wrote in > On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote: > > I believe a database server is not supposed to be executed under such > > a memory-constrained environment. > > I don't really follow this argument. The backend

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Michael Paquier
On Tue, Aug 01, 2023 at 01:51:13PM +0900, Kyotaro Horiguchi wrote: > I believe a database server is not supposed to be executed under such > a memory-constrained environment. I don't really follow this argument. The backend and the frontends are reliable on OOM, where we generate ERRORs or even F

Re: Incorrect handling of OOM in WAL replay leading to data loss

2023-07-31 Thread Kyotaro Horiguchi
At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier wrote in > A colleague, Ethan Mertz (in CC), has discovered that we don't handle > correctly WAL records that are failing because of an OOM when > allocating their required space. In the case of Ethan, we have bumped > on the failure after an a