Re: the s_lock_stuck on perform_spin_delay

2024-02-08 Thread Andy Fan
Hi, > There are some similarities between this and > https://www.postgresql.org/message-id/20240207184050.rkvpuudq7huijmkb%40awork3.anarazel.de > as described in that email. Thanks for this information. > > > Hm, I think this might make this code a bit more expensive. It's cheaper, both > in

Re: the s_lock_stuck on perform_spin_delay

2024-02-07 Thread Andres Freund
Hi, There are some similarities between this and https://www.postgresql.org/message-id/20240207184050.rkvpuudq7huijmkb%40awork3.anarazel.de as described in that email. On 2024-01-25 15:24:17 +0800, Andy Fan wrote: > From 4c8fd0ab71299e57fbeb18dec70051bd1d035c7c Mon Sep 17 00:00:00 2001 > From:

Re: the s_lock_stuck on perform_spin_delay

2024-01-24 Thread Andy Fan
Andres Freund writes: > On 2024-01-18 14:00:58 -0500, Robert Haas wrote: >> > The LockBufHdr also used init_local_spin_delay / perform_spin_delay >> > infrastruce and then it has the same issue like ${subject}, it is pretty >> > like the code in s_lock; Based on my current knowledge, I think

Re: the s_lock_stuck on perform_spin_delay

2024-01-24 Thread Andy Fan
Hi, > Hi, > > On 2024-01-22 15:18:35 +0800, Andy Fan wrote: >> I used sigismember(, SIGQUIT) to detect if a process is doing a >> quickdie, however this is bad not only because it doesn't work on >> Windows, but also it has too poor performance even it impacts on >> USE_ASSERT_CHECKING build

Re: the s_lock_stuck on perform_spin_delay

2024-01-23 Thread Andres Freund
Hi, On 2024-01-22 15:18:35 +0800, Andy Fan wrote: > I used sigismember(, SIGQUIT) to detect if a process is doing a > quickdie, however this is bad not only because it doesn't work on > Windows, but also it has too poor performance even it impacts on > USE_ASSERT_CHECKING build only. In v8, I

Re: the s_lock_stuck on perform_spin_delay

2024-01-23 Thread Andres Freund
Hi, On 2024-01-18 14:00:58 -0500, Robert Haas wrote: > > The LockBufHdr also used init_local_spin_delay / perform_spin_delay > > infrastruce and then it has the same issue like ${subject}, it is pretty > > like the code in s_lock; Based on my current knowledge, I think we > > should add the check

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Andy Fan
Robert Haas writes: > On Mon, Jan 22, 2024 at 12:13 PM Andy Fan wrote: >> > On Mon, Jan 22, 2024 at 11:58 AM Andy Fan wrote: >> >> I get your point! Acquiring an already held spinlock in quickdie is >> >> unlikely to happen, but since our existing infrastructure can handle it, >> >> then

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Robert Haas
On Mon, Jan 22, 2024 at 12:13 PM Andy Fan wrote: > > On Mon, Jan 22, 2024 at 11:58 AM Andy Fan wrote: > >> I get your point! Acquiring an already held spinlock in quickdie is > >> unlikely to happen, but since our existing infrastructure can handle it, > >> then there is no reason to bypass it.

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Andy Fan
Robert Haas writes: > On Mon, Jan 22, 2024 at 11:58 AM Andy Fan wrote: >> I get your point! Acquiring an already held spinlock in quickdie is >> unlikely to happen, but since our existing infrastructure can handle it, >> then there is no reason to bypass it. > > No, the existing

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Robert Haas
On Mon, Jan 22, 2024 at 11:58 AM Andy Fan wrote: > I get your point! Acquiring an already held spinlock in quickdie is > unlikely to happen, but since our existing infrastructure can handle it, > then there is no reason to bypass it. No, the existing infrastructure cannot handle that at all. --

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Andy Fan
Robert Haas writes: > On Mon, Jan 22, 2024 at 2:22 AM Andy Fan wrote: >> I used sigismember(, SIGQUIT) to detect if a process is doing a >> quickdie, however this is bad not only because it doesn't work on >> Windows, but also it has too poor performance even it impacts on >>

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Robert Haas
On Mon, Jan 22, 2024 at 2:22 AM Andy Fan wrote: > I used sigismember(, SIGQUIT) to detect if a process is doing a > quickdie, however this is bad not only because it doesn't work on > Windows, but also it has too poor performance even it impacts on > USE_ASSERT_CHECKING build only. In v8, I

Re: the s_lock_stuck on perform_spin_delay

2024-01-21 Thread Andy Fan
Andy Fan writes: > I found a speical case about checking it in errstart. So commit 3 in v7 > is added. > > commit 757c67c1d4895ce6a523bcf5217af8eb2351e2a1 (HEAD -> s_stuck_v3) > Author: yizhi.fzh > Date: Mon Jan 22 07:14:29 2024 +0800 > > Bypass SpinLock checking in SIGQUIT signal

Re: the s_lock_stuck on perform_spin_delay

2024-01-21 Thread Andy Fan
Hi, Andy Fan writes: > Hi, > > v6 attached which addressed all the items Robert suggested except the > following 2 open items. They are handled differently. > >> >> Here is the summary of the open-items, it would be great that Andres and >> Matthias have a look at this when they have time. >>

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan
Hi, v6 attached which addressed all the items Robert suggested except the following 2 open items. They are handled differently. > > Here is the summary of the open-items, it would be great that Andres and > Matthias have a look at this when they have time. > >>> The LockBufHdr also used

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan
Hi, Here is the summary of the open-items, it would be great that Andres and Matthias have a look at this when they have time. 1. Shall we treat the LockBufHdr as a SpinLock usage. >> The LockBufHdr also used init_local_spin_delay / perform_spin_delay >> infrastruce and then it has the same

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 1:30 PM Andy Fan wrote: > > You added an #include to dynahash.c despite making no other changes to > > the file. > > That's mainly because I put the code into miscadmin.h and spin.h depends > on miscadmin.h with MACROs. That's not a good reason. Headers need to include

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan
Hi Robert, Thanks for your attention! > On Thu, Jan 18, 2024 at 7:56 AM Andy Fan wrote: >> Do you still have interest with making some progress on this topic? > > Some, but it's definitely not my top priority. I wish I could give as > much attention as everyone would like to everyone's

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 7:56 AM Andy Fan wrote: > Do you still have interest with making some progress on this topic? Some, but it's definitely not my top priority. I wish I could give as much attention as everyone would like to everyone's patches, but I can't even come close. I think that the

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan
ssible to be misused in many cases. e.g. Acquiring another LWLocks or regular locks, memory allocation. In this patch, all of such cases may increase the timing of holding the spin lock. In our culture, if any spin lock can't be acquired for some-time, a crash should happen. See the s_lock_stuck in perform_sp

Re: the s_lock_stuck on perform_spin_delay

2024-01-14 Thread Andy Fan
cquired for some-time, a crash should happen. See the s_lock_stuck in perform_spin_delay. CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock. Depends on what signals are left to handle, PG may raise error/fatal which would cause the code jump to some other places which is h

Re: the s_lock_stuck on perform_spin_delay

2024-01-12 Thread Robert Haas
On Wed, Jan 10, 2024 at 10:17 PM Andy Fan wrote: > fixed in v2. Timing the spinlock wait seems like a separate patch from the new sanity checks. I suspect that the new sanity checks should only be armed in assert-enabled builds. I'm doubtful that this method of tracking the wait time will be

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Andy Fan
Robert Haas writes: > Thanks for jumping in with a review, Matthias! FWIW, Matthias is also the first one for this proposal at this thread, thanks for that as well! -- Best Regards Andy Fan

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Andy Fan
Hi Matthias, Thanks for the review! Matthias van de Meent writes: > On Wed, 10 Jan 2024 at 02:44, Andy Fan wrote: >> Hi, >> >> I want to know if Andres or you have plan >> to do some code review. I don't expect this would happen very soon, just >> want to make sure this will not happen that

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Robert Haas
Thanks for jumping in with a review, Matthias! On Wed, Jan 10, 2024 at 8:03 AM Matthias van de Meent wrote: > I'm not 100% sure on the policy of this, but theoretically you could > use LockAquireExtended(dontWait=true) while holding a spin lock, as > that would not have an unknown duration. Then

Re: the s_lock_stuck on perform_spin_delay

2024-01-10 Thread Matthias van de Meent
On Wed, 10 Jan 2024 at 02:44, Andy Fan wrote: > Hi, > > I want to know if Andres or you have plan > to do some code review. I don't expect this would happen very soon, just > want to make sure this will not happen that both of you think the other > one will do, but actually none of them does it

Re: the s_lock_stuck on perform_spin_delay

2024-01-09 Thread Andy Fan
Hi, Robert Haas writes: > On Mon, Jan 8, 2024 at 9:40 PM Andy Fan wrote: >> The singler handler I was refering to is 'CHECK_FOR_INTERRUPTS', Based >> on this, spin_lock and lwlock are acted pretty differently. > > CHECK_FOR_INTERRUPTS() is not a signal handler, hmm, I knew this but I

Re: the s_lock_stuck on perform_spin_delay

2024-01-09 Thread Robert Haas
On Mon, Jan 8, 2024 at 9:40 PM Andy Fan wrote: > The singler handler I was refering to is 'CHECK_FOR_INTERRUPTS', Based > on this, spin_lock and lwlock are acted pretty differently. CHECK_FOR_INTERRUPTS() is not a signal handler, and it's OK to acquire and release spin locks or lwlocks there. We

Re: the s_lock_stuck on perform_spin_delay

2024-01-08 Thread Andy Fan
Hi, Robert Haas writes: > On Sun, Jan 7, 2024 at 9:52 PM Andy Fan wrote: >> > I think we should add cassert-only infrastructure tracking whether we >> > currently hold spinlocks, are in a signal handler and perhaps a few other >> > states. That'd allow us to add assertions like: >> .. >> > -

Re: the s_lock_stuck on perform_spin_delay

2024-01-08 Thread Robert Haas
On Sun, Jan 7, 2024 at 9:52 PM Andy Fan wrote: > > I think we should add cassert-only infrastructure tracking whether we > > currently hold spinlocks, are in a signal handler and perhaps a few other > > states. That'd allow us to add assertions like: > .. > > - no lwlocks or ... while in signal

Re: the s_lock_stuck on perform_spin_delay

2024-01-07 Thread Andy Fan
Hi! > > I think we should add cassert-only infrastructure tracking whether we > currently hold spinlocks, are in a signal handler and perhaps a few other > states. That'd allow us to add assertions like: .. > - no lwlocks or ... while in signal handlers I *wish* lwlocks should *not* be held

Re: the s_lock_stuck on perform_spin_delay

2024-01-06 Thread Andy Fan
Hi, > On 2024-01-05 14:19:23 -0500, Robert Haas wrote: >> On Fri, Jan 5, 2024 at 2:11 PM Andres Freund wrote: >> > I see it fairly regularly. Including finding several related bugs that >> > lead to >> > stuck systems last year (signal handlers are a menace). >> >> In that case, I think this

Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi, On 2024-01-05 14:19:23 -0500, Robert Haas wrote: > On Fri, Jan 5, 2024 at 2:11 PM Andres Freund wrote: > > I see it fairly regularly. Including finding several related bugs that lead > > to > > stuck systems last year (signal handlers are a menace). > > In that case, I think this proposal

Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi, On 2024-01-05 10:20:39 +0800, Andy Fan wrote: > Andres Freund writes: > > On 2024-01-04 14:59:06 +0800, Andy Fan wrote: > >> My question is if someone doesn't obey the rule by mistake (everyone > >> can make mistake), shall we PANIC on a production environment? IMO I > >> think it can be a

Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 2:11 PM Andres Freund wrote: > I see it fairly regularly. Including finding several related bugs that lead to > stuck systems last year (signal handlers are a menace). In that case, I think this proposal is dead. I can't personally testify to this code being a force for

Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi, On 2024-01-05 08:51:53 -0500, Robert Haas wrote: > On Thu, Jan 4, 2024 at 6:06 PM Andres Freund wrote: > > I think we should add infrastructure to detect bugs like this during > > development, but not PANICing when this happens in production seems > > completely > > non-viable. > > I mean

Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Robert Haas
On Thu, Jan 4, 2024 at 6:06 PM Andres Freund wrote: > I think we should add infrastructure to detect bugs like this during > development, but not PANICing when this happens in production seems completely > non-viable. I mean +1 for the infrastructure, but "completely non-viable"? Why? I've only

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andy Fan
Hi, Andres Freund writes: > > On 2024-01-04 14:59:06 +0800, Andy Fan wrote: >> My question is if someone doesn't obey the rule by mistake (everyone >> can make mistake), shall we PANIC on a production environment? IMO I >> think it can be a WARNING on a production environment and be a stuck

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund
Hi, On 2024-01-04 17:03:18 -0600, Jim Nasby wrote: > On 1/4/24 10:33 AM, Tom Lane wrote: > > Robert Haas writes: > > On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: > > We should be making an effort to ban coding patterns like > "return with spinlock still

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund
Hi, On 2024-01-04 14:59:06 +0800, Andy Fan wrote: > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Jim Nasby
On 1/4/24 10:33 AM, Tom Lane wrote: Robert Haas writes: On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: We should be making an effort to ban coding patterns like "return with spinlock still held", because they're just too prone to errors similar to this one. I agree

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andres Freund
Hi, On 2024-01-04 12:04:07 -0500, Robert Haas wrote: > On Thu, Jan 4, 2024 at 11:33 AM Tom Lane wrote: > > I believe it's (a). No matter what the reason for a stuck spinlock > > is, the only reliable method of getting out of the problem is to > > blow things up and start over. The patch

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas
On Thu, Jan 4, 2024 at 11:33 AM Tom Lane wrote: > I believe it's (a). No matter what the reason for a stuck spinlock > is, the only reliable method of getting out of the problem is to > blow things up and start over. The patch proposed at the top of this > thread would leave the system unable

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Tom Lane
Robert Haas writes: > On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: >> We should be making an effort to ban coding patterns like >> "return with spinlock still held", because they're just too prone >> to errors similar to this one. > I agree that we don't want to add overhead, and also about

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas
On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: > I'm not a fan of adding overhead to such a performance-critical > thing as spinlocks in order to catch coding errors that are easily > detectable statically. IMV the correct usage of spinlocks is that > they should only be held across *short,

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Tom Lane
Robert Haas writes: > I'm not sure that the approach this patch takes is correct in detail, > but I kind of agree with you about the overall point. I mean, the idea > of the PANIC is to avoid having the system just sit there in a state > from which it will never recover ... but it can also have

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Andy Fan
a similar manner as > START_CRIT_SECTION/END_CRIT_SECTION. > > Kind regards, > > Matthias van de Meent > Neon (https://neon.tech) -- Best Regards Andy Fan >From 7d7fd0f0e9b13a290bfffaec0ad40773191155f2 Mon Sep 17 00:00:00 2001 From: "yizhi.fzh" Date: Thu, 4 Jan 2024 14:33:37 +0800 Sub

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Robert Haas
On Thu, Jan 4, 2024 at 2:09 AM Andy Fan wrote: > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef USE_ASSERT_CHECKING'. >

Re: the s_lock_stuck on perform_spin_delay

2024-01-04 Thread Matthias van de Meent
On Thu, 4 Jan 2024 at 08:09, Andy Fan wrote: > > My question is if someone doesn't obey the rule by mistake (everyone > can make mistake), shall we PANIC on a production environment? IMO I > think it can be a WARNING on a production environment and be a stuck > when 'ifdef USE_ASSERT_CHECKING'. >

the s_lock_stuck on perform_spin_delay

2024-01-03 Thread Andy Fan
Don't call s_lock_stuck in production environment In the current implementation, if a spin lock is misused, the s_lock_stuck in perform_spin_delay can cause the entire system to PANIC. In order to balance fault tolerance and the ability to detect incorrect usage, we can use WARNING to replace