Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
On Mon, Feb 15, 2016 at 04:16:17PM -0500, Bob Peterson wrote: > I think the "right thing to do" at this point is this: > > 1. Patch #1 is already upstream > 2. Patch #2 stands on its own, so I think this should go forward. > 3. Combine patches 3, 4 and 5, which ought to provide a comprehensive fix >for the other problems listed in #1. > 4. The rest of the patches, I can post as separate patches because they are >code cleanups, not related to the original problems of #1. > > Let me know your thoughts on the subject. If you like this plan, I can > re-test and post replacement patches tomorrow (hopefully). That sounds good.
Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
- Original Message - > - Original Message - > > On Wed, Feb 10, 2016 at 01:55:26PM -0500, Bob Peterson wrote: > > > I've been doing a bunch of recovery testing with DLM and discovered some > > > issues. This collection of 6 patches addresses those issues. Some of them > > > are of my own making, introduced by the recent patches that made DLM > > > print socket connection errors, and recovery from those errors. > > > > Thanks Bob, perhaps I've not been paying close enough attention, but it's > > unclear to me how this patch set relates the the most accute issue we have > > at the moment, which are the problems introduced here: > > > > From b3a5bbfd780d9e9291f5f257be06e9ad6db11657 Mon Sep 17 00:00:00 2001 > > From: Bob Peterson > > Date: Thu, 27 Aug 2015 09:34:47 -0500 > > Subject: [PATCH] dlm: print error from kernel_sendpage > > > > Print a dlm-specific error when a socket error occurs > > when sending a dlm message. > > > > Signed-off-by: Bob Peterson > > Signed-off-by: David Teigland > > > > Could we begin with one patch that's easy to track that directly resolves > > the issues with that commit (perhaps even a revert if it's not simple to > > fix directly)? That brings us back to a known-good place, from which we > > can look at cleanups and changes. > > > Hi Dave, > > My goal has always been to attain stability, which I think I've finally > achieved. > > The problem is: While testing the dlm in multiple recovery situations, > Nate and I discovered multiple problems. Until recently, no one has tried > to run recovery tests on an upstream DLM, so I think we're finding some > old bugs that have been there for a while, as well as bugs with b3a5bbfd, > which you mentioned. > > I agree that some of these patches might be unnecessary improvements. > I'll try to pare them down to what is absolutely necessary and what > is not. I'll also document exactly why the necessary ones are needed. Hi Dave, Here is some more information on the set of DLM patches I recently posted, and where things stand: 1. Patch: dlm: print error from kernel_sendpage Commit: b3a5bbfd780d9e9291f5f257be06e9ad6db11657 Advantages: It allows dlm to report socket errors Disadvantages: It caused some major problems: Problem #1: nodeid_to_addr ends up occasionally being called from softirq context, which is a problem because it takes a spinlock. Problem #2: The first condition also does "return;" rather than calling the original error report. This is a problem because the original error report needs to be called to do socket cleanup. The sunrpc implementation avoids this by doing that socket cleanup manually inside its own error_report function. Problem #3: It saves off the sk_error_report callback, but it never restores the callback to its original value. Problem #4: It only saves off the sk_error_report callback, but not any of the other three callbacks. All four really ought to be saved and restored once dlm is done with the socket, like sunrpc does. Problem #5: If two competing socket errors occur, lowcomms_error_report could, in theory, be called twice, causing socket cleanup (from the original error_report function) to happen twice, which results in a kernel panic (the details of which escape me, but I could maybe recreate it). 2. Patch: DLM: Replace nodeid_to_addr with kernel_getpeername Advanges: It fixes problem #1 above. Disadvantages: It doesn't fix any of the other problems. 3. Patch: DLM: Call original error report when socket is NULL Advantages: It fixes problem #2 above. Disadvantages: It introduces a new problem below. Problem: Error report recursion problem: Depending on timing, if/when add_sock is called multiple times for the same socket, it saves off the original sk_error_report multiple times. The first time, it saves off the proper one and replaces it with lowcomms_error_report. The second time, it saves lowcomms_error_report, which means when lowcomms_error_report is called the next time, it recurses and calls itself recursively an infinite number of times until the system crashes and is fenced. NOTE #1: This problem is, in fact, already in the code today, for the second two paths through lowcomms_error_report. This patch only makes the first path do the same thing. In other words, the problem is already there; this patch just makes it a lot more likely to happen. NOTE #2: There are two ways to fix it. The first is to make dlm do the socket cleanup, like sunrpc does. I don't like that because any cleanup introduced in the calling code needs to be echoed to dlm, and whomever makes that kind of change won't know to do it. The second is to clean up th
Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
On Thu, Feb 11, 2016 at 01:39:09PM -0500, Bob Peterson wrote: > The problem is: While testing the dlm in multiple recovery situations, > Nate and I discovered multiple problems. Until recently, no one has tried > to run recovery tests on an upstream DLM, (Let's distinguish tcp connection testing/recovery vs locking testing/recovery. I agree we've never looked at the tcp connections too much since the node is typically dead anyway.) > I agree that some of these patches might be unnecessary improvements. > I'll try to pare them down to what is absolutely necessary and what > is not. I'll also document exactly why the necessary ones are needed. Improvements are fine, I was just confused about which were fixes vs cleanups. > I'll also try to post them in order of highest priority and repost > them as individual patches rather than a set. > > The recovery tests are somewhat slow, so this will take some time. > > BTW, Have you had a chance to look at the patch I posted on 18 January, > titled "DLM: Replace nodeid_to_addr with kernel_getpeername"? > That definitely fixes one bug in patch b3a5bbfd which you mentioned. Great, thanks, that's the key one that I'd missed or forgotten. > I assume you're not suggesting I combine that patch with other patches > to stabilize b3a5bbfd, right? As you well know, this is very touchy > code and it's easier to diagnose and debug a larger number of smaller > patches. No, I don't have any concerns with the other improvements/fixes you have since the main issue was fixed in that nodeid_to_addr replacement.
Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
- Original Message - > On Wed, Feb 10, 2016 at 01:55:26PM -0500, Bob Peterson wrote: > > I've been doing a bunch of recovery testing with DLM and discovered some > > issues. This collection of 6 patches addresses those issues. Some of them > > are of my own making, introduced by the recent patches that made DLM > > print socket connection errors, and recovery from those errors. > > Thanks Bob, perhaps I've not been paying close enough attention, but it's > unclear to me how this patch set relates the the most accute issue we have > at the moment, which are the problems introduced here: > > From b3a5bbfd780d9e9291f5f257be06e9ad6db11657 Mon Sep 17 00:00:00 2001 > From: Bob Peterson > Date: Thu, 27 Aug 2015 09:34:47 -0500 > Subject: [PATCH] dlm: print error from kernel_sendpage > > Print a dlm-specific error when a socket error occurs > when sending a dlm message. > > Signed-off-by: Bob Peterson > Signed-off-by: David Teigland > > Could we begin with one patch that's easy to track that directly resolves > the issues with that commit (perhaps even a revert if it's not simple to > fix directly)? That brings us back to a known-good place, from which we > can look at cleanups and changes. > Hi Dave, My goal has always been to attain stability, which I think I've finally achieved. The problem is: While testing the dlm in multiple recovery situations, Nate and I discovered multiple problems. Until recently, no one has tried to run recovery tests on an upstream DLM, so I think we're finding some old bugs that have been there for a while, as well as bugs with b3a5bbfd, which you mentioned. I agree that some of these patches might be unnecessary improvements. I'll try to pare them down to what is absolutely necessary and what is not. I'll also document exactly why the necessary ones are needed. I'll also try to post them in order of highest priority and repost them as individual patches rather than a set. The recovery tests are somewhat slow, so this will take some time. BTW, Have you had a chance to look at the patch I posted on 18 January, titled "DLM: Replace nodeid_to_addr with kernel_getpeername"? That definitely fixes one bug in patch b3a5bbfd which you mentioned. I assume you're not suggesting I combine that patch with other patches to stabilize b3a5bbfd, right? As you well know, this is very touchy code and it's easier to diagnose and debug a larger number of smaller patches. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
On Wed, Feb 10, 2016 at 01:55:26PM -0500, Bob Peterson wrote: > I've been doing a bunch of recovery testing with DLM and discovered some > issues. This collection of 6 patches addresses those issues. Some of them > are of my own making, introduced by the recent patches that made DLM > print socket connection errors, and recovery from those errors. Thanks Bob, perhaps I've not been paying close enough attention, but it's unclear to me how this patch set relates the the most accute issue we have at the moment, which are the problems introduced here: From b3a5bbfd780d9e9291f5f257be06e9ad6db11657 Mon Sep 17 00:00:00 2001 From: Bob Peterson Date: Thu, 27 Aug 2015 09:34:47 -0500 Subject: [PATCH] dlm: print error from kernel_sendpage Print a dlm-specific error when a socket error occurs when sending a dlm message. Signed-off-by: Bob Peterson Signed-off-by: David Teigland Could we begin with one patch that's easy to track that directly resolves the issues with that commit (perhaps even a revert if it's not simple to fix directly)? That brings us back to a known-good place, from which we can look at cleanups and changes.
Re: [Cluster-devel] [DLM PATCH 0/6] Misc DLM Improvements Regarding Socket Errors
On Wed, Feb 10, 2016 at 7:55 PM, Bob Peterson wrote: > I've been doing a bunch of recovery testing with DLM and discovered some > issues. This collection of 6 patches addresses those issues. Some of them > are of my own making, introduced by the recent patches that made DLM > print socket connection errors, and recovery from those errors. > > The first patch changes the TCP "connect to sock" function to more closely > match the SCTP version of the function. The idea is to not create a kernel > socket until we have a valid node address, like it does in the SCTP path. > > The second patch removes a "return" from lowcomms_error_report that should > not be there. The return was causing it to bypass calling the original > error report code, thus skipping an important part in the reporting. > > The third patch changes function tcp_create_listen_sock so that its > error path is consistent. Only one of its error paths was setting > con->sock to NULL, but it should be done in both cases. > > The fourth patch eliminates a useless goto, to make the code more clear. > > The fifth patch adds a layer of locking by way of the sk->sk_callback_lock > which is needed to prevent multiple send/receive sockets from > interfering with one another when reporting the socket errors and > subsequent recovery. This makes it similar to how sunrpc handles errors. > > The sixth and final patch makes the socket error code save and restore > all four callbacks, whereas before we were only saving and restoring the > error report callback. This patch set makes removing lockspaces a lot more robust for me. One test case that triggers NULL pointer dereferences in callbacks from TCP to DLM regularly without these is removing a lockspace on three cluster nodes "simultaneously". Thanks, Andreas