Re: [Valgrind-users] helgrind markup for lock-ordering
On Thu, 2012-11-08 at 00:18 +0100, David Faure wrote: On Wednesday 07 November 2012 23:00:51 Philippe Waroquiers wrote: discover the bug is related to the doubful construct, not to a race condition If there's no race condition and no deadlock, I'm not sure what bug you want to detect :-) The doubtful construct. Such kind of construct is in my opinion better removed, and better detect that via a tool. Note that currently, laog is producing messages which should be considered as lock order warning, not as for sure there is a deadlock order problem. Yes, but why does it warn about lock order? Because it could cause deadlocks. Not necessarily. Doubtful construct like the above could cause e.g. livelocks if the inverted (failing) trylocks are not properly handled. So, even if a wrong order does not necessarily cause a deadlock, it can be a good idea to detect these doubtful constructions and remove them. I agree, this is about potential deadlocks, not actual deadlocks. But trylock 1 + trylock 2 vs trylock2 + trylock1 (the case we're talking about in this part of the mail) is not even a potential deadlock. It can't ever deadlock. So there's nothing to warn about. A compiler produces warnings and errors. warnings are useful to detect potential problems. Similarly, a wrong lock order (even if not deadlocking) might be something you do not want. And so, have e.g. an option to continue to warn for such a case (i.e. continue to have the current helgrind behaviour) would be useful for many users. The trylock is one case of warning, not an error which could/should be improved. But there are others e.g. laog does not understand the concept of guard locks which is (IIUC): each thread can acquire a single lock in a set of locks. If a thread wants to acquire more than one lock (in any order then), it first has to acquire the guard lock, and then can lock in any order any nr of locks in the lock set. With this guard lock, not possible to have a deadlock, but for sure this is not understood by the current helgrind laog algorithm. Right. That one definitely needs annotations in the source code, I would think. There's no way for the tool to detect that these mutexes all go together. Search on the web for good lock algorithm, which solves the problem of guard lock detection in deadlock lock graph analysis. I gave it a try, but I'm hitting a problem with exactly that, passing isTryLock to that code. isTryLock is set in HG_PTHREAD_MUTEX_LOCK_PRE and similar, while the above code is called from HG_PTHREAD_MUTEX_LOCK_POST and similar. If I understand correctly, adding an argument to the _POST variant would break source compatibility for the existing userland macros? I do not think it breaks compatibility for two reasons: 1. I believe these are system client requests, executed by helgrind interception code, not by client code. 2. it is possible to add an argument, because the requests all have a fixed nr of arguments, with unused arguments passed as 0. As long as really using such an unused argument keeps the same semantic for the 0 value, it is ok to add such arguments. (e.g. I have in 3.7.0 added a new argument to the memcheck leak search client request to do delta leak search, without breaking the compatibility: 0 means full leak search 1 means delta leak search I'll finish the patch then, but only if you agree with the approach, otherwise this would be dead code, i.e. a wasted effort. At this point you don't seem fully convinced :) Because I believe the below 3 threads 3 locks is the counter-example which shows that simply adding the trylock in the graph does not work. Note that convincing me is neither a sufficient nor necessary condition to have a complex helgrind patch accepted :). But I suspect that the insertion of a trylock in the graph might later on cause a 'wrong' cycle to be detected. E.g. (L = lock, T = trylock, L and T followed by lock nr) threadA L1 T2 threadB L2 L3 threadC L3 L1 cannot deadlock (I think :) if threadA releases lock 1 when T2 fails. Well, if T2 fails then we have no cycle, and if it succeeds we have a real potential deadlock. The question is whether we want to remember the T2 attempt (and warn later) even when T2 fails. I would say, if it failed, it's like it didn't happen. If T2 succeeds, there is of course no deadlock : in this case, threadA can do the required work, and then release the locks. So, with T2, the above cannot deadlock (if we assume the heuristic of threadA is to release L1 when T2 fails). If the heuristic of threadA is rather to just retry T2, then we have a deadlock. So, in summary: if T2 succeeds, for sure no deadlock (at least for this run :). if T2 fails, then depending on threadA heuristic, we might have a deadlock (probably better to call it a livelock, as threadA will continue to burn CPU to
Re: [Valgrind-users] helgrind markup for lock-ordering
On Tuesday 06 November 2012 22:56:32 Philippe Waroquiers wrote: On Tue, 2012-11-06 at 13:43 +0100, David Faure wrote: On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp (from https://bugs.kde.org/show_bug.cgi?id=243232) shows that an optimization inside Qt leads to a helgrind warning about wrong lock ordering, making the use of that feature impossible for detecting actual problems elsewhere (i.e. I have to use --track-lockorders=no all the time). Technically if we ignore the distinction between lock and tryLock, helgrind is right, we did lock the mutexes in reverse order. But because it's a tryLock, it couldn't possibly deadlock. Should helgrind simply ignore all pthread_mutex_trylock calls, for the lockorder detection, even if they succeed? I think so, actually (by definition they couldn't deadlock, which is what track-lockorders is all about). A deadlock can appear if there is a mixture of trylock and lock on the same lock. So, trylock cannot just be ignored. E.g. Thread 1:trylock mutex1 lock mutex2 Thread 2:lock mutex2 lock mutex1 might deadlock. True. This means that only trylock in second place should be ignored. More on this below. More generally, I guess that you mean trylock in last place should be ignored (rather than the special case of 2nd place). Right. This might be difficult to implement as each time a lock is taken, helgrind checks for order violation. I suspect a later lock operation might then transform a trylock in last place to a trylock which is now not anymore in last place. Right. So let's record it, but let's not warn at the precise time the successful trylock happens. If another lock happens later, out of order due to the successful trylock, then yes, let's warn. But of course, when the trylock operation has just been done, this trylock is last place and so if we would ignore it, then this would be similar to always ignore the trylock, which is not ok. Right, not ignore it as if it didn't happen, but ignore it as in don't warn right now, and still record it. Currently, helgrind maintains a graph of lock order. I suspect we might need different graph node types and/or edge types to cope with trylock. For sure, more investigations needed looking in depth at the current algorithm. I think it would be good enough to add the successful trylock to the graph, just without a warning at that point. Even without mixture, isn't the below slightly bizarre/dangerous ? Thread 1: trylock mutex1 trylock mutex2 Thread 2: trylock mutex2 trylock mutex1 No deadlock can ever happen here. Yes, no deadlock can occur. However, this is really a really doubful construction. The question is: should helgrind report a lock order warning for such constructs ? I don't think so, due to the valid use cases I'm showing here. The idea of helgrind is that it detects lock order problems and/or race condition problems *even* if no deadlock happens and/or if no race condition really happened. Maybe it is very unlikely that the trylock fails. Still would be nice to discover the bug. And if the trylock does not fail, then the race condition will then not be detected by helgrind. The simpler construct that can lead to this problem is * trylock mutex1 * access shared data If the trylock fails, then we have a race condition. If it succeeds, then there is no problem. I don't see how you can detect the potential race condition with helgrind when the trylock succeeds, unless you go as far as saying all trylocks are bad (which was actually my thinking until reading the code of QOrderedMutexLocker which has a valid use case for trylock -- maybe there are more, of course). But this seems impossible to detect. I mean, the code could say if (trylock succeeds) // do something else // do something else You're working on a runtime analysis tool, not on a static analysis tool, so you can't know anything about the branch that is not being taken in a given run of helgrind. Therefore I see no way of saying there could have been a race if this trylock had failed, but it actually succeeded in this run. So I don't disagree on it would be nice to discover the bug, but if it's impossible then let's drop the idea and come back to what is actually possible :) Yes this is exactly that QOrderedMutexLocker does. Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that failed, unlock mutex2 lock mutex1 lock mutex2
Re: [Valgrind-users] helgrind markup for lock-ordering
On Wed, 2012-11-07 at 10:51 +0100, David Faure wrote: The idea of helgrind is that it detects lock order problems and/or race condition problems *even* if no deadlock happens and/or if no race condition really happened. Maybe it is very unlikely that the trylock fails. Still would be nice to discover the bug. And if the trylock does not fail, then the race condition will then not be detected by helgrind. The simpler construct that can lead to this problem is * trylock mutex1 * access shared data discover the bug is related to the doubful construct, not to a race condition (as said above, if the trylock does not fail, no way to detect the race condition that would happen if the application does not properly handle the trylock failure). Note that currently, laog is producing messages which should be considered as lock order warning, not as for sure there is a deadlock order problem. The trylock is one case of warning, not an error which could/should be improved. But there are others e.g. laog does not understand the concept of guard locks which is (IIUC): each thread can acquire a single lock in a set of locks. If a thread wants to acquire more than one lock (in any order then), it first has to acquire the guard lock, and then can lock in any order any nr of locks in the lock set. With this guard lock, not possible to have a deadlock, but for sure this is not understood by the current helgrind laog algorithm. Properly handling guard locks implies a more sophisticated algorithm than the current laog (search for good lock algorithm). This is too monolithic thinking :) An application that uses a given framework (e.g. Qt) could very well be doing things differently than the framework does internally. As a developer debugging a large application written by other people, using a framework written by other people, how can I guarantee helgrind that all trylock uses follow a single design pattern? Of course, this cannot be guaranteed. But the idea is not that the command line option(s) are covering the full range of possible design patterns. The idea is that they should cover 'out of the box' a reasonable range of such patterns. This is true e.g. for memcheck (out of the box, it supports malloc compatible libs, but otherwise you need annotations). Helgrind should also support some 'out of the box' locking logics. The difficult question is what should be covered by the command line options 'out of the box' (the nirvana being an algorithm that would work for everything without annotations). At the time of the trylock, it is the last one - no warning at that precise moment. This sounds like a simple enough change in the current algorithm? Basically adding one if() ... if only I knew where ;) To avoid doing a lock order warning when doing the trylock is easy I believe: in hg_main.c:3697, put a condition 'if (!is_a_try_lock) before: other = laog__do_dfs_from_to(lk, thr-locksetA); if (other) { (where is_a_trylock has to be given by the caller). I think it is almost mechanical work to add arguments to *POST event handlers and corresponding requests to transfer the is a try lock from the helgrind interception to the line 3697). But I suspect that the insertion of a trylock in the graph might later on cause a 'wrong' cycle to be detected. E.g. (L = lock, T = trylock, L and T followed by lock nr) threadA L1 T2 threadB L2 L3 threadC L3 L1 cannot deadlock (I think :) if threadA releases lock 1 when T2 fails. But when L3 L1 will be inserted, a cycle will be found via T2 (if the graph has not remembered this is a trylock). So, I am still (somewhat intuitively) thinking that we need to have nodes and/or edges marked with this is a trylock and have the graph exploration taking these marking into account to not generate such false warnings. Far to be a mathematical proof, I know :). Philippe -- LogMeIn Central: Instant, anywhere, Remote PC access and management. Stay in control, update software, and manage PCs from one command center Diagnose problems and improve visibility into emerging IT issues Automate, monitor and manage. Do more in less time with Central http://p.sf.net/sfu/logmein12331_d2d ___ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users
Re: [Valgrind-users] helgrind markup for lock-ordering
On Wednesday 07 November 2012 23:00:51 Philippe Waroquiers wrote: On Wed, 2012-11-07 at 10:51 +0100, David Faure wrote: The idea of helgrind is that it detects lock order problems and/or race condition problems *even* if no deadlock happens and/or if no race condition really happened. Maybe it is very unlikely that the trylock fails. Still would be nice to discover the bug. And if the trylock does not fail, then the race condition will then not be detected by helgrind. The simpler construct that can lead to this problem is * trylock mutex1 * access shared data discover the bug is related to the doubful construct, not to a race condition If there's no race condition and no deadlock, I'm not sure what bug you want to detect :-) Note that currently, laog is producing messages which should be considered as lock order warning, not as for sure there is a deadlock order problem. Yes, but why does it warn about lock order? Because it could cause deadlocks. I agree, this is about potential deadlocks, not actual deadlocks. But trylock 1 + trylock 2 vs trylock2 + trylock1 (the case we're talking about in this part of the mail) is not even a potential deadlock. It can't ever deadlock. So there's nothing to warn about. The trylock is one case of warning, not an error which could/should be improved. But there are others e.g. laog does not understand the concept of guard locks which is (IIUC): each thread can acquire a single lock in a set of locks. If a thread wants to acquire more than one lock (in any order then), it first has to acquire the guard lock, and then can lock in any order any nr of locks in the lock set. With this guard lock, not possible to have a deadlock, but for sure this is not understood by the current helgrind laog algorithm. Right. That one definitely needs annotations in the source code, I would think. There's no way for the tool to detect that these mutexes all go together. At the time of the trylock, it is the last one - no warning at that precise moment. This sounds like a simple enough change in the current algorithm? Basically adding one if() ... if only I knew where ;) To avoid doing a lock order warning when doing the trylock is easy I believe: in hg_main.c:3697, put a condition 'if (!is_a_try_lock) before: other = laog__do_dfs_from_to(lk, thr-locksetA); if (other) { (where is_a_trylock has to be given by the caller). I gave it a try, but I'm hitting a problem with exactly that, passing isTryLock to that code. isTryLock is set in HG_PTHREAD_MUTEX_LOCK_PRE and similar, while the above code is called from HG_PTHREAD_MUTEX_LOCK_POST and similar. If I understand correctly, adding an argument to the _POST variant would break source compatibility for the existing userland macros? I think it is almost mechanical work to add arguments to *POST event handlers and corresponding requests to transfer the is a try lock from the helgrind interception to the line 3697). Ah OK, so you don't seem to see a problem with adding an argument :) No source compatibility for the user request macros is OK? I'll finish the patch then, but only if you agree with the approach, otherwise this would be dead code, i.e. a wasted effort. At this point you don't seem fully convinced :) But I suspect that the insertion of a trylock in the graph might later on cause a 'wrong' cycle to be detected. E.g. (L = lock, T = trylock, L and T followed by lock nr) threadA L1 T2 threadB L2 L3 threadC L3 L1 cannot deadlock (I think :) if threadA releases lock 1 when T2 fails. Well, if T2 fails then we have no cycle, and if it succeeds we have a real potential deadlock. The question is whether we want to remember the T2 attempt (and warn later) even when T2 fails. I would say, if it failed, it's like it didn't happen. Do you actually store failed trylocks currently? But when L3 L1 will be inserted, a cycle will be found via T2 (if the graph has not remembered this is a trylock). So, I am still (somewhat intuitively) thinking that we need to have nodes and/or edges marked with this is a trylock and have the graph exploration taking these marking into account to not generate such false warnings. My idea is rather that a successful trylock is just like a lock (except that it shouldn't warn at that precise moment, but if any later real-lock happens out of order with it, then we should warn, this is why successful trylocks should be recorded), and a failed trylock should NOT be recorded. So at this point I don't see a need for remembering this was a trylock. I don't see how this could matter later on. -- David Faure, fa...@kde.org, http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 -- LogMeIn Central: Instant, anywhere, Remote PC access and management. Stay in control, update software, and manage PCs from one
Re: [Valgrind-users] helgrind markup for lock-ordering
On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp (from https://bugs.kde.org/show_bug.cgi?id=243232) shows that an optimization inside Qt leads to a helgrind warning about wrong lock ordering, making the use of that feature impossible for detecting actual problems elsewhere (i.e. I have to use --track-lockorders=no all the time). Technically if we ignore the distinction between lock and tryLock, helgrind is right, we did lock the mutexes in reverse order. But because it's a tryLock, it couldn't possibly deadlock. Should helgrind simply ignore all pthread_mutex_trylock calls, for the lockorder detection, even if they succeed? I think so, actually (by definition they couldn't deadlock, which is what track-lockorders is all about). A deadlock can appear if there is a mixture of trylock and lock on the same lock. So, trylock cannot just be ignored. E.g. Thread 1:trylock mutex1 lock mutex2 Thread 2:lock mutex2 lock mutex1 might deadlock. True. This means that only trylock in second place should be ignored. More on this below. Even without mixture, isn't the below slightly bizarre/dangerous ? Thread 1: trylock mutex1 trylock mutex2 Thread 2: trylock mutex2 trylock mutex1 No deadlock can ever happen here. If the 2nd trylock fails, what is the plan B ? If the program then accesses shared data, a race condition will happen and will be detected by helgrind anyway. So ignoring the ordering of these trylocks is ok, I would think. Of course helgrind must record we got the lock, for the race condition detection feature, but it shouldn't warn about the wrong order of the locking, since it can't possibly deadlock. Not that the above would be good programming practice, of course, but helgrind can't say anything about it if all the locks were acquired. It will warn in another run, where some trylock fails, and a race ensues. It seems that a task must unlock all locks and restart from scratch in the above case. Yes this is exactly that QOrderedMutexLocker does. Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that failed, unlock mutex2 lock mutex1 lock mutex2 If the trylock fails (because thread1 was first), then it unlocks and restarts from scratch. I can't see a deadlock risk with that, so ideally helgrind shouldn't warn. I guess we might need an option such as: --trylock-logic={normal-lock|local-retry|full-retry} normal-lock = current behaviour local-retry means the task would re-trylock full-retry means that the plan B is to unlock all locks and retry everything. I don't see how this can be a global option. Some piece of code (like QOrderedMutexLocker) might have the full retry logic above, but other pieces of code might do something different - e.g. something wrong. It doesn't make sense to me to tell helgrind this is what all the code paths are going to do about tryLock, that's impossible to predict in a complex program. Let me present another case: Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that fails, unlock mutex2 and give up This could happen for a non-critical operation that can be canceled if it can't be done immediately. Again, no deadlock possible, so helgrind shouldn't warn about a successful trylock being out of order. And yet this isn't a full retry, so I don't think --trylock-logic=full-retry is the solution. Deadlock can only happen if both threads use normal locking as their second operation. A trylock as the second operation doesn't deadlock. Or maybe we would need the same info but with an annotation of the lock operation and/or of the lock itself ? Sounds like an annotation of the trylock operation, unless we agree on my next statement: I am quite sure the simple logic trylock can be ignored is not ok for all cases. Right. Let me refine that: trylock can be ignored as the second operation, i.e. helgrind shouldn't issue the out-of-order-locking-warning at the precise moment it's seeing a successful out-of-order trylock. If we can agree on that, then there's no need for an annotation in the code. PS: see my previous email about VALGRIND_HG_ENABLE_CHECKING not working. Is this known? Should I report a bug? Always good to report a bug if you have found a bug :). Mail will be forgotten, bug entries are less likely to be lost. OK, will do. On the other hand I'm getting more reaction about the tryLock issue here than on bug 243232 :-). Anyway, thanks for your input, much appreciated. -- David Faure, fa...@kde.org, http://www.davidfaure.fr Working on KDE, in particular KDE
Re: [Valgrind-users] helgrind markup for lock-ordering
On Tue, 2012-11-06 at 13:43 +0100, David Faure wrote: On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp (from https://bugs.kde.org/show_bug.cgi?id=243232) shows that an optimization inside Qt leads to a helgrind warning about wrong lock ordering, making the use of that feature impossible for detecting actual problems elsewhere (i.e. I have to use --track-lockorders=no all the time). Technically if we ignore the distinction between lock and tryLock, helgrind is right, we did lock the mutexes in reverse order. But because it's a tryLock, it couldn't possibly deadlock. Should helgrind simply ignore all pthread_mutex_trylock calls, for the lockorder detection, even if they succeed? I think so, actually (by definition they couldn't deadlock, which is what track-lockorders is all about). A deadlock can appear if there is a mixture of trylock and lock on the same lock. So, trylock cannot just be ignored. E.g. Thread 1:trylock mutex1 lock mutex2 Thread 2:lock mutex2 lock mutex1 might deadlock. True. This means that only trylock in second place should be ignored. More on this below. More generally, I guess that you mean trylock in last place should be ignored (rather than the special case of 2nd place). This might be difficult to implement as each time a lock is taken, helgrind checks for order violation. I suspect a later lock operation might then transform a trylock in last place to a trylock which is now not anymore in last place. But of course, when the trylock operation has just been done, this trylock is last place and so if we would ignore it, then this would be similar to always ignore the trylock, which is not ok. Currently, helgrind maintains a graph of lock order. I suspect we might need different graph node types and/or edge types to cope with trylock. For sure, more investigations needed looking in depth at the current algorithm. Even without mixture, isn't the below slightly bizarre/dangerous ? Thread 1: trylock mutex1 trylock mutex2 Thread 2: trylock mutex2 trylock mutex1 No deadlock can ever happen here. Yes, no deadlock can occur. However, this is really a really doubful construction. The question is: should helgrind report a lock order warning for such constructs ? If the 2nd trylock fails, what is the plan B ? If the program then accesses shared data, a race condition will happen and will be detected by helgrind anyway. So ignoring the ordering of these trylocks is ok, I would think. Of course helgrind must record we got the lock, for the race condition detection feature, but it shouldn't warn about the wrong order of the locking, since it can't possibly deadlock. The idea of helgrind is that it detects lock order problems and/or race condition problems *even* if no deadlock happens and/or if no race condition really happened. Maybe it is very unlikely that the trylock fails. Still would be nice to discover the bug. And if the trylock does not fail, then the race condition will then not be detected by helgrind. Not that the above would be good programming practice, of course, but helgrind can't say anything about it if all the locks were acquired. It will warn in another run, where some trylock fails, and a race ensues. It seems that a task must unlock all locks and restart from scratch in the above case. Yes this is exactly that QOrderedMutexLocker does. Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that failed, unlock mutex2 lock mutex1 lock mutex2 QOrderedMutexLocker really retries locking in an other order than the operations done by Thread 2 ? Or is that a typo ? If the trylock fails (because thread1 was first), then it unlocks and restarts from scratch. I can't see a deadlock risk with that, so ideally helgrind shouldn't warn. I guess we might need an option such as: --trylock-logic={normal-lock|local-retry|full-retry} normal-lock = current behaviour local-retry means the task would re-trylock full-retry means that the plan B is to unlock all locks and retry everything. I don't see how this can be a global option. Some piece of code (like QOrderedMutexLocker) might have the full retry logic above, but other pieces of code might do something different - e.g. something wrong. It doesn't make sense to me to tell helgrind this is what all the code paths are going to do about tryLock, that's impossible to predict in a complex program. For sure, generally, an application can do a big variety of behaviours. I suspect however that an application might (this is sane at least) try to