> The comment: > > * When target == current, it would be safe just to call > * splice_attaching() right here. But if we're inside a > * callback, > > just to clarify, "inside a callback" means inside utrace_report_xxx(), > not only inside utrace_engine_ops->report_xxx(), right?
Certainly what I mean when I say "a callback" is one of the functions whose pointer lives in struct utrace_engine_ops. But I don't see how the distinction you make could even be meaningful here. A utrace_attach_task() call "inside utrace_report_foo()" could only possibly mean one made by a ->report_foo() function utrace_report_foo() calls, since obviously there are no hard-wired utrace_attach_task() calls in utrace.c itself. > that would mean the new engine also gets > * notified about the event that precipitated its own > * creation. > > engine->flags == 0, so it should not be notified until the caller > does utrace_set_events() later, right? Right. The case in question is a callback doing: new_engine = utrace_attach_task(current, ...); utrace_set_events(new_engine, <includes this event>); Then new_engine would get the "this event" callback at the end of the very same reporting loop containing its creation. (This is what happened before I changed the code as the comment describes.) > This is not what the user wants. > > It it not clear to me why the user doesn't want this. Jim Keniston is the user who doesn't want this. https://www.redhat.com/archives/utrace-devel/2008-December/msg00051.html > I understand this as follows. If we add the new engine to the ->attached > list, and if the target is inside a callback, the target can later race > with (say) utrace_set_events(). The target can see "engine->flags & event" > and call start_callback/finish_callback before utrace_set_events() completes. It's not a race question. There are no guarantees for such races. (That is, utrace_set_events() calls on target!=current make no guarantee about reporting an event that might already have started. Only if the target was already stopped by your engine when you made the call can you be sure that no such event can be in progress.) The scenario we are talking about here is fully synchronous. The target itself is inside a callback, calling utrace_set_events() on itself. > Another question. In any case I don't understand why do we really need > two lists. We want that in the common case a reporting pass takes no locks. The ->attached list is never touched when the target is not quiescent. (The target uses the lock to synchronize when it transitions between being quiescent and not.) Once you believe the quiescence logic, this makes it easy to be confident about the unlocked use of that list in reporting passes. It's used in totally vanilla ways, and modified in totally vanilla ways. > Let's suppose we implement the new trivial helper, > > list_for_each_entry_xxx(pos, head, tail, member) > > it stops when "pos" reaches "tail", not "head". Then REPORT_CALLBACKS() > can just read "tail = utrace->attached->prev" (under ->lock, or > utrace_add_engine() can use list_add_rcu) before list_for_each_entry_xxx. > > This way we can kill ->attaching, no? This is a lot like what the old utrace code did before the introduction of the two lists (when the engine struct was managed using RCU). This is just an optimization over what we have now. It saves the ->attaching space (i.e. two words in struct utrace), the splice_attaching() logic (pretty cheap), and the sometimes-superfluous resume report after an attach. The cost for this is some very touchy fine-grained complexity in convincing ourselves (and reviewers) that the list traversal and modification is always correct. I've already implied that anything taking any locks for every vanilla reporting pass is a non-starter. I'm asserting preemptive optimization here because it's the case that is most important to optimize. The overhead of a reporting pass applies to situations like every system call entry, with an engine callback that quickly filters out the vast majority of calls (i.e. "if (regs->foo != __NR_bar) return;" or something about that cheap). So we think about the reporting pass overhead as something that might be done a million times a second, and accordingly think carefully about that hot path. In contrast, we are talking here about optimizing attach, and saving a couple of words of data structure space that will already be cache-hot. We are not actually following any RCU rules at all, so to use list_add_tail_rcu would really just mean that we are relying on our own fancy special list mutation scheme and proving/documenting that it is correct. It just happens to have the same implementation details as list_add_tail_rcu, and we must either copy those innards and document why they are right in our uses, or document how the list_add_tail_rcu innards happen to match what is right for our uses and keep track of any future implementation changes in rculist.h that might diverge from what we rely on. That proof and documentation entails hairy logic about SMP ordering and memory barriers and so forth. Frankly, that all seems like much more touchy hair than the utrace-indirect logic (for less benefit), and we've already decided to avoid that for the first cut because LKML reviewers found it too hairy to contemplate. Next, consider that e.g. Renzo Davoli has proposed reversing the engine order used for certain reporting passes (syscall entry vs exit having inverse order). (I'm not discussing the merits of that change, it's just an example.) Right now, a change like that would be a simple choice about the desireable API, with no implementation complexity to worry about at all, just s/list_for_each/&_reverse/. In the same vein, we contemplate for the future having engines on a priority list or some other means to add new engines somewhere other than at the end of the list. When we've resolved what interfaces we want for that, it will be straightforward to implement whatever it is using normal list.h calls (or even to use a different kind of list data structure). Many choices like those are likely to conflict with what clever SMP-safe list magic we can do now if we start on that sort of optimization now. I could easily be quite wrong about the performance trade-offs of a lock vs splice_attaching() and cache effects, etc. But before we get to worrying about that performance in great detail, the complexity argument stands. This is an optimization to consider later on, both after the upstream review has accepted the simpler code into the kernel to begin with, and after we have gotten a more mature set of uses of the API and refined the details of the API semantics on merits broader than such micro-optimization. Thanks, Roland