Date: Tue, 13 Mar 2018 03:27:50 +0100 From: Kamil Rytarowski <n...@gmx.com> Message-ID: <24955bcc-ce92-150f-75cc-60bc62287...@gmx.com>
| The approach with iterating over pid_table[] didn't work well for me and | I've abandoned it quickly. I am not surprised at that. | I've been testing a patch splitting KERN_PROC_PID out into a new | branch/function: | My impression is that these changes make the code much less readable, Agreed, though it is usually possible to improve that by refactoring and similar (but most likely not worth the effort here.) | and the performance gain.... was negligible or even negative (!) in a | regular running systems with 20-30 processes. I expect that such | interface is clearly not a quick-path. Not a surprise there either - where it might make a difference is when the number of processes is more in the thousands. I'm not sure how 20-30 processes counts as a "regular running system" though, my laptop has > 130 processes at the minute, and it is more or less idling (replying to this e-mail is its current real task) - other times it wil be doing a build, perhaps with -j4 or -j8, and be running many more. | I was testing the proposal by Christos swapping zombie and allproc lists | and I've found that this code is stable for the original problem. It will certainly arrange that you not find a process twice, at the expense of making it more likely that you not find a process at all. I am still not sure that I like that tradeoff - given that we have unpredictable behaviour, I still think that always including all processes observed (even if that means sometimes some are returned twice) is the right solution, rather than sometimes omitting some that could have been returned (procs which existed before the sysctl etarted and stil exist after it is done.) In your message yesterday, that I did not reply to (had not had time to do so yet, and now will not, other than this) you said ... | I expect an interface like sysctl(3) to get a quick snapshot of the current | system, and I think that's a faulty expectation - to generate a snapshot it would need to freeze everything (and I mean everything) while it copies out all the data to user space. The only mechanism I have ever seen which could do that for the proc table was an ancient "table driver" for 6th edition vintage systems - and it worked only because: there were no multiprocessors there was no paging hence the system call to read the entire contents of the process (or file, or tty, mount, ...) table would be guaranteed to not block (to be able to do the read, the process had to be resident, and running, and if it was resident, all of it was, so no waiting for pages to be brought back), and as it was in control of the CPU, and in the kernel, no premption could happen so that nothing would change anywhere while it was running. Of course as soon as the sys call was done, and the data in user space, it could have become "incorrect" even in that system, as even then there was no guarantee a context switch might not happen before return to user level, so before the reading process got to examine the data, everything might have altered. But it was a snapshot of an instant of time in the past. There is no snapshot now, real, or intended - just a (slow, or at least, sluggish) scan of a changing data structure, with everything that implies about the consistency of the data returned. What's more, we do not want more than that - the uses for this (which is any random process deciding to fetch all processes) should not be permitted to slow down the system. The only way to make a snapshot type of result would be to lock out every possible change for the duration of the sys call - everything else just freezes. [Later quotes are again back from the message I am replying to.] Note that to the caller there are even more races than is apparent here, the sysctl can always return processes that no longer exist by the time the user process gets to examine the results, and new processes that could have been returned may now exist - just did not when the list was scanned. None of that is an issue to be "fixed". | I've included in this patch additionally: | - Short-circuit break for KERN_PROC_PID. This alone would have fixed the original problem with getting the process returned twice (or the kernel attempting to make that happen, finding iinsufficient user buffer space provided, and returning ENOMEM). | - Removal of redundant "if (kbuf)" and "if (marker)" checks. | - Update of comments regarding potential optimization, explaining why | we don't want it. | - ESRCH returned for badly written software, that is not prepared to | see no result (or see race). This also makes the case of replying a call | easier to detect. All that is fine. | I've not observed any ATF regression in local tests For your ptrace() tests, I would not expect that - while the sysctl interface allows for truly unrelated processes to perform proc table scans, and while the processes in your tests are technically unrelated (to the kernel they seem that way), they're not so much, as the process doing the scan is more or less in control of the others of interest, telling them what to do by sending them messages - so, for example, the situation of a process id having been reused by another process (the sysctl returns a process, but it is not the one you were looking for) should be impossible in those scenarios (assuming there are no message sending bugs and the code synchronisation is properly designed). Are there any other ATF tests that use this sysctl? Clearly nothing else should be affected by altering the way it works... What I'd suggest is adding one more test, where you create processes much like in the ptrace tests, but without any ptrace involvement, where the child is simply fork/wait (with random small delays - of the order of milliseconds - between those steps) and its child simply does an exit after a small random delay, with the 1st (direct) sending messages containing the pid of the current grandchild to the ATF created process which will be in a loop doing continuous sysctl() looking for that exact process. If it ever finds the process after having been told it does not exist (ESRCH) I think that's a problem (it wil not start looking for a pid until after it has been created, when a message is received telling it that, and will stop looking for that pid after a new grandchild exists, implying that the first no longer exists - so unless the kernel uses the same pid for 2 successive grandchildren, which I think we should prevent happening if it is even a remote possibility now) the only way we should get found/ESRCH/found is if the ESRCH happened as the sysctl scan occurred just when the process was moving from allproc to zombproc (so the 2nd "found" is of the zombie now located on zombproc). I also suspect that if you test again now, with the sole change to the current code being that the order of the list scans be reverted to the way it was (allproc and then zombproc) that the issues you were having with the ptrace() tests will all be gone. kre