Re: Is there a current known getrusage() problem (amd64)
On Apr 6, 1:59am, k...@munnari.oz.au (Robert Elz) wrote: -- Subject: Re: Is there a current known getrusage() problem (amd64) | Date:Thu, 5 Apr 2018 16:57:31 + (UTC) | From:chris...@astron.com (Christos Zoulas) | Message-ID: | | | Does it change anything to define it: | | | | >#define CLKF_USERMODE(frame) (!KERNELMODE((frame)->cf_if.if_tf.tf_cs)) | | No, no change, though it looks as if perhaps it should make a difference, | if the comments are to be believed anyway. Oh well. christos
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 5 Apr 2018 16:57:31 + (UTC) From:chris...@astron.com (Christos Zoulas) Message-ID: | Does it change anything to define it: | | >#define CLKF_USERMODE(frame) (!KERNELMODE((frame)->cf_if.if_tf.tf_cs)) No, no change, though it looks as if perhaps it should make a difference, if the comments are to be believed anyway. kre
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 5 Apr 2018 19:00:50 +0200 From:Manuel Bouyer Message-ID: <20180405170050.ga...@mail.soc.lip6.fr> | maybe it was on i386 and not amd64 ? While the answer to that is still no, on thinking about it, I wonder whether you would expect it to matter. >From what I can see, the USERMODE() test on i386 is just the same as the amd64 one - so are you expecting that there would be a difference? kre
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 5 Apr 2018 16:57:31 + (UTC) From:chris...@astron.com (Christos Zoulas) Message-ID: | Does it change anything to define it: | | >#define CLKF_USERMODE(frame) (!KERNELMODE((frame)->cf_if.if_tf.tf_cs)) | | like sh3 does? That's easy to test, so I shall... kre
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 5 Apr 2018 19:00:50 +0200 From:Manuel Bouyer Message-ID: <20180405170050.ga...@mail.soc.lip6.fr> | maybe it was on i386 and not amd64 ? No, I have barely ever run i386 XEN - and then only for i386 system testing. But it was a long time ago (NetBSD 6 vintage). kre
Re: Is there a current known getrusage() problem (amd64)
On Thu, Apr 05, 2018 at 08:16:38PM +0700, Robert Elz wrote: > Date:Thu, 05 Apr 2018 18:08:34 +0700 > From:Robert Elz > Message-ID: <9363.1522926...@jinx.noi.kre.to> > > | But if I had to guess where this problem might have originated, I'm going > | to suspect arch/amd64/include/segments.h rev 1.34 > > Not that one, reverting that (all of it, not just the one file) made no > difference. > > Not sure I will ever work out what the right method is - I am sure > someone knows though (but once upon a time it was possible to > set up cgd's on a XEN DomU, I have done it - and the cgdconfig -g > command is the first step of that - which indicates that sometime in the > past this did work.) maybe it was on i386 and not amd64 ? -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Is there a current known getrusage() problem (amd64)
In article <9363.1522926...@jinx.noi.kre.to>, Robert Elz wrote: >Date:Wed, 4 Apr 2018 17:05:51 -0400 (EDT) >From:Mouse >Message-ID: <201804042105.raa04...@stone.rodents-montreal.org> > > | >> Hmm...have you tried [various suggestions]? > | > No, to all of those - and all good suggestions (and since this is a > | > XEN DomU I can give it just 1 CPU as well...) > | > | Might also be interesting to see whether/how the symptoms change on > | real hardware, if you have any such available; with xen in the mix > | there are a bunch more possible places bugs could be causing trouble. > >I suspect they probably will, and yes, I could test this there (been meaning >to update my laptop kernel anyway...) but I am now not sure I will need to >do that, as some progress has been made (almost by fluke). > > | see that name: what, you're trying to configure Chris Demetriou?) >:-) > > >What I saw that "revealed all" is from sysctl ... > >kern.cp_time: user = 0, nice = 0, sys = 6764, intr = 107, idle = 4968970 > >The test system has been up, idling, for about 14 hours, so that idle >time is right. The intr time (about 1 second) seems reasonable. > >But all the time being counted as system time, and none user time, is >what I was seeing from the time command as well. > >What this means to me is that the test in kern_clock.c > >if (CLKF_USERMODE(frame)) { > >is failing, and most likely just on XEN, as we know that the user/kernel >distinction is done differently there than is on real hardware. > >On amd64, CLKF_USERMODE is just > >#define CLKF_USERMODE(frame)USERMODE((frame)->cf_if.if_tf.tf_cs) Does it change anything to define it: >#define CLKF_USERMODE(frame)(!KERNELMODE((frame)->cf_if.if_tf.tf_cs)) like sh3 does? christos
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 05 Apr 2018 18:08:34 +0700 From:Robert Elz Message-ID: <9363.1522926...@jinx.noi.kre.to> | But if I had to guess where this problem might have originated, I'm going | to suspect arch/amd64/include/segments.h rev 1.34 Not that one, reverting that (all of it, not just the one file) made no difference. Not sure I will ever work out what the right method is - I am sure someone knows though (but once upon a time it was possible to set up cgd's on a XEN DomU, I have done it - and the cgdconfig -g command is the first step of that - which indicates that sometime in the past this did work.) kre
Re: Is there a current known getrusage() problem (amd64)
Oh - I meant to add, then forgot - the reason why the rusage() starts out looking OK, and then fails, is because of the way it is computed. As Michael van Elst said, the actual time cosumed by the process is accurately measured. The time spent in its various modes is sampled (if a clock interrupt (1/hz) fires in user mode, a user mode tick is added, if in kernel mode, a system time tick is counted, and if in an interrupt, an interrupt time tick is added.) Before the first tick happens, time is still accumulated - but the rusage code has no idea whether it was user/system/interrupt time, so it simply splits 50/50 (half is assumed to be user, and half system). Once the first tick happens, the times are divided according to the ratio of the number of ticks. This is how the up/down/... that Michael descriibed can happen - initially time is 50/50, then the first tick happens in either user or system mode, and now we have 100/0 one way or the other - so one of them doubled, the other went to zero. As more ticks happen, things tend to get more rational, and for any process that has lasted a suitable number of ticks, the division of the time it has accumulated should be about right. Because of the XEN USERMODE() issue from the earlier reply, all the ticks are being counted as system ticks, none at all as user mode. So, until the first tick happens (somewhere between 0 and 10ms after the process starts) we see both user and system time, and they will appear equal (cgdconfig never looks at system time, so I did not notice this.) Once that first tick happens, since it will always be a system time tick (currently, on XEN) all of the CPU time consumed is counted as system time, and user time is locked at 0. Note that cgdconfig would care about none of these issues - it spends all its time in user mode, ticks will be (should be) counted as user mode ticks, almost exclusively, and eventually enough cpu time will be assumulated that the user mode rusage() will be enough for its purpose (it now knows how much work to do to use up enough, but not too much, CPU time) - which I assume then gets used later when it actually encrypts something. kre
Re: Is there a current known getrusage() problem (amd64)
Date:Wed, 4 Apr 2018 17:05:51 -0400 (EDT) From:Mouse Message-ID: <201804042105.raa04...@stone.rodents-montreal.org> | >> Hmm...have you tried [various suggestions]? | > No, to all of those - and all good suggestions (and since this is a | > XEN DomU I can give it just 1 CPU as well...) | | Might also be interesting to see whether/how the symptoms change on | real hardware, if you have any such available; with xen in the mix | there are a bunch more possible places bugs could be causing trouble. I suspect they probably will, and yes, I could test this there (been meaning to update my laptop kernel anyway...) but I am now not sure I will need to do that, as some progress has been made (almost by fluke). | see that name: what, you're trying to configure Chris Demetriou?) :-) What I saw that "revealed all" is from sysctl ... kern.cp_time: user = 0, nice = 0, sys = 6764, intr = 107, idle = 4968970 The test system has been up, idling, for about 14 hours, so that idle time is right. The intr time (about 1 second) seems reasonable. But all the time being counted as system time, and none user time, is what I was seeing from the time command as well. What this means to me is that the test in kern_clock.c if (CLKF_USERMODE(frame)) { is failing, and most likely just on XEN, as we know that the user/kernel distinction is done differently there than is on real hardware. On amd64, CLKF_USERMODE is just #define CLKF_USERMODE(frame)USERMODE((frame)->cf_if.if_tf.tf_cs) and #define USERMODE(c)(ISPL(c) == SEL_UPL) These seem to apply unchanged for XEN - whether that is appropriate or not I have no idea. This is about where I hit a brick wall, as now we're into details of x86 hardware, and how it is modified for XEN. But if I had to guess where this problem might have originated, I'm going to suspect arch/amd64/include/segments.h rev 1.34 "Fix a huge privilege separation vulnerability in Xen-amd64." I am going to (locally) revert that and see if the problem goes away. If it does, then I will leave it for maxv to make appropriate changes to USERMODE() for XEN so it works correctly. kre
Re: Is there a current known getrusage() problem (amd64)
>> Hmm...have you tried [various suggestions]? > No, to all of those - and all good suggestions (and since this is a > XEN DomU I can give it just 1 CPU as well...) Might also be interesting to see whether/how the symptoms change on real hardware, if you have any such available; with xen in the mix there are a bunch more possible places bugs could be causing trouble. > time cgdconfig ... (In passing, I always twitch a little, even if only mentally, when I see that name: what, you're trying to configure Chris Demetriou?) > and that reports (when I killed it the cgdconfig, in a couple of > attempts) 3.xx and 5.xx seconds elapsed time, and the same amount of > system time and 0 user time. > [...], that is clearly backwards. Indeed. I'm wondering if whatever system/user time sampling is driven off of is something xen tries to be "smart" about and, as is far too common when something tries to be smart, botches. I've occasionally seen systems which have load averages up in the hundreds but you'd never know it from interactive response, because that load is all phantom load from processes that wake up every clock tick, do almost nothing, and go back to sleep, taking maybe a percent or two of the CPU - but, when the run queue is sampled every tick, they're always runnable. Something related might be happening here at some level. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Is there a current known getrusage() problem (amd64)
Date:Wed, 4 Apr 2018 15:43:48 -0400 (EDT) From:Mouse Message-ID: <201804041943.paa05...@stone.rodents-montreal.org> | Hmm...have you tried a non-MP kernel? Have you tried nailing your test | down on each of the available CPUs? Have you tried running various | numbers of infinite-loop CPU-burners at the same time? No, to all of those - and all good suggestions (and since this is a XEN DomU I can give it just 1 CPU as well...) WIll do all that, tomorrow... (or, that is, today, but later, much later...) What I have done more recently is time cgdconfig ... and that reports (when I killed it the cgdconfig, in a couple of attempts) 3.xx and 5.xx seconds elapsed time, and the same amount of system time and 0 user time. SInce this program (used this way) does just about no sys calls (other than my diagnostic printfs, which should consume almost nothing - there are not all that many) but is CPU intensive, that is clearly backwards. I have not been able to kill it while it is still in the period when it is getting sane user cpu times, to see what the time command says then (that usually ends within a few milliseconfds of CPU (and elapsed) time.) I might see if I can rig a debugger session and breakpoint it, so I can keep it within those initial steps, and then kill it. Or with your suggestions, and what I can see already, I might just be able to read the kernel code and spot a bug. kre
Re: Is there a current known getrusage() problem (amd64)
>> aren't there known issues with the tc you mentioned and SMP systems? > Don't know - but one possibility that would fit the symptoms would be > if there are multiple usage counters (one per cpu) and somehow the > values are just coming from one of them, and it isn't the right one. Hmm...have you tried a non-MP kernel? Have you tried nailing your test down on each of the available CPUs? Have you tried running various numbers of infinite-loop CPU-burners at the same time? Each of those might hint at the actual problem. Maybe ktrace could use a "migrated to a different CPU" tracepoint? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Is there a current known getrusage() problem (amd64)
Date:Thu, 05 Apr 2018 04:08:08 +1000 From:matthew green Message-ID: <19468.1522865...@splode.eterna.com.au> | this sounds like the timecounter goes backwards issue. Almost anything is possible, but I doubt it. This is far too consistent (that is, it does not just happen sometimes and I was "lucky" to catch it, it happens every time - for this process anyway, which is the only one I know that loops calling getrusage() and not doing any other sys calls.) The number of iterations before it switches to 0 varies (I have seen it from the very first call, once) and of course, before it goes to 0, the actual values vary, but every time it gives a 0 usage value, and never changes after. | aren't there known issues with the tc you mentioned and | SMP systems? Don't know - but one possibility that would fit the symptoms would be if there are multiple usage counters (one per cpu) and somehow the values are just coming from one of them, and it isn't the right one. Or something like that (sounds very implausible, I'd expect them to be summed.) Certainly the way it is being used (on a more or less standalone system doing nothing else, and wanting to simply consume CPU until it decides it has had enough) would fit - usage from all CPUs except "the one" would be zero, only the one the process is running on would (should) be different. kre ps: I doubt this has anything to do with timekeeping in any way at all, I suspect a getrusage() specific problem, relating to collecting the data - not to accumulating it in the first place (but that's just a gut feeling.)
re: Is there a current known getrusage() problem (amd64)
this sounds like the timecounter goes backwards issue. for a long on the sparc port time we saw issues where getrusage() would return garbage (very large) values for times. after a lot of debugging, i tracked it down to the tc giving a lower value than it should have, without ever getting close to overflowing, which ends up sending negative runtime down into the system. aren't there known issues with the tc you mentioned and SMP systems? this is the fix i applied for sparc: http://mail-index.netbsd.org/source-changes/2018/01/12/msg091064.html but note that it has a common source, not per-cpu one. .mrg.
Re: Is there a current known getrusage() problem (amd64)
Date:Wed, 4 Apr 2018 17:16:32 - (UTC) From:mlel...@serpens.de (Michael van Elst) Message-ID: | That's what you get with statistics. Not what I am seeing I don't think. I understand those issues, and none of that matters for the code in question (which has been there for a long long time.) | As a result, the user and system times reported by getrusage | are neither precise nor necessarily monotonic. While not exactly OK, that's a don't care for this use. All that matters is that the values are something like an approximation for the time used (when the CPU has been 100% busy in this process for a second or two, for example). What is happening is that the results are going back to 0. Not near 0, not fluctuation, they jump to 0, and stay fixed there. That's not statistics, that is something going wrong. What I don't know at the minute is what that is (other than a quick glance, I haven't started on debugging the kernel end). kre
Re: Is there a current known getrusage() problem (amd64)
k...@munnari.oz.au (Robert Elz) writes: >pkcs5_pbkdf2_time(24, 64): -1017 us (0.1017 -> 0.0) >Somehow, getrusage(RUSAGE_SELF, &end) has resulted in 0 >for the user time (which then results in negative consumed user >time - or that's what is calculated.) That's what you get with statistics. User and system time are sampled very coarsely. The reported times are interpolated from the total run time (which is measured precisely) and the ratio of user vs. system time. As a result, the user and system times reported by getrusage are neither precise nor necessarily monotonic. See bin/10201, kern/30115. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."