Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Christos Zoulas
On Apr 6,  1:59am, k...@munnari.oz.au (Robert Elz) wrote:
-- Subject: Re: Is there a current known getrusage() problem (amd64)

| Date:Thu, 5 Apr 2018 16:57:31 + (UTC)
| From:chris...@astron.com (Christos Zoulas)
| Message-ID:  
| 
|   | Does it change anything to define it:
|   |
|   | >#define CLKF_USERMODE(frame)
(!KERNELMODE((frame)->cf_if.if_tf.tf_cs))
| 
| No, no change, though it looks as if perhaps it should make a difference,
| if the comments are to be believed anyway.

Oh well.

christos


Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Thu, 5 Apr 2018 16:57:31 + (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  

  | Does it change anything to define it:
  |
  | >#define CLKF_USERMODE(frame)
(!KERNELMODE((frame)->cf_if.if_tf.tf_cs))

No, no change, though it looks as if perhaps it should make a difference,
if the comments are to be believed anyway.

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Thu, 5 Apr 2018 19:00:50 +0200
From:Manuel Bouyer 
Message-ID:  <20180405170050.ga...@mail.soc.lip6.fr>

  | maybe it was on i386 and not amd64 ?

While the answer to that is still no, on thinking about it, I wonder
whether you would expect it to matter.

>From what I can see, the USERMODE() test on i386 is just the
same as the amd64 one - so are you expecting that there would
be a difference?

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Thu, 5 Apr 2018 16:57:31 + (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  

  | Does it change anything to define it:
  |
  | >#define CLKF_USERMODE(frame)
(!KERNELMODE((frame)->cf_if.if_tf.tf_cs))
  |
  | like sh3 does?

That's easy to test, so I shall...

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Thu, 5 Apr 2018 19:00:50 +0200
From:Manuel Bouyer 
Message-ID:  <20180405170050.ga...@mail.soc.lip6.fr>

  | maybe it was on i386 and not amd64 ?

No, I have barely ever run i386 XEN - and then only for i386 system testing.
But it was a long time ago (NetBSD 6 vintage).

kre




Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Manuel Bouyer
On Thu, Apr 05, 2018 at 08:16:38PM +0700, Robert Elz wrote:
> Date:Thu, 05 Apr 2018 18:08:34 +0700
> From:Robert Elz 
> Message-ID:  <9363.1522926...@jinx.noi.kre.to>
> 
>   | But if I had to guess where this problem might have originated, I'm going
>   | to suspect arch/amd64/include/segments.h rev 1.34
> 
> Not that one, reverting that (all of it, not just the one file) made no 
> difference.
> 
> Not sure I will ever work out what the right method is - I am sure
> someone knows though (but once upon a time it was possible to
> set up cgd's on a XEN DomU, I have done it - and the cgdconfig -g
> command is the first step of that - which indicates that sometime in the
> past this did work.)

maybe it was on i386 and not amd64 ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Christos Zoulas
In article <9363.1522926...@jinx.noi.kre.to>,
Robert Elz   wrote:
>Date:Wed, 4 Apr 2018 17:05:51 -0400 (EDT)
>From:Mouse 
>Message-ID:  <201804042105.raa04...@stone.rodents-montreal.org>
>
>  | >> Hmm...have you tried [various suggestions]?
>  | > No, to all of those - and all good suggestions (and since this is a
>  | > XEN DomU I can give it just 1 CPU as well...)
>  |
>  | Might also be interesting to see whether/how the symptoms change on
>  | real hardware, if you have any such available; with xen in the mix
>  | there are a bunch more possible places bugs could be causing trouble.
>
>I suspect they probably will, and yes, I could test this there (been meaning
>to update my laptop kernel anyway...)   but I am now not sure I will need to
>do that, as some progress has been made (almost by fluke).
>
>  | see that name: what, you're trying to configure Chris Demetriou?)
>:-)
>
>
>What I saw that "revealed all" is from sysctl ...
>
>kern.cp_time: user = 0, nice = 0, sys = 6764, intr = 107, idle = 4968970
>
>The test system has been up, idling, for about 14 hours, so that idle
>time is right.   The intr time (about 1 second) seems reasonable.
>
>But all the time being counted as system time, and none user time, is
>what I was seeing from the time command as well.
>
>What this means to me is that the test in kern_clock.c
>
>if (CLKF_USERMODE(frame)) {
>
>is failing, and most likely just on XEN, as we know that the user/kernel
>distinction is done differently there than is on real hardware.
>
>On amd64, CLKF_USERMODE is just
>
>#define CLKF_USERMODE(frame)USERMODE((frame)->cf_if.if_tf.tf_cs)

Does it change anything to define it:

>#define CLKF_USERMODE(frame)(!KERNELMODE((frame)->cf_if.if_tf.tf_cs))

like sh3 does?

christos



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Thu, 05 Apr 2018 18:08:34 +0700
From:Robert Elz 
Message-ID:  <9363.1522926...@jinx.noi.kre.to>

  | But if I had to guess where this problem might have originated, I'm going
  | to suspect arch/amd64/include/segments.h rev 1.34

Not that one, reverting that (all of it, not just the one file) made no 
difference.

Not sure I will ever work out what the right method is - I am sure
someone knows though (but once upon a time it was possible to
set up cgd's on a XEN DomU, I have done it - and the cgdconfig -g
command is the first step of that - which indicates that sometime in the
past this did work.)

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Oh - I meant to add, then forgot - the reason why the rusage() starts
out looking OK, and then fails, is because of the way it is computed.

As Michael van Elst said, the actual time cosumed by the process is
accurately measured.   The time spent in its various modes is sampled
(if a clock interrupt (1/hz) fires in user mode, a user mode tick is added,
if in kernel mode, a system time tick is counted, and if in an interrupt,
an interrupt time tick is added.)

Before the first tick happens, time is still accumulated - but the rusage
code has no idea whether it was user/system/interrupt time, so it
simply splits 50/50 (half is assumed to be user, and half system).

Once the first tick happens, the times are divided according to the
ratio of the number of ticks.   This is how the up/down/... that Michael
descriibed can happen - initially time is 50/50, then the first tick happens
in either user or system mode, and now we have 100/0 one way or the
other - so one of them doubled, the other went to zero.  As more ticks
happen, things tend to get more rational, and for any process that has
lasted a suitable number of ticks, the division of the time it has
accumulated should be about right.

Because of the XEN USERMODE() issue from the earlier reply, all the
ticks are being counted as system ticks, none at all as user mode.
So, until the first tick happens (somewhere between 0 and 10ms after the
process starts) we see both user and system time, and they will appear
equal (cgdconfig never looks at system time, so I did not notice this.)
Once that first tick happens, since it will always be a system time tick
(currently, on XEN) all of the CPU time consumed is counted as system
time, and user time is locked at 0.

Note that cgdconfig would care about none of these issues - it spends
all its time in user mode, ticks will be (should be) counted as user mode
ticks, almost exclusively, and eventually enough cpu time will be assumulated
that the user mode rusage() will be enough for its purpose (it now knows how
much work to do to use up enough, but not too much, CPU time) - which I
assume then gets used later when it actually encrypts something.

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-05 Thread Robert Elz
Date:Wed, 4 Apr 2018 17:05:51 -0400 (EDT)
From:Mouse 
Message-ID:  <201804042105.raa04...@stone.rodents-montreal.org>

  | >> Hmm...have you tried [various suggestions]?
  | > No, to all of those - and all good suggestions (and since this is a
  | > XEN DomU I can give it just 1 CPU as well...)
  |
  | Might also be interesting to see whether/how the symptoms change on
  | real hardware, if you have any such available; with xen in the mix
  | there are a bunch more possible places bugs could be causing trouble.

I suspect they probably will, and yes, I could test this there (been meaning
to update my laptop kernel anyway...)   but I am now not sure I will need to
do that, as some progress has been made (almost by fluke).

  | see that name: what, you're trying to configure Chris Demetriou?)
:-)


What I saw that "revealed all" is from sysctl ...

kern.cp_time: user = 0, nice = 0, sys = 6764, intr = 107, idle = 4968970

The test system has been up, idling, for about 14 hours, so that idle
time is right.   The intr time (about 1 second) seems reasonable.

But all the time being counted as system time, and none user time, is
what I was seeing from the time command as well.

What this means to me is that the test in kern_clock.c

if (CLKF_USERMODE(frame)) {

is failing, and most likely just on XEN, as we know that the user/kernel
distinction is done differently there than is on real hardware.

On amd64, CLKF_USERMODE is just

#define CLKF_USERMODE(frame)USERMODE((frame)->cf_if.if_tf.tf_cs)

and

#define USERMODE(c)(ISPL(c) == SEL_UPL)

These seem to apply unchanged for XEN - whether that is appropriate or
not I have no idea.

This is about where I hit a brick wall, as now we're into details of x86 
hardware, and how it is modified for XEN.

But if I had to guess where this problem might have originated, I'm going
to suspect arch/amd64/include/segments.h rev 1.34

"Fix a huge privilege separation vulnerability in Xen-amd64."

I am going to (locally) revert that and see if the problem goes away.
If it does, then I will leave it for maxv to make appropriate changes
to USERMODE() for XEN so it works correctly.

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Mouse
>> Hmm...have you tried [various suggestions]?
> No, to all of those - and all good suggestions (and since this is a
> XEN DomU I can give it just 1 CPU as well...)

Might also be interesting to see whether/how the symptoms change on
real hardware, if you have any such available; with xen in the mix
there are a bunch more possible places bugs could be causing trouble.

>   time cgdconfig ...

(In passing, I always twitch a little, even if only mentally, when I
see that name: what, you're trying to configure Chris Demetriou?)

> and that reports (when I killed it the cgdconfig, in a couple of
> attempts)  3.xx and 5.xx seconds elapsed time, and the same amount of
> system time  and 0 user time.

> [...], that is clearly backwards.

Indeed.  I'm wondering if whatever system/user time sampling is driven
off of is something xen tries to be "smart" about and, as is far too
common when something tries to be smart, botches.  I've occasionally
seen systems which have load averages up in the hundreds but you'd
never know it from interactive response, because that load is all
phantom load from processes that wake up every clock tick, do almost
nothing, and go back to sleep, taking maybe a percent or two of the CPU
- but, when the run queue is sampled every tick, they're always
runnable.  Something related might be happening here at some level.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Robert Elz
Date:Wed, 4 Apr 2018 15:43:48 -0400 (EDT)
From:Mouse 
Message-ID:  <201804041943.paa05...@stone.rodents-montreal.org>

  | Hmm...have you tried a non-MP kernel?  Have you tried nailing your test
  | down on each of the available CPUs?  Have you tried running various
  | numbers of infinite-loop CPU-burners at the same time?

No, to all of those - and all good suggestions (and since this is a XEN DomU
I can give it just 1 CPU as well...)

WIll do all that, tomorrow... (or, that is, today, but later, much later...)

What I have done more recently is

time cgdconfig ...

and that reports (when I killed it the cgdconfig, in a couple
of attempts)  3.xx and 5.xx seconds elapsed time,
and the same amount of system time  and 0 user time.

SInce this program (used this way) does just about no sys
calls (other than my diagnostic printfs, which should consume
almost nothing - there are not all that many) but is CPU
intensive, that is clearly backwards.

I have not been able to kill it while it is still in
the period when it is getting sane user cpu times,
to see what the time command says then (that usually
ends within a few milliseconfds of CPU (and elapsed) time.)

I might see if I can rig a debugger session and breakpoint it,
so I can keep it within those initial steps, and then kill it.

Or with your suggestions, and what I can see already,
I might just be able to read the kernel code and spot
a bug.

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Mouse
>> aren't there known issues with the tc you mentioned and SMP systems?
> Don't know - but one possibility that would fit the symptoms would be
> if there are multiple usage counters (one per cpu) and somehow the
> values are just coming from one of them, and it isn't the right one.

Hmm...have you tried a non-MP kernel?  Have you tried nailing your test
down on each of the available CPUs?  Have you tried running various
numbers of infinite-loop CPU-burners at the same time?

Each of those might hint at the actual problem.

Maybe ktrace could use a "migrated to a different CPU" tracepoint?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Robert Elz
Date:Thu, 05 Apr 2018 04:08:08 +1000
From:matthew green 
Message-ID:  <19468.1522865...@splode.eterna.com.au>

  | this sounds like the timecounter goes backwards issue.

Almost anything is possible, but I doubt it.   This is far
too consistent (that is, it does not just happen sometimes
and I was "lucky" to catch it, it happens every time - for
this process anyway, which is the only one I know that
loops calling getrusage() and not doing any other sys calls.)

The number of iterations before it switches to 0 varies (I
have seen it from the very first call, once) and of course,
before it goes to 0, the actual values vary, but every
time it gives a 0 usage value, and never changes after.

  | aren't there known issues with the tc you mentioned and
  | SMP systems?

Don't know - but one possibility that would fit the symptoms
would be if there are multiple usage counters (one per cpu)
and somehow the values are just coming from one of them,
and it isn't the right one.   Or something like that (sounds
very implausible, I'd expect them to be summed.)

Certainly the way it is being used (on a more or less
standalone system doing nothing else, and wanting
to simply consume CPU until it decides it has had enough)
would fit - usage from all CPUs except "the one" would be
zero, only the one the process is running on would (should)
be different.

kre

ps: I doubt this has anything to do with timekeeping in any
way at all, I suspect a getrusage() specific problem, relating
to collecting the data - not to accumulating it in the first place
(but that's just a gut feeling.)



re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread matthew green
this sounds like the timecounter goes backwards issue.

for a long on the sparc port time we saw issues where
getrusage() would return garbage (very large) values for
times.  after a lot of debugging, i tracked it down to
the tc giving a lower value than it should have, without
ever getting close to overflowing, which ends up sending
negative runtime down into the system.

aren't there known issues with the tc you mentioned and
SMP systems?

this is the fix i applied for sparc:

http://mail-index.netbsd.org/source-changes/2018/01/12/msg091064.html

but note that it has a common source, not per-cpu one.


.mrg.


Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Robert Elz
Date:Wed, 4 Apr 2018 17:16:32 - (UTC)
From:mlel...@serpens.de (Michael van Elst)
Message-ID:  

  | That's what you get with statistics.

Not what I am seeing I don't think.

I understand those issues, and none of that matters for the
code in question (which has been there for a long long time.)

  | As a result, the user and system times reported by getrusage
  | are neither precise nor necessarily monotonic.

While not exactly OK, that's a don't care for this use.

All that matters is that the values are something like an
approximation for the time used (when the CPU has been
100% busy in this process for a second or two, for example).

What is happening is that the results are going back to 0.
Not near 0, not fluctuation, they jump to 0, and stay fixed
there.

That's not statistics, that is something going wrong.   What
I don't know at the minute is what that is (other than a quick
glance, I haven't started on debugging the kernel end).

kre



Re: Is there a current known getrusage() problem (amd64)

2018-04-04 Thread Michael van Elst
k...@munnari.oz.au (Robert Elz) writes:

>pkcs5_pbkdf2_time(24, 64): -1017 us (0.1017 -> 0.0)

>Somehow, getrusage(RUSAGE_SELF, &end) has resulted in 0
>for the user time (which then results in negative consumed user
>time - or that's what is calculated.)

That's what you get with statistics.

User and system time are sampled very coarsely. The reported
times are interpolated from the total run time (which is measured
precisely) and the ratio of user vs. system time.

As a result, the user and system times reported by getrusage
are neither precise nor necessarily monotonic.

See bin/10201, kern/30115.


-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."