On Wed, 23 Mar 2016, John Baldwin wrote:

On Wednesday, March 23, 2016 09:58:42 AM Konstantin Belousov wrote:
On Mon, Mar 21, 2016 at 11:12:57AM -0700, John Baldwin wrote:
On Saturday, March 19, 2016 05:22:16 AM Konstantin Belousov wrote:
On Fri, Mar 18, 2016 at 07:48:49PM +0000, John Baldwin wrote:

-       for (x = 0; x < delay; x += 5) {
+       for (x = 0; x < delay; x++) {
                if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) ==
                    APIC_DELSTAT_IDLE)
                        return (1);
-               DELAY(5);
+               DELAY(1);
        }
        return (0);
 }

Ideally we would structure the loop differently. I think it is more
efficient WRT latency to only block execution by ia32_pause(), and
compare the the getbinuptime() results to calculate spent time, on each
loop step.

Yes.  I've thought about using the TSC directly to do that, but folks are
worried about the TSC being unstable due to vcpus in a VM migrating
across physical CPUs.  DELAY() does seem to DTRT in that case assuming the
hypervisor doesn't advertise an invariant TSC via cpuid.  We'd have to
essentially duplicate DELAY() (really delay_tc()) inline.

If TSC has the behaviour you described, i.e. suddenly jumping random
steps on single CPU, from the point of view of kernel, then the system
is seriosly misbehaving.  The timekeeping stuff would be badly broken
regardless of the ipi_wait().  I do not see why should we worry about
that in ipi_wait().

I proposed slightly different thing, i.e. using timekeep code to indirect
to TSC if configured so.  Below is the proof of concept patch, use of
nanouptime() may be too naive, and binuptime() would cause tiny bit less
overhead, but I do not want to think about arithmetic.

As you noted, the issue is if a timecounter needs locks (e.g. i8254) though
outside of that I think the patch is great. :-/  Of course, if the TSC
isn't advertised as invariant, DELAY() is talking to the timecounter
directly as well.

The i8254 locks work better in practice than in theory.  Timecounter code
is called from very low levels (fast interrupt handlers) and must work
from there.  And the i8254 timecounter does work in fast interrupt handlers.
The above loop is slightly (?) lower level, so must be more careful.

DELAY() talkes directly to the i8254 if the TSC is not invariant and
the timecounter uses the i8254.  Then the timecounter is slow and
otherwise unusable for DELAY() since it would deadlock in ddb, so the
old i8254 DELAY() which is faster and more careful is used.  The same
(fudged recursive) locking would work here.  But you don't want to use
the i8254 or any other slow timecounter hardware or software.  They
all have a large latency of ~1 usec minimum.

However, I think we probably can use the TSC.  The only specific note I got
from Ryan (cc'd) was about the TSC being unstable as a timecounter under KVM.
That doesn't mean that the TSC is non-mononotic on a single vCPU.  In fact,

It also doesn't need to be invariant provided it is usually monotonic
and doesn't jump ahead by a lot.  Or you can just use a calibrated
loop.  The calibration gets complicated if the CPU is throttled or
otherwise has a variable frequency.  One case is a loop with
ia32_pause() in it.  The pause length can be calibrated for most cases
and is probably longer than the rest of the loop, but it is hard to
be sure if the CPU didn't change it without telling you.  Long loops
can easiliy recalibrate themself by checking an external timer not
very often, but that doesn't work for short loops (ones shorter than
the timer access time).

thinking about this more I have a different theory to explain how the TSC
can be out of whack on different vCPUs even if the hardware TSC is in sync
in the physical CPUs underneath.

One of the things present in the VCMS on Intel CPUs using VT-x is a TSC
adjustment.  The hypervisor can alter this TSC adjustment during a VM-exit to
alter the offset between the TSC in the guest and the "real" TSC value in the
physical CPU itself.  One way a hypervisor might use this is to try to
"pause" the TSC during a VM-exit by taking TSC timestamps at the start and
end of a VM-exit and adding that delta to the TSC offset just before each
VM-entry.  However, if you have two vCPUs, one of which is running in the
guest and one of which is handling a VM-exit in the hypervisor, the TSC on
the first vCPU will run while the effective TSC of the second vCPU is paused.
When the second vCPU resumes after a VM-entry, it's TSC will now "unpause",
but it will lag the first vCPU by however long it took to handle its VM-exit.

It wouldn't surprise me if KVM was doing this.  bhyve does not do this to my
knowledge (so the TSC is actually still usable as a timecounter under bhyve
for some value of "usable").  However, even with this TSC pausing/unpausing,
the TSC would still increase monotonically on a single vCPU.  For the purposes
of DELAY() (and other spin loops on a pinned thread such as in
lapic_ipi_wait()), that is all you need.

Is monotonic really enough?  Suppose you want to wait at least 1 usec.  Then
you can't trust the timer if it does a combination of jumps that add up to
a significant fraction of 1 usec.

To minimise latency, I would try a tight loop with occasional checks.  E.g.,
10-1000 lapic reads separated by ia32_pauses()'s, then check the time.  It
isn't clear how to minimise power use for loops like this.  I couldn't
find anything better than mwait for cooling in loops in ddb i/o.

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to