subject:"\+ stupid\-hack\-to\-make\-mainline\-build.patch added to \-mm tree"

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Andi Kleen


> 
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants. 

It is.  More hooks in this are wouldn't be appreciated.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> Maybe hooking into genapic is the right way to mop up all the uses of
> send_IPI and its variants.  But from a quick grep it doesn't look like
> they get called from too many places...  Most of the callers seem to be
> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.

Yeah, we'll see once we are crashing and debugging some code ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Chris Wright wrote:
> * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
>   
>> Maybe hooking into genapic is the right way to mop up all the uses of
>> send_IPI and its variants.  But from a quick grep it doesn't look like
>> they get called from too many places...  Most of the callers seem to be
>> in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.
>> 
>
> Yeah, we'll see once we are crashing and debugging some code ;-)
>   
It's the Linux way (tm).

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
> that's a concern, but maybe we can tease it apart in a sensible way.

Yes, that's exactly what I'm saying.  Same with above (the native stuff), since
we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-)  Of course,
dom0 will be another can of worms, but one at a time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Chris Wright wrote:
> * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
>   
>> I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
>> that's a concern, but maybe we can tease it apart in a sensible way.
>> 
>
> Yes, that's exactly what I'm saying.  Same with above (the native stuff), 
> since
> we don't want a bunch of apic_read type of pv_ops (oh, wait... ;-)  Of course,
> dom0 will be another can of worms, but one at a time.
>   

Yeah, well we're already talking about a two-level model to accomodate
VMI, since it wants the mostly native SMP stuff except for the actual
apic operations.

Maybe hooking into genapic is the right way to mop up all the uses of
send_IPI and its variants.  But from a quick grep it doesn't look like
they get called from too many places...  Most of the callers seem to be
in arch/i386/kernek/smp.c, so they should be pretty easy to isolate.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Chris Wright wrote:
> * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
>   
>> Chris Wright wrote:
>> 
>>> I agree with that, but I think that's esp. for things like create and launch
>>> new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
>>> as well.
>>>   
>>>   
>> Well, native would fall back to using the existing arch/i386 versions of
>> those functions, so that's reasonably straightforward.
>> 
>
> It's the fact that we need to leave code in the kernel to run on native,
> but also do something dynamically with that same code when running
> paravirt that I'm referring to.

Why would it be any different to all the other code we've got behind
native pvops?

The ideal simplified case is that we rename
smp_send_stop/send_reschedule/prepare_cpus/etc to native_* versions.  In
the !PARAVIRT case we just call the native_* version directly; in
PARAVIRT we call via the native pv_ops structure.  Under Xen, all these
would implemented independently from the native versions.

> No, it's not the IPI itself, it's the way it's often accessed by the rest of
> the kernel (which is intertwined with genapic).  I'm happy to avoid apic
> altogether since it's effectively worthless for Xen other than
> integrating into the existing infrastructure.
>   

I guess by "rest of the kernel" you mean other stuff in arch/i386.  Yes,
that's a concern, but maybe we can tease it apart in a sensible way.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> Chris Wright wrote:
> > * Daniel Arai ([EMAIL PROTECTED]) wrote:
> >   
> >> There's no good way to override __send_IPI_shortcut.  I suppose we could 
> >> add 
> >> paravirt ops for __send_IPI_shortcut and every other op that touches the 
> >> APIC. 
> >> 
> >
> > While that's basically what we did in Xen, it would make more sense to
> > build it into genapic which would give us one common abstraction to base
> > from.  We should avoid adding pv_ops when existing infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in
>  is a good match for Xen, so I was going to investigate
> making pv_ops at that level and see how it falls out.

I agree with that, but I think that's esp. for things like create and launch
new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
as well.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> Chris Wright wrote:
> > I agree with that, but I think that's esp. for things like create and launch
> > new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
> > as well.
> >   
> 
> Well, native would fall back to using the existing arch/i386 versions of
> those functions, so that's reasonably straightforward.

It's the fact that we need to leave code in the kernel to run on native,
but also do something dynamically with that same code when running
paravirt that I'm referring to.  Xen punts on this right now by
#ifdef'ing away as happy as can be.

> There'll need to
> be a bit of internal rearrangement so that the Xen code can call in to
> do things like set up the pda/gdt and other bits of CPU state.
> 
> I don't think IPI is especially interesting in itself, is it?   It's a
> necessary mechanism to implement smp_call_function(), but Xen can do IPI
> without having to invoke any of the existing apic-based IPI code.  The
> other main user of IPI is cross-cpu tlb shootdown, but Xen has much more
> efficient mechanisms than IPI for that (so we'll need to make the tlb
> pv_ops interface a little wider to pass down a cpuset).

No, it's not the IPI itself, it's the way it's often accessed by the rest of
the kernel (which is intertwined with genapic).  I'm happy to avoid apic
altogether since it's effectively worthless for Xen other than
integrating into the existing infrastructure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> > While that's basically what we did in Xen, it would make more sense 
> > to build it into genapic which would give us one common abstraction 
> > to base from.  We should avoid adding pv_ops when existing 
> > infrastructure exists.
> 
> I was looking at cutting in at a much higher level.  The interface in 
>  is a good match for Xen, so I was going to investigate 
> making pv_ops at that level and see how it falls out.

yes, yes, yes. Finally someone with a clue about APIs ;-)

Basically, we want to think about the hypercall API more like a system 
call API, not like a hardware API! There will probably still be lowlevel 
details like ptes for a long time - but even those are not quite 
necessary.

And the reason is really fundamental: those system-call alike APIs are 
going to be the /most stable ones/ over time! 'Send stuff from A to B' 
or 'notify X about event Y' is /ALOT/ more stable across hardware 
variations than 'IDTs, vectors, apics or ptes'. And that is so precisely 
because these are fundamental actions that physical matter can do, and 
those do not get changed when new silicon comes out. In that sense Xen's 
hypervisor API is saner than VMI.

the most highlevel API is what UML uses today (and it clearly overdoes 
abstraction), still i was able to get basic UML performance close to 
native performance, via extending a few Linux system calls to enable the 
management of multiple sets of pagetables (each represented by a 
separate fd) via a single hypervisor-level process, and feeding back raw 
pagefault events to the hypervisor. (that was UML's SKAS concept 
combined with sys_remap_file_pages_prot() and sys_vcpu())

Now the practical problem with UML is that nobody has tried to make an 
UML native+guest 'shared kernel image', and hence it's unusable for 
distros. But there is no conceptual problem with UML's virtualization 
model.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Chris Wright wrote:
> I agree with that, but I think that's esp. for things like create and launch
> new vcpu.  The IPI bit I'm not as clear on, nor running this all on native
> as well.
>   

Well, native would fall back to using the existing arch/i386 versions of
those functions, so that's reasonably straightforward.  There'll need to
be a bit of internal rearrangement so that the Xen code can call in to
do things like set up the pda/gdt and other bits of CPU state.

I don't think IPI is especially interesting in itself, is it?   It's a
necessary mechanism to implement smp_call_function(), but Xen can do IPI
without having to invoke any of the existing apic-based IPI code.  The
other main user of IPI is cross-cpu tlb shootdown, but Xen has much more
efficient mechanisms than IPI for that (so we'll need to make the tlb
pv_ops interface a little wider to pass down a cpuset).

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Chris Wright wrote:
> * Daniel Arai ([EMAIL PROTECTED]) wrote:
>   
>> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
>> paravirt ops for __send_IPI_shortcut and every other op that touches the 
>> APIC. 
>> 
>
> While that's basically what we did in Xen, it would make more sense to
> build it into genapic which would give us one common abstraction to base
> from.  We should avoid adding pv_ops when existing infrastructure exists.
>   

I was looking at cutting in at a much higher level.  The interface in
 is a good match for Xen, so I was going to investigate
making pv_ops at that level and see how it falls out.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Daniel Arai ([EMAIL PROTECTED]) wrote:
> Chris, would you like to work together on this?  I don't know what Xen's 
> requirements are for the APIC interface.  Do you think we could come up 
> with something that would fit both of our needs, and maybe also be usable 
> for some of the subarch-specific code?

Sure, we just have a pretty small genapic_xen, and then enough (hackery,
this should be sorted out) to use that genapic and have an effective
override for __send_IPI_shortcut.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Ingo Molnar


* Chris Wright <[EMAIL PROTECTED]> wrote:

> > Chris, would you like to work together on this?  I don't know what 
> > Xen's requirements are for the APIC interface.  Do you think we 
> > could come up with something that would fit both of our needs, and 
> > maybe also be usable for some of the subarch-specific code?
> 
> Sure, we just have a pretty small genapic_xen, and then enough 
> (hackery, this should be sorted out) to use that genapic and have an 
> effective override for __send_IPI_shortcut.

genapic is still too lowlevel: as Thomas mentioned what we want is a 
virtual interrupt controller used by /all/ hypervisors (and mapped to 
their respective hypervisor ABIs via the backend).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Daniel Arai


Chris Wright wrote:

* Daniel Arai ([EMAIL PROTECTED]) wrote:

There's no good way to override __send_IPI_shortcut.  I suppose we could add 
paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 



While that's basically what we did in Xen, it would make more sense to
build it into genapic which would give us one common abstraction to base
from.  We should avoid adding pv_ops when existing infrastructure exists.


I agree with this.

Chris, would you like to work together on this?  I don't know what Xen's 
requirements are for the APIC interface.  Do you think we could come up with 
something that would fit both of our needs, and maybe also be usable for some of 
the subarch-specific code?


Dan.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Zachary Amsden ([EMAIL PROTECTED]) wrote:
> s/do/will (smpboot.c)

Well the current Xen mechanism rather dodges all of that (for bits like
IPI apicid).

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Chris Wright

* Daniel Arai ([EMAIL PROTECTED]) wrote:
> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> paravirt ops for __send_IPI_shortcut and every other op that touches the 
> APIC. 

While that's basically what we did in Xen, it would make more sense to
build it into genapic which would give us one common abstraction to base
from.  We should avoid adding pv_ops when existing infrastructure exists.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Rusty Russell

On Thu, 2007-03-08 at 09:01 +0100, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
> 
> > > Your implementation is almost the perfect prototype, if you move the 
> > > 128 bit hackery into the hypervisor and hide it away from the kernel 
> > > :)
> > 
> > The point is to use the tsc to avoid making any hypercalls, so dealing 
> > with the tsc->ns conversion has to happen on the guest side somehow.
> 
> you are obsessed with avoiding a hypercall, but why? Granted it's slow 
> especially on things like SVN/VMX, but it's not fundamentally slow. We 
> definitely do not want to design our whole APIs and abstractions around 
> the temporary notion that 'hypercalls are slow'. I'd expect hypercalls 
> to be put into silicon just as much as SYSENTER was put into silicon. 

Indeed, I expect them to fall somewhere between system calls and context
switches.  Perhaps not slow, but definitely worth minimising.

> Anyway, in terms of guest time code, a /big/ amount of design junk can 
> be avoided by not trying to do sillynesses like 'virtual time'. The TSC 
> is awfully unreliable.

You mean stolen time?

I find this whole discussion really irritating, to be honest.  I just
want Thomas to implement the timer code for lguest, because that code
scares me...

I look forward to your patch 8)
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Jeremy Fitzhardinge

Ingo Molnar wrote:
> you are obsessed with avoiding a hypercall, but why? Granted it's slow 
> especially on things like SVN/VMX, but it's not fundamentally slow. We 
> definitely do not want to design our whole APIs and abstractions around 
> the temporary notion that 'hypercalls are slow'.

Sure.  But the specific case we're talking about here is a 300 line
clock driver.  Nothing about its implementation has any effect on the
kernel's APIs or abstractions.

>  I'd expect hypercalls 
> to be put into silicon just as much as SYSENTER was put into silicon. 
>   
Sysenter is marginally faster than int $80, but not massively so.  I
guess Xen could use sysenter now for hypercalls, since its only useful
for getting into ring 0.

> Anyway, in terms of guest time code, a /big/ amount of design junk can 
> be avoided by not trying to do sillynesses like 'virtual time'. 
Well, if you have a hypervisor scheduler multiplexing vcpus onto a real
cpu at 100hz and a kernel scheduler multiplexing processes onto a vcpu
at 100hz, then you're going to get a lot of disappointed processes who
nominally got their 10ms real-time slice, but it was all spent on some
other vcpu.   Its important that the kernel's scheduler know how much
vcpu time each process really got, rather than basing its scheduling on
the amount of real time that passed.

> The TSC 
> is awfully unreliable.
>   
Sure.

> /THIS/ is the kind of junk we are trying to protect Linux against. 
>   

What?  That Xen happens to use the tsc as part of its hypervisor
interface?  A fact that's completely isolated from the rest of the
kernel behind the clock subsystem?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Keir Fraser

On 8/3/07 08:01, "Ingo Molnar" <[EMAIL PROTECTED]> wrote:

> you are obsessed with avoiding a hypercall, but why? Granted it's slow
> especially on things like SVN/VMX, but it's not fundamentally slow. We
> definitely do not want to design our whole APIs and abstractions around
> the temporary notion that 'hypercalls are slow'. I'd expect hypercalls
> to be put into silicon just as much as SYSENTER was put into silicon.

If syscalls are already so fast, why does Linux have vgettimeofday()?

 -- Keir


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> > Your implementation is almost the perfect prototype, if you move the 
> > 128 bit hackery into the hypervisor and hide it away from the kernel 
> > :)
> 
> The point is to use the tsc to avoid making any hypercalls, so dealing 
> with the tsc->ns conversion has to happen on the guest side somehow.

you are obsessed with avoiding a hypercall, but why? Granted it's slow 
especially on things like SVN/VMX, but it's not fundamentally slow. We 
definitely do not want to design our whole APIs and abstractions around 
the temporary notion that 'hypercalls are slow'. I'd expect hypercalls 
to be put into silicon just as much as SYSENTER was put into silicon. 
Anyway, in terms of guest time code, a /big/ amount of design junk can 
be avoided by not trying to do sillynesses like 'virtual time'. The TSC 
is awfully unreliable.

really, it's a bit as if Linus looked at his 386DX CPU when he bought it 
16 years ago and decided that: "this CPU executes 16-bit code much 
faster than 32-bit code, so lets base this new toy OS on 16-bit code. 
Sure, it's a bit of a pain to use, compared to 32-bit code, but users 
demand performance!".

/THIS/ is the kind of junk we are trying to protect Linux against. 
Basically hypervisors are a way to prolong hardware legacies, and 
because unlike real hardware software ABIs dont actually burn out with 
time, and people are stubborn about using them, their effects are alot 
worse and alot longer than that of legacy hardware.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-08 Thread Zachary Amsden


Thomas Gleixner wrote:

On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
  
But more importantly, we want a kernel that can run both on native hardware and 
in a paravirtualized environment.  Linux doesn't really provide abstractions for 
  replacing the appropriate code.  We tried to hook into the source code at a 
level that seemed possible.



Again. You just refuse to change your implementation and you want to
keep it by arguing how hard it is because there are no abstractions.
  


It is no longer possible to change our _hypervisor_ implementation.  The 
Linux side of our code is entirely flexible, and we are trying to change 
it, but it hasn't always been clear what you want us to do.



Your prayer wheel argument of missing abstractions and easiness of
emulating things is annoying. If you think it is better to emulate APIC,
please emulate it without paravirt ops. If you want the speed
improvement, work with us to create the interfaces and abstractions
which are necessary to have a sane, maintainable and useful for all
hypervisors implementation.
  


That's what we are doing.  Our prayer wheel would be easier appeased if 
you actually told us which parts of the VMI timer you objected to.  As I 
understand it now:


1) We should not call into external functions in other time sources; any 
common code should be merged up
2) We should not be using global_clock_event; it is a horrible hack 
which you want to remove
3) We should not use the smp_apic_timer_interrupt assembly code which 
calls up to the lapic timer handlers
4) We should not add our own assembly code to call out to a local timer 
handler (from Ingo)


These last two points create a conflict which is a little tricky to 
solve.  We can't add our own custom timer handler, and we can't re-use 
the APIC timer handler.  But there is no timer handler available on i386 
that works, since the handlers will fall back to either PIC or IO-APIC 
edge handling.  Using either of those for the local timer interrupt on 
SMP does not work because they assume traditional IRQ semantics - an IRQ 
raised from the bus should be serviced by one processor.  Re-raises of 
the same IRQ on remote processors are locked out by the handler, and 
dropped.  Thus simultaneous local timers firing on multiple CPUs cause 
only one to be serviced.


This does not work for local timer interrupts in NO_HZ mode, because 
they must always be serviced so that they can reschedule the next local 
timer.  I have a proposed solution to this issue, but it fails to work 
when the IO-APIC assumes control of all IRQs based on ACPI results 
(which we control, but can't change because of compatibility issues with 
other operating systems).


My proposal is to keep IRQ-0 as the timer interrupt, on all CPUs, but 
fire it from the LAPIC after local apic timers get initialized.  We 
would do this by converting the irq handler using set_irq_handler(0, 
handle_percpu_irq).  The only problem is the IO-APIC code will want to 
take over IRQ0 and convert it to an edge triggered IO-APIC interrupt.  
But for the local irq handlers to work, we have to keep them using the 
handle_percpu_irq handler, and can't let the IO-APIC steal these 
vectors.  There is no way to do conditionally for just a specific set of 
IRQs in tree today, so we would need to add a special case to io_apic.c 
to allow early boot code to reserve specific vectors so they are not 
subsumed by the IO-APIC.  This seems reasonable, but is a special case.


If, on the other hand, we are allowed to use our own assembly code to 
call out to our local timer handler (dropping constraint #4 above), we 
can simply rewire LOCAL_TIMER_VECTOR to point to this code, but now we 
must emulate the semantics of irq_enter / leave / etc inside our code, 
which is also not the cleanest solution.  We used to do this, and it 
caught flak I believe from Ingo.


The basic problem is that a local IRQ doesn't behave like a global IRQ, 
and the i386 backend is unaware of how to set up any local IRQs except 
in the case of local APIC, but you have told us we should not re-use the 
APIC handlers by overloading global_clock_event.  The patches we sent 
out recently did just this, but seemed to meet even more violence than 
our previous way of doing things.


So the question is, which approach do you prefer?

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 17:01 -0800, Daniel Arai wrote:
> Thomas Gleixner wrote:
> 
> > You managed to avoid the usage of other code (i.e. PIT / HPET) already,
> > so why is it sooo desireable to emulate apics instead of substituting it
> > by a small and sane replacement ? Just because you happen to have an
> > LAPIC emulator ? That's no reason to wire yourself into the kernel code
> > and make it harder to change and maintain.
> 
> There are several reasons why it's desirable to emulate the APIC.  As you 
> mentioned, we already have APIC emulation, and APIC emulation isn't a huge 
> bottleneck on most workloads.  Our code works, the Linux code works, and 
> replacing both pieces of code with something "small and sane" isn't going to 
> improve performance very much, so why bother?  Any hypervisor implementation 
> is 
> going to be a tradeoff between what's easy to implement in the hypervisor, 
> what's easy to implement in the guest operating system, and what's 
> performance 
> critical.

It is not about performance. It is about maintainability. 

> Secondly, not all (para-)virtualized operating systems will want to use 
> abstracted devices.  Some virtual operating systems will be given direct 
> access 
> to hardware devices, and will need to run the actual driver for that device 
> and 
> not some abstracted device driver.  So I don't buy your argument that every 
> piece of the kernel that interacts with a paravirtualized driver should have 
> a 
> "small and sane replacement."

Err. We talk about paravirtualized Linux and not about what you have to
emulate to get Windows running. I don't care at all. Do you really
expect that we have to accept your design decisions, just because they
allow you to make your life easy ? This is exactly what you are using
paravirt ops for: a backdoor to throw your hackery at the kernel and
leave us with the mess of hardwired crap.

> But more importantly, we want a kernel that can run both on native hardware 
> and 
> in a paravirtualized environment.  Linux doesn't really provide abstractions 
> for 
>   replacing the appropriate code.  We tried to hook into the source code at a 
> level that seemed possible.

Again. You just refuse to change your implementation and you want to
keep it by arguing how hard it is because there are no abstractions.

I went through the business of creating abstractions into hardwired
hairballs twice. I know exactly what I'm talking about. It _IS_ hard
work, but at the end it makes the code better and more maintainable. You
do nothing for that, but expect that we live with your addons to the
hairball.

> There's no good way to override __send_IPI_shortcut.  I suppose we could add 
> paravirt ops for __send_IPI_shortcut and every other op that touches the 
> APIC. 
> But there are dozens of functions in apic.c that would need to be included in 
> paravirt ops.  And for our implementation, we really just want to override 
> apic_read and apic_write, since we can make these faster when done through 
> hypercalls than through memory accesses.  If we were to make these paravirt 
> ops, 
> their implementations would be the same, except with a different apic_read 
> and 
> apic_write.  This is a whole lot of useless code duplication.

No it is not. #include  is an abstraction and
__send_IPI ... is the i386 low level implementation.

You insist to hook yourself into the low level code instead of hooking
into the high level code, because it is _YOUR_ implementation and we
have to accept it as is.

This is the completely wrong way. We get the same crap and discussion
for every other architecture we are going to support with paravirt ops.
And probably for every other hypervisor implementation, which has a
different way of doing things.

> Most of the interrupt system is not written in such a way that multiple APICs 
> implementations can be selected from at boot time.  This is an absolute 
> requirement so that the same kernel can boot on native and in a 
> paravirtualized 
> environment.  While this could be implemented, it seems like a waste of time, 
> since we can just emulate something similar to a real interrupt system and 
> not 
> change things very much.

Waste of your precious time. I'm working on low level code and
abstractions and from now on I have also to take care not to break
_YOUR_ implementation. You are going to waste _MY_ time and I'm going to
fight that forever.

Your prayer wheel argument of missing abstractions and easiness of
emulating things is annoying. If you think it is better to emulate APIC,
please emulate it without paravirt ops. If you want the speed
improvement, work with us to create the interfaces and abstractions
which are necessary to have a sane, maintainable and useful for all
hypervisors implementation.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 17:23 -0800, Jeremy Fitzhardinge wrote:
> Daniel Arai wrote:
> > But more importantly, we want a kernel that can run both on native hardware 
> > and 
> > in a paravirtualized environment.  Linux doesn't really provide 
> > abstractions for 
> >   replacing the appropriate code.  We tried to hook into the source code at 
> > a 
> > level that seemed possible.
> >   
> 
> Xen doesn't support any kind of apic emulation, so we'll need to hook
> anything which relies on an apic.  The ipi code you quote below will
> probably be one of those.
> 
> My opinion is that pv_ops shouldn't have raw apic operations, but
> instead have appropriate high-level interfaces to achieve the same
> ends.  Zach's counter-argument was basically your's: that the VMI code
> will use a lot of the native code except for the actual apic operations.
> 
> I can live with VMI emulating apics if it wants, so long as it does it
> in private and doesn't make a big scene about it.  We'll need the
> high-level interfaces regardless.

I can't because it reaches out into non private parts of the low level
implementation and is not helping to distangle things and making the
overall code better. No it forces its own view of the world on us
without giving us anything back.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Daniel Arai wrote:
> But more importantly, we want a kernel that can run both on native hardware 
> and 
> in a paravirtualized environment.  Linux doesn't really provide abstractions 
> for 
>   replacing the appropriate code.  We tried to hook into the source code at a 
> level that seemed possible.
>   

Xen doesn't support any kind of apic emulation, so we'll need to hook
anything which relies on an apic.  The ipi code you quote below will
probably be one of those.

My opinion is that pv_ops shouldn't have raw apic operations, but
instead have appropriate high-level interfaces to achieve the same
ends.  Zach's counter-argument was basically your's: that the VMI code
will use a lot of the native code except for the actual apic operations.

I can live with VMI emulating apics if it wants, so long as it does it
in private and doesn't make a big scene about it.  We'll need the
high-level interfaces regardless.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Daniel Arai


Thomas Gleixner wrote:


You managed to avoid the usage of other code (i.e. PIT / HPET) already,
so why is it sooo desireable to emulate apics instead of substituting it
by a small and sane replacement ? Just because you happen to have an
LAPIC emulator ? That's no reason to wire yourself into the kernel code
and make it harder to change and maintain.


There are several reasons why it's desirable to emulate the APIC.  As you 
mentioned, we already have APIC emulation, and APIC emulation isn't a huge 
bottleneck on most workloads.  Our code works, the Linux code works, and 
replacing both pieces of code with something "small and sane" isn't going to 
improve performance very much, so why bother?  Any hypervisor implementation is 
going to be a tradeoff between what's easy to implement in the hypervisor, 
what's easy to implement in the guest operating system, and what's performance 
critical.


Secondly, not all (para-)virtualized operating systems will want to use 
abstracted devices.  Some virtual operating systems will be given direct access 
to hardware devices, and will need to run the actual driver for that device and 
not some abstracted device driver.  So I don't buy your argument that every 
piece of the kernel that interacts with a paravirtualized driver should have a 
"small and sane replacement."


But more importantly, we want a kernel that can run both on native hardware and 
in a paravirtualized environment.  Linux doesn't really provide abstractions for 
 replacing the appropriate code.  We tried to hook into the source code at a 
level that seemed possible.


For example, take smp_call_function().  What this essentially does is call 
send_IPI_allbutself().


void fastcall send_IPI_self(int vector)
{
__send_IPI_shortcut(APIC_DEST_SELF, vector);
}

void __send_IPI_shortcut(unsigned int shortcut, int vector)
{
/*
 * Subtle. In the case of the 'never do double writes' workaround
 * we have to lock out interrupts to be safe.  As we don't care
 * of the value read we use an atomic rmw access to avoid costly
 * cli/sti.  Otherwise we use an even cheaper single atomic write
 * to the APIC.
 */
unsigned int cfg;

/*
 * Wait for idle.
 */
apic_wait_icr_idle();

/*
 * No need to touch the target chip field
 */
cfg = __prepare_ICR(shortcut, vector);

/*
 * Send the IPI. The write to APIC_ICR fires this off.
 */
apic_write_around(APIC_ICR, cfg);
}


There's no good way to override __send_IPI_shortcut.  I suppose we could add 
paravirt ops for __send_IPI_shortcut and every other op that touches the APIC. 
But there are dozens of functions in apic.c that would need to be included in 
paravirt ops.  And for our implementation, we really just want to override 
apic_read and apic_write, since we can make these faster when done through 
hypercalls than through memory accesses.  If we were to make these paravirt ops, 
their implementations would be the same, except with a different apic_read and 
apic_write.  This is a whole lot of useless code duplication.


Most of the interrupt system is not written in such a way that multiple APICs 
implementations can be selected from at boot time.  This is an absolute 
requirement so that the same kernel can boot on native and in a paravirtualized 
environment.  While this could be implemented, it seems like a waste of time, 
since we can just emulate something similar to a real interrupt system and not 
change things very much.


Dan Arai
VMware, Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> Sigh. The cut zero hairball is already in mainline. :(
>   

Yes, there were a couple of unfortunate patches in that series, but they
got fast-tracked in with the promise they would get fixed asap.

> Sure. If the clockevent API is changed, then the users get fixed. This
> is not my main concern. The "oh we reuse the PIT interrupt" reachout is
> what makes life hard. VMI does this already extensive and I'm frightened
> by it.
>   

Well, I think they know what's expected of them now.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 15:33 -0800, Jeremy Fitzhardinge wrote:
> > On the other hand we yet see things like:
> >
> > /* We use normal irq0 handler on cpu0. */
> > time_init_hook();
> >
> > Which is just reaching into the kernel code directly and does not handle
> > the clock event interrupt self contained. clockevents is not bound to
> > IRQ0 and this kind of hackery is exactly what we need to avoid in order
> > to get this maintainable.
> >   
> 
> Yes, I'm definitely not arguing with you about this.  I think the first
> cut vmi time code was pretty questionable, but I have confidence they'll
> fix it up before submission.

Sigh. The cut zero hairball is already in mainline. :(

> The point is that when you put the xen and vmi implementations next to
> each other you find that 1) in each case there's a pretty small
> abstraction distance between the clock interface and the hypercall
> interface, and 2) there's very little code which can be shared between
> the two.  Which means that adding another layer of abstraction to
> protect the clock code from paravirtualized time devices is just going
> to add fat without much benefit.

Fair enough.

> > Yes, if they are used in a sane and self contained way without reaching
> > all over the place and expecting that those functions, which are not
> > part of the paravirt interfaces will work for ever.
> >   
> 
> 100% agree.  If the interfaces change, then we'll change the code using
> them like any other kernel code would.  If the new interfaces are hard
> to make work then that's a problem, but one would hope that would get
> shaken out as part of the normal kernel development process.

Sure. If the clockevent API is changed, then the users get fixed. This
is not my main concern. The "oh we reuse the PIT interrupt" reachout is
what makes life hard. VMI does this already extensive and I'm frightened
by it.

> The point is that this code under and around the paravirt_ops interface
> is just normal Linux code, and we expect to participate in the normal
> kernel development process, with all the usual
> discussions/arguments/negotiations over interface changes.  If the code
> loses all its maintainers and becomes orphaned, unresponsive to
> interface changes, then it's like any other dead driver: mark it
> CONFIG_BROKEN and wait for someone to fix it.  But for now and the
> foreseeable future these are going to be actively supported and
> maintained pieces of code.

Ack.

> > You are not increasing the entanglement with the rest of the system,
> > when you use a self contained device on top of an existing core kernel
> > infrastructure, which has a paravirt backend. Quite the contrary, you
> > have one piece of virtual hardware which is connected to the kernel and
> > interacts with the various incarnations on the other side, which can as
> > well live inside the kernel code. Granted it is another level of
> > indirection, but I'd be happy to have only to deal with one of those
> > beasts.
> >   
> 
> Right.  But at that point the interface doesn't really have much of a
> technical basis.  It's really a political border at which you can hand
> off responsibility and make it ours.  I quite understand your
> motivation, but I think you're solving a problem that hasn't happened
> yet, and one that we'd all like to avoid.

Granted.

> I know the vmi time code has coloured your view here, but I surely hope
> it can be got into a better state before posting.  I'm biased of course,
> but I would rather hope that all these drivers we're talking about will
> be as stylistically clean as the Xen time code (which has room for
> improvement, of course).
> 
> There is, however, a median solution which keeps the number of clock
> drivers down but also doesn't involve extending pv_ops.  We can just
> create paravirt_clocksource/paravirt_clockevent helper wrappers, with
> their own internal interfaces to act as a facade for the
> hypervisor-specific code.  I don't think there's much point in doing
> this now, but maybe it will become appealing once we start dealing with
> things like stolen time.

We'll see.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Dan Hecht wrote:
> On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote:
>> I know the vmi time code has coloured your view here, but I surely hope
>> it can be got into a better state before posting.  I'm biased of course,
>> but I would rather hope that all these drivers we're talking about will
>> be as stylistically clean as the Xen time code (which has room for
>> improvement, of course).
>>
>
> Could you send us comments on where you feel the style needs some
> fixing up?

I think Thomas has covered this in quite a bit of detail already.  But
the fact that the code mentions "apic" or "pit" at all seems
unfortunate, but I guess that's what you have to work with.

> VMI encapsulates all the implementation details away from the kernel,
> whereas the Xen time code puts it all out there in the kernel[...]

This is not an exercise in "my hypervisor is better than yours", it's a
matter of getting clean implementations within the constraints of each
hypervisor interface.  The Xen code may be more verbose than the
corresponding VMI code, but it's self-contained and doesn't make any
demands on the rest of the kernel.

The concern is that the vmi code reaches out and does things like set
global_clock_event, calls time_init_hook and so on - basically
complicating the already ugly lapic/pic legacy time mess, and therefore
making yourself part of the tangle if anyone wants to go in there and
change it.

The question is whether you can make the vmi clock implementation
free-standing, in that it has no dependencies other than well defined
interfaces like the clock api itself, the normal (non-legacy) interrupt
api and, of course, the underlying VMI interface.  But no reach-arounds
into the lapic/pit code.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 15:25 -0800, Zachary Amsden wrote:
> > Looking at vmitimer.c and the number of hardcoded assumptions are
> > telling me, that we are heading in exactly the opposite direction.
> >   
> 
> No, VMI timer is unique because for SMP, it is based on the APIC.  On 
> i386, SMP is hardwired to depend on the APIC, and so we simply re-use 
> the pieces of it which are there, with the same assumptions about irqs, 
> and hardware behavior, good or bad.  We just have a different way of 
> telling the LAPIC when to deliver interrupts.

This is exactly the point. There is no benfit in reusing 3 lines of
lapic interrupt handler code and therefor reaching into it. clockevents
are not connected to lapic on SMP by any means. They are designed to be
self contained and so please use them as designed.

> The alternative is to pretty much completely copy apic.c into vmi.c or 
> vmitimer.c, which seems a rather bad idea, since now two copies of 
> nearly identical code need to be maintained.

You managed to avoid the usage of other code (i.e. PIT / HPET) already,
so why is it sooo desireable to emulate apics instead of substituting it
by a small and sane replacement ? Just because you happen to have an
LAPIC emulator ? That's no reason to wire yourself into the kernel code
and make it harder to change and maintain.

> > Yes, if they are used in a sane and self contained way without reaching
> > all over the place and expecting that those functions, which are not
> > part of the paravirt interfaces will work for ever.
> >   
> 
> But we definitely need pieces of the core APIC dependent code.  Xen 
> needs pieces of it too, but very select pieces for SMP boot.  The 
> ugliness you point out is there, but the reason it is there is not 
> because the paravirt code is cluttered, it is because the i386 code is 
> so hardwired to use the APIC model that there is pain separating from it.
>
> The correct solution here is to properly separate the APIC, SMP, and 
> timer code so the logic of it which we want to reuse is separated from 
> the hardware dependence.  Clock events and clocksources take care of 
> most of the timer issues, but there is still ugliness from SMP timer 
> events depending on having part of the APIC infrastructure for wiring 
> the interrupt gates.

Again: clockevents do not require APIC and do not depend on any APIC
wiring. Your hypervisor is working that way.

> > No it's not an absolute blocker, as long as we can take care, that the
> > number of incarnations is 
> >
> > - designed to be shareable between hypervisors which have the same time
> > model
> > - common code like the 128 bit math is in a shared library
> > - self contained and not reaching out into core kernel code for no good
> > reason
> >   
> > Same goes for clock events, interrupts and other core facilities.
> 
> I think that is what everyone wants.  This is an iterative process.  We 
> certainly don't want to reach out into core kernel code unless there is 
> a good reason to do so, and with every development of clock events, 
> sources, and interrupts, we have less of a reason to do so, and the code 
> gets cleaner and more maintainable.

We have to avoid this reachout in the first place. It just adds more
hardwires into the hairball and makes it harder to distangle. 

If you want the virtualization support in the kernel, then please
understand that we hardwire now and we'll fix it up once the core kernel
developers serve us the solution on the silver tablet is not going to
work. Please work with us on a proper solution upfront instead of
throwing random hackery with the lame excuse "for a good reason" at us.

You knew exactly, that clockevents & co are on the way to mainline and
there was enough time to work with us on a proper solution. No, you
decided to ignore it, even after people pointed it out to you way before
the 2.6.21 merge window. Now we have the hardwire in place and we can
wait for you to fix it whenever it seems to fit into the vmware business
plan.

I'm not going to accept any further reachout unless there is an urgent
bugfix in the release cycle, which does not allow a proper solution. But
be sure, that the backout patch will hit -mm immidiately.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 03:33 PM, Jeremy Fitzhardinge wrote:

I know the vmi time code has coloured your view here, but I surely hope
it can be got into a better state before posting.  I'm biased of course,
but I would rather hope that all these drivers we're talking about will
be as stylistically clean as the Xen time code (which has room for
improvement, of course).



Could you send us comments on where you feel the style needs some fixing 
up?


VMI encapsulates all the implementation details away from the kernel, 
whereas the Xen time code puts it all out there in the kernel (see 
snippet below).  What happens when Xen wants to change the way it 
implements "system time"?  It looses compatibility with all existing 
kernels


In VMI terms, the code to read "system time" from the hypervisor is this 
one-liner (it can be written in any "style" you want; the fact is, it's 
just an interface call to the VMI-layer):


vmi_timer_ops.get_cycle_counter(VMI_CYCLES_REAL);

In Xen terms, the same code to accomplish that is:

/*
 * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
 * yielding a 64-bit result.
 */
static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
{
u64 product;
#ifdef __i386__
u32 tmp1, tmp2;
#endif

if (shift < 0)
delta >>= -shift;
else
delta <<= shift;

#ifdef __i386__
__asm__ (
"mul  %5   ; "
"mov  %4,%%eax ; "
"mov  %%edx,%4 ; "
"mul  %5   ; "
"xor  %5,%5; "
"add  %4,%%eax ; "
"adc  %5,%%edx ; "
: "=A" (product), "=r" (tmp1), "=r" (tmp2)
: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
#elif __x86_64__
__asm__ (
"mul %%rdx ; shrd $32,%%rdx,%%rax"
: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
#else
#error implement me!
#endif

return product;
}

static u64 get_nsec_offset(struct shadow_time_info *shadow)
{
u64 now, delta;
rdtscll(now);
delta = now - shadow->tsc_timestamp;
return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
}

static cycle_t xen_clocksource_read(void)
{
struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
cycle_t ret;

get_time_values_from_xen();

ret = shadow->system_timestamp + get_nsec_offset(shadow);

put_cpu_var(shadow_time);

return ret;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Alan Cox

> Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
> Unfortunately in 10 years no clearly better alternative has appeared;
> maybe in 10 years there will be one.  It might even be the tsc.

TSC is essentially unusable for any kind of time related work. And I'd
disagree about the alternatives - the HPET and ACPI timers are not bad,
the CMOS timer can be used as an interrupting timer source, and there is
the old PC timer chip. All are superior to the TSC.

Finally for performance management work you've got cycle counters in the
debug side (with interrupt on overflow) which allow you to do management
of resources by cpu ticks or by memory bandwidth utilisation (Sun btw
have a fascinating paper somewhere on the latter)

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Zachary Amsden


Jeremy Fitzhardinge wrote:

Zachary Amsden wrote:
  

Xen needs pieces of it too, but very select pieces for SMP boot.



We do?  Send the SMP Xen code over, because I don't have it here.
  


s/do/will (smpboot.c)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Zachary Amsden wrote:
> Xen needs pieces of it too, but very select pieces for SMP boot.

We do?  Send the SMP Xen code over, because I don't have it here.

Thanks,
J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> Still there is a difference between using existing kernel interfaces and
> abusing them in a way which makes modifications to the core kernel code
> hard and unmaintainable. See below.
>   

I completely agree.  "Using the kernel interfaces" doesn't mean "this
random hack happens to work", it means "use the interface as intended as
a fully-fledged client".  If the interface doesn't work for our use,
then we can negotiate with the appropriate people on how to extend it
properly.

> On the other hand we yet see things like:
>
> /* We use normal irq0 handler on cpu0. */
> time_init_hook();
>
> Which is just reaching into the kernel code directly and does not handle
> the clock event interrupt self contained. clockevents is not bound to
> IRQ0 and this kind of hackery is exactly what we need to avoid in order
> to get this maintainable.
>   

Yes, I'm definitely not arguing with you about this.  I think the first
cut vmi time code was pretty questionable, but I have confidence they'll
fix it up before submission.

The point is that when you put the xen and vmi implementations next to
each other you find that 1) in each case there's a pretty small
abstraction distance between the clock interface and the hypercall
interface, and 2) there's very little code which can be shared between
the two.  Which means that adding another layer of abstraction to
protect the clock code from paravirtualized time devices is just going
to add fat without much benefit.

> Yes, if they are used in a sane and self contained way without reaching
> all over the place and expecting that those functions, which are not
> part of the paravirt interfaces will work for ever.
>   

100% agree.  If the interfaces change, then we'll change the code using
them like any other kernel code would.  If the new interfaces are hard
to make work then that's a problem, but one would hope that would get
shaken out as part of the normal kernel development process.

The point is that this code under and around the paravirt_ops interface
is just normal Linux code, and we expect to participate in the normal
kernel development process, with all the usual
discussions/arguments/negotiations over interface changes.  If the code
loses all its maintainers and becomes orphaned, unresponsive to
interface changes, then it's like any other dead driver: mark it
CONFIG_BROKEN and wait for someone to fix it.  But for now and the
foreseeable future these are going to be actively supported and
maintained pieces of code.

> You are not increasing the entanglement with the rest of the system,
> when you use a self contained device on top of an existing core kernel
> infrastructure, which has a paravirt backend. Quite the contrary, you
> have one piece of virtual hardware which is connected to the kernel and
> interacts with the various incarnations on the other side, which can as
> well live inside the kernel code. Granted it is another level of
> indirection, but I'd be happy to have only to deal with one of those
> beasts.
>   

Right.  But at that point the interface doesn't really have much of a
technical basis.  It's really a political border at which you can hand
off responsibility and make it ours.  I quite understand your
motivation, but I think you're solving a problem that hasn't happened
yet, and one that we'd all like to avoid.

I know the vmi time code has coloured your view here, but I surely hope
it can be got into a better state before posting.  I'm biased of course,
but I would rather hope that all these drivers we're talking about will
be as stylistically clean as the Xen time code (which has room for
improvement, of course).

There is, however, a median solution which keeps the number of clock
drivers down but also doesn't involve extending pv_ops.  We can just
create paravirt_clocksource/paravirt_clockevent helper wrappers, with
their own internal interfaces to act as a facade for the
hypervisor-specific code.  I don't think there's much point in doing
this now, but maybe it will become appealing once we start dealing with
things like stolen time.

> No it's not an absolute blocker, as long as we can take care, that the
> number of incarnations is 
>
> - designed to be shareable between hypervisors which have the same time
> model
> - common code like the 128 bit math is in a shared library
> - self contained and not reaching out into core kernel code for no good
> reason

Yep.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Zachary Amsden


Thomas Gleixner wrote:

On the other hand we yet see things like:

/* We use normal irq0 handler on cpu0. */
time_init_hook();

Which is just reaching into the kernel code directly and does not handle
the clock event interrupt self contained. clockevents is not bound to
IRQ0 and this kind of hackery is exactly what we need to avoid in order
to get this maintainable.

Once this is used by paravirt implementations a change to the
mach-default implementation will break stuff left and right.
  


We've fixed that already.  Thanks for pointing it out.  We were just 
trying to re-use code.



Also the whole LAPIC business is so horrible, that it hurts. The generic
interrupt layer is there since almost a year and we still see the crude
emulation of hardware and assumptions of irq0 setup all over the place.

We carefully need to define, which existing kernel interfaces are used /
hooked in which way.

If the paravirt implementations actually use the already available
abstractions in the way in which those abstractions are designed, then
we get into a maintainable design. If there are shortcomings on those
abstractions we need to fix them in a sane way or provide a _common_
workaround (e.g. 128 bit math back and forth library) without impacting
the main kernel code.

Looking at vmitimer.c and the number of hardcoded assumptions are
telling me, that we are heading in exactly the opposite direction.
  


No, VMI timer is unique because for SMP, it is based on the APIC.  On 
i386, SMP is hardwired to depend on the APIC, and so we simply re-use 
the pieces of it which are there, with the same assumptions about irqs, 
and hardware behavior, good or bad.  We just have a different way of 
telling the LAPIC when to deliver interrupts.


The alternative is to pretty much completely copy apic.c into vmi.c or 
vmitimer.c, which seems a rather bad idea, since now two copies of 
nearly identical code need to be maintained.



Yes, if they are used in a sane and self contained way without reaching
all over the place and expecting that those functions, which are not
part of the paravirt interfaces will work for ever.
  


But we definitely need pieces of the core APIC dependent code.  Xen 
needs pieces of it too, but very select pieces for SMP boot.  The 
ugliness you point out is there, but the reason it is there is not 
because the paravirt code is cluttered, it is because the i386 code is 
so hardwired to use the APIC model that there is pain separating from it.


The correct solution here is to properly separate the APIC, SMP, and 
timer code so the logic of it which we want to reuse is separated from 
the hardware dependence.  Clock events and clocksources take care of 
most of the timer issues, but there is still ugliness from SMP timer 
events depending on having part of the APIC infrastructure for wiring 
the interrupt gates.



No it's not an absolute blocker, as long as we can take care, that the
number of incarnations is 


- designed to be shareable between hypervisors which have the same time
model
- common code like the 128 bit math is in a shared library
- self contained and not reaching out into core kernel code for no good
reason
  
Same goes for clock events, interrupts and other core facilities.


I think that is what everyone wants.  This is an iterative process.  We 
certainly don't want to reach out into core kernel code unless there is 
a good reason to do so, and with every development of clock events, 
sources, and interrupts, we have less of a reason to do so, and the code 
gets cleaner and more maintainable.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 14:05 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > This is tinkering of the best. My understanding of the paravirt
> > discussion at Kernel Summit was, that paravirt ops are exactly there to
> > prevent the above random hackery in the kernel and to allow _ALL_
> > hypervisors to interact via a sane interface inside of the kernel.
> >   
> 
> No, I don't think that was ever the intent.  The idea was to create a
> new interface for things which don't currently have an interface in the
> kernel, such as how to run the CPU in ring 1 and manage pagetable
> updates.  But an important and explicit intent of the project was to use
> existing kernel interfaces where possible, rather than try to make
> pv_ops an monster all-encompassing interface.

Maybe I missunderstood. 

Still there is a difference between using existing kernel interfaces and
abusing them in a way which makes modifications to the core kernel code
hard and unmaintainable. See below.

> Using the new time infrastructure was an explicit example of that.  We
> anticipated that different hypervisors would have different ways of
> doing time, but all would be easily accommodated by the
> clocksource/events infrastructure, and so each would have its own
> implementation for these interfaces.  From the kernel's perspective,
> they're just another time device, and we manage to avoid making any core
> kernel changes, or bloating the pv_ops interface.  It seems like a
> natural use of the clock subsystem's design.

On the other hand we yet see things like:

/* We use normal irq0 handler on cpu0. */
time_init_hook();

Which is just reaching into the kernel code directly and does not handle
the clock event interrupt self contained. clockevents is not bound to
IRQ0 and this kind of hackery is exactly what we need to avoid in order
to get this maintainable.

Once this is used by paravirt implementations a change to the
mach-default implementation will break stuff left and right.

Also the whole LAPIC business is so horrible, that it hurts. The generic
interrupt layer is there since almost a year and we still see the crude
emulation of hardware and assumptions of irq0 setup all over the place.

We carefully need to define, which existing kernel interfaces are used /
hooked in which way.

If the paravirt implementations actually use the already available
abstractions in the way in which those abstractions are designed, then
we get into a maintainable design. If there are shortcomings on those
abstractions we need to fix them in a sane way or provide a _common_
workaround (e.g. 128 bit math back and forth library) without impacting
the main kernel code.

Looking at vmitimer.c and the number of hardcoded assumptions are
telling me, that we are heading in exactly the opposite direction.

> > You are just perverting the whole idea of a standartized
> > paravirtualization interface.
> >
> > This things can be done for clocksources, clockevents, interrupts (the
> > generic irq code allows this) and probaly for a whole bunch of other
> > stuff.
> >   
> 
> Yes, exactly.  The entirety of the Xen support consists of not only an
> implementation of the paravirt_ops interface, but also the Xen
> clocksource and clockevents and the Xen irqchip.  My hope and intent is
> that we can shrink the paravirt_ops interface in favour of using
> existing generally useful kernel interfaces.

Yes, if they are used in a sane and self contained way without reaching
all over the place and expecting that those functions, which are not
part of the paravirt interfaces will work for ever.

> > The current paravirt interface is completely insane and will explode
> > into an unmaintainable nightmare within no time, if we keep accepting
> > that crap further.
> >   
> 
> No, that's exactly what we've been trying to avoid.
> 
> If we start patching in new paravirt_ops to deal with time, interrupts,
> or whatever piece of functionality which already has a perfectly good
> kernel interface, then we're just increasing the size of the pv_ops
> interface, its entanglement with the rest of the system and the amount
> of potential legacy stuff which gets dragged around as the interface
> evolves.

You are not increasing the entanglement with the rest of the system,
when you use a self contained device on top of an existing core kernel
infrastructure, which has a paravirt backend. Quite the contrary, you
have one piece of virtual hardware which is connected to the kernel and
interacts with the various incarnations on the other side, which can as
well live inside the kernel code. Granted it is another level of
indirection, but I'd be happy to have only to deal with one of those
beasts.

> As hardware gets better at supporting virtualization directly, we're
> going to see more hybrid para- and fully- virtualized hypervisor
> interfaces.  The result will be that more and more of paravirt_ops will
> be implemented by the "native" versions of the functions

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 02:31 PM, Thomas Gleixner wrote:

Please make these things self contained and not relying on whatever
time_init_hook() contains.



Fixing up the code to do this now

thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 14:17 -0800, Zachary Amsden wrote:
> Thomas Gleixner wrote:
> > Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
> > contained and do not impose restrictions on the kernel core code, which
> > we have to maintain.
> >   
> 
> But time_init_hook is supposed to be abused.  That is its purpose - to 
> be a hook for different time devices on SGI Visual Workstation and 
> Voyager.  And we don't actually abuse it anymore, we just bypass it 
> because the default timer init path wants to setup the PIT or the HPET, 
> neither of which should be used in paravirt.

It is there for those hardware platforms, but using it inside your clock
event device is _JUST_ wrong.

Please make these things self contained and not relying on whatever
time_init_hook() contains.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Zachary Amsden


Thomas Gleixner wrote:

Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
contained and do not impose restrictions on the kernel core code, which
we have to maintain.
  


But time_init_hook is supposed to be abused.  That is its purpose - to 
be a hook for different time devices on SGI Visual Workstation and 
Voyager.  And we don't actually abuse it anymore, we just bypass it 
because the default timer init path wants to setup the PIT or the HPET, 
neither of which should be used in paravirt.


Zach


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 13:34 -0800, Dan Hecht wrote:
> On 03/07/2007 01:40 PM, Thomas Gleixner wrote:
> > On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> > That would certainly be ideal.  We'll look at the xen, vmi, lguest and
> >> kvm paravirtualized time models and see how much they really have in
> >> common.  I'm a bit curious about how vmi's time events make their way
> >> back into the system.
> > 
> > By the crude mechanism I'm fighting.
> >
> 
> Hmm?  They make there way back via interrupts.  How is that crude?

Simply because you _ABUSE_ timer_init_hook() to set it up. Keep it self
contained and do not impose restrictions on the kernel core code, which
we have to maintain.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> This is tinkering of the best. My understanding of the paravirt
> discussion at Kernel Summit was, that paravirt ops are exactly there to
> prevent the above random hackery in the kernel and to allow _ALL_
> hypervisors to interact via a sane interface inside of the kernel.
>   

No, I don't think that was ever the intent.  The idea was to create a
new interface for things which don't currently have an interface in the
kernel, such as how to run the CPU in ring 1 and manage pagetable
updates.  But an important and explicit intent of the project was to use
existing kernel interfaces where possible, rather than try to make
pv_ops an monster all-encompassing interface.

Using the new time infrastructure was an explicit example of that.  We
anticipated that different hypervisors would have different ways of
doing time, but all would be easily accommodated by the
clocksource/events infrastructure, and so each would have its own
implementation for these interfaces.  From the kernel's perspective,
they're just another time device, and we manage to avoid making any core
kernel changes, or bloating the pv_ops interface.  It seems like a
natural use of the clock subsystem's design.

> You are just perverting the whole idea of a standartized
> paravirtualization interface.
>
> This things can be done for clocksources, clockevents, interrupts (the
> generic irq code allows this) and probaly for a whole bunch of other
> stuff.
>   

Yes, exactly.  The entirety of the Xen support consists of not only an
implementation of the paravirt_ops interface, but also the Xen
clocksource and clockevents and the Xen irqchip.  My hope and intent is
that we can shrink the paravirt_ops interface in favour of using
existing generally useful kernel interfaces.

> The current paravirt interface is completely insane and will explode
> into an unmaintainable nightmare within no time, if we keep accepting
> that crap further.
>   

No, that's exactly what we've been trying to avoid.

If we start patching in new paravirt_ops to deal with time, interrupts,
or whatever piece of functionality which already has a perfectly good
kernel interface, then we're just increasing the size of the pv_ops
interface, its entanglement with the rest of the system and the amount
of potential legacy stuff which gets dragged around as the interface
evolves.

As hardware gets better at supporting virtualization directly, we're
going to see more hybrid para- and fully- virtualized hypervisor
interfaces.  The result will be that more and more of paravirt_ops will
be implemented by the "native" versions of the functions; maybe at some
point the whole thing will evaporate away.

It's not a huge reach to expect the hardware vendors to get a clue about
time hardware (scratch that, of course it is, but we can always hope)
and come up with something that is directly usable from either an OS
running natively or from within a virtual machine.  In that case, I'm
sure you'd agree it would warrant a real clocksource/event
implementation.  In the scheme I'm proposing, that's no big deal; you
just register the hardware driver, and that's that.  But what you're
proposing leaves this vestigial interface sitting in pv_ops, doing
nothing other than being redundant.

My principle goal here is to get the Xen code into the kernel, and I'm
being pragmatic about it.  If you think having a xen_clocksource is an
absolute blocker to merging this stuff, then I'll add the interface to
pv_ops, and we'll work out how to wire all the hypervisors up underneath
that interface.  But I think it's precisely the wrong way to go from an
overall kernel perspective.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 13:42 -0800, Dan Hecht wrote:
> On 03/07/2007 12:40 PM, Thomas Gleixner wrote:
> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> 
> First of all, I'm not "demanding" anything. I'm just trying to have a 
> technical discussion about the issues.  If it comes out that absolute 
> expiry can't be done cleanly, and the cost out weighs the benefit, then 
> so be it.  But, what's so wrong about having the discussion?
> 
> When you do have match register (or count and compare, whatever you want 
> to call it) based timers in real hardware, the relative expiry interface 
> in software is a bit suboptimal.  You still have no idea how much time 
> has already gone by between the time you calculated the delta and when 
> you setup the hardware (you have a pretty good estimate, but can't know 
> for sure unless you disable caches and all other sources of 
> non-determinate latencies).  So, you will always be a little late in 
> your timer firing.  You may argue that no client of clockevents cares 
> about this little bit of lateness.  But, it does exist, and can be 
> solved with a software interface that talks in terms of absolute expiries.

With sane hardware yes. But there is no sane hardware. You need a (<=)
match machinery instead of the available (==) ones, which introduce
extra latencies and incorrectness. See arch/i386/kernel/hpet.c. We can
end up with returning -ETIME and an interrupt, as we have no control
over SMM code and such crap at all. For such devices the delta based
expiry is actually faster, as it avoids the calculation of wraps and the
possible 128 bit math in the reprogramming path.

This correctness discussion is purely hypothetical on current real world
hardware.

> Perhaps we can't get around the 128-bit math problem, or maybe we can 
> think of a clever solution.  If we can't, then maybe fixing the lateness 
> is not worth the cost 128-bit math.  But, maybe there is a clean way 
> around the 128-bit math and we just need to approach it from another angle.

Please put the clever solution inside of the clockevent. I can provide
the absolute time in nanoseconds without making you touch the
clockevent->next_event variable.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 12:40 PM, Thomas Gleixner wrote:

Real hardware copes well with relative deltas for the events, even when
it is match register based. I thought long about the support for
absolute expiry values in cycles and decided against them to avoid that
math hackery, which you folks now demand.


First of all, I'm not "demanding" anything. I'm just trying to have a 
technical discussion about the issues.  If it comes out that absolute 
expiry can't be done cleanly, and the cost out weighs the benefit, then 
so be it.  But, what's so wrong about having the discussion?


When you do have match register (or count and compare, whatever you want 
to call it) based timers in real hardware, the relative expiry interface 
in software is a bit suboptimal.  You still have no idea how much time 
has already gone by between the time you calculated the delta and when 
you setup the hardware (you have a pretty good estimate, but can't know 
for sure unless you disable caches and all other sources of 
non-determinate latencies).  So, you will always be a little late in 
your timer firing.  You may argue that no client of clockevents cares 
about this little bit of lateness.  But, it does exist, and can be 
solved with a software interface that talks in terms of absolute expiries.


Perhaps we can't get around the 128-bit math problem, or maybe we can 
think of a clever solution.  If we can't, then maybe fixing the lateness 
is not worth the cost 128-bit math.  But, maybe there is a clean way 
around the 128-bit math and we just need to approach it from another angle.


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 01:40 PM, Thomas Gleixner wrote:

On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
That would certainly be ideal.  We'll look at the xen, vmi, lguest and

kvm paravirtualized time models and see how much they really have in
common.  I'm a bit curious about how vmi's time events make their way
back into the system.


By the crude mechanism I'm fighting.



Hmm?  They make there way back via interrupts.  How is that crude?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 01:21 PM, Thomas Gleixner wrote:

On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote:
Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
for reference (please excuse any formating issues); it's also lean. 
We'll send out a proper patch later after some more testing:


Ah. Bitching loud enough speeds things up. :)



We've always planned to do this.  We just didn't want to create the 
dependency between paravirt_ops and clockevents too early such that they 
would depend on each other to merge to main line.  Now that they are 
both there, we are all for it.



/** vmi clockevent */

static struct clock_event_device vmi_global_clockevent;

static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
{
return (evt == &vmi_global_clockevent) ?
VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
}

static void vmi_timer_set_mode(enum clock_event_mode mode,
   struct clock_event_device *evt)
{
u32 wiring;
cycle_t now, cycles_per_hz;
BUG_ON(!irqs_disabled());

wiring = vmi_alarm_wiring(evt);
if (wiring == VMI_ALARM_WIRED_LVTT)
/* Route the interrupt to the correct vector */
apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);


Wire that in the hypervisor.


switch (mode) {
case CLOCK_EVT_MODE_ONESHOT:
break;
case CLOCK_EVT_MODE_PERIODIC:
cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
(void)do_div(cycles_per_hz, HZ);
now = 
vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
now, cycles_per_hz);


paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period);



Huh?  paravirt_ops isn't a hypervisor interface, it's just a linux code 
abstraction.  The code on both sides of paravirt_ops is *linux* code, 
any way you cut it.  clockevents is already a linux code abstraction. 
why introduce the redundancy?




break;
case CLOCK_EVT_MODE_UNUSED:
case CLOCK_EVT_MODE_SHUTDOWN:


paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode);



You would be introducing the same redundancy.




switch (evt->mode) {
case CLOCK_EVT_MODE_ONESHOT:
vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
break;
case CLOCK_EVT_MODE_PERIODIC:
vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
break;
default:
break;
}
break;
default:
break;
}
}


This whole vmi_timer_ops thing is horrible. All hypervisors can share 
paravirt_ops->paravirt_clockevent and retrieve the methods on boot.




vmi_timer_ops.whatever is where the kernel <-> hypervisor boundary is 
crossed for VMI.



static int vmi_timer_next_event(unsigned long delta,
struct clock_event_device *evt)
{
/* Unfortunately, set_next_event interface only passes relative
 * expiry, but we want absolute expiry.  It'd be better if were
 * were passed an aboslute expiry, since a bunch of time may
 * have been stolen between the time the delta is computed and
 * when we set the alarm below. */
cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT));

BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT);
vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,
now + delta, 0);
return 0;
}


Great. Now we have:

   s64 event = startup_offset + ktime_to_ns(evt->next_event);

   if (HYPERVISOR_set_timer_op(event) < 0)
BUG();
and

vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,now + 
delta, 0);

How will the next implementations look like ?

lguest_program_timer(delta + lguest_current_time(), 
LGUEST_TIMER_SHOOT_ONCE);

virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS);

...

This is tinkering of the best. My understanding of the paravirt
discussion at Kernel Summit was, that paravirt ops are exactly there to
prevent the above random hackery in the kernel and to allow _ALL_
hypervisors to interact via a sane interface inside of the kernel.



No, that was not the point of paravirt_ops.  It is actually the complete 
opposite of the intention of paravirt_ops.  paravirt_ops' intent is 
exactly to allow for *multiple* hypervisor ABIs to exist in the kernel.


At kernel summit, paravirt_ops was proposed to allow for multiple 
hypervisor ABI's to be targeted by the kernel.  The code on both sides 
of paravirt_ops is *linux* code.



You are just perverting the whole idea of a standartized
paravirtualization interface.

This things can be done f

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > I tend to disagree. The clockevents infrastructure was designed to cope
> > with the existing mess of real hardware. The discussion over the last
> > days exposed me to even more exotic designs than the hardware vendors
> > were able to deliver until now.
> >   
> 
> It's a different but related problem domain.  It's also an increasingly
> common execution environment for a kernel to find itself in.  Dealing
> with proper paravirtualized timer devices is a big improvement over
> trying to reliably deal with fully virtualized hardware timers, which
> simply can't make the same guarantees that real hardware can make - such
> as "you will definitely get N ns of CPU time between doing the
> delta->absolute computation and programming the match register".

That's exactly the reason why we want only _ONE_ proper virtualized
timer device instead of 10 new variants of broken hardware.

> > I know exactly where you are heading:
> >
> > Offload the handling of hypervisor design decisions to the kernel and
> > let us deal with that. So we need to implement 128 bit math to convert
> > back and forth and I expect more interesting things to creep up. 
> >   
> 
> I wouldn't put it that way.  We've been getting a lot of pressure to
> keep the pv_ops interface as small as possible.  Reusing existing kernel
> interfaces rather than making up new ones is a good way to do that.  The
> clock infrastructure certainly cleans things up; earlier Xen patches
> made a complete copy of the old kernel/time.c and hacked it around,
> which isn't what anyone wants to do.

All you need is exactly ONE paravirt clockevent device and ONE paravirt
clocksource for _ALL_ hypervisors. Cast that into stone with a
paravirt_ops->clockwahtever interface and we are all happy.

> > All this is of _NO_ use and benefit for the kernel itself.
> >   
> 
> Lots of people want to run Linux in virtual machines.  If we can make
> sane kernel changes to help those users, then that is of use an benefit
> to the kernel.

The above will give a real benefit as it is a well defined interface,
which can be verified on both ends.

> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> >   
> 
> Not really.  Xen and VMI interfaces both use absolute monotonic time for
> timeouts, which is certainly a common case for such interfaces
> (pthread_cond_timedwait, for example).  Converting delta to absolute is
> clearly simple, but it does introduce an added bit of non-determinism if
> your CPU can be preempted from outside at any time.  I presume SMM or
> similar interrupts can cause the same problem on real hardware.

As I said before: I have no objection against expanding / changing the
clockevents interface to deliver absolute expiry time, which we have
already handy.

I just refuse for a good reason to convert it from ktime_t (nanoseconds)
to an absolute cycle value. This can be done on the hypervisor side of
the paravirt clock event device. Same applies for clocksources. The ones
which need nanosecond from/to whatever conversion can do it _IN_ the
hypervisor and not in 10 different grades of madness in the kernel code.

> > We can optimize this by skipping the conversion via a feature flag.

> The clocksource needed the shift for ntp warping.  Does the clockevent
> need a shift at all?  Could I just set mult/shift to 1/0?

Yes.

> > Your implementation is almost the perfect prototype, if you move the
> > 128 bit hackery into the hypervisor and hide it away from the kernel :)
> >   
> The point is to use the tsc to avoid making any hypercalls, so dealing
> with the tsc->ns conversion has to happen on the guest side somehow.

I understand that you want to make this as fast as possible, but TSC is
broken in more than one way and it just makes me barf, when we have yet
another way of dealing with it in the kernel.

Please keep the paravirt interface abstract and treat it in the same way
we treat the kernel - userspace API. The kernel hides all this hardware
crap away from the user space and the same applies for a sane paravirt
interface. This is also a benefit in terms of portability. 

For devices, which already live on top of an abstraction layer in the
kernel, e.g. clocksources, clockevents, interrupts, we can share one
implementation accross multiple platforms.

> > One of these is perfectly fine for _ALL_ of the hypervisor folks.
> > Anything else is just a backwards decision for the kernel.
> >   
> That would certainly be ideal.  We'll look at the xen, vmi, lguest and
> kvm paravirtualized time models and see how much they really have in
> common.  I'm a bit curious about how vmi's time events make their way
> back into the system.

By the crude mechanism I'm figh

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 01:19 PM, Thomas Gleixner wrote:

On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote:

On 03/07/2007 12:57 PM, Thomas Gleixner wrote:

On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:

Dan Hecht wrote:

Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
ours for reference (please excuse any formating issues); it's also
lean. We'll send out a proper patch later after some more testing:
So the interrupt side of the clockevent comes through the virtual apic? 
Where does evt->handle_event get called?



/* We use normal irq0 handler on cpu0. */
time_init_hook();

That's exactly the thing I ranted about before. We keep the historic
view of emulated hardware and just wrap it into enough glue code instead
of doing an abstract design, which just gets rid of those hardware
assumptions at all. That's the big advantage of paravirtualization, but
the current way on paravirt ops is just ignoring this.

Are you saying you would prefer we create our own irq handler something 
like this rather than using the standard i386 handlers?


irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
{
local_event->event_handler(local_event);
return IRQ_HANDLED;
}

??  That's fine with me.


I prefer _ONE_ generic abstract implementation of a clock event, which
can be used by all hypervisors. Please keep all your wiring and ideas of
how to best emulate a i386 system away from the kernel as far as you
can.

Please sit down with the other hypervisor folks and define the five
functions you need to interact between clockevents and the particular
hypervisor and implement it once.

Then you can change and evolve your idea of how handle them best in your
hypervisor code, where it belongs.



Okay, I guess we are essentially back to the "XEN & VMI" thread.  Let's 
just keep that discussion in one place.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 11:49 -0800, Dan Hecht wrote:
> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
> for reference (please excuse any formating issues); it's also lean. 
> We'll send out a proper patch later after some more testing:

Ah. Bitching loud enough speeds things up. :)

> /** vmi clockevent */
> 
> static struct clock_event_device vmi_global_clockevent;
> 
> static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
> {
>   return (evt == &vmi_global_clockevent) ?
>   VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
> }
> 
> static void vmi_timer_set_mode(enum clock_event_mode mode,
>  struct clock_event_device *evt)
> {
>   u32 wiring;
>   cycle_t now, cycles_per_hz;
>   BUG_ON(!irqs_disabled());
> 
>   wiring = vmi_alarm_wiring(evt);
>   if (wiring == VMI_ALARM_WIRED_LVTT)
>   /* Route the interrupt to the correct vector */
>   apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);

Wire that in the hypervisor.

>   switch (mode) {
>   case CLOCK_EVT_MODE_ONESHOT:
>   break;
>   case CLOCK_EVT_MODE_PERIODIC:
>   cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
>   (void)do_div(cycles_per_hz, HZ);
>   now = 
> vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
>   vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
>   now, cycles_per_hz);

paravirt_ops->paravirt_clockevent->set_periodic(vcpu, period);

>   break;
>   case CLOCK_EVT_MODE_UNUSED:
>   case CLOCK_EVT_MODE_SHUTDOWN:

paravirt_ops->paravirt_clockevent->stop_event(vcpu, mode);


>   switch (evt->mode) {
>   case CLOCK_EVT_MODE_ONESHOT:
>   vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
>   break;
>   case CLOCK_EVT_MODE_PERIODIC:
>   vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
>   break;
>   default:
>   break;
>   }
>   break;
>   default:
>   break;
>   }
> }

This whole vmi_timer_ops thing is horrible. All hypervisors can share 
paravirt_ops->paravirt_clockevent and retrieve the methods on boot.

> static int vmi_timer_next_event(unsigned long delta,
>   struct clock_event_device *evt)
> {
>   /* Unfortunately, set_next_event interface only passes relative
>* expiry, but we want absolute expiry.  It'd be better if were
>* were passed an aboslute expiry, since a bunch of time may
>* have been stolen between the time the delta is computed and
>* when we set the alarm below. */
>   cycle_t now = vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_ONESHOT));
> 
>   BUG_ON(evt->mode != CLOCK_EVT_MODE_ONESHOT);
>   vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,
>   now + delta, 0);
>   return 0;
> }

Great. Now we have:

   s64 event = startup_offset + ktime_to_ns(evt->next_event);

   if (HYPERVISOR_set_timer_op(event) < 0)
BUG();
and

vmi_timer_ops.set_alarm(vmi_alarm_wiring(evt) | VMI_ONESHOT,now + 
delta, 0);

How will the next implementations look like ?

lguest_program_timer(delta + lguest_current_time(), 
LGUEST_TIMER_SHOOT_ONCE);

virt_nextgen_ops.set_timer_event(delta, NO_WE_NEED_NO_FLAGS);

...

This is tinkering of the best. My understanding of the paravirt
discussion at Kernel Summit was, that paravirt ops are exactly there to
prevent the above random hackery in the kernel and to allow _ALL_
hypervisors to interact via a sane interface inside of the kernel.

You are just perverting the whole idea of a standartized
paravirtualization interface.

This things can be done for clocksources, clockevents, interrupts (the
generic irq code allows this) and probaly for a whole bunch of other
stuff.

The current paravirt interface is completely insane and will explode
into an unmaintainable nightmare within no time, if we keep accepting
that crap further.

No thanks.

> #ifdef CONFIG_X86_LOCAL_APIC
> 
> /* Replacement for lapic timer local clock event.
>   * paravirt_ops.setup_boot_clock  = vmi_nop
>   *   (continue using global_clock_event on cpu0)
>   * paravirt_ops.setup_secondary_clock = vmi_timer_setup_local_alarm
>   */
> void __devinit vmi_timer_setup_local_alarm(void)
> {
>   struct clock_event_device *evt = &__get_cpu_var(local_clock_events);
> 
>   /* Then, start it back up as a local clockevent device. */
>   memcpy(evt, &vmi_clockevent, sizeof(*evt));
>   evt->cpumask = cpumask_of_cpu(smp_processor_id());
> 
>   printk(KERN_WARNING "vmi: registering clock event %s. mult=%lu 
> shift=%u\n",
>  evt->name, evt->mult, evt->shift);
>   clockevents_register_device(e

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 13:02 -0800, Dan Hecht wrote:
> On 03/07/2007 12:57 PM, Thomas Gleixner wrote:
> > On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> >> Dan Hecht wrote:
> >>> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >>> ours for reference (please excuse any formating issues); it's also
> >>> lean. We'll send out a proper patch later after some more testing:
> >> So the interrupt side of the clockevent comes through the virtual apic? 
> >> Where does evt->handle_event get called?
> > 
> > 
> >> /* We use normal irq0 handler on cpu0. */
> >> time_init_hook();
> > 
> > That's exactly the thing I ranted about before. We keep the historic
> > view of emulated hardware and just wrap it into enough glue code instead
> > of doing an abstract design, which just gets rid of those hardware
> > assumptions at all. That's the big advantage of paravirtualization, but
> > the current way on paravirt ops is just ignoring this.
> > 
> 
> Are you saying you would prefer we create our own irq handler something 
> like this rather than using the standard i386 handlers?
> 
> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
> {
> local_event->event_handler(local_event);
> return IRQ_HANDLED;
> }
> 
> ??  That's fine with me.

I prefer _ONE_ generic abstract implementation of a clock event, which
can be used by all hypervisors. Please keep all your wiring and ideas of
how to best emulate a i386 system away from the kernel as far as you
can.

Please sit down with the other hypervisor folks and define the five
functions you need to interact between clockevents and the particular
hypervisor and implement it once.

Then you can change and evolve your idea of how handle them best in your
hypervisor code, where it belongs.

tglx

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 12:49 -0800, Dan Hecht wrote:
> On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:
> > Dan Hecht wrote:
> >> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> >> ours for reference (please excuse any formating issues); it's also
> >> lean. We'll send out a proper patch later after some more testing:
> > 
> > So the interrupt side of the clockevent comes through the virtual apic? 
> > Where does evt->handle_event get called?
> > 
> 
> Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
> and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
> interrupt handler code.

Oh well. Here we are again. 2 hypervisors - 4 different views on how to
inject events into the kernel.

This is the complete wrong approach. Paravirtualization should not abuse
existing hardware drivers. It should just provide their own sane
abstract implementation.

Please stop this _NOW_

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> I tend to disagree. The clockevents infrastructure was designed to cope
> with the existing mess of real hardware. The discussion over the last
> days exposed me to even more exotic designs than the hardware vendors
> were able to deliver until now.
>   

It's a different but related problem domain.  It's also an increasingly
common execution environment for a kernel to find itself in.  Dealing
with proper paravirtualized timer devices is a big improvement over
trying to reliably deal with fully virtualized hardware timers, which
simply can't make the same guarantees that real hardware can make - such
as "you will definitely get N ns of CPU time between doing the
delta->absolute computation and programming the match register".

> I know exactly where you are heading:
>
> Offload the handling of hypervisor design decisions to the kernel and
> let us deal with that. So we need to implement 128 bit math to convert
> back and forth and I expect more interesting things to creep up. 
>   

I wouldn't put it that way.  We've been getting a lot of pressure to
keep the pv_ops interface as small as possible.  Reusing existing kernel
interfaces rather than making up new ones is a good way to do that.  The
clock infrastructure certainly cleans things up; earlier Xen patches
made a complete copy of the old kernel/time.c and hacked it around,
which isn't what anyone wants to do.

> All this is of _NO_ use and benefit for the kernel itself.
>   

Lots of people want to run Linux in virtual machines.  If we can make
sane kernel changes to help those users, then that is of use an benefit
to the kernel.

> Real hardware copes well with relative deltas for the events, even when
> it is match register based. I thought long about the support for
> absolute expiry values in cycles and decided against them to avoid that
> math hackery, which you folks now demand.
>   

Not really.  Xen and VMI interfaces both use absolute monotonic time for
timeouts, which is certainly a common case for such interfaces
(pthread_cond_timedwait, for example).  Converting delta to absolute is
clearly simple, but it does introduce an added bit of non-determinism if
your CPU can be preempted from outside at any time.  I presume SMM or
similar interrupts can cause the same problem on real hardware.

I guess the worst case for real hardware is an absolute-time match
register which only compares for match==now rather than match<=now,
since you could completely lose the time event if you miss the deadline.

>> static const struct clock_event_device xen_clockevent = {
>>  .name = "xen",
>>  .features = CLOCK_EVT_FEAT_ONESHOT,
>>
>>  .max_delta_ns = 0x7fff,
>>  .min_delta_ns = 100,/* ? */
>>
>>  .mult = 1<>  .shift = XEN_SHIFT,
>> 
>
> We can optimize this by skipping the conversion via a feature flag.
>   

The clocksource needed the shift for ntp warping.  Does the clockevent
need a shift at all?  Could I just set mult/shift to 1/0?

> Your implementation is almost the perfect prototype, if you move the
> 128 bit hackery into the hypervisor and hide it away from the kernel :)
>   

The point is to use the tsc to avoid making any hypercalls, so dealing
with the tsc->ns conversion has to happen on the guest side somehow.

> One of these is perfectly fine for _ALL_ of the hypervisor folks.
> Anything else is just a backwards decision for the kernel.
>   

That would certainly be ideal.  We'll look at the xen, vmi, lguest and
kvm paravirtualized time models and see how much they really have in
common.  I'm a bit curious about how vmi's time events make their way
back into the system.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Dan Hecht wrote:
> Are you saying you would prefer we create our own irq handler
> something like this rather than using the standard i386 handlers?
>
> irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
> {
>local_event->event_handler(local_event);
>return IRQ_HANDLED;
> }
>
> ??  That's fine with me.

It does make the code self-contained.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 12:57 PM, Thomas Gleixner wrote:

On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:

Dan Hecht wrote:

Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
ours for reference (please excuse any formating issues); it's also
lean. We'll send out a proper patch later after some more testing:
So the interrupt side of the clockevent comes through the virtual apic? 
Where does evt->handle_event get called?




/* We use normal irq0 handler on cpu0. */
time_init_hook();


That's exactly the thing I ranted about before. We keep the historic
view of emulated hardware and just wrap it into enough glue code instead
of doing an abstract design, which just gets rid of those hardware
assumptions at all. That's the big advantage of paravirtualization, but
the current way on paravirt ops is just ignoring this.



Are you saying you would prefer we create our own irq handler something 
like this rather than using the standard i386 handlers?


irqreturn_t vmi_timer_interrupt(int irq, void *dev_id)
{
   local_event->event_handler(local_event);
   return IRQ_HANDLED;
}

??  That's fine with me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 12:11 PM, Jeremy Fitzhardinge wrote:

Dan Hecht wrote:

Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
ours for reference (please excuse any formating issues); it's also
lean. We'll send out a proper patch later after some more testing:


So the interrupt side of the clockevent comes through the virtual apic? 
Where does evt->handle_event get called?




Yeah, we use the same interrupt handlers as normal i386: timer_interrupt 
and smp_apic_timer_interrupt.  That way we don't need to duplicate the 
interrupt handler code.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 12:11 -0800, Jeremy Fitzhardinge wrote:
> Dan Hecht wrote:
> > Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> > ours for reference (please excuse any formating issues); it's also
> > lean. We'll send out a proper patch later after some more testing:
> 
> So the interrupt side of the clockevent comes through the virtual apic? 
> Where does evt->handle_event get called?

> /* We use normal irq0 handler on cpu0. */
> time_init_hook();

That's exactly the thing I ranted about before. We keep the historic
view of emulated hardware and just wrap it into enough glue code instead
of doing an abstract design, which just gets rid of those hardware
assumptions at all. That's the big advantage of paravirtualization, but
the current way on paravirt ops is just ignoring this.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 09:41 -0800, Jeremy Fitzhardinge wrote:
> Other hypervisors may take other approaches, depending on what the real
> underlying hardware is and the real requirements.  One could imagine a
> hypervisor exposing an hpet mapping, for example, or just having some
> kind of completely synthetic time source.
> 
> The point is that if we were to build an abstraction layer over all of
> these just so that we could have a single clocksource/event
> implementation, it would be pretty much equivalent to the existing clock
> infrastructure, and would add no value.

I tend to disagree. The clockevents infrastructure was designed to cope
with the existing mess of real hardware. The discussion over the last
days exposed me to even more exotic designs than the hardware vendors
were able to deliver until now.

> I was very pleased when I saw the clocksource/event mechanisms go into
> the kernel because it means different hypervisors can have a clock*
> implementation to match their own particular time model/interface
> without having to clutter up the pv_ops interface, and still have a
> well-defined interface to the rest of the kernel's time infrastructure.

I know exactly where you are heading:

Offload the handling of hypervisor design decisions to the kernel and
let us deal with that. So we need to implement 128 bit math to convert
back and forth and I expect more interesting things to creep up. 

All this is of _NO_ use and benefit for the kernel itself.

Real hardware copes well with relative deltas for the events, even when
it is match register based. I thought long about the support for
absolute expiry values in cycles and decided against them to avoid that
math hackery, which you folks now demand.

> I don't think having a clock implementation for each hypervisor is such
> a big deal.  The Xen one, for example, is 300 lines of straightforward code.
> 
> > Abstractions for the abstractions sake are braindead. There is no real
> > reason to implement 128 bit math into that path just to make the virtual
> > clockevent device look like real hardware.
> >
> > The abstraction of clockevents helps you to get rid of hardwired
> > hardware assumptions, but you insist on creating them artificially for
> > reasons which are beyond my grasp.
> >   
> The hypervisor may present abstracted time hardware, but there is real
> time hardware under there somewhere, and there are benefits to making
> the abstraction as thin as possible.

Yeah, it's much faster to do the conversion in the kernel and not in the
hypervisor thin layer. See also below.

> Xen chooses to express its time
> interfaces in ns and so is a good direct match for the Linux time
> infrastructure, but it still has to the 128-bit cycles<->ns conversion
> *somewhere*, because the underlying hardware is still using cycles.  It
> sounds like the VMWare folks have chosen to directly use cycles in order
> to avoid that conversion altogether.

Neither the host OS nor the hypervisors use cycles as the main unit for
their own time related code. They all have the required conversion code
already available.

The historical design of hypervisors was based on emulating the hardware
1:1. So the TSC needs to be a TSC and the LAPIC a LAPIC. 

Paravitualized guests can use smarter virtual hardware which is exposed
to the kernel. Using paravirtualization only to speed up the emulation
of legacy crap without thinking about the overall possible enhancements
is just backwards. 

Paravirtualization is a technique that presents a software interface to
virtual machines that is similar but not identical to that of the
underlying hardware.

clockevents allow you to do that easy and simple, but you insist on a
1:1 conversion of your current design and offload the legacy burden of
your historical hardware usage to the kernel developers. No thanks.

Also let's compare the code flow for a Linux guest on a Linux host:

cylces based:

program_next_event()
convert to a virtual cycle value
call into the emulated clock event device
call into the hypervisor
convert to nanoseconds
arm a hrtimer
convert to real hardware cycles

nanosecond based:

program_next_event()
call into the emulated clock event device
call into the hypervisor
arm a hrtimer
convert to real hardware cycles

> > Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
> > instead of writing up lengthy excuses, why it is s hard and takes
> > sooo much time and the current interface is sooo insufficient.
> >   
> 
> Yep, it worked out well.  The only warty thing in there is the asm
> 128-bit math needed in scale_delta() to convert tsc cycles to ns.  John
> Stultz had suggested (on a much earlier incarnation of this code) that
> it could be generally useful and could be hoisted to somewhere more
> common.  I've included the

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Dan Hecht wrote:
> Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's
> ours for reference (please excuse any formating issues); it's also
> lean. We'll send out a proper patch later after some more testing:

So the interrupt side of the clockevent comes through the virtual apic? 
Where does evt->handle_event get called?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Dan Hecht


On 03/07/2007 11:05 AM, Jeremy Fitzhardinge wrote:

James Morris wrote:
It seems to me that it could be useful to have a library of common virtual 
time code (entirely separate from pv_ops), to avoid re-implementing some 
apparently common requirements, such as: handling TSC frequency changes, 
stolen time accounting, synthetic programmable clockevent etc.
  


Well, lets put our clock* implementations next to each other and see how
much common code there is to be factored out.

The Xen time code is pretty lean.  There's not much difference in
abstraction between the clocksource/event interface and the hypervisor
interface, so there's just not very much code there.



Jeremy, I saw you sent out the Xen version earlier, thanks.  Here's ours 
for reference (please excuse any formating issues); it's also lean. 
We'll send out a proper patch later after some more testing:


---

/*
 * VMI paravirtual timer support routines.
 *
 * Copyright (C) 2007, VMware, Inc.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
 * NON INFRINGEMENT.  See the GNU General Public License for more
 * details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 */

#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

#include 

#define VMI_ONESHOT  (VMI_ALARM_IS_ONESHOT  | VMI_CYCLES_REAL)
#define VMI_PERIODIC (VMI_ALARM_IS_PERIODIC | VMI_CYCLES_REAL)

static inline u32 vmi_counter(u32 flags)
{
/* Given VMI_ONESHOT or VMI_PERIODIC, return the corresponding
 * cycle counter. */
return flags & VMI_ALARM_COUNTER_MASK;
}

/* paravirt_ops.get_wallclock = vmi_get_wallclock */
unsigned long vmi_get_wallclock(void)
{
unsigned long long wallclock;
wallclock = vmi_timer_ops.get_wallclock(); // nsec
(void)do_div(wallclock, 10);   // sec

return wallclock;
}

/* paravirt_ops.set_wallclock = vmi_set_wallclock */
int vmi_set_wallclock(unsigned long now)
{
return 0;
}

/* paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles */
unsigned long long vmi_get_sched_cycles(void)
{
return vmi_timer_ops.get_cycle_counter(VMI_CYCLES_AVAILABLE);
}

/* paravirt_ops.get_cpu_khz = vmi_cpu_khz */
unsigned long vmi_cpu_khz(void)
{
unsigned long long khz;
khz = vmi_timer_ops.get_cycle_frequency();
(void)do_div(khz, 1000);
return khz;
}

/** vmi clockevent */

static struct clock_event_device vmi_global_clockevent;

static inline u32 vmi_alarm_wiring(struct clock_event_device *evt)
{
return (evt == &vmi_global_clockevent) ?
VMI_ALARM_WIRED_IRQ0 : VMI_ALARM_WIRED_LVTT;
}

static void vmi_timer_set_mode(enum clock_event_mode mode,
   struct clock_event_device *evt)
{
u32 wiring;
cycle_t now, cycles_per_hz;
BUG_ON(!irqs_disabled());

wiring = vmi_alarm_wiring(evt);
if (wiring == VMI_ALARM_WIRED_LVTT)
/* Route the interrupt to the correct vector */
apic_write_around(APIC_LVTT, LOCAL_TIMER_VECTOR);

switch (mode) {
case CLOCK_EVT_MODE_ONESHOT:
break;
case CLOCK_EVT_MODE_PERIODIC:
cycles_per_hz = vmi_timer_ops.get_cycle_frequency();
(void)do_div(cycles_per_hz, HZ);
now = 
vmi_timer_ops.get_cycle_counter(vmi_counter(VMI_PERIODIC));
vmi_timer_ops.set_alarm(wiring | VMI_PERIODIC,
now, cycles_per_hz);
break;
case CLOCK_EVT_MODE_UNUSED:
case CLOCK_EVT_MODE_SHUTDOWN:
switch (evt->mode) {
case CLOCK_EVT_MODE_ONESHOT:
vmi_timer_ops.cancel_alarm(VMI_ONESHOT);
break;
case CLOCK_EVT_MODE_PERIODIC:
vmi_timer_ops.cancel_alarm(VMI_PERIODIC);
break;
default:
break;
}
break;
default:
break;
}
}

static int vmi_timer_next_event(unsigned long delta,
struct clock_event_device *evt)
{
/* Unfortunately, set_next_event interface only passes relative
 * expiry, but we want absolute expiry.  It'd be better if were
 * were passed an aboslute expiry, since a bunch of time may
 * have been st

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

James Morris wrote:
> It seems to me that it could be useful to have a library of common virtual 
> time code (entirely separate from pv_ops), to avoid re-implementing some 
> apparently common requirements, such as: handling TSC frequency changes, 
> stolen time accounting, synthetic programmable clockevent etc.
>   

Well, lets put our clock* implementations next to each other and see how
much common code there is to be factored out.

The Xen time code is pretty lean.  There's not much difference in
abstraction between the clocksource/event interface and the hypervisor
interface, so there's just not very much code there.

One immediate candidate is the scale_delta() function which does the
necessary cycles->tsc conversion.  I think that will be generally useful
and should be put somewhere common rather than copied.

I think stolen time is a bit more core, and in principle applies to
non-virtualized systems as well (such as time stolen by SMM and
discontinuities caused by suspend/resume).  The key piece is a monotonic
clock which advances while a vcpu is actually running on a real cpu,
since that should be used to determine how much time each process has
been running for.

Maybe it will just fall out if we start moving to a state-transition
process time accounting rather than the current sample-based one.  Is
there an actual plan to do that, or is it at the handwaving stage?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 13:11 -0500, James Morris wrote:
> On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote:
> 
> > I was very pleased when I saw the clocksource/event mechanisms go into
> > the kernel because it means different hypervisors can have a clock*
> > implementation to match their own particular time model/interface
> > without having to clutter up the pv_ops interface, and still have a
> > well-defined interface to the rest of the kernel's time infrastructure.
> 
> It seems to me that it could be useful to have a library of common virtual 
> time code (entirely separate from pv_ops), to avoid re-implementing some 
> apparently common requirements, such as: handling TSC frequency changes, 
> stolen time accounting, synthetic programmable clockevent etc.

Yes please. Expose sane emulated silicon to the kernel core and maintain
your hypervisor decisions behind that silicon instead of exposing us to
10 different silicon versions with 20 bugs each.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Wed, 2007-03-07 at 10:28 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > /For you/ it's certainly no big deal, you dont have to fix it up and you 
> > dont have to keep it flexible ;)
> >   
> 
> How flexible does it need to be?  Its a simple time source and event
> driver.  How flexible does the pit driver need to be?  It's just a small
> leaf node hanging off a large existing piece of kernel infrastructure.
> 
> > and really, i'm not expecting miracles, i've never seen any hardware 
> > vendor argue /against/ support for their own hardware =B-)
> >   
> 
> And since when has it been kernel policy to argue against including a
> well written, self-contained, vendor-provided driver for a piece of
> hardware?

The difference is that we have not much influence on the design
decisions of silicon vendors. We usually see them when the shit already
has been morphed into solid silicon.

Software emulated silicon _IS_ actually under our control. And we want
to have it as sane as possible.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Ingo Molnar wrote:
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

Yep, the tsc has myriad problems; for Xen its the best of a bad lot. 
Unfortunately in 10 years no clearly better alternative has appeared;
maybe in 10 years there will be one.  It might even be the tsc.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Ingo Molnar wrote:
> /For you/ it's certainly no big deal, you dont have to fix it up and you 
> dont have to keep it flexible ;)
>   

How flexible does it need to be?  Its a simple time source and event
driver.  How flexible does the pit driver need to be?  It's just a small
leaf node hanging off a large existing piece of kernel infrastructure.

> and really, i'm not expecting miracles, i've never seen any hardware 
> vendor argue /against/ support for their own hardware =B-)
>   

And since when has it been kernel policy to argue against including a
well written, self-contained, vendor-provided driver for a piece of
hardware?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread James Morris

On Wed, 7 Mar 2007, Jeremy Fitzhardinge wrote:

> I was very pleased when I saw the clocksource/event mechanisms go into
> the kernel because it means different hypervisors can have a clock*
> implementation to match their own particular time model/interface
> without having to clutter up the pv_ops interface, and still have a
> well-defined interface to the rest of the kernel's time infrastructure.

It seems to me that it could be useful to have a library of common virtual 
time code (entirely separate from pv_ops), to avoid re-implementing some 
apparently common requirements, such as: handling TSC frequency changes, 
stolen time accounting, synthetic programmable clockevent etc.

- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread James Morris

On Wed, 7 Mar 2007, Ingo Molnar wrote:

> 
> * Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
> 
> > Xen, for example, uses the tsc as the principle timebase in the
> > hypervisor interface. [...]
> 
> ugh. Please take it from me: i've watched the Linux time code walk its 
> long, rocky 10+ years road. One of the first mistakes was when we made 
> the TSC the center of the i386-time universe. (incidentally, it was me 
> who did the first steps of that, as a rookie kernel hacker) We got cured 
> out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
> beginning of that same road. Meet in another 10 years? ;)

What do you suggest instead ?

(Digging into this for lguest now...)



- James
-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Ingo Molnar


* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> I don't think having a clock implementation for each hypervisor is 
> such a big deal.  The Xen one, for example, is 300 lines of 
> straightforward code.

/For you/ it's certainly no big deal, you dont have to fix it up and you 
dont have to keep it flexible ;)

and really, i'm not expecting miracles, i've never seen any hardware 
vendor argue /against/ support for their own hardware =B-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> Xen, for example, uses the tsc as the principle timebase in the
> hypervisor interface. [...]

ugh. Please take it from me: i've watched the Linux time code walk its 
long, rocky 10+ years road. One of the first mistakes was when we made 
the TSC the center of the i386-time universe. (incidentally, it was me 
who did the first steps of that, as a rookie kernel hacker) We got cured 
out of that in v2.6.19, v2.6.20 and v2.6.21. Granted, Xen is only at the 
beginning of that same road. Meet in another 10 years? ;)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> That's a pure academic exercise. When we are at the point where
> nanoseconds are to coarse - sometimes after we both retired - the
> internal resolution will be femtoseconds or whatever fits.
>
> Again: paravirt should use a common infrastructure for this. Virtual
> clocksource and virtual clockevent devices, which operate on ktime_t and
> not on some artificial clock chip emulation frequency. The backend
> implementation will be still per hypervisor, but we have _ONE_ device
> emulation model, which is exposed to the kernel instead of five.
>   

Different hypervisors have different time interfaces for good reasons -
mostly because the real hardware is such a mess, and there's no clear
"good" answer.  In other words, for the same reason that the new clock
infrastructure exists.

Xen, for example, uses the tsc as the principle timebase in the
hypervisor interface. A shared memory region is updated from time to
time with the tsc frequency and other parameters, and the guest is
expected to compute the current time in ns by extrapolating using the
current tsc value.  This only works because the hypervisor goes to some
effort to synchronize the tsc between the (real) cpus, but its otherwise
much the same as using the raw tsc.

Other hypervisors may take other approaches, depending on what the real
underlying hardware is and the real requirements.  One could imagine a
hypervisor exposing an hpet mapping, for example, or just having some
kind of completely synthetic time source.

The point is that if we were to build an abstraction layer over all of
these just so that we could have a single clocksource/event
implementation, it would be pretty much equivalent to the existing clock
infrastructure, and would add no value.

I was very pleased when I saw the clocksource/event mechanisms go into
the kernel because it means different hypervisors can have a clock*
implementation to match their own particular time model/interface
without having to clutter up the pv_ops interface, and still have a
well-defined interface to the rest of the kernel's time infrastructure.

I don't think having a clock implementation for each hypervisor is such
a big deal.  The Xen one, for example, is 300 lines of straightforward code.

> Abstractions for the abstractions sake are braindead. There is no real
> reason to implement 128 bit math into that path just to make the virtual
> clockevent device look like real hardware.
>
> The abstraction of clockevents helps you to get rid of hardwired
> hardware assumptions, but you insist on creating them artificially for
> reasons which are beyond my grasp.
>   

The hypervisor may present abstracted time hardware, but there is real
time hardware under there somewhere, and there are benefits to making
the abstraction as thin as possible.  Xen chooses to express its time
interfaces in ns and so is a good direct match for the Linux time
infrastructure, but it still has to the 128-bit cycles<->ns conversion
*somewhere*, because the underlying hardware is still using cycles.  It
sounds like the VMWare folks have chosen to directly use cycles in order
to avoid that conversion altogether.

> Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
> instead of writing up lengthy excuses, why it is s hard and takes
> sooo much time and the current interface is sooo insufficient.
>   

Yep, it worked out well.  The only warty thing in there is the asm
128-bit math needed in scale_delta() to convert tsc cycles to ns.  John
Stultz had suggested (on a much earlier incarnation of this code) that
it could be generally useful and could be hoisted to somewhere more
common.  I've included the whole thing below.

J

--

#include 
#include 
#include 
#include 

#include 

#include 
#include 
#include 

#include "xen-ops.h"

#define XEN_SHIFT 22

/* These are perodically updated in shared_info, and then copied here. */
struct shadow_time_info {
u64 tsc_timestamp; /* TSC at last update of time vals.  */
u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
};

static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);

/* Xen time at startup */
static s64 startup_offset;

unsigned long xen_cpu_khz(void)
{
u64 cpu_khz = 100ULL << 32;
const struct vcpu_time_info *info =
&HYPERVISOR_shared_info->vcpu_info[0].time;

do_div(cpu_khz, info->tsc_to_system_mul);
if (info->tsc_shift < 0)
cpu_khz <<= -info->tsc_shift;
else
cpu_khz >>= info->tsc_shift;

return cpu_khz;
}

/*
 * Reads a consistent set of time-base values from Xen, into a shadow data
 * area.
 */
static void get_time_values_from_xen(void)
{
struct vcpu_time_info   *src;
struct shadow_time_info *dst;

src = &read_pda(xen.vcpu)->time;
dst = &get_cpu_var(shadow_time);

do {

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-07 Thread Thomas Gleixner

On Tue, 2007-03-06 at 18:08 -0800, Dan Hecht wrote:
> > IMO the paravirt interfaces should use nanoseconds anyway for both
> > readout and next event programming. That way the conversion is done in
> > the hypervisor once and the clocksources and clockevents are simple and
> > unified (except for the underlying hypervisor calls).
> > 
> 
> I disagree.  The clocksource/clockevents layer are always going to have 
> to convert nanoseconds to/from hardware units, so why not use it?  And, 
> some guests (say, a future version of linux that does trace-based 
> process accounting) may want higher resolution than nanoseconds for 
> certain uses. 

That's a pure academic exercise. When we are at the point where
nanoseconds are to coarse - sometimes after we both retired - the
internal resolution will be femtoseconds or whatever fits.

Again: paravirt should use a common infrastructure for this. Virtual
clocksource and virtual clockevent devices, which operate on ktime_t and
not on some artificial clock chip emulation frequency. The backend
implementation will be still per hypervisor, but we have _ONE_ device
emulation model, which is exposed to the kernel instead of five.

On a Linux based host, you probably end up with a hrtimer on the host
side to schedule the next event on the guest. So why do we need to
convert ktime_t to some virtual frequency in the guest so we can convert
it back into ktime_t on the host ?

Abstractions for the abstractions sake are braindead. There is no real
reason to implement 128 bit math into that path just to make the virtual
clockevent device look like real hardware.

The abstraction of clockevents helps you to get rid of hardwired
hardware assumptions, but you insist on creating them artificially for
reasons which are beyond my grasp.

> In any case, this is beside the point; I'd prefer to 
> stick to using the clockevents interface in the way it was intended 
> rather than reaching into ->next_event.

Sigh. The gain is, that you still have a good reason, why you can't move
to the clockevents interface.

Jeremy spent a couple of hours to get NO_HZ running for Xen yesterday
instead of writing up lengthy excuses, why it is s hard and takes
sooo much time and the current interface is sooo insufficient.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 17:44 -0800, Dan Hecht wrote:
> >>> 2) As I said above. The time accounting for virtualization needs to be
> >>> fixed in a generic way.
> >>>
> >>> I'm not going to accept some weird hackery for virtualization, which is
> >>> of exactly ZERO value for the kernel itself. Quite the contrary it will
> >>> make the cleanup harder and introduce another hard to remove thing,
> >>> which will in the worst case last for ever.
> >>>
> >> Okay, to confirm I'm on the same page as you, you want to move process 
> >> time accounting from being periodic sampled based to being trace based? 
> >> i.e. at the system-call/interrupt boundaries, read clocksource and 
> >> compute directly the amount of system/user/process time?
> > 
> > At least for the paravirt guests this is the correct approach. Once the
> > CPU vendors come up with a sane solution for a reliable and fast clock
> > source we might use that on real hardware as well.
> > 
> 
> I thought your preference was to not do things differently from real 
> hardware?  I guess this case you are okay with since you'd like to see 
> the real hardware case follow eventually?

Real hardware _IS_ broken and slow. If we add the facilities for
virtualization we want it in a way, which is usable by real hardware as
well.

> > Yes, with todays hardware it is simply a PITA. PowerPC has some basic
> > support for this though, IIRC.
> > 
> 
> I think S390 maybe too.

One more reason to make it a generic solution rather than some extra
hackery.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> Ooops. I completely forgot, that you get the absolute expiry time
> already in ktime_t format (nanoseconds) when dev->set_next_event() is
> called.
>
>   dev->next_event = expires;
>
> is done right before the call. 
>
> So it's already there for free.
>   

OK, but a trap for young players (ie, me): the absolute time is in ns
since kernel boot, but the hypervisor wants an absolute time in ns since
system boot.  Everything works reasonably well for the first guest
started early, so be sure to take a snapshot of hypervisor time early in
order to get the correction...

J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 05:18 PM, Thomas Gleixner wrote:

On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote:

Ooops. I completely forgot, that you get the absolute expiry time
already in ktime_t format (nanoseconds) when dev->set_next_event() is
called.

dev->next_event = expires;

is done right before the call. 


So it's already there for free.


Okay.  I noticed that but didn't think it was okay to use since it 
didn't seem like it was set up for the clock_event_device code's use, so 
seemed like a conceptual interface violation to go digging around in 
there.


Yes it is. 


I just wanted to point out that you can use it until I'm awake enough to
implement it proper.



Well, we'll probably just live with using the relative expiry for the 
first pass, and then revisit this later once that is working, rather 
than resort to hacking it out by reading ->next_event.


Also, wasn't one of the points of clockevents to prevent the device code 
from doing conversions between nanoseconds and clicks themselves?  Don't 
we really want the clockevents generic layer to do this conversion 
between monotonic nanonseconds to absolute device clicks and then give 
the device code that value, so the device layer doesn't perform any 
conversions?


Right. But this applies only to deltas, as the conversion of absolute
time values gets ugly, i.e. 128bit math



Yeah, hopefully we can come up with a clean way to do this.  But, like I 
said early, until we do, we'll stick with the relative expiry.



IMO the paravirt interfaces should use nanoseconds anyway for both
readout and next event programming. That way the conversion is done in
the hypervisor once and the clocksources and clockevents are simple and
unified (except for the underlying hypervisor calls).



I disagree.  The clocksource/clockevents layer are always going to have 
to convert nanoseconds to/from hardware units, so why not use it?  And, 
some guests (say, a future version of linux that does trace-based 
process accounting) may want higher resolution than nanoseconds for 
certain uses.  In any case, this is beside the point; I'd prefer to 
stick to using the clockevents interface in the way it was intended 
rather than reaching into ->next_event.


thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 05:22 PM, Thomas Gleixner wrote:

On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
accounting would be wrong.  Instead, we should allow the 
tick_sched_timer in cases (c) and (d) to have runtime configurable 
period, and then scale the time value accordingly before passing to 
account_system_time.  This is probably something the Xen folks will want 
also, since I think Xen itself only gets 100hz hard timer, and so it can 
implement at best a oneshot virtual timer with 100hz resolution.  Any 
objections to us doing something like this?
Yes. It's gross hackery. 


1) We want to have a cleanup of the tick assumptions _all_ over the
place and this is going to be real hard work.

2) As I said above. The time accounting for virtualization needs to be
fixed in a generic way.

I'm not going to accept some weird hackery for virtualization, which is
of exactly ZERO value for the kernel itself. Quite the contrary it will
make the cleanup harder and introduce another hard to remove thing,
which will in the worst case last for ever.

Okay, to confirm I'm on the same page as you, you want to move process 
time accounting from being periodic sampled based to being trace based? 
i.e. at the system-call/interrupt boundaries, read clocksource and 
compute directly the amount of system/user/process time?


At least for the paravirt guests this is the correct approach. Once the
CPU vendors come up with a sane solution for a reliable and fast clock
source we might use that on real hardware as well.



I thought your preference was to not do things differently from real 
hardware?  I guess this case you are okay with since you'd like to see 
the real hardware case follow eventually?


In any case, in paravirt the costs of reading timers and doing system 
call transitions are a bit different than on native, so we'll need to 
figure out what makes sense given those costs.


Do you know if anyone has explored this?  I thought there was a 
discussion about this a while back but it was rejected due to the 
sample-based approach having much lower overheads on high system call 
rate workloads.


Yes, with todays hardware it is simply a PITA. PowerPC has some basic
support for this though, IIRC.



I think S390 maybe too.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 16:42 -0800, Dan Hecht wrote:
> >> accounting would be wrong.  Instead, we should allow the 
> >> tick_sched_timer in cases (c) and (d) to have runtime configurable 
> >> period, and then scale the time value accordingly before passing to 
> >> account_system_time.  This is probably something the Xen folks will want 
> >> also, since I think Xen itself only gets 100hz hard timer, and so it can 
> >> implement at best a oneshot virtual timer with 100hz resolution.  Any 
> >> objections to us doing something like this?
> > 
> > Yes. It's gross hackery. 
> > 
> > 1) We want to have a cleanup of the tick assumptions _all_ over the
> > place and this is going to be real hard work.
> > 
> > 2) As I said above. The time accounting for virtualization needs to be
> > fixed in a generic way.
> > 
> > I'm not going to accept some weird hackery for virtualization, which is
> > of exactly ZERO value for the kernel itself. Quite the contrary it will
> > make the cleanup harder and introduce another hard to remove thing,
> > which will in the worst case last for ever.
> >
> 
> Okay, to confirm I'm on the same page as you, you want to move process 
> time accounting from being periodic sampled based to being trace based? 
> i.e. at the system-call/interrupt boundaries, read clocksource and 
> compute directly the amount of system/user/process time?

At least for the paravirt guests this is the correct approach. Once the
CPU vendors come up with a sane solution for a reliable and fast clock
source we might use that on real hardware as well.

> Do you know if anyone has explored this?  I thought there was a 
> discussion about this a while back but it was rejected due to the 
> sample-based approach having much lower overheads on high system call 
> rate workloads.

Yes, with todays hardware it is simply a PITA. PowerPC has some basic
support for this though, IIRC.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 16:53 -0800, Dan Hecht wrote:
> > Ooops. I completely forgot, that you get the absolute expiry time
> > already in ktime_t format (nanoseconds) when dev->set_next_event() is
> > called.
> > 
> > dev->next_event = expires;
> > 
> > is done right before the call. 
> > 
> > So it's already there for free.
> > 
> >
> 
> Okay.  I noticed that but didn't think it was okay to use since it 
> didn't seem like it was set up for the clock_event_device code's use, so 
> seemed like a conceptual interface violation to go digging around in 
> there.

Yes it is. 

I just wanted to point out that you can use it until I'm awake enough to
implement it proper.

> Also, wasn't one of the points of clockevents to prevent the device code 
> from doing conversions between nanoseconds and clicks themselves?  Don't 
> we really want the clockevents generic layer to do this conversion 
> between monotonic nanonseconds to absolute device clicks and then give 
> the device code that value, so the device layer doesn't perform any 
> conversions?

Right. But this applies only to deltas, as the conversion of absolute
time values gets ugly, i.e. 128bit math

IMO the paravirt interfaces should use nanoseconds anyway for both
readout and next event programming. That way the conversion is done in
the hypervisor once and the clocksources and clockevents are simple and
unified (except for the underlying hypervisor calls).

> On an unrelated note, can you explain what the difference between 
> CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the 
> legal state transitions are? (or point me to a document describing 
> this).  At least on i386, all clock event devices treat them the same; 
> do we really need both?

UNUSED:
The device is registered, but not used by any clockevents client

SHUTDOWN:
The device is registered, claimed by a clockevents client, but
momentarily not active.

The clock events device can treat UNUSED and SHUTDOWN basically in the
same way.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 04:49 PM, Thomas Gleixner wrote:

On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote:

There is no problem for realtime uses, as the reprogramming path is
running with local interrupts disabled. I can see the point for paravirt
and I'm not opposed to change / expand the interface for that. It might
be done by an extra clockevents feature flag, which requests absolute
time instead of relative time.
  

I'm not sure how much different it makes overall.  It's true that
absolute time would be a more useful interface, but because the guest
vcpu can be preempted at any time, we could miss the timeout
regardless.  In Xen if you set a timeout for the past you get an
immediate interrupt; I presume the clockevent code can deal with that?

That's the problem though, you won't know to set it for the past since 
the expiry is relative.  When the vcpu starts running again, it will set 
the timer to expire X ns from now, not Xns from when the timer was 
requested.


Ooops. I completely forgot, that you get the absolute expiry time
already in ktime_t format (nanoseconds) when dev->set_next_event() is
called.

dev->next_event = expires;

is done right before the call. 


So it's already there for free.




Okay.  I noticed that but didn't think it was okay to use since it 
didn't seem like it was set up for the clock_event_device code's use, so 
seemed like a conceptual interface violation to go digging around in 
there.


Also, wasn't one of the points of clockevents to prevent the device code 
from doing conversions between nanoseconds and clicks themselves?  Don't 
we really want the clockevents generic layer to do this conversion 
between monotonic nanonseconds to absolute device clicks and then give 
the device code that value, so the device layer doesn't perform any 
conversions?



On an unrelated note, can you explain what the difference between 
CLOCK_EVT_MODE_UNUSED and CLOCK_EVT_MODE_SHUTDOWN modes are and what the 
legal state transitions are? (or point me to a document describing 
this).  At least on i386, all clock event devices treat them the same; 
do we really need both?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 16:35 -0800, Dan Hecht wrote:
> >> There is no problem for realtime uses, as the reprogramming path is
> >> running with local interrupts disabled. I can see the point for paravirt
> >> and I'm not opposed to change / expand the interface for that. It might
> >> be done by an extra clockevents feature flag, which requests absolute
> >> time instead of relative time.
> >>   
> > 
> > I'm not sure how much different it makes overall.  It's true that
> > absolute time would be a more useful interface, but because the guest
> > vcpu can be preempted at any time, we could miss the timeout
> > regardless.  In Xen if you set a timeout for the past you get an
> > immediate interrupt; I presume the clockevent code can deal with that?
> > 
> 
> That's the problem though, you won't know to set it for the past since 
> the expiry is relative.  When the vcpu starts running again, it will set 
> the timer to expire X ns from now, not Xns from when the timer was 
> requested.

Ooops. I completely forgot, that you get the absolute expiry time
already in ktime_t format (nanoseconds) when dev->set_next_event() is
called.

dev->next_event = expires;

is done right before the call. 

So it's already there for free.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 03:53 PM, Thomas Gleixner wrote:
2) Virtual interrupts have a relatively high overhead as compared with 
native interrupts.  So, in vmitime, we wanted to be able to lower the 
timer interrupt rate at runtime, even if HZ is a compile time constant 
(and set to something high, like 1000hz).  While we could hack this in 
by using evt->min_delta_ns, it wouldn't really work since process time 
accounting would be wrong.  Instead, we should allow the 
tick_sched_timer in cases (c) and (d) to have runtime configurable 
period, and then scale the time value accordingly before passing to 
account_system_time.  This is probably something the Xen folks will want 
also, since I think Xen itself only gets 100hz hard timer, and so it can 
implement at best a oneshot virtual timer with 100hz resolution.  Any 
objections to us doing something like this?


Yes. It's gross hackery. 


1) We want to have a cleanup of the tick assumptions _all_ over the
place and this is going to be real hard work.

2) As I said above. The time accounting for virtualization needs to be
fixed in a generic way.

I'm not going to accept some weird hackery for virtualization, which is
of exactly ZERO value for the kernel itself. Quite the contrary it will
make the cleanup harder and introduce another hard to remove thing,
which will in the worst case last for ever.



Okay, to confirm I'm on the same page as you, you want to move process 
time accounting from being periodic sampled based to being trace based? 
i.e. at the system-call/interrupt boundaries, read clocksource and 
compute directly the amount of system/user/process time?


Do you know if anyone has explored this?  I thought there was a 
discussion about this a while back but it was rejected due to the 
sample-based approach having much lower overheads on high system call 
rate workloads.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 04:24 PM, Jeremy Fitzhardinge wrote:

Thomas Gleixner wrote:
3) clockevent set_next_event interface is suboptimal for paravirt (and 
probably realtime-ish uses).  The problem is that the expiry is passed 
as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
may have passed since the delta was computed and when the timer device 
is programmed, causing that next interrupt to be too far out in the 
future.  It seems a better interface for set_next_event would be to pass 
the current time and the absolute expiry.  Actually, I sent email to 
Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
in July 2006, but never heard back.  Thoughts?


There is no problem for realtime uses, as the reprogramming path is
running with local interrupts disabled. I can see the point for paravirt
and I'm not opposed to change / expand the interface for that. It might
be done by an extra clockevents feature flag, which requests absolute
time instead of relative time.
  


I'm not sure how much different it makes overall.  It's true that
absolute time would be a more useful interface, but because the guest
vcpu can be preempted at any time, we could miss the timeout
regardless.  In Xen if you set a timeout for the past you get an
immediate interrupt; I presume the clockevent code can deal with that?



That's the problem though, you won't know to set it for the past since 
the expiry is relative.  When the vcpu starts running again, it will set 
the timer to expire X ns from now, not Xns from when the timer was 
requested.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 16:24 -0800, Jeremy Fitzhardinge wrote:
> >> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
> >> probably realtime-ish uses).  The problem is that the expiry is passed 
> >> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
> >> may have passed since the delta was computed and when the timer device 
> >> is programmed, causing that next interrupt to be too far out in the 
> >> future.  It seems a better interface for set_next_event would be to pass 
> >> the current time and the absolute expiry.  Actually, I sent email to 
> >> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
> >> in July 2006, but never heard back.  Thoughts?
> >> 
> >
> > There is no problem for realtime uses, as the reprogramming path is
> > running with local interrupts disabled. I can see the point for paravirt
> > and I'm not opposed to change / expand the interface for that. It might
> > be done by an extra clockevents feature flag, which requests absolute
> > time instead of relative time.
> >   
> 
> I'm not sure how much different it makes overall.  It's true that
> absolute time would be a more useful interface, but because the guest
> vcpu can be preempted at any time, we could miss the timeout
> regardless.  In Xen if you set a timeout for the past you get an
> immediate interrupt; I presume the clockevent code can deal with that?

Yep. You also can return -ETIME so it just works w/o an interrupt.

tglx




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Jeremy Fitzhardinge

Thomas Gleixner wrote:
> All paravirt users probably want to have NO_HZ, so PARAVIRT might simply
> depend on NO_HZ. Of course I might be wrong :)
>   

Xen can deal either way, but tickless is certainly preferred.

> OTOH the stolen time accounting should be fixed in general and not rely
> on it happens to work now assumptions. And it should be done for _ALL_
> hypervisors in the same way, i.e. in the generic code.
>   

Yep.  We'll need to come up with a common story for that. 

>>  This is probably something the Xen folks will want 
>> also, since I think Xen itself only gets 100hz hard timer, and so it can 
>> implement at best a oneshot virtual timer with 100hz resolution.  Any 
>> objections to us doing something like this?
>> 

Xen has a nanosecond resolution one-shot timer which I'm using for
this.  There's also a 100Hz tick which gets in the way a bit (it will
appear as a stream of spurious timeouts), but we'll turn that off soon.

>> 3) clockevent set_next_event interface is suboptimal for paravirt (and 
>> probably realtime-ish uses).  The problem is that the expiry is passed 
>> as a relative time.  On paravirt, an arbitrary amount of (stolen) time 
>> may have passed since the delta was computed and when the timer device 
>> is programmed, causing that next interrupt to be too far out in the 
>> future.  It seems a better interface for set_next_event would be to pass 
>> the current time and the absolute expiry.  Actually, I sent email to 
>> Thomas and Ingo about this (and some other clockevents/hrtimer feedback) 
>> in July 2006, but never heard back.  Thoughts?
>> 
>
> There is no problem for realtime uses, as the reprogramming path is
> running with local interrupts disabled. I can see the point for paravirt
> and I'm not opposed to change / expand the interface for that. It might
> be done by an extra clockevents feature flag, which requests absolute
> time instead of relative time.
>   

I'm not sure how much different it makes overall.  It's true that
absolute time would be a more useful interface, but because the guest
vcpu can be preempted at any time, we could miss the timeout
regardless.  In Xen if you set a timeout for the past you get an
immediate interrupt; I presume the clockevent code can deal with that?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

Dan,

On Tue, 2007-03-06 at 13:07 -0800, Dan Hecht wrote:
> > Why is this so non-trivial ? All you have to do is _NOT_ register
> > PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
> > which uses the hypervisor timer emulation instead of real hardware.
> > 
> > clockevents breaks the hardwired assumptions of the old timer code and
> > allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
> > stuff like
> > 
> >/* Disable PIT. */
> > outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
> > 
> 
> Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and 
> we still need to stop it.

I guess you have access to the source code of this virtual BIOS. So this
is a real cute technical solution.

ROTFL. The number of lame excuses in this whole virtualization
discussion is amazing.

> In any case, clockevents doesn't really make it easier nor harder as far 
> as init goes.  In the pre-clockevent days, we replace setup_pit_timer, 
> setup_boot_clock, setup_secondary_clock.  With clockevents, I think the 
> hook points are the same.  Mostly just need to allow the per-cpu 
> lapic_event to be generalized to local_clock_events that can be set to 
> whatever device we want.  The other thing on i386 is just some minor 
> annoyances due initially setting up only the PIT on cpu0 on irq 0 and 
> then later setting up per-cpu timer on lvtt, and making this all place 
> nice with paravirt timers.  But these are just details and just require 
> some minor changes and will be working, but it just takes some massaging.

Nothing forces you to follow that low level hardware scheme. That's
_WHY_ clockevents are there. Create a per cpu clock event source, which
uses whatever interrupt you want (you just need to be able to pin it to
the cpu)

> So, that is not the real reason to move over the clockevents. 

It is partially, because clockevents remove the hardcoded hardware
assumptions.

> The real 
> reason is to use the generic interrupt handlers.  We understand that, 
> and will get to that point.  In the mean time, we are harming no one. 
> Our code has zero effect when you booting natively or on a non-VMI 
> hypervisor.

The "we are harming no one" argument is a great excuse to push random
hackery into the kernel. Once it is there, there is no rush to fix it
because it works (for you).

That's exactly the point which is discussed in the "Xen & VMI" thread.
We open up a can of worms and within no time we have 5 or more different
solutions for the same problem. If we do not look careful at this, we
have no way to do any changes in the core code w/o breaking one of those
hypervisor interfaces. The in tree / FOSS hypervisor interfaces might be
fixable, but those which throw a binary blob to the kernel are not.

I completely agree with Ingo, that this whole paravirt business starts
to crawl across the kernel spreading paralyis all over the place.

We have already enough trouble with real hardware, so we want to
carefully avoid that we get broken virtual hardware as an extra workload
via paravirt ops.

> >> We worked around this by keeping NO_IDLE_HZ support, which now 
> >> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
> >> and it is working fine.  We understand the benefits of moving to the CE 
> >> model - but it cannot be done overnight.
> > 
> > This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
> > irq_enter() and irq_exit() so the clockevents code is actually invoked.
> > I have not looked close enough why this does work at all.
> > 
> 
> I believe this was just a quick fix in response to Ingo breaking the VMI 
> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.
>
> (The two work okay together because when using NO_IDLE_HZ, the hooks are 
> deeper in a custom safe_halt routine which isn't registered when using 
> nohz mode at runtime, and conversely, the nohz code is guarded at 
> runtime by the ts->nohz_mode.  So, the two really can co-exist at 
> compile time).

It is guarded by the fact, that you are not registering clockevent
devices. It's not guarded by design. It happens to work.

> Again, no one is arguing that we shouldn't move to clockevents, it's 
> just a matter of time (sorry, no pun intended).

clockevents have been around for quite a time - pun intended :). They
did not surface surprisingly with 2.6.21-rc1.

> The vmi-time code was introduced to solve some shortcomings of the old 
> (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was 
> especially painful for virtualization.  Certainly, 
> clocksource/clockevents/NO_HZ solves many of the problems (basically, 
> moving away from counting interrupts to using time sources).  e.g. xtime 
> updating is no longer a worry with the new timeofday/clocksource stuff. 
>   But there are some that may not quite be solved, listed below.  (I 
> know I'm not telling you anything

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 02:21 PM, Andi Kleen wrote:
I believe this was just a quick fix in response to Ingo breaking the VMI 
build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
reason why NO_IDLE_HZ=y can't coexist with NO_HZ.


Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users.


The only thing NO_IDLE_HZ=y "forces" on other users is some extra code 
(which you are going to get no matter what with CONFIG_PARAVIRT).   It 
doesn't force them to use this code.  It just provides a few extra 
routines that a paravirt_ops backend might want to call back into (I 
think both vmi and xen backends use these routines and that is why it 
became associated with CONFIG_PARAVIRT rather than CONFIG_VMI).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Andi Kleen

> I believe this was just a quick fix in response to Ingo breaking the VMI 
> build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
> reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

Well it's nasty that you force NO_IDLE_HZ on all of paravirt ops users.
I think the right solution is to make VMI depend on (not select) NO_IDLE_HZ
until you can fix your code to work with dynticks properly.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Dan Hecht


On 03/06/2007 02:59 AM, Thomas Gleixner wrote:

On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
a proper CE device also has the added bonus of making high-res timers 
guests work automatically. It should be simple: just pass it through to 
your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
essentially no guest-side complexity.
  
It is not so simple.  In theory it works great.  In reality, the i386 
implementation is completely hardwired to work the way hardware works, 
and breaking the clockevent code out of the deep ties to the APIC is 
extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
because the hrtimers integration was complex, and introduced many bugs 
for us.


Why is this so non-trivial ? All you have to do is _NOT_ register
PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
which uses the hypervisor timer emulation instead of real hardware.

clockevents breaks the hardwired assumptions of the old timer code and
allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
stuff like

   /* Disable PIT. */
outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */



Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and 
we still need to stop it.


In any case, clockevents doesn't really make it easier nor harder as far 
as init goes.  In the pre-clockevent days, we replace setup_pit_timer, 
setup_boot_clock, setup_secondary_clock.  With clockevents, I think the 
hook points are the same.  Mostly just need to allow the per-cpu 
lapic_event to be generalized to local_clock_events that can be set to 
whatever device we want.  The other thing on i386 is just some minor 
annoyances due initially setting up only the PIT on cpu0 on irq 0 and 
then later setting up per-cpu timer on lvtt, and making this all place 
nice with paravirt timers.  But these are just details and just require 
some minor changes and will be working, but it just takes some massaging.


So, that is not the real reason to move over the clockevents.  The real 
reason is to use the generic interrupt handlers.  We understand that, 
and will get to that point.  In the mean time, we are harming no one. 
Our code has zero effect when you booting natively or on a non-VMI 
hypervisor.


We worked around this by keeping NO_IDLE_HZ support, which now 
you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
and it is working fine.  We understand the benefits of moving to the CE 
model - but it cannot be done overnight.


This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
irq_enter() and irq_exit() so the clockevents code is actually invoked.
I have not looked close enough why this does work at all.



I believe this was just a quick fix in response to Ingo breaking the VMI 
build yesterday by disabling NO_IDLE_HZ on us.  There is no technical 
reason why NO_IDLE_HZ=y can't coexist with NO_HZ.


(The two work okay together because when using NO_IDLE_HZ, the hooks are 
deeper in a custom safe_halt routine which isn't registered when using 
nohz mode at runtime, and conversely, the nohz code is guarded at 
runtime by the ts->nohz_mode.  So, the two really can co-exist at 
compile time).


Again, no one is arguing that we shouldn't move to clockevents, it's 
just a matter of time (sorry, no pun intended).


The vmi-time code was introduced to solve some shortcomings of the old 
(pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was 
especially painful for virtualization.  Certainly, 
clocksource/clockevents/NO_HZ solves many of the problems (basically, 
moving away from counting interrupts to using time sources).  e.g. xtime 
updating is no longer a worry with the new timeofday/clocksource stuff. 
 But there are some that may not quite be solved, listed below.  (I 
know I'm not telling you anything new, but I might as well flesh it out 
for the other paravirt folks while the code is fresh in my mind):


1) Stolen time (virtual cpu is ready to run but not running): this is 
handled inconsistently between the various clockevent handlers / 
CLOCK_EVT_MODE_ONESHOT combinations:


 a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you 
define "periodic" timer in a paravirtual world.  If you do something 
like Xen-style where you send periodic events only to running vcpus, 
then this handler suffers from some of the same problems as the old i386 
timer handler:
  - jiffies updated according to the number of interrupts you get, so 
falls behind monotonic time.  generally, counting timer interrupts is 
bad for paravirt.
  - process time updated according to the number of interrupts, so 
falls behind monotonic time.  This is probably okay though, since it is 
essentially tracking (mono - stolen) time.  I.e. the missing time is stolen.
  - jiffies updated only by boot cpu, which is a problem for paravirt 
since the boot vcpu can be descheduled while the other vcpus are scheduled.
  -

Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

2007-03-06 Thread Thomas Gleixner

On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
> > a proper CE device also has the added bonus of making high-res timers 
> > guests work automatically. It should be simple: just pass it through to 
> > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has 
> > essentially no guest-side complexity.
> >   
> 
> It is not so simple.  In theory it works great.  In reality, the i386 
> implementation is completely hardwired to work the way hardware works, 
> and breaking the clockevent code out of the deep ties to the APIC is 
> extremely non-trivial.  We tried, and could not accomplish it for 2.6.21 
> because the hrtimers integration was complex, and introduced many bugs 
> for us.

Why is this so non-trivial ? All you have to do is _NOT_ register
PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
which uses the hypervisor timer emulation instead of real hardware.

clockevents breaks the hardwired assumptions of the old timer code and
allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
stuff like

   /* Disable PIT. */
outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */

> We worked around this by keeping NO_IDLE_HZ support, which now 
> you deprecated.  So now we are using NO_HZ without a hyper-CE device, 
> and it is working fine.  We understand the benefits of moving to the CE 
> model - but it cannot be done overnight.

This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
irq_enter() and irq_exit() so the clockevents code is actually invoked.
I have not looked close enough why this does work at all.

I have the feeling that "working fine" means something like "does not
explode".

We really want to fix this now instead of pushing some not know why it
works hack into the kernel.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

86 matches

Mail list logo