Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-07-11 Thread Roland McGrath
> That should work well.  But how does the handler know whether a ptrace
> trigger occurred?  I can think of several possible ways, none of them
> very attractive.  Simply checking the vdr6 value might not work.  The
> simplest approach would be to see if the trigger callback address is
> equal to ptrace_triggered -- it's a hack but it is reliable.

That's what I did in the powerpc code.  You might recall I originally
argued for not using a regular hw_breakpoint struct and callback for ptrace
at all.  (I still think it could wind up pretty clean and tight to have it
purely a special case using its own data structures without a struct
hw_breakpoint.  Only the priority stuff has to do something special to
treat ptrace-in-use as a registration with the right priority.)

> For that matter, knowing when to set vdr6 is a little tricky.  I guess
> it should be set whenever a debug exception occurs in user mode (which
> includes both breakpoints and single-step events).  But what about
> ptrace triggers while the CPU is in kernel mode?  Should they set the
> four DR_TRAPn bits in vdr6 and leave the rest alone?

When do_debug sets TIF_SINGLESTEP, it will lead to a SIGTRAP on the way
back to user mode.  The idea is that it should appear to user mode like the
syscall was any hardware instruction that got the step trap.  So it follows
(and matches existing behavior) to set DR_STEP in vdr6 in this case too.

> I don't.  I used TIF_DEBUG because it was already there and it was 
> atomic.  But setting hw_breakpoint_info is equally atomic, so there's 
> no reason to keep TIF_DEBUG.

I would not remove TIF_DEBUG where it exists now.  It was added on x86 and
x86_64 as an optimization so the common case tests and decides not to call
__switch_to_xtra with one instruction.  Don't lose that optimization.

> > The num_installed/num_kbps stuff feels a little hokey when it's really a
> > flag because the maximum number is one.  It seems like I could make it
> > tighter with some more finesse in the arch-specific hook options, so that
> > chbi and thbi each just store dabr, dabr!=0 means "mine gets installed",
> > and the switch in is just chbi->dabr?:thbi->dabr or something like that.
> 
> You certainly can do that in the hook routines.  But the generic code
> still needs to use num_installed (which doesn't get used very much) and
> num_kbps.

What I meant is using some arch hooks instead of those fields in the
generic code.  On machines where there is a count to keep, they would just
be trivial accessors (could be one-line macros).  On powerpc, they would be
implemented slightly differently and return 1 or 0.

> > Some uses might be happy with trigger-before, but I don't see much benefit.
> 
> Other than ptrace backward-compatibility.

Right, I wasn't suggesting losing that.

> I never have either.  Possibly you might want to change the value just 
> before the read, based on the address of the code doing the reading.  
> But I've never heard of anyone doing that.

Ah, that's a thought.  Ok.  I was already tending towards flexibility to
let someone do that if they wanted to (on trigger-before machines).

> > int hw_breakpoint_triggers_before(struct hw_breakpoint *);
> > int hw_breakpoint_can_resume(struct hw_breakpoint *);
> > 
> > or perhaps taking (unsigned int type) instead, in .
> > i.e. for x86:
> > 
> > #define hw_breakpoint_triggers_before(type) ((type) == 
> > HW_BREAKPOINT_EXECUTE)
> > #define hw_breakpoint_can_resume(type)  1
> > 
> > and powerpc:
> > 
> > #define hw_breakpoint_triggers_before(any)  1
> > #define hw_breakpoint_can_resume(any)   0
> 
> I prefer the second alternative.  For the first, you'd have to register 
> the breakpoint before knowing how it will behave!

Yes, I was sort of thinking it up while I typed there.

> In general that sounds good.  But do we really want the register call
> to fail if extra handlers are defined?  That approach makes portable
> drivers harder to write.  Maybe it would be better to fail only if all
> of the arch-supported handler alternatives are NULL.

My rationale is that judiciously rejecting impossible settings in fact
makes it easier to write (correct) portable drivers.  If you set a callback
function that will never be called, you are confused and are going to have
the logic go wrong in your code.  If you can't get started while under the
delusion that your function is going to be called, then you won't waste all
that time on subtle debugging trying to figure out why it's not getting called.

> > We'd still want hw_breakpoint_can_resume to tell whether you can return
> > from a pre_handler and continue with no a post_handler, without needing to
> > unregister the breakpoint.  That's true on ia64, while on powerpc you
> > either have to clear the breakpoint or request the post_handler stepping 
> > logic.
> 
> Unregistering the breakpoint isn't good on SMP systems, since it would
> be unregistered on all CPUs.  I think it would better to requir

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-07-06 Thread Alan Stern
On Wed, 27 Jun 2007, Roland McGrath wrote:

> In the first battle just to make it compile, the only issue was that you
> assume every machine has TIF_DEBUG, which is in fact an implementation
> detail chosen lately by i386 and x86_64.  AFAIK the only reason for it
> there is just to make a cheap test of multiple bits in the hot path
> deciding to call __switch_to_xtra.  Do you rely on it meaning something
> more precise than just being a shorthand for hw_breakpoint_info!=NULL?

Going over the code, I remembered that TIF_DEBUG really does mean moree
than just hw_breakpoint_info != NULL.  It means that the thread
actually has some breakpoints registered.

Why keep the hw_breakpoint_info structure if there are no registered 
breakpoints?  I did it so that the virtualized DR[0-3] values would 
remain intact.

For other processors that have only one debug register, this won't
matter so much.  But of course there are references to TIF_DEBUG in the 
arch-independent code.  Do you think there would be any problem about 
reserving a bit for TIF_DEBUG in the other architectures?

Alan Stern

P.S.: I'm just now getting around to doing the stuff we discussed last
week.  It has been a busy time...  At OLS somebody asked when 
hw-breakpoint would get into the mainline.  I guessed that it would be 
a few months before it is added to -mm.  Solving this 
pre/post-notification issue will be difficult.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-28 Thread Alan Stern
On Wed, 27 Jun 2007, Roland McGrath wrote:

> I did the first crack at a powerpc port.  I'd appreciate your comments on
> this patch.  It should not be incorporated, isn't finished, probably breaks
> ptrace, etc.  I'm posting it now just to get any thoughts you have raised
> by seeing the second machine share the intended arch-independent code.

I'll go over it in detail and get back to you later.

> I just translated your implementation to powerpc terms without thinking
> about it much.  If you see anything that you aren't sure is right, please
> tell me and don't presume there is some powerpc-specific reason it's
> different.  More likely I just didn't think it through.
> 
> In the first battle just to make it compile, the only issue was that you
> assume every machine has TIF_DEBUG, which is in fact an implementation
> detail chosen lately by i386 and x86_64.  AFAIK the only reason for it
> there is just to make a cheap test of multiple bits in the hot path
> deciding to call __switch_to_xtra.  Do you rely on it meaning something
> more precise than just being a shorthand for hw_breakpoint_info!=NULL?

I don't.  I used TIF_DEBUG because it was already there and it was 
atomic.  But setting hw_breakpoint_info is equally atomic, so there's 
no reason to keep TIF_DEBUG.

> Incidentally, I think it would be nice if kernel/hw_breakpoint.c itself had
> all the #include's for everything it uses directly.  arch hw_breakpoint.c
> files probably only need  and one or two others to
> define what they need before #include "../../../kernel/hw_breakpoint.c".

Okay.

> The num_installed/num_kbps stuff feels a little hokey when it's really a
> flag because the maximum number is one.  It seems like I could make it
> tighter with some more finesse in the arch-specific hook options, so that
> chbi and thbi each just store dabr, dabr!=0 means "mine gets installed",
> and the switch in is just chbi->dabr?:thbi->dabr or something like that.

You certainly can do that in the hook routines.  But the generic code
still needs to use num_installed (which doesn't get used very much) and
num_kbps.

> As we get more machines, more cleanups along these lines will probably make
> sense.  (Also, before the next person not me or you tries a port, we could
> use for the generic hw_breakpoint.c to get some comments at the top making
> explicit what the arch file is expected to define in its types, etc.)

Good idea.  Splitting the code into two pieces made it considerably 
more opaque.

> With just the included change to the generic code for the TIF_DEBUG, this
> kind of works.  That is, it doesn't break everything else and I can use
> bptest, sort of.  I didn't even try ptrace, I probably broke that.
> 
> It works enough to make clear the main new wrinkle.  On powerpc, the data
> breakpoint exception is a fault before the instruction executes, not a trap
> after it.  The load/store will not complete until the breakpoint is cleared.
> With this patch, you can use bptest to generate a tight loop of bp0 triggers.

A nice new way to hang your machine!  :-)

> For ptrace compatibility, userland already expects to deal with this.  gdb
> has it as per-machine implementation options how ptrace watchpoints behave,
> and for powerpc it knows to remove the watchpoint, step, and reinsert it.
> 
> One approach for hw_breakpoint is just to expose in asm/hw_breakpoint.h
> some standard macros saying how things behave, and caveat emptor.  But I
> don't like that much.  I think things will just wind up being confused and
> inadvertently unportable if the important semantics vary too much between
> machines.  The point of the whole facility is to make watchpoints easy to
> use, after all.

Agreed.

> Some uses might be happy with trigger-before, but I don't see much benefit.

Other than ptrace backward-compatibility.

> For writing, the trigger function can look at the memory before it's
> changed.  But you could just as well have recorded the old value before
> setting the breakpoint, as you have to for trigger-after--and to see both
> old and new values you then need to single-step to get the new value, which
> trigger-after handles with a single exception.  For reading, the trigger
> function can change the memory before it's read.  But likewise, you could
> just as well have changed it before setting the breakpoint--you know noone
> will have read the new value until your trigger anyway.  (I have never used
> a read-triggered breakpoint, so I'm rather vague on those use scenarios.)

I never have either.  Possibly you might want to change the value just 
before the read, based on the address of the code doing the reading.  
But I've never heard of anyone doing that.

> The third machine whose manual I have handy is ia64.  It has instruction
> and data breakpoints that are both trigger-before.  It has processor flags
> similar to x86's RF for both, to ignore one or both breakpoint flavor for
> one instruction.  That makes it cheap to continue past the br

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-28 Thread Alan Stern
On Wed, 27 Jun 2007, Roland McGrath wrote:

> > In theory we should get an exception with both DR_STEP and DR_TRAPn 
> > set, meaning that neither notifier will return NOTIFY_STOP.  But if the 
> > kprobes handler clears DR_STEP in the DR6 image passed to the 
> > hw_breakpoint handler, it should work out better.
> 
> It's since occurred to me that kprobes can and should do:
> 
>   args->err &= ~(unsigned long) DR_STEP;
>   if (args->err == 0)
>   return NOTIFY_STOP;
> 
> This doesn't affect do_debug directly, but it will change the value seen by
> the next notifier.  So if hw_breakpoint_handler is responsible for setting
> vdr6 based on its args->err value, we should win.

Exactly what I had in mind.

> > > vdr6 belongs wholly to hw_breakpoint, no other code refers to it
> > > directly.  hw_breakpoint's notifier sets vdr6 with non-DR_TRAPn bits,
> > > if it's a user-mode exception.  If it's a ptrace exception it also
> > > sets the mapped DR_TRAPn bits.  If it's not a ptrace exception and
> > > only DR_TRAPn bits were newly set, then it returns NOTIFY_STOP.  If
> > > it's a spurious exception from lazy db7 setting, hw_breakpoint just
> > > returns NOTIFY_STOP early.
> > 
> > That sounds not quite right.  To a user-space debugger, a system call
> > should appear as an atomic operation.  If multiple ptrace exceptions
> > occur during a system call, all the relevant DR_TRAPn bits should be
> > set in vdr6 together and all the other ones reset.  How can we arrange
> > that?
> 
> That would be nice.  But it's more than the old code did.  I don't feel any
> strong need to improve the situation when using ptrace.  The old code
> disabled breakpoints after the first hit, so userland would only see the
> first DR_TRAPn bit.  (Even if it didn't, with the blind copying of the
> hardware %db6 value, we now know it would only see one DR_TRAPn bit still
> set after a second exception.)  With my suggestion above, userland would
> only see the last DR_TRAPn bit.  So it's not worse.
> 
> In the ptrace case, we know it's always going to wind up with a signal
> before it finishes and returns to user mode.  So one approach would be in
> e.g. do_notify_resume, do:
> 
>   if (thread_info_flags & _TIF_DEBUG)
>   current->thread.hw_breakpoint_info->someflag = 0;
> 
> Then ptrace_triggered could set someflag, and know from it still being set
> on entry that it's a second trigger without getting back to user mode yet
> (and so accumulate bits instead reset old ones).
> 
> But I just would not bother improving ptrace beyond the status quo for a
> corner case noone has cared about in practice so far.  In sensible
> mechanisms of the future, nothing will examine db6 values directly.

Come to think of it, I believe that gdb doesn't check beyond the first 
DR_TRAPn bit it finds set.  I can live with reporting only the last 
hit.

> > There's also the question of whether to send the SIGTRAP.  If
> > extraneous bits are set in DR6 (e.g., because the CPU always sets some
> > extra bits) then we will never get NOTIFY_STOP.  Nevertheless, the
> > signal should not always be sent.
> 
> Yeah.  The current Intel manual describes all the unspecified DR6 bits as
> explicitly reserved and set to 1 (except 0x1000 reserved and 0).  If new
> meanings are assigned in future chips, presumably those will only be
> enabled by some new explicit cr/msr setting.  Those might be enabled by
> some extra module or something, but there is only so much we can do to
> accomodate.  I think the best plan is that notifiers should do:
> 
>   args->err &= ~bits_i_recognize_as_mine;
>   if (!(args->err & known_bits))
>   return NOTIFY_STOP;
> 
> known_bits are the ones we use, plus 0x8000 (DR_SWITCH/BS) and 0x2000 (BD).
> (Those two should be impossible without some strange new kernel bug.)
> Probably should write it as ~DR_STATUS_RESERVED, to parallel existing macros.
> 
> Then we only possibly interfere with a newfangled debug exception flavor
> that occurs in the same one debug exception for an instruction also
> triggering for hw_breakpoint or step.  In the more likely cases of a new
> flavor of exception happening by itself, or the aforementioned strange new
> kernel bugs, we will get to the bottom of do_debug and do the SIGTRAP.
> 
> For this plan, hw_breakpoint_handler also needs not to return NOTIFY_STOP
> as a special case for a ptrace trigger.

That should work well.  But how does the handler know whether a ptrace
trigger occurred?  I can think of several possible ways, none of them
very attractive.  Simply checking the vdr6 value might not work.  The
simplest approach would be to see if the trigger callback address is
equal to ptrace_triggered -- it's a hack but it is reliable.

For that matter, knowing when to set vdr6 is a little tricky.  I guess
it should be set whenever a debug exception occurs in user mode (which
includes both breakpoints and single-step events).  But what about
ptrace triggers wh

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-27 Thread Roland McGrath
I did the first crack at a powerpc port.  I'd appreciate your comments on
this patch.  It should not be incorporated, isn't finished, probably breaks
ptrace, etc.  I'm posting it now just to get any thoughts you have raised
by seeing the second machine share the intended arch-independent code.

I just translated your implementation to powerpc terms without thinking
about it much.  If you see anything that you aren't sure is right, please
tell me and don't presume there is some powerpc-specific reason it's
different.  More likely I just didn't think it through.

In the first battle just to make it compile, the only issue was that you
assume every machine has TIF_DEBUG, which is in fact an implementation
detail chosen lately by i386 and x86_64.  AFAIK the only reason for it
there is just to make a cheap test of multiple bits in the hot path
deciding to call __switch_to_xtra.  Do you rely on it meaning something
more precise than just being a shorthand for hw_breakpoint_info!=NULL?

Incidentally, I think it would be nice if kernel/hw_breakpoint.c itself had
all the #include's for everything it uses directly.  arch hw_breakpoint.c
files probably only need  and one or two others to
define what they need before #include "../../../kernel/hw_breakpoint.c".

The num_installed/num_kbps stuff feels a little hokey when it's really a
flag because the maximum number is one.  It seems like I could make it
tighter with some more finesse in the arch-specific hook options, so that
chbi and thbi each just store dabr, dabr!=0 means "mine gets installed",
and the switch in is just chbi->dabr?:thbi->dabr or something like that.
As we get more machines, more cleanups along these lines will probably make
sense.  (Also, before the next person not me or you tries a port, we could
use for the generic hw_breakpoint.c to get some comments at the top making
explicit what the arch file is expected to define in its types, etc.)

With just the included change to the generic code for the TIF_DEBUG, this
kind of works.  That is, it doesn't break everything else and I can use
bptest, sort of.  I didn't even try ptrace, I probably broke that.

It works enough to make clear the main new wrinkle.  On powerpc, the data
breakpoint exception is a fault before the instruction executes, not a trap
after it.  The load/store will not complete until the breakpoint is cleared.
With this patch, you can use bptest to generate a tight loop of bp0 triggers.

For ptrace compatibility, userland already expects to deal with this.  gdb
has it as per-machine implementation options how ptrace watchpoints behave,
and for powerpc it knows to remove the watchpoint, step, and reinsert it.

One approach for hw_breakpoint is just to expose in asm/hw_breakpoint.h
some standard macros saying how things behave, and caveat emptor.  But I
don't like that much.  I think things will just wind up being confused and
inadvertently unportable if the important semantics vary too much between
machines.  The point of the whole facility is to make watchpoints easy to
use, after all.

Some uses might be happy with trigger-before, but I don't see much benefit.
For writing, the trigger function can look at the memory before it's
changed.  But you could just as well have recorded the old value before
setting the breakpoint, as you have to for trigger-after--and to see both
old and new values you then need to single-step to get the new value, which
trigger-after handles with a single exception.  For reading, the trigger
function can change the memory before it's read.  But likewise, you could
just as well have changed it before setting the breakpoint--you know noone
will have read the new value until your trigger anyway.  (I have never used
a read-triggered breakpoint, so I'm rather vague on those use scenarios.)

The third machine whose manual I have handy is ia64.  It has instruction
and data breakpoints that are both trigger-before.  It has processor flags
similar to x86's RF for both, to ignore one or both breakpoint flavor for
one instruction.  That makes it cheap to continue past the breakpoint since
you don't have to clear and reset it.  But for getting new values from
data-write breakpoints, it still requires a single-step and second stop,
like powerpc.  (Incidentally ia64 has another interesting feature, which I
think the generic code accomodates nicely as an upward-compatible addition
just by changing the len arg in the register and arch_* calls to unsigned long,
and adding an arch_validate_len that can short-circuit the generic length
and alignment check.)

So, I'd like your thoughts on the whole situation.  The starting point we
can do without anything else is:

int hw_breakpoint_triggers_before(struct hw_breakpoint *);
int hw_breakpoint_can_resume(struct hw_breakpoint *);

or perhaps taking (unsigned int type) instead, in .
i.e. for x86:

#define hw_breakpoint_triggers_before(type) ((type) == HW_BREAKPOINT_EXECUTE)
#define hw_breakpoint_can_resume(type)  1

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-27 Thread Roland McGrath
> So far this work has all been based on the vanilla kernel.  Should I 
> switch over to basing it on -mm?

It doesn't much matter at the moment.  Sticking with vanilla is the easiest
for you and me testing it right now.

> > Calling send_sigtrap twice during the same exception does happen to be
> > harmless, but I don't think it should be presumed to be.  It is just not
> > the right way to go about things that you send a signal twice when there
> > is one signal you want to generate.
> 
> What happens when there are two ptrace exceptions at different points
> during the same system call?  Won't we end up sending the signal twice
> no matter what?

Well then that is two signals for good reason, so that is a different
story.  It winds up indistinguishable from only sending the second, but as
far as the organization of the code and thinking about the semantics, twice
is right in this case and once is right in the simpler case.

> In theory we should get an exception with both DR_STEP and DR_TRAPn 
> set, meaning that neither notifier will return NOTIFY_STOP.  But if the 
> kprobes handler clears DR_STEP in the DR6 image passed to the 
> hw_breakpoint handler, it should work out better.

It's since occurred to me that kprobes can and should do:

args->err &= ~(unsigned long) DR_STEP;
if (args->err == 0)
return NOTIFY_STOP;

This doesn't affect do_debug directly, but it will change the value seen by
the next notifier.  So if hw_breakpoint_handler is responsible for setting
vdr6 based on its args->err value, we should win.

> > vdr6 belongs wholly to hw_breakpoint, no other code refers to it
> > directly.  hw_breakpoint's notifier sets vdr6 with non-DR_TRAPn bits,
> > if it's a user-mode exception.  If it's a ptrace exception it also
> > sets the mapped DR_TRAPn bits.  If it's not a ptrace exception and
> > only DR_TRAPn bits were newly set, then it returns NOTIFY_STOP.  If
> > it's a spurious exception from lazy db7 setting, hw_breakpoint just
> > returns NOTIFY_STOP early.
> 
> That sounds not quite right.  To a user-space debugger, a system call
> should appear as an atomic operation.  If multiple ptrace exceptions
> occur during a system call, all the relevant DR_TRAPn bits should be
> set in vdr6 together and all the other ones reset.  How can we arrange
> that?

That would be nice.  But it's more than the old code did.  I don't feel any
strong need to improve the situation when using ptrace.  The old code
disabled breakpoints after the first hit, so userland would only see the
first DR_TRAPn bit.  (Even if it didn't, with the blind copying of the
hardware %db6 value, we now know it would only see one DR_TRAPn bit still
set after a second exception.)  With my suggestion above, userland would
only see the last DR_TRAPn bit.  So it's not worse.

In the ptrace case, we know it's always going to wind up with a signal
before it finishes and returns to user mode.  So one approach would be in
e.g. do_notify_resume, do:

if (thread_info_flags & _TIF_DEBUG)
current->thread.hw_breakpoint_info->someflag = 0;

Then ptrace_triggered could set someflag, and know from it still being set
on entry that it's a second trigger without getting back to user mode yet
(and so accumulate bits instead reset old ones).

But I just would not bother improving ptrace beyond the status quo for a
corner case noone has cared about in practice so far.  In sensible
mechanisms of the future, nothing will examine db6 values directly.

> There's also the question of whether to send the SIGTRAP.  If
> extraneous bits are set in DR6 (e.g., because the CPU always sets some
> extra bits) then we will never get NOTIFY_STOP.  Nevertheless, the
> signal should not always be sent.

Yeah.  The current Intel manual describes all the unspecified DR6 bits as
explicitly reserved and set to 1 (except 0x1000 reserved and 0).  If new
meanings are assigned in future chips, presumably those will only be
enabled by some new explicit cr/msr setting.  Those might be enabled by
some extra module or something, but there is only so much we can do to
accomodate.  I think the best plan is that notifiers should do:

args->err &= ~bits_i_recognize_as_mine;
if (!(args->err & known_bits))
return NOTIFY_STOP;

known_bits are the ones we use, plus 0x8000 (DR_SWITCH/BS) and 0x2000 (BD).
(Those two should be impossible without some strange new kernel bug.)
Probably should write it as ~DR_STATUS_RESERVED, to parallel existing macros.

Then we only possibly interfere with a newfangled debug exception flavor
that occurs in the same one debug exception for an instruction also
triggering for hw_breakpoint or step.  In the more likely cases of a new
flavor of exception happening by itself, or the aforementioned strange new
kernel bugs, we will get to the bottom of do_debug and do the SIGTRAP.

For this plan, hw_breakpoint_handler also needs not to return NOTIFY_STOP
as a special case for a ptrac

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-26 Thread Alan Stern
On Tue, 26 Jun 2007, Roland McGrath wrote:

> > Here's the next iteration.  The arch-specific parts are now completely 
> > encapsulated.  validate_settings is in a form which should be workable 
> > on all architectures.  And the address, length, and type are passed as 
> > arguments to register_{kernel,user}_hw_breakpoint().
> 
> I like it!

Good.  My earlier stubbornness was caused by a desire to allow static
initializers, but now I see that specifying the values in the
registration call really isn't all that bad.

> > I haven't tried to modify Kconfig at all.  To do it properly would
> > require making ptrace configurable, which is not something I want to
> > tackle at the moment.
> 
> You don't need to worry about that.  Under utrace, CONFIG_PTRACE is
> already separate and can be turned off.  I don't think we need really to
> finish the Kconfig stuff at all before I merge it into the utrace code.

So far this work has all been based on the vanilla kernel.  Should I 
switch over to basing it on -mm?


> Calling send_sigtrap twice during the same exception does happen to be
> harmless, but I don't think it should be presumed to be.  It is just not
> the right way to go about things that you send a signal twice when there
> is one signal you want to generate.

What happens when there are two ptrace exceptions at different points
during the same system call?  Won't we end up sending the signal twice
no matter what?

> Also, send_sigtrap is an i386-only function (not even x86_64 has the
> same).  Only x86_64 will share this actual code, but all others will be
> modelled on it.  I think it makes things simplest across the board if
> the standard form is that when there is a ptrace exception, the notifier
> does not return NOTIFY_STOP, so it falls through to the existing SIGTRAP
> arch code.
> 
> So, hmm.  In the old do_debug code, if a notifier returns NOTIFY_STOP,
> it bails immediately, before the db6 value is saved in current->thread.
> This is the normal theory of notify_die use, where NOTIFY_STOP means to
> completely swallow the event as if it never happened.  In the event
> there were some third party notifier involved, it ought to be able to
> swallow its magic exceptions as before and have no user-visible db6
> change happen at the time of that exception.  So how about this:
> 
>   get_debugreg(condition, 6);
>   set_debugreg(0UL, 6);   /* The CPU does not clear it.  */
> 
>   if (notify_die(DIE_DEBUG, "debug", regs, condition, error_code,
>   SIGTRAP) == NOTIFY_STOP)
>   return;
> 
> The kprobes notifier uses max priority, so it will run first.  Its
> notifier code uses my version.  For a single-step that belongs to it,
> it will return NOTIFY_STOP and nothing else happens (noone touches
> vdr6).  (I think I'm dredging up old territory by asking what happens
> when kprobes steps over an insn that hits a data breakpoint, but I
> don't recall atm.)

In theory we should get an exception with both DR_STEP and DR_TRAPn 
set, meaning that neither notifier will return NOTIFY_STOP.  But if the 
kprobes handler clears DR_STEP in the DR6 image passed to the 
hw_breakpoint handler, it should work out better.

> vdr6 belongs wholly to hw_breakpoint, no other code refers to it
> directly.  hw_breakpoint's notifier sets vdr6 with non-DR_TRAPn bits,
> if it's a user-mode exception.  If it's a ptrace exception it also
> sets the mapped DR_TRAPn bits.  If it's not a ptrace exception and
> only DR_TRAPn bits were newly set, then it returns NOTIFY_STOP.  If
> it's a spurious exception from lazy db7 setting, hw_breakpoint just
> returns NOTIFY_STOP early.

That sounds not quite right.  To a user-space debugger, a system call
should appear as an atomic operation.  If multiple ptrace exceptions
occur during a system call, all the relevant DR_TRAPn bits should be
set in vdr6 together and all the other ones reset.  How can we arrange
that?

There's also the question of whether to send the SIGTRAP.  If
extraneous bits are set in DR6 (e.g., because the CPU always sets some
extra bits) then we will never get NOTIFY_STOP.  Nevertheless, the
signal should not always be sent.

> > @@ -484,7 +495,8 @@ int copy_thread(int nr, unsigned long cl
> >  
> > err = 0;
> >   out:
> > -   if (err && p->thread.io_bitmap_ptr) {
> > +   if (err) {
> > +   flush_thread_hw_breakpoint(p);
> > kfree(p->thread.io_bitmap_ptr);
> > p->thread.io_bitmap_max = 0;
> > }
> 
> This can call kfree(NULL).  I would leave the original code alone, i.e.:
> 
>   if (err)
>   flush_thread_hw_breakpoint(p);
>   if (err && p->thread.io_bitmap_ptr) {
>   kfree(p->thread.io_bitmap_ptr);
>   p->thread.io_bitmap_max = 0;
>   }

I disagree.  kfree() is documented to return harmlessly when passed a
NULL pointer, and lots of places in the kernel have been changed to
remove useless tests for NULL before calls to kfre

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-26 Thread Alan Stern
On Tue, 26 Jun 2007, Roland McGrath wrote:

> I needed the attached patch on top of the bptest patch for the current
> code.  Btw, that is a very nice little tester!

I had already made some of those changes (the ones needed to make 
bptest build with the new hw_breakpoint code).  I'll add in the others.

> Below that is a patch to go on top of your current patch, with x86-64
> support.  I've only tried a few trivial tests with bptest (including an
> 8-byte bp), which worked great.  It is a pretty faithful copy of your i386
> changes.  I'm still not sure we have all that right, but you might as well
> incorporate this into your patch.  You should change the x86_64 code in
> parallel with any i386 changes we decide on later, and I can test it and
> send you any typo fixups or whatnot.

Right.  I may update a few comments...

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-26 Thread Roland McGrath
> All right, I'll change it.  And I'll encapsulate those fields.  I still 
> think it will accomplish nothing more than hiding some implementation 
> details which don't really need to be hidden.

It makes me a little happier, and I at least consider that a substantial
accomplishment.  ;-)

> It's below.  The patch logs the value of DR6 when each debug interrupt 
> occurs, and it adds another sysfs attribute to the bptest driver.  The 
> attribute is named "test", and it contains the value that the IRQ 
> handler will write back to DR6.  Combine this with the Alt-SysRq-P 
> change already submitted, and you can get a clear view of what's going 
> on.

Thanks.  I haven't played with this.

> I see.  So I could add a CONFIG_HW_BREAKPOINT option and make 
> CONFIG_PTRACE depend on it.  That will be simple enough.

Right.  

> Do you think it would make sense to allow utrace without hw-breakpoint?

Sure.  There's no special reason to want to turn hw-breakpoint off, but
it is a naturally separable option.

> Here's the next iteration.  The arch-specific parts are now completely 
> encapsulated.  validate_settings is in a form which should be workable 
> on all architectures.  And the address, length, and type are passed as 
> arguments to register_{kernel,user}_hw_breakpoint().

I like it!

> I haven't tried to modify Kconfig at all.  To do it properly would
> require making ptrace configurable, which is not something I want to
> tackle at the moment.

You don't need to worry about that.  Under utrace, CONFIG_PTRACE is
already separate and can be turned off.  I don't think we need really to
finish the Kconfig stuff at all before I merge it into the utrace code.

> I changed the Kprobes single-step routine along the lines you 
> suggested, but added a little extra.  See what you think.
[...]
> The test for early termination of the exception handler is now back the
> way it was.  However I didn't change the test for deciding whether to 
> send a SIGTRAP.  Under the current circumstances I don't see how it 
> could ever be wrong.  (On the other hand, the code will end up calling 
> send_sigtrap() twice when a ptrace exception occurs: once in the ptrace 
> trigger routine and once in do_debug.  That won't matter will it?  I 
> would expect send_sigtrap() to be idempotent.)

Calling send_sigtrap twice during the same exception does happen to be
harmless, but I don't think it should be presumed to be.  It is just not
the right way to go about things that you send a signal twice when there
is one signal you want to generate.

Also, send_sigtrap is an i386-only function (not even x86_64 has the
same).  Only x86_64 will share this actual code, but all others will be
modelled on it.  I think it makes things simplest across the board if
the standard form is that when there is a ptrace exception, the notifier
does not return NOTIFY_STOP, so it falls through to the existing SIGTRAP
arch code.

So, hmm.  In the old do_debug code, if a notifier returns NOTIFY_STOP,
it bails immediately, before the db6 value is saved in current->thread.
This is the normal theory of notify_die use, where NOTIFY_STOP means to
completely swallow the event as if it never happened.  In the event
there were some third party notifier involved, it ought to be able to
swallow its magic exceptions as before and have no user-visible db6
change happen at the time of that exception.  So how about this:

get_debugreg(condition, 6);
set_debugreg(0UL, 6);   /* The CPU does not clear it.  */

if (notify_die(DIE_DEBUG, "debug", regs, condition, error_code,
SIGTRAP) == NOTIFY_STOP)
return;

The kprobes notifier uses max priority, so it will run first.  Its
notifier code uses my version.  For a single-step that belongs to it,
it will return NOTIFY_STOP and nothing else happens (noone touches
vdr6).  (I think I'm dredging up old territory by asking what happens
when kprobes steps over an insn that hits a data breakpoint, but I
don't recall atm.)

vdr6 belongs wholly to hw_breakpoint, no other code refers to it
directly.  hw_breakpoint's notifier sets vdr6 with non-DR_TRAPn bits,
if it's a user-mode exception.  If it's a ptrace exception it also
sets the mapped DR_TRAPn bits.  If it's not a ptrace exception and
only DR_TRAPn bits were newly set, then it returns NOTIFY_STOP.  If
it's a spurious exception from lazy db7 setting, hw_breakpoint just
returns NOTIFY_STOP early.

The rest of the old do_debug code stays as it is, only clear_dr7 goes.

> Are you going to the Ottawa Linux Symposium?

I am not.

> @@ -484,7 +495,8 @@ int copy_thread(int nr, unsigned long cl
>  
>   err = 0;
>   out:
> - if (err && p->thread.io_bitmap_ptr) {
> + if (err) {
> + flush_thread_hw_breakpoint(p);
>   kfree(p->thread.io_bitmap_ptr);
>   p->thread.io_bitmap_max = 0;
>   }

This can call kfree(NULL).  I would leave the original code alone, i.e.:

  

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-26 Thread Roland McGrath
I needed the attached patch on top of the bptest patch for the current
code.  Btw, that is a very nice little tester!

Below that is a patch to go on top of your current patch, with x86-64
support.  I've only tried a few trivial tests with bptest (including an
8-byte bp), which worked great.  It is a pretty faithful copy of your i386
changes.  I'm still not sure we have all that right, but you might as well
incorporate this into your patch.  You should change the x86_64 code in
parallel with any i386 changes we decide on later, and I can test it and
send you any typo fixups or whatnot.


Thanks,
Roland


--- bptest/bptest.c.~1~ 2007-06-25 04:08:06.0 -0700
+++ bptest/bptest.c 2007-06-26 01:14:20.0 -0700
@@ -147,17 +147,17 @@ static DRIVER_ATTR(call, 0200, NULL, cal
 /* Breakpoint callbacks */
 static void bptest_triggered(struct hw_breakpoint *bp, struct pt_regs *regs)
 {
-   printk(KERN_INFO "Breakpoint %d triggered\n", bp - bps);
+   printk(KERN_INFO "Breakpoint %d triggered\n", (int) (bp - bps));
 }
 
 static void bptest_installed(struct hw_breakpoint *bp)
 {
-   printk(KERN_INFO "Breakpoint %d installed\n", bp - bps);
+   printk(KERN_INFO "Breakpoint %d installed\n", (int) (bp - bps));
 }
 
 static void bptest_uninstalled(struct hw_breakpoint *bp)
 {
-   printk(KERN_INFO "Breakpoint %d uninstalled\n", bp - bps);
+   printk(KERN_INFO "Breakpoint %d uninstalled\n", (int) (bp - bps));
 }
 
 
@@ -204,7 +204,7 @@ static ssize_t bp_show(int n, char *buf)
 
a = -1;
if (type == 'e') {
-   const void *addr = hw_breakpoint_get_kaddr(bp);
+   const void *addr = hw_breakpoint_get_kaddress(bp);
 
if (addr == r0)
a = 0;
@@ -215,7 +215,7 @@ static ssize_t bp_show(int n, char *buf)
else if (addr == r3)
a = 3;
} else {
-   const unsigned char *p = hw_breakpoint_get_kaddr(bp);
+   const unsigned char *p = hw_breakpoint_get_kaddress(bp);
 
if (p >= bytes && p < bytes + NUM_BYTES)
a = p - bytes;
@@ -233,6 +233,7 @@ static ssize_t bp_store(int n, const cha
char atype;
unsigned len, type;
int i;
+   void *addr;
 
if (count <= 1) {
printk(KERN_INFO "bptest: bp%d: format:  priority type "
@@ -290,7 +291,7 @@ static ssize_t bp_store(int n, const cha
return -EINVAL;
}
if (atype == 'e')
-   hw_breakpoint_kinit(bp, rtns[a], len, type);
+   addr = rtns[a];
else {
switch (alen) {
 #ifdef HW_BREAKPOINT_LEN_1
@@ -311,14 +312,14 @@ static ssize_t bp_store(int n, const cha
return -EINVAL;
break;
}
-   hw_breakpoint_kinit(bp, &bytes[a], len, type);
+   addr = &bytes[a];
}
 
bp->triggered = bptest_triggered;
bp->installed = bptest_installed;
bp->uninstalled = bptest_uninstalled;
 
-   i = register_kernel_hw_breakpoint(bp);
+   i = register_kernel_hw_breakpoint(bp, addr, len, type);
if (i < 0) {
printk(KERN_WARNING "bptest: bp%d: failed to register %d\n",
n, i);
---
 arch/i386/kernel/hw_breakpoint.c   |5 --
 arch/x86_64/ia32/ia32_aout.c   |   10 
 arch/x86_64/ia32/ptrace32.c|   65 
 arch/x86_64/kernel/Makefile|3 +
 arch/x86_64/kernel/kprobes.c   |   14 +-
 arch/x86_64/kernel/machine_kexec.c |2 
 arch/x86_64/kernel/process.c   |   46 +++---
 arch/x86_64/kernel/ptrace.c|   72 +--
 arch/x86_64/kernel/signal.c|8 ---
 arch/x86_64/kernel/smpboot.c   |4 +
 arch/x86_64/kernel/suspend.c   |   17 +---
 arch/x86_64/kernel/traps.c |   75 +
 include/asm-x86_64/debugreg.h  |   30 ++
 include/asm-x86_64/hw_breakpoint.h |   50 
 include/asm-x86_64/processor.h |   10 +---
 include/asm-x86_64/suspend.h   |3 -
 16 files changed, 184 insertions(+), 230 deletions(-)

Index: b/arch/x86_64/kernel/kprobes.c
===
--- a/arch/x86_64/kernel/kprobes.c
+++ b/arch/x86_64/kernel/kprobes.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void jprobe_return_end(void);
 static void __kprobes arch_copy_kprobe(struct kprobe *p);
@@ -652,8 +653,17 @@ int __kprobes kprobe_exceptions_notify(s
ret = NOTIFY_STOP;
break;
case DIE_DEBUG:
-   if (post_kprobe_handler(args->regs))
-   ret = NOTIFY_STOP;
+   /*
+* The DR6 value is stored in args->err.
+* If DR_STEP is set and it's ours, we s

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-25 Thread Alan Stern
On Mon, 25 Jun 2007, Roland McGrath wrote:

> I added this on top of your patch to make it compile (and look a little 
> nicer).
> With that, bptest worked nicely.

I'll merge this with the rest of the patch.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-25 Thread Alan Stern
On Mon, 25 Jun 2007, Roland McGrath wrote:

> > "A waste to store one"?  Waste of what?  It isn't a waste of space; the 
> > space would otherwise be unused.  Waste of an instruction, perhaps.
> 
> Yes.  

Of course, calling register_kernel_hw_breakpoint() with three extra
arguments is a waste of an instruction also, if one of those arguments 
isn't used.

And yet it's not clear that either of these really is a waste.  Suppose
somebody ports code from x86 to PPC64 and leaves a breakpoint length
set to HW_BREAKPOINT_LEN_4.  Clearly we would want to return an error.  
This means that the length value _has_ to be tested, even if it won't
be used for anything.  And this means the length _has_ to be passed
along somehow, either as an argument or as a field value.

> > You might want to examine the check in validate_settings() for address 
> > alignment; it might not be valid if other values get stored in the 
> > low-order bits of the address.  This is a tricky point; it's not safe 
> > to mix bits around unless you know that the data values are correct, 
> > but in validate_settings() you don't yet know that.
> 
> This is why I didn't bring up encoded addresses earlier on. :-)  
> 
> These kinds of issues are why I prefer unambiguously opaque arch-specific
> encodings.  validate_settings is indeed wrong for the natural ppc encoding.
> 
> The values must be set by a call that can return an error.  That means you
> can't really have a static initializer macro, unless it's intended to mean
> "unspecified garbage if not used exactly right".  I favor just going back
> to passing three more args to register_kernel_hw_breakpoint.

All right, I'll change it.  And I'll encapsulate those fields.  I still 
think it will accomplish nothing more than hiding some implementation 
details which don't really need to be hidden.


> > Tests show that my CPU does not clear DR_STEP when a data breakpoint is
> > hit.  Conversely, the DR_TRAPn bits are cleared even when a single-step 
> > exception occurs.
> 
> Ok, this is pretty consistent with what the newest Intel manuals say.
> 
> > If you're interested, I can send you the code I used to do this testing
> > so you can try it on your machine.
> 
> Ok.

It's below.  The patch logs the value of DR6 when each debug interrupt 
occurs, and it adds another sysfs attribute to the bptest driver.  The 
attribute is named "test", and it contains the value that the IRQ 
handler will write back to DR6.  Combine this with the Alt-SysRq-P 
change already submitted, and you can get a clear view of what's going 
on.


> > We have three things to consider: ptrace, utrace, and hw-breakpoint.  
> > Ultimately hw-breakpoint should become part of utrace; we might not
> > want to bother with a standalone version.
> 
> It is not hard to make it a separate option, so there is no reason not to.
> 
> > Furthermore, hw-breakpoint takes over the ptrace's mechanism for
> > breakpoint handling.  If we want to allow a configuration where ptrace
> > is present and hw-breakpoint isn't, then I would have to add an
> > alternate implementation containing only support for the legacy
> > interface.
> 
> I was not suggesting that.  CONFIG_PTRACE would require HW_BREAKPOINT on
> machines where arch ptrace code uses it.

I see.  So I could add a CONFIG_HW_BREAKPOINT option and make 
CONFIG_PTRACE depend on it.  That will be simple enough.

Do you think it would make sense to allow utrace without hw-breakpoint?


> > I made a few other changes to do_debug.  For instance, it no longer 
> > checks whether notify_die() returns NOTIFY_STOP.  That check was a 
> > mistake to begin with; NOTIFY_STOP merely means to cut the notifier 
> > chain short -- it doesn't mean that the debug exception can be ignored.  
> 
> This is incorrect.  The usage of notify_die in all other cases, at least of
> machine exceptions on x86, is to test for == NOTIFY_STOP and when true
> short-circuit the normal effect of the exception (signal, oops).  The
> notifiers should return NOTIFY_STOP if they consumed the exception wholly.
> If none uses NOTIFY_STOP, then the normal user signal should happen.

All right, I'll fix that back up.

> > Also it sends the SIGTRAP when any of the DR_STEP or DR_TRAPn bits are 
> > set in vdr6; this is now the appropriate condition.
> 
> From what you've said, DR_STEP will remain set on a later debug exception.
> So if a non-ptrace hw breakpoint consumed the exception and left no
> DR_TRAPn bits set, the thread would generate a second SIGTRAP from the
> prior single-step.  Currently userland expects to have to clear DR_STEP in
> dr6 via ptrace itself, but does not expect it can get a duplicate SIGTRAP
> if it doesn't.

No, because do_debug always writes a 0 to DR6 after reading it;  
consequently DR_STEP does not remain set on later exceptions.  Unless
we do something like this we would never know whether we entered the
handler because of a single-step exception or not.

But the same effect could occur because of a bo

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-25 Thread Roland McGrath
I added this on top of your patch to make it compile (and look a little nicer).
With that, bptest worked nicely.

---
 arch/i386/kernel/kprobes.c |   19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

Index: b/arch/i386/kernel/kprobes.c
===
--- a/arch/i386/kernel/kprobes.c
+++ b/arch/i386/kernel/kprobes.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void jprobe_return_end(void);
 
@@ -660,16 +661,16 @@ int __kprobes kprobe_exceptions_notify(s
ret = NOTIFY_STOP;
break;
case DIE_DEBUG:
-
-   /* The DR6 value is stored in args->err */
-#define DR6(args->err)
-
-   if ((DR6 & DR_STEP) && post_kprobe_handler(args->regs)) {
-   if ((DR6 & ~DR_STEP) == 0)
-   ret = NOTIFY_STOP;
-   }
+   /*
+* The %db6 value is stored in args->err.
+* If DR_STEP is the only bit set and it's ours,
+* we should eat this exception.
+*/
+   if ((args->err & DR_STEP) &&
+   post_kprobe_handler(args->regs) &&
+   (args->err & ~DR_STEP) == 0)
+   ret = NOTIFY_STOP;
break;
-#undef DR6
 
case DIE_GPF:
case DIE_PAGE_FAULT:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-25 Thread Roland McGrath
> "A waste to store one"?  Waste of what?  It isn't a waste of space; the 
> space would otherwise be unused.  Waste of an instruction, perhaps.

Yes.  

> It is now possible for an implementation to store things in a 
> machine-dependent fashion; I have added accessor routines as you 
> suggested.  But I also left the fields as they were; the documentation 
> mentions that they won't necessarily contain any particular values.

People usually read the documentation after the fields named like they can
guess what they contain have values that confuse them, not before.

> You might want to examine the check in validate_settings() for address 
> alignment; it might not be valid if other values get stored in the 
> low-order bits of the address.  This is a tricky point; it's not safe 
> to mix bits around unless you know that the data values are correct, 
> but in validate_settings() you don't yet know that.

This is why I didn't bring up encoded addresses earlier on. :-)  

These kinds of issues are why I prefer unambiguously opaque arch-specific
encodings.  validate_settings is indeed wrong for the natural ppc encoding.

The values must be set by a call that can return an error.  That means you
can't really have a static initializer macro, unless it's intended to mean
"unspecified garbage if not used exactly right".  I favor just going back
to passing three more args to register_kernel_hw_breakpoint.

> Tests show that my CPU does not clear DR_STEP when a data breakpoint is
> hit.  Conversely, the DR_TRAPn bits are cleared even when a single-step 
> exception occurs.

Ok, this is pretty consistent with what the newest Intel manuals say.

> If you're interested, I can send you the code I used to do this testing
> so you can try it on your machine.

Ok.

> > I still think it's the proper thing to make it conditional, not always
> > built in.  But it's a pedantic point.
> 
> We have three things to consider: ptrace, utrace, and hw-breakpoint.  
> Ultimately hw-breakpoint should become part of utrace; we might not
> want to bother with a standalone version.

It is not hard to make it a separate option, so there is no reason not to.

> Furthermore, hw-breakpoint takes over the ptrace's mechanism for
> breakpoint handling.  If we want to allow a configuration where ptrace
> is present and hw-breakpoint isn't, then I would have to add an
> alternate implementation containing only support for the legacy
> interface.

I was not suggesting that.  CONFIG_PTRACE would require HW_BREAKPOINT on
machines where arch ptrace code uses it.

> I made a few other changes to do_debug.  For instance, it no longer 
> checks whether notify_die() returns NOTIFY_STOP.  That check was a 
> mistake to begin with; NOTIFY_STOP merely means to cut the notifier 
> chain short -- it doesn't mean that the debug exception can be ignored.  

This is incorrect.  The usage of notify_die in all other cases, at least of
machine exceptions on x86, is to test for == NOTIFY_STOP and when true
short-circuit the normal effect of the exception (signal, oops).  The
notifiers should return NOTIFY_STOP if they consumed the exception wholly.
If none uses NOTIFY_STOP, then the normal user signal should happen.

> Also it sends the SIGTRAP when any of the DR_STEP or DR_TRAPn bits are 
> set in vdr6; this is now the appropriate condition.

>From what you've said, DR_STEP will remain set on a later debug exception.
So if a non-ptrace hw breakpoint consumed the exception and left no
DR_TRAPn bits set, the thread would generate a second SIGTRAP from the
prior single-step.  Currently userland expects to have to clear DR_STEP in
dr6 via ptrace itself, but does not expect it can get a duplicate SIGTRAP
if it doesn't.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-06-13 Thread Roland McGrath
> I really don't understand your point here.  What's wrong with bp_show?  
> Is it all the preprocessor conditionals?  I thought that was how we had 
> agreed portable code should determine which types and lengths were 
> supported on a particular architecture.

That part is fine.  The problem is fetching the hw_breakpoint.len field
directly and expecting it to contain the API values.  In an implementation
done as I've been referring to, there is no need for any field to contain
the HW_BREAKPOINT_LEN_8 value, and it's a waste to store one.  If it were
hw_breakpoint_get_len(bp), that would be fine.

> Consider that the definition of struct hw_breakpoint is in
> include/asm-generic/.  [...]
> The one thing which makes sense to me is that some architectures might 
> want to store type and/or length bits in along with the address field.  

Indeed, that is the natural thing (and all the bits needed) on several.
I hadn't raised this before since I was having so much trouble already
convincing you about storing things in machine-dependent fashion so that
users cannot just use the struct fields directly.

I really think it would be cleanest all around to use just:

struct arch_hw_breakpoint info;

in place of address union, len, type in struct hw_breakpoint.  Then each
arch provides hw_breakpoint_get_{kaddr,uaddr,len,type} inlines.  For
storing, each arch can define hw_breakpoint_init(addr, len, type) (or
maybe k/u variants).  This can be used by callers directly if you want to
keep register_hw_breakpoint to one argument, or could just be internal if
register_hw_breakpoint takes the three more args.  If callers use it
directly, there can also be an INIT_ARCH_HW_BREAKPOINT(addr, len, type)
for use in struct hw_breakpoint_init initializers.

On x86 use:

struct arch_hw_breakpoint_info {
union {
const void  *kernel;
const void  __user  *user;
unsigned long   va;
}   address;
u8  len;
u8  type;
} __attribute__((packed));

and the size of struct hw_breakpoint won't increase.

> > What about DR_STEP?  i.e., if DR_STEP was set from a single-step and then
> > there was a DR_TRAPn debug exception, is DR_STEP still set?  If DR_TRAPn
> > was set and then you single-step, is DR_TRAPn cleared?
> 
> I didn't experiment with using DR_STEP.  There wasn't any simple way to
> cause a single-step exception.  Perhaps if I were more familiar with
> kprobes...

It's easy for user mode with gdb.  kprobes is simple to use, and it
always does a single-step to execute (a copy of) the instruction that 
was overwritten with the breakpoint.  So, write a module that does:

int testvar=0;
asm(".globl testme; testme: movl $17,testvar; ret");
void testme();
testinit() {
... register kprobe at &testme ...
... register hw_breakpoint at &testvar ...
testme()
}

Your kprobe handlers don't have to actually do anything at all, if you
are just hacking the low-level code so see what %dr6 values you get at
each trap.

> I decided on something simpler than messing around with Kconfig.  

I still think it's the proper thing to make it conditional, not always
built in.  But it's a pedantic point.

> This is getting pretty close to a final form.  The patch below is for 
> 2.6.22-rc3.  See what you think...

Indeed I think we have come nearly as far as we will before we have a few
arch ports get done and some heavy use to find the rough edges.  Thanks
very much for being so accomodating to all my criticism, which I hope has
been constructive.

> +inline const void *hw_breakpoint_get_kaddr(struct hw_breakpoint *bp)

These need to be static inline.  Here you're defining a global function
in every .o file that uses the header.

> + get_debugreg(dr6, 6);
> + set_debugreg(0, 6); /* DR6 may or may not be cleared by the CPU */
> + if (dr6 & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3))
> + tsk->thread.vdr6 = 0;

Some comment here about this conditional clearing, please.

> +
> +/*
> + * HW breakpoint additions
> + */
> +
> +#define HB_NUM   4   /* Number of hardware breakpoints */

Need #ifdef __KERNEL__ around all these additions to debugreg.h.

> +static inline void arch_update_thbi(struct thread_hw_breakpoint *thbi,

For local functions in a source file (not a header), it's standard form
now just to define them static, not static inline.  For these trivial
ones, the compiler will always inline them.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-23 Thread Roland McGrath
> > This does not happen in reality.  Breakpoints can only be set by the
> > debugger, not by the program itself.  The debugger should always eat the 
> > trap.
> 
> Hmmm.  I put in a little extra code to account for the possibility that
> a program might want to set hardware breakpoints in itself.  Should
> this be removed?

Do you just mean a register_hw_breakpoint call made on current?  That
certainly ought to work.  That's still "the debugger", i.e. in utracespeak
the tracing engine.  My point was that there will never be a facility
intended for a program to use hw_breakpoint to generate a signal that gets
delivered to a handler in the vanilla way.  There's always some "outside"
agent who asked for the breakpoint and who is responsible for responding to
the traps it causes, never the program itself so as it would make sense for
it to actually see the signal in the end.

> > I guess my main objection to having .type and .len is the false implied
> > documentation of their presence and names, leading to people thinking they
> > can look at those values.  In fact, they are machine-specific and
> > implementation-specific bits of no intrinsic use to anyone else.
> 
> The fact that they are machine-specific and implementation-specific 
> doesn't necessarily make them of no use.  See the driver below.

The code in bp_show is exactly the kind of wrong I want to prevent.  When I
say they are machine-specific and implementation-specific, I mean there is
no specified part of the interface to which you can presume they correspond
directly.  The powerpc implementation will not have any field that is set
to HW_BREAKPOINT_LEN_8 and may well have none set to the type macros
either.  If you want to have some machine-specific macros or inlines to
yield the HW_BREAKPOINT_* values for a struct hw_breakpoint, then fine.

> Allow me to rephrase: When a debug exception occurs, the real DR6 value
> should be copied to vdr6, except that kprobes should adjust DR_STEP and
> hw_breakpoint should adjust the DR_TRAPn bits appropriately.  There's
> some question about what value the debug exception handler should write
> back to DR6, if anything.  

Agreed.

> As for what users expect of the low four bits, you are definitely 
> wrong.  My tests with gdb show that it relies on the CPU to clear those 
> bits whenever a data breakpoint is hit; it doesn't clear them itself 
> and it doesn't work properly if the kernel keeps virtualized versions 
> of them set.  That's on a Pentium 4 and on an AMD Duron.

Ok.  We were both going on what the manual said and I was assuming that
some chip had actually behaved that way and thus that's what users expect.

>   Values written back to DR6 were retained in the register until 
>   the next debug exception occurred.

Ok.  This behavior is invisible anyway.

>   When the exception handler read DR6, the 0x0ff0 bits were
>   set every time.  The 0x1000 bit was never set, even if it
>   had been turned on before the exception occurred.

Ok.  That is not really surprising.

>   No matter what values were stored in the low four bits
>   beforehand, when the exception occurred DR6 had only the
>   bit for the debug register which was triggered.

Ok.  This makes the users' expectations make sense.  Maybe we can get the
Intel and AMD people to change the manual not to be misleading about this
(it says something terse about "never clears" and without more details I
read it as "never clears any bit, ever").  

What about DR_STEP?  i.e., if DR_STEP was set from a single-step and then
there was a DR_TRAPn debug exception, is DR_STEP still set?  If DR_TRAPn
was set and then you single-step, is DR_TRAPn cleared?

>   If the handler wrote back any of BS, BT, or BD to DR6, then
>   the system misbehaved.  I don't know exactly what happened,
>   but my shell process ended and the debug handler got called
>   over and over again (as if stuck in a loop) for several
>   seconds.

Yowza.  That is really surprising.

> In light of these results, the best approach appears to be either to 
> leave DR6 alone or to set it to 0.

Agreed.  I suspect clearing it to zero is the right thing (given what the
hardware manuals say), even if it appears that DR_STEP and DR_TRAPn do
reset each other on the chips we have on hand.

> Below is a patch containing a driver meant for testing kernel hardware 
> breakpoints.  Instructions are in the comments at the top.  You can 
> build the driver by typing "make M=bptest" at the top level.

Thanks.

> The patch also adjust the Alt-SysRq-P handler to print out the debug 
> register values along with all the other stuff.

I think you should post that little patch (and equivalent for x86_64) by
itself.  There's no reason that shouldn't go right in.

> I took a look at seqlock.h.  It turns out not to be a good match for my 
> requirements; the header file specifically says that it won't work with 
> data that contains pointers.

There i

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-17 Thread Alan Stern
On Sun, 13 May 2007, Roland McGrath wrote:

> You may need some memory barriers around the switching/restart stuff.
> In fact, I think it would be better not to delve into reinventing the
> low-level bits there at all.  Instead use read_seqcount_retry there
> (linux/seqlock.h).  Using that read_seqcount_begin's value as the
> number to compare in thbi would also give a 32-bit sequence number.

I took a look at seqlock.h.  It turns out not to be a good match for my 
requirements; the header file specifically says that it won't work with 
data that contains pointers.  But changing over to regular 32-bit 
sequence numbers was straightforward.

The "switching"/"restart" stuff doesn't need memory barriers because
all the communication is between two routines on the same CPU.  Nor are 
memory barriers needed in the rest of the code for the kernel 
breakpoint updates; the IPI mechanism already provides its own.

However there is one oddball case which does seem to require a memory
barrier: when a new CPU comes online (either for the first time or
during return from hibernation).  There's a hook to load the initial
debug register values, and it runs in an atomic context so I can't
grab the mutex.  The hook is called in two places:

arch/i386/power/cpu.c: fix_processor_context(), and
arch/i386/kernel/smpboot.c: start_secondary().

A memory barrier is necessary to avoid chaos if another CPU should
happen to update the kernel breakpoint settings at the same time.  If
you can suggest a way around it, please do.

> It looks to me like there is quite a lot to be shared.  Of course the
> code can refer to constants like HB_NUM, they just have to be defined
> per machine.  The dr7 stuff can all be a couple of simple arch_foo
> hooks, which will be empty on other machines.  All of the list-managing
> logic, the prio stuff, etc., would be bad to copy.
> 
> The two flavors could probably be accomodated cleanly with an
> HB_TYPES_NUM macro that's 1 on x86 and 2 on ia64, and is used in loops
> around some of the calls.  I'm not suggesting you try to figure out
> that code structure ahead of time.  But I don't think it will be a big
> barrier to code sharing.

Okay, if I don't worry about machines with two sets of code & data
debug registers (HB_TYPES_NUM = 2) then yes, quite a lot of the code is
sharable.  There will be a few arch-specific hooks to:

Store the values into the debug registers;

Take care of the DR7 calculations;

Do address limit verification (see whether a pointer
lies in user space or kernel space).

Nothing more seems to be needed.  Then there will be unsharable code, 
including:

Dumping the debug registers while creating an aout-type
core image;

All the legacy ptrace stuff;

The notify-handler itself.

Does all that sound about right?  

> I also want to get this machine-independent code sharing going for
> real.  I'd like to have powerpc working as the non-x86 demonstration
> before we declare things in good shape.  I don't expect you to write
> any powerpc support, but I hope I can get you to do the arch code
> separation to make the way for it.  If you'll take a crack at it, I'll
> fill in and test the powerpc bits and I think we'll get something very
> satisfactory ironed out pretty fast.  

How should this be arranged so that it can build okay on all platforms,
even ones where the low-level support code hasn't been written?  Maybe 
an arch-dependent CONFIG_HW_BREAKPOINT option?

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-16 Thread Alan Stern
On Mon, 14 May 2007, Roland McGrath wrote:

> > It seems to me that signal handlers must run with a copy of the original
> > EFLAGS stored on the stack.
> 
> Of course.  I'm talking about how the registers get changed to set up the
> signal handler to start running, not how the interrupted registers are
> saved on the user stack.  There is no issue with the stored eflags image;
> the "privileged" flags like RF are ignored by sigreturn anyway.

Ah, okay.  Yes, clearly the new EFLAGS for the signal handler should 
have RF turned off.  This should always be true, regardless of 
debugging.

> > Also, what if the signal handler was entered as a result of encountering 
> > an instruction breakpoint?  
> 
> This does not happen in reality.  Breakpoints can only be set by the
> debugger, not by the program itself.  The debugger should always eat the trap.

Hmmm.  I put in a little extra code to account for the possibility that
a program might want to set hardware breakpoints in itself.  Should
this be removed?

> The earlier quote I gave was from an AMD64 manual.  A 1995 Intel manual I
> have says, "All Intel Architecture processors manage the RF flag as follows,"
> and proceeds to give the "all faults except instruction breakpoint" behavior
> I quoted from the AMD manual earlier.  Hence I sincerely doubt that this
> varies among Intel and AMD processors.  Someone else will have to help us
> know about other makers' processors.  So far I have no reason to suspect that
> any processor behaves differently (aside from generic cynicism ;-).

And I no longer have any 386 CPUs to test...

> > It's a relatively minor issue.  On machines with fixed-length breakpoints, 
> > the .len field can be ignored.  Conversely, leaving it out would require 
> > using bitmasks to extract the type and length values from a combined .bits 
> > field.  I don't see any advantage.
> 
> I guess my main objection to having .type and .len is the false implied
> documentation of their presence and names, leading to people thinking they
> can look at those values.  In fact, they are machine-specific and
> implementation-specific bits of no intrinsic use to anyone else.

The fact that they are machine-specific and implementation-specific 
doesn't necessarily make them of no use.  See the driver below.

> That line from the manual is what we were both going on originally, and
> then you described the conflicting behavior.  I was trying to ascertain
> whether chips really do vary, or if the manual was just inaccurate about
> the single common way it actually behaves.  I take it you have in fact
> observed different behaviors on different chips?

No; I have tested only a couple of systems and I don't have a wide
variety of machines available.

> There are two possible kinds of "conservative" here.  To be conservative
> with respect to the existing behavior on a given chip, whatever that may
> be, we should never clear %dr6 completely, and instead should always
> mirror its bits to vdr7, only mapping the low four bits around to present
> the virtualized order.  The only bits we'd ever clear in hardware are
> those DR_TRAPn bits corresponding to the registers allocated to non-ptrace
> uses, and kprobes should clear DR_STEP.  And note that when vdr6 is
> changed by ptrace, we should reset the hardware %dr6 accordingly, to match
> existing kernel behavior should users change debugreg[6] via ptrace.
> 
> To be conservative in the sense of reliable user-level behavior despite
> chip oddities would be a little different.  Firstly, I think we should
> mirror all the "extra" bits from hardware to vdr7 blindly, i.e. everything
> but DR_STEP and DR_TRAPn.  That way if any chip comes along that sets new
> bits for new features or whatnot, users can at least see the new hardware
> bits via ptrace before hw_breakpoint gets updated to support them more
> directly.  For the low four bits, I think what users expect is that no
> bits are ever implicitly cleared, so they accumulate to say which drN has
> hit since the last time ptrace was used to clear vdr6.

Allow me to rephrase: When a debug exception occurs, the real DR6 value
should be copied to vdr6, except that kprobes should adjust DR_STEP and
hw_breakpoint should adjust the DR_TRAPn bits appropriately.  There's
some question about what value the debug exception handler should write
back to DR6, if anything.  When switching to a new task, the DR_TRAPn
bits in vdr6 could be de-virtualized somehow and the result loaded into
DR6, but again, it might be safest to leave DR6 alone.

As for what users expect of the low four bits, you are definitely 
wrong.  My tests with gdb show that it relies on the CPU to clear those 
bits whenever a data breakpoint is hit; it doesn't clear them itself 
and it doesn't work properly if the kernel keeps virtualized versions 
of them set.  That's on a Pentium 4 and on an AMD Duron.

I did some testing to see how the CPU behaves when the debug handler
writes different values back to DR

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-14 Thread Roland McGrath
> It seems to me that signal handlers must run with a copy of the original
> EFLAGS stored on the stack.

Of course.  I'm talking about how the registers get changed to set up the
signal handler to start running, not how the interrupted registers are
saved on the user stack.  There is no issue with the stored eflags image;
the "privileged" flags like RF are ignored by sigreturn anyway.

> Also, what if the signal handler was entered as a result of encountering 
> an instruction breakpoint?  

This does not happen in reality.  Breakpoints can only be set by the
debugger, not by the program itself.  The debugger should always eat the trap.

> You're right about wanting to clear RF when changing the PC via ptrace or
> when setting a new execution breakpoint (provided the new breakpoint's
> address is equal to the current PC value).

Starting a signal handler is "warping the PC" equivalent to changing it via
ptrace for purposes of this discussion.  In case the new PC is the site of
another breakpoint, RF must be clear.

> Do you know how gdb handles instruction breakpoints, and in particular, 
> how it resumes execution after a breakpoint?

AFAICT it never actually uses hardware instruction breakpoints, only data
watchpoints.  I wouldn't be surprised if noone has ever really used
instruction breakpoint settings in x86 hardware debug registers on Linux.
(Frankly, I don't much expect them to start either.  This level of detail
about instruction breakpoints is largely academic.  I am a stickler for
getting the details right if we're going to allow using them at all.  
But I think really everyone only cares about data watchpoints.)

> But it doesn't matter.  We're up against an API incompatibility here.  

That's a red herring.  gdb is the compatibility case, not the real API user.

> Under the circumstances I think we should just leave it out.

That is fine.  If the flutter issue comes up, we can address it later.

> On the 386, either GE or LE had to be set for DR breakpoints to work 
> properly.  Later on (I don't remember if it was in the 486 or the Pentium) 
> this restriction was removed.  I don't know whether those bits do anything 
> at all on modern CPUs.

I'm moderately sure they do nothing on modern CPUs.  Intel says they're
ignored as of Pentium, but recommends setting both bits if you care at all.
In practice, I don't think we'll ever hear about the inexactness on a
pre-Pentium processor from not setting the bits.  But I'd follow the Intel
manual and set both.

> My 80386 Programmer's Reference Manual says:

The earlier quote I gave was from an AMD64 manual.  A 1995 Intel manual I
have says, "All Intel Architecture processors manage the RF flag as follows,"
and proceeds to give the "all faults except instruction breakpoint" behavior
I quoted from the AMD manual earlier.  Hence I sincerely doubt that this
varies among Intel and AMD processors.  Someone else will have to help us
know about other makers' processors.  So far I have no reason to suspect that
any processor behaves differently (aside from generic cynicism ;-).

> I suppose you might register a breakpoint and find that it isn't installed 
> immediately, but then it could get installed and actually trigger before 
> you managed to unregister it.  Does that count as a "difficult race"?  

Yes, that is really the kind of thing I had in mind.  For user breakpoints it
shouldn't be an issue, since the thread shouldn't have been let run in between.

> Presumably the work done by the trigger callback would get ignored.

That is in the "difficult race" category to ensure.  I would not presume.

> Maybe it doesn't have to be so bad.  If there were _two_ global copies of
> the kernel bp settings, one for the old pre-IPI state and one for the new,
> then the handler could simply look up the DR# in the appropriate copy.  
> This would remove the need to store the settings in the per-CPU area.

I think that is what I suggested an iteration or two ago.  Installing new
state means making a fresh data structure and installing a pointer to it,
leaving the old (immutable) one to be freed by RCU.

> It's a relatively minor issue.  On machines with fixed-length breakpoints, 
> the .len field can be ignored.  Conversely, leaving it out would require 
> using bitmasks to extract the type and length values from a combined .bits 
> field.  I don't see any advantage.

I guess my main objection to having .type and .len is the false implied
documentation of their presence and names, leading to people thinking they
can look at those values.  In fact, they are machine-specific and
implementation-specific bits of no intrinsic use to anyone else.

> Ah, you haven't understood the purpose of the gennum.  In fact 8 bits 
> isn't too small -- far from it!  It's too _large_; a single bit would 
> suffice.  I made it an 8-bit value just because that was easier.

If it's actually a flag, then treating it any other way is just confusing.
I can't see how it's easier for anyone.

> 

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-14 Thread Alan Stern
On Sun, 13 May 2007, Roland McGrath wrote:

> This makes me think about RF a little more.  If you ever set it, there are
> some places we need to clear it too.  That is, when the PC is being changed
> before returning to user mode, which is in signals and in ptrace.  If the
> PC is changing to other than the breakpoint location hit by the handler
> that set RF, we need clear RF so that the first instruction at the changed
> PC can be a breakpoint hit of its own and not get masked.  In fact, it may
> also be necessary to clear RF when freshly setting a new instruction
> breakpoint (when RF is set because the stop was not a debug exception at
> all), so that it isn't skipped if the PC happens to be right there already.

It seems to me that signal handlers must run with a copy of the original
EFLAGS stored on the stack.  Otherwise, when the handler returned the
former context wouldn't be fully restored.  But I don't know enough about
the signal handling code to see how to turn off RF in the stored EFLAGS
image.

Also, what if the signal handler was entered as a result of encountering 
an instruction breakpoint?  In that case you would want to keep RF on to 
prevent an infinite loop.

You're right about wanting to clear RF when changing the PC via ptrace or
when setting a new execution breakpoint (provided the new breakpoint's
address is equal to the current PC value).

Do you know how gdb handles instruction breakpoints, and in particular, 
how it resumes execution after a breakpoint?


> > Come to think of it, we don't really need modify_user_hw_breakpoint at
> > all.  It could be replaced by an {unregister(old); register(new);}
> > sequence.  Unless you think there's some pressing reason to keep it, my
> > inclination is to do away with it.
> 
> I sort of wondered from the beginning why it was there.  The rationale I
> can see is to avoid flutter.  That is, when unregistering frees up a slot
> for a lower-priority allocation waiting in the wings, and then the new
> registration will just displace it again.  The priority list diddling is
> wasted work to get back to just how it was before, but more importantly you
> don't want to have those callbacks for a momentarily-available slot coming
> and going.  I don't know if this can really come up with the current code.

That may be what I originally had in mind; I no longer remember.

But it doesn't matter.  We're up against an API incompatibility here.  
gdb doesn't allow you to modify breakpoints; it forces you to delete the
old one and add a new one.  It's only an artifact of the x86 architecture
that gdb implements this by reusing debug registers.  So even if the 
modify_user_hw_breakpoint() routine were kept, gdb wouldn't really want to 
make use of it.

Under the circumstances I think we should just leave it out.


> > As I understand it, setting one of those bits is necessary on the 386 but
> > not necessary for later processors.  Should this be controlled by a
> > runtime (or compile time) check?  For that matter, do those bits have any
> > effect at all on a Pentium?
> 
> I've never heard of anyone using them, but I don't know the full story.

On the 386, either GE or LE had to be set for DR breakpoints to work 
properly.  Later on (I don't remember if it was in the 486 or the Pentium) 
this restriction was removed.  I don't know whether those bits do anything 
at all on modern CPUs.


> The documentation I have says that RF is set in the trap frame on the stack
> (i.e. pt_regs.eflags) by every other kind of exception.  However, for a
> debug exception that is due to an instruction breakpoint, RF=0 in the trap
> frame and the manual explicitly says that the handler must set the bit so
> that iret will resume and execute it rather than hit the breakpoint again.
> 
> [later:]
> > It also turns out that some CPUs don't automatically set the RF bit in 
> > the EFLAGS image on the stack.  Intel recommends that the OS always set 
> > that bit whenever a debug exception occurs, so that's what I did.
> 
> Is this really "some CPUs"?  Or is it actually always as I described above
> (i.e. RF set usually but cleared for an instruction breakpoint hit)?

My 80386 Programmer's Reference Manual says:

... an instruction-address breakpoint exception is a fault.

And:

When it detects a fault, the processor automatically sets
RF in the flags image that it pushes onto the stack.

And:

The processor automatically sets RF in the EFLAGS image
on the stack before entry into any fault handler.  Upon
entry into the fault handler for instruction address
breakpoints, for example, RF is set in the EFLAGS image
on the stack...

That seems to be pretty clear.  So the behavior can vary according to the 
processor type.


> > If callers want to give up when a kernel breakpoint isn't installed 
> > immediately, all they have to do is check the return value from 
> > register_kernel_hw_breakpoint and call unregister_kerne

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-13 Thread Roland McGrath
Sorry again about the delay.  

> I trust we are moving closer to a final, usable form.

Indeed, I think it is getting there.

> I think there are probably still a few small things wrong with it.  For
> instance, the RF setting isn't right; I misunderstood the Intel manual.  
> It should get set only when the latest debug interrupt was for an
> instruction breakpoint.

This makes me think about RF a little more.  If you ever set it, there are
some places we need to clear it too.  That is, when the PC is being changed
before returning to user mode, which is in signals and in ptrace.  If the
PC is changing to other than the breakpoint location hit by the handler
that set RF, we need clear RF so that the first instruction at the changed
PC can be a breakpoint hit of its own and not get masked.  In fact, it may
also be necessary to clear RF when freshly setting a new instruction
breakpoint (when RF is set because the stop was not a debug exception at
all), so that it isn't skipped if the PC happens to be right there already.

> Come to think of it, we don't really need modify_user_hw_breakpoint at
> all.  It could be replaced by an {unregister(old); register(new);}
> sequence.  Unless you think there's some pressing reason to keep it, my
> inclination is to do away with it.

I sort of wondered from the beginning why it was there.  The rationale I
can see is to avoid flutter.  That is, when unregistering frees up a slot
for a lower-priority allocation waiting in the wings, and then the new
registration will just displace it again.  The priority list diddling is
wasted work to get back to just how it was before, but more importantly you
don't want to have those callbacks for a momentarily-available slot coming
and going.  I don't know if this can really come up with the current code.

> Hmm...  Maybe I could store a pointer to the DR6 value in args.err instead
> of the value itself...

Ugh.

> As I understand it, setting one of those bits is necessary on the 386 but
> not necessary for later processors.  Should this be controlled by a
> runtime (or compile time) check?  For that matter, do those bits have any
> effect at all on a Pentium?

I've never heard of anyone using them, but I don't know the full story.

> My Intel manual says that the CPU automatically sets the RF bit in the
> EFLAGS image stored on the stack by the debug exception.  Hence the
> handler doesn't have to worry about it.  That's why I removed it from the 
> existing code.

The documentation I have says that RF is set in the trap frame on the stack
(i.e. pt_regs.eflags) by every other kind of exception.  However, for a
debug exception that is due to an instruction breakpoint, RF=0 in the trap
frame and the manual explicitly says that the handler must set the bit so
that iret will resume and execute it rather than hit the breakpoint again.

[later:]
> It also turns out that some CPUs don't automatically set the RF bit in 
> the EFLAGS image on the stack.  Intel recommends that the OS always set 
> that bit whenever a debug exception occurs, so that's what I did.

Is this really "some CPUs"?  Or is it actually always as I described above
(i.e. RF set usually but cleared for an instruction breakpoint hit)?

> If callers want to give up when a kernel breakpoint isn't installed 
> immediately, all they have to do is check the return value from 
> register_kernel_hw_breakpoint and call unregister_kernel_hw_breakpoint.  
> If you really want it, I could add an extra "fail if not installed" 
> argument flag.

The important thing is that there aren't any difficult races (i.e. what you
get with callbacks).  If register with no callback followed by unregister
on seeing "registered but not installed" return value is simple and cheap,
that is fine.

> For user breakpoints, the whole notion is almost meaningless.  Even if the
> breakpoint was allocated a debug register initially, it could get
> displaced by the time the debuggee task next runs.

It's no less meaningful than for a kernel allocation.  In neither case is
there a guarantee you'll keep it forever.  What callers I had in mind want
is a quick answer when the answer is negative at the time of the call, so
they just punt on the complexity of dealing with a positive answer.

> Again, this was referring to existing code which I basically copied 
> without fully understanding.  Does the new code in do_debug do the right 
> thing with regard to TF?

It looks right to me.  That is, it preserves the existing behavior for
kernel-mode traps, and does not touch TF at all for user-mode traps.

> > > + /* Block kernel breakpoint updates from other CPUs */
> > > + local_irq_save(flags);
> > 
> > I have a feeling this is more costly than we want, though I don't really
> > know.  It seems to me that things in struct cpu_hw_breakpoint are not
> > really per-CPU, except for bp_task.  They are "current global state",
> > right?
> 
> Not really, since changes to the debug registers on multiple CPUs cannot
> be made

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-05-11 Thread Alan Stern
On Wed, 28 Mar 2007, Roland McGrath wrote:

> Sorry I've been slow in responding to your most recent version.
> I fell into a large hole and couldn't get out until I fixed some bugs.

Has the same thing happened again?  There hasn't been any feedback on the 
most recent version of hw_breakpoint emailed on April 13:

http://marc.info/?l=linux-kernel&m=117661223820357&w=2

I think there are probably still a few small things wrong with it.  For
instance, the RF setting isn't right; I misunderstood the Intel manual.  
It should get set only when the latest debug interrupt was for an
instruction breakpoint.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-29 Thread Alan Stern
On Wed, 28 Mar 2007, Roland McGrath wrote:

> Sorry I've been slow in responding to your most recent version.
> I fell into a large hole and couldn't get out until I fixed some bugs.

That's okay; the same thing happens to everyone from time to time.


> > No, I'm not confused and neither are you.  I realize there's no functional 
> > difference between the two sets of enable bits, since Linux doesn't use 
> > hardware task-switching.  I just like to keep things neatly separated, 
> > that's all.
> 
> I don't really know what more to say.  I like things neatly separated too,
> most especially between clearly meaningful and misleading yet apparently
> meaningful.  Overloading unrelated hardware bits for no material purpose
> will never make sense to me.

Well, I can change it easily enough.


> > > How would that happen?  This would mean that some user process has been
> > > allowed to enable ioperm for some io port that kernel drivers also send to
> > > from interrupt handlers.  Can that ever happen?
> > 
> > I haven't checked the ioperm code to be certain, but it seems like the 
> > sort of thing somebody might want to do on occasion.
> 
> I checked the ioperm code.  As far as I can tell, if you have CAP_SYS_RAWIO
> then you can use ioperm to enable any port you want, so this will always be
> possible.  (That seems a bit nutty to me, hence my earlier reaction.)

At this stage it's a moot point anyway.


> > > > It gives drivers a way to tell whether or not the breakpoint is 
> > > > currently
> > > > installed without having to do explicit tracking of installed() and 
> > > > uninstalled() callbacks.
> > > 
> > > How could that ever be used that would not be racy and thus buggy?  A
> > > registration call on another CPU could cause a change and callback just
> > > after you fetched the value.
> > 
> > Not if you have interrupts disabled.  Debug register settings are 
> > disseminated from one CPU to the others by means of an IPI.
> 
> But the callback is not per-CPU, it is per-registration.  When a new
> registration displaces an old one, or a deregistration stops displacing
> one, isn't the status change in the data structure and the callback made
> right away, by the thread doing the registration/deregistration call?
> Another CPU with interrupts disabled won't have its breakpoints actually
> change, but the data structure will have changed.  Isn't that right?

Not quite right.  The status change in the data structure isn't made until
after all the IPIs have completed, which won't happen until all the CPUs
have responded.  A CPU with interrupts disabled will therefore delay the
status change as well as the debug-register update, and so the value of
bp->status would remain valid until interrupts were enabled.


> > I also decided against adding a .bits member.  It doesn't really gain very 
> > much; the savings in encoding the breakpoint values is trivial -- one line 
> > of code on i386.  
> 
> I think this is a mistake.  On other machines, there is no need for more
> than one field and .len will go unused.  There is no reason not to encode
> ahead of time.  If it's useful for callers to be able to extract the type
> and length from a struct hw_breakpoint, it's easy to define macros or
> inlines in asm/hw_breakpoint.h that go the other way.

I'm not entirely convinced.

Consider first that the computation involved in encoding .bits is just a
little more complicated than one would like to do at compile time.  (At
least, it is on x86_64.)  Callers would have to do it themselves
dynamically, or else pass len and type as arguments to the
register_hw_breakpoint routines (which would mean an unused argument on
those other machines).  In general it just makes things harder on callers.

The fact the .len would go unused on some architectures shouldn't matter.  
It's just a u8; .len and .type together take up no more space than .bits 
would.

However, if you insist I can still change things over.

> > And it helps to have the original length and type values available for
> > use by the ptrace routines.  
> 
> There's no reason __modify_user_hw_breakpoint can't use an encoded value.
> modify_user_hw_breakpoint can do checking and encoding before taking the lock.

Come to think of it, we don't really need modify_user_hw_breakpoint at
all.  It could be replaced by an {unregister(old); register(new);}
sequence.  Unless you think there's some pressing reason to keep it, my
inclination is to do away with it.


> My review comments below are about your patch of 3/22 (described as
> "actually works with 2.6.21-rc4").
> 
> > -   if (notify_die(DIE_DEBUG, "debug", regs, condition, error_code,
> > -   SIGTRAP) == NOTIFY_STOP)
> > +   args.regs = regs;
> > +   args.str = "debug";
> > +   get_debugreg(args.err, 6);
> > +   set_debugreg(0, 6); /* DR6 is never cleared by the CPU */
> > +   args.trapnr = error_code;
> > +   args.signr = SIGTRAP;
> > +   if (atomic_notifier_call_chain(&i386die

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-28 Thread Roland McGrath
Sorry I've been slow in responding to your most recent version.
I fell into a large hole and couldn't get out until I fixed some bugs.

> No, I'm not confused and neither are you.  I realize there's no functional 
> difference between the two sets of enable bits, since Linux doesn't use 
> hardware task-switching.  I just like to keep things neatly separated, 
> that's all.

I don't really know what more to say.  I like things neatly separated too,
most especially between clearly meaningful and misleading yet apparently
meaningful.  Overloading unrelated hardware bits for no material purpose
will never make sense to me.

> > How would that happen?  This would mean that some user process has been
> > allowed to enable ioperm for some io port that kernel drivers also send to
> > from interrupt handlers.  Can that ever happen?
> 
> I haven't checked the ioperm code to be certain, but it seems like the 
> sort of thing somebody might want to do on occasion.

I checked the ioperm code.  As far as I can tell, if you have CAP_SYS_RAWIO
then you can use ioperm to enable any port you want, so this will always be
possible.  (That seems a bit nutty to me, hence my earlier reaction.)

> > > It gives drivers a way to tell whether or not the breakpoint is currently
> > > installed without having to do explicit tracking of installed() and 
> > > uninstalled() callbacks.
> > 
> > How could that ever be used that would not be racy and thus buggy?  A
> > registration call on another CPU could cause a change and callback just
> > after you fetched the value.
> 
> Not if you have interrupts disabled.  Debug register settings are 
> disseminated from one CPU to the others by means of an IPI.

But the callback is not per-CPU, it is per-registration.  When a new
registration displaces an old one, or a deregistration stops displacing
one, isn't the status change in the data structure and the callback made
right away, by the thread doing the registration/deregistration call?
Another CPU with interrupts disabled won't have its breakpoints actually
change, but the data structure will have changed.  Isn't that right?

> I implemented most of the changes we discussed.  Ignoring the length for 
> execute breakpoints turned out not to be a good idea because it would 
> affect the way ptrace works, so the code verifies it just like any other 
> kind of breakpoint.

I think that's perfectly fine.

> I also decided against adding a .bits member.  It doesn't really gain very 
> much; the savings in encoding the breakpoint values is trivial -- one line 
> of code on i386.  

I think this is a mistake.  On other machines, there is no need for more
than one field and .len will go unused.  There is no reason not to encode
ahead of time.  If it's useful for callers to be able to extract the type
and length from a struct hw_breakpoint, it's easy to define macros or
inlines in asm/hw_breakpoint.h that go the other way.

> And it helps to have the original length and type values available for
> use by the ptrace routines.  

There's no reason __modify_user_hw_breakpoint can't use an encoded value.
modify_user_hw_breakpoint can do checking and encoding before taking the lock.

> In fact, I decided to add a superfluous bit to the type code.  That's to
> help disambiguate between length and type values; it's easy to mix the
> two of them up.

It doesn't hurt to add some constant bits to the API values that you just
mask out (after checking) as part of the encoding convsersion.  In fact,
it's a good way to make sure people are using the right macros.

> Likewise, the length macros don't give the encoded values; that's so
> people can just specify the length directly instead of using the macro.

I object to this.  The purpose of having macros is to give a constrained
set of values that can be used at compile time.  Letting people use literal
numbers is just asking for people to write their calls unportably.  For
machines that support only LEN8, I'd planned to define it to some magic
bit pattern just so it could be checked and rule out cavalier users who
don't pay attention to the macros.

> There's a small question about the value of the error_code argument for 
> send_sigtrap().  The value passed into do_debug() isn't available in
> ptrace_triggered() -- but since it is always 0, that's what I'm using.  
> I'm not sure what it's supposed to mean anyway.

task->thread.trap_no and task->thread.error_code usually store the values
from the hardware trap frame when a trap is turned into a signal (e.g. see
do_trap).  This is the hardware trap number, which is 1 for do_debug.  The
error code is a word of the hardware trap frame whose meaning is specified
differently for each trap number.  For debug traps, it's always zero.
For other kinds of traps, it is sometimes useful to have the value.


My review comments below are about your patch of 3/22 (described as
"actually works with 2.6.21-rc4").

> - if (notify_die(DIE_DEBUG, "debug", regs, condition, error_

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-14 Thread Alan Stern
On Tue, 13 Mar 2007, Roland McGrath wrote:

> > Yes, the code could be reworked by moving some of the data from the CPU
> > hw-breakpoint info into the thread's info.  I'll see how much simpler it
> > ends up being.
> 
> I don't quite understand that characterization of the kind of change I'm
> advocating.

It's easy to explain: Your sample code had a tdr[] array in the thread's
hw_breakpoint area and my patch did not; adding it amounts to moving some
of the data from the CPU's info into the thread's info.

> > It isn't quite that easy.  Even though the number of user breakpoints may
> > not have changed, their identities may have.  So the unlikely case has to
> > encompass two possibilities: the number of installable user breakpoints
> > has changed, or any user breakpoints have been registered or unregistered.
> 
> Why does it matter?  When a new user breakpoint was made the
> highest-priority one, it ought to update tdr[0..3] right then before the
> registration call returns.  It seems fine to me for it to make an
> uninstalled callback right away rather than at the thread's next switch-in.
> But even if you wanted to delay it, you could just set active_dr7 to zero
> or something so that the unlikely case triggers.

That's basically what I intend to do.  Although instead of keeping track 
of an active_dr7, I'll keep track of num_kbps as of the last time the 
thread ran.  The unlikely case can be triggered by setting the value to 
-1.

> The plan I suggested relies on setting want_dr7 with the enable bits that
> do include the ones the kernel uses (for contested slots).  Of course it
> works as well to use either bit for this, as long as you're consistent.

With my scheme it won't matter.  You'll see.

> But as I've said at least twice already, there is no actual meaning
> whatsoever to choosing one enable bit over the other.  It's just confusing
> and misleading to have the code make special efforts to set one rather than
> the other for different cases.  You talk about them as if they meant
> something, which keeps making me wonder if you're confused.  Since the
> hardware doesn't care which bit you set, you could overload them to record
> a bit and a half of information there if really wanted to, but you're not
> even doing that, unless I'm confused.

No, I'm not confused and neither are you.  I realize there's no functional 
difference between the two sets of enable bits, since Linux doesn't use 
hardware task-switching.  I just like to keep things neatly separated, 
that's all.


> > Maybe.  I always had in the back of my mind the possibility that there
> > might be a user I/O breakpoint set.  It could be triggered by an interrupt
> > handler even in the SIGKILL case.  But since we're not supporting I/O
> > breakpoints now, that's a moot point.
> 
> How would that happen?  This would mean that some user process has been
> allowed to enable ioperm for some io port that kernel drivers also send to
> from interrupt handlers.  Can that ever happen?

I haven't checked the ioperm code to be certain, but it seems like the 
sort of thing somebody might want to do on occasion.

Another aspect perhaps worth mentioning is that user breakpoints are 
active only when the task is running.  Hence breakpoints for user data 
affected by AIO operations won't necessarily be triggered; with async I/O 
the transfers can occur while a different task is running.  Of course 
there's nothing new about this; the same has been true for ptrace all 
along.


> > Actually the code _doesn't_ already know what's there; the chbi area
> > doesn't include any storage for the kernel DR7 value.  I figured it was at
> > least as easy to read it from the CPU register as to read it from memory.  
> > But maybe that's not true; according to my ancient processor manual, moves
> > to/from debug registers take many more clock cycles than moves to/from
> > memory.
> 
> The purpose of the chbi area is to optimize this path.  Make it store
> whatever precomputed values are most convenient for the hot paths.  This
> path doesn't need num_kbps, it needs kdr7.  So precompute that and do that
> one load, instead of a load of chbi->num_bkps we don't otherwise need plus
> a load from kdr7_masks that can be avoided altogether on hot paths.

I will endeavor to optimize switch_to_thread_hw_breakpoint.  However it 
will turn out that the hot patch really does to use chbi->num_kbps and 
chbi->kdr7_mask.  Again, you'll see...


> > No.  If a debugger has removed some user breakpoints since the last time
> > the thread ran, the chbi->bps[] entries could still be present.  Likewise
> > if the previously-running task had more breakpoints than the current one.
> 
> I don't really get why user breakpoints would be in chbi->bps at all.
> When a debug trap hits, you can check kdr7 or whatnot to see if it was a
> kernel allocation, and otherwise look in current->thbi->bps to find it.

You're right; I'll do it that way.


> > Since PPC doesn't allow lengths shorter 

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Roland McGrath
> Yes, the code could be reworked by moving some of the data from the CPU
> hw-breakpoint info into the thread's info.  I'll see how much simpler it
> ends up being.

I don't quite understand that characterization of the kind of change I'm
advocating.  If the common case path in context switch has really anything
at all more than the example I gave, something is wrong.

> It isn't quite that easy.  Even though the number of user breakpoints may
> not have changed, their identities may have.  So the unlikely case has to
> encompass two possibilities: the number of installable user breakpoints
> has changed, or any user breakpoints have been registered or unregistered.

Why does it matter?  When a new user breakpoint was made the
highest-priority one, it ought to update tdr[0..3] right then before the
registration call returns.  It seems fine to me for it to make an
uninstalled callback right away rather than at the thread's next switch-in.
But even if you wanted to delay it, you could just set active_dr7 to zero
or something so that the unlikely case triggers.

> > For the masks to work as I described, you need to use the same enable bit
> > (or both) for kernel and user allocations.  It really doesn't matter which
> > one you use, since all of Linux is "local" for the sense of the dr7 enable
> > bits (i.e. you should just use DR_GLOBAL_ENABLE).
> 
> This shouldn't be necessary.  So long as DR_GLOBAL_ENABLE always belongs
> to the kernel's part of DR7 and DR_LOCAL_ENABLE always belongs to the
> thread's part there will be no interference between them.

The plan I suggested relies on setting want_dr7 with the enable bits that
do include the ones the kernel uses (for contested slots).  Of course it
works as well to use either bit for this, as long as you're consistent.
But as I've said at least twice already, there is no actual meaning
whatsoever to choosing one enable bit over the other.  It's just confusing
and misleading to have the code make special efforts to set one rather than
the other for different cases.  You talk about them as if they meant
something, which keeps making me wonder if you're confused.  Since the
hardware doesn't care which bit you set, you could overload them to record
a bit and a half of information there if really wanted to, but you're not
even doing that, unless I'm confused.

> Maybe.  I always had in the back of my mind the possibility that there
> might be a user I/O breakpoint set.  It could be triggered by an interrupt
> handler even in the SIGKILL case.  But since we're not supporting I/O
> breakpoints now, that's a moot point.

How would that happen?  This would mean that some user process has been
allowed to enable ioperm for some io port that kernel drivers also send to
from interrupt handlers.  Can that ever happen?

> Actually the code _doesn't_ already know what's there; the chbi area
> doesn't include any storage for the kernel DR7 value.  I figured it was at
> least as easy to read it from the CPU register as to read it from memory.  
> But maybe that's not true; according to my ancient processor manual, moves
> to/from debug registers take many more clock cycles than moves to/from
> memory.

The purpose of the chbi area is to optimize this path.  Make it store
whatever precomputed values are most convenient for the hot paths.  This
path doesn't need num_kbps, it needs kdr7.  So precompute that and do that
one load, instead of a load of chbi->num_bkps we don't otherwise need plus
a load from kdr7_masks that can be avoided altogether on hot paths.

I don't really know about the slowness of reading debug registers, though I
would guess it is slower than most common operations.  But regardless, you
can avoid it because kdr7 is something you need anyway, so you're not
replacing it with a load but letting a load you already had kill two birds.

> No.  If a debugger has removed some user breakpoints since the last time
> the thread ran, the chbi->bps[] entries could still be present.  Likewise
> if the previously-running task had more breakpoints than the current one.

I don't really get why user breakpoints would be in chbi->bps at all.
When a debug trap hits, you can check kdr7 or whatnot to see if it was a
kernel allocation, and otherwise look in current->thbi->bps to find it.

> I don't like using DR_LEN_1, because it would force asm/debugreg.h to be 
> #included by any user of hw_breakpoint.  The raw numerical value should do 
> just as well.

Agreed.  (I just used DR_LEN_1 as shorthand and was not hot on including
asm/debugreg.h in asm/hw_breakpoint.h in the actual version.)

> > On powerpc, the address breakpoint is always for an 8-byte address range.
> 
> So there's no way to trap on accesses to a particular byte within a
> string?

There's no way to tell which of the 8 bytes were accessed, AFAIK.  It's the
same as LEN8 on x86_64 or LEN[42] on i386: some byte in there was accessed.

> Better yet, if type is HW_BREAKPOINT_TYPE_EXECUTE then just ignore the
> caller's

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Alan Stern
On Tue, 13 Mar 2007, Roland McGrath wrote:

> > At that moment D, running on CPU 1, decides to unregister a breakpoint in
> > T.  Clearing TIF_DEBUG now doesn't do any good -- it's too late; CPU 0 has
> > already tested it.  CPU 1 goes in and alters the user breakpoint data,
> > maybe even deallocating a breakpoint structure that CPU 0 is about to
> > read.  Not a good situation.
> 
> You make it sound like a really good case for RCU. ;-)
> 
> > No way to tell when a task being debugged is started up by anything other 
> > than its debugger?  Hmmm, in that case maybe it would be better to use 
> > RCU.  It won't add much overhead to anything but the code for registering 
> > and unregistering user breakpoints.
> 
> Indeed, it is for this sort of thing.

Well, if I have to use it then I will.

>  Still, it feels like a bit too much
> is going on in switch_to_thread_hw_breakpoint for the common case.
> It seems to me it ought to be something as simple as this:
> 
>   if (unlikely((thbi->want_dr7 &~ chbi->kdr7) != thbi->active_tdr7) {
>   /* Need to make some installed or uninstalled callbacks.  */
>   if (thbi->active_tdr7 & chbi->kdr7)
>   uninstalled callbacks;
>   else
>   installed callbacks;
>   recompute active_dr7, want_dr7;
>   }
> 
>   switch (thbi->active_bkpts) {
>   case 4:
>   set_debugreg(0, thbi->tdr[0]);
>   case 3:
>   set_debugreg(1, thbi->tdr[1]);
>   case 2:
>   set_debugreg(2, thbi->tdr[2]);
>   case 1:
>   set_debugreg(3, thbi->tdr[3]);
>   }
>   set_debugreg(7, chbi->kdr7 | thbi->active_tdr7);

Yes, the code could be reworked by moving some of the data from the CPU
hw-breakpoint info into the thread's info.  I'll see how much simpler it
ends up being.

> Only in the unlikely case do you need to worry about synchronization,
> whether it's RCU or spin locks or whatever.  The idea is that breakpoint
> installation when doing the fancy stuff would reduce it to "the four
> breakpoints I would like, in order" (tdr[3] containing the highest-priority
> one), and dr7 masks describing what dr7 you were using last time you were
> running (active_tdr7), and that plus the enable bits you would like to have
> set (want_dr7).  The unlikely case is when the number of kernel debugregs
> consumed changed since the last time you switched in, so you go recompute
> active_tdr7.  (Put the body of that if in another function.)

It isn't quite that easy.  Even though the number of user breakpoints may
not have changed, their identities may have.  So the unlikely case has to
encompass two possibilities: the number of installable user breakpoints
has changed, or any user breakpoints have been registered or unregistered.

> For the masks to work as I described, you need to use the same enable bit
> (or both) for kernel and user allocations.  It really doesn't matter which
> one you use, since all of Linux is "local" for the sense of the dr7 enable
> bits (i.e. you should just use DR_GLOBAL_ENABLE).

This shouldn't be necessary.  So long as DR_GLOBAL_ENABLE always belongs
to the kernel's part of DR7 and DR_LOCAL_ENABLE always belongs to the
thread's part there will be no interference between them.

> It's perfectly safe to access all this stuff while it might be getting
> overwritten, and worst case you switch in some user breakpoints you didn't
> want.  That only happens in the SIGKILL case, when you never hit user mode
> again and don't care.

Maybe.  I always had in the back of my mind the possibility that there
might be a user I/O breakpoint set.  It could be triggered by an interrupt
handler even in the SIGKILL case.  But since we're not supporting I/O
breakpoints now, that's a moot point.

> > +void switch_to_thread_hw_breakpoint(struct task_struct *tsk)
> [...]
> > +   /* Keep the DR7 bits that refer to kernel breakpoints */
> > +   get_debugreg(dr7, 7);
> > +   dr7 &= kdr7_masks[chbi->num_kbps];
> 
> I don't understand what this part is for.  Why fetch dr7 from the CPU?  You
> already know what's there.  All you need is the current dr7 bits belonging
> to kernel allocations, i.e. a chbi->kdr7 mask.

Actually the code _doesn't_ already know what's there; the chbi area
doesn't include any storage for the kernel DR7 value.  I figured it was at
least as easy to read it from the CPU register as to read it from memory.  
But maybe that's not true; according to my ancient processor manual, moves
to/from debug registers take many more clock cycles than moves to/from
memory.

> > +   if (tsk && test_tsk_thread_flag(tsk, TIF_DEBUG)) {
> 
> switch_to_thread_hw_breakpoint is on the context switch path.  On that
> path, this test can never be false.  The context switch path should not
> have any unnecessary conditionals.  If you want to share code with some
> other places that now call switch_to_thread_hw_breakpoint, they can share a
> common inline

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Alan Cox
On Tue, 13 Mar 2007 01:00:50 -0700 (PDT)
Roland McGrath <[EMAIL PROTECTED]> wrote:

> > Well, I can add in the test for 0, but finding the set of always-on bits
> > in DR6 will have to be done separately.  Isn't it possible that different
> > CPUs could have different bits?
> 
> I don't know, but it seems unlikely.  AFAIK all CPUs are presumed to have
> the same CPUID results, for example.

No. We merge the CPUID information to get a shared set of capability bits.

Generic PC systems with a mix of PII and PIII are possible. The voyager
architecture can have even more peculiar combinations of processor
modules installed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-13 Thread Roland McGrath
> Well, I can add in the test for 0, but finding the set of always-on bits
> in DR6 will have to be done separately.  Isn't it possible that different
> CPUs could have different bits?

I don't know, but it seems unlikely.  AFAIK all CPUs are presumed to have
the same CPUID results, for example.

> At that moment D, running on CPU 1, decides to unregister a breakpoint in
> T.  Clearing TIF_DEBUG now doesn't do any good -- it's too late; CPU 0 has
> already tested it.  CPU 1 goes in and alters the user breakpoint data,
> maybe even deallocating a breakpoint structure that CPU 0 is about to
> read.  Not a good situation.

You make it sound like a really good case for RCU. ;-)

> No way to tell when a task being debugged is started up by anything other 
> than its debugger?  Hmmm, in that case maybe it would be better to use 
> RCU.  It won't add much overhead to anything but the code for registering 
> and unregistering user breakpoints.

Indeed, it is for this sort of thing.  Still, it feels like a bit too much
is going on in switch_to_thread_hw_breakpoint for the common case.
It seems to me it ought to be something as simple as this:

if (unlikely((thbi->want_dr7 &~ chbi->kdr7) != thbi->active_tdr7) {
/* Need to make some installed or uninstalled callbacks.  */
if (thbi->active_tdr7 & chbi->kdr7)
uninstalled callbacks;
else
installed callbacks;
recompute active_dr7, want_dr7;
}

switch (thbi->active_bkpts) {
case 4:
set_debugreg(0, thbi->tdr[0]);
case 3:
set_debugreg(1, thbi->tdr[1]);
case 2:
set_debugreg(2, thbi->tdr[2]);
case 1:
set_debugreg(3, thbi->tdr[3]);
}
set_debugreg(7, chbi->kdr7 | thbi->active_tdr7);

Only in the unlikely case do you need to worry about synchronization,
whether it's RCU or spin locks or whatever.  The idea is that breakpoint
installation when doing the fancy stuff would reduce it to "the four
breakpoints I would like, in order" (tdr[3] containing the highest-priority
one), and dr7 masks describing what dr7 you were using last time you were
running (active_tdr7), and that plus the enable bits you would like to have
set (want_dr7).  The unlikely case is when the number of kernel debugregs
consumed changed since the last time you switched in, so you go recompute
active_tdr7.  (Put the body of that if in another function.)

For the masks to work as I described, you need to use the same enable bit
(or both) for kernel and user allocations.  It really doesn't matter which
one you use, since all of Linux is "local" for the sense of the dr7 enable
bits (i.e. you should just use DR_GLOBAL_ENABLE).

It's perfectly safe to access all this stuff while it might be getting
overwritten, and worst case you switch in some user breakpoints you didn't
want.  That only happens in the SIGKILL case, when you never hit user mode
again and don't care.

> +void switch_to_thread_hw_breakpoint(struct task_struct *tsk)
[...]
> + /* Keep the DR7 bits that refer to kernel breakpoints */
> + get_debugreg(dr7, 7);
> + dr7 &= kdr7_masks[chbi->num_kbps];

I don't understand what this part is for.  Why fetch dr7 from the CPU?  You
already know what's there.  All you need is the current dr7 bits belonging
to kernel allocations, i.e. a chbi->kdr7 mask.

> + if (tsk && test_tsk_thread_flag(tsk, TIF_DEBUG)) {

switch_to_thread_hw_breakpoint is on the context switch path.  On that
path, this test can never be false.  The context switch path should not
have any unnecessary conditionals.  If you want to share code with some
other places that now call switch_to_thread_hw_breakpoint, they can share a
common inline for part of the guts.

> + set_debugreg(dr7, 7);   /* Disable user bps while switching */

What is this for?  The kernel's dr7 bits are already set.  Why does it
matter if bits enabling user breakpoints are set too?  No user breakpoint
can be hit on this CPU before this function returns.

> + /* Clear any remaining stale bp pointers */
> + while (--i >= chbi->num_kbps)
> + chbi->bps[i] = NULL;

Why is this done here?  This can be done when the kernel allocations are
installed/uninstalled.

> @@ -15,6 +15,7 @@ struct die_args {
>   long err;
>   int trapnr;
>   int signr;
> + int ret;
>  };

I don't understand why you added this at all.

>  fastcall void __kprobes do_debug(struct pt_regs * regs, long error_code)
[...]
> + if ((args.err & (DR_STEP|DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) ||
> + args.ret)
> + send_sigtrap(tsk, regs, error_code);

The args.err test is fine.  A notifier that wants the SIGTRAP sent should
just leave the appropriate DR_* bit set rather than clear it.

In hw_breakpoint_handler, you could just change:

if (i >= chbi->num_kbps)
  

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-08 Thread Roland McGrath
> That sounds like a rather fragile approach to avoiding a minimal amount of 
> work.  Debug exceptions don't occur very often, and when they do it won't 
> matter too much if we go through some extra notifier-chain callouts.

When single-stepping occurs it happens repeatedly many times, and that
doesn't need any more overhead than it already has.  

> It turns out that this won't work correctly unless I use something
> stronger, like a spinlock or RCU.  Either one seems like overkill.

What is the problem with just clearing TIF_DEBUG?  It just means that in
the SIGKILL case, the dying thread won't switch in its local debugregs.
The global kernel allocations will already be set in the processor from the
previous context, and old user-address allocations do no harm since we
won't run in user mode again before switching out at the end of do_exit.

> Is there any way to find out from within the
> switch_to_thread_hw_breakpoint routine whether the task is in this unusual
> state?  (By which I mean the task is being debugged and the debugger
> hasn't told it to start running.)  Would (tsk->exit_code == SIGKILL) work?  

That won't necessarily work.  There isn't any cheap check that won't also
catch a task preempted on its way to stopping for the debugger.  

> If not, can we add a TIF_DEBUG_STOPPED flag?  

I'm not clear on what that would mean, but it's probably not an idea I like.

> Or should I just go with a spinlock?

If it's really necessary, but it hasn't proved so for any other switched
per-thread state.  As long as you aren't doing per-thread kernel-mode
allocations, I don't see why you need anything other than TIF_DEBUG.

> Is SIGKILL the only way this can happen?

It should be, but there might be some stray wake_up_process calls in the
kernel that can violate [up]trace's supposed monopoly on TASK_TRACED (or
duopoly with SIGKILL, I suppose I should say).  If there is no SIGKILL,
then the task will just call schedule again nearly immediately to go back
to blocking, which will switch out unless there is a second wakeup right
then.

> In a similar vein, I need a reliable way to know whether a task has gone 
> through exit_thread().  If it has, then its hw_breakpoint area has been 
> deallocated and a new one must not be allocated.  Will (tsk->flags & 
> PF_EXITING) always be true once that happens?

PF_EXITING it set after there is no possibility of returning to user mode,
but a while before exit_thread, when you might still want kernel-mode
breakpoints.  If the only per-thread allocations you support are for user
mode, then you can certainly refuse to do any when PF_EXITING is set.


Thanks,
Roland

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-07 Thread Alan Stern
On Tue, 6 Mar 2007, Roland McGrath wrote:

> > > Yeah, I guess that's right.  It should still return NOTIFY_STOP when
> > > args->err has no other bits set, so notifiers aren't called with zero.
> > 
> > In practice that might not work.  On my machine, at least, reads of DR6
> > return ones in all the reserved bit positions.
> 
> Does that mean asm("mov %1,%%dr6; mov %%dr6,%0" : "=r" (mask) : "r" (0)); 
> puts in mask the set of reserved bits?  We could collect that value at CPU
> startup and mask it off args->err, then OR it back into vdr6.

That sounds like a rather fragile approach to avoiding a minimal amount of 
work.  Debug exceptions don't occur very often, and when they do it won't 
matter too much if we go through some extra notifier-chain callouts.


Back to a previous topic:

> > The actual guarantee I need is that nobody will switch_to() the task while
> > my routines are running.
>
> You can't get that.  It can always be woken for SIGKILL (which is a good
> thing).  What you are guaranteed is that if it does, it will never return
> to user mode.  So it has to be ok for switching in to use the bits in any
> intermediate state you might get them, meaning any possible garbage state
> is harmful only to user mode or is otherwise recoverable (worst case
> perhaps the exception handler has to know to ignore some traps).  This is
> already true with ptrace and ->thread.debugreg, as well as the normal user
> registers.  In your case, if you wanted to be paranoid you could clear
> TIF_DEBUG before you touch anything, and set it again only after you're
> done (with memory barriers as needed).

It turns out that this won't work correctly unless I use something
stronger, like a spinlock or RCU.  Either one seems like overkill.

Is there any way to find out from within the
switch_to_thread_hw_breakpoint routine whether the task is in this unusual
state?  (By which I mean the task is being debugged and the debugger
hasn't told it to start running.)  Would (tsk->exit_code == SIGKILL) work?  
If not, can we add a TIF_DEBUG_STOPPED flag?  Or should I just go with a 
spinlock?

Is SIGKILL the only way this can happen?

In a similar vein, I need a reliable way to know whether a task has gone 
through exit_thread().  If it has, then its hw_breakpoint area has been 
deallocated and a new one must not be allocated.  Will (tsk->flags & 
PF_EXITING) always be true once that happens?

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-06 Thread Roland McGrath
> > Yeah, I guess that's right.  It should still return NOTIFY_STOP when
> > args->err has no other bits set, so notifiers aren't called with zero.
> 
> In practice that might not work.  On my machine, at least, reads of DR6
> return ones in all the reserved bit positions.

Does that mean asm("mov %1,%%dr6; mov %%dr6,%0" : "=r" (mask) : "r" (0)); 
puts in mask the set of reserved bits?  We could collect that value at CPU
startup and mask it off args->err, then OR it back into vdr6.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-06 Thread Alan Stern
On Mon, 5 Mar 2007, Roland McGrath wrote:

> > The actual guarantee I need is that nobody will switch_to() the task while
> > my routines are running.
> 
> You can't get that.  It can always be woken for SIGKILL (which is a good
> thing).  What you are guaranteed is that if it does, it will never return
> to user mode.  So it has to be ok for switching in to use the bits in any
> intermediate state you might get them, meaning any possible garbage state
> is harmful only to user mode or is otherwise recoverable (worst case
> perhaps the exception handler has to know to ignore some traps).  This is
> already true with ptrace and ->thread.debugreg, as well as the normal user
> registers.  In your case, if you wanted to be paranoid you could clear
> TIF_DEBUG before you touch anything, and set it again only after you're
> done (with memory barriers as needed).

Guess I'll have to take that approach.  The new additions to __switch_to() 
follow a linked list, and obviously it's not safe to do that while the 
list is being updated.  (No, I'm not about to start using RCU for this!)


> > If someone really needs to do that, they can always put their own call to
> > (un)register_kernel_hwbkpt() at the entry(exit) to the complex subsystems.  
> > Or perhaps it should be a job for systemtap, which would use hwbkpt to do
> > the actual work.
> 
> But you don't have an option to avoid interrupting other CPUs to update,
> which is not necessary or desireable for this usage.  That's what I was
> referring to.  If it's not trivial to add, it isn't needed now.

I see your point.  If you want to monitor a certain location in kernel
space but only while in the context of a specific task, the new framework
doesn't provide any way of doing it.  One approach could be to use a
regular kernel breakpoint and return immediately if the task of interest
isn't the current one.  Beyond that, let's punt.


> > Which implies that do_debug needs to decide whether or not to issue 
> > SIGTRAP.  Presumably the condition will be that any of the DR_STEP or 
> > DR_TRAPn bits remain set after the notifier chain has run.  This means the 
> > kprobes code will have to be modified to clear DR_STEP in args->err.
> 
> Yeah, I guess that's right.  It should still return NOTIFY_STOP when
> args->err has no other bits set, so notifiers aren't called with zero.

In practice that might not work.  On my machine, at least, reads of DR6
return ones in all the reserved bit positions.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Roland McGrath
> Presumably you mean that hw-breakpoint.c shouldn't do anything at all on
> single-step exceptions.  

Right.

> So far I've been developing under 2.6.21-rc, which doesn't have utrace.
> But eventually this will be submitted by way of -mm, which does.  The
> easiest approach would be to make the whole thing conditional on
> CONFIG_UTRACE.

That is fine with me.

> The actual guarantee I need is that nobody will switch_to() the task while
> my routines are running.

You can't get that.  It can always be woken for SIGKILL (which is a good
thing).  What you are guaranteed is that if it does, it will never return
to user mode.  So it has to be ok for switching in to use the bits in any
intermediate state you might get them, meaning any possible garbage state
is harmful only to user mode or is otherwise recoverable (worst case
perhaps the exception handler has to know to ignore some traps).  This is
already true with ptrace and ->thread.debugreg, as well as the normal user
registers.  In your case, if you wanted to be paranoid you could clear
TIF_DEBUG before you touch anything, and set it again only after you're
done (with memory barriers as needed).

> If someone really needs to do that, they can always put their own call to
> (un)register_kernel_hwbkpt() at the entry(exit) to the complex subsystems.  
> Or perhaps it should be a job for systemtap, which would use hwbkpt to do
> the actual work.

But you don't have an option to avoid interrupting other CPUs to update,
which is not necessary or desireable for this usage.  That's what I was
referring to.  If it's not trivial to add, it isn't needed now.

> Not nearly as hot as switch_to()!  But I'll do it.

That's why it's got a cheap TIF_DEBUG check with unlikely().

> That may be so, but the only way to access that part of the state is via
> ptrace.  Think of it this way: The debug register settings really should
> not be part of the thread's virtual state.  If we had some other, more
> logical API for managing breakpoints in a task then ptrace_bps[] wouldn't
> be necessary at all (other than for backward compatibility perhaps).

As things are in utrace, there will continue to be a utrace method of
setting the (virtual) "raw" debugregs, even if ptrace per se is not involved.
(So all I'm saying really is I'm on a personal campaign against the letter P.)

OTOH, your point is well taken.  Once your stuff is integrated, there is no
real reason that thread-virtualized "raw" debug registers need to be
accessible via utrace_regset.  Perhaps I should drop it.  Then those calls
will be used purely by ptrace compatibility and can be #ifdef CONFIG_PTRACE.

> Which implies that do_debug needs to decide whether or not to issue 
> SIGTRAP.  Presumably the condition will be that any of the DR_STEP or 
> DR_TRAPn bits remain set after the notifier chain has run.  This means the 
> kprobes code will have to be modified to clear DR_STEP in args->err.

Yeah, I guess that's right.  It should still return NOTIFY_STOP when
args->err has no other bits set, so notifiers aren't called with zero.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Roland McGrath
> > Please not.  That might save a few bytes, but it makes the interface a
> > lot harder to understand for users.  We really discourage over-loaded
> > interfaces in Linux.
> 
> I agree with Christoph.  Plenty of other interfaces in the kernel do the 
> same thing.

I don't think a single hook for both "on" and "off" notification is harder
to understand at all, it's natural.  But this is a very tiny point.

> I'll be happy to rename the structures and the files if Roland doesn't 
> mind.

I didn't much like "debugpoint" but I'm not wedded to vowellessness.
hw_breakpoint is fine by me.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Alan Stern
On Sun, 4 Mar 2007, Roland McGrath wrote:

> Thanks, Alan.  Great work.  I have some suggestions for changes.
> 
> > I pretty much copied the existing code for handling vm86 mode
> > and single-step exceptions, without fully understanding it.
> > 
> > The code doesn't virtualize the BS (single-step) flag in DR6
> > for userspace.  It could be added, but I wonder whether it is
> > really needed.
> 
> There is only one TF flag, it and the DR_STEP bit in dr6 are part of the
> unitary thread state.  You should not be doing anything at all on
> single-step exceptions.

Presumably you mean that hw-breakpoint.c shouldn't do anything at all on
single-step exceptions.  do_debug does have to know about them, because it
has to know whether or not to send a SIGTRAP.  Separation of duties; I
tried to move too much out of do_debug.


> > Setting user breakpoints on I/O ports should require permissions
> > checking.  I haven't tried to figure out how that works or
> > how to implement it yet.
> 
> I would just leave the I/O breakpoint feature out for the first cut.  
> See if there is demand for it.  

Good enough.

> It requires setting CR4.DE, which we don't do.  The validation is
> relatively simple, comparing against the io_bitmap_ptr data (if any).
> You can check at insertion time (requiring doing ioperm before inserting
> an I/O breakpoint), and then check again at exception time if the hit
> was in kernel mode and ignore it for a user-only breakpoint, or perhaps
> check again and eject the breakpoint if it's been cleared from the
> ioperm bitmap.

It's easier simply to ignore the whole issue.  :-)  Fortunately many other 
architectures don't have to worry about it at all.


> > It seems likely that some of the new routines should be marked
> > "__kprobes", but I don't know which, or even what that annotation
> > is supposed to mean.
> 
> That annotation is for code that is in the kprobes maintenance code path
> (where you cannot insert kprobes).  do_debug has it for the notify_die
> path that might lead to kprobes.  The only thing in your code that
> should need it is your notifier function, which might get called for an
> exception that turns out to be a kprobes single-step.  There shouldn't
> be any problem inserting kprobes into the other hwbkpt code.

Got it.


> > The parts relating to kernel breakpoints could be made conditional
> > on a Kconfig option.  The amount of code space saved would be
> > relatively small; I'm not sure that it would be worthwhile.
> 
> In a utrace merge, the user parts can be made conditional on CONFIG_UTRACE.
> Then with both turned off, the code goes away completely.  It's unlikely it
> will ever be turned off, but it is a clean way to go about things in case
> someone wants the smallest possible config for a limited-use installation.

So far I've been developing under 2.6.21-rc, which doesn't have utrace.  
But eventually this will be submitted by way of -mm, which does.  The
easiest approach would be to make the whole thing conditional on 
CONFIG_UTRACE.


> > +   void*data;
> 
> Leave this out.  Let the caller embed struct hwbkpt in his larger struct
> and use container_of.

Okay.


> > +/*
> > + * The tsk argument in the following three routines will usually be a
> > + * process being PTRACEd by the current task, normally a debugger.
> > + * It is also legal for tsk to be the current task.  In either case we
> > + * can guarantee that tsk will not start running on another CPU while
> > + * its breakpoints are being modified.  If that happened it could cause
> > + * a crash.
> > + */
> > +int register_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp);
> > +void unregister_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp);
> > +int modify_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp,
> > +   void *address, u8 len, u8 type);
> 
> These are the kinds of constraints utrace is there to make doable.
> (I'm assuming that guarantee is really "will not start running in
> user mode", since SIGKILL is always possible.)

The actual guarantee I need is that nobody will switch_to() the task while
my routines are running.

>  In a utrace merge,
> I think we want the only entry points for this to be a utrace-based
> interface that explicitly requires utrace's notion of quiescence
> (as its accessors for thread register data do).

Yes; I had in mind that the user-breakpoint part would be called only by 
utrace and/or ptrace.  That's why the routines aren't EXPORTed.  But I 
wrote it in more general form, to make it easier to understand.  I'll add 
some comments about this.


> > +/*
> > + * Kernel breakpoints are not associated with any particular thread.
> > + */
> > +int register_kernel_hwbkpt(struct hwbkpt *bp);
> > +void unregister_kernel_hwbkpt(struct hwbkpt *bp);
> 
> Potentially kernel users might want per-thread installations, if
> it's simple to provide the option.  e.g., a probe at some 

Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 11:16:48AM -0500, Alan Stern wrote:
> > Making this unconditional is pointless and just makes things harder to
> > read, so please don't do it.  (The same is true for utrace, but Roland
> > has unfortunately still not replied to my mail mentioning it :P)
> 
> Sorry, I don't understand what you're saying.  I would think that making
> it _conditional_ would make things harder to read, because of all the
> extra "#ifdef" and "#endif" lines plus the need to keep two different
> versions of the code in mind.
> 
> Did you mean to say "conditional" instead of "unconditional"?

Yes, I did mean that.  Sorry for the confusion :)

> Incidentally, I do believe that for certain applications (embedded
> devices, for instance) it makes sense to avoid including all this code.  
> The cleanest way to do that would be to make both PTRACE and UTRACE
> configurable.

PTRACE configurable makes a lot of sense, especially as we want to get
rid of it very long term.  Making UTRACE configurable aswel as all
these tracehooks wrappers just make the code utterly unreadable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Alan Stern
On Mon, 5 Mar 2007, Christoph Hellwig wrote:

> On Sun, Mar 04, 2007 at 11:01:36PM -0800, Roland McGrath wrote:
> > >   The parts relating to kernel breakpoints could be made conditional
> > >   on a Kconfig option.  The amount of code space saved would be
> > >   relatively small; I'm not sure that it would be worthwhile.
> > 
> > In a utrace merge, the user parts can be made conditional on CONFIG_UTRACE.
> > Then with both turned off, the code goes away completely.  It's unlikely it
> > will ever be turned off, but it is a clean way to go about things in case
> > someone wants the smallest possible config for a limited-use installation.
> 
> Making this unconditional is pointless and just makes things harder to
> read, so please don't do it.  (The same is true for utrace, but Roland
> has unfortunately still not replied to my mail mentioning it :P)

Sorry, I don't understand what you're saying.  I would think that making
it _conditional_ would make things harder to read, because of all the
extra "#ifdef" and "#endif" lines plus the need to keep two different
versions of the code in mind.

Did you mean to say "conditional" instead of "unconditional"?

Incidentally, I do believe that for certain applications (embedded
devices, for instance) it makes sense to avoid including all this code.  
The cleanest way to do that would be to make both PTRACE and UTRACE
configurable.


> > > + void(*installed)(struct hwbkpt *);
> > > + void(*uninstalled)(struct hwbkpt *);
> > 
> > Save space in the struct by having just one function for both installed
> > and uninstalled, taking an argument.  Probably a caller should be able to
> > pass a null function here to say that the registration call should fail if
> > it can't be installed due to higher-priority or no-callback registrations
> > existing, and that its registration cannot be ejected by another (i.e., an
> > ill-behaved user).
> 
> Please not.  That might save a few bytes, but it makes the interface a
> lot harder to understand for users.  We really discourage over-loaded
> interfaces in Linux.

I agree with Christoph.  Plenty of other interfaces in the kernel do the 
same thing.


> > > +struct thread_hwbkpt {   /* HW breakpoint info for a thread */
> 
> Can this and all the file names please get an actually readable name?
> E.g. hw_breakpoint.  We're not IBM managers that needs to save every
> cent on superflous sillables :)

I'll be happy to rename the structures and the files if Roland doesn't 
mind.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-05 Thread Christoph Hellwig
On Sun, Mar 04, 2007 at 11:01:36PM -0800, Roland McGrath wrote:
> > The parts relating to kernel breakpoints could be made conditional
> > on a Kconfig option.  The amount of code space saved would be
> > relatively small; I'm not sure that it would be worthwhile.
> 
> In a utrace merge, the user parts can be made conditional on CONFIG_UTRACE.
> Then with both turned off, the code goes away completely.  It's unlikely it
> will ever be turned off, but it is a clean way to go about things in case
> someone wants the smallest possible config for a limited-use installation.

Making this unconditional is pointless and just makes things harder to
read, so please don't do it.  (The same is true for utrace, but Roland
has unfortunately still not replied to my mail mentioning it :P)

> > +   void(*installed)(struct hwbkpt *);
> > +   void(*uninstalled)(struct hwbkpt *);
> 
> Save space in the struct by having just one function for both installed
> and uninstalled, taking an argument.  Probably a caller should be able to
> pass a null function here to say that the registration call should fail if
> it can't be installed due to higher-priority or no-callback registrations
> existing, and that its registration cannot be ejected by another (i.e., an
> ill-behaved user).

Please not.  That might save a few bytes, but it makes the interface a
lot harder to understand for users.  We really discourage over-loaded
interfaces in Linux.

> > +struct thread_hwbkpt { /* HW breakpoint info for a thread */

Can this and all the file names please get an actually readable name?
E.g. hw_breakpoint.  We're not IBM managers that needs to save every
cent on superflous sillables :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-04 Thread Roland McGrath
Thanks, Alan.  Great work.  I have some suggestions for changes.

>   I pretty much copied the existing code for handling vm86 mode
>   and single-step exceptions, without fully understanding it.
> 
>   The code doesn't virtualize the BS (single-step) flag in DR6
>   for userspace.  It could be added, but I wonder whether it is
>   really needed.

There is only one TF flag, it and the DR_STEP bit in dr6 are part of the
unitary thread state.  You should not be doing anything at all on
single-step exceptions.

>   Setting user breakpoints on I/O ports should require permissions
>   checking.  I haven't tried to figure out how that works or
>   how to implement it yet.

I would just leave the I/O breakpoint feature out for the first cut.  
See if there is demand for it.  

It requires setting CR4.DE, which we don't do.  The validation is
relatively simple, comparing against the io_bitmap_ptr data (if any).
You can check at insertion time (requiring doing ioperm before inserting
an I/O breakpoint), and then check again at exception time if the hit
was in kernel mode and ignore it for a user-only breakpoint, or perhaps
check again and eject the breakpoint if it's been cleared from the
ioperm bitmap.

>   It seems likely that some of the new routines should be marked
>   "__kprobes", but I don't know which, or even what that annotation
>   is supposed to mean.

That annotation is for code that is in the kprobes maintenance code path
(where you cannot insert kprobes).  do_debug has it for the notify_die
path that might lead to kprobes.  The only thing in your code that
should need it is your notifier function, which might get called for an
exception that turns out to be a kprobes single-step.  There shouldn't
be any problem inserting kprobes into the other hwbkpt code.

>   The parts relating to kernel breakpoints could be made conditional
>   on a Kconfig option.  The amount of code space saved would be
>   relatively small; I'm not sure that it would be worthwhile.

In a utrace merge, the user parts can be made conditional on CONFIG_UTRACE.
Then with both turned off, the code goes away completely.  It's unlikely it
will ever be turned off, but it is a clean way to go about things in case
someone wants the smallest possible config for a limited-use installation.

> + void(*installed)(struct hwbkpt *);
> + void(*uninstalled)(struct hwbkpt *);

Save space in the struct by having just one function for both installed
and uninstalled, taking an argument.  Probably a caller should be able to
pass a null function here to say that the registration call should fail if
it can't be installed due to higher-priority or no-callback registrations
existing, and that its registration cannot be ejected by another (i.e., an
ill-behaved user).

> + void*data;

Leave this out.  Let the caller embed struct hwbkpt in his larger struct
and use container_of.

> +/*
> + * The tsk argument in the following three routines will usually be a
> + * process being PTRACEd by the current task, normally a debugger.
> + * It is also legal for tsk to be the current task.  In either case we
> + * can guarantee that tsk will not start running on another CPU while
> + * its breakpoints are being modified.  If that happened it could cause
> + * a crash.
> + */
> +int register_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp);
> +void unregister_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp);
> +int modify_user_hwbkpt(struct task_struct *tsk, struct hwbkpt *bp,
> + void *address, u8 len, u8 type);

These are the kinds of constraints utrace is there to make doable.
(I'm assuming that guarantee is really "will not start running in
user mode", since SIGKILL is always possible.)  In a utrace merge,
I think we want the only entry points for this to be a utrace-based
interface that explicitly requires utrace's notion of quiescence
(as its accessors for thread register data do).

> +/*
> + * Kernel breakpoints are not associated with any particular thread.
> + */
> +int register_kernel_hwbkpt(struct hwbkpt *bp);
> +void unregister_kernel_hwbkpt(struct hwbkpt *bp);

Potentially kernel users might want per-thread installations, if
it's simple to provide the option.  e.g., a probe at some syscall
entry that installs a watchpoint while calling into complex
subsystems, and then removes it before returning.

> @@ -379,15 +381,15 @@ void exit_thread(void)
>   tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
>   put_cpu();
>   }
> + flush_thread_hwbkpt(tsk);

I'd make this do if (unlikely(test_tsk_thread_flag(tsk, TIF_DEBUG))),
or make it an inline doing that test before calling out to the guts.
Most of the time the hwbkpt code need not be on a hot path, and the
exit path will be one.

> +struct thread_hwbkpt {   /* HW breakpoint info for a thread */
> +
> + /* utrace support */
> + struct

[RFC] hwbkpt: Hardware breakpoints (was Kwatch)

2007-03-02 Thread Alan Stern
Roland and Prasanna:

Here's my first attempt, lightly tested, at an hwbkpt implementation.  It
includes copious comments, so it shouldn't be too hard to figure out (if
you read the files in the right order).  The patch below is meant for
2.6.21-rc2; porting it to -mm shouldn't be very hard.

There are still several loose ends and unanswered questions.

I pretty much copied the existing code for handling vm86 mode
and single-step exceptions, without fully understanding it.

The code doesn't virtualize the BS (single-step) flag in DR6
for userspace.  It could be added, but I wonder whether it is
really needed.

Unlike the existing code, DR7 is re-enabled upon returning from
a debug interrupt.  That means it doesn't have to be enabled
when delivering a SIGTRAP.

Setting user breakpoints on I/O ports should require permissions
checking.  I haven't tried to figure out how that works or
how to implement it yet.

It seems likely that some of the new routines should be marked
"__kprobes", but I don't know which, or even what that annotation
is supposed to mean.

When CPUs go on- or off-line, their debug registers need to be
initialized or cleared.  I did a little bit of that, but more is
needed.  In particular, CPU hotplugging and kexec have to take
this into account.

The parts relating to kernel breakpoints could be made conditional
on a Kconfig option.  The amount of code space saved would be
relatively small; I'm not sure that it would be worthwhile.

Probably there are some more issues I haven't thought of.  Anyway, let me 
know what you think.

Alan Stern



Index: 2.6.21-rc2/include/asm-i386/hwbkpt.h
===
--- /dev/null
+++ 2.6.21-rc2/include/asm-i386/hwbkpt.h
@@ -0,0 +1,185 @@
+#ifndef_I386_HWBKPT_H
+#define_I386_HWBKPT_H
+
+#include 
+#include 
+
+/**
+ * struct hwbkpt - unified kernel/user-space hardware breakpoint
+ * @node: internal linked-list management
+ * @triggered: callback invoked when the breakpoint is hit
+ * @installed: callback invoked when the breakpoint is installed
+ * @uninstalled: callback invoked when the breakpoint is uninstalled
+ * @data: private data for use by the breakpoint owner
+ * @address: location (virtual address) of the breakpoint
+ * @len: extent of the breakpoint address (1, 2, or 4 bytes)
+ * @type: breakpoint type (write-only, read/write, execute, or I/O)
+ * @priority: requested priority level
+ * @status: current registration/installation status
+ *
+ * %hwbkpt structures are the kernel's way of representing hardware
+ * breakpoints.  These can be either execution breakpoints (triggered
+ * on instruction execution) or data breakpoints (also known as
+ * "watchpoints", triggered on data access), and the breakpoint's
+ * target address can be located in either kernel space or user space.
+ *
+ * The @address, @len, and @type fields are standard, indicating the
+ * location of the breakpoint, its extent in bytes, and the type of
+ * access that will trigger the breakpoint.  Possible values for @len
+ * are 1, 2, and 4.  Possible values for @type are %HWBKPT_WRITE
+ * (triggered on write access), %HWBKPT_RW (triggered on read or
+ * write access), %HWBKPT_IO (triggered on I/O-space access), and
+ * %HWBKPT_EXECUTE (triggered on instruction execution).  Certain
+ * restrictions apply: %HWBKPT_EXECUTE requires that @len be 1, and
+ * %HWBKPT_IO is available only on processors with Debugging Extensions.
+ *
+ * In register_user_hwbkpt() and modify_user_hwbkpt(), @address must
+ * refer to a location in user space (unless @type is %HWBKPT_IO).
+ * The breakpoint will be active only while the requested task is
+ * running.  Conversely, in register_kernel_hwbkpt() @address must
+ * refer to a location in kernel space, and the breakpoint will be
+ * active on all CPUs regardless of the task being run.
+ *
+ * When a breakpoint gets hit, the @triggered callback is invoked
+ * in_interrupt with a pointer to the %hwbkpt structure and the
+ * processor registers.  %HWBKPT_EXECUTE traps occur before the
+ * breakpointed instruction executes; all other types of trap occur
+ * after the memory or I/O access has taken place.  All breakpoints
+ * are disabled while @triggered runs, to avoid recursive traps and
+ * allow unhindered access to breakpointed memory.
+ *
+ * Hardware breakpoints are implemented using the CPU's debug registers,
+ * which are a limited hardware resource.  Requests to register a
+ * breakpoint will always succeed (provided the member entries are
+ * valid), but the breakpoint may not be installed in a debug register
+ * right away.  Physical debug registers are allocated based on the
+ * priority level stored in @priority (higher values indicate higher
+ * priority).  User-space breakpoints within a single thread compe