Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-04-12 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> On Thu, 30 Mar 2017 08:53:32 +0200
> Ingo Molnar  wrote:
> 
> > 
> > * Masami Hiramatsu  wrote:
> > 
> > > > So this is something I missed while the original code was merged, but 
> > > > the concept 
> > > > looks a bit weird: why do we do any "allocation" while a handler is 
> > > > executing?
> > > > 
> > > > That's fundamentally fragile. What's the maximum number of parallel 
> > > > 'kretprobe_instance' required per kretprobe - one per CPU?
> > > 
> > > It depends on the place where we put the probe. If the probed function 
> > > will be
> > > blocked (yield to other tasks), then we need a same number of threads on
> > > the system which can invoke the function. So, ultimately, it is same
> > > as function_graph tracer, we need it for each thread.
> > 
> > So then put it into task_struct (assuming there's no 
> > kretprobe-inside-kretprobe 
> > nesting allowed).
> 
> No, that is possible to put several kretprobes on same thread, e.g.
> the func1() is called from func2(), user can put kretprobes for each
> function at same time.
> So the possible solution is to allocate new return-stack for each task_struct,
> and that is what the function-graph tracer did.
> 
> Anyway, I'm considering to integrate kretprobe_instance with the ret_stack.
> It will increase memory usage for kretprobes, but can provide safer way
> to allocate kretprobe_instance.

Ok, that sounds good to me.

Thanks,

Ingo


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-30 Thread Alban Crequy
On Thu, Mar 30, 2017 at 8:53 AM, Ingo Molnar  wrote:
>
> * Masami Hiramatsu  wrote:
>
>> > So this is something I missed while the original code was merged, but the 
>> > concept
>> > looks a bit weird: why do we do any "allocation" while a handler is 
>> > executing?
>> >
>> > That's fundamentally fragile. What's the maximum number of parallel
>> > 'kretprobe_instance' required per kretprobe - one per CPU?
>>
>> It depends on the place where we put the probe. If the probed function will 
>> be
>> blocked (yield to other tasks), then we need a same number of threads on
>> the system which can invoke the function. So, ultimately, it is same
>> as function_graph tracer, we need it for each thread.
>
> So then put it into task_struct (assuming there's no 
> kretprobe-inside-kretprobe
> nesting allowed). There's just no way in hell we should be calling any complex
> kernel function from kernel probes!

Some kprobes are called from an interruption context. We have a kprobe
on tcp_set_state() and this is sometimes called when the network card
receives a packet.

> I mean, think about it, a kretprobe can be installed in a lot of places, and 
> now
> we want to call get_free_pages() from it?? This would add a massive amount of
> fragility.
>
> Instrumentation must be _simple_, every patch that adds more complexity to the
> most fundamental code path of it should raise a red flag ...
>
> So let's make this more robust, ok?
>
> Thanks,
>
> Ingo

Thanks,
Alban


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-30 Thread Masami Hiramatsu
On Thu, 30 Mar 2017 08:53:32 +0200
Ingo Molnar  wrote:

> 
> * Masami Hiramatsu  wrote:
> 
> > > So this is something I missed while the original code was merged, but the 
> > > concept 
> > > looks a bit weird: why do we do any "allocation" while a handler is 
> > > executing?
> > > 
> > > That's fundamentally fragile. What's the maximum number of parallel 
> > > 'kretprobe_instance' required per kretprobe - one per CPU?
> > 
> > It depends on the place where we put the probe. If the probed function will 
> > be
> > blocked (yield to other tasks), then we need a same number of threads on
> > the system which can invoke the function. So, ultimately, it is same
> > as function_graph tracer, we need it for each thread.
> 
> So then put it into task_struct (assuming there's no 
> kretprobe-inside-kretprobe 
> nesting allowed).

No, that is possible to put several kretprobes on same thread, e.g.
the func1() is called from func2(), user can put kretprobes for each
function at same time.
So the possible solution is to allocate new return-stack for each task_struct,
and that is what the function-graph tracer did.

Anyway, I'm considering to integrate kretprobe_instance with the ret_stack.
It will increase memory usage for kretprobes, but can provide safer way
to allocate kretprobe_instance.

> There's just no way in hell we should be calling any complex 
> kernel function from kernel probes!

OK, so let's drop this, since it may easily cause deadlock... 


> I mean, think about it, a kretprobe can be installed in a lot of places, and 
> now 
> we want to call get_free_pages() from it?? This would add a massive amount of 
> fragility.

I thought it was safe because GFP_ATOMIC is safe at interrupt handler.

> Instrumentation must be _simple_, every patch that adds more complexity to 
> the 
> most fundamental code path of it should raise a red flag ...
> 
> So let's make this more robust, ok?

Yeah, in that case, I think Alban's patch is enough at this point since
it gives user to tune their kretprobe events not to be missed.

Thank you,

> 
> Thanks,
> 
>   Ingo


-- 
Masami Hiramatsu 


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-30 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> > So this is something I missed while the original code was merged, but the 
> > concept 
> > looks a bit weird: why do we do any "allocation" while a handler is 
> > executing?
> > 
> > That's fundamentally fragile. What's the maximum number of parallel 
> > 'kretprobe_instance' required per kretprobe - one per CPU?
> 
> It depends on the place where we put the probe. If the probed function will be
> blocked (yield to other tasks), then we need a same number of threads on
> the system which can invoke the function. So, ultimately, it is same
> as function_graph tracer, we need it for each thread.

So then put it into task_struct (assuming there's no kretprobe-inside-kretprobe 
nesting allowed). There's just no way in hell we should be calling any complex 
kernel function from kernel probes!

I mean, think about it, a kretprobe can be installed in a lot of places, and 
now 
we want to call get_free_pages() from it?? This would add a massive amount of 
fragility.

Instrumentation must be _simple_, every patch that adds more complexity to the 
most fundamental code path of it should raise a red flag ...

So let's make this more robust, ok?

Thanks,

Ingo


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-29 Thread Masami Hiramatsu
On Wed, 29 Mar 2017 10:18:48 -0700
Josh Stone  wrote:

> On 03/29/2017 01:25 AM, Masami Hiramatsu wrote:
> > On Wed, 29 Mar 2017 08:30:05 +0200
> > Ingo Molnar  wrote:
> >>
> >> * Masami Hiramatsu  wrote:
> >>
> >>> @@ -1824,6 +1823,30 @@ void unregister_jprobes(struct jprobe **jps, int 
> >>> num)
> >>>  EXPORT_SYMBOL_GPL(unregister_jprobes);
> >>>  
> >>>  #ifdef CONFIG_KRETPROBES
> >>> +
> >>> +/* Try to use free instance first, if failed, try to allocate new 
> >>> instance */
> >>> +struct kretprobe_instance *kretprobe_alloc_instance(struct kretprobe *rp)
> >>> +{
> >>> + struct kretprobe_instance *ri = NULL;
> >>> + unsigned long flags = 0;
> >>> +
> >>> + raw_spin_lock_irqsave(>lock, flags);
> >>> + if (!hlist_empty(>free_instances)) {
> >>> + ri = hlist_entry(rp->free_instances.first,
> >>> + struct kretprobe_instance, hlist);
> >>> + hlist_del(>hlist);
> >>> + }
> >>> + raw_spin_unlock_irqrestore(>lock, flags);
> >>> +
> >>> + /* Populate max active instance if possible */
> >>> + if (!ri && rp->maxactive < KRETPROBE_MAXACTIVE_ALLOC) {
> >>> + ri = kmalloc(sizeof(*ri) + rp->data_size, GFP_ATOMIC);
> >>> + if (ri)
> >>> + rp->maxactive++;
> >>> + }
> >>> +
> >>> + return ri;
> >>> +}
> >>>  /*
> >>>   * This kprobe pre_handler is registered with every kretprobe. When probe
> >>>   * hits it will set up the return probe.
> >>> @@ -1846,14 +1869,8 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> >>> struct pt_regs *regs)
> >>>   }
> >>>  
> >>>   /* TODO: consider to only swap the RA after the last pre_handler fired 
> >>> */
> >>> - hash = hash_ptr(current, KPROBE_HASH_BITS);
> >>> - raw_spin_lock_irqsave(>lock, flags);
> >>> - if (!hlist_empty(>free_instances)) {
> >>> - ri = hlist_entry(rp->free_instances.first,
> >>> - struct kretprobe_instance, hlist);
> >>> - hlist_del(>hlist);
> >>> - raw_spin_unlock_irqrestore(>lock, flags);
> >>> -
> >>> + ri = kretprobe_alloc_instance(rp);
> >>> + if (ri) {
> >>>   ri->rp = rp;
> >>>   ri->task = current;
> >>>  
> >>> @@ -1868,13 +1885,13 @@ static int pre_handler_kretprobe(struct kprobe 
> >>> *p, struct pt_regs *regs)
> >>>  
> >>>   /* XXX(hch): why is there no hlist_move_head? */
> >>>   INIT_HLIST_NODE(>hlist);
> >>> + hash = hash_ptr(current, KPROBE_HASH_BITS);
> >>>   kretprobe_table_lock(hash, );
> >>>   hlist_add_head(>hlist, _inst_table[hash]);
> >>>   kretprobe_table_unlock(hash, );
> >>> - } else {
> >>> + } else
> >>>   rp->nmissed++;
> >>> - raw_spin_unlock_irqrestore(>lock, flags);
> >>> - }
> >>> +
> >>>   return 0;
> >>>  }
> >>>  NOKPROBE_SYMBOL(pre_handler_kretprobe);
> >>
> >> So this is something I missed while the original code was merged, but the 
> >> concept 
> >> looks a bit weird: why do we do any "allocation" while a handler is 
> >> executing?
> >>
> >> That's fundamentally fragile. What's the maximum number of parallel 
> >> 'kretprobe_instance' required per kretprobe - one per CPU?
> > 
> > It depends on the place where we put the probe. If the probed function will 
> > be
> > blocked (yield to other tasks), then we need a same number of threads on
> > the system which can invoke the function. So, ultimately, it is same
> > as function_graph tracer, we need it for each thread.
> 
> Isn't it also possible that the function may be reentrant?  Whether by
> plain recursion or an interrupt call, this leads to multiple live
> instances even for a given thread.

Yes, that's another possible case, but I don't think that's so serious in kernel
because we have very limited kernel stack, which means the recursion may not
so deep.

Thank you,

-- 
Masami Hiramatsu 


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-29 Thread Josh Stone
On 03/29/2017 01:25 AM, Masami Hiramatsu wrote:
> On Wed, 29 Mar 2017 08:30:05 +0200
> Ingo Molnar  wrote:
>>
>> * Masami Hiramatsu  wrote:
>>
>>> @@ -1824,6 +1823,30 @@ void unregister_jprobes(struct jprobe **jps, int num)
>>>  EXPORT_SYMBOL_GPL(unregister_jprobes);
>>>  
>>>  #ifdef CONFIG_KRETPROBES
>>> +
>>> +/* Try to use free instance first, if failed, try to allocate new instance 
>>> */
>>> +struct kretprobe_instance *kretprobe_alloc_instance(struct kretprobe *rp)
>>> +{
>>> +   struct kretprobe_instance *ri = NULL;
>>> +   unsigned long flags = 0;
>>> +
>>> +   raw_spin_lock_irqsave(>lock, flags);
>>> +   if (!hlist_empty(>free_instances)) {
>>> +   ri = hlist_entry(rp->free_instances.first,
>>> +   struct kretprobe_instance, hlist);
>>> +   hlist_del(>hlist);
>>> +   }
>>> +   raw_spin_unlock_irqrestore(>lock, flags);
>>> +
>>> +   /* Populate max active instance if possible */
>>> +   if (!ri && rp->maxactive < KRETPROBE_MAXACTIVE_ALLOC) {
>>> +   ri = kmalloc(sizeof(*ri) + rp->data_size, GFP_ATOMIC);
>>> +   if (ri)
>>> +   rp->maxactive++;
>>> +   }
>>> +
>>> +   return ri;
>>> +}
>>>  /*
>>>   * This kprobe pre_handler is registered with every kretprobe. When probe
>>>   * hits it will set up the return probe.
>>> @@ -1846,14 +1869,8 @@ static int pre_handler_kretprobe(struct kprobe *p, 
>>> struct pt_regs *regs)
>>> }
>>>  
>>> /* TODO: consider to only swap the RA after the last pre_handler fired 
>>> */
>>> -   hash = hash_ptr(current, KPROBE_HASH_BITS);
>>> -   raw_spin_lock_irqsave(>lock, flags);
>>> -   if (!hlist_empty(>free_instances)) {
>>> -   ri = hlist_entry(rp->free_instances.first,
>>> -   struct kretprobe_instance, hlist);
>>> -   hlist_del(>hlist);
>>> -   raw_spin_unlock_irqrestore(>lock, flags);
>>> -
>>> +   ri = kretprobe_alloc_instance(rp);
>>> +   if (ri) {
>>> ri->rp = rp;
>>> ri->task = current;
>>>  
>>> @@ -1868,13 +1885,13 @@ static int pre_handler_kretprobe(struct kprobe *p, 
>>> struct pt_regs *regs)
>>>  
>>> /* XXX(hch): why is there no hlist_move_head? */
>>> INIT_HLIST_NODE(>hlist);
>>> +   hash = hash_ptr(current, KPROBE_HASH_BITS);
>>> kretprobe_table_lock(hash, );
>>> hlist_add_head(>hlist, _inst_table[hash]);
>>> kretprobe_table_unlock(hash, );
>>> -   } else {
>>> +   } else
>>> rp->nmissed++;
>>> -   raw_spin_unlock_irqrestore(>lock, flags);
>>> -   }
>>> +
>>> return 0;
>>>  }
>>>  NOKPROBE_SYMBOL(pre_handler_kretprobe);
>>
>> So this is something I missed while the original code was merged, but the 
>> concept 
>> looks a bit weird: why do we do any "allocation" while a handler is 
>> executing?
>>
>> That's fundamentally fragile. What's the maximum number of parallel 
>> 'kretprobe_instance' required per kretprobe - one per CPU?
> 
> It depends on the place where we put the probe. If the probed function will be
> blocked (yield to other tasks), then we need a same number of threads on
> the system which can invoke the function. So, ultimately, it is same
> as function_graph tracer, we need it for each thread.

Isn't it also possible that the function may be reentrant?  Whether by
plain recursion or an interrupt call, this leads to multiple live
instances even for a given thread.



Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-29 Thread Masami Hiramatsu
On Wed, 29 Mar 2017 08:30:05 +0200
Ingo Molnar  wrote:
> 
> * Masami Hiramatsu  wrote:
> 
> > @@ -1824,6 +1823,30 @@ void unregister_jprobes(struct jprobe **jps, int num)
> >  EXPORT_SYMBOL_GPL(unregister_jprobes);
> >  
> >  #ifdef CONFIG_KRETPROBES
> > +
> > +/* Try to use free instance first, if failed, try to allocate new instance 
> > */
> > +struct kretprobe_instance *kretprobe_alloc_instance(struct kretprobe *rp)
> > +{
> > +   struct kretprobe_instance *ri = NULL;
> > +   unsigned long flags = 0;
> > +
> > +   raw_spin_lock_irqsave(>lock, flags);
> > +   if (!hlist_empty(>free_instances)) {
> > +   ri = hlist_entry(rp->free_instances.first,
> > +   struct kretprobe_instance, hlist);
> > +   hlist_del(>hlist);
> > +   }
> > +   raw_spin_unlock_irqrestore(>lock, flags);
> > +
> > +   /* Populate max active instance if possible */
> > +   if (!ri && rp->maxactive < KRETPROBE_MAXACTIVE_ALLOC) {
> > +   ri = kmalloc(sizeof(*ri) + rp->data_size, GFP_ATOMIC);
> > +   if (ri)
> > +   rp->maxactive++;
> > +   }
> > +
> > +   return ri;
> > +}
> >  /*
> >   * This kprobe pre_handler is registered with every kretprobe. When probe
> >   * hits it will set up the return probe.
> > @@ -1846,14 +1869,8 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> > struct pt_regs *regs)
> > }
> >  
> > /* TODO: consider to only swap the RA after the last pre_handler fired 
> > */
> > -   hash = hash_ptr(current, KPROBE_HASH_BITS);
> > -   raw_spin_lock_irqsave(>lock, flags);
> > -   if (!hlist_empty(>free_instances)) {
> > -   ri = hlist_entry(rp->free_instances.first,
> > -   struct kretprobe_instance, hlist);
> > -   hlist_del(>hlist);
> > -   raw_spin_unlock_irqrestore(>lock, flags);
> > -
> > +   ri = kretprobe_alloc_instance(rp);
> > +   if (ri) {
> > ri->rp = rp;
> > ri->task = current;
> >  
> > @@ -1868,13 +1885,13 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> > struct pt_regs *regs)
> >  
> > /* XXX(hch): why is there no hlist_move_head? */
> > INIT_HLIST_NODE(>hlist);
> > +   hash = hash_ptr(current, KPROBE_HASH_BITS);
> > kretprobe_table_lock(hash, );
> > hlist_add_head(>hlist, _inst_table[hash]);
> > kretprobe_table_unlock(hash, );
> > -   } else {
> > +   } else
> > rp->nmissed++;
> > -   raw_spin_unlock_irqrestore(>lock, flags);
> > -   }
> > +
> > return 0;
> >  }
> >  NOKPROBE_SYMBOL(pre_handler_kretprobe);
> 
> So this is something I missed while the original code was merged, but the 
> concept 
> looks a bit weird: why do we do any "allocation" while a handler is executing?
> 
> That's fundamentally fragile. What's the maximum number of parallel 
> 'kretprobe_instance' required per kretprobe - one per CPU?

It depends on the place where we put the probe. If the probed function will be
blocked (yield to other tasks), then we need a same number of threads on
the system which can invoke the function. So, ultimately, it is same
as function_graph tracer, we need it for each thread.

> 
> If so then we should preallocate all of them when they are installed and not 
> do 
> any alloc/free dance when executing them.
> 
> This will also speed them up, and increase robustness all around.

I see, kretprobe already do that, and I keep it on the code.

By default, kretprobe will allocate NR_CPU of kretprobe_instance for each
kretprobe. For usual usecase (deeper inside functions in kernel) that is OK.
However, as Lukasz reported, for the function near the syscall entry, it may
require more instances. In that case, kretprobe user needs to increase
maxactive before registering kretprobes, which will be done by Alban's patch.

However, the next question is, how many instances are actually needed.
User may have to do trial & error loop to find that. For professional users,
they will do that, but for the light users, they may not want to do that.

I'm also considering to provide a "knob" of disabing this dynamic allocation
feature on debugfs, which will help users who would like to avoid memory
allocation on kretprobe.

Thank you,

-- 
Masami Hiramatsu 


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-29 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> @@ -1824,6 +1823,30 @@ void unregister_jprobes(struct jprobe **jps, int num)
>  EXPORT_SYMBOL_GPL(unregister_jprobes);
>  
>  #ifdef CONFIG_KRETPROBES
> +
> +/* Try to use free instance first, if failed, try to allocate new instance */
> +struct kretprobe_instance *kretprobe_alloc_instance(struct kretprobe *rp)
> +{
> + struct kretprobe_instance *ri = NULL;
> + unsigned long flags = 0;
> +
> + raw_spin_lock_irqsave(>lock, flags);
> + if (!hlist_empty(>free_instances)) {
> + ri = hlist_entry(rp->free_instances.first,
> + struct kretprobe_instance, hlist);
> + hlist_del(>hlist);
> + }
> + raw_spin_unlock_irqrestore(>lock, flags);
> +
> + /* Populate max active instance if possible */
> + if (!ri && rp->maxactive < KRETPROBE_MAXACTIVE_ALLOC) {
> + ri = kmalloc(sizeof(*ri) + rp->data_size, GFP_ATOMIC);
> + if (ri)
> + rp->maxactive++;
> + }
> +
> + return ri;
> +}
>  /*
>   * This kprobe pre_handler is registered with every kretprobe. When probe
>   * hits it will set up the return probe.
> @@ -1846,14 +1869,8 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> struct pt_regs *regs)
>   }
>  
>   /* TODO: consider to only swap the RA after the last pre_handler fired 
> */
> - hash = hash_ptr(current, KPROBE_HASH_BITS);
> - raw_spin_lock_irqsave(>lock, flags);
> - if (!hlist_empty(>free_instances)) {
> - ri = hlist_entry(rp->free_instances.first,
> - struct kretprobe_instance, hlist);
> - hlist_del(>hlist);
> - raw_spin_unlock_irqrestore(>lock, flags);
> -
> + ri = kretprobe_alloc_instance(rp);
> + if (ri) {
>   ri->rp = rp;
>   ri->task = current;
>  
> @@ -1868,13 +1885,13 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> struct pt_regs *regs)
>  
>   /* XXX(hch): why is there no hlist_move_head? */
>   INIT_HLIST_NODE(>hlist);
> + hash = hash_ptr(current, KPROBE_HASH_BITS);
>   kretprobe_table_lock(hash, );
>   hlist_add_head(>hlist, _inst_table[hash]);
>   kretprobe_table_unlock(hash, );
> - } else {
> + } else
>   rp->nmissed++;
> - raw_spin_unlock_irqrestore(>lock, flags);
> - }
> +
>   return 0;
>  }
>  NOKPROBE_SYMBOL(pre_handler_kretprobe);

So this is something I missed while the original code was merged, but the 
concept 
looks a bit weird: why do we do any "allocation" while a handler is executing?

That's fundamentally fragile. What's the maximum number of parallel 
'kretprobe_instance' required per kretprobe - one per CPU?

If so then we should preallocate all of them when they are installed and not do 
any alloc/free dance when executing them.

This will also speed them up, and increase robustness all around.

Thanks,

Ingo