from:"Mathieu Desnoyers"

Re: [RFC v2] epoll: avoid spinlock contention with wfcqueue

2013-03-18 Thread Mathieu Desnoyers

* This could be rearranged to delay the deactivation of epi->ws
>* instead, but then epi->ws would temporarily be out of sync
> -  * with ep_is_linked().
> +  * with epi->state.
>*/
>   ws = ep_wakeup_source(epi);
>   if (ws) {
> @@ -1468,8 +1457,6 @@ static int ep_send_events_proc(struct eventpoll *ep, 
> struct list_head *head,
>   __pm_relax(ws);
>   }
>  
> - list_del_init(&epi->rdllink);
> -
>   revents = ep_item_poll(epi, &pt);
>  
>   /*
> @@ -1481,46 +1468,37 @@ static int ep_send_events_proc(struct eventpoll *ep, 
> struct list_head *head,
>   if (revents) {
>   if (__put_user(revents, &uevent->events) ||
>   __put_user(epi->event.data, &uevent->data)) {
> - list_add(&epi->rdllink, head);
> - ep_pm_stay_awake(epi);
> - return eventcnt ? eventcnt : -EFAULT;
> + if (!eventcnt)
> + eventcnt = -EFAULT;
> + break;
>   }
> - eventcnt++;
> +
>   uevent++;
> - if (epi->event.events & EPOLLONESHOT)
> + if (++eventcnt == maxevents)
> + n = NULL; /* stop iteration */
> +
> + if (epi->event.events & EPOLLONESHOT) {
>   epi->event.events &= EP_PRIVATE_BITS;
> - else if (!(epi->event.events & EPOLLET)) {
> - /*
> -  * If this file has been added with Level
> -  * Trigger mode, we need to insert back inside
> -  * the ready list, so that the next call to
> -  * epoll_wait() will check again the events
> -  * availability. At this point, no one can 
> insert
> -  * into ep->rdllist besides us. The epoll_ctl()
> -  * callers are locked out by
> -  * ep_scan_ready_list() holding "mtx" and the
> -  * poll callback will queue them in ep->ovflist.
> -  */
> - list_add_tail(&epi->rdllink, &ep->rdllist);
> - ep_pm_stay_awake(epi);
> + } else if (!(epi->event.events & EPOLLET)) {
> +         ep_level_trigger(ep, epi);
> + continue;
>   }
>   }
> +
> + /*
> +  * we set EP_STATE_DEQUEUE before dequeueing to prevent losing
> +  * events during ep_poll_callback that fire before we can set
> +  * EP_STATE_IDLE ep_poll_callback must spin while we're
> +  * EP_STATE_DEQUEUE (for the dequeue)
> +  */
> + atomic_xchg(&epi->state, EP_STATE_DEQUEUE);
> + __wfcq_dequeue(&ep->txlhead, &ep->txltail);
> + atomic_xchg(&epi->state, EP_STATE_IDLE);
>   }
>  
>   return eventcnt;
>  }
[...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC v2] epoll: avoid spinlock contention with wfcqueue

2013-03-18 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Eric Wong  wrote:
> > > > I'm posting this lightly tested version since I may not be able to do
> > > > more testing/benchmarking until the weekend.
> > > 
> > > Still lightly tested (on an initramfs KVM, no real applications, yet).
> > > 
> > > > Davide's totalmess is still running, so that's probably a good sign :)
> > > > http://www.xmailserver.org/totalmess.c
> > > 
> > > Ditto :)  Also testing with eponeshotmt, which is close to my target
> > > use case: http://yhbt.net/eponeshotmt.c
> > > 
> > > > I will look for more ways to break this (and benchmark when I stop
> > > > finding ways to break it).  No real applications tested, yet, and
> > > > I think I can improve upon this, too.
> > > 
> > > No real apps, yet, and I need to make sure this doesn't cause
> > > regressions for the traditional single-threaded event loop case.
> > > 
> > > This is the use case I mainly care about (multiple tasks calling
> > > epoll_wait(maxevents=1) to divide work).
> > > 
> > > Time to wait on 4 million events (4 threads generating events,
> > > 4 threads calling epoll_wait(maxevents=1) 1 million times each,
> > > 10 eventfd file descriptors (fewer descriptors means higher
> > > chance of contention for epi->state inside ep_poll_callback).
> > > 
> > > Before:
> > > $ eponeshotmt -t 4 -w 4 -f 10 -c 100
> > > real0m 9.58s
> > > user0m 1.22s
> > > sys 0m 37.08s
> > > 
> > > After:
> > > $ eponeshotmt -t 4 -w 4 -f 10 -c 100
> > > real0m 6.49s
> > > user0m 1.28s
> > > sys 0m 24.66s
> > 
> > Nice! This looks like a 31% speedup with 4 cores. It would be nice to
> > see how this evolves when the number of cores and threads increase. I
> > also notice that you turned the spinlock_irqsave into a mutex. Maybe
> > comparison with a simple spinlock (rather than the mutex) with lead to
> > interesting findings. (note that this spinlock will likely not need to
> > have IRQ off, as enqueue won't need to hold the spinlock).
> 
> Unfortunately, 4 cores is all I have right now.  I'm hoping others can
> help test with more cores.
> 
> I added the mutex lock to ep_poll since it's now required for
> ep_events_available.  Another upside is the ep_events_available is
> always coherent with the ep_send_events loop, so there's no chance of a
> task entering ep_send_events on an empty ready list
> 
> I was planning on making the mutex cover a wider scope for ep_poll
> before I discovered wfcqueue.  I noticed ep->lock was very contended
> (and always dominating lock_stat).
> 
> Previously with ep_poll + ep_scan_ready_list + ep_send_events_proc,
> it was something like this where a spin lock was taken 3 times in
> quick succession:
> 
>   ep_poll:
>   spin_lock;
>   check ep_events_available;
>   spin_unlock;
> 
>   ep_send_events:
>   ep_scan_ready_list:
>   mutex_lock
>   spin_lock
>   ...
>   spin_unlock
> 
>   ep_send_events_proc
> 
>   spin_lock
>   ...
>   spin_unlock
>   mutex_unlock
> 
> ep->lock was getting bounced all over since ep_poll_callback(which also
> takes ep->lock) was constantly firing.  This is made worse when several
> threads are calling ep_poll.  The exclusive waitqueue-ing of ep_poll
> doesn't help much because of the sheer number of ep_poll_callback wakeups.

OK, yep, that's where the wait-free queue with head and tail on separate
cache lines really helps.

> 
> > Some comments below,
> > 
> > [...]
> > > +static int ep_send_events(struct eventpoll *ep,
> > > +   struct epoll_event __user *uevent, int maxevents)
> > > +{
> > > + int eventcnt = 0;
> > >   unsigned int revents;
> > >   struct epitem *epi;
> > > - struct epoll_event __user *uevent;
> > >   struct wakeup_source *ws;
> > > + struct wfcq_node *node, *n;
> > > + enum epoll_item_state state;
> > >   poll_table pt;
> > >  
> > >   ini

Re: [PATCH] tracepoints: prevents null probe from being added

2013-03-20 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Wed, 2013-03-20 at 12:18 +0900, kpark3...@gmail.com wrote:
> > From: Sahara 
> > 
> > Somehow tracepoint_entry_add/remove_probe functions allow a null probe
> > function.
> 
> You actually hit this in practice, or is this just something that you
> observe from code review?
> 
> >  Especially on getting a null probe in remove function, it seems
> > to be used to remove all probe functions in the entry.
> 
> Hmm, that actually sounds like a feature.

Yep. It's been a long time since I wrote this code, but the removal code
seems to use NULL probe pointer to remove all probes for a given
tracepoint.

I'd be tempted to just validate non-NULL probe within
tracepoint_entry_add_probe() and let other sites as is, just in case
anyone would be using this feature.

I cannot say that I have personally used this "remove all" feature much
though.

Thanks,

Mathieu

> 
> > But, the code is not handled as expected. Since the tracepoint_entry
> > maintains funcs array's last func as NULL in order to mark it as the end
> > of the array. Also NULL func is used in for-loop to check out the end of
> > the loop. So if there's NULL func in the entry's funcs, the for-loop
> > will be abruptly ended in the middle of operation.
> > Also checking out if probe is null in for-loop is not efficient.
> > 
> > Signed-off-by: Sahara 
> > ---
> >  kernel/tracepoint.c |   18 --
> >  1 files changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> > index 0c05a45..30f427e 100644
> > --- a/kernel/tracepoint.c
> > +++ b/kernel/tracepoint.c
> > @@ -112,7 +112,10 @@ tracepoint_entry_add_probe(struct tracepoint_entry 
> > *entry,
> > int nr_probes = 0;
> > struct tracepoint_func *old, *new;
> >  
> > -   WARN_ON(!probe);
> > +   if (unlikely(!probe)) {
> > +   WARN_ON(!probe);
> > +   return ERR_PTR(-EINVAL);
> > +   }
> 
> Um, you want:
> 
>   if (WARN_ON(!probe))
>   return ERR_PTR(-EINVAL);
> 
> >  
> > debug_print_probes(entry);
> > old = entry->funcs;
> > @@ -147,15 +150,19 @@ tracepoint_entry_remove_probe(struct tracepoint_entry 
> > *entry,
> >  
> > old = entry->funcs;
> >  
> > +   if (unlikely(!probe)) {
> > +   WARN_ON(!probe);
> > +   return ERR_PTR(-EINVAL);
> > +   }
> 
> Here too if it wasn't intended to allow removal of all probes from a
> tracepoint.
> 
> > +
> > if (!old)
> > return ERR_PTR(-ENOENT);
> >  
> > debug_print_probes(entry);
> > /* (N -> M), (N > 1, M >= 0) probes */
> > for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> > -   if (!probe ||
> > -   (old[nr_probes].func == probe &&
> > -old[nr_probes].data == data))
> > +   if (old[nr_probes].func == probe &&
> > +old[nr_probes].data == data)
> > nr_del++;
> > }
> >  
> > @@ -173,8 +180,7 @@ tracepoint_entry_remove_probe(struct tracepoint_entry 
> > *entry,
> > if (new == NULL)
> > return ERR_PTR(-ENOMEM);
> > for (i = 0; old[i].func; i++)
> > -   if (probe &&
> > -   (old[i].func != probe || old[i].data != data))
> > +   if (old[i].func != probe || old[i].data != data)
> 
> This makes it look like the null probe was intentional.
> 
> -- Steve
> 
> > new[j++] = old[i];
> > new[nr_probes - nr_del].func = NULL;
> > entry->refcount = nr_probes - nr_del;
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tracepoints: prevents null probe from being added

2013-03-20 Thread Mathieu Desnoyers

* Keun-O Park (kpark3...@gmail.com) wrote:
> On Thu, Mar 21, 2013 at 8:01 AM, Steven Rostedt  wrote:
> > On Wed, 2013-03-20 at 14:01 -0400, Mathieu Desnoyers wrote:
> >> * Steven Rostedt (rost...@goodmis.org) wrote:
> >> > On Wed, 2013-03-20 at 12:18 +0900, kpark3...@gmail.com wrote:
> >> > > From: Sahara 
> >> > >
> >> > > Somehow tracepoint_entry_add/remove_probe functions allow a null probe
> >> > > function.
> >> >
> >> > You actually hit this in practice, or is this just something that you
> >> > observe from code review?
> >> >
> >> > >  Especially on getting a null probe in remove function, it seems
> >> > > to be used to remove all probe functions in the entry.
> >> >
> >> > Hmm, that actually sounds like a feature.
> >>
> >> Yep. It's been a long time since I wrote this code, but the removal code
> >> seems to use NULL probe pointer to remove all probes for a given
> >> tracepoint.
> >>
> >> I'd be tempted to just validate non-NULL probe within
> >> tracepoint_entry_add_probe() and let other sites as is, just in case
> >> anyone would be using this feature.
> >>
> >> I cannot say that I have personally used this "remove all" feature much
> >> though.
> >>
> >
> > I agree. I don't see anything wrong in leaving the null probe feature in
> > the removal code. But updating the add code looks like a proper change.
> >
> > -- Steve
> >
> >
> 
> Hello Steve & Mathieu,
> If we want to leave the null probe feature enabled, I think it would
> be better modifying the code like the following for code efficiency.
> 
> @@ -112,7 +112,8 @@ tracepoint_entry_add_probe(struct tracepoint_entry *entry,
> int nr_probes = 0;
> struct tracepoint_func *old, *new;
> 
> -   WARN_ON(!probe);
> +   if (WARN_ON(!probe))
> +   return ERR_PTR(-EINVAL);
> 
> debug_print_probes(entry);
> old = entry->funcs;
> @@ -152,14 +153,15 @@ tracepoint_entry_remove_probe(struct tracepoint_entry 
> *ent
> 
> debug_print_probes(entry);
> /* (N -> M), (N > 1, M >= 0) probes */
> -   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> -   if (!probe ||
> -   (old[nr_probes].func == probe &&
> -old[nr_probes].data == data))
> -   nr_del++;
> +   if (probe) {
> +   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> +   if (old[nr_probes].func == probe &&
> +old[nr_probes].data == data)
> +   nr_del++;
> +   }
> }
> 
> -   if (nr_probes - nr_del == 0) {
> +   if (!probe || nr_probes - nr_del == 0) {

We might want to do:

if (probe) {
  ...
} else {
  nr_del = nr_probes;
}

if (nr_probes - nr_del == 0) {
   ...
}

rather than:

if (probe) {
  ...
}

if (!probe || nr_probes - nr_del == 0) {
   ...
}

Using nr_del makes the code easier to follow IMHO.

Thanks,

Mathieu

> /* N -> 0, (N > 1) */
> entry->funcs = NULL;
> entry->refcount = 0;
> 
> Because we know handing over the null probe to
> tracepoint_entry_add_probe is not possible,
> we don't have to check if the probe is null or not within for loop. If
> the probe is null, it's just enough to add !probe in
> 'if(nr_probes-nr_del==0)'. And, with additional if-clause wrapping
> for-loop, falling through for-loop can be prevented when probe is
> null.
> 
> @@ -173,8 +172,7 @@ tracepoint_entry_remove_probe(struct tracepoint_entry 
> *entry
> if (new == NULL)
> return ERR_PTR(-ENOMEM);
> for (i = 0; old[i].func; i++)
> -   if (probe &&
> -   (old[i].func != probe || old[i].data != data))
> +   if (old[i].func != probe || old[i].data != data)
> new[j++] = old[i];
> new[nr_probes - nr_del].func = NULL;
> entry->refcount = nr_probes - nr_del;
> 
> We don't have to check the probe here too. We know probe is always true here.
> Thanks.
> 
> -- Kpark

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] tracepoints: prevents null probe from being added

2013-03-20 Thread Mathieu Desnoyers

* Keun-O Park (kpark3...@gmail.com) wrote:
> On Thu, Mar 21, 2013 at 11:45 AM, Mathieu Desnoyers
>  wrote:
> > * Keun-O Park (kpark3...@gmail.com) wrote:
> >> On Thu, Mar 21, 2013 at 8:01 AM, Steven Rostedt  
> >> wrote:
> >> > On Wed, 2013-03-20 at 14:01 -0400, Mathieu Desnoyers wrote:
> >> >> * Steven Rostedt (rost...@goodmis.org) wrote:
> >> >> > On Wed, 2013-03-20 at 12:18 +0900, kpark3...@gmail.com wrote:
> >> >> > > From: Sahara 
> >> >> > >
> >> >> > > Somehow tracepoint_entry_add/remove_probe functions allow a null 
> >> >> > > probe
> >> >> > > function.
> >> >> >
> >> >> > You actually hit this in practice, or is this just something that you
> >> >> > observe from code review?
> >> >> >
> >> >> > >  Especially on getting a null probe in remove function, it seems
> >> >> > > to be used to remove all probe functions in the entry.
> >> >> >
> >> >> > Hmm, that actually sounds like a feature.
> >> >>
> >> >> Yep. It's been a long time since I wrote this code, but the removal code
> >> >> seems to use NULL probe pointer to remove all probes for a given
> >> >> tracepoint.
> >> >>
> >> >> I'd be tempted to just validate non-NULL probe within
> >> >> tracepoint_entry_add_probe() and let other sites as is, just in case
> >> >> anyone would be using this feature.
> >> >>
> >> >> I cannot say that I have personally used this "remove all" feature much
> >> >> though.
> >> >>
> >> >
> >> > I agree. I don't see anything wrong in leaving the null probe feature in
> >> > the removal code. But updating the add code looks like a proper change.
> >> >
> >> > -- Steve
> >> >
> >> >
> >>
> >> Hello Steve & Mathieu,
> >> If we want to leave the null probe feature enabled, I think it would
> >> be better modifying the code like the following for code efficiency.
> >>
> >> @@ -112,7 +112,8 @@ tracepoint_entry_add_probe(struct tracepoint_entry 
> >> *entry,
> >> int nr_probes = 0;
> >> struct tracepoint_func *old, *new;
> >>
> >> -   WARN_ON(!probe);
> >> +   if (WARN_ON(!probe))
> >> +   return ERR_PTR(-EINVAL);
> >>
> >> debug_print_probes(entry);
> >> old = entry->funcs;
> >> @@ -152,14 +153,15 @@ tracepoint_entry_remove_probe(struct 
> >> tracepoint_entry *ent
> >>
> >> debug_print_probes(entry);
> >> /* (N -> M), (N > 1, M >= 0) probes */
> >> -   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> >> -   if (!probe ||
> >> -   (old[nr_probes].func == probe &&
> >> -old[nr_probes].data == data))
> >> -   nr_del++;
> >> +   if (probe) {
> >> +   for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> >> +   if (old[nr_probes].func == probe &&
> >> +old[nr_probes].data == data)
> >> +   nr_del++;
> >> +   }
> >> }
> >>
> >> -   if (nr_probes - nr_del == 0) {
> >> +   if (!probe || nr_probes - nr_del == 0) {
> >
> > We might want to do:
> >
> > if (probe) {
> >   ...
> > } else {
> >   nr_del = nr_probes;
> > }
> >
> > if (nr_probes - nr_del == 0) {
> >...
> > }
> 
> This code has a problem.
> nr_probes is initialized as zero.

yes,

> And, in order to get correct count of probes,
> we need to go through the for-loop even though probe is null.
> So with above code, nr_del will be zero. Anyhow, the code will fall
> through if-clause(nr_probes-nr_del==0).
> It looks odd to me.

Ah, I see what you mean: the nr_del = nr_probes assignment is useless,
because both nr_probes and nr_del are equal to 0. So we could go for:

if (probe) {
for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
if (old[nr_probes].func == probe &&
 old[nr_probes].data == data)
nr_del++;
}
}

if (nr_probes - nr_del == 0) {
...
} else {
...
}

Does it look better ?

Thanks,

Mathieu

> 
> -- Kpark
> 
> >
> > rather than:
> >
> > if (probe) {
> >   ...
> > }
> >
> > if (!probe || nr_probes - nr_del == 0) {
> >...
> > }
> >
> > Using nr_del makes the code easier to follow IMHO.
> >
> > Thanks,
> >
> > Mathieu
> >

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] wfcqueue: add function for unsynchronized prepend

2013-04-02 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> In some situations, it is necessary to prepend a node to a queue.
> For epoll, this is necessary for two rare conditions:
> 
> * when the user triggers -EFAULT
> * when reinjecting elements from the ovflist (implemented as a stack)

This approach makes sense.

In terms of API naming, I wonder if "prepend" is the good counterpart
for "enqueue". Maybe "enqueue_first" or "enqueue_head" could be better
suited ?

Currently, we have an "append" function used internally, but it's not
exposed by the API.

Just for fun, I tried making your "prepend" wait-free (thus not
requiring a mutex), but it's really not obvious, because of its impact
on splice operation and dequeue-last-node operation.

Thoughts ?

Thanks,

Mathieu

> 
> Signed-off-by: Eric Wong 
> Cc: Mathieu Desnoyers 
> ---
>  This is on top of my other patch to implement __wfcq_enqueue
> 
>  include/linux/wfcqueue.h | 35 +++
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/wfcqueue.h b/include/linux/wfcqueue.h
> index a452ab9..4cb8f22 100644
> --- a/include/linux/wfcqueue.h
> +++ b/include/linux/wfcqueue.h
> @@ -56,15 +56,17 @@
>   * [5] __wfcq_first
>   * [6] __wfcq_next
>   * [7] __wfcq_enqueue
> + * [8] __wfcq_prepend
>   *
> - * [1] [2] [3] [4] [5] [6] [7]
> - * [1]  -   -   -   -   -   -   X
> - * [2]  -   -   -   -   -   -   X
> - * [3]  -   -   X   X   X   X   X
> - * [4]  -   -   X   -   X   X   X
> - * [5]  -   -   X   X   -   -   X
> - * [6]  -   -   X   X   -   -   X
> - * [7]  X   X   X   X   X   X   X
> + * [1] [2] [3] [4] [5] [6] [7] [8]
> + * [1]  -   -   -   -   -   -   X   X
> + * [2]  -   -   -   -   -   -   X   X
> + * [3]  -   -   X   X   X   X   X   X
> + * [4]  -   -   X   -   X   X   X   X
> + * [5]  -   -   X   X   -   -   X   X
> + * [6]  -   -   X   X   -   -   X   X
> + * [7]  X   X   X   X   X   X   X   X
> + * [8]  X   X   X   X   X   X   X   X
>   *
>   * Besides locking, mutual exclusion of dequeue, splice and iteration
>   * can be ensured by performing all of those operations from a single
> @@ -441,6 +443,23 @@ static inline enum wfcq_ret __wfcq_splice(
>  }
>  
>  /*
> + * __wfcq_prepend: prepend a node into a queue, requiring mutual exclusion.
> + *
> + * No memory barriers are issued.  Mutual exclusion is the responsibility
> + * of the caller.
> + */
> +static inline void __wfcq_prepend(struct wfcq_head *head,
> + struct wfcq_tail *tail, struct wfcq_node *node)
> +{
> + node->next = head->node.next;
> + head->node.next = node;
> +
> + /* if the queue was empty before, it is no longer empty now */
> + if (tail->p == &head->node)
> + tail->p = node;
> +}
> +
> +/*
>   * __wfcq_for_each: Iterate over all nodes in a queue,
>   * without dequeuing them.
>   * @head: head of the queue (struct wfcq_head pointer).
> -- 
> Eric Wong
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tracepoint] cargo-culting considered harmful...

2013-01-25 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Wed, Jan 23, 2013 at 10:55:24PM +, Al Viro wrote:
> > In samples/tracepoints/tracepoint-probe-sample.c:
> > /*
> >  * Here the caller only guarantees locking for struct file and struct inode.
> >  * Locking must therefore be done in the probe to use the dentry.
> >  */
> > static void probe_subsys_event(void *ignore,   
> >struct inode *inode, struct file *file)
> > {
> > path_get(&file->f_path);
> > dget(file->f_path.dentry);
> > printk(KERN_INFO "Event is encountered with filename %s\n",
> > file->f_path.dentry->d_name.name);
> > dput(file->f_path.dentry);
> > path_put(&file->f_path);
> > }
> > 
> > note that
> > * file->f_path is already pinned down by open(), path_get() does not
> > provide anything extra.
> > * file->f_path.dentry is already pinned by open() *and* path_get()
> > just above that dget().
> > * ->d_name.name *IS* *NOT* *PROTECTED* by pinning dentry down,
> > whether it's done once or thrice.
> > 
> > I do realize that it's just an example, but perhaps we should rename that
> > file to match the contents?  The only question is whether it should be
> > git mv samples/tracepoints/{tracepoint-probe-sample,cargo-cult}.c
> > or git mv samples cargo-cult...
> 
> I wonder if we should just remove the samples/tracepoints/ all together.
> The tracepoint code is now only used internally by the trace_event code,
> and there should not be any users of tracepoints directly.

Yep, I'd be OK with removing this example, since now all users are
expected to user TRACE_EVENT(), which is built on top of tracepoints.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tracepoint] cargo-culting considered harmful...

2013-01-25 Thread Mathieu Desnoyers

* Al Viro (v...@zeniv.linux.org.uk) wrote:
> On Wed, Jan 23, 2013 at 03:51:47PM -0800, Andrew Morton wrote:
> 
> > > note that
> > >   * file->f_path is already pinned down by open(), path_get() does not
> > > provide anything extra.
> > >   * file->f_path.dentry is already pinned by open() *and* path_get()
> > > just above that dget().
> > >   * ->d_name.name *IS* *NOT* *PROTECTED* by pinning dentry down,
> > > whether it's done once or thrice.
> > 
> > I guess the first two are obvious (or at least, expected).  But the
> > third isn't.

Hi Al,

I agree that the tracepoint example should be removed. There is one
extra piece of module code I think would require fixing (see below).

> 
> ->d_name.name is changed by rename() (as one could expect).  Grabbing
> a reference to dentry will not prevent rename() from happening.  ->i_mutex
> on parent will, but you either need to play with retries (grab reference
> to parent, grab ->i_mutex, check that it's still our parent, if we'd lost
> the race and someone had renamed the sucker - unlock ->i_mutex, dput,
> repeat) *or* to have our dentry looked up with parent locked, with ->i_mutex
> on said parent still held (which happens to cover the majority of valid
> uses in fs code - ->lookup(), ->create(), ->unlink(), rename(), etc. are
> all called that way, so the name of dentry passed to such methods is stable
> for the duration of the method).
> 
> ->d_lock on dentry is also sufficient, but that obviously means that you
> can't block while holding it.
> 
> > Where should a kernel developer go to learn these things? 
> > include/linux/dcache.h doesn't mention d_name locking rules, nor does
> > Documentation/filesystems/vfs.txt.
> 
> See directory locking rules in there; the crucial point is that dentry
> name is changed by rename() *and* that results of a race can be worse than
> just running into a partially rewritten name - long names are allocated
> separately and walking through a stale pointer you might end up in freed
> memory.
> 
> It's a mess, unfortunately, and $BIGNUM other uses of ->i_mutex make it only
> nastier.  Once in a while I go hunting for races in that area, usally with
> a bunch of fixes coming out of such run ;-/

In the light of what you are saying here, am I right to think that the
following code is broken wrt locking wrt use of
filp->f_dentry->d_name.name ?

static
void lttng_enumerate_task_fd(struct lttng_session *session,
struct task_struct *p, char *tmp)
{
struct fdtable *fdt;
struct file *filp;
unsigned int i;
const unsigned char *path;

task_lock(p);
if (!p->files)
goto unlock_task;
spin_lock(&p->files->file_lock);
fdt = files_fdtable(p->files);
for (i = 0; i < fdt->max_fds; i++) {
filp = fcheck_files(p->files, i);
if (!filp)
continue;
path = d_path(&filp->f_path, tmp, PAGE_SIZE);
/* Make sure we give at least some info */
trace_lttng_statedump_file_descriptor(session, p, i,
IS_ERR(path) ?
filp->f_dentry->d_name.name :
path);
    }
spin_unlock(&p->files->file_lock);
unlock_task:
task_unlock(p);
}

Since tracepoints never block, holding the ->d_lock around the
trace_lttng_statedump_file_descriptor() tracepoint should probably be
enough to make it correct. Am I missing anything ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tracepoint] cargo-culting considered harmful...

2013-01-25 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Fri, 2013-01-25 at 09:38 -0500, Mathieu Desnoyers wrote:
> 
> > Yep, I'd be OK with removing this example, since now all users are
> > expected to user TRACE_EVENT(), which is built on top of tracepoints.
> 
> Can I  get your Acked-by for the following patch?

Sure,

Acked-by: Mathieu Desnoyers 

Thanks!

Mathieu

> 
> -- Steve
> 
> commit 867a31fab0a064e54147371425f9fdef933e3d1d
> Author: Steven Rostedt 
> Date:   Fri Jan 25 09:46:36 2013 -0500
> 
> tracing: Remove tracepoint sample code
> 
> The tracepoint sample code was used to teach developers how to
> create their own tracepoints. But now the trace_events have been
> added as a higher level that is used directly by developers today.
> 
> Only the trace_event code should use the tracepoint interface
> directly and no new tracepoints should be added.
> 
> Besides, the example had a race condition with the use of the
>  ->d_name.name dentry field, as pointed out by Al Viro.
> 
> Best just to remove the code so it wont be used by other developers.
> 
> Cc: Al Viro 
> Cc: Mathieu Desnoyers 
> Signed-off-by: Steven Rostedt 
> 
> diff --git a/samples/Kconfig b/samples/Kconfig
> index 7b6792a..6181c2c 100644
> --- a/samples/Kconfig
> +++ b/samples/Kconfig
> @@ -5,12 +5,6 @@ menuconfig SAMPLES
>  
>  if SAMPLES
>  
> -config SAMPLE_TRACEPOINTS
> - tristate "Build tracepoints examples -- loadable modules only"
> - depends on TRACEPOINTS && m
> - help
> -   This build tracepoints example modules.
> -
>  config SAMPLE_TRACE_EVENTS
>   tristate "Build trace_events examples -- loadable modules only"
>   depends on EVENT_TRACING && m
> diff --git a/samples/Makefile b/samples/Makefile
> index 5ef08bb..1a60c62 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -1,4 +1,4 @@
>  # Makefile for Linux samples code
>  
> -obj-$(CONFIG_SAMPLES)+= kobject/ kprobes/ tracepoints/ trace_events/ 
> \
> +obj-$(CONFIG_SAMPLES)+= kobject/ kprobes/ trace_events/ \
>  hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
> diff --git a/samples/tracepoints/Makefile b/samples/tracepoints/Makefile
> deleted file mode 100644
> index 36479ad..000
> --- a/samples/tracepoints/Makefile
> +++ /dev/null
> @@ -1,6 +0,0 @@
> -# builds the tracepoint example kernel modules;
> -# then to use one (as root):  insmod 
> -
> -obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-sample.o
> -obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-probe-sample.o
> -obj-$(CONFIG_SAMPLE_TRACEPOINTS) += tracepoint-probe-sample2.o
> diff --git a/samples/tracepoints/tp-samples-trace.h 
> b/samples/tracepoints/tp-samples-trace.h
> deleted file mode 100644
> index 4d46be9..000
> --- a/samples/tracepoints/tp-samples-trace.h
> +++ /dev/null
> @@ -1,11 +0,0 @@
> -#ifndef _TP_SAMPLES_TRACE_H
> -#define _TP_SAMPLES_TRACE_H
> -
> -#include/* for struct inode and struct file */
> -#include 
> -
> -DECLARE_TRACE(subsys_event,
> - TP_PROTO(struct inode *inode, struct file *file),
> - TP_ARGS(inode, file));
> -DECLARE_TRACE_NOARGS(subsys_eventb);
> -#endif
> diff --git a/samples/tracepoints/tracepoint-probe-sample.c 
> b/samples/tracepoints/tracepoint-probe-sample.c
> deleted file mode 100644
> index 744c0b9..000
> --- a/samples/tracepoints/tracepoint-probe-sample.c
> +++ /dev/null
> @@ -1,57 +0,0 @@
> -/*
> - * tracepoint-probe-sample.c
> - *
> - * sample tracepoint probes.
> - */
> -
> -#include 
> -#include 
> -#include 
> -#include "tp-samples-trace.h"
> -
> -/*
> - * Here the caller only guarantees locking for struct file and struct inode.
> - * Locking must therefore be done in the probe to use the dentry.
> - */
> -static void probe_subsys_event(void *ignore,
> -struct inode *inode, struct file *file)
> -{
> - path_get(&file->f_path);
> - dget(file->f_path.dentry);
> - printk(KERN_INFO "Event is encountered with filename %s\n",
> - file->f_path.dentry->d_name.name);
> - dput(file->f_path.dentry);
> - path_put(&file->f_path);
> -}
> -
> -static void probe_subsys_eventb(void *ignore)
> -{
> - printk(KERN_INFO "Event B is encountered\n");
> -}
> -
> -static int __init tp_sample_trace_init(void)
> -{
> - int ret;
> -
> - ret = register_trace_subsys_event(probe_subsys_event, NULL);
> - WARN_ON(ret);
> - ret = register_trace_subsys_eventb(pr

Re: [tracepoint] cargo-culting considered harmful...

2013-01-25 Thread Mathieu Desnoyers

* Al Viro (v...@zeniv.linux.org.uk) wrote:
> On Fri, Jan 25, 2013 at 09:49:53AM -0500, Mathieu Desnoyers wrote:
> > static
> > void lttng_enumerate_task_fd(struct lttng_session *session,
> > struct task_struct *p, char *tmp)
> > {
> > struct fdtable *fdt;
> > struct file *filp;
> > unsigned int i;
> > const unsigned char *path;
> > 
> > task_lock(p);
> > if (!p->files)
> > goto unlock_task;
> > spin_lock(&p->files->file_lock);
> > fdt = files_fdtable(p->files);
> > for (i = 0; i < fdt->max_fds; i++) {
> > filp = fcheck_files(p->files, i);
> > if (!filp)
> > continue;
> > path = d_path(&filp->f_path, tmp, PAGE_SIZE);
> > /* Make sure we give at least some info */
> > trace_lttng_statedump_file_descriptor(session, p, i,
> > IS_ERR(path) ?
> > filp->f_dentry->d_name.name :
> > path);
> > }
> > spin_unlock(&p->files->file_lock);
> > unlock_task:
> > task_unlock(p);
> > }
> 
> *cringe*
> 
> a) yes, it needs d_lock for that ->d_name access
> b) iterate_fd() is there for purpose; use it, instead of open-coding the
> damn loop.  Something like
> 
> struct ctx {
>   char *page;
>   struct lttng_session *session,
>   struct task_struct *p;
> };
>   
> static int dump_one(void *p, struct file *file, unsigned fd)
> {
>   struct ctx *ctx = p;
>   const char *s = d_path(&file->f_path, ctx->page, PAGE_SIZE);
>   struct dentry *dentry;
>   if (!IS_ERR(s)) {
>   trace_lttng_statedump_file_descriptor(ctx->session, ctx->p, fd, 
> s);
>   return 0;
>   }
>   /* Make sure we give at least some info */
>   dentry = file->f_path.dentry;
>   spin_lock(&dentry->d_lock);
>   trace_lttng_statedump_file_descriptor(ctx->session, ctx->p, fd,
>   dentry->d_name);
>   spin_unlock(&dentry->d_lock);
>   return 0;
> }
> 
> ...
>   task_lock(p);
>   iterate_fd(p->files, 0, dump_one, &(struct ctx){tmp, session, p});
>   task_unlock(p);
> 
> assuming it wouldn't be better to pass tmp/session/p as the single pointer
> to struct in the first place - I don't know enough about the callers of
> that sucker to tell.  And yes, iterate_fd() will DTRT if given NULL as the
> first argument.  The second argument is "which descriptor should I start
> from?", callback is called for everything present in the table starting from
> that place until it returns non-zero or the end of table is reached...

Thanks !! Modulo a couple of trivial nits, I've integrated your
suggestions. I'm creating a lttng_iterate_fd() wrapper for older kernels
(yeah.. we deal with kernels back to 2.6.32).

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys

2013-02-25 Thread Mathieu Desnoyers

Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
compat_process_vm_rw() shows that the compatibility code requires an
explicit "access_ok()" check before calling
compat_rw_copy_check_uvector(). The same difference seems to appear when
we compare fs/read_write.c:do_readv_writev() to
fs/compat.c:compat_do_readv_writev().

This subtle difference between the compat and non-compat requirements
should probably be debated, as it seems to be error-prone. In fact,
there are two others sites that use this function in the Linux kernel,
and they both seem to get it wrong:

Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
also ends up calling compat_rw_copy_check_uvector() through
aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
be missing. Same situation for
security/keys/compat.c:compat_keyctl_instantiate_key_iov().

I propose that we add the access_ok() check directly into
compat_rw_copy_check_uvector(), so callers don't have to worry about it,
and it therefore makes the compat call code similar to its non-compat
counterpart. Place the access_ok() check in the same location where
copy_from_user() can trigger a -EFAULT error in the non-compat code, so
the ABI behaviors are alike on both compat and non-compat.

While we are here, fix compat_do_readv_writev() so it checks for
compat_rw_copy_check_uvector() negative return values.

And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
handling.

Acked-by: Linus Torvalds 
Acked-by: Al Viro 
Signed-off-by: Mathieu Desnoyers 
---
 fs/compat.c|   15 +++
 mm/process_vm_access.c |8 
 security/keys/compat.c |4 ++--
 3 files changed, 9 insertions(+), 18 deletions(-)

Index: linux/fs/compat.c
===
--- linux.orig/fs/compat.c
+++ linux/fs/compat.c
@@ -558,6 +558,10 @@ ssize_t compat_rw_copy_check_uvector(int
}
*ret_pointer = iov;
 
+   ret = -EFAULT;
+   if (!access_ok(VERIFY_READ, uvector, nr_segs*sizeof(*uvector)))
+   goto out;
+
/*
 * Single unix specification:
 * We should -EINVAL if an element length is not >= 0 and fitting an
@@ -1080,17 +1084,12 @@ static ssize_t compat_do_readv_writev(in
if (!file->f_op)
goto out;
 
-   ret = -EFAULT;
-   if (!access_ok(VERIFY_READ, uvector, nr_segs*sizeof(*uvector)))
-   goto out;
-
-   tot_len = compat_rw_copy_check_uvector(type, uvector, nr_segs,
+   ret = compat_rw_copy_check_uvector(type, uvector, nr_segs,
   UIO_FASTIOV, iovstack, &iov);
-   if (tot_len == 0) {
-   ret = 0;
+   if (ret <= 0)
goto out;
-   }
 
+   tot_len = ret;
ret = rw_verify_area(type, file, pos, tot_len);
if (ret < 0)
goto out;
Index: linux/mm/process_vm_access.c
===
--- linux.orig/mm/process_vm_access.c
+++ linux/mm/process_vm_access.c
@@ -429,12 +429,6 @@ compat_process_vm_rw(compat_pid_t pid,
if (flags != 0)
return -EINVAL;
 
-   if (!access_ok(VERIFY_READ, lvec, liovcnt * sizeof(*lvec)))
-   goto out;
-
-   if (!access_ok(VERIFY_READ, rvec, riovcnt * sizeof(*rvec)))
-   goto out;
-
if (vm_write)
rc = compat_rw_copy_check_uvector(WRITE, lvec, liovcnt,
  UIO_FASTIOV, iovstack_l,
@@ -459,8 +453,6 @@ free_iovecs:
kfree(iov_r);
if (iov_l != iovstack_l)
kfree(iov_l);
-
-out:
return rc;
 }
 
Index: linux/security/keys/compat.c
===
--- linux.orig/security/keys/compat.c
+++ linux/security/keys/compat.c
@@ -40,12 +40,12 @@ static long compat_keyctl_instantiate_ke
   ARRAY_SIZE(iovstack),
   iovstack, &iov);
if (ret < 0)
-   return ret;
+   goto err;
if (ret == 0)
goto no_payload_free;
 
ret = keyctl_instantiate_key_common(id, iov, ioc, ret, ringid);
-
+err:
if (iov != iovstack)
    kfree(iov);
    return ret;
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Add leading underscores to filldir, fillonedir, compat_fillonedir, filldir64

2013-02-25 Thread Mathieu Desnoyers

Document that these functions expect access_ok() to be performed on
dirent between reception from user-space and call to these functions.

Signed-off-by: Mathieu Desnoyers 
CC: Al Viro 
---
 fs/compat.c  |4 ++--
 fs/readdir.c |   14 +++---
 2 files changed, 9 insertions(+), 9 deletions(-)

Index: linux/fs/compat.c
===
--- linux.orig/fs/compat.c
+++ linux/fs/compat.c
@@ -834,7 +834,7 @@ struct compat_readdir_callback {
int result;
 };
 
-static int compat_fillonedir(void *__buf, const char *name, int namlen,
+static int __compat_fillonedir(void *__buf, const char *name, int namlen,
loff_t offset, u64 ino, unsigned int d_type)
 {
struct compat_readdir_callback *buf = __buf;
@@ -879,7 +879,7 @@ asmlinkage long compat_sys_old_readdir(u
buf.result = 0;
buf.dirent = dirent;
 
-   error = vfs_readdir(f.file, compat_fillonedir, &buf);
+   error = vfs_readdir(f.file, __compat_fillonedir, &buf);
if (buf.result)
error = buf.result;
 
Index: linux/fs/readdir.c
===
--- linux.orig/fs/readdir.c
+++ linux/fs/readdir.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(vfs_readdir);
  *
  * "count=1" is a special case, meaning that the buffer is one
  * dirent-structure in size and that the code can't handle more
- * anyway. Thus the special "fillonedir()" function for that
+ * anyway. Thus the special "__fillonedir()" function for that
  * case (the low-level handlers don't need to care about this).
  */
 
@@ -70,7 +70,7 @@ struct readdir_callback {
int result;
 };
 
-static int fillonedir(void * __buf, const char * name, int namlen, loff_t 
offset,
+static int __fillonedir(void * __buf, const char * name, int namlen, loff_t 
offset,
  u64 ino, unsigned int d_type)
 {
struct readdir_callback * buf = (struct readdir_callback *) __buf;
@@ -115,7 +115,7 @@ SYSCALL_DEFINE3(old_readdir, unsigned in
buf.result = 0;
buf.dirent = dirent;
 
-   error = vfs_readdir(f.file, fillonedir, &buf);
+   error = vfs_readdir(f.file, __fillonedir, &buf);
if (buf.result)
error = buf.result;
 
@@ -143,7 +143,7 @@ struct getdents_callback {
int error;
 };
 
-static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
+static int __filldir(void * __buf, const char * name, int namlen, loff_t 
offset,
   u64 ino, unsigned int d_type)
 {
struct linux_dirent __user * dirent;
@@ -206,7 +206,7 @@ SYSCALL_DEFINE3(getdents, unsigned int,
buf.count = count;
buf.error = 0;
 
-   error = vfs_readdir(f.file, filldir, &buf);
+   error = vfs_readdir(f.file, __filldir, &buf);
if (error >= 0)
error = buf.error;
lastdirent = buf.previous;
@@ -227,7 +227,7 @@ struct getdents_callback64 {
int error;
 };
 
-static int filldir64(void * __buf, const char * name, int namlen, loff_t 
offset,
+static int __filldir64(void * __buf, const char * name, int namlen, loff_t 
offset,
 u64 ino, unsigned int d_type)
 {
struct linux_dirent64 __user *dirent;
@@ -286,7 +286,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int
buf.count = count;
buf.error = 0;
 
-   error = vfs_readdir(f.file, filldir64, &buf);
+   error = vfs_readdir(f.file, __filldir64, &buf);
if (error >= 0)
    error = buf.error;
lastdirent = buf.previous;

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Reproduceable SATA lockup on 3.7.8 with SSD

2013-02-25 Thread Mathieu Desnoyers

* Marc MERLIN (m...@merlins.org) wrote:
> Howdy,
> 
> I seem to have the same problem (or similar) as Mathieu Desnoyers in
> https://lkml.org/lkml/2013/2/22/437
> 
> I can reliably get my SSD to drop from the SATA bus given the right workload
> on linux.
> 
> How can I tell if it's linux's fault of the drive's fault?

Here is a pseudo-git-blame checklist that might be useful for accurate
finger-pointing when a drive fails:

- try diagnostic tools from your drive vendor, if it reports your drive
  as bad, then it might just be your drive failing,
- try to run a SMART test from smartmontools,
- try to reproduce your issue with a simple test-case (trying my test
  program might help) that clearly fails quickly, and all the time, on
  your problematic hardware,
- find out if there are known firmware upgrades for your drive provided
  by your vendor, try them out,
- find out if there are known BIOS upgrades for your machine provided by
  your vendor, try them out,
- try test-case on various kernel versions,
- try test-case on various distributions (just in case),
- try test-case with power management disabled in your machine's BIOS,
- try test-case with other SSD drives of the exact same model as
  yours, so you can see if it's just you own drive failing,
- try moving your drive to a different machine (same model, different
  model), and see if the test-case still fails,
- try with other SSD drives (from other vendors) on your machine,
- check if you partition mount options enable TRIM or not, try to
  disable TRIM explicitly (see mount(8), discard/nodiscard option),
- try using a different filesystem (just in case),
- try using a different block I/O scheduler,
- try using your drive vendor's SSD eraser, to reinitialize your entire
  disk (yes, you will lose you entire data). This might be useful if
  TRIM handling has changed after a firmware upgrade for instance.

With all those results in hand, it should become easier to identify the
cause of your problem. My personal research currently indicate that all
the Intel SSDSC2BW180A3L drives found in Lenovo x230 laptops I have
tested so far (4 different laptops) all fail after a couple of minutes
with my simple random-access-write workload. Moving the drives into a
different laptop (x200) does not help (it still fails).

Good luck!

Mathieu

> 
> Thanks,
> Marc
> 
> - Forwarded message from Marc MERLIN  -
> 
> From: Marc MERLIN 
> To: linux-...@vger.kernel.org
> 
> Hopefully this is the right list. I know that IDE!=SATA, but I can't find
> a SATA list.
> Please redirect me if needed.
> 
> Hardware:
> Lenovo T530, 64bit kernel and userland.
> Hadware is shown below, but 2 drives, one SSD (OCZ-VERTEX4) and one HD 
> (Hitachi HTS54101).
> 
> The SSD will lockup reliably if I do a specific mencoder command that reads 
> MP4
> files and rewrites them to another file in the same directory.
> 
> The log of what happens is shown below, the drive is eventually taken off the 
> bus.
> Once I reboot, it back, as if nothing happened.
> If I do the same command on the HD, it works, but of course timings will be 
> different
> since the HD is slower.
> 
> How can I tell if it's the SSD's firmware's fault, or the linux SATA/AHCI code
> that is buggy?
> 
> Thanks,
> Marc
> 
> Failure log:
> ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x6 frozen
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:00:00:38:13/04:00:33:00:00/40 tag 0 ncq 524288 out
>  res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:08:00:3c:13/04:00:33:00:00/40 tag 1 ncq 524288 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> (snipped)
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:e8:00:30:13/04:00:33:00:00/40 tag 29 ncq 524288 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1.00: failed command: WRITE FPDMA QUEUED
> ata1.00: cmd 61/00:f0:00:34:13/04:00:33:00:00/40 tag 30 ncq 524288 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: hard resetting link
> ata1: link is slow to respond, please be patient (ready=0)
> ata1: COMRESET failed (errno=-16)
> ata1: limiting SATA link speed to 3.0 Gbps
> ata1: hard resetting link
> ata1: COMRESET failed (errno=-1

[BUG] Lenovo x230: SATA errors with 180GB Intel 520 SSD under heavy write load

2013-02-22 Thread Mathieu Desnoyers

Hi,

We spent a couple of days cornering what appears to be an issue with the
Intel 520 SSD drives in Lenovo x230 laptops. It was first showing up
on a clean Debian installation, while installing a guest operating
system into a VM. Looking around on forums, there appears to be some
people having issues with database workloads too. So I decided to create
a small user-space program to repoduce the problem. IMPORTANT: Before
you try it, be ready for a system crash. It's available at:

git://git.efficios.com/test-ssd.git

direct link to .c file:
https://git.efficios.com/?p=test-ssd.git;a=blob;f=test-ssd-write.c;hb=refs/heads/master

This program simply performs random-access-writes of 4Kb into a single
file.

Executive summary of our findings (the details are in the
test-ssd-write.c header in the git repo):

- We reproduced this issue on 4 x230 machines (all our x230 have 180GB
  Intel drives, and they are all affected),
- We took a SSD from one of the machines, moved it into an x200, and the
  problem still occurs,
- The problem seems to occur independently of the filesystem (reproduced
  on ext3 and ext4),
- Problem reproduced by test-ssd-write.c (git tree above): After less
  than 5 minutes of the heavy write workload, we get SATA errors and we
  need to cold reboot the machine to access the drive again. Example
  usage (don't forget to prepare for a computer freeze):

  ./test-ssd-write somefileondisk 209715200 1234 -z

  (see options by just running ./test-ssd-write)

The problem occurs with drive model SSDSC2BW180A3L, with both firmwares
LE1i and LF1i (those are Lenovo firmwares). We could reproduce the issue
on 3.2 (Debian), 3.5 (Debian), 3.7.9 (Arch) distribution kernels. We
could reproduce it with x230 BIOS G2ET90WW (2.50) 2012-20-12 and
G2ET86WW (2.06) 2012-11-13, but since it can be reproduced on a x200
too, it does not appear to be a BIOS issue.

We tried the program on a range of other SSD drives, one of those
including the same SandForce 2281 controller (details within
test-ssd-write.c header). So our current guess is that the Lenovo
firmware on the SSD might be part of the problem, but it might be good
if we could to confirm that Intel's firmwares work fine.

Thoughts, ideas, hints about who to contact on this issue would be very
much welcome,

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys

2013-03-13 Thread Mathieu Desnoyers

* Linus Torvalds (torva...@linux-foundation.org) wrote:
> On Tue, Mar 12, 2013 at 11:04 AM, Linus Torvalds
>  wrote:
> >
> > That said, the patch definitely looks like the right thing to do. I'll 
> > apply it.
> 
> Hmm. I applied it, and then pretty much immediately realized what was
> problematic about it. What about the fs/bio.c one?  This all felt like
> it was still a work-in-progress, and I'm not sure if you had more
> comments or patches coming along?
> 
> Anyway, this particular patch got applied. Does that obviate the need
> for the fs/bio.c one? I didn't follow all the call-chains...

The fs/bio.c issue is unrelated to this patch, so that separate issue
still stands. I did not get any feedback on that private RFC though.

Should I repost it CCing lkml ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > Ported to the Linux kernel from Userspace RCU library, at commit
> > 108a92e5b97ee91b2b902dba2dd2e78aab42f420.
> > 
> > Ref: http://git.lttng.org/userspace-rcu.git
> > 
> > It is provided as a starting point only. Test cases should be ported
> > from Userspace RCU to kernel space and thoroughly ran on a wide range of
> > architectures before considering this port production-ready.
> 
> Thanks, this seems to work.  Will post an early epoll patch in a minute.
> 
> Minor comments below.
> 
> > +/*
> > + * Load a data from shared memory.
> > + */
> > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
> 
> When iterating through the queue by dequeueing, I needed a way
> to get the tail at the start of the iteration and use that as
> a sentinel while iterating, so I access the tail like this:
> 
>   struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> 
> I hope this is supported... it seems to work :)

Ideally it would be good if users could try using the exposed APIs to do
these things, or if it's really needed, maybe it's a sign that we need
to extend the API.

> 
> Unlike most queue users, I need to stop iteration to prevent the same
> item from appearing in the events returned by epoll_wait; since a
> dequeued item may appear back in the wfcqueue immediately.

I think your use-case is very similar to our urcu call_rcu
implementation. I would recommend to use wfcq in the following way:

When you want to dequeue, define, on the stack:

struct wfcq_head qhead;
struct wfcq_tail qtail;
struct wfcq_node *node, *n;
enum wfcq_ret ret;

wfcq_init(&qhead, &qtail);

/*
 * Then use splice to move the entire source queue into the local queue.
 * Don't forget to grab the appropriate mutexes for eqpoll_q here.
 */
ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);

switch (ret) {
case WFCQ_RET_SRC_EMPTY:
return -ENOENT; /* No events to handle */
case WFCQ_RET_DEST_EMPTY:
case WFCQ_RET_DEST_NON_EMPTY:
break;
}

/*
 * From this point, you can release the epoll_q lock and simply iterate
 * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
 * if you need to free the nodes at the same time.
 */
__wfcq_for_each_safe(&qhead, &qtail, node, n) {
...
}

The advantage of using splice() over dequeue() is that you will reduce
the amount of interactions between concurrent enqueue and dequeue
operations on the head and tail of the same queue.

> 
> > +struct wfcq_head {
> > +   struct wfcq_node node;
> > +   struct mutex lock;
> > +};
> 
> I'm not using this lock at all since I already have ep->mtx (which also
> protects the ep->rbr).  Perhaps it should not be included; normal linked
> list and most data structures I see in the kernel do not provide their
> own locks, either

Good point. The Linux kernel habits are to leave the locks outside of
those structures whenever possible, so users can pick the right lock for
their need. Will remove, and remove all the "helpers" that use this lock
as well.

> 
> > +static inline void wfcq_init(struct wfcq_head *head,
> > +   struct wfcq_tail *tail)
> > +{
> > +   /* Set queue head and tail */
> > +   wfcq_node_init(&head->node);
> > +   tail->p = &head->node;
> > +   mutex_init(&head->lock);
> > +}
> 
> There's no corresponding mutex_destroy, so I'm just destroying it
> myself...

Since I remove it, it will become a non-issue.

After thinking about it, I plan to disable preemption around xchg and
store within __wfcq_append. This is a luxury we have at kernel-level
that we don't have in user-space, si it might be good to use it. This
will allow me to remove the 10ms sleep from the adaptative busy-wait in
___wfcq_busy_wait(), and replace this by a simple busy-wait. This will
eliminate those possible latency peaks from the dequeue/splice/iteration
side. It's a shame to have a possibility of 10ms sleep due to preemption
within a 2 instructions window. preempt_disable() will fix this, while
adding a very low overhead to enqueue path in preemptible kernels.

I'll send a new patch version soon,

Thanks for your comments!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation (v2)

2013-03-14 Thread Mathieu Desnoyers

Ported to the Linux kernel from Userspace RCU library, at commit
108a92e5b97ee91b2b902dba2dd2e78aab42f420.

Ref: http://git.lttng.org/userspace-rcu.git

It is provided as a starting point only. Test cases should be ported
from Userspace RCU to kernel space and thoroughly ran on a wide range of
architectures before considering this port production-ready.

Changelog since v1:
- Remove the internal mutex and helpers. Let the caller handle it.
- Disable preemption within append operation, thus removing sleep from
  dequeue, therefore removing the blocking/nonblocking API distinction.

Signed-off-by: Mathieu Desnoyers 
CC: Lai Jiangshan 
CC: Paul E. McKenney 
CC: Stephen Hemminger 
CC: Davide Libenzi 
CC: Eric Wong 
---
 include/linux/wfcqueue.h |  444 +++
 1 file changed, 444 insertions(+)

Index: linux/include/linux/wfcqueue.h
===
--- /dev/null
+++ linux/include/linux/wfcqueue.h
@@ -0,0 +1,444 @@
+#ifndef _LINUX_WFCQUEUE_H
+#define _LINUX_WFCQUEUE_H
+
+/*
+ * linux/wfcqueue.h
+ *
+ * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue
+ *
+ * Copyright 2010-2013 - Mathieu Desnoyers 
+ * Copyright 2011-2012 - Lai Jiangshan 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue
+ *
+ * This queue has been designed and implemented collaboratively by
+ * Mathieu Desnoyers and Lai Jiangshan. Inspired from
+ * half-wait-free/half-blocking queue implementation done by Paul E.
+ * McKenney.
+ *
+ * Mutual exclusion of wfcq_* / __wfcq_* API
+ *
+ * Synchronization table:
+ *
+ * External synchronization techniques described in the API below is
+ * required between pairs marked with "X". No external synchronization
+ * required between pairs marked with "-".
+ *
+ * Legend:
+ * [1] wfcq_enqueue
+ * [2] __wfcq_splice (destination queue)
+ * [3] __wfcq_dequeue
+ * [4] __wfcq_splice (source queue)
+ * [5] __wfcq_first
+ * [6] __wfcq_next
+ *
+ * [1] [2] [3] [4] [5] [6]
+ * [1]  -   -   -   -   -   -
+ * [2]  -   -   -   -   -   -
+ * [3]  -   -   X   X   X   X
+ * [4]  -   -   X   -   X   X
+ * [5]  -   -   X   X   -   -
+ * [6]  -   -   X   X   -   -
+ *
+ * Besides locking, mutual exclusion of dequeue, splice and iteration
+ * can be ensured by performing all of those operations from a single
+ * thread, without requiring any lock.
+ */
+
+/*
+ * Load a data from shared memory.
+ */
+#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
+
+/*
+ * Identify a shared store.
+ */
+#define CMM_STORE_SHARED(x, v) ({ ACCESS_ONCE(x) = (v); })
+
+enum wfcq_ret {
+   WFCQ_RET_DEST_EMPTY =   0,
+   WFCQ_RET_DEST_NON_EMPTY =   1,
+   WFCQ_RET_SRC_EMPTY =2,
+};
+
+struct wfcq_node {
+   struct wfcq_node *next;
+};
+
+/*
+ * Do not put head and tail on the same cache-line if concurrent
+ * enqueue/dequeue are expected from many CPUs. This eliminates
+ * false-sharing between enqueue and dequeue.
+ */
+struct wfcq_head {
+   struct wfcq_node node;
+};
+
+struct wfcq_tail {
+   struct wfcq_node *p;
+};
+
+/*
+ * wfcq_node_init: initialize wait-free queue node.
+ */
+static inline void wfcq_node_init(struct wfcq_node *node)
+{
+   node->next = NULL;
+}
+
+/*
+ * wfcq_init: initialize wait-free queue.
+ */
+static inline void wfcq_init(struct wfcq_head *head,
+   struct wfcq_tail *tail)
+{
+   /* Set queue head and tail */
+   wfcq_node_init(&head->node);
+   tail->p = &head->node;
+}
+
+/*
+ * wfcq_empty: return whether wait-free queue is empty.
+ *
+ * No memory barrier is issued. No mutual exclusion is required.
+ *
+ * We perform the test on head->node.next to check if the queue is
+ * possibly empty, but we confirm this by checking if the tail pointer
+ * points to the head node because the tail pointer is the linearisation
+ * point of the enqueuers. Just checking the head next pointer could
+ * make a queue appear empty if an enqueuer is preempted for a long time
+ * between xchg() and setting the previous node's next pointer.
+ */
+stat

Re: [RFC] epoll: avoid spinlock contention with wfcqueue

2013-03-14 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> I'm posting this lightly tested version since I may not be able to do
> more testing/benchmarking until the weekend.
> 
> Davide's totalmess is still running, so that's probably a good sign :)
> http://www.xmailserver.org/totalmess.c
> I will look for more ways to break this (and benchmark when I stop
> finding ways to break it).  No real applications tested, yet, and
> I think I can improve upon this, too.
> 
> This depends on a couple of patches sitting in -mm and a few
> more I've posted on LKML, for convenience everything is here:
> 
>   http://yhbt.net/epoll-wfcqueue-v3.8.2-20130314.mbox
> (should apply cleanly to 3.9-rc* since there's no epoll changes in that)
> 
> --8<---
> From 139f0d4528c3fabc6a54e47be73ba9990b42cdd8 Mon Sep 17 00:00:00 2001
> From: Eric Wong 
> Date: Thu, 14 Mar 2013 02:37:12 +
> Subject: [PATCH] epoll: avoid spinlock contention with wfcqueue
> 
> This is not a proper commit, I've barely tested this.
> 
> Replace the spinlock-protected linked list ready list with wfcqueue.
> 
> There is still a per-epitem atomic variable which may still spin.  The
> likelyhood of contention is very low since it's not shared by the entire
> structure, the state is private to each epitem.
> 
> Things changed/removed:
> 
> * ep->ovflist - the atomic, per-epitem state field prevents
>   event loss during ep_send_events.
> 
> * ep_scan_ready_list - not enough generic code between users
>   anymore to warrant this.  ep_poll_readyevents_proc (used for
>   poll) is read-only, ep_send_events (used for epoll_wait)
>   dequeues.
> 
> * ep->lock renamed to ep->wqlock; this only protects waitqueues now.
>   (will experiment with making it trylock in some places, next)
> 
> * EPOLL_CTL_DEL/close() on a ready file will not immediately release
>   epitem memory, epoll_wait() must be called since there's no way to
>   delete a ready item from wfcqueue in O(1) time.  In practice this
>   should not be a problem, any valid app using epoll must call
>   epoll_wait occasionally.  Unfreed epitems still count against
>   max_user_watches to protect against local DoS.  This should be the
>   only possibly-noticeable change (in case there's an app that blindly
>   adds/deletes things from the rbtree but never calls epoll_wait)
> 
> Barely-tested-by: Eric Wong 
> Cc: Davide Libenzi 
> Cc: Al Viro 
> Cc: Andrew Morton 
> Cc: Mathieu Desnoyers 
> ---
>  fs/eventpoll.c | 631 
> ++---
>  1 file changed, 292 insertions(+), 339 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index a4e4ad7..6159b85 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -40,6 +40,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * LOCKING:
> @@ -47,15 +48,13 @@
>   *
>   * 1) epmutex (mutex)
>   * 2) ep->mtx (mutex)
> - * 3) ep->lock (spinlock)
> + * 3) ep->wqlock (spinlock)
>   *
> - * The acquire order is the one listed above, from 1 to 3.
> - * We need a spinlock (ep->lock) because we manipulate objects
> - * from inside the poll callback, that might be triggered from
> - * a wake_up() that in turn might be called from IRQ context.
> - * So we can't sleep inside the poll callback and hence we need
> - * a spinlock. During the event transfer loop (from kernel to
> - * user space) we could end up sleeping due a copy_to_user(), so
> + * The acquire order is the one listed above, from 1 to 2.
> + *
> + * We only have a spinlock (ep->wqlock) to manipulate the waitqueues.
> + * During the event transfer loop (from kernel to user space)
> + * we could end up sleeping due a copy_to_user(), so
>   * we need a lock that will allow us to sleep. This lock is a
>   * mutex (ep->mtx). It is acquired during the event transfer loop,
>   * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
> @@ -82,8 +81,8 @@
>   * of epoll file descriptors, we use the current recursion depth as
>   * the lockdep subkey.
>   * It is possible to drop the "ep->mtx" and to use the global
> - * mutex "epmutex" (together with "ep->lock") to have it working,
> - * but having "ep->mtx" will make the interface more scalable.
> + * mutex "epmutex" to have it working,  but having "ep->mtx" will
> + * make the interface more scalable.
>   * Events that require holding "epmutex" are very rare, while for
>   * normal operations the epoll private "ep->mtx" will guarantee
>   * a better

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Mathieu Desnoyers  wrote:
> > > > +/*
> > > > + * Load a data from shared memory.
> > > > + */
> > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
> > > 
> > > When iterating through the queue by dequeueing, I needed a way
> > > to get the tail at the start of the iteration and use that as
> > > a sentinel while iterating, so I access the tail like this:
> > > 
> > >   struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > 
> > > I hope this is supported... it seems to work :)
> > 
> > Ideally it would be good if users could try using the exposed APIs to do
> > these things, or if it's really needed, maybe it's a sign that we need
> > to extend the API.
> 
> Right.  If I can use splice, I will not need this.  more comments below
> on splice...
> 
> > > Unlike most queue users, I need to stop iteration to prevent the same
> > > item from appearing in the events returned by epoll_wait; since a
> > > dequeued item may appear back in the wfcqueue immediately.
> > 
> > I think your use-case is very similar to our urcu call_rcu
> > implementation. I would recommend to use wfcq in the following way:
> > 
> > When you want to dequeue, define, on the stack:
> > 
> > struct wfcq_head qhead;
> > struct wfcq_tail qtail;
> > struct wfcq_node *node, *n;
> > enum wfcq_ret ret;
> > 
> > wfcq_init(&qhead, &qtail);
> > 
> > /*
> >  * Then use splice to move the entire source queue into the local queue.
> >  * Don't forget to grab the appropriate mutexes for eqpoll_q here.
> >  */
> > ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);
> > 
> > switch (ret) {
> > case WFCQ_RET_SRC_EMPTY:
> > return -ENOENT; /* No events to handle */
> > case WFCQ_RET_DEST_EMPTY:
> > case WFCQ_RET_DEST_NON_EMPTY:
> > break;
> > }
> > 
> > /*
> >  * From this point, you can release the epoll_q lock and simply iterate
> >  * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
> >  * if you need to free the nodes at the same time.
> >  */
> > __wfcq_for_each_safe(&qhead, &qtail, node, n) {
> > ...
> > }
> > 
> > The advantage of using splice() over dequeue() is that you will reduce
> > the amount of interactions between concurrent enqueue and dequeue
> > operations on the head and tail of the same queue.
> 
> I wanted to use splice here originally, but epoll_wait(2) may not
> consume the entire queue, as it's limited to maxevents specified by the
> user calling epoll_wait.

I see,

> 
> With unconsumed elements, I need to preserve ordering of the queue to
> avoid starvation.  So I would either need to:
> 
> a) splice the unconsumed portion back to the head of the shared queue.
>I'm not sure if this is possible while elements are being enqueued...

That would be a double-ended queue. I haven't thought this problem
through yet.

>Using a mutex for splicing back to unconsumed elements is OK, and
>probably required anyways since we need to keep EPOLLONESHOT
>unmasking synced with dequeue.
> 
> b) preserve the unconsumed spliced queue across epoll_wait invocations
>but that requires checking both queues for event availability...

I think b) should be preferred over a).

Basically, instead of having the "unconsumed element" queue on the stack
as I suggested, you might want to keep it across epoll_wait invocations.

Before you start digging in the unconsumed element queue, you just start
by splicing the content of the shared queue into the tail of unconsumed
queue. Then, you simply dequeue the unconsumed queue elements until you
either reach the end or the max nr.

You should note that with this scheme, you'll have to dequeue items from
the unconsumed queue rather than just iterating over the elements. The
nice side of consuming all the elements (in the temp queue on the local
stack) is that you don't care about dequeuing, since the entire queue
will vanish. However, in this case, since you care about keeping the
queue after a partial iteration, you need to dequeue from it.

And yes, this approach involves checking both queues for event
availability. Hopefully none of this will be too much of an issue
performance-wise.

Another approach could be to let you work directly on the shared queue:

I could possibly implement a

void __wfcq_snapshot(struct wfcq_head *head,
struc

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers

* Peter Hurley (pe...@hurleysoftware.com) wrote:
> On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote:
> > +/*
> > + * Do not put head and tail on the same cache-line if concurrent
> > + * enqueue/dequeue are expected from many CPUs. This eliminates
> > + * false-sharing between enqueue and dequeue.
> > + */
> > +struct wfcq_head {
> > +   struct wfcq_node node;
> > +   struct mutex lock;
> > +};
> > +
> > +struct wfcq_tail {
> > +   struct wfcq_node *p;
> > +};
> 
> 
> If you want to force separate cachelines for SMP, this would be
> 
> struct wfcq_head {
>   struct wfcq_node node;
>   struct mutex lock;
> } cacheline_aligned_in_smp;
> 
> struct wfcq_tail {
>   struct wfcq_node *p;
> } cacheline_aligned_in_smp;

Well, the thing is: I don't want to force it. The queue head and tail
can be used in a few ways:

1) tail used by frequent enqueue on one CPU, head used for frequent
   dequeues on another CPU. In this case, we want head/tail on different
   cache lines.
2) same scenario as 1), but head and tail are placed in per-cpu data
   of the two CPUs. We don't need to align each structure explicitly.
3) tail and head used locally, e.g. on the stack, as splice destination
   followed by iteration or dequeue from the same CPU. We don't want to
   waste precious memory space in this case, so no alignment.

So as you see, only (1) actually requires explicit alignment of head and
tail.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Mathieu Desnoyers  wrote:
> > > > The advantage of using splice() over dequeue() is that you will reduce
> > > > the amount of interactions between concurrent enqueue and dequeue
> > > > operations on the head and tail of the same queue.
> > > 
> > > I wanted to use splice here originally, but epoll_wait(2) may not
> > > consume the entire queue, as it's limited to maxevents specified by the
> > > user calling epoll_wait.
> > 
> > I see,
> > 
> > > 
> > > With unconsumed elements, I need to preserve ordering of the queue to
> > > avoid starvation.  So I would either need to:
> > > 
> > > a) splice the unconsumed portion back to the head of the shared queue.
> > >I'm not sure if this is possible while elements are being enqueued...
> > 
> > That would be a double-ended queue. I haven't thought this problem
> > through yet.
> > 
> > >Using a mutex for splicing back to unconsumed elements is OK, and
> > >probably required anyways since we need to keep EPOLLONESHOT
> > >unmasking synced with dequeue.
> > > 
> > > b) preserve the unconsumed spliced queue across epoll_wait invocations
> > >but that requires checking both queues for event availability...
> > 
> > I think b) should be preferred over a).
> > 
> > Basically, instead of having the "unconsumed element" queue on the stack
> > as I suggested, you might want to keep it across epoll_wait invocations.
> > 
> > Before you start digging in the unconsumed element queue, you just start
> > by splicing the content of the shared queue into the tail of unconsumed
> > queue. Then, you simply dequeue the unconsumed queue elements until you
> > either reach the end or the max nr.
> > 
> > You should note that with this scheme, you'll have to dequeue items from
> > the unconsumed queue rather than just iterating over the elements. The
> > nice side of consuming all the elements (in the temp queue on the local
> > stack) is that you don't care about dequeuing, since the entire queue
> > will vanish. However, in this case, since you care about keeping the
> > queue after a partial iteration, you need to dequeue from it.
> > 
> > And yes, this approach involves checking both queues for event
> > availability. Hopefully none of this will be too much of an issue
> > performance-wise.
> 
> Right.  I will try this, I don't think the check will be too expensive.
> 
> When dequeuing from the unconsumed queue, perhaps there should be a
> "dequeue_local" function which omits the normal barriers required
> for the shared queue.
> 
> With a splice and without needing barriers for iteration, this sounds good.

Well actually, __wfcq_dequeue() is really not that expensive. In terms
of synchronization, here is what it typically does:

node = ___wfcq_node_sync_next(&head->node);
  -> busy wait if node->next is NULL. This check is needed even if we
 work on a "local" queue, because the O(1) splice operation does not
 walk over every node to check for NULL next pointer: this is left
 to the dequeue/iteration operations.
if ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
  -> only taken if we are getting the last node of the queue. Happens
 at most once per batch.
}

head->node.next = next;
  -> a simple store.

smp_read_barrier_depends();
  -> no-op on everything but Alpha.

return node;

So my recommendation would be to be careful before trying to come up
with flavors that remove barriers if those are not actually hurting the
fast-path significantly. By dequeue fast-path, I mean what needs to be
executed for dequeue of each node.

> 
> > Another approach could be to let you work directly on the shared queue:
> > 
> > I could possibly implement a
> > 
> > void __wfcq_snapshot(struct wfcq_head *head,
> > struct wfcq_tail *tail);
> > 
> > That would save a tail shapshot that would then be used to stop
> > iteration, dequeue and splice at the location of the tail snapshot. And
> > 
> > void __wfcq_snapshot_reset(struct wfcq_head *head,
> > struct wfcq_tail *tail);
> > 
> > would set the tail snapshot pointer back to NULL.
> > 
> > This would require a few extra checks, but nothing very expensive I
> > expect.
> > 
> > Thoughts ?
> 
> I'm not sure I follow, would using it be something like this?
> 
&

Re: [PATCH] Documentation: Remove text on tracepoint samples

2013-03-15 Thread Mathieu Desnoyers

* Paul Bolle (pebo...@tiscali.nl) wrote:
> The tracepoint sample code got removed. Remove a few lines on its usage
> too.
> 
> Signed-off-by: Paul Bolle 

Thanks!

Acked-by: Mathieu Desnoyers 

> ---
>  Documentation/trace/tracepoints.txt | 15 ---
>  1 file changed, 15 deletions(-)
> 
> diff --git a/Documentation/trace/tracepoints.txt 
> b/Documentation/trace/tracepoints.txt
> index c0e1cee..da49437 100644
> --- a/Documentation/trace/tracepoints.txt
> +++ b/Documentation/trace/tracepoints.txt
> @@ -81,7 +81,6 @@ tracepoint_synchronize_unregister() must be called before 
> the end of
>  the module exit function to make sure there is no caller left using
>  the probe. This, and the fact that preemption is disabled around the
>  probe call, make sure that probe removal and module unload are safe.
> -See the "Probe example" section below for a sample probe module.
>  
>  The tracepoint mechanism supports inserting multiple instances of the
>  same tracepoint, but a single definition must be made of a given
> @@ -100,17 +99,3 @@ core kernel image or in modules.
>  If the tracepoint has to be used in kernel modules, an
>  EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be
>  used to export the defined tracepoints.
> -
> -* Probe / tracepoint example
> -
> -See the example provided in samples/tracepoints
> -
> -Compile them with your kernel.  They are built during 'make' (not
> -'make modules') when CONFIG_SAMPLE_TRACEPOINTS=m.
> -
> -Run, as root :
> -modprobe tracepoint-sample (insmod order is not important)
> -modprobe tracepoint-probe-sample
> -cat /proc/tracepoint-sample (returns an expected error)
> -rmmod tracepoint-sample tracepoint-probe-sample
> -dmesg
> -- 
> 1.7.11.7
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-16 Thread Mathieu Desnoyers

* Eric Wong (normalper...@yhbt.net) wrote:
> Eric Wong  wrote:
> > Mathieu Desnoyers  wrote:
> > > * Eric Wong (normalper...@yhbt.net) wrote:
> > > > Mathieu Desnoyers  wrote:
> > > > > +/*
> > > > > + * Load a data from shared memory.
> > > > > + */
> > > > > +#define CMM_LOAD_SHARED(p)   ACCESS_ONCE(p)
> > > > 
> > > > When iterating through the queue by dequeueing, I needed a way
> > > > to get the tail at the start of the iteration and use that as
> > > > a sentinel while iterating, so I access the tail like this:
> > > > 
> > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > > 
> > > > I hope this is supported... it seems to work :)
> > > 
> > > Ideally it would be good if users could try using the exposed APIs to do
> > > these things, or if it's really needed, maybe it's a sign that we need
> > > to extend the API.
> > 
> > Right.  If I can use splice, I will not need this.  more comments below
> > on splice...
> 
> Even with splice, I think I need to see the main tail at the start of
> iteration to maintain compatibility (for weird apps that might care).

Thanks for providing this detailed scenario. I think there is an
important aspect in the use of splice I suggested on which we are not
fully understanding each other. I will annotate your scenario below with
clarifications:

> 
> Consider this scenario:
> 
>   1) main.queue has 20 events
> 
>   2) epoll_wait(maxevents=16) called by user
> 
>   3) splice all 20 events into unconsumed.queue, main.queue is empty
> 
>   4) put_user + dequeue on 16 events from unconsumed.queue
>  # unconsumed.queue has 4 left at this point
> 
>   5) main.queue gets several more events enqueued at any point after 3.

Let's suppose 11 events are enqueued into main.queue after 3.

> 
>   6) epoll_wait(maxevents=16) called by user again

Before 7), here is what should be done:

6.5) splice all new events from main.queue into unconsumed.queue.
 unconsumed.queue will now contain 4 + 11 = 15 events. Note
 that splice will preserve the right order of events within
 unconsumed.queue.

> 
>   7) put_user + dequeue on 4 remaining items in unconsumed.queue
> 
>  We can safely return 4 events back to the user at this point.

Step 7) will now return 15 events from unconsumed.queue.

> 
>  However, this might break compatibility for existing users.  I'm
>  not sure if there's any weird apps which know/expect the event
>  count they'll get from epoll_wait, but maybe there is one...

With the new step 6.5), I don't think the behavior will change compared
to what is already there.

> 
>   8) We could perform a splice off main.queue to fill the remaining
>  slots the user requested, but we do not know if the things we
>  splice from main.queue at this point were just dequeued in 7.
> 
>  If we loaded the main.queue.tail before 7, we could safely splice
>  into unconsumed.queue and know when to stop when repeating the
>  put_user + dequeue loop.

We can achieve the same thing by doing step 6.5) at the beginning of
epoll_wait(). It's important to do it at the beginning of epoll_wait for
the reason you discuss in 8) : if you wait until you notice that
unconsumed.queue is empty before refilling it from main.queue, you won't
be able to know if the events in main.queue were added after the first
event was dequeued.

Step 6.5) should be performed each time upon entry into epoll_wait(). It
does not matter if unconsumed.queue would happen to have enough events
to fill in the maxevents request or not (and you don't want to iterate
on the unconsumed.queue needlessly to count them): you can just do a
O(1) splice from main.queue into unconsumed.queue, and your original
semantic should be preserved.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-08-28 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On 08/25/2012 06:24 AM, Mathieu Desnoyers wrote:
> > * Tejun Heo (t...@kernel.org) wrote:
> >> Hello,
> >>
> >> On Sat, Aug 25, 2012 at 12:59:25AM +0200, Sasha Levin wrote:
> >>> Thats the thing, the amount of things of things you can do with a given 
> >>> bucket
> >>> is very limited. You can't add entries to any point besides the head 
> >>> (without
> >>> walking the entire list).
> >>
> >> Kinda my point.  We already have all the hlist*() interface to deal
> >> with such cases.  Having something which is evidently the trivial
> >> hlist hashtable and advertises as such in the interface can be
> >> helpful.  I think we need that more than we need anything fancy.
> >>
> >> Heh, this is a debate about which one is less insignificant.  I can
> >> see your point.  I'd really like to hear what others think on this.
> >>
> >> Guys, do we want something which is evidently trivial hlist hashtable
> >> which can use hlist_*() API directly or do we want something better
> >> encapsulated?
> > 
> > My 2 cents, FWIW: I think this specific effort should target a trivially
> > understandable API and implementation, for use-cases where one would be
> > tempted to reimplement his own trivial hash table anyway. So here
> > exposing hlist internals, with which kernel developers are already
> > familiar, seems like a good approach in my opinion, because hiding stuff
> > behind new abstraction might make the target users go away.
> > 
> > Then, as we see the need, we can eventually merge a more elaborate hash
> > table with poneys and whatnot, but I would expect that the trivial hash
> > table implementation would still be useful. There are of course very
> > compelling reasons to use a more featureful hash table: automatic
> > resize, RT-aware updates, scalable updates, etc... but I see a purpose
> > for a trivial implementation. Its primary strong points being:
> > 
> > - it's trivially understandable, so anyone how want to be really sure
> >   they won't end up debugging the hash table instead of their
> >   work-in-progress code can have a full understanding of it,
> > - it has few dependencies, which makes it easier to understand and
> >   easier to use in some contexts (e.g. early boot).
> > 
> > So I'm in favor of not overdoing the abstraction for this trivial hash
> > table, and honestly I would rather prefer that this trivial hash table
> > stays trivial. A more elaborate hash table should probably come as a
> > separate API.
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> 
> Alright, let's keep it simple then.
> 
> I do want to keep the hash_for_each[rcu,safe] family though.

Just a thought: if the API offered by the simple hash table focus on
providing a mechanism to find the hash bucket to which belongs the hash
chain containing the key looked up, and then expects the user to use the
hlist API to iterate on the chain (with or without the hlist _rcu
variant), then it might seem consistent that a helper providing
iteration over the entire table would actually just provide iteration on
all buckets, and let the user call the hlist for each iterator for each
node within the bucket, e.g.:

struct hlist_head *head;
struct hlist_node *pos;

hash_for_each_bucket(ht, head) {
hlist_for_each(pos, head) {
...
}
}

That way you only have to provide one single macro
(hash_for_each_bucket), and rely on the already existing:

- hlist_for_each_entry
- hlist_for_each_safe
- hlist_for_each_entry_rcu
- hlist_for_each_safe_rcu
  .

and various flavors that can appear in the future without duplicating
this API. So you won't even have to create _rcu, _safe, nor _safe_rcu
versions of the hash_for_each_bucket macro.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-08-28 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On 08/28/2012 12:11 PM, Mathieu Desnoyers wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> On 08/25/2012 06:24 AM, Mathieu Desnoyers wrote:
> >>> * Tejun Heo (t...@kernel.org) wrote:
> >>>> Hello,
> >>>>
> >>>> On Sat, Aug 25, 2012 at 12:59:25AM +0200, Sasha Levin wrote:
> >>>>> Thats the thing, the amount of things of things you can do with a given 
> >>>>> bucket
> >>>>> is very limited. You can't add entries to any point besides the head 
> >>>>> (without
> >>>>> walking the entire list).
> >>>>
> >>>> Kinda my point.  We already have all the hlist*() interface to deal
> >>>> with such cases.  Having something which is evidently the trivial
> >>>> hlist hashtable and advertises as such in the interface can be
> >>>> helpful.  I think we need that more than we need anything fancy.
> >>>>
> >>>> Heh, this is a debate about which one is less insignificant.  I can
> >>>> see your point.  I'd really like to hear what others think on this.
> >>>>
> >>>> Guys, do we want something which is evidently trivial hlist hashtable
> >>>> which can use hlist_*() API directly or do we want something better
> >>>> encapsulated?
> >>>
> >>> My 2 cents, FWIW: I think this specific effort should target a trivially
> >>> understandable API and implementation, for use-cases where one would be
> >>> tempted to reimplement his own trivial hash table anyway. So here
> >>> exposing hlist internals, with which kernel developers are already
> >>> familiar, seems like a good approach in my opinion, because hiding stuff
> >>> behind new abstraction might make the target users go away.
> >>>
> >>> Then, as we see the need, we can eventually merge a more elaborate hash
> >>> table with poneys and whatnot, but I would expect that the trivial hash
> >>> table implementation would still be useful. There are of course very
> >>> compelling reasons to use a more featureful hash table: automatic
> >>> resize, RT-aware updates, scalable updates, etc... but I see a purpose
> >>> for a trivial implementation. Its primary strong points being:
> >>>
> >>> - it's trivially understandable, so anyone how want to be really sure
> >>>   they won't end up debugging the hash table instead of their
> >>>   work-in-progress code can have a full understanding of it,
> >>> - it has few dependencies, which makes it easier to understand and
> >>>   easier to use in some contexts (e.g. early boot).
> >>>
> >>> So I'm in favor of not overdoing the abstraction for this trivial hash
> >>> table, and honestly I would rather prefer that this trivial hash table
> >>> stays trivial. A more elaborate hash table should probably come as a
> >>> separate API.
> >>>
> >>> Thanks,
> >>>
> >>> Mathieu
> >>>
> >>
> >> Alright, let's keep it simple then.
> >>
> >> I do want to keep the hash_for_each[rcu,safe] family though.
> > 
> > Just a thought: if the API offered by the simple hash table focus on
> > providing a mechanism to find the hash bucket to which belongs the hash
> > chain containing the key looked up, and then expects the user to use the
> > hlist API to iterate on the chain (with or without the hlist _rcu
> > variant), then it might seem consistent that a helper providing
> > iteration over the entire table would actually just provide iteration on
> > all buckets, and let the user call the hlist for each iterator for each
> > node within the bucket, e.g.:
> > 
> > struct hlist_head *head;
> > struct hlist_node *pos;
> > 
> > hash_for_each_bucket(ht, head) {
> > hlist_for_each(pos, head) {
> > ...
> > }
> > }
> > 
> > That way you only have to provide one single macro
> > (hash_for_each_bucket), and rely on the already existing:
> > 
> > - hlist_for_each_entry
> > - hlist_for_each_safe
> > - hlist_for_each_entry_rcu
> > - hlist_for_each_safe_rcu
> >   .
> > 
> > and various flavors that can appear in the future without duplicating
> > this API. So you won't even have to create _rcu, _safe, nor _safe_rcu
> > versio

Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-08-28 Thread Mathieu Desnoyers

* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On 08/28/2012 12:11 PM, Mathieu Desnoyers wrote:
> > > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > >> On 08/25/2012 06:24 AM, Mathieu Desnoyers wrote:
> > >>> * Tejun Heo (t...@kernel.org) wrote:
> > >>>> Hello,
> > >>>>
> > >>>> On Sat, Aug 25, 2012 at 12:59:25AM +0200, Sasha Levin wrote:
> > >>>>> Thats the thing, the amount of things of things you can do with a 
> > >>>>> given bucket
> > >>>>> is very limited. You can't add entries to any point besides the head 
> > >>>>> (without
> > >>>>> walking the entire list).
> > >>>>
> > >>>> Kinda my point.  We already have all the hlist*() interface to deal
> > >>>> with such cases.  Having something which is evidently the trivial
> > >>>> hlist hashtable and advertises as such in the interface can be
> > >>>> helpful.  I think we need that more than we need anything fancy.
> > >>>>
> > >>>> Heh, this is a debate about which one is less insignificant.  I can
> > >>>> see your point.  I'd really like to hear what others think on this.
> > >>>>
> > >>>> Guys, do we want something which is evidently trivial hlist hashtable
> > >>>> which can use hlist_*() API directly or do we want something better
> > >>>> encapsulated?
> > >>>
> > >>> My 2 cents, FWIW: I think this specific effort should target a trivially
> > >>> understandable API and implementation, for use-cases where one would be
> > >>> tempted to reimplement his own trivial hash table anyway. So here
> > >>> exposing hlist internals, with which kernel developers are already
> > >>> familiar, seems like a good approach in my opinion, because hiding stuff
> > >>> behind new abstraction might make the target users go away.
> > >>>
> > >>> Then, as we see the need, we can eventually merge a more elaborate hash
> > >>> table with poneys and whatnot, but I would expect that the trivial hash
> > >>> table implementation would still be useful. There are of course very
> > >>> compelling reasons to use a more featureful hash table: automatic
> > >>> resize, RT-aware updates, scalable updates, etc... but I see a purpose
> > >>> for a trivial implementation. Its primary strong points being:
> > >>>
> > >>> - it's trivially understandable, so anyone how want to be really sure
> > >>>   they won't end up debugging the hash table instead of their
> > >>>   work-in-progress code can have a full understanding of it,
> > >>> - it has few dependencies, which makes it easier to understand and
> > >>>   easier to use in some contexts (e.g. early boot).
> > >>>
> > >>> So I'm in favor of not overdoing the abstraction for this trivial hash
> > >>> table, and honestly I would rather prefer that this trivial hash table
> > >>> stays trivial. A more elaborate hash table should probably come as a
> > >>> separate API.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Mathieu
> > >>>
> > >>
> > >> Alright, let's keep it simple then.
> > >>
> > >> I do want to keep the hash_for_each[rcu,safe] family though.
> > > 
> > > Just a thought: if the API offered by the simple hash table focus on
> > > providing a mechanism to find the hash bucket to which belongs the hash
> > > chain containing the key looked up, and then expects the user to use the
> > > hlist API to iterate on the chain (with or without the hlist _rcu
> > > variant), then it might seem consistent that a helper providing
> > > iteration over the entire table would actually just provide iteration on
> > > all buckets, and let the user call the hlist for each iterator for each
> > > node within the bucket, e.g.:
> > > 
> > > struct hlist_head *head;
> > > struct hlist_node *pos;
> > > 
> > > hash_for_each_bucket(ht, head) {
> > > hlist_for_each(pos, head) {
> > > ...
> > > }
> > > }
> > > 
> > > That way you only have to pr

Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-09-04 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Tue, 2012-08-28 at 19:00 -0400, Mathieu Desnoyers wrote:
> 
> > Looking again at:
> > 
> > +#define hash_for_each_size(name, bits, bkt, node, obj, member) 
> > \
> > +   for (bkt = 0; bkt < HASH_SIZE(bits); bkt++) 
> > \
> > +   hlist_for_each_entry(obj, node, &name[bkt], member)
> > 
> > you will notice that a "break" or "continue" in the inner loop will not
> > affect the outer loop, which is certainly not what the programmer would
> > expect!
> > 
> > I advise strongly against creating such error-prone construct.
> > 
> 
> A few existing loop macros do this. But they require a do { } while ()
> approach, and all have a comment.
> 
> It's used by do_each_thread() in sched.h 

Yes. It's worth noting that it is a do_each_thread() /
while_each_thread() pair.


> and ftrace does this as well.
> Look at kernel/trace/ftrace.c at do_for_each_ftrace_rec().

Same here.

> 
> Yes it breaks 'break' but it does not break 'continue' as it would just
> go to the next item that would have been found (like a normal for
> would).

Good point.

So would changing hash_for_each_size() to a

do_each_hash_size()/while_each_hash_size() make it clearer that this
contains a double-loop ? (along with an appropriate comment about
break).

Thanks,

Mathieu

> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable

2012-09-04 Thread Mathieu Desnoyers

* Pedro Alves (pal...@redhat.com) wrote:
> On 09/04/2012 05:30 PM, Pedro Alves wrote:
> > On 09/04/2012 04:35 PM, Steven Rostedt wrote:
> >> On Tue, 2012-08-28 at 19:00 -0400, Mathieu Desnoyers wrote:
> >>
> >>> Looking again at:
> >>>
> >>> +#define hash_for_each_size(name, bits, bkt, node, obj, member)   
> >>>   \
> >>> +   for (bkt = 0; bkt < HASH_SIZE(bits); bkt++)   
> >>>   \
> >>> +   hlist_for_each_entry(obj, node, &name[bkt], member)
> >>>
> >>> you will notice that a "break" or "continue" in the inner loop will not
> >>> affect the outer loop, which is certainly not what the programmer would
> >>> expect!
> >>>
> >>> I advise strongly against creating such error-prone construct.
> >>>
> >>
> >> A few existing loop macros do this. But they require a do { } while ()
> >> approach, and all have a comment.
> >>
> >> It's used by do_each_thread() in sched.h and ftrace does this as well.
> >> Look at kernel/trace/ftrace.c at do_for_each_ftrace_rec().
> >>
> >> Yes it breaks 'break' but it does not break 'continue' as it would just
> >> go to the next item that would have been found (like a normal for
> >> would).
> > 
> > /*
> >  * This is a double for. Do not use 'break' to break out of the loop,
> >  * you must use a goto.
> >  */
> > #define do_for_each_ftrace_rec(pg, rec) \
> > for (pg = ftrace_pages_start; pg; pg = pg->next) {  \
> > int _i; \
> > for (_i = 0; _i < pg->index; _i++) {\
> > rec = &pg->records[_i];
> > 
> > 
> > 
> > You can make 'break' also work as expected if you can embed a little 
> > knowledge
> > of the inner loop's condition in the outer loop's condition.  Sometimes it's
> > trivial, most often when the inner loop's iterator is a pointer that goes
> > NULL at the end, but other times not so much.  Something like (completely 
> > untested):
> > 
> > #define do_for_each_ftrace_rec(pg, rec) \
> > for (pg = ftrace_pages_start, rec = &pg->records[pg->index];\
> >  pg && rec == &pg->records[pg->index];  \
> >  pg = pg->next) {   \
> > int _i; \
> > for (_i = 0; _i < pg->index; _i++) {\
> > rec = &pg->records[_i];
> >
> > 
> > (other variants possible)
> > 
> > IOW, the outer loop only iterates if the inner loop completes.  If there's
> > a break in the inner loop, then the outer loop breaks too.  Of course, it
> > all depends on whether the generated code looks sane or hideous, if
> > the uses of the macro care for it over bug avoidance.
> > 
> 
> BTW, you can also go a step further and remove the need to close with double 
> }},
> with something like:
> 
> #define do_for_each_ftrace_rec(pg, rec)   
>\
> for (pg = ftrace_pages_start, rec = &pg->records[pg->index];  
>\
>  pg && rec == &pg->records[pg->index];
>\
>  pg = pg->next)           
>\
>   for (rec = pg->records; rec < &pg->records[pg->index]; rec++)

Maybe in some cases there might be ways to combine the two loops into
one ? I'm not seeing exactly how to do it for this one, but it should
not be impossible. If the inner loop condition can be moved to the outer
loop, and if we use (blah ? loop1_conf : loop2_cond) to test for
different conditions depending on the context, and do the same for the
3rd argument of the for() loop. The details elude me for now though, so
maybe it's complete non-sense ;)

It might not be that useful for do_for_each_ftrace_rec, but if we can do
it for the hash table iterator, it might be worth it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/23 -v8] handle accurate time keeping over long delays

2008-01-31 Thread Mathieu Desnoyers

nline s64 __get_nsec_offset(void) { return 0; }
> +void timekeeping_accumulate(void) { }
>  #endif
>  
>  /**
> @@ -302,6 +302,7 @@ static int timekeeping_resume(struct sys
>   timespec_add_ns(&xtime, timekeeping_suspend_nsecs);
>   /* re-base the last cycle value */
>   clock->cycle_last = clocksource_read(clock);
> + clock->cycle_accumulated = 0;
>   clock->error = 0;
>   timekeeping_suspended = 0;
>   write_sequnlock_irqrestore(&xtime_lock, flags);
> @@ -448,27 +449,28 @@ static void clocksource_adjust(s64 offse
>   */
>  void update_wall_time(void)
>  {
> - cycle_t offset;
> + cycle_t cycle_now;
>  
>   /* Make sure we're fully resumed: */
>   if (unlikely(timekeeping_suspended))
>   return;
>  
>  #ifdef CONFIG_GENERIC_TIME
> - offset = (clocksource_read(clock) - clock->cycle_last) & clock->mask;
> + cycle_now = clocksource_read(clock);
>  #else
> - offset = clock->cycle_interval;
> + cycle_now = clock->cycle_last + clock->cycle_interval;
>  #endif
> + clocksource_accumulate(clock, cycle_now);
> +
>   clock->xtime_nsec += (s64)xtime.tv_nsec << clock->shift;
>  
>   /* normally this loop will run just once, however in the
>* case of lost or late ticks, it will accumulate correctly.
>*/
> - while (offset >= clock->cycle_interval) {
> + while (clock->cycle_accumulated >= clock->cycle_interval) {
>   /* accumulate one interval */
>   clock->xtime_nsec += clock->xtime_interval;
> - clock->cycle_last += clock->cycle_interval;
> - offset -= clock->cycle_interval;
> + clock->cycle_accumulated -= clock->cycle_interval;
>  
>   if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
>   clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
> @@ -482,13 +484,13 @@ void update_wall_time(void)
>   }
>  
>   /* correct the clock when NTP error is too big */
> - clocksource_adjust(offset);
> + clocksource_adjust(clock->cycle_accumulated);
>  
>   /* store full nanoseconds into xtime */
>   xtime.tv_nsec = (s64)clock->xtime_nsec >> clock->shift;
>   clock->xtime_nsec -= (s64)xtime.tv_nsec << clock->shift;
>  
> - update_xtime_cache(cyc2ns(clock, offset));
> + update_xtime_cache(cyc2ns(clock, clock->cycle_accumulated));
>  
>   /* check to see if there is a new clocksource to use */
>   change_clocksource();
> Index: linux-mcount.git/arch/powerpc/kernel/time.c
> ===
> --- linux-mcount.git.orig/arch/powerpc/kernel/time.c  2008-01-30 
> 14:35:51.0 -0500
> +++ linux-mcount.git/arch/powerpc/kernel/time.c   2008-01-30 
> 14:54:12.0 -0500
> @@ -773,7 +773,8 @@ void update_vsyscall(struct timespec *wa
>   stamp_xsec = (u64) xtime.tv_nsec * XSEC_PER_SEC;
>   do_div(stamp_xsec, 10);
>   stamp_xsec += (u64) xtime.tv_sec * XSEC_PER_SEC;
> - update_gtod(clock->cycle_last, stamp_xsec, t2x);
> + update_gtod(clock->cycle_last-clock->cycle_accumulated,
> + stamp_xsec, t2x);
>  }
>  
>  void update_vsyscall_tz(void)
> 
> -- 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 21/23 -v8] Add markers to various events

2008-01-31 Thread Mathieu Desnoyers

4:35:48.0 -0500
> +++ linux-mcount.git/arch/x86/mm/fault_32.c   2008-01-30 15:54:06.0 
> -0500
> @@ -311,6 +311,9 @@ fastcall void __kprobes do_page_fault(st
>   /* get the address */
>  address = read_cr2();
>  
> + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
> +regs->eip, error_code, address);
> +
>   tsk = current;
>  
>   si_code = SEGV_MAPERR;
> Index: linux-mcount.git/arch/x86/mm/fault_64.c
> ===
> --- linux-mcount.git.orig/arch/x86/mm/fault_64.c  2008-01-30 
> 14:35:48.0 -0500
> +++ linux-mcount.git/arch/x86/mm/fault_64.c   2008-01-30 15:54:06.0 
> -0500
> @@ -316,6 +316,9 @@ asmlinkage void __kprobes do_page_fault(
>   /* get the address */
>   address = read_cr2();
>  
> + trace_mark(arch_do_page_fault, "ip %lx err %lx addr %lx",
> +regs->rip, error_code, address);
> +
>   info.si_code = SEGV_MAPERR;
>  
>  
> Index: linux-mcount.git/kernel/hrtimer.c
> ===
> --- linux-mcount.git.orig/kernel/hrtimer.c2008-01-30 14:35:48.0 
> -0500
> +++ linux-mcount.git/kernel/hrtimer.c 2008-01-30 15:54:06.0 -0500
> @@ -709,6 +709,8 @@ static void enqueue_hrtimer(struct hrtim
>   struct hrtimer *entry;
>   int leftmost = 1;
>  
> + trace_mark(kernel_hrtimer_enqueue,
> +"expires %p timer %p", &timer->expires, timer);
>   /*
>* Find the right place in the rbtree:
>*/
> @@ -1130,6 +1132,7 @@ void hrtimer_interrupt(struct clock_even
>  
>   retry:
>   now = ktime_get();
> + trace_mark(kernel_hrtimer_interrupt, "now %p", &now);
>  
>   expires_next.tv64 = KTIME_MAX;
>  
> @@ -1168,6 +1171,10 @@ void hrtimer_interrupt(struct clock_even
>   continue;
>   }
>  
> + trace_mark(kernel_hrtimer_interrupt_expire,
> +"expires %p timer %p",
> +&timer->expires, timer);
> +
>   __run_hrtimer(timer);
>   }
>   spin_unlock(&cpu_base->lock);
> Index: linux-mcount.git/kernel/sched.c
> ===
> --- linux-mcount.git.orig/kernel/sched.c  2008-01-30 15:46:44.0 
> -0500
> +++ linux-mcount.git/kernel/sched.c   2008-01-30 15:54:06.0 -0500
> @@ -90,6 +90,11 @@ unsigned long long __attribute__((weak))
>  #define PRIO_TO_NICE(prio)   ((prio) - MAX_RT_PRIO - 20)
>  #define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
>  
> +#define __PRIO(prio) \
> + ((prio) <= 99 ? 199 - (prio) : (prio) - 120)
> +
> +#define PRIO(p) __PRIO((p)->prio)
> +
>  /*
>   * 'User priority' is the nice value converted to something we
>   * can work with better when scaling various scheduler parameters,
> @@ -1372,6 +1377,9 @@ static void activate_task(struct rq *rq,
>   if (p->state == TASK_UNINTERRUPTIBLE)
>   rq->nr_uninterruptible--;
>  
> + trace_mark(kernel_sched_activate_task,
> +"pid %d prio %d nr_running %ld",
> +p->pid, PRIO(p), rq->nr_running);
>   enqueue_task(rq, p, wakeup);
>   inc_nr_running(p, rq);
>  }
> @@ -1385,6 +1393,9 @@ static void deactivate_task(struct rq *r
>   rq->nr_uninterruptible++;
>  
>   dequeue_task(rq, p, sleep);
> + trace_mark(kernel_sched_deactivate_task,
> +"pid %d prio %d nr_running %ld",
> +p->pid, PRIO(p), rq->nr_running);
>   dec_nr_running(p, rq);
>  }
>  
> 
> -- 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/23 -v8] handle accurate time keeping over long delays

2008-02-01 Thread Mathieu Desnoyers

* John Stultz ([EMAIL PROTECTED]) wrote:
> 
> On Thu, 2008-01-31 at 07:10 -0500, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > > From: John Stultz <[EMAIL PROTECTED]>
> > > 
> > > Handle accurate time even if there's a long delay between
> > > accumulated clock cycles.
> > > 
> > 
> > About this one.. we talked a lot about the importance of timekeeping at
> > the first Montreal Tracing Summit this week. Actually, someone
> > mentioned a very interesting point : in order to be able to synchronize
> > traces taken from the machine with traces taken on external hardware
> > (i.e. memory bus tracer on Freescale), taking the "real" counter value
> > rather that using the "cumulated cycles" approach (which creates a
> > virtual counted instead) would be better.
> > 
> > So I would recommend using an algorithm that would return a clock value
> > which is the same as the underlying hardware counter.
> 
> Hmm. It is an interesting issue. Clearly having the raw cycle value
> match up so hardware analysis could be mapped to software timestamps
> would be useful(although obscure) feature. However with the variety of
> clocksources, dealing properly with the clocksource wrap issue (ACPI PM
> for instance wraps about every 5 seconds) also has to be addressed.
> 
> I think you were mentioning an idea that required some work on the read
> side to handle the wraps, basically managing the high order bits by
> hand. This sounds like it would be an additional feature that could be
> added on to the infrastructure being provided in the
> get_monotonic_cycles() patch. No?
> 

Yup, exactly.

> 
> However, all of the above is a separate issue then what this (the
> timekeeping over long delay) patch addresses, as it is not really
> directly related to the get_monotonic_cycles() patch, but instead allows
> for correct timekeeping, making update_wall_time() to function properly
> if it was deferred for longer then the clocksource's wrap time.
> 

I agree, that could apply on top of the monotonic cycles patch. It's
just a different way to see it : dealing with wrapping TSC bits,
returning the LSBs given by the hardware, rather than simply
accumulating time. This is what the patch I sent earlier (which I use in
LTTng) does. I currently expects 32 LSBs to be given by the hardware,
but it would be trivial to extend it to support any given number of
hardware LSBs.

Mathieu

> thanks
> -john
> 
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 21/23 -v8] Add markers to various events

2008-02-02 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 5B
> 
> On Thu, 31 Jan 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > > This patch adds markers to various events in the kernel.
> > > (interrupts, task activation and hrtimers)
> > >
> >
> > Hi Steven,
> >
> > I would propose the following standard for IRQ handler markers:
> >
> > trace_mark(kernel_irq_entry, "irq_id %u kernel_mode %u", irq,
> >   (regs)?(!user_mode(regs)):(1));
> 
> Are you saying that two markers with the same name is ok?
> That would be great if that is true.
> 

Yep. Just make sure that they have the exact same format string,
otherwise the marker infrastructure will refuse to connect probes to
them (and will printk an error message saying this). You can even put a
marker in an inline function or unrolled loop (many markers will be
generated and this is ok).

> > 
> > trace_mark(kernel_irq_exit, MARK_NOARGS);
> 
> My patches don't use this (yet) so I'm leaving out adding markers
> that are not used. I'm not disagreeing with adding these, it's just that
> a patch series should only add markers that are actually used.
> 

Ok, I agree with your approach. About using the markers, I wonder if it
would be best for me to continue to concentrate my effort in getting the
low-level stuff into the kernel first or to switch my focus on posting
the higher-level stuff : the tracer itself, which is mostly a
stand-alone kernel module.

Mathieu

> >
> > So we can know the elaspsed time in irq handlers and whether they are
> > nested on user of kernel code.
> >
> > The same for traps :
> >
> > trace_mark(kernel_arch_trap_entry, "trap_id %d ip #p%ld", trapnr,
> >   instruction_pointer(regs));
> >
> > Where we know the trap number and the instruction pointer that caused
> > the trap. Here again, we should put a :
> >
> > trace_mark(kernel_arch_trap_exit, MARK_NOARGS);
> >
> > At the end of the trap handlers.
> >
> > It makes automatic analysis _much_ easier than trying to gather each and
> > every handler instrumentation which would have a different name...
> >
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/23 -v8] handle accurate time keeping over long delays

2008-02-02 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> 
> 
> On Fri, 1 Feb 2008, Mathieu Desnoyers wrote:
> 
> > > > > accumulated clock cycles.
> > > > >
> > > >
> > > > About this one.. we talked a lot about the importance of timekeeping at
> > > > the first Montreal Tracing Summit this week. Actually, someone
> > > > mentioned a very interesting point : in order to be able to synchronize
> > > > traces taken from the machine with traces taken on external hardware
> > > > (i.e. memory bus tracer on Freescale), taking the "real" counter value
> > > > rather that using the "cumulated cycles" approach (which creates a
> > > > virtual counted instead) would be better.
> > > >
> > > > So I would recommend using an algorithm that would return a clock value
> > > > which is the same as the underlying hardware counter.
> > >
> > > Hmm. It is an interesting issue. Clearly having the raw cycle value
> > > match up so hardware analysis could be mapped to software timestamps
> > > would be useful(although obscure) feature. However with the variety of
> > > clocksources, dealing properly with the clocksource wrap issue (ACPI PM
> > > for instance wraps about every 5 seconds) also has to be addressed.
> > >
> > > I think you were mentioning an idea that required some work on the read
> > > side to handle the wraps, basically managing the high order bits by
> > > hand. This sounds like it would be an additional feature that could be
> > > added on to the infrastructure being provided in the
> > > get_monotonic_cycles() patch. No?
> > >
> >
> > Yup, exactly.
> >
> > >
> > > However, all of the above is a separate issue then what this (the
> > > timekeeping over long delay) patch addresses, as it is not really
> > > directly related to the get_monotonic_cycles() patch, but instead allows
> > > for correct timekeeping, making update_wall_time() to function properly
> > > if it was deferred for longer then the clocksource's wrap time.
> > >
> >
> > I agree, that could apply on top of the monotonic cycles patch. It's
> > just a different way to see it : dealing with wrapping TSC bits,
> > returning the LSBs given by the hardware, rather than simply
> > accumulating time. This is what the patch I sent earlier (which I use in
> > LTTng) does. I currently expects 32 LSBs to be given by the hardware,
> > but it would be trivial to extend it to support any given number of
> > hardware LSBs.
> >
> 
> So you are saying that you can trivally make it work with a clock that is,
> say 24 bits? And this same code can work if we boot up with a clock with
> 32 bits or more?
> 
> -- Steve
> 

Yes, with this updated version. It supports HW clocks with various
number of bits. I limit myself to the 32 LSBs of the clock even if the
clock provides more bits for performance reasons. This module is aimed
at 32 bits architectures because such tricks are not necessary on 64
bits architectures given they provide atomic 64 bits updates.

(it's only compile-tested)

LTTng timestamp

LTTng synthetic TSC code for timestamping. Extracts 64 bits tsc from a [0..32]
bits counter, kept up to date by periodical timer interrupt. Lockless.

> do you actually use the RCU internals? or do you just reimplement an RCU
> algorithm?
> 

Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:

- Disable preemption at the read-side :
  it makes sure the pointer I get will point to a data structure that
  will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
  (only need to disable preemption, does not race against other CPUs and
  won't generate cache line bouncing). It also allows dealing with
  unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.

(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.

Changelog:

- Support [0..32] bits -> 64 bits.

I volountarily limit the code to use at most 32 bits of the hardware clock for
performance considerations. If this is a problem it could be changed. Also, the
algori

[PATCH] Linux Kernel Markers Support for Proprierary Modules

2008-02-02 Thread Mathieu Desnoyers

There seems to be good arguments for markers to support proprierary modules. So
I am throwing this one-liner in and let's see how people react. It only makes
sure that a module that has been "forced" to be loaded won't have its markers
used. It is important to leave this check to make sure the kernel does not crash
by expecting the markers part of the struct module by mistake in the case there
is an incorrect checksum.

Discussion so far :
http://lkml.org/lkml/2008/1/22/226

Jon Masters <[EMAIL PROTECTED]> writes:
I notice in module.c:

#ifdef CONFIG_MARKERS
  if (!mod->taints)
  marker_update_probe_range(mod->markers,
  mod->markers + mod->num_markers, NULL, NULL);
#endif

Is this an attempt to not set a marker for proprietary modules? [...]

* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
I can't seem to find any discussion about this aspect.  If this is the
intent, it seems misguided to me.  There may instead be a relationship
to TAINT_FORCED_{RMMOD,MODULE}.  Mathieu?

- FChE

On Tue, 2008-01-22 at 22:10 -0500, Mathieu Desnoyers wrote:
On my part, its mostly a matter of not crashing the kernel when someone
tries to force modprobe of a proprietary module (where the checksums
doesn't match) on a kernel that supports the markers. Not doing so
causes the markers to try to find the marker-specific information in
struct module which doesn't exist and OOPSes.

Christoph's point of view is rather more drastic than mine : it's not
interesting for the kernel community to help proprietary modules writers,
so it's a good idea not to give them marker support. (I CC'ed him so he
can clarify his position).

* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
[...]
Another way of looking at this though is that by allowing/encouraging
proprietary module writers to include markers, we and their users get
new diagnostic capabilities.  It constitutes a little bit of opening
up, which IMO we should reward rather than punish.

* Vladis Ketnieks ([EMAIL PROTECTED]) wrote:
On Wed, 23 Jan 2008 09:48:12 EST, Mathieu Desnoyers said:

> This specific one is a kernel policy matter, and I personally don't
> have a strong opinion about it. I agree that you raise a good counter
> argument : it can be useful to proprietary modules users to be able to
> extract tracing information from those modules to argue with their
> vendors that their driver/hardware is broken (a tracer is _very_ useful
> in that kind of situation).

Amen, brother. Been there, done that, got the tshirt (not on Linux, but
other operating systems).

> However, it is also useful to proprieraty
> module writers who can benefit from the merged kernel/modules traces.
> Do we want to give them this ability ?

The proprietary module writer has the *source* for the kernel and their module.
There's no way you can prevent the proprietary module writers from using this
feature as long as you allow other module writers to use it.

>   It would surely help writing
> better proprieraty kernel modules.

The biggest complaint against proprietary modules is that they make it
impossible for *us* to debug.  And you want to argue *against* a feature that
would allow them to develop better code that causes less crashes, and therefor
less people *asking* for us to debug it?

Remember - when a user tries a Linux box with a proprietary module, and the
experience sucks because the module sucks, they will walk away thinking
"Linux sucks", not "That module sucks".

It applies on top of 2.6.24-git12.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Jon Masters <[EMAIL PROTECTED]>
CC: "Frank Ch. Eigler" <[EMAIL PROTECTED]>
CC: Jon Masters <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Linus Torvalds <[EMAIL PROTECTED]>
CC: Andrew Morton <[EMAIL PROTECTED]>
---
 kernel/module.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/module.c
===
--- linux-2.6-lttng.orig/kernel/module.c2008-02-02 14:36:54.0 
-0500
+++ linux-2.6-lttng/kernel/module.c 2008-02-02 14:36:56.0 -0500
@@ -2033,7 +2033,7 @@ static struct module *load_module(void _
add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);

 #ifdef CONFIG_MARKERS
-   if (!mod->taints)
+   if (!(mod->taints & TAINT_FORCED_MODULE))
marker_update_probe_range(mod->markers,
mod->markers + mod->num_markers, NULL, NULL);
 #endif

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A

[patch 0/5] Instrumentation Menu Removal

2008-02-02 Thread Mathieu Desnoyers

Hi Andrew,

Here are the updated instrumentation menu removal patches. They apply on top of
2.6.24-git12. I took tare of the underlying architecture changes that happened
recently (avr32 and arm especially).

Patch order :
# instrumentation menu removal
fix-arm-to-play-nicely-with-generic-instrumentation-menu.patch
create-arch-kconfig.patch
add-have-oprofile.patch
add-have-kprobes.patch
move-kconfiginstrumentation-to-arch-kconfig-and-init-kconfig.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 5/5] Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig

2008-02-02 Thread Mathieu Desnoyers

Move the instrumentation Kconfig to

arch/Kconfig for architecture dependent options
  - oprofile
  - kprobes

and

init/Kconfig for architecture independent options
  - profiling
  - markers

Remove the "Instrumentation Support" menu. Everything moves to "General setup".
Delete the kernel/Kconfig.instrumentation file.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Linus Torvalds <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
---
 arch/Kconfig   |   28 
 arch/alpha/Kconfig |2 -
 arch/arm/Kconfig   |2 -
 arch/blackfin/Kconfig  |2 -
 arch/cris/Kconfig  |2 -
 arch/frv/Kconfig   |2 -
 arch/h8300/Kconfig |2 -
 arch/ia64/Kconfig  |2 -
 arch/m32r/Kconfig  |2 -
 arch/m68k/Kconfig  |2 -
 arch/m68knommu/Kconfig |2 -
 arch/mips/Kconfig  |2 -
 arch/parisc/Kconfig|2 -
 arch/powerpc/Kconfig   |2 -
 arch/ppc/Kconfig   |2 -
 arch/s390/Kconfig  |2 -
 arch/sh/Kconfig|2 -
 arch/sparc/Kconfig |2 -
 arch/sparc64/Kconfig   |2 -
 arch/um/Kconfig|2 -
 arch/v850/Kconfig  |2 -
 arch/x86/Kconfig   |2 -
 arch/xtensa/Kconfig|2 -
 init/Kconfig   |   12 
 kernel/Kconfig.instrumentation |   55 -
 25 files changed, 40 insertions(+), 99 deletions(-)

Index: linux-2.6-lttng/arch/Kconfig
===
--- linux-2.6-lttng.orig/arch/Kconfig   2008-02-02 14:37:03.0 -0500
+++ linux-2.6-lttng/arch/Kconfig2008-02-02 14:37:08.0 -0500
@@ -1,3 +1,31 @@
 #
 # General architecture dependent options
 #
+
+config OPROFILE
+   tristate "OProfile system profiling (EXPERIMENTAL)"
+   depends on PROFILING
+   depends on HAVE_OPROFILE
+   help
+ OProfile is a profiling system capable of profiling the
+ whole system, include the kernel, kernel modules, libraries,
+ and applications.
+
+ If unsure, say N.
+
+config HAVE_OPROFILE
+   def_bool n
+
+config KPROBES
+   bool "Kprobes"
+   depends on KALLSYMS && MODULES
+   depends on HAVE_KPROBES
+   help
+ Kprobes allows you to trap at almost any kernel address and
+ execute a callback function.  register_kprobe() establishes
+ a probepoint and specifies the callback.  Kprobes is useful
+ for kernel debugging, non-intrusive instrumentation and testing.
+ If in doubt, say "N".
+
+config HAVE_KPROBES
+   def_bool n
Index: linux-2.6-lttng/init/Kconfig
===
--- linux-2.6-lttng.orig/init/Kconfig   2008-02-02 14:37:03.0 -0500
+++ linux-2.6-lttng/init/Kconfig2008-02-02 14:37:08.0 -0500
@@ -665,6 +665,18 @@ config SLOB
 
 endchoice
 
+config PROFILING
+   bool "Profiling support (EXPERIMENTAL)"
+   help
+ Say Y here to enable the extended profiling support mechanisms used
+ by profilers such as OProfile.
+
+config MARKERS
+   bool "Activate markers"
+   help
+ Place an empty function call at each marker site. Can be
+ dynamically changed for a probe function.
+
 source "arch/Kconfig"
 
 endmenu# General setup
Index: linux-2.6-lttng/arch/alpha/Kconfig
===
--- linux-2.6-lttng.orig/arch/alpha/Kconfig 2008-02-02 14:37:04.0 
-0500
+++ linux-2.6-lttng/arch/alpha/Kconfig  2008-02-02 14:37:08.0 -0500
@@ -650,8 +650,6 @@ source "drivers/Kconfig"
 
 source "fs/Kconfig"
 
-source "kernel/Kconfig.instrumentation"
-
 source "arch/alpha/Kconfig.debug"
 
 # DUMMY_CONSOLE may be defined in drivers/video/console/Kconfig
Index: linux-2.6-lttng/arch/arm/Kconfig
===
--- linux-2.6-lttng.orig/arch/arm/Kconfig   2008-02-02 14:37:07.0 
-0500
+++ linux-2.6-lttng/arch/arm/Kconfig2008-02-02 14:37:08.0 -0500
@@ -1147,8 +1147,6 @@ endmenu
 
 source "fs/Kconfig"
 
-source "kernel/Kconfig.instrumentation"
-
 source "arch/arm/Kconfig.debug"
 
 source "security/Kconfig"
Index: linux-2.6-lttng/arch/blackfin/Kconfig
===
--- linux-2.6-lttng.orig/arch/blackfin/Kconfig  2008-02-02 14:37:04.0 
-0500
+++ linux-2.6-lttng/arch/blackfin/Kconfig   2008-02-02 14:37:08.0 
-0500
@@ -974,8 +974,6 @@ source &

[patch 2/5] Create arch/Kconfig

2008-02-02 Thread Mathieu Desnoyers

Puts the content of arch/Kconfig in the "General setup" menu.

Linus:

> Should it come with a re-duplication of it's content into each
> architecture, which was the case previously ? The oprofile and kprobes
> menu entries were litteraly cut and pasted from one architecture to
> another. Should we put its content in init/Kconfig then ?

I don't think it's a good idea to go back to making it per-architecture,
although that extensive "depends on " might
indicate that there certainly is room for cleanup there.

And I don't think it's wrong keeping it in kernel/Kconfig.xyz per se, I
just think it's wrong to (a) lump the code together when it really doesn't
necessarily need to and (b) show it to users as some kind of choice that
is tied together (whether it then has common code or not).

On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like

depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32

really shouldn't exist in a file like kernel/Kconfig.instrumentation.

It would be much better to do

depends on ARCH_SUPPORTS_KPROBES

in that generic file, and then architectures that do support it would just
have a

bool ARCH_SUPPORTS_KPROBES
default y

in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...
   

Sam Ravnborg:

Stuff it into a new file: arch/Kconfig
We can then extend this file to include all the 'trailing'
Kconfig things that are anyway equal for all ARCHs.

But it should be kept clean - so if we introduce such a file
then we should use ARCH_HAS_whatever in the arch specific Kconfig
files to enable stuff that is not shared.

[...]

The above suggestion is actually not exactly the best way to do it...
First the naming..
A quick grep shows following usage today (in Kconfig files)
ARCH_HAS51
ARCH_SUPPORTS   4
HAVE_ARCH   7

ARCH_HAS is the clear winner.


In the common Kconfig file do:

config FOO
depends on ARCH_HAS_FOO
bool "bla bla"

config ARCH_HAS_FOO
def_bool n


In the arch specific Kconfig file in a suitable place do:

config SUITABLE_OPTION
select ARCH_HAS_FOO


The naming of ARCH_HAS_ is fixed and shall be:
ARCH_HAS_


Only a single line added pr. architecture.
And we will end up with a (maybe even commented) list of trivial selects.

- Yet another update :

Moving to HAVE_* now.


Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Sam Ravnborg <[EMAIL PROTECTED]>
Cc: Jeff Dike <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
Cc: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
---
 arch/Kconfig |3 +++
 init/Kconfig |2 ++
 2 files changed, 5 insertions(+)

Index: linux-2.6-lttng/init/Kconfig
===
--- linux-2.6-lttng.orig/init/Kconfig   2008-01-29 15:04:23.0 -0500
+++ linux-2.6-lttng/init/Kconfig2008-01-29 15:22:47.0 -0500
@@ -665,6 +665,8 @@ config SLOB
 
 endchoice
 
+source "arch/Kconfig"
+
 endmenu# General setup
 
 config SLABINFO
Index: linux-2.6-lttng/arch/Kconfig
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-lttng/arch/Kconfig2008-01-29 15:22:47.0 -0500
@@ -0,0 +1,3 @@
+#
+# General architecture dependent options
+#

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/5] Add HAVE_KPROBES

2008-02-02 Thread Mathieu Desnoyers

Linus:

On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like

depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32

really shouldn't exist in a file like kernel/Kconfig.instrumentation.

It would be much better to do

depends on ARCH_SUPPORTS_KPROBES

in that generic file, and then architectures that do support it would just
have a

bool ARCH_SUPPORTS_KPROBES
default y

in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...

Changelog:

Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use

config KPROBES_SUPPORT
def_bool y

instead, which is a bit denser.

We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...

- Use HAVE_KPROBES
- Use a select

- Yet another update :
Moving to HAVE_* now.

- Update ARM for kprobes support.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Sam Ravnborg <[EMAIL PROTECTED]>
Cc: Jeff Dike <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
Cc: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
---
 arch/arm/Kconfig   |1 +
 arch/avr32/Kconfig |1 +
 arch/ia64/Kconfig  |1 +
 arch/powerpc/Kconfig   |1 +
 arch/ppc/Kconfig   |1 +
 arch/s390/Kconfig  |1 +
 arch/sparc64/Kconfig   |1 +
 arch/x86/Kconfig   |1 +
 kernel/Kconfig.instrumentation |5 -
 9 files changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/avr32/Kconfig
===
--- linux-2.6-lttng.orig/arch/avr32/Kconfig 2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/avr32/Kconfig  2008-02-01 07:37:51.0 -0500
@@ -11,6 +11,7 @@ config AVR32
# that we usually don't need on AVR32.
select EMBEDDED
select HAVE_OPROFILE
+   select HAVE_KPROBES
help
  AVR32 is a high-performance 32-bit RISC microprocessor core,
  designed for cost-sensitive embedded applications, with particular
Index: linux-2.6-lttng/arch/ia64/Kconfig
===
--- linux-2.6-lttng.orig/arch/ia64/Kconfig  2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/ia64/Kconfig   2008-02-01 07:39:09.0 -0500
@@ -16,6 +16,7 @@ config IA64
select PM if (!IA64_HP_SIM)
select ARCH_SUPPORTS_MSI
select HAVE_OPROFILE
+   select HAVE_KPROBES
default y
help
  The Itanium Processor Family is Intel's 64-bit successor to
Index: linux-2.6-lttng/arch/powerpc/Kconfig
===
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig   2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/powerpc/Kconfig2008-02-01 07:39:09.0 
-0500
@@ -88,6 +88,7 @@ config PPC
bool
default y
select HAVE_OPROFILE
+   select HAVE_KPROBES
 
 config EARLY_PRINTK
bool
Index: linux-2.6-lttng/arch/ppc/Kconfig
===
--- linux-2.6-lttng.orig/arch/ppc/Kconfig   2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/ppc/Kconfig2008-02-01 07:39:09.0 -0500
@@ -43,6 +43,7 @@ config PPC
bool
default y
select HAVE_OPROFILE
+   select HAVE_KPROBES
 
 config PPC32
bool
Index: linux-2.6-lttng/arch/s390/Kconfig
===
--- linux-2.6-lttng.orig/arch/s390/Kconfig  2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/s390/Kconfig   2008-02-01 07:39:09.0 -0500
@@ -52,6 +52,7 @@ mainmenu "Linux Kernel Configuration"
 config S390
def_bool y
select HAVE_OPROFILE
+   select HAVE_KPROBES
 
 source "init/Kconfig"
 
Index: linux-2.6-lttng/arch/sparc64/Kconfig
===
--- linux-2.6-lttng.orig/arch/sparc64/Kconfig   2008-02-01 07:37:17.0 
-0500
+++ linux-2.6-lttng/arch/sparc64/Kconfig2008-02-01 07:39:09.0 
-0500
@@ -9,6 +9,7 @@ config SPARC
bool
default y
select HAVE_OPROFILE
+   select HAVE_KPROBES
 
 config SPARC64
bool
Index: linux-2.6-lttng/kernel/Kconfig.instrumentation
===
--- linux-2.6-lttng.orig/kernel/Kconfig.instrumentation 2008-02-01 
07:37:17.0

[patch 3/5] Add HAVE_OPROFILE

2008-02-02 Thread Mathieu Desnoyers

Linus:
On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like

depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32

really shouldn't exist in a file like kernel/Kconfig.instrumentation.

It would be much better to do

depends on ARCH_SUPPORTS_KPROBES

in that generic file, and then architectures that do support it would just
have a

bool ARCH_SUPPORTS_KPROBES
default y

in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...

Changelog:

Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use

config ARCH_SUPPORTS_KPROBES
def_bool y

instead, which is a bit denser.

We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...

Changelog :

- Moving to HAVE_*.
- Add AVR32 oprofile.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Sam Ravnborg <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Cc: Haavard Skinnemoen <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
Cc: Jeff Dike <[EMAIL PROTECTED]>
Cc: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
---
 arch/alpha/Kconfig |1 +
 arch/arm/Kconfig   |1 +
 arch/avr32/Kconfig |4 +---
 arch/blackfin/Kconfig  |1 +
 arch/ia64/Kconfig  |1 +
 arch/m32r/Kconfig  |1 +
 arch/mips/Kconfig  |1 +
 arch/parisc/Kconfig|1 +
 arch/powerpc/Kconfig   |1 +
 arch/ppc/Kconfig   |1 +
 arch/s390/Kconfig  |1 +
 arch/sh/Kconfig|1 +
 arch/sparc/Kconfig |1 +
 arch/sparc64/Kconfig   |1 +
 arch/x86/Kconfig   |5 +
 kernel/Kconfig.instrumentation |5 -
 16 files changed, 19 insertions(+), 8 deletions(-)

Index: linux-2.6-lttng/kernel/Kconfig.instrumentation
===
--- linux-2.6-lttng.orig/kernel/Kconfig.instrumentation 2008-02-02 
15:05:46.0 -0500
+++ linux-2.6-lttng/kernel/Kconfig.instrumentation  2008-02-02 
15:06:13.0 -0500
@@ -21,7 +21,7 @@ config PROFILING
 config OPROFILE
tristate "OProfile system profiling (EXPERIMENTAL)"
depends on PROFILING && !UML
-   depends on ARCH_SUPPORTS_OPROFILE || ALPHA || ARM || BLACKFIN || IA64 
|| M32R || PARISC || PPC || S390 || SUPERH || SPARC
+   depends on HAVE_OPROFILE
help
  OProfile is a profiling system capable of profiling the
  whole system, include the kernel, kernel modules, libraries,
@@ -29,6 +29,9 @@ config OPROFILE
 
  If unsure, say N.
 
+config HAVE_OPROFILE
+   def_bool n
+
 config KPROBES
bool "Kprobes"
depends on KALLSYMS && MODULES && !UML
Index: linux-2.6-lttng/arch/alpha/Kconfig
===
--- linux-2.6-lttng.orig/arch/alpha/Kconfig 2008-02-02 14:55:45.0 
-0500
+++ linux-2.6-lttng/arch/alpha/Kconfig  2008-02-02 15:06:13.0 -0500
@@ -5,6 +5,7 @@
 config ALPHA
bool
default y
+   select HAVE_OPROFILE
help
  The Alpha is a 64-bit general-purpose processor designed and
  marketed by the Digital Equipment Corporation of blessed memory,
Index: linux-2.6-lttng/arch/arm/Kconfig
===
--- linux-2.6-lttng.orig/arch/arm/Kconfig   2008-02-02 15:05:46.0 
-0500
+++ linux-2.6-lttng/arch/arm/Kconfig2008-02-02 15:06:13.0 -0500
@@ -10,6 +10,7 @@ config ARM
default y
select RTC_LIB
select SYS_SUPPORTS_APM_EMULATION
+   select HAVE_OPROFILE
help
  The ARM series is a line of low-power-consumption RISC chip designs
  licensed by ARM Ltd and targeted at embedded applications and
Index: linux-2.6-lttng/arch/blackfin/Kconfig
===
--- linux-2.6-lttng.orig/arch/blackfin/Kconfig  2008-02-02 14:55:45.0 
-0500
+++ linux-2.6-lttng/arch/blackfin/Kconfig   2008-02-02 15:06:13.0 
-0500
@@ -24,6 +24,7 @@ config RWSEM_XCHGADD_ALGORITHM
 config BLACKFIN
bool
default y
+   select HAVE_OPROFILE
 
 config ZONE_DMA
bool
Index: linux-2.6-lttng/arch/ia64/Kconfig
===
--- linux-2.6-lttng.orig/arch/ia64/Kconfig  2008-02-02 14:55:45.0 
-0500
+++ linux-2.6-

[patch 1/5] Fix ARM to play nicely with generic Instrumentation menu

2008-02-02 Thread Mathieu Desnoyers

The conflicting commit for 
move-kconfiginstrumentation-to-arch-kconfig-and-init-kconfig.patch
is the ARM fix from Linus :

commit 38ad9aebe70dc72df08851bbd1620d89329129ba

He just seemed to agree that my approach (just putting the missing ARM
config options in arch/arm/Kconfig) works too. The main advantage it has
is that it is smaller, does not need a cleanup in the future and does
not break the following patches unnecessarily.

It's just been discussed here

http://lkml.org/lkml/2008/1/15/267

However, Linus might prefer to stay with his own patch and I would
totally understand it that late in the release cycle. Therefore I submit
this for the next release cycle.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Sam Ravnborg <[EMAIL PROTECTED]>
Cc: Jeff Dike <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
Cc: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
CC: Russell King <[EMAIL PROTECTED]>
---
 arch/arm/Kconfig |   19 +++
 arch/arm/Kconfig.instrumentation |   62 ---
 kernel/Kconfig.instrumentation   |4 +-
 3 files changed, 20 insertions(+), 65 deletions(-)

Index: linux-2.6-lttng/arch/arm/Kconfig
===
--- linux-2.6-lttng.orig/arch/arm/Kconfig   2008-02-02 14:36:22.0 
-0500
+++ linux-2.6-lttng/arch/arm/Kconfig2008-02-02 14:37:00.0 -0500
@@ -135,6 +135,23 @@ config FIQ
 config ARCH_MTD_XIP
bool
 
+if OPROFILE
+
+config OPROFILE_ARMV6
+   def_bool y
+   depends on CPU_V6 && !SMP
+   select OPROFILE_ARM11_CORE
+
+config OPROFILE_MPCORE
+   def_bool y
+   depends on CPU_V6 && SMP
+   select OPROFILE_ARM11_CORE
+
+config OPROFILE_ARM11_CORE
+   bool
+
+endif
+
 config VECTORS_BASE
hex
default 0x if MMU || CPU_HIGH_VECTOR
@@ -1128,7 +1145,7 @@ endmenu
 
 source "fs/Kconfig"
 
-source "arch/arm/Kconfig.instrumentation"
+source "kernel/Kconfig.instrumentation"
 
 source "arch/arm/Kconfig.debug"
 
Index: linux-2.6-lttng/kernel/Kconfig.instrumentation
===
--- linux-2.6-lttng.orig/kernel/Kconfig.instrumentation 2008-02-02 
14:36:22.0 -0500
+++ linux-2.6-lttng/kernel/Kconfig.instrumentation  2008-02-02 
14:37:00.0 -0500
@@ -32,7 +32,7 @@ config OPROFILE
 config KPROBES
bool "Kprobes"
depends on KALLSYMS && MODULES && !UML
-   depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32
+   depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32 
|| (ARM && !XIP_KERNEL)
help
  Kprobes allows you to trap at almost any kernel address and
  execute a callback function.  register_kprobe() establishes
Index: linux-2.6-lttng/arch/arm/Kconfig.instrumentation
===
--- linux-2.6-lttng.orig/arch/arm/Kconfig.instrumentation   2008-02-02 
14:36:22.0 -0500
+++ /dev/null   1970-01-01 00:00:00.0 +
@@ -1,62 +0,0 @@
-menuconfig INSTRUMENTATION
-   bool "Instrumentation Support"
-   default y
-   ---help---
- Say Y here to get to see options related to performance measurement,
- system-wide debugging, and testing. This option alone does not add any
- kernel code.
-
- If you say N, all options in this submenu will be skipped and
- disabled. If you're trying to debug the kernel itself, go see the
- Kernel Hacking menu.
-
-if INSTRUMENTATION
-
-config PROFILING
-   bool "Profiling support (EXPERIMENTAL)"
-   help
- Say Y here to enable the extended profiling support mechanisms used
- by profilers such as OProfile.
-
-config OPROFILE
-   tristate "OProfile system profiling (EXPERIMENTAL)"
-   depends on PROFILING && !UML
-   help
- OProfile is a profiling system capable of profiling the
- whole system, include the kernel, kernel modules, libraries,
- and applications.
-
- If unsure, say N.
-
-config OPROFILE_ARMV6
-   bool
-   depends on OPROFILE && CPU_V6 && !SMP
-   default y
-   select OPROFILE_ARM11_CORE
-
-config OPROFILE_MPCORE
-   bool
-   depends on OPROFILE && CPU_V6 && SMP
-   default y
-   select OPROFILE_ARM11_CORE
-
-config OPROFILE_ARM11_CORE
-   bool
-
-config KPROBES
-   bool "Kprobes"
-   depends on KALLSYMS && MODULES && !UML && !XIP_KERNEL
-   help
- Kprobes allows you to trap at almost any kernel address and
- execute a callback function.  register_kprobe() establishes
- a probepoint and specifies the callback.  Kprobe

[patch 0/3] Kprobes mutex cleanup

2008-02-02 Thread Mathieu Desnoyers

This patchset cleans the kprobes mutexes. It makes usage of the Text Edit Lock
possible.

It applies on top of 2.6.24-git12 in this order :

kprobes-use-mutex-for-insn-pages.patch
kprobes-dont-use-kprobes-mutex-in-arch-code.patch
kprobes-declare-kprobes-mutex-static.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/3] Kprobes - do not use kprobes mutex in arch code

2008-02-02 Thread Mathieu Desnoyers

Remove the kprobes mutex from kprobes.h, since it does not belong there. Also
remove all use of this mutex in the architecture specific code, replacing it by
a proper mutex lock/unlock in the architecture agnostic code.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
---
 arch/ia64/kernel/kprobes.c|2 --
 arch/powerpc/kernel/kprobes.c |2 --
 arch/s390/kernel/kprobes.c|2 --
 arch/x86/kernel/kprobes.c |2 --
 include/linux/kprobes.h   |2 --
 kernel/kprobes.c  |2 ++
 6 files changed, 2 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/include/linux/kprobes.h
===
--- linux-2.6-lttng.orig/include/linux/kprobes.h2008-01-30 
09:06:02.0 -0500
+++ linux-2.6-lttng/include/linux/kprobes.h 2008-01-30 09:15:48.0 
-0500
@@ -35,7 +35,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_KPROBES
 #include 
@@ -192,7 +191,6 @@ static inline int init_test_probes(void)
 #endif /* CONFIG_KPROBES_SANITY_TEST */
 
 extern spinlock_t kretprobe_lock;
-extern struct mutex kprobe_mutex;
 extern int arch_prepare_kprobe(struct kprobe *p);
 extern void arch_arm_kprobe(struct kprobe *p);
 extern void arch_disarm_kprobe(struct kprobe *p);
Index: linux-2.6-lttng/arch/x86/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/arch/x86/kernel/kprobes.c  2008-01-30 
09:06:01.0 -0500
+++ linux-2.6-lttng/arch/x86/kernel/kprobes.c   2008-01-30 09:15:48.0 
-0500
@@ -376,9 +376,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-   mutex_lock(&kprobe_mutex);
free_insn_slot(p->ainsn.insn, (p->ainsn.boostable == 1));
-   mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb)
Index: linux-2.6-lttng/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/kernel/kprobes.c   2008-01-30 09:14:29.0 
-0500
+++ linux-2.6-lttng/kernel/kprobes.c2008-01-30 09:15:48.0 -0500
@@ -644,7 +644,9 @@ valid_p:
list_del_rcu(&p->list);
kfree(old_p);
}
+   mutex_lock(&kprobe_mutex);
arch_remove_kprobe(p);
+   mutex_unlock(&kprobe_mutex);
} else {
mutex_lock(&kprobe_mutex);
if (p->break_handler)
Index: linux-2.6-lttng/arch/ia64/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/arch/ia64/kernel/kprobes.c 2008-01-30 
09:04:21.0 -0500
+++ linux-2.6-lttng/arch/ia64/kernel/kprobes.c  2008-01-30 09:15:48.0 
-0500
@@ -582,9 +582,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-   mutex_lock(&kprobe_mutex);
free_insn_slot(p->ainsn.insn, 0);
-   mutex_unlock(&kprobe_mutex);
 }
 /*
  * We are resuming execution after a single step fault, so the pt_regs
Index: linux-2.6-lttng/arch/powerpc/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/arch/powerpc/kernel/kprobes.c  2008-01-30 
09:04:21.0 -0500
+++ linux-2.6-lttng/arch/powerpc/kernel/kprobes.c   2008-01-30 
09:15:48.0 -0500
@@ -88,9 +88,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-   mutex_lock(&kprobe_mutex);
free_insn_slot(p->ainsn.insn, 0);
-   mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs 
*regs)
Index: linux-2.6-lttng/arch/s390/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/arch/s390/kernel/kprobes.c 2008-01-30 
09:04:21.0 -0500
+++ linux-2.6-lttng/arch/s390/kernel/kprobes.c  2008-01-30 09:15:48.0 
-0500
@@ -220,9 +220,7 @@ void __kprobes arch_disarm_kprobe(struct
 
 void __kprobes arch_remove_kprobe(struct kprobe *p)
 {
-   mutex_lock(&kprobe_mutex);
free_insn_slot(p->ainsn.insn, 0);
-   mutex_unlock(&kprobe_mutex);
 }
 
 static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs 
*regs)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/3] Kprobes - use a mutex to protect the instruction pages list.

2008-02-02 Thread Mathieu Desnoyers

Protect the instruction pages list by a specific insn pages mutex, called in 
get_insn_slot() and free_insn_slot(). It makes sure that architectures that does
not need to call arch_remove_kprobe() does not take an unneeded kprobes mutex.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
---
 kernel/kprobes.c |   27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/kernel/kprobes.c   2007-08-27 11:48:56.0 
-0400
+++ linux-2.6-lttng/kernel/kprobes.c2007-08-27 11:48:58.0 -0400
@@ -95,6 +95,10 @@ enum kprobe_slot_state {
SLOT_USED = 2,
 };
 
+/*
+ * Protects the kprobe_insn_pages list. Can nest into kprobe_mutex.
+ */
+static DEFINE_MUTEX(kprobe_insn_mutex);
 static struct hlist_head kprobe_insn_pages;
 static int kprobe_garbage_slots;
 static int collect_garbage_slots(void);
@@ -131,7 +135,9 @@ kprobe_opcode_t __kprobes *get_insn_slot
 {
struct kprobe_insn_page *kip;
struct hlist_node *pos;
+   kprobe_opcode_t *ret;
 
+   mutex_lock(&kprobe_insn_mutex);
  retry:
hlist_for_each_entry(kip, pos, &kprobe_insn_pages, hlist) {
if (kip->nused < INSNS_PER_PAGE) {
@@ -140,7 +146,8 @@ kprobe_opcode_t __kprobes *get_insn_slot
if (kip->slot_used[i] == SLOT_CLEAN) {
kip->slot_used[i] = SLOT_USED;
kip->nused++;
-   return kip->insns + (i * MAX_INSN_SIZE);
+   ret = kip->insns + (i * MAX_INSN_SIZE);
+   goto end;
}
}
/* Surprise!  No unused slots.  Fix kip->nused. */
@@ -154,8 +161,10 @@ kprobe_opcode_t __kprobes *get_insn_slot
}
/* All out of space.  Need to allocate a new page. Use slot 0. */
kip = kmalloc(sizeof(struct kprobe_insn_page), GFP_KERNEL);
-   if (!kip)
-   return NULL;
+   if (!kip) {
+   ret = NULL;
+   goto end;
+   }
 
/*
 * Use module_alloc so this page is within +/- 2GB of where the
@@ -165,7 +174,8 @@ kprobe_opcode_t __kprobes *get_insn_slot
kip->insns = module_alloc(PAGE_SIZE);
if (!kip->insns) {
kfree(kip);
-   return NULL;
+   ret = NULL;
+   goto end;
}
INIT_HLIST_NODE(&kip->hlist);
hlist_add_head(&kip->hlist, &kprobe_insn_pages);
@@ -173,7 +183,10 @@ kprobe_opcode_t __kprobes *get_insn_slot
kip->slot_used[0] = SLOT_USED;
kip->nused = 1;
kip->ngarbage = 0;
-   return kip->insns;
+   ret = kip->insns;
+end:
+   mutex_unlock(&kprobe_insn_mutex);
+   return ret;
 }
 
 /* Return 1 if all garbages are collected, otherwise 0. */
@@ -207,7 +220,7 @@ static int __kprobes collect_garbage_slo
struct kprobe_insn_page *kip;
struct hlist_node *pos, *next;
 
-   /* Ensure no-one is preepmted on the garbages */
+   /* Ensure no-one is preempted on the garbages */
if (check_safety() != 0)
return -EAGAIN;
 
@@ -231,6 +244,7 @@ void __kprobes free_insn_slot(kprobe_opc
struct kprobe_insn_page *kip;
struct hlist_node *pos;
 
+   mutex_lock(&kprobe_insn_mutex);
hlist_for_each_entry(kip, pos, &kprobe_insn_pages, hlist) {
if (kip->insns <= slot &&
slot < kip->insns + (INSNS_PER_PAGE * MAX_INSN_SIZE)) {
@@ -247,6 +261,7 @@ void __kprobes free_insn_slot(kprobe_opc
 
if (dirty && ++kprobe_garbage_slots > INSNS_PER_PAGE)
collect_garbage_slots();
+   mutex_unlock(&kprobe_insn_mutex);
 }
 #endif
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/3] Kprobes - declare kprobe_mutex static

2008-02-02 Thread Mathieu Desnoyers

Since it will not be used by other kernel objects, it makes sense to declare it
static.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
---
 kernel/kprobes.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/kernel/kprobes.c   2007-08-19 09:09:15.0 
-0400
+++ linux-2.6-lttng/kernel/kprobes.c2007-08-19 17:18:07.0 -0400
@@ -68,7 +68,7 @@ static struct hlist_head kretprobe_inst_
 /* NOTE: change this value only with kprobe_mutex held */
 static bool kprobe_enabled;
 
-DEFINE_MUTEX(kprobe_mutex);/* Protects kprobe_table */
+static DEFINE_MUTEX(kprobe_mutex); /* Protects kprobe_table */
 DEFINE_SPINLOCK(kretprobe_lock);   /* Protects kretprobe_inst_table */
 static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL;
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/2] x86 : Enhance DEBUG_RODATA support

2008-02-02 Thread Mathieu Desnoyers

Hi Ingo,

These patches enhances the DEBUG_RODATA support of the x86 architecture.

They apply on top of 2.6.24-git12 in this order :

#Enhance DEBUG_RODATA support
x86-enhance-debug-rodata-support-alternatives.patch
x86-enhance-debug-rodata-support-for-hotplug-and-kprobes.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/2] x86 - Enhance DEBUG_RODATA support for hotplug and kprobes

2008-02-02 Thread Mathieu Desnoyers

Standardize DEBUG_RODATA, removing special cases for hotplug and kprobes.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Andi Kleen <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: Thomas Gleixner <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
CC: H. Peter Anvin <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/x86/mm/init_32.c |   24 
 arch/x86/mm/init_64.c |   20 +++-
 2 files changed, 11 insertions(+), 33 deletions(-)

Index: linux-2.6-lttng/arch/x86/mm/init_32.c
===
--- linux-2.6-lttng.orig/arch/x86/mm/init_32.c  2008-02-02 15:33:24.0 
-0500
+++ linux-2.6-lttng/arch/x86/mm/init_32.c   2008-02-02 15:33:34.0 
-0500
@@ -740,25 +740,17 @@ void mark_rodata_ro(void)
unsigned long start = PFN_ALIGN(_text);
unsigned long size = PFN_ALIGN(_etext) - start;
 
-#ifndef CONFIG_KPROBES
-#ifdef CONFIG_HOTPLUG_CPU
-   /* It must still be possible to apply SMP alternatives. */
-   if (num_possible_cpus() <= 1)
-#endif
-   {
-   set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT);
-   printk(KERN_INFO "Write protecting the kernel text: %luk\n",
-   size >> 10);
+   set_pages_ro(virt_to_page(start), size >> PAGE_SHIFT);
+   printk(KERN_INFO "Write protecting the kernel text: %luk\n",
+   size >> 10);
 
 #ifdef CONFIG_CPA_DEBUG
-   printk(KERN_INFO "Testing CPA: Reverting %lx-%lx\n",
-   start, start+size);
-   set_pages_rw(virt_to_page(start), size>>PAGE_SHIFT);
+   printk(KERN_INFO "Testing CPA: Reverting %lx-%lx\n",
+   start, start+size);
+   set_pages_rw(virt_to_page(start), size>>PAGE_SHIFT);
 
-   printk(KERN_INFO "Testing CPA: write protecting again\n");
-   set_pages_ro(virt_to_page(start), size>>PAGE_SHIFT);
-#endif
-   }
+   printk(KERN_INFO "Testing CPA: write protecting again\n");
+   set_pages_ro(virt_to_page(start), size>>PAGE_SHIFT);
 #endif
start += size;
size = (unsigned long)__end_rodata - start;
Index: linux-2.6-lttng/arch/x86/mm/init_64.c
===
--- linux-2.6-lttng.orig/arch/x86/mm/init_64.c  2008-02-02 15:33:39.0 
-0500
+++ linux-2.6-lttng/arch/x86/mm/init_64.c   2008-02-02 15:33:59.0 
-0500
@@ -618,23 +618,8 @@ EXPORT_SYMBOL_GPL(rodata_test_data);
 
 void mark_rodata_ro(void)
 {
-   unsigned long start = (unsigned long)_stext, end;
-
-#ifdef CONFIG_HOTPLUG_CPU
-   /* It must still be possible to apply SMP alternatives. */
-   if (num_possible_cpus() > 1)
-   start = (unsigned long)_etext;
-#endif
-
-#ifdef CONFIG_KPROBES
-   start = (unsigned long)__start_rodata;
-#endif
-
-   end = (unsigned long)__end_rodata;
-   start = (start + PAGE_SIZE - 1) & PAGE_MASK;
-   end &= PAGE_MASK;
-   if (end <= start)
-   return;
+   unsigned long start = PFN_ALIGN(_stext),
+   end = PFN_ALIGN(__end_rodata);
 
set_memory_ro(start, (end - start) >> PAGE_SHIFT);
 
@@ -651,6 +636,7 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
 #endif
 }
+
 #endif
 
 #ifdef CONFIG_BLK_DEV_INITRD

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/2] x86 - Enhance DEBUG_RODATA support - alternatives

2008-02-02 Thread Mathieu Desnoyers

Fix a memcpy that should be a text_poke (in apply_alternatives).

Use kernel_wp_save/kernel_wp_restore in text_poke to support DEBUG_RODATA
correctly and so the CPU HOTPLUG special case can be removed.

Add text_poke_early, for alternatives and paravirt boot-time and module load
time patching.

Notes:
- A macro is used instead of an inline function to deal with circular header
  include otherwise necessary for read_cr0 and preempt_disable/enable.

Changelog:

- Fix text_set and text_poke alignment check (mixed up bitwise and and or)
- Remove text_set
- Export add_nops, so it can be used by others.
- Remove x86 test for "wp_works_ok", it will just be ignored by the architecture
  if not supported.
- Document text_poke_early.
- Remove clflush, since it breaks some VIA architectures and is not strictly
  necessary.
- Add kerneldoc to text_poke and text_poke_early.
- Remove arg cr0 from kernel_wp_save/restore. Change the macro name for
  kernel_wp_disable/enable.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Andi Kleen <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: Thomas Gleixner <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
CC: H. Peter Anvin <[EMAIL PROTECTED]>
---
 arch/x86/kernel/alternative.c |   53 ++
 include/asm-x86/alternative.h |   36 +++-
 2 files changed, 79 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/arch/x86/kernel/alternative.c
===
--- linux-2.6-lttng.orig/arch/x86/kernel/alternative.c  2008-02-02 
14:06:43.0 -0500
+++ linux-2.6-lttng/arch/x86/kernel/alternative.c   2008-02-02 
14:07:29.0 -0500
@@ -173,7 +173,7 @@ static const unsigned char*const * find_
 #endif /* CONFIG_X86_64 */
 
 /* Use this to add nops to a buffer, then text_poke the whole buffer. */
-static void add_nops(void *insns, unsigned int len)
+void add_nops(void *insns, unsigned int len)
 {
const unsigned char *const *noptable = find_nop_table();
 
@@ -186,6 +186,7 @@ static void add_nops(void *insns, unsign
len -= noplen;
}
 }
+EXPORT_SYMBOL_GPL(add_nops);
 
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern u8 *__smp_locks[], *__smp_locks_end[];
@@ -219,7 +220,7 @@ void apply_alternatives(struct alt_instr
memcpy(insnbuf, a->replacement, a->replacementlen);
add_nops(insnbuf + a->replacementlen,
 a->instrlen - a->replacementlen);
-   text_poke(instr, insnbuf, a->instrlen);
+   text_poke_early(instr, insnbuf, a->instrlen);
}
 }
 
@@ -407,7 +408,7 @@ void apply_paravirt(struct paravirt_patc
 
/* Pad the rest with nops */
add_nops(insnbuf + used, p->len - used);
-   text_poke(p->instr, insnbuf, p->len);
+   text_poke_early(p->instr, insnbuf, p->len);
}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
@@ -471,18 +472,52 @@ void __init alternative_instructions(voi
 #endif
 }
 
-/*
- * Warning:
+/**
+ * text_poke_early - Update instructions on a live kernel at boot time
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy
+ *
  * When you use this code to patch more than one byte of an instruction
  * you need to make sure that other CPUs cannot execute this code in parallel.
- * Also no thread must be currently preempted in the middle of these 
instructions.
- * And on the local CPU you need to be protected again NMI or MCE handlers
- * seeing an inconsistent instruction while you patch.
+ * Also no thread must be currently preempted in the middle of these
+ * instructions.  And on the local CPU you need to be protected again NMI or 
MCE
+ * handlers seeing an inconsistent instruction while you patch.
+ * Warning: read_cr0 is modified by paravirt, this is why we have _early
+ * versions. They are not in the __init section because they can be used at
+ * module load time.
  */
-void __kprobes text_poke(void *addr, unsigned char *opcode, int len)
+void *text_poke_early(void *addr, const void *opcode, size_t len)
 {
memcpy(addr, opcode, len);
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
   that causes hangs on some VIA CPUs. */
+   return addr;
 }
+
+/**
+ * text_poke - Update instructions on a live kernel
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy
+ *
+ * Only atomic text poke/set should be allowed when not doing early patching.
+ * It means the size must be writable atomically and the address must be 
aligned
+ * in a way that permits an atomic write.
+ */
+void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
+{
+   BUG_ON(len > sizeof(long));
+   BUG_ONlong)addr + len

[patch 1/2] Text Edit Lock - Architecture Independent Code

2008-02-02 Thread Mathieu Desnoyers

This is an architecture independant synchronization around kernel text
modifications through use of a global mutex.

A mutex has been chosen so that kprobes, the main user of this, can sleep during
memory allocation between the memory read of the instructions it must replace
and the memory write of the breakpoint.

Other user of this interface: immediate values.

Paravirt and alternatives are always done when SMP is inactive, so there is no
need to use locks.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Andi Kleen <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/linux/memory.h |7 +++
 mm/memory.c|   34 ++
 2 files changed, 41 insertions(+)

Index: linux-2.6-lttng/include/linux/memory.h
===
--- linux-2.6-lttng.orig/include/linux/memory.h 2008-01-30 09:04:21.0 
-0500
+++ linux-2.6-lttng/include/linux/memory.h  2008-01-30 09:16:48.0 
-0500
@@ -93,4 +93,11 @@ extern int memory_notify(unsigned long v
 #define hotplug_memory_notifier(fn, pri) do { } while (0)
 #endif
 
+/*
+ * Take and release the kernel text modification lock, used for code patching.
+ * Users of this lock can sleep.
+ */
+extern void kernel_text_lock(void);
+extern void kernel_text_unlock(void);
+
 #endif /* _LINUX_MEMORY_H_ */
Index: linux-2.6-lttng/mm/memory.c
===
--- linux-2.6-lttng.orig/mm/memory.c2008-01-30 09:06:02.0 -0500
+++ linux-2.6-lttng/mm/memory.c 2008-01-30 09:16:48.0 -0500
@@ -50,6 +50,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -84,6 +86,12 @@ EXPORT_SYMBOL(high_memory);
 
 int randomize_va_space __read_mostly = 1;
 
+/*
+ * mutex protecting text section modification (dynamic code patching).
+ * some users need to sleep (allocating memory...) while they hold this lock.
+ */
+static DEFINE_MUTEX(text_mutex);
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2785,3 +2793,29 @@ void print_vma_addr(char *prefix, unsign
}
up_read(¤t->mm->mmap_sem);
 }
+
+/**
+ * kernel_text_lock -   Take the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_lock(void)
+{
+   mutex_lock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_lock);
+
+/**
+ * kernel_text_unlock   -   Release the kernel text modification lock
+ *
+ * Insures mutual write exclusion of kernel and modules text live text
+ * modification. Should be used for code patching.
+ * Users of this lock can sleep.
+ */
+void __kprobes kernel_text_unlock(void)
+{
+   mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(kernel_text_unlock);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/2] Text Edit Lock

2008-02-02 Thread Mathieu Desnoyers

Hi Andrew,

Here is a resend of the Text Edit Lock patchset (only 2 patches). It consists of
a mutex that serializes code modification. Used by kprobes and the immediate
values.

Dependencies :

#Kprobes mutex cleanup
kprobes-use-mutex-for-insn-pages.patch
kprobes-dont-use-kprobes-mutex-in-arch-code.patch
kprobes-declare-kprobes-mutex-static.patch
#Enhance DEBUG_RODATA support
x86-enhance-debug-rodata-support-alternatives.patch
x86-enhance-debug-rodata-support-for-hotplug-and-kprobes.patch

It applies on top of 2.6.24-git12 in this order:

#Text Edit Lock (depends on Enhance DEBUG_RODATA and kprobes mutex cleanup)
text-edit-lock-architecture-independent-code.patch
text-edit-lock-kprobes-architecture-independent-support.patch


Mathieu



-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/2] Text Edit Lock - kprobes architecture independent support

2008-02-02 Thread Mathieu Desnoyers

Use the mutual exclusion provided by the text edit lock in the kprobes code. It
allows coherent manipulation of the kernel code by other subsystems.

Changelog:

Move the kernel_text_lock/unlock out of the for loops.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Ananth N Mavinakayanahalli <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
CC: Roel Kluin <[EMAIL PROTECTED]>
---
 kernel/kprobes.c |   19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/kernel/kprobes.c
===
--- linux-2.6-lttng.orig/kernel/kprobes.c   2008-01-30 09:16:42.0 
-0500
+++ linux-2.6-lttng/kernel/kprobes.c2008-01-30 09:19:28.0 -0500
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -568,9 +569,10 @@ static int __kprobes __register_kprobe(s
goto out;
}
 
+   kernel_text_lock();
ret = arch_prepare_kprobe(p);
if (ret)
-   goto out;
+   goto out_unlock_text;
 
INIT_HLIST_NODE(&p->hlist);
hlist_add_head_rcu(&p->hlist,
@@ -578,7 +580,8 @@ static int __kprobes __register_kprobe(s
 
if (kprobe_enabled)
arch_arm_kprobe(p);
-
+out_unlock_text:
+   kernel_text_unlock();
 out:
mutex_unlock(&kprobe_mutex);
 
@@ -621,8 +624,11 @@ valid_p:
 * enabled - otherwise, the breakpoint would already have
 * been removed. We save on flushing icache.
 */
-   if (kprobe_enabled)
+   if (kprobe_enabled) {
+   kernel_text_lock();
arch_disarm_kprobe(p);
+   kernel_text_unlock();
+   }
hlist_del_rcu(&old_p->hlist);
cleanup_p = 1;
} else {
@@ -644,9 +650,7 @@ valid_p:
list_del_rcu(&p->list);
kfree(old_p);
}
-   mutex_lock(&kprobe_mutex);
arch_remove_kprobe(p);
-   mutex_unlock(&kprobe_mutex);
} else {
mutex_lock(&kprobe_mutex);
if (p->break_handler)
@@ -717,7 +721,6 @@ static int __kprobes pre_handler_kretpro
ri->rp = rp;
ri->task = current;
arch_prepare_kretprobe(ri, regs);
-
/* XXX(hch): why is there no hlist_move_head? */
hlist_del(&ri->uflist);
hlist_add_head(&ri->uflist, &ri->rp->used_instances);
@@ -940,11 +943,13 @@ static void __kprobes enable_all_kprobes
if (kprobe_enabled)
goto already_enabled;
 
+   kernel_text_lock();
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist)
arch_arm_kprobe(p);
}
+   kernel_text_unlock();
 
kprobe_enabled = true;
printk(KERN_INFO "Kprobes globally enabled\n");
@@ -969,6 +974,7 @@ static void __kprobes disable_all_kprobe
 
kprobe_enabled = false;
printk(KERN_INFO "Kprobes globally disabled\n");
+   kernel_text_lock();
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
hlist_for_each_entry_rcu(p, node, head, hlist) {
@@ -976,6 +982,7 @@ static void __kprobes disable_all_kprobe
arch_disarm_kprobe(p);
}
}
+   kernel_text_unlock();
 
mutex_unlock(&kprobe_mutex);
/* Allow all currently running kprobes to complete */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/7] Immediate Values

2008-02-02 Thread Mathieu Desnoyers

Hi Andrew,

Here are the updated immediate values for 2.6.24-git12.

Dependencies :

# instrumentation menu removal #merged in kbuild.git
fix-arm-to-play-nicely-with-generic-instrumentation-menu.patch
create-arch-kconfig.patch
add-have-oprofile.patch
add-have-kprobes.patch
move-kconfiginstrumentation-to-arch-kconfig-and-init-kconfig.patch
#
#Kprobes mutex cleanup
kprobes-use-mutex-for-insn-pages.patch
kprobes-dont-use-kprobes-mutex-in-arch-code.patch
kprobes-declare-kprobes-mutex-static.patch
#Enhance DEBUG_RODATA support
x86-enhance-debug-rodata-support-alternatives.patch
x86-enhance-debug-rodata-support-for-hotplug-and-kprobes.patch
#Text Edit Lock (depends on Enhance DEBUG_RODATA and kprobes mutex cleanup)
text-edit-lock-architecture-independent-code.patch
text-edit-lock-kprobes-architecture-independent-support.patch

The patchset applies in the following order :

#Immediate Values
immediate-values-architecture-independent-code.patch
immediate-values-kconfig-menu-in-embedded.patch
immediate-values-x86-optimization.patch
add-text-poke-and-sync-core-to-powerpc.patch
immediate-values-powerpc-optimization.patch
immediate-values-documentation.patch
#
scheduler-profiling-use-immediate-values.patch
#

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/7] Immediate Values - Architecture Independent Code

2008-02-02 Thread Mathieu Desnoyers

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
  - imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
  - Wait until we have gcc support before we add the imv_if macro, since
its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
  status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
  architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
---
 include/asm-generic/vmlinux.lds.h |3 
 include/linux/immediate.h |   94 +++
 include/linux/module.h|   16 +++
 init/main.c   |8 +
 kernel/Makefile   |1 
 kernel/immediate.c|  187 ++
 kernel/module.c   |   50 +-
 7 files changed, 358 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/linux/immediate.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-lttng/include/linux/immediate.h   2008-02-02 15:54:04.0 
-0500
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <[EMAIL PROTECTED]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+   unsigned long var;  /* Pointer to the identifier variable of the
+* immediate value
+*/
+   unsigned long imv;  /*
+* Pointer to the memory location of the
+* immediate value within the instruction.
+*/
+   unsigned char size; /* Type size. */
+} __attribute__ ((packed));
+
+#include 
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module

[patch 7/7] Scheduler Profiling - Use Immediate Values

2008-02-02 Thread Mathieu Desnoyers

Use immediate values with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Changelog :
- Use imv_* instead of immediate_*.
- Follow the white rabbit : kvm_main.c which becomes x86.c.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/x86/kvm/x86.c  |2 +-
 include/linux/profile.h |5 +++--
 kernel/profile.c|   22 +++---
 kernel/sched_fair.c |5 +
 4 files changed, 16 insertions(+), 18 deletions(-)

Index: linux-2.6-lttng/kernel/profile.c
===
--- linux-2.6-lttng.orig/kernel/profile.c   2008-02-01 07:32:04.0 
-0500
+++ linux-2.6-lttng/kernel/profile.c2008-02-01 07:43:02.0 -0500
@@ -42,8 +42,8 @@ static int (*timer_hook)(struct pt_regs 
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
-int prof_on __read_mostly;
-EXPORT_SYMBOL_GPL(prof_on);
+DEFINE_IMV(char, prof_on) __read_mostly;
+EXPORT_IMV_SYMBOL_GPL(prof_on);
 
 static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
 #ifdef CONFIG_SMP
@@ -61,7 +61,7 @@ static int __init profile_setup(char *st
 
if (!strncmp(str, sleepstr, strlen(sleepstr))) {
 #ifdef CONFIG_SCHEDSTATS
-   prof_on = SLEEP_PROFILING;
+   imv_set(prof_on, SLEEP_PROFILING);
if (str[strlen(sleepstr)] == ',')
str += strlen(sleepstr) + 1;
if (get_option(&str, &par))
@@ -74,7 +74,7 @@ static int __init profile_setup(char *st
"kernel sleep profiling requires CONFIG_SCHEDSTATS\n");
 #endif /* CONFIG_SCHEDSTATS */
} else if (!strncmp(str, schedstr, strlen(schedstr))) {
-   prof_on = SCHED_PROFILING;
+   imv_set(prof_on, SCHED_PROFILING);
if (str[strlen(schedstr)] == ',')
str += strlen(schedstr) + 1;
if (get_option(&str, &par))
@@ -83,7 +83,7 @@ static int __init profile_setup(char *st
"kernel schedule profiling enabled (shift: %ld)\n",
prof_shift);
} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
-   prof_on = KVM_PROFILING;
+   imv_set(prof_on, KVM_PROFILING);
if (str[strlen(kvmstr)] == ',')
str += strlen(kvmstr) + 1;
if (get_option(&str, &par))
@@ -93,7 +93,7 @@ static int __init profile_setup(char *st
prof_shift);
} else if (get_option(&str, &par)) {
prof_shift = par;
-   prof_on = CPU_PROFILING;
+   imv_set(prof_on, CPU_PROFILING);
printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
prof_shift);
}
@@ -104,7 +104,7 @@ __setup("profile=", profile_setup);
 
 void __init profile_init(void)
 {
-   if (!prof_on)
+   if (!_imv_read(prof_on))
return;
 
/* only text is profiled */
@@ -291,7 +291,7 @@ void profile_hits(int type, void *__pc, 
int i, j, cpu;
struct profile_hit *hits;
 
-   if (prof_on != type || !prof_buffer)
+   if (!prof_buffer)
return;
pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
@@ -401,7 +401,7 @@ void profile_hits(int type, void *__pc, 
 {
unsigned long pc;
 
-   if (prof_on != type || !prof_buffer)
+   if (!prof_buffer)
return;
pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
atomic_add(nr_hits, &prof_buffer[min(pc, prof_len - 1)]);
@@ -558,7 +558,7 @@ static int __init create_hash_tables(voi
}
return 0;
 out_cleanup:
-   prof_on = 0;
+   imv_set(prof_on, 0);
smp_mb();
on_each_cpu(profile_nop, NULL, 0, 1);
for_each_online_cpu(cpu) {
@@ -585,7 +585,7 @@ static int __init create_proc_profile(vo
 {
struct proc_dir_entry *entry;
 
-   if (!prof_on)
+   if (!_imv_read(prof_on))
return 0;
if (create_hash_tables())
return -1;
Index: linux-2.6-lttng/include/linux/profile.h
===
--- linux-2.6-lttng.orig/include/linux/profile.h2008-02-01 
07:32:04.0 -0500
+++ linux-2.6-lttng/include/linux/profile.h 2008-02-01 07:43:02.0 
-0500
@@ -7,10 +7,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-extern int prof_on __read_mostly;
+DECLARE_IMV(char, prof_on) __read_mostly;
 
 #define CPU_PROFILING  1
 #define SCHED_PROFILING2
@@ -38,7 +39,7 @@ static inline void profile_hit(int type,
/*
 * Speedup f

[patch 6/7] Immediate Values - Documentation

2008-02-02 Thread Mathieu Desnoyers

Changelog:
- Remove imv_set_early (removed from API).
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
---
 Documentation/immediate.txt |  221 
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-lttng/Documentation/immediate.txt 2008-02-01 07:42:01.0 
-0500
@@ -0,0 +1,221 @@
+   Using the Immediate Values
+
+       Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(imv_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include 
+
+DEFINE_IMV(char, this_immediate);
+EXPORT_IMV_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use imv_set(this_immediate) to set the immediate value.
+
+Use imv_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _imv_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMV(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 10
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:  77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:+12.1369 cycles
+Memory load based test: +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 markers for irq entry/exit, it
+brings us to 15348 markers sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.4
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Markers impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a marker upon syscall entry and syscall exit; markers are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid()

[patch 4/7] Add text_poke and sync_core to powerpc

2008-02-02 Thread Mathieu Desnoyers

- Needed on architectures where we must surround live instruction modification
  with "WP flag disable".
- Turns into a memcpy on powerpc since there is no WP flag activated for
  instruction pages (yet..).
- Add empty sync_core to powerpc so it can be used in architecture independent
  code.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Paul Mackerras <[EMAIL PROTECTED]>
---
 include/asm-powerpc/cacheflush.h |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-powerpc/cacheflush.h
===
--- linux-2.6-lttng.orig/include/asm-powerpc/cacheflush.h   2007-11-19 
12:05:50.0 -0500
+++ linux-2.6-lttng/include/asm-powerpc/cacheflush.h2007-11-19 
13:27:36.0 -0500
@@ -63,7 +63,9 @@ extern void flush_dcache_phys_range(unsi
 #define copy_from_user_page(vma, page, vaddr, dst, src, len) \
memcpy(dst, src, len)
 
-
+#define text_poke  memcpy
+#define text_poke_earlytext_poke
+#define sync_core()
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /* internal debugging function */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 5/7] Immediate Values - Powerpc Optimization

2008-02-02 Thread Mathieu Desnoyers

PowerPC optimization of the immediate values which uses a li instruction,
patched with an immediate value.

Changelog:
- Put imv_set and _imv_set in the architecture independent header.
- Pack the __imv section. Use smallest types required for size (char).
- Remove architecture specific update code : now handled by architecture
  agnostic code.
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Paul Mackerras <[EMAIL PROTECTED]>
---
 arch/powerpc/Kconfig|1 
 include/asm-powerpc/immediate.h |   55 
 2 files changed, 56 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/immediate.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-lttng/include/asm-powerpc/immediate.h 2008-01-30 
09:33:42.0 -0500
@@ -0,0 +1,55 @@
+#ifndef _ASM_POWERPC_IMMEDIATE_H
+#define _ASM_POWERPC_IMMEDIATE_H
+
+/*
+ * Immediate values. PowerPC architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <[EMAIL PROTECTED]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include 
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ */
+#define imv_read(name) \
+   ({  \
+   __typeof__(name##__imv) value;  \
+   BUILD_BUG_ON(sizeof(value) > 8);\
+   switch (sizeof(value)) {\
+   case 1: \
+   asm(".section __imv,\"a\",@progbits\n\t"\
+   PPC_LONG "%c1, ((1f)-1)\n\t"\
+   ".byte 1\n\t"   \
+   ".previous\n\t" \
+   "li %0,0\n\t"   \
+   "1:\n\t"\
+   : "=r" (value)  \
+   : "i" (&name##__imv));  \
+   break;  \
+   case 2: \
+   asm(".section __imv,\"a\",@progbits\n\t"\
+   PPC_LONG "%c1, ((1f)-2)\n\t"\
+   ".byte 2\n\t"   \
+   ".previous\n\t" \
+   "li %0,0\n\t"   \
+   "1:\n\t"\
+   : "=r" (value)  \
+   : "i" (&name##__imv));  \
+   break;  \
+   case 4: \
+   case 8: value = name##__imv;\
+   break;  \
+   };  \
+   value;  \
+   })
+
+#endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/powerpc/Kconfig
===
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig   2008-01-30 09:14:21.0 
-0500
+++ linux-2.6-lttng/arch/powerpc/Kconfig    2008-01-30 09:33:42.0 
-0500
@@ -89,6 +89,7 @@ config PPC
default y
select HAVE_OPROFILE
select HAVE_KPROBES
+   select HAVE_IMMEDIATE
 
 config EARLY_PRINTK
bool

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/7] Immediate Values - Kconfig menu in EMBEDDED

2008-02-02 Thread Mathieu Desnoyers

Immediate values provide a way to use dynamic code patching to update variables
sitting within the instruction stream. It saves caches lines normally used by
static read mostly variables. Enable it by default, but let users disable it
through the EMBEDDED menu with the "Disable immediate values" submenu entry.

Note: Since I think that I really should let embedded systems developers using
RO memory the option to disable the immediate values, I choose to leave this
menu option there, in the EMBEDDED menu. Also, the "CONFIG_IMMEDIATE" makes
sense because we want to compile out all the immediate code when we decide not
to use optimized immediate values at all (it removes otherwise unused code).

Changelog:
- Change ARCH_SUPPORTS_IMMEDIATE for ARCH_HAS_IMMEDIATE

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
CC: Adrian Bunk <[EMAIL PROTECTED]>
CC: Andi Kleen <[EMAIL PROTECTED]>
CC: Alexey Dobriyan <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
---
 init/Kconfig |   24 
 1 file changed, 24 insertions(+)

Index: linux-2.6-lttng/init/Kconfig
===
--- linux-2.6-lttng.orig/init/Kconfig   2008-01-29 15:26:54.0 -0500
+++ linux-2.6-lttng/init/Kconfig2008-01-29 15:27:17.0 -0500
@@ -444,6 +444,20 @@ config CC_OPTIMIZE_FOR_SIZE
 config SYSCTL
bool
 
+config IMMEDIATE
+   default y if !DISABLE_IMMEDIATE
+   depends on HAVE_IMMEDIATE
+   bool
+   help
+ Immediate values are used as read-mostly variables that are rarely
+ updated. They use code patching to modify the values inscribed in the
+ instruction stream. It provides a way to save precious cache lines
+ that would otherwise have to be used by these variables. They can be
+ disabled through the EMBEDDED menu.
+
+config HAVE_IMMEDIATE
+   def_bool n
+
 menuconfig EMBEDDED
bool "Configure standard kernel features (for small systems)"
help
@@ -679,6 +693,16 @@ config MARKERS
 
 source "arch/Kconfig"
 
+config DISABLE_IMMEDIATE
+   default y if EMBEDDED
+   bool "Disable immediate values" if EMBEDDED
+   depends on HAVE_IMMEDIATE
+   help
+ Disable code patching based immediate values for embedded systems. It
+ consumes slightly more memory and requires to modify the instruction
+ stream each time a variable is updated. Should really be disabled for
+ embedded systems with read-only text.
+
 endmenu# General setup
 
 config SLABINFO

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/7] Immediate Values - x86 Optimization

2008-02-02 Thread Mathieu Desnoyers

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the imv_mutex).
- Put imv_set and _imv_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.

Ok, so the most flexible solution that I see, that should fit for both
x86 and x86_64 would be :
1 byte  :   "=q" : "a", "b", "c", or "d" register for the i386.  For
   x86-64 it is equivalent to "r" class (for 8-bit
   instructions that do not use upper halves).
2, 4, 8 bytes : "=r" : A register operand is allowed provided that it is in a
   general register.

- "Redux" immediate values : no need to put a breakpoint, therefore, no
  need to know where the instruction starts. It's therefore OK to have a
  REX prefix.

- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
  immediate structure.
- Change the immediate.c update code to support variable length opcodes.
- Vastly simplified, using a busy looping IPI with interrupts disabled.
  Does not protect against NMI nor MCE.
- Pack the __imv section. Use smallest types required for size (char).
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Andi Kleen <[EMAIL PROTECTED]>
CC: "H. Peter Anvin" <[EMAIL PROTECTED]>
CC: Chuck Ebbert <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
CC: Thomas Gleixner <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
CC: Rusty Russell <[EMAIL PROTECTED]>
---
 arch/x86/Kconfig|1 
 include/asm-x86/immediate.h |   77 
 2 files changed, 78 insertions(+)

Index: linux-2.6-lttng/include/asm-x86/immediate.h
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-lttng/include/asm-x86/immediate.h 2008-01-30 09:33:22.0 
-0500
@@ -0,0 +1,77 @@
+#ifndef _ASM_X86_IMMEDIATE_H
+#define _ASM_X86_IMMEDIATE_H
+
+/*
+ * Immediate values. x86 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <[EMAIL PROTECTED]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include 
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _imv_read() instead.
+ * If size is bigger than the architecture long size, fall back on a memory
+ * read.
+ *
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 8 bytes immediate value (not the 
REX.W
+ * prefixed one that loads a sign extended 32 bits immediate value in a r64
+ * register).
+ */
+#define imv_read(name) \
+   ({  \
+   __typeof__(name##__imv) value;  \
+   BUILD_BUG_ON(sizeof(value) > 8);\
+   switch (sizeof(value)) {\
+   case 1: \
+   asm(".section __imv,\"a\",@progbits\n\t"\
+   _ASM_PTR "%c1, (3f)-%c2\n\t"\
+   ".byte %c2\n\t" \
+   ".previous\n\t" \
+   "mov $0,%0\n\t" \
+   "3:\n\t"\
+   : "=q" (value)  \
+   : "i" (&name##__imv),   \
+ "i" (sizeof(value))); \
+   break;  \
+   case 2: \
+   case 4: \
+   asm(".section __imv,\"a\",@progbits\n\t"\
+   _ASM_PTR

Re: [PATCH] kbuild: Fix instrumentation removal breakage on avr32

2008-02-04 Thread Mathieu Desnoyers

* Haavard Skinnemoen ([EMAIL PROTECTED]) wrote:
> On Sun, 3 Feb 2008 22:10:42 +0100
> Sam Ravnborg <[EMAIL PROTECTED]> wrote:
> 
> > Mathieu Desnoyers (5):
> >   Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig
> 
> AVR32 still includes Kconfig.instrumentation, so it won't build after
> this...
> 
> I did point this out when the patch was submitted, I sent the avr32
> pull request early as promised (more than a week ago), but it still
> broke. Please apply the fix below. This fixes 2.6.24-mm1 too.
> 

This one slipped through.

Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

> Signed-off-by: Haavard Skinnemoen <[EMAIL PROTECTED]>
> ---
>  arch/avr32/Kconfig |2 --
>  1 file changed, 2 deletions(-)
> 
> Index: linux-2.6.24-mm1/arch/avr32/Kconfig
> ===
> --- linux-2.6.24-mm1.orig/arch/avr32/Kconfig  2008-02-04 12:28:31.0 
> +0100
> +++ linux-2.6.24-mm1/arch/avr32/Kconfig   2008-02-04 12:28:36.0 
> +0100
> @@ -236,8 +236,6 @@ source "drivers/Kconfig"
>  
>  source "fs/Kconfig"
>  
> -source "kernel/Kconfig.instrumentation"
> -
>  source "arch/avr32/Kconfig.debug"
>  
>  source "security/Kconfig"

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation

2008-02-04 Thread Mathieu Desnoyers

r:
> 
> void add_op(struct myops *x) {
>   /* x->next may be garbage here */
>   x->next = global_p;
>   smp_wmb();
>   global_p = x;
> }
> 
> Reader:
> 
> void read_op(void)
> {
>   struct myops *p = global_p;
> 
>   while (p != NULL) {
>   p->func();
>   p = next;
>   /* if p->next is garbage we crash */
>   }
> }
> 
> 
> Here, we are missing the read_barrier_depends(). Lets look at the Alpha
> cache issue:
> 
> 
> reader reads the new version of global_p, and then reads the next
> pointer. But since the next pointer is on a different cacheline than
> global_p, it may have somehow had that in it's cache still. So it uses the
> old next pointer which contains the garbage.
> 
> Is that correct?
> 
> But I will have to admit, that I can't see how an aggressive compiler
> might have screwed this up. Being that x is a parameter, and the function
> add_op is not in a header file.
> 

Tell me if I am mistakened, but applying Paul's explanation to your
example would give (I unroll the loop for clarity) :

Writer:

void add_op(struct myops *x) {
/* x->next may be garbage here */
x->next = global_p;
smp_wmb();
global_p = x;
}

Reader:

void read_op(void)
{
struct myops *p = global_p;

  if (p != NULL) {
p->func();
p = p->next;
  /*
   * Suppose the compiler expects that p->next is likely to be equal to
   * p + sizeof(struct myops), uses r1 to store previous p, r2 to store the
   * next p and r3 to store the expected value. Let's look at what the
   * compiler could do for the next loop iteration.
   */
  r2 = r1->next   (1)
  r3 = r1 + sizeof(struct myops)
  r4 = r3->func   (2)
  if (r3 == r2 && r3 != NULL)
call r4

/* if p->next is garbage we crash */
} else
return;

  if (p != NULL) {
p->func();
p = p->next;
/* if p->next is garbage we crash */
} else
return;
  .
}

In this example, we would be reading the expected "r3->func" (2) before
reading the real r1->next (1) value if reads are issued out of order.

Paul, am I correct ? And.. does the specific loop optimization I just
described actually exist ?

Thanks for your enlightenment :)

Mathieu

> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] spufs support multiple probes markers

2008-02-12 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * Andrew Morton ([EMAIL PROTECTED]) wrote:
> > 
> > fyi, I have this on hold because I spotted spufs build breakage,
> > but I haven't had time to investigate.  powerpc allmodconfig, iirc.
> 
> Christoph told me he would update his sputrace accordingly. Christoph,
> should I do the changes or let you do it ?
> 
> It's mostly the probe function prototype which changes from 
> 
> -typedef void marker_probe_func(const struct marker *mdata,
> -   void *private_data, const char *fmt, ...);
> 
> to
> 
> +typedef void marker_probe_func(void *probe_private, void *call_private,
> +   const char *fmt, va_list *args);
> 
> Where you receive an already ready va_list instead of having to call
> va_start/va_end in the probe.
> 
> Also, there is no more marker_arm/marker_disarm. It is now automatically
> done upon register/unregister.
> 
> Mathieu
> 

Update spufs to the new linux kernel markers API, which supports connecting
more than one probe to a single marker.

(compile-tested only)

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: "Frank Ch. Eigler" <[EMAIL PROTECTED]>
CC: Steven Rostedt <[EMAIL PROTECTED]>
CC: Andrew Morton <[EMAIL PROTECTED]>
---
 arch/powerpc/platforms/cell/spufs/sputrace.c |   31 +--
 1 file changed, 11 insertions(+), 20 deletions(-)

Index: linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
===
--- linux-2.6-lttng.orig/arch/powerpc/platforms/cell/spufs/sputrace.c   
2008-02-12 15:07:25.0 -0500
+++ linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
2008-02-12 15:11:39.0 -0500
@@ -146,34 +146,28 @@
wake_up(&sputrace_wait);
 }
 
-static void spu_context_event(const struct marker *mdata,
-   void *private, const char *format, ...)
+static void spu_context_event(void *probe_private, void *call_data,
+   const char *format, va_list *args)
 {
-   struct spu_probe *p = mdata->private;
-   va_list ap;
+   struct spu_probe *p = probe_private;
struct spu_context *ctx;
struct spu *spu;
 
-   va_start(ap, format);
-   ctx = va_arg(ap, struct spu_context *);
-   spu = va_arg(ap, struct spu *);
+   ctx = va_arg(*args, struct spu_context *);
+   spu = va_arg(*args, struct spu *);
 
sputrace_log_item(p->name, ctx, spu);
-   va_end(ap);
 }
 
-static void spu_context_nospu_event(const struct marker *mdata,
-   void *private, const char *format, ...)
+static void spu_context_nospu_event(void *probe_private, void *call_data,
+   const char *format, va_list *args)
 {
-   struct spu_probe *p = mdata->private;
-   va_list ap;
+   struct spu_probe *p = probe_private;
struct spu_context *ctx;
 
-   va_start(ap, format);
-   ctx = va_arg(ap, struct spu_context *);
+   ctx = va_arg(*args, struct spu_context *);
 
sputrace_log_item(p->name, ctx, NULL);
-   va_end(ap);
 }
 
 struct spu_probe spu_probes[] = {
@@ -219,10 +213,6 @@
if (error)
printk(KERN_INFO "Unable to register probe %s\n",
p->name);
-
-   error = marker_arm(p->name);
-   if (error)
-   printk(KERN_INFO "Unable to arm probe %s\n", p->name);
}
 
return 0;
@@ -238,7 +228,8 @@
int i;
 
for (i = 0; i < ARRAY_SIZE(spu_probes); i++)
-   marker_probe_unregister(spu_probes[i].name);
+   marker_probe_unregister(spu_probes[i].name,
+   spu_probes[i].probe_func, &spu_probes[i]);
 
remove_proc_entry("sputrace", NULL);
kfree(sputrace_log);


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Cast cmpxchg64 and cmpxchg64_local result for 386 and 486 - Fix missing parenthesis

2008-02-12 Thread Mathieu Desnoyers

Two pairs of parenthesis were missing around the result cast of cmpxchg64 and
cmpxchg64_local. This is a rather stupid mistake in
"Cast cmpxchg and cmpxchg_local result for 386 and 486". My bad. This fix
should be folded with the previous.

Sorry for this trivial bug which should have never appeared in the first place.
The aim was to fix cmpxchg and cmpxchg_local, which were used in slub. cmpxchg64
and cmpxchg64_local happen to be only used in LTTng currently.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Cc: Vegard Nossum <[EMAIL PROTECTED]>
Cc: Pekka Enberg <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Thomas Gleixner <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
---
 include/asm-x86/cmpxchg_32.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/cmpxchg_32.h
===
--- linux-2.6-lttng.orig/include/asm-x86/cmpxchg_32.h   2008-02-12 
17:49:32.0 -0500
+++ linux-2.6-lttng/include/asm-x86/cmpxchg_32.h2008-02-12 
17:50:18.0 -0500
@@ -305,11 +305,11 @@ extern unsigned long long cmpxchg_486_u6
 ({ \
__typeof__(*(ptr)) __ret;   \
if (likely(boot_cpu_data.x86 > 4))  \
-   __ret = __typeof__(*(ptr))__cmpxchg64((ptr),\
+   __ret = (__typeof__(*(ptr)))__cmpxchg64((ptr),  \
(unsigned long long)(o),\
(unsigned long long)(n));   \
else\
-   __ret = __typeof__(*(ptr))cmpxchg_486_u64((ptr),\
+   __ret = (__typeof__(*(ptr)))cmpxchg_486_u64((ptr),  \
(unsigned long long)(o),\
(unsigned long long)(n));   \
__ret;  \
@@ -318,11 +318,11 @@ extern unsigned long long cmpxchg_486_u6
 ({ \
__typeof__(*(ptr)) __ret;   \
if (likely(boot_cpu_data.x86 > 4))  \
-   __ret = __typeof__(*(ptr))__cmpxchg64_local((ptr),  \
+   __ret = (__typeof__(*(ptr)))__cmpxchg64_local((ptr),\
(unsigned long long)(o),\
(unsigned long long)(n));   \
else\
-   __ret = __typeof__(*(ptr))cmpxchg_486_u64((ptr),\
+   __ret = (__typeof__(*(ptr)))cmpxchg_486_u64((ptr),  \
(unsigned long long)(o),\
(unsigned long long)(n));   \
__ret;      \
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] fix module_update_markers() compile error

2008-02-13 Thread Mathieu Desnoyers

* Adrian Bunk ([EMAIL PROTECTED]) wrote:
> This patch fixes the following compile error with CONFIG_MODULES=n 
> caused by commit fb40bd78b0f91b274879cf5db8facd1e04b6052e:
> 
> <--  snip  -->
> 
> ...
>   CC  kernel/marker.o
> /home/bunk/linux/kernel-2.6/git/linux-2.6/kernel/marker.c: In function 
> ‘marker_update_probes’:
> /home/bunk/linux/kernel-2.6/git/linux-2.6/kernel/marker.c:627: error: too few 
> arguments to function ‘module_update_markers’
> make[2]: *** [kernel/marker.o] Error 1
> 
> <--  snip  -->
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 

Thanks for spotting this.

Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

> ---
> 8d811a4160c6e2cb92391076e0e0b500e1b4a8a2 diff --git a/include/linux/module.h 
> b/include/linux/module.h
> index 330bec0..819c4e8 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -567,8 +567,7 @@ static inline void print_modules(void)
>  {
>  }
>  
> -static inline void module_update_markers(struct module *probe_module,
> -     int *refcount)
> +static inline void module_update_markers(void)
>  {
>  }
>  
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] markers: Fix build for MODULES=n.

2008-02-14 Thread Mathieu Desnoyers

* Paul Mundt ([EMAIL PROTECTED]) wrote:
>   CC  kernel/marker.o
> kernel/marker.c: In function 'marker_update_probes':
> kernel/marker.c:627: error: too few arguments to function 
> 'module_update_markers'
> make[1]: *** [kernel/marker.o] Error 1
> make: *** [kernel] Error 2
> 
> module_update_markers() doesn't take any arguments, update the MODULES=n
> version of it to reflect that.
> 

Hi Paul, thanks for submitting this. Adrian-the-roadrunner has been
faster than you though. :) He already submitted this fix here :

http://lkml.org/lkml/2008/2/13/714

Mathieu


> Signed-off-by: Paul Mundt <[EMAIL PROTECTED]>
> 
> ---
> 
>  include/linux/module.h |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 330bec0..819c4e8 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -567,8 +567,7 @@ static inline void print_modules(void)
>  {
>  }
>  
> -static inline void module_update_markers(struct module *probe_module,
> -     int *refcount)
> +static inline void module_update_markers(void)
>  {
>  }
>  

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] samples: build fix

2008-02-14 Thread Mathieu Desnoyers

* Roland McGrath ([EMAIL PROTECTED]) wrote:
> 
> The samples/ subdirectory contains only modules.
> But the only make run done there is in commands for vmlinux.
> I can't see why this was ever done in this nonstandard fashion.
> As things stand, the modules don't get built by 'make modules'.
> 
> I didn't make the addition of the directory use core-$(CONFIG_SAMPLES)
> because there is no other conditional like that in the top-level Makefile
> and samples/Makefile already uses obj-$(CONFIG_SAMPLES) as if it expects
> always to be included.
> 

Sam, is this ok with the samples intent ? I think as long as we do not
include them with the kernel image and have a "make samples" to build
them, it's ok. Having them built upon make modules seems like a good
idea to me.

Mathieu

> Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
> ---
>  Makefile |5 +
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index c162370..9e9ce33 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -602,7 +602,7 @@ export mod_strip_cmd
>  
>  
>  ifeq ($(KBUILD_EXTMOD),)
> -core-y   += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y   += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ 
> samples/
>  
>  vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
>$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
> @@ -802,9 +802,6 @@ vmlinux: $(vmlinux-lds) $(vmlinux-init) $(vmlinux-main) 
> vmlinux.o $(kallsyms.o)
>  ifdef CONFIG_HEADERS_CHECK
>   $(Q)$(MAKE) -f $(srctree)/Makefile headers_check
>  endif
> -ifdef CONFIG_SAMPLES
> -     $(Q)$(MAKE) $(build)=samples
> -endif
>   $(call vmlinux-modpost)
>   $(call if_changed_rule,vmlinux__)
>   $(Q)rm -f .old_version

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] spufs support multiple probes markers

2008-02-14 Thread Mathieu Desnoyers

* Christoph Hellwig ([EMAIL PROTECTED]) wrote:
> On Tue, Feb 12, 2008 at 06:56:50PM -0500, Mathieu Desnoyers wrote:
> > Update spufs to the new linux kernel markers API, which supports connecting
> > more than one probe to a single marker.
> 
> Compiles and works for me.  But saying I like the odd API would be lying.
> 

Are there any concerns of yours that I should address then ? The changes
I made to the probe function prototype appeared to be technically
required and caused by variadic argument limitations when it comes to
support multiple probes.

I think that the marker arm/disarm removal gets rid of an unnecessary
redundancy.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] samples: build fix

2008-02-14 Thread Mathieu Desnoyers

* Sam Ravnborg ([EMAIL PROTECTED]) wrote:
> On Thu, Feb 14, 2008 at 08:27:52AM -0500, Mathieu Desnoyers wrote:
> > * Roland McGrath ([EMAIL PROTECTED]) wrote:
> > > 
> > > The samples/ subdirectory contains only modules.
> > > But the only make run done there is in commands for vmlinux.
> > > I can't see why this was ever done in this nonstandard fashion.
> > > As things stand, the modules don't get built by 'make modules'.
> > > 
> > > I didn't make the addition of the directory use core-$(CONFIG_SAMPLES)
> > > because there is no other conditional like that in the top-level Makefile
> > > and samples/Makefile already uses obj-$(CONFIG_SAMPLES) as if it expects
> > > always to be included.
> > > 
> > 
> > Sam, is this ok with the samples intent ? I think as long as we do not
> > include them with the kernel image and have a "make samples" to build
> > them, it's ok. Having them built upon make modules seems like a good
> > idea to me.
> 
> The samples code are supposed to be what the name says 'samples'.
> This is not code that are supposed to be part of the built-in kernel.
> This is not modules that are supposed to be installes when 
> installing modules.
> 
> Adding it to core-y as Roland does in the patch below is plain
> wrong as it enabled both points above.
> The fact that the present code in samples/ does not do this is
> in this respect irellevant.
> 
> Do we have problems when to build the sampls - then lets
> address this issue but not by trying to upgrade the samples
> to first class citizen in the kernel - they are not that
> and should not be handled like that.
> 

Then is there some other way to have the samples built upon "make
modules" that would not install them with other modules ?

Mathieu

>   Sam

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] make marker_debug static

2008-02-14 Thread Mathieu Desnoyers

* Adrian Bunk ([EMAIL PROTECTED]) wrote:
> With the needlessly global marker_debug being static gcc can optimize 
> the unused code away.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 

Thanks,

Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

> ---
> 91577cc8ac60bf9003d0dd037a231db363003740 diff --git a/kernel/marker.c 
> b/kernel/marker.c
> index c4c2cd8..133bdbb 100644
> --- a/kernel/marker.c
> +++ b/kernel/marker.c
> @@ -28,7 +28,7 @@ extern struct marker __start___markers[];
>  extern struct marker __stop___markers[];
>  
>  /* Set to 1 to enable marker debug output */
> -const int marker_debug;
> +static const int marker_debug;
>  
>  /*
>   * markers_mutex nests inside module_mutex. Markers mutex protects the 
> builtin
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MARKERS depends on MODULES

2008-02-15 Thread Mathieu Desnoyers

* Chris Snook ([EMAIL PROTECTED]) wrote:
> From: Chris Snook <[EMAIL PROTECTED]>
> 
> Make MARKERS depend on MODULES to prevent build failures with certain configs.
> 
> Signed-off-by: Chris Snook <[EMAIL PROTECTED]>
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index dcef8b5..933df15 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -729,6 +729,7 @@ config PROFILING
>  
>  config MARKERS
>   bool "Activate markers"
> + depends on MODULES
>   help
> Place an empty function call at each marker site. Can be
> dynamically changed for a probe function.

It should not be needed. Please try Adrian's fix there :
http://lkml.org/lkml/2008/2/13/714

It should fix your problem.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-26 Thread Mathieu Desnoyers

* David Laight (david.lai...@aculab.com) wrote:
> Amazing how something simple gets lots of comments and versions :-)
> 
> > ...
> > + * This has to be a macro since HASH_BITS() will not work on pointers since
> > + * it calculates the size during preprocessing.
> > + */
> > +#define hash_empty(hashtable)  
> > \
> > +({ 
> > \
> > +   int __i;
> > \
> > +   bool __ret = true;  
> > \
> > +   
> > \
> > +   for (__i = 0; __i < HASH_SIZE(hashtable); __i++)
> > \
> > +   if (!hlist_empty(&hashtable[__i]))  
> > \
> > +   __ret = false;  
> > \
> > +   
> > \
> > +   __ret;  
> > \
> > +})
> 
> Actually you could have a #define that calls a function
> passing in the address and size.
> Also, should the loop have a 'break' in it?

+1   Removing unnecessary variables defined within a
statement-expression is indeed something we want, and your suggestion of
a macro calling a static inline is, IMHO, spot-on.

The same should be done for hash_init().

And yes, a break would be welcome in that loop: no need to continue if
we encounter a non-empty hlist.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-26 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On 09/26/2012 03:59 PM, Steven Rostedt wrote:
> > On Wed, 2012-09-26 at 14:45 +0100, David Laight wrote:
> >> Amazing how something simple gets lots of comments and versions :-)
> >>
> >>> ...
> >>> + * This has to be a macro since HASH_BITS() will not work on pointers 
> >>> since
> >>> + * it calculates the size during preprocessing.
> >>> + */
> >>> +#define hash_empty(hashtable)
> >>> \
> >>> +({   
> >>> \
> >>> + int __i;
> >>> \
> >>> + bool __ret = true;  
> >>> \
> >>> + 
> >>> \
> >>> + for (__i = 0; __i < HASH_SIZE(hashtable); __i++)
> >>> \
> >>> + if (!hlist_empty(&hashtable[__i]))  
> >>> \
> >>> + __ret = false;  
> >>> \
> >>> + 
> >>> \
> >>> + __ret;  
> >>> \
> >>> +})
> >>
> >> Actually you could have a #define that calls a function
> >> passing in the address and size.
> > 
> > Probably would be cleaner to do so.
> 
> I think it's worth it if it was more complex than a simple loop. We
> were doing a similar thing with the _size() functions (see version 4
> of this patch), but decided to remove it since it was becoming too
> complex.

Defining local variables within statement-expressions can have some
unexpected side-effects if the "caller" which embeds the macro use the
same variable name. See rcu_dereference() as an example (Paul uses an
awefully large number of underscores). It should be avoided whenever
possible.

> > 
> > 
> >> Also, should the loop have a 'break' in it?
> > 
> > Yeah it should, and could do:
> > 
> > for (i = 0; i < HASH_SIZE(hashtable); i++)
> > if (!hlist_empty(&hashtable[i]))
> > break;
> > 
> > return i < HASH_SIZE(hashtable);


Hrm, Steven, did you drink you morning coffee before writing this ? ;-)
It looks like you did 2 bugs in 4 LOC.

First, the condition should be reversed, because this function returns
whether the hash is empty, not the other way around.

And even then, if we would do:

for (i = 0; i < HASH_SIZE(hashtable); i++)
if (!hlist_empty(&hashtable[i]))
break;
 
return i >= HASH_SIZE(hashtable);

What happens if the last entry of the table is non-empty ?

So I would advise that Sasha keep his original flag-based
implementation, but add the missing break, and move the init and empty
define loops into static inlines.

Thanks,

Mathieu

> 
> Right.
> 
> 
> Thanks,
> Sasha

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6] hashtable: introduce a small and naive hashtable

2012-09-26 Thread Mathieu Desnoyers

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Wed, 2012-09-26 at 10:39 -0400, Mathieu Desnoyers wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > > On 09/26/2012 03:59 PM, Steven Rostedt wrote:
> > > > On Wed, 2012-09-26 at 14:45 +0100, David Laight wrote:
> > > >> Amazing how something simple gets lots of comments and versions :-)
> > > >>
> > > >>> ...
> > > >>> + * This has to be a macro since HASH_BITS() will not work on 
> > > >>> pointers since
> > > >>> + * it calculates the size during preprocessing.
> > > >>> + */
> > > >>> +#define hash_empty(hashtable)
> > > >>> \
> > > >>> +({   
> > > >>> \
> > > >>> + int __i;
> > > >>> \
> > > >>> + bool __ret = true;  
> > > >>> \
> > > >>> + 
> > > >>> \
> > > >>> + for (__i = 0; __i < HASH_SIZE(hashtable); __i++)
> > > >>> \
> > > >>> + if (!hlist_empty(&hashtable[__i]))  
> > > >>> \
> > > >>> + __ret = false;  
> > > >>> \
> > > >>> + 
> > > >>> \
> > > >>> + __ret;  
> > > >>> \
> > > >>> +})
> > > >>
> > > >> Actually you could have a #define that calls a function
> > > >> passing in the address and size.
> > > > 
> > > > Probably would be cleaner to do so.
> > > 
> > > I think it's worth it if it was more complex than a simple loop. We
> > > were doing a similar thing with the _size() functions (see version 4
> > > of this patch), but decided to remove it since it was becoming too
> > > complex.
> > 
> > Defining local variables within statement-expressions can have some
> > unexpected side-effects if the "caller" which embeds the macro use the
> > same variable name. See rcu_dereference() as an example (Paul uses an
> > awefully large number of underscores). It should be avoided whenever
> > possible.
> > 
> > > > 
> > > > 
> > > >> Also, should the loop have a 'break' in it?
> > > > 
> > > > Yeah it should, and could do:
> > > > 
> > > > for (i = 0; i < HASH_SIZE(hashtable); i++)
> > > > if (!hlist_empty(&hashtable[i]))
> > > > break;
> > > > 
> > > > return i < HASH_SIZE(hashtable);
> > 
> > 
> > Hrm, Steven, did you drink you morning coffee before writing this ? ;-)
> > It looks like you did 2 bugs in 4 LOC.
> 
> Coffee yes, but head cold as well. :-p
> 
> > 
> > First, the condition should be reversed, because this function returns
> > whether the hash is empty, not the other way around.
> 
> Bah, I was looking at the code the code and got the ret confused. I
> originally had it the opposite, and then reversed it before sending.
> 
> > 
> > And even then, if we would do:
> > 
> > for (i = 0; i < HASH_SIZE(hashtable); i++)
> > if (!hlist_empty(&hashtable[i]))
> > break;
> >  
> > return i >= HASH_SIZE(hashtable);
> > 
> > What happens if the last entry of the table is non-empty ?
> 
> It still works, as 'i' is not incremented due to the break. And i will
> still be less than HASH_SIZE(hashtable). Did you have *your* cup of
> coffee today? ;-)

Ahh, right! Actually I had it already ;-)

> 
> 
> > 
> > So I would advise that Sasha keep his original flag-based
> > implementation, but add the missing break, and move the init and empty
> > define loops into static inlines.
> > 
> 
> Nah,

Agreed that the flags should be removed. Moving to define + static
inline is still important though.

Thanks,

Mathieu


> 
> -- Steve
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 01/16] hashtable: introduce a small and naive hashtable

2012-10-29 Thread Mathieu Desnoyers

t; +
> +static inline bool __hash_empty(struct hlist_head *ht, int sz)

int -> unsigned int.

> +{
> + int i;

int -> unsigned int.

> +
> + for (i = 0; i < sz; i++)
> + if (!hlist_empty(&ht[i]))
> + return false;
> +
> + return true;
> +}
> +
> +/**
> + * hash_empty - check whether a hashtable is empty
> + * @hashtable: hashtable to check
> + *
> + * This has to be a macro since HASH_BITS() will not work on pointers since
> + * it calculates the size during preprocessing.
> + */
> +#define hash_empty(hashtable) __hash_empty(hashtable, HASH_SIZE(hashtable))
> +
> +/**
> + * hash_del - remove an object from a hashtable
> + * @node: &struct hlist_node of the object to remove
> + */
> +static inline void hash_del(struct hlist_node *node)
> +{
> + hlist_del_init(node);
> +}
> +
> +/**
> + * hash_del_rcu - remove an object from a rcu enabled hashtable
> + * @node: &struct hlist_node of the object to remove
> + */
> +static inline void hash_del_rcu(struct hlist_node *node)
> +{
> + hlist_del_init_rcu(node);
> +}
> +
> +/**
> + * hash_for_each - iterate over a hashtable
> + * @name: hashtable to iterate
> + * @bkt: integer to use as bucket loop cursor
> + * @node: the &struct list_head to use as a loop cursor for each entry
> + * @obj: the type * to use as a loop cursor for each entry
> + * @member: the name of the hlist_node within the struct
> + */
> +#define hash_for_each(name, bkt, node, obj, member)  
> \
> + for (bkt = 0, node = NULL; node == NULL && bkt < HASH_SIZE(name); 
> bkt++)\

if "bkt" happens to be a dereferenced pointer (unary operator '*'), we
get into a situation where "*blah" has higher precedence than "=",
higher than "<", but lower than "++". Any thoughts on fixing this ?

> + hlist_for_each_entry(obj, node, &name[bkt], member)
> +
> +/**
> + * hash_for_each_rcu - iterate over a rcu enabled hashtable
> + * @name: hashtable to iterate
> + * @bkt: integer to use as bucket loop cursor
> + * @node: the &struct list_head to use as a loop cursor for each entry
> + * @obj: the type * to use as a loop cursor for each entry
> + * @member: the name of the hlist_node within the struct
> + */
> +#define hash_for_each_rcu(name, bkt, node, obj, member)  
> \
> + for (bkt = 0, node = NULL; node == NULL && bkt < HASH_SIZE(name); 
> bkt++)\

Same comment as above about "bkt".

> + hlist_for_each_entry_rcu(obj, node, &name[bkt], member)
> +
> +/**
> + * hash_for_each_safe - iterate over a hashtable safe against removal of
> + * hash entry
> + * @name: hashtable to iterate
> + * @bkt: integer to use as bucket loop cursor
> + * @node: the &struct list_head to use as a loop cursor for each entry
> + * @tmp: a &struct used for temporary storage
> + * @obj: the type * to use as a loop cursor for each entry
> + * @member: the name of the hlist_node within the struct
> + */
> +#define hash_for_each_safe(name, bkt, node, tmp, obj, member)
> \
> + for (bkt = 0, node = NULL; node == NULL && bkt < HASH_SIZE(name); 
> bkt++)\

Same comment as above about "bkt".

Thanks,

Mathieu

> + hlist_for_each_entry_safe(obj, node, tmp, &name[bkt], member)
> +
> +/**
> + * hash_for_each_possible - iterate over all possible objects hashing to the
> + * same bucket
> + * @name: hashtable to iterate
> + * @obj: the type * to use as a loop cursor for each entry
> + * @node: the &struct list_head to use as a loop cursor for each entry
> + * @member: the name of the hlist_node within the struct
> + * @key: the key of the objects to iterate over
> + */
> +#define hash_for_each_possible(name, obj, node, member, key) 
> \
> + hlist_for_each_entry(obj, node, &name[hash_min(key, HASH_BITS(name))], 
> member)
> +
> +/**
> + * hash_for_each_possible_rcu - iterate over all possible objects hashing to 
> the
> + * same bucket in an rcu enabled hashtable
> + * in a rcu enabled hashtable
> + * @name: hashtable to iterate
> + * @obj: the type * to use as a loop cursor for each entry
> + * @node: the &struct list_head to use as a loop cursor for each entry
> + * @member: the name of the hlist_node within the struct
> + * @key: the key of the objects to iterate over
> + */
> +#define hash_for_each_possible_rcu(name, obj, node, member, key) 
> \
> + hlist_for_each_entry_rcu(obj, node, &name[hash_min(key, 
> HASH_BITS(name))], member)
> +
> +/**

Re: [PATCH v7 06/16] tracepoint: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch tracepoints to use the new hashtable implementation. This reduces the 
> amount of
> generic unrelated code in the tracepoints.
> 
> Signed-off-by: Sasha Levin 
> ---
>  kernel/tracepoint.c | 27 +++
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index d96ba22..854df92 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  extern struct tracepoint * const __start___tracepoints_ptrs[];
>  extern struct tracepoint * const __stop___tracepoints_ptrs[];
> @@ -49,8 +50,7 @@ static LIST_HEAD(tracepoint_module_list);
>   * Protected by tracepoints_mutex.
>   */
>  #define TRACEPOINT_HASH_BITS 6
> -#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> -static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> +static DEFINE_HASHTABLE(tracepoint_table, TRACEPOINT_HASH_BITS);
>  
[...]
>  
> @@ -722,6 +715,8 @@ struct notifier_block tracepoint_module_nb = {
>  
>  static int init_tracepoints(void)
>  {
> + hash_init(tracepoint_table);
> +
>   return register_module_notifier(&tracepoint_module_nb);
>  }
>  __initcall(init_tracepoints);

So we have a hash table defined in .bss (therefore entirely initialized
to NULL), and you add a call to "hash_init", which iterates on the whole
array and initialize it to NULL (again) ?

This extra initialization is redundant. I think it should be removed
from here, and hashtable.h should document that hash_init() don't need
to be called on zeroed memory (which includes static/global variables,
kzalloc'd memory, etc).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 07/16] net,9p: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch 9p error table to use the new hashtable implementation. This reduces 
> the amount of
> generic unrelated code in 9p.
> 
> Signed-off-by: Sasha Levin 
> ---
>  net/9p/error.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/net/9p/error.c b/net/9p/error.c
> index 2ab2de7..a5cc7dd 100644
> --- a/net/9p/error.c
> +++ b/net/9p/error.c
> @@ -34,7 +34,7 @@
>  #include 
>  #include 
>  #include 
> -
> +#include 

missing newline.

>  /**
>   * struct errormap - map string errors from Plan 9 to Linux numeric ids
>   * @name: string sent over 9P
> @@ -50,8 +50,8 @@ struct errormap {
>   struct hlist_node list;
>  };
>  
> -#define ERRHASHSZ32
> -static struct hlist_head hash_errmap[ERRHASHSZ];
> +#define ERR_HASH_BITS 5
> +static DEFINE_HASHTABLE(hash_errmap, ERR_HASH_BITS);
>  
>  /* FixMe - reduce to a reasonable size */
>  static struct errormap errmap[] = {
> @@ -193,18 +193,17 @@ static struct errormap errmap[] = {
>  int p9_error_init(void)
>  {
>   struct errormap *c;
> - int bucket;
> + u32 hash;
>  
>   /* initialize hash table */
> - for (bucket = 0; bucket < ERRHASHSZ; bucket++)
> - INIT_HLIST_HEAD(&hash_errmap[bucket]);
> + hash_init(hash_errmap);

As for most of the other patches in this series, the hash_init is
redundant for a statically defined hash table.

Thanks,

Mathieu

>  
>   /* load initial error map into hash table */
>   for (c = errmap; c->name != NULL; c++) {
>   c->namelen = strlen(c->name);
> - bucket = jhash(c->name, c->namelen, 0) % ERRHASHSZ;
> + hash = jhash(c->name, c->namelen, 0);
>   INIT_HLIST_NODE(&c->list);
> - hlist_add_head(&c->list, &hash_errmap[bucket]);
> + hash_add(hash_errmap, &c->list, hash);
>   }
>  
>   return 1;
> @@ -223,13 +222,13 @@ int p9_errstr2errno(char *errstr, int len)
>   int errno;
>   struct hlist_node *p;
>   struct errormap *c;
> - int bucket;
> + u32 hash;
>  
>   errno = 0;
>   p = NULL;
>   c = NULL;
> - bucket = jhash(errstr, len, 0) % ERRHASHSZ;
> - hlist_for_each_entry(c, p, &hash_errmap[bucket], list) {
> + hash = jhash(errstr, len, 0);
> + hash_for_each_possible(hash_errmap, c, p, list, hash) {
>   if (c->namelen == len && !memcmp(c->name, errstr, len)) {
>   errno = c->val;
>   break;
> -- 
> 1.7.12.4
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 08/16] block,elevator: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
[...]
> @@ -96,6 +97,8 @@ struct elevator_type
>   struct list_head list;
>  };
>  
> +#define ELV_HASH_BITS 6
> +
>  /*
>   * each queue has an elevator_queue associated with it
>   */
> @@ -105,7 +108,7 @@ struct elevator_queue
>   void *elevator_data;
>   struct kobject kobj;
>   struct mutex sysfs_lock;
> - struct hlist_head *hash;
> + DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
>   unsigned int registered:1;

Hrm, so this is moving "registered" out of the elevator_queue first
cache-line by turning the pointer into a 256 or 512 bytes hash table.

Maybe we should consider moving "registered" before the "hash" field ?

Thanks,

Mathieu

>  };
>  
> -- 
> 1.7.12.4
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 09/16] SUNRPC/cache: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch cache to use the new hashtable implementation. This reduces the amount 
> of
> generic unrelated code in the cache implementation.
> 
> Signed-off-by: Sasha Levin 
> ---
>  net/sunrpc/cache.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index fc2f7aa..0490546 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -524,19 +525,18 @@ EXPORT_SYMBOL_GPL(cache_purge);
>   * it to be revisited when cache info is available
>   */
>  
> -#define  DFR_HASHSIZE(PAGE_SIZE/sizeof(struct list_head))
> -#define  DFR_HASH(item)  long)item)>>4 ^ (((long)item)>>13)) % 
> DFR_HASHSIZE)
> +#define  DFR_HASH_BITS   9

If we look at a bit of history, mainly commit:

commit 1117449276bb909b029ed0b9ba13f53e4784db9d
Author: NeilBrown 
Date:   Thu Aug 12 17:04:08 2010 +1000

sunrpc/cache: change deferred-request hash table to use hlist.


we'll notice that the only reason why the prior DFR_HASHSIZE was using

  (PAGE_SIZE/sizeof(struct list_head))

instead of

  (PAGE_SIZE/sizeof(struct hlist_head))

is because it has been forgotten in that commit. The intent there is to
make the hash table array fit the page size.

By defining DFR_HASH_BITS arbitrarily to "9", this indeed fulfills this
purpose on architectures with 4kB page size and 64-bit pointers, but not
on some powerpc configurations, and Tile architectures, which have more
exotic 64kB page size, and of course on the far less exotic 32-bit
pointer architectures.

So defining e.g.:

#include 

#define DFR_HASH_BITS  (PAGE_SHIFT - ilog2(BITS_PER_LONG))

would keep the intended behavior in all cases: use one page for the hash
array.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 10/16] dlm: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
[...]
> @@ -158,34 +159,21 @@ static int dlm_allow_conn;
>  static struct workqueue_struct *recv_workqueue;
>  static struct workqueue_struct *send_workqueue;
>  
> -static struct hlist_head connection_hash[CONN_HASH_SIZE];
> +static struct hlist_head connection_hash[CONN_HASH_BITS];
>  static DEFINE_MUTEX(connections_lock);
>  static struct kmem_cache *con_cache;
>  
>  static void process_recv_sockets(struct work_struct *work);
>  static void process_send_sockets(struct work_struct *work);
>  
> -
> -/* This is deliberately very simple because most clusters have simple
> -   sequential nodeids, so we should be able to go straight to a connection
> -   struct in the array */
> -static inline int nodeid_hash(int nodeid)
> -{
> - return nodeid & (CONN_HASH_SIZE-1);
> -}

There is one thing I dislike about this change: you remove a useful
comment. It's good to be informed of the reason why a direct mapping
"value -> hash" without any dispersion function is preferred here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 11/16] net,l2tp: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
[...]  
> -/* Session hash global list for L2TPv3.
> - * The session_id SHOULD be random according to RFC3931, but several
> - * L2TP implementations use incrementing session_ids.  So we do a real
> - * hash on the session_id, rather than a simple bitmask.
> - */
> -static inline struct hlist_head *
> -l2tp_session_id_hash_2(struct l2tp_net *pn, u32 session_id)
> -{
> - return &pn->l2tp_session_hlist[hash_32(session_id, L2TP_HASH_BITS_2)];
> -
> -}

I understand that you removed this hash function, as well as
"l2tp_session_id_hash" below, but is there any way we could leave those
comments in place ? They look useful.

> -/* Session hash list.
> - * The session_id SHOULD be random according to RFC2661, but several
> - * L2TP implementations (Cisco and Microsoft) use incrementing
> - * session_ids.  So we do a real hash on the session_id, rather than a
> - * simple bitmask.

Ditto.

> - */
> -static inline struct hlist_head *
> -l2tp_session_id_hash(struct l2tp_tunnel *tunnel, u32 session_id)
> -{
> - return &tunnel->session_hlist[hash_32(session_id, L2TP_HASH_BITS)];
> -}
> -
>  /* Lookup a session by id
>   */

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 10/16] dlm: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> [...]
> > @@ -158,34 +159,21 @@ static int dlm_allow_conn;
> >  static struct workqueue_struct *recv_workqueue;
> >  static struct workqueue_struct *send_workqueue;
> >  
> > -static struct hlist_head connection_hash[CONN_HASH_SIZE];
> > +static struct hlist_head connection_hash[CONN_HASH_BITS];
> >  static DEFINE_MUTEX(connections_lock);
> >  static struct kmem_cache *con_cache;
> >  
> >  static void process_recv_sockets(struct work_struct *work);
> >  static void process_send_sockets(struct work_struct *work);
> >  
> > -
> > -/* This is deliberately very simple because most clusters have simple
> > -   sequential nodeids, so we should be able to go straight to a connection
> > -   struct in the array */
> > -static inline int nodeid_hash(int nodeid)
> > -{
> > -   return nodeid & (CONN_HASH_SIZE-1);
> > -}
> 
> There is one thing I dislike about this change: you remove a useful
> comment. It's good to be informed of the reason why a direct mapping
> "value -> hash" without any dispersion function is preferred here.

And now that I come to think of it: you're changing the behavior : you
will now use a dispersion function on the key, which goes against the
intent expressed in this comment.

It might be good to change hash_add(), hash_add_rcu(),
hash_for_each_possible*() key parameter for a "hash" parameter, and let
the caller provide the hash value computed by the function they like as
parameter, rather than enforcing hash_32/hash_64.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 13/16] lockd: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch lockd to use the new hashtable implementation. This reduces the amount 
> of
> generic unrelated code in lockd.
> 
> Signed-off-by: Sasha Levin 
> ---
>  fs/lockd/svcsubs.c | 66 
> +-
>  1 file changed, 36 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index 0deb5f6..d223a1f 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define NLMDBG_FACILITY  NLMDBG_SVCSUBS
>  
> @@ -28,8 +29,7 @@
>   * Global file hash table
>   */
>  #define FILE_HASH_BITS   7
> -#define FILE_NRHASH  (1< -static struct hlist_head nlm_files[FILE_NRHASH];
> +static DEFINE_HASHTABLE(nlm_files, FILE_HASH_BITS);
>  static DEFINE_MUTEX(nlm_file_mutex);
>  
>  #ifdef NFSD_DEBUG
> @@ -68,7 +68,7 @@ static inline unsigned int file_hash(struct nfs_fh *f)
>   int i;
>   for (i=0; i   tmp += f->data[i];
> - return tmp & (FILE_NRHASH - 1);
> + return tmp;
>  }
>  
>  /*
> @@ -86,17 +86,17 @@ nlm_lookup_file(struct svc_rqst *rqstp, struct nlm_file 
> **result,
>  {
>   struct hlist_node *pos;
>   struct nlm_file *file;
> - unsigned inthash;
> + unsigned intkey;
>   __be32  nfserr;
>  
>   nlm_debug_print_fh("nlm_lookup_file", f);
>  
> - hash = file_hash(f);
> + key = file_hash(f);
>  
>   /* Lock file table */
>   mutex_lock(&nlm_file_mutex);
>  
> - hlist_for_each_entry(file, pos, &nlm_files[hash], f_list)
> + hash_for_each_possible(nlm_files, file, pos, f_list, file_hash(f))

we have a nice example of weirdness about key vs hash here:

1) "key" is computed from file_hash(f)
2) file_hash(f) is computed again and again in hash_for_each_possible()

>   if (!nfs_compare_fh(&file->f_handle, f))
>   goto found;
>  
> @@ -123,7 +123,7 @@ nlm_lookup_file(struct svc_rqst *rqstp, struct nlm_file 
> **result,
>   goto out_free;
>   }
>  
> - hlist_add_head(&file->f_list, &nlm_files[hash]);
> + hash_add(nlm_files, &file->f_list, key);

3) then we use "key" as parameter to hash_add.

Moreover, we're adding dispersion to the file_hash() with the hash_32()
called under the hook within hashtable.h. Is it an intended behavior ?
This should at the very least be documented in the changelog.

[...]

> +static int __init nlm_init(void)
> +{
> + hash_init(nlm_files);

Useless.

Thanks,

Mathieu

> + return 0;
> +}
> +
> +module_init(nlm_init);
> -- 
> 1.7.12.4
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 14/16] net,rds: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch rds to use the new hashtable implementation. This reduces the amount of
> generic unrelated code in rds.
> 
> Signed-off-by: Sasha Levin 
> ---
>  net/rds/bind.c   |  28 +-
>  net/rds/connection.c | 102 
> +++
>  2 files changed, 63 insertions(+), 67 deletions(-)
> 
> diff --git a/net/rds/bind.c b/net/rds/bind.c
> index 637bde5..79d65ce 100644
> --- a/net/rds/bind.c
> +++ b/net/rds/bind.c
> @@ -36,16 +36,16 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "rds.h"
>  
> -#define BIND_HASH_SIZE 1024
> -static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
> +#define BIND_HASH_BITS 10
> +static DEFINE_HASHTABLE(bind_hash_table, BIND_HASH_BITS);
>  static DEFINE_SPINLOCK(rds_bind_lock);
>  
> -static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
> +static u32 rds_hash(__be32 addr, __be16 port)
>  {
> - return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) &
> -   (BIND_HASH_SIZE - 1));
> + return jhash_2words((u32)addr, (u32)port, 0);
>  }
>  
>  static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
> @@ -53,12 +53,12 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, 
> __be16 port,
>  {
>   struct rds_sock *rs;
>   struct hlist_node *node;
> - struct hlist_head *head = hash_to_bucket(addr, port);
> + u32 key = rds_hash(addr, port);
>   u64 cmp;
>   u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
>  
>   rcu_read_lock();
> - hlist_for_each_entry_rcu(rs, node, head, rs_bound_node) {
> + hash_for_each_possible_rcu(bind_hash_table, rs, node, rs_bound_node, 
> key) {

here too, key will be hashed twice:

- once by jhash_2words,
- once by hash_32(),

is this intended ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 15/16] openvswitch: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
[...]
> -static struct hlist_head *hash_bucket(struct net *net, const char *name)
> -{
> - unsigned int hash = jhash(name, strlen(name), (unsigned long) net);
> - return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)];
> -}
> -
>  /**
>   *   ovs_vport_locate - find a port that has already been created
>   *
> @@ -84,13 +76,12 @@ static struct hlist_head *hash_bucket(struct net *net, 
> const char *name)
>   */
>  struct vport *ovs_vport_locate(struct net *net, const char *name)
>  {
> - struct hlist_head *bucket = hash_bucket(net, name);
>   struct vport *vport;
>   struct hlist_node *node;
> + int key = full_name_hash(name, strlen(name));
>  
> - hlist_for_each_entry_rcu(vport, node, bucket, hash_node)
> - if (!strcmp(name, vport->ops->get_name(vport)) &&
> - net_eq(ovs_dp_get_net(vport->dp), net))
> + hash_for_each_possible_rcu(dev_table, vport, node, hash_node, key)

Is applying hash_32() on top of full_name_hash() needed and expected ?

Thanks,

Mathieu

> + if (!strcmp(name, vport->ops->get_name(vport)))
>   return vport;
>  
>   return NULL;
> @@ -174,7 +165,8 @@ struct vport *ovs_vport_add(const struct vport_parms 
> *parms)
>  
>   for (i = 0; i < ARRAY_SIZE(vport_ops_list); i++) {
>   if (vport_ops_list[i]->type == parms->type) {
> - struct hlist_head *bucket;
> + int key;
> + const char *name;
>  
>   vport = vport_ops_list[i]->create(parms);
>   if (IS_ERR(vport)) {
> @@ -182,9 +174,9 @@ struct vport *ovs_vport_add(const struct vport_parms 
> *parms)
>   goto out;
>   }
>  
> - bucket = hash_bucket(ovs_dp_get_net(vport->dp),
> -  vport->ops->get_name(vport));
> - hlist_add_head_rcu(&vport->hash_node, bucket);
> + name = vport->ops->get_name(vport);
> + key = full_name_hash(name, strlen(name));
> + hash_add_rcu(dev_table, &vport->hash_node, key);
>   return vport;
>   }
>   }
> @@ -225,7 +217,7 @@ void ovs_vport_del(struct vport *vport)
>  {
>   ASSERT_RTNL();
>  
> - hlist_del_rcu(&vport->hash_node);
> + hash_del_rcu(&vport->hash_node);
>  
>   vport->ops->destroy(vport);
>  }
> -- 
> 1.7.12.4
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 09/16] SUNRPC/cache: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Linus Torvalds (torva...@linux-foundation.org) wrote:
> On Mon, Oct 29, 2012 at 5:42 AM, Mathieu Desnoyers
>  wrote:
> >
> > So defining e.g.:
> >
> > #include 
> >
> > #define DFR_HASH_BITS  (PAGE_SHIFT - ilog2(BITS_PER_LONG))
> >
> > would keep the intended behavior in all cases: use one page for the hash
> > array.
> 
> Well, since that wasn't true before either because of the long-time
> bug you point out, clearly the page size isn't all that important. I
> think it's more important to have small and simple code, and "9" is
> certainly that, compared to playing ilog2 games with not-so-obvious
> things.
> 
> Because there's no reason to believe that '9' is in any way a worse
> random number than something page-shift-related, is there? And getting
> away from *previous* overly-complicated size calculations that had
> been broken because they were too complicated and random, sounds like
> a good idea.

Good point. I agree that unless we really care about the precise number
of TLB entries and cache lines used by this hash table, we might want to
stay away from page-size and pointer-size based calculation.

It might not hurt to explain this in the patch changelog though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 09/16] SUNRPC/cache: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* J. Bruce Fields (bfie...@fieldses.org) wrote:
> On Mon, Oct 29, 2012 at 11:13:43AM -0400, Mathieu Desnoyers wrote:
> > * Linus Torvalds (torva...@linux-foundation.org) wrote:
> > > On Mon, Oct 29, 2012 at 5:42 AM, Mathieu Desnoyers
> > >  wrote:
> > > >
> > > > So defining e.g.:
> > > >
> > > > #include 
> > > >
> > > > #define DFR_HASH_BITS  (PAGE_SHIFT - ilog2(BITS_PER_LONG))
> > > >
> > > > would keep the intended behavior in all cases: use one page for the hash
> > > > array.
> > > 
> > > Well, since that wasn't true before either because of the long-time
> > > bug you point out, clearly the page size isn't all that important. I
> > > think it's more important to have small and simple code, and "9" is
> > > certainly that, compared to playing ilog2 games with not-so-obvious
> > > things.
> > > 
> > > Because there's no reason to believe that '9' is in any way a worse
> > > random number than something page-shift-related, is there? And getting
> > > away from *previous* overly-complicated size calculations that had
> > > been broken because they were too complicated and random, sounds like
> > > a good idea.
> > 
> > Good point. I agree that unless we really care about the precise number
> > of TLB entries and cache lines used by this hash table, we might want to
> > stay away from page-size and pointer-size based calculation.
> >
> > It might not hurt to explain this in the patch changelog though.
> 
> I'd also be happy to take that as a separate patch now.

FYIW: I've made a nice boo-boo above. It should have been:

#define DFR_HASH_BITS   (PAGE_SHIFT - ilog2(sizeof(struct hlist_head)))

Because we happen to have a memory indexed in bytes, not in bits. I
guess this goes a long way proving Linus' point about virtues of trivial
code. ;-)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 15/16] openvswitch: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Hi Mathieu,
> 
> On Mon, Oct 29, 2012 at 9:29 AM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > [...]
> >> -static struct hlist_head *hash_bucket(struct net *net, const char *name)
> >> -{
> >> - unsigned int hash = jhash(name, strlen(name), (unsigned long) net);
> >> - return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)];
> >> -}
> >> -
> >>  /**
> >>   *   ovs_vport_locate - find a port that has already been created
> >>   *
> >> @@ -84,13 +76,12 @@ static struct hlist_head *hash_bucket(struct net *net, 
> >> const char *name)
> >>   */
> >>  struct vport *ovs_vport_locate(struct net *net, const char *name)
> >>  {
> >> - struct hlist_head *bucket = hash_bucket(net, name);
> >>   struct vport *vport;
> >>   struct hlist_node *node;
> >> + int key = full_name_hash(name, strlen(name));
> >>
> >> - hlist_for_each_entry_rcu(vport, node, bucket, hash_node)
> >> - if (!strcmp(name, vport->ops->get_name(vport)) &&
> >> - net_eq(ovs_dp_get_net(vport->dp), net))
> >> + hash_for_each_possible_rcu(dev_table, vport, node, hash_node, key)
> >
> > Is applying hash_32() on top of full_name_hash() needed and expected ?
> 
> Since this was pointed out in several of the patches, I'll answer it
> just once here.
> 
> I've intentionally "allowed" double hashing with hash_32 to keep the
> code simple.
> 
> hash_32() is pretty simple and gcc optimizes it to be almost nothing,
> so doing that costs us a multiplication and a shift. On the other
> hand, we benefit from keeping our code simple - how would we avoid
> doing this double hash? adding a different hashtable function for
> strings? or a new function for already hashed keys? I think we benefit
> a lot from having to mul/shr instead of adding extra lines of code
> here.

This could be done, as I pointed out in another email within this
thread, by changing the "key" argument from add/for_each_possible to an
expected "hash" value, and let the caller invoke hash_32() if they want.
I doubt this would add a significant amount of complexity for users of
this API, but would allow much more flexibility to choose hash
functions.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 10/16] dlm: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 9:07 AM, Mathieu Desnoyers
>  wrote:
> > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> >> * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> [...]
> >> > @@ -158,34 +159,21 @@ static int dlm_allow_conn;
> >> >  static struct workqueue_struct *recv_workqueue;
> >> >  static struct workqueue_struct *send_workqueue;
> >> >
> >> > -static struct hlist_head connection_hash[CONN_HASH_SIZE];
> >> > +static struct hlist_head connection_hash[CONN_HASH_BITS];
> >> >  static DEFINE_MUTEX(connections_lock);
> >> >  static struct kmem_cache *con_cache;
> >> >
> >> >  static void process_recv_sockets(struct work_struct *work);
> >> >  static void process_send_sockets(struct work_struct *work);
> >> >
> >> > -
> >> > -/* This is deliberately very simple because most clusters have simple
> >> > -   sequential nodeids, so we should be able to go straight to a 
> >> > connection
> >> > -   struct in the array */
> >> > -static inline int nodeid_hash(int nodeid)
> >> > -{
> >> > -   return nodeid & (CONN_HASH_SIZE-1);
> >> > -}
> >>
> >> There is one thing I dislike about this change: you remove a useful
> >> comment. It's good to be informed of the reason why a direct mapping
> >> "value -> hash" without any dispersion function is preferred here.
> 
> Yes, I've removed the comment because it's no longer true with the patch :)
> 
> > And now that I come to think of it: you're changing the behavior : you
> > will now use a dispersion function on the key, which goes against the
> > intent expressed in this comment.
> 
> The comment gave us the information that nodeids are mostly
> sequential, we no longer need to rely on that.

I'm fine with turning a direct + modulo mapping into a dispersed hash as
long as there are no underlying assumptions about sequentiality of value
accesses.

If the access pattern would happen to be typically sequential, then
adding dispersion could hurt performances significantly, turning a
frequent L1 access into a L2 access for instance.

> 
> > It might be good to change hash_add(), hash_add_rcu(),
> > hash_for_each_possible*() key parameter for a "hash" parameter, and let
> > the caller provide the hash value computed by the function they like as
> > parameter, rather than enforcing hash_32/hash_64.
> 
> Why? We already proved that hash_32() is more than enough as a hashing
> function, why complicate things?
> 
> Even doing hash_32() on top of another hash is probably a good idea to
> keep things simple.

All I'm asking is: have you made sure that this hash table is not
deliberately kept sequential (without dispersion) to accelerate specific
access patterns ? This should at least be documented in the changelog.

Thanks,

Mathieu


> 
> Thanks,
> Sasha

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 01/16] hashtable: introduce a small and naive hashtable

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 7:29 AM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> +
> >> + for (i = 0; i < sz; i++)
> >> + INIT_HLIST_HEAD(&ht[sz]);
> >
> > ouch. How did this work ? Has it been tested at all ?
> >
> > sz -> i
> 
> Funny enough, it works perfectly. Generally as a test I boot the
> kernel in a VM and let it fuzz with trinity for a bit, doing that with
> the code above worked flawlessly.
> 
> While it works, it's obviously wrong. Why does it work though? Usually
> there's a list op happening pretty soon after that which brings the
> list into proper state.
> 
> I've been playing with a patch that adds a magic value into list_head
> if CONFIG_DEBUG_LIST is set, and checks that magic in the list debug
> code in lib/list_debug.c.
> 
> Does it sound like something useful? If so I'll send that patch out.

Most of the calls to this initialization function apply it on zeroed
memory (static/kzalloc'd...), which makes it useless. I'd actually be in
favor of removing those redundant calls (as I pointed out in another
email), and document that zeroed memory don't need to be explicitly
initialized.

Those sites that need to really reinitialize memory, or initialize it
(if located on the stack or in non-zeroed dynamically allocated memory)
could use a memset to 0, which will likely be faster than setting to
NULL on many architectures.

About testing, I'd recommend taking the few sites that still need the
initialization function, and just initialize the array with garbage
before calling the initialization function. Things should blow up quite
quickly. Doing it as a one-off thing might be enough to catch any issue.
I don't think we need extra magic numbers to catch issues in this rather
obvious init function.

Thanks,

Mathieu

> 
> 
> Thanks,
> Sasha

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 01/16] hashtable: introduce a small and naive hashtable

2012-10-29 Thread Mathieu Desnoyers

* Tejun Heo (t...@kernel.org) wrote:
> Hello,
> 
> On Mon, Oct 29, 2012 at 12:14:12PM -0400, Mathieu Desnoyers wrote:
> > Most of the calls to this initialization function apply it on zeroed
> > memory (static/kzalloc'd...), which makes it useless. I'd actually be in
> > favor of removing those redundant calls (as I pointed out in another
> > email), and document that zeroed memory don't need to be explicitly
> > initialized.
> > 
> > Those sites that need to really reinitialize memory, or initialize it
> > (if located on the stack or in non-zeroed dynamically allocated memory)
> > could use a memset to 0, which will likely be faster than setting to
> > NULL on many architectures.
> 
> I don't think it's a good idea to optimize out the basic encapsulation
> there.  We're talking about re-zeroing some static memory areas which
> are pretty small.  It's just not worth optimizing out at the cost of
> proper initializtion.  e.g. We might add debug fields to list_head
> later.

Future-proofness for debugging fields is indeed a very compelling
argument. Fair enough!

We might want to document this intent at the top of the initialization
function though, just in case anyone want to short-circuit it.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 01/16] hashtable: introduce a small and naive hashtable

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 12:14 PM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> On Mon, Oct 29, 2012 at 7:29 AM, Mathieu Desnoyers
> >>  wrote:
> >> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> >> +
> >> >> + for (i = 0; i < sz; i++)
> >> >> + INIT_HLIST_HEAD(&ht[sz]);
> >> >
> >> > ouch. How did this work ? Has it been tested at all ?
> >> >
> >> > sz -> i
> >>
> >> Funny enough, it works perfectly. Generally as a test I boot the
> >> kernel in a VM and let it fuzz with trinity for a bit, doing that with
> >> the code above worked flawlessly.
> >>
> >> While it works, it's obviously wrong. Why does it work though? Usually
> >> there's a list op happening pretty soon after that which brings the
> >> list into proper state.
> >>
> >> I've been playing with a patch that adds a magic value into list_head
> >> if CONFIG_DEBUG_LIST is set, and checks that magic in the list debug
> >> code in lib/list_debug.c.
> >>
> >> Does it sound like something useful? If so I'll send that patch out.
> >
> > Most of the calls to this initialization function apply it on zeroed
> > memory (static/kzalloc'd...), which makes it useless. I'd actually be in
> > favor of removing those redundant calls (as I pointed out in another
> > email), and document that zeroed memory don't need to be explicitly
> > initialized.
> 
> Why would that make it useless? The idea is that the init functions
> will set the magic field to something random, like:
> 
> .magic = 0xBADBEEF0;
> 
> And have list_add() and friends WARN(.magic != 0xBADBEEF0, "Using an
> uninitialized list\n");
> 
> This way we'll catch all places that don't go through list initialization 
> code.

As I replied to Tejun Heo already, I agree that keeping the
initialization in place makes sense for future-proofness. This intent
should probably be documented in a comment about the initialization
function though, just to make sure nobody will try to skip it.

Thanks,

Mathieu

> 
> 
> Thanks,
> Sasha

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 06/16] tracepoint: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 7:35 AM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> Switch tracepoints to use the new hashtable implementation. This reduces 
> >> the amount of
> >> generic unrelated code in the tracepoints.
> >>
> >> Signed-off-by: Sasha Levin 
> >> ---
> >>  kernel/tracepoint.c | 27 +++
> >>  1 file changed, 11 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> >> index d96ba22..854df92 100644
> >> --- a/kernel/tracepoint.c
> >> +++ b/kernel/tracepoint.c
> >> @@ -26,6 +26,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>
> >>  extern struct tracepoint * const __start___tracepoints_ptrs[];
> >>  extern struct tracepoint * const __stop___tracepoints_ptrs[];
> >> @@ -49,8 +50,7 @@ static LIST_HEAD(tracepoint_module_list);
> >>   * Protected by tracepoints_mutex.
> >>   */
> >>  #define TRACEPOINT_HASH_BITS 6
> >> -#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> >> -static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> >> +static DEFINE_HASHTABLE(tracepoint_table, TRACEPOINT_HASH_BITS);
> >>
> > [...]
> >>
> >> @@ -722,6 +715,8 @@ struct notifier_block tracepoint_module_nb = {
> >>
> >>  static int init_tracepoints(void)
> >>  {
> >> + hash_init(tracepoint_table);
> >> +
> >>   return register_module_notifier(&tracepoint_module_nb);
> >>  }
> >>  __initcall(init_tracepoints);
> >
> > So we have a hash table defined in .bss (therefore entirely initialized
> > to NULL), and you add a call to "hash_init", which iterates on the whole
> > array and initialize it to NULL (again) ?
> >
> > This extra initialization is redundant. I think it should be removed
> > from here, and hashtable.h should document that hash_init() don't need
> > to be called on zeroed memory (which includes static/global variables,
> > kzalloc'd memory, etc).
> 
> This was discussed in the previous series, the conclusion was to call
> hash_init() either way to keep the encapsulation and consistency.

Agreed,

Thanks,

Mathieu

> 
> It's cheap enough and happens only once, so why not?
> 
> 
> Thanks,
> Sasha

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 15/16] openvswitch: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 11:59 AM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> Hi Mathieu,
> >>
> >> On Mon, Oct 29, 2012 at 9:29 AM, Mathieu Desnoyers
> >>  wrote:
> >> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> > [...]
> >> >> -static struct hlist_head *hash_bucket(struct net *net, const char 
> >> >> *name)
> >> >> -{
> >> >> - unsigned int hash = jhash(name, strlen(name), (unsigned long) 
> >> >> net);
> >> >> - return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)];
> >> >> -}
> >> >> -
> >> >>  /**
> >> >>   *   ovs_vport_locate - find a port that has already been created
> >> >>   *
> >> >> @@ -84,13 +76,12 @@ static struct hlist_head *hash_bucket(struct net 
> >> >> *net, const char *name)
> >> >>   */
> >> >>  struct vport *ovs_vport_locate(struct net *net, const char *name)
> >> >>  {
> >> >> - struct hlist_head *bucket = hash_bucket(net, name);
> >> >>   struct vport *vport;
> >> >>   struct hlist_node *node;
> >> >> + int key = full_name_hash(name, strlen(name));
> >> >>
> >> >> - hlist_for_each_entry_rcu(vport, node, bucket, hash_node)
> >> >> - if (!strcmp(name, vport->ops->get_name(vport)) &&
> >> >> - net_eq(ovs_dp_get_net(vport->dp), net))
> >> >> + hash_for_each_possible_rcu(dev_table, vport, node, hash_node, key)
> >> >
> >> > Is applying hash_32() on top of full_name_hash() needed and expected ?
> >>
> >> Since this was pointed out in several of the patches, I'll answer it
> >> just once here.
> >>
> >> I've intentionally "allowed" double hashing with hash_32 to keep the
> >> code simple.
> >>
> >> hash_32() is pretty simple and gcc optimizes it to be almost nothing,
> >> so doing that costs us a multiplication and a shift. On the other
> >> hand, we benefit from keeping our code simple - how would we avoid
> >> doing this double hash? adding a different hashtable function for
> >> strings? or a new function for already hashed keys? I think we benefit
> >> a lot from having to mul/shr instead of adding extra lines of code
> >> here.
> >
> > This could be done, as I pointed out in another email within this
> > thread, by changing the "key" argument from add/for_each_possible to an
> > expected "hash" value, and let the caller invoke hash_32() if they want.
> > I doubt this would add a significant amount of complexity for users of
> > this API, but would allow much more flexibility to choose hash
> > functions.
> 
> Most callers do need to do the hashing though, so why add an
> additional step for all callers instead of doing another hash_32 for
> the ones that don't really need it?
> 
> Another question is why do you need flexibility? I think that
> simplicity wins over flexibility here.

I usually try to make things as simple as possible, but not simplistic
compared to the problem tackled. In this case, I would ask the following
question: by standardizing the hash function of all those pieces of
kernel infrastructure to "hash_32()", including submodules part of the
kernel network infrastructure, parts of the kernel that can be fed
values coming from user-space (through the VFS), how can you guarantee
that hash_32() won't be the cause of a DoS attack based on the fact that
this algorithm is a) known by an attacker, and b) does not have any
randomness. It's been a recent trend to perform DoS attacks on poorly
implemented hashing functions.

This is just one example in an attempt to show why different hash table
users may have different constraints: for a hash table entirely
populated by keys generated internally by the kernel, a random seed
might not be required, but for cases where values are fed by user-space
and from the NIC, I would argue that flexibility to implement a
randomizable hash function beats implementation simplicity any time.

And you could keep the basic use-case simple by providing hints to the
hash_32()/hash_64()/hash_ulong() helpers in comments.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 15/16] openvswitch: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Tejun Heo (t...@kernel.org) wrote:
> Hello,
> 
> On Mon, Oct 29, 2012 at 02:16:48PM -0400, Mathieu Desnoyers wrote:
> > This is just one example in an attempt to show why different hash table
> > users may have different constraints: for a hash table entirely
> > populated by keys generated internally by the kernel, a random seed
> > might not be required, but for cases where values are fed by user-space
> > and from the NIC, I would argue that flexibility to implement a
> > randomizable hash function beats implementation simplicity any time.
> > 
> > And you could keep the basic use-case simple by providing hints to the
> > hash_32()/hash_64()/hash_ulong() helpers in comments.
> 
> If all you need is throwing in a salt value to avoid attacks, can't
> you just do that from caller side?  Scrambling the key before feeding
> it into hash_*() should work, no?

Yes, I think salting the "key" parameter would work.

Thanks,

Mathieu

> 
> Thanks.
> 
> -- 
> tejun

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 06/16] tracepoint: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 2:31 PM, Josh Triplett  wrote:
> > On Mon, Oct 29, 2012 at 01:29:24PM -0400, Sasha Levin wrote:
> >> On Mon, Oct 29, 2012 at 7:35 AM, Mathieu Desnoyers
> >>  wrote:
> >> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> >> Switch tracepoints to use the new hashtable implementation. This 
> >> >> reduces the amount of
> >> >> generic unrelated code in the tracepoints.
> >> >>
> >> >> Signed-off-by: Sasha Levin 
> >> >> ---
> >> >>  kernel/tracepoint.c | 27 +++
> >> >>  1 file changed, 11 insertions(+), 16 deletions(-)
> >> >>
> >> >> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> >> >> index d96ba22..854df92 100644
> >> >> --- a/kernel/tracepoint.c
> >> >> +++ b/kernel/tracepoint.c
> >> >> @@ -26,6 +26,7 @@
> >> >>  #include 
> >> >>  #include 
> >> >>  #include 
> >> >> +#include 
> >> >>
> >> >>  extern struct tracepoint * const __start___tracepoints_ptrs[];
> >> >>  extern struct tracepoint * const __stop___tracepoints_ptrs[];
> >> >> @@ -49,8 +50,7 @@ static LIST_HEAD(tracepoint_module_list);
> >> >>   * Protected by tracepoints_mutex.
> >> >>   */
> >> >>  #define TRACEPOINT_HASH_BITS 6
> >> >> -#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> >> >> -static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> >> >> +static DEFINE_HASHTABLE(tracepoint_table, TRACEPOINT_HASH_BITS);
> >> >>
> >> > [...]
> >> >>
> >> >> @@ -722,6 +715,8 @@ struct notifier_block tracepoint_module_nb = {
> >> >>
> >> >>  static int init_tracepoints(void)
> >> >>  {
> >> >> + hash_init(tracepoint_table);
> >> >> +
> >> >>   return register_module_notifier(&tracepoint_module_nb);
> >> >>  }
> >> >>  __initcall(init_tracepoints);
> >> >
> >> > So we have a hash table defined in .bss (therefore entirely initialized
> >> > to NULL), and you add a call to "hash_init", which iterates on the whole
> >> > array and initialize it to NULL (again) ?
> >> >
> >> > This extra initialization is redundant. I think it should be removed
> >> > from here, and hashtable.h should document that hash_init() don't need
> >> > to be called on zeroed memory (which includes static/global variables,
> >> > kzalloc'd memory, etc).
> >>
> >> This was discussed in the previous series, the conclusion was to call
> >> hash_init() either way to keep the encapsulation and consistency.
> >>
> >> It's cheap enough and happens only once, so why not?
> >
> > Unnecessary work adds up.  Better not to do it unnecessarily, even if by
> > itself it doesn't cost that much.
> >
> > It doesn't seem that difficult for future fields to have 0 as their
> > initialized state.
> 
> Let's put it this way: hlist requires the user to initialize hlist
> head before usage, therefore as a hlist user, hashtable implementation
> must do that.
> 
> We do it automatically when the hashtable user does
> DEFINE_HASHTABLE(), but we can't do that if he does
> DECLARE_HASHTABLE(). This means that the hashtable user must call
> hash_init() whenever he uses DECLARE_HASHTABLE() to create his
> hashtable.
> 
> There are two options here, either we specify that hash_init() should
> only be called if DECLARE_HASHTABLE() was called, which is confusing,
> inconsistent and prone to errors, or we can just say that it should be
> called whenever a hashtable is used.
> 
> The only way to work around it IMO is to get hlist to not require
> initializing before usage, and there are good reasons that that won't
> happen.

Hrm, just a second here.

The argument about hash_init being useful to add magic values in the
future only works for the cases where a hash table is declared with
DECLARE_HASHTABLE(). It's completely pointless with DEFINE_HASHTABLE(),
because we could initialize any debugging variables from within
DEFINE_HASHTABLE().

So I take my "Agreed" back. I disagree with initializing the hash table
twice redundantly. There should be at least "DEFINE_HASHTABLE()" or a
hash_init() (for DECLARE_HASHTABLE()), but not useless execution
initialization on top of an already statically initialized hash table.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 06/16] tracepoint: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Tejun Heo (t...@kernel.org) wrote:
> On Mon, Oct 29, 2012 at 11:58:14AM -0700, Tejun Heo wrote:
> > On Mon, Oct 29, 2012 at 02:53:19PM -0400, Mathieu Desnoyers wrote:
> > > The argument about hash_init being useful to add magic values in the
> > > future only works for the cases where a hash table is declared with
> > > DECLARE_HASHTABLE(). It's completely pointless with DEFINE_HASHTABLE(),
> > > because we could initialize any debugging variables from within
> > > DEFINE_HASHTABLE().
> > 
> > You can do that with [0 .. HASH_SIZE - 1] initializer.
> 
> And in general, let's please try not to do optimizations which are
> pointless.  Just stick to the usual semantics.  You have an abstract
> data structure - invoke the initializer before using it.  Sure,
> optimize it if it shows up somewhere.  And here, if we do the
> initializers properly, it shouldn't cause any more actual overhead -
> ie. DEFINE_HASHTABLE() will basicallly boil down to all zero
> assignments and the compiler will put the whole thing in .bss anyway.

Yes, agreed. I was going too far in optimization land by proposing
assumptions on zeroed memory. All I actually really care about is that
we don't end up calling hash_init() on a statically defined (and thus
already initialized) hash table.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 06/16] tracepoint: use new hashtable implementation

2012-10-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, Oct 29, 2012 at 2:53 PM, Mathieu Desnoyers
>  wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> On Mon, Oct 29, 2012 at 2:31 PM, Josh Triplett  
> >> wrote:
> >> > On Mon, Oct 29, 2012 at 01:29:24PM -0400, Sasha Levin wrote:
> >> >> On Mon, Oct 29, 2012 at 7:35 AM, Mathieu Desnoyers
> >> >>  wrote:
> >> >> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> >> >> >> Switch tracepoints to use the new hashtable implementation. This 
> >> >> >> reduces the amount of
> >> >> >> generic unrelated code in the tracepoints.
> >> >> >>
> >> >> >> Signed-off-by: Sasha Levin 
> >> >> >> ---
> >> >> >>  kernel/tracepoint.c | 27 +++
> >> >> >>  1 file changed, 11 insertions(+), 16 deletions(-)
> >> >> >>
> >> >> >> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> >> >> >> index d96ba22..854df92 100644
> >> >> >> --- a/kernel/tracepoint.c
> >> >> >> +++ b/kernel/tracepoint.c
> >> >> >> @@ -26,6 +26,7 @@
> >> >> >>  #include 
> >> >> >>  #include 
> >> >> >>  #include 
> >> >> >> +#include 
> >> >> >>
> >> >> >>  extern struct tracepoint * const __start___tracepoints_ptrs[];
> >> >> >>  extern struct tracepoint * const __stop___tracepoints_ptrs[];
> >> >> >> @@ -49,8 +50,7 @@ static LIST_HEAD(tracepoint_module_list);
> >> >> >>   * Protected by tracepoints_mutex.
> >> >> >>   */
> >> >> >>  #define TRACEPOINT_HASH_BITS 6
> >> >> >> -#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> >> >> >> -static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> >> >> >> +static DEFINE_HASHTABLE(tracepoint_table, TRACEPOINT_HASH_BITS);
> >> >> >>
> >> >> > [...]
> >> >> >>
> >> >> >> @@ -722,6 +715,8 @@ struct notifier_block tracepoint_module_nb = {
> >> >> >>
> >> >> >>  static int init_tracepoints(void)
> >> >> >>  {
> >> >> >> + hash_init(tracepoint_table);
> >> >> >> +
> >> >> >>   return register_module_notifier(&tracepoint_module_nb);
> >> >> >>  }
> >> >> >>  __initcall(init_tracepoints);
> >> >> >
> >> >> > So we have a hash table defined in .bss (therefore entirely 
> >> >> > initialized
> >> >> > to NULL), and you add a call to "hash_init", which iterates on the 
> >> >> > whole
> >> >> > array and initialize it to NULL (again) ?
> >> >> >
> >> >> > This extra initialization is redundant. I think it should be removed
> >> >> > from here, and hashtable.h should document that hash_init() don't need
> >> >> > to be called on zeroed memory (which includes static/global variables,
> >> >> > kzalloc'd memory, etc).
> >> >>
> >> >> This was discussed in the previous series, the conclusion was to call
> >> >> hash_init() either way to keep the encapsulation and consistency.
> >> >>
> >> >> It's cheap enough and happens only once, so why not?
> >> >
> >> > Unnecessary work adds up.  Better not to do it unnecessarily, even if by
> >> > itself it doesn't cost that much.
> >> >
> >> > It doesn't seem that difficult for future fields to have 0 as their
> >> > initialized state.
> >>
> >> Let's put it this way: hlist requires the user to initialize hlist
> >> head before usage, therefore as a hlist user, hashtable implementation
> >> must do that.
> >>
> >> We do it automatically when the hashtable user does
> >> DEFINE_HASHTABLE(), but we can't do that if he does
> >> DECLARE_HASHTABLE(). This means that the hashtable user must call
> >> hash_init() whenever he uses DECLARE_HASHTABLE() to create his
> >> hashtable.
> >>
> >> There are two options here, either we specify that hash_init() should
> &

Re: [PATCH tip/core/rcu 4/4] rcu: Document alternative RCU/reference-count algorithms

2012-10-30 Thread Mathieu Desnoyers

* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
> From: "Paul E. McKenney" 
> 
> The approach for mixing RCU and reference counting listed in the RCU
> documentation only describes one possible approach.  This approach can
> result in failure on the read side, which is nice if you want fresh data,
> but not so good if you want simple code.  This commit therefore adds
> two additional approaches that feature unconditional reference-count
> acquisition by RCU readers.  These approaches are very similar to that
> used in the security code.
> 
> Signed-off-by: Paul E. McKenney 
> ---
>  Documentation/RCU/rcuref.txt |   61 -
>  1 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt
> index 4202ad0..99ca662 100644
> --- a/Documentation/RCU/rcuref.txt
> +++ b/Documentation/RCU/rcuref.txt
> @@ -20,7 +20,7 @@ release_referenced()delete()
>  {{
>  ...  write_lock(&list_lock);
>  atomic_dec(&el->rc, relfunc) ...
> -...  delete_element
> +...  remove_element
>  }write_unlock(&list_lock);
>   ...
>   if (atomic_dec_and_test(&el->rc))
> @@ -52,7 +52,7 @@ release_referenced()delete()
>  {{
>  ...  spin_lock(&list_lock);
>  if (atomic_dec_and_test(&el->rc))   ...
> -call_rcu(&el->head, el_free);   delete_element
> +call_rcu(&el->head, el_free);   remove_element
>  ... spin_unlock(&list_lock);
>  }...
>   if (atomic_dec_and_test(&el->rc))
> @@ -64,3 +64,60 @@ Sometimes, a reference to the element needs to be obtained 
> in the
>  update (write) stream.  In such cases, atomic_inc_not_zero() might be
>  overkill, since we hold the update-side spinlock.  One might instead
>  use atomic_inc() in such cases.
> +
> +It is not always convenient to deal with "FAIL" in the
> +search_and_reference() code path.  In such cases, the
> +atomic_dec_and_test() may be moved from delete() to el_free()
> +as follows:
> +
> +1.   2.
> +add()search_and_reference()
> +{{
> +alloc_object rcu_read_lock();
> +...  search_for_element
> +atomic_set(&el->rc, 1);  atomic_inc(&el->rc);
> +spin_lock(&list_lock);   ...
> + 
> +add_element  rcu_read_unlock();
> +...  }

indentation looks wrong in my mail client for the two lines above (for
the 2. block).

Otherwise, it looks good to me,

Thanks,

Mathieu


> +spin_unlock(&list_lock); 4.
> +}delete()
> +3.   {
> +release_referenced() spin_lock(&list_lock);
> +{...
> +...  remove_element
> +if (atomic_dec_and_test(&el->rc))   spin_unlock(&list_lock);
> +kfree(el);   ...
> +... call_rcu(&el->head, el_free);
> +}...
> +5.   }
> +void el_free(struct rcu_head *rhp)
> +{
> +release_referenced();
> +}
> +
> +The key point is that the initial reference added by add() is not removed
> +until after a grace period has elapsed following removal.  This means that
> +search_and_reference() cannot find this element, which means that the value
> +of el->rc cannot increase.  Thus, once it reaches zero, there are no
> +readers that can or ever will be able to reference the element.  The
> +element can therefore safely be freed.  This in turn guarantees that if
> +any reader finds the element, that reader may safely acquire a reference
> +without checking the value of the reference counter.
> +
> +In cases where delete() can sleep, synchronize_rcu() can be called from
> +delete(), so that el_free() can be subsumed into

Re: [PATCH v8 01/16] hashtable: introduce a small and naive hashtable

2012-10-30 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> This hashtable implementation is using hlist buckets to provide a simple
> hashtable to prevent it from getting reimplemented all over the kernel.
> 
> Signed-off-by: Sasha Levin 

Reviewed-by: Mathieu Desnoyers 

> ---
> 
> Changes from v8:
> 
>  - Addressed comments from Tejun Heo and Mathieu Desnoyers.
> 
> 
>  include/linux/hashtable.h | 196 
> ++
>  1 file changed, 196 insertions(+)
>  create mode 100644 include/linux/hashtable.h
> 
> diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
> new file mode 100644
> index 000..3c1a9cb
> --- /dev/null
> +++ b/include/linux/hashtable.h
> @@ -0,0 +1,196 @@
> +/*
> + * Statically sized hash table implementation
> + * (C) 2012  Sasha Levin 
> + */
> +
> +#ifndef _LINUX_HASHTABLE_H
> +#define _LINUX_HASHTABLE_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DEFINE_HASHTABLE(name, bits) 
> \
> + struct hlist_head name[1 << (bits)] =   
> \
> + { [0 ... ((1 << (bits)) - 1)] = HLIST_HEAD_INIT }
> +
> +#define DECLARE_HASHTABLE(name, bits)
> \
> + struct hlist_head name[1 << (bits)]
> +
> +#define HASH_SIZE(name) (ARRAY_SIZE(name))
> +#define HASH_BITS(name) ilog2(HASH_SIZE(name))
> +
> +/* Use hash_32 when possible to allow for fast 32bit hashing in 64bit 
> kernels. */
> +#define hash_min(val, bits)  
> \
> +({   
> \
> + sizeof(val) <= 4 ?  
> \
> + hash_32(val, bits) :
> \
> + hash_long(val, bits);   
> \
> +})
> +
> +static inline void __hash_init(struct hlist_head *ht, unsigned int sz)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < sz; i++)
> + INIT_HLIST_HEAD(&ht[i]);
> +}
> +
> +/**
> + * hash_init - initialize a hash table
> + * @hashtable: hashtable to be initialized
> + *
> + * Calculates the size of the hashtable from the given parameter, otherwise
> + * same as hash_init_size.
> + *
> + * This has to be a macro since HASH_BITS() will not work on pointers since
> + * it calculates the size during preprocessing.
> + */
> +#define hash_init(hashtable) __hash_init(hashtable, HASH_SIZE(hashtable))
> +
> +/**
> + * hash_add - add an object to a hashtable
> + * @hashtable: hashtable to add to
> + * @node: the &struct hlist_node of the object to be added
> + * @key: the key of the object to be added
> + */
> +#define hash_add(hashtable, node, key)   
> \
> + hlist_add_head(node, &hashtable[hash_min(key, HASH_BITS(hashtable))])
> +
> +/**
> + * hash_add_rcu - add an object to a rcu enabled hashtable
> + * @hashtable: hashtable to add to
> + * @node: the &struct hlist_node of the object to be added
> + * @key: the key of the object to be added
> + */
> +#define hash_add_rcu(hashtable, node, key)   
> \
> + hlist_add_head_rcu(node, &hashtable[hash_min(key, 
> HASH_BITS(hashtable))])
> +
> +/**
> + * hash_hashed - check whether an object is in any hashtable
> + * @node: the &struct hlist_node of the object to be checked
> + */
> +static inline bool hash_hashed(struct hlist_node *node)
> +{
> + return !hlist_unhashed(node);
> +}
> +
> +static inline bool __hash_empty(struct hlist_head *ht, unsigned int sz)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < sz; i++)
> + if (!hlist_empty(&ht[i]))
> + return false;
> +
> + return true;
> +}
> +
> +/**
> + * hash_empty - check whether a hashtable is empty
> + * @hashtable: hashtable to check
> + *
> + * This has to be a macro since HASH_BITS() will not work on pointers since
> + * it calculates the size during preprocessing.
> + */
> +#define hash_empty(hashtable) __hash_empty(hashtable, HASH_SIZE(hashtable))
> +
> +/**
> + * hash_del - remove an object from a hashtable
> + * @node: &struct hlist_node of the object to remove
> + */
> +static inline void hash_del(struct hlist_node *node)
> +{
> + hlist_del_init(node);
> +}
> +
> +/**
> + * hash_del_rcu - remove an object from a rcu enabled hashtable
> + * @no

Re: [PATCH v8 06/16] tracepoint: use new hashtable implementation

2012-10-30 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> Switch tracepoints to use the new hashtable implementation. This reduces the
> amount of generic unrelated code in the tracepoints.
> 
> Signed-off-by: Sasha Levin 

Reviewed-by: Mathieu Desnoyers 

> ---
>  kernel/tracepoint.c | 25 +
>  1 file changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index d96ba22..5b599f1 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  extern struct tracepoint * const __start___tracepoints_ptrs[];
>  extern struct tracepoint * const __stop___tracepoints_ptrs[];
> @@ -49,8 +50,7 @@ static LIST_HEAD(tracepoint_module_list);
>   * Protected by tracepoints_mutex.
>   */
>  #define TRACEPOINT_HASH_BITS 6
> -#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> -static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> +static DEFINE_HASHTABLE(tracepoint_table, TRACEPOINT_HASH_BITS);
>  
>  /*
>   * Note about RCU :
> @@ -191,16 +191,15 @@ tracepoint_entry_remove_probe(struct tracepoint_entry 
> *entry,
>   */
>  static struct tracepoint_entry *get_tracepoint(const char *name)
>  {
> - struct hlist_head *head;
>   struct hlist_node *node;
>   struct tracepoint_entry *e;
>   u32 hash = jhash(name, strlen(name), 0);
>  
> - head = &tracepoint_table[hash & (TRACEPOINT_TABLE_SIZE - 1)];
> - hlist_for_each_entry(e, node, head, hlist) {
> + hash_for_each_possible(tracepoint_table, e, node, hlist, hash) {
>   if (!strcmp(name, e->name))
>   return e;
>   }
> +
>   return NULL;
>  }
>  
> @@ -210,19 +209,13 @@ static struct tracepoint_entry *get_tracepoint(const 
> char *name)
>   */
>  static struct tracepoint_entry *add_tracepoint(const char *name)
>  {
> - struct hlist_head *head;
> - struct hlist_node *node;
>   struct tracepoint_entry *e;
>   size_t name_len = strlen(name) + 1;
>   u32 hash = jhash(name, name_len-1, 0);
>  
> - head = &tracepoint_table[hash & (TRACEPOINT_TABLE_SIZE - 1)];
> - hlist_for_each_entry(e, node, head, hlist) {
> - if (!strcmp(name, e->name)) {
> - printk(KERN_NOTICE
> - "tracepoint %s busy\n", name);
> - return ERR_PTR(-EEXIST);/* Already there */
> - }
> + if (get_tracepoint(name)) {
> + printk(KERN_NOTICE "tracepoint %s busy\n", name);
> + return ERR_PTR(-EEXIST);/* Already there */
>   }
>   /*
>* Using kmalloc here to allocate a variable length element. Could
> @@ -234,7 +227,7 @@ static struct tracepoint_entry *add_tracepoint(const char 
> *name)
>   memcpy(&e->name[0], name, name_len);
>   e->funcs = NULL;
>   e->refcount = 0;
> - hlist_add_head(&e->hlist, head);
> + hash_add(tracepoint_table, &e->hlist, hash);
>   return e;
>  }
>  
> @@ -244,7 +237,7 @@ static struct tracepoint_entry *add_tracepoint(const char 
> *name)
>   */
>  static inline void remove_tracepoint(struct tracepoint_entry *e)
>  {
> - hlist_del(&e->hlist);
> + hash_del(&e->hlist);
>   kfree(e);
>  }
>  
> -- 
> 1.7.12.4
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lttng and full nohz

2013-07-08 Thread Mathieu Desnoyers

* Mats Liljegren (liljegren.ma...@gmail.com) wrote:
> I've been investigating why lttng destroys full nohz mode, and the
> root cause is that lttng uses timers for flushing trace buffers. So
> I'm planning on moving the timers to the ticking CPU, so that any CPU
> using full nohz mode can continue to do so even though they might have
> tracepoints.
> 
> I can see that kernel/sched/core.c has the function
> get_nohz_timer_target() which tries to find an idle CPU to allocate
> for a timer that has not specified a CPU to be pinned to.
> 
> My question here is: For full nohz mode, should this still be "only"
> an idle CPU, or should it be translated to a CPU not running in full
> nohz mode? I'd think this could make it a lot easier to allow
> applications to make full use of full nohz.

One thing to be aware of wrt LTTng ring buffer: if you look at
lttng-ring-buffer-client.h, you will notice that we use

  .sync = RING_BUFFER_SYNC_PER_CPU,

as ring buffer synchronization. This means we need to issue event write
and sub-buffer switch from the CPU owning the buffer, or, in very
specific cases, if the CPU owning the buffer is offline, we can touch it
from a remote CPU, but just one (e.g. cpu hotplug code).

For the LTTng ring buffer, there are two timers to take into account:
switch_timer and read_timer.

The switch_timer is not enabled by default. When it is enabled by the
end-user, it periodically flush the lttng buffers. If you want to make
this timer execute from a single timer handler and apply to all buffers
(without IPI), you will need to use

  .sync = RING_BUFFER_SYNC_GLOBAL,

to allow concurrent updates to a ring buffer from remote CPUs.

The other timer requires less modifications: the read_timer periodically
checks if the poll() needs to be awakened. It just reads the producer
offset position and compares it to the current consumer position. This
one can be moved to a single timer handler that covers all CPUs without
any change to the "sync" choice.

Please note that the read_timer is current used by default. It can be
entirely removed if you choose

  .wakeup = RING_BUFFER_WAKEUP_BY_WRITER,

instead of RING_BUFFER_WAKEUP_BY_TIMER. However, if you choose the
wakeup by writer, the tracer will discard events coming from NMI
handlers, because some locks need to be taken by the tracing site in
this mode.

If we care about performance and scalability (we really should), the
right approach would be to keep RING_BUFFER_SYNC_PER_CPU though, and
keep the per-CPU timers for periodic flush (switch_timer). We might want
to hook into the full nohz entry/hooks (hopefully they exist) to move
the per-cpu timers out of the full nohz CPUs, and enable a new flag on
these ring buffers that would allow to dynamically change between
RING_BUFFER_SYNC_PER_CPU and RING_BUFFER_SYNC_GLOBAL for a given ring
buffer.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4040 matches

Mail list logo