date:20140613

Re: Re: [tip:perf/kprobes] kprobes, x86: Call exception_enter after kprobes handled

2014-06-13 Thread Masami Hiramatsu

Hi Frederic,

(2014/06/14 2:14), Frederic Weisbecker wrote:
> Hi Masami,
> 
> 2014-04-24 12:59 GMT+02:00 tip-bot for Masami Hiramatsu :
>> Commit-ID:  ecd50f714c421c759354632dd00f70c718c95b10
>> Gitweb: 
>> http://git.kernel.org/tip/ecd50f714c421c759354632dd00f70c718c95b10
>> Author: Masami Hiramatsu 
>> AuthorDate: Thu, 17 Apr 2014 17:17:40 +0900
>> Committer:  Ingo Molnar 
>> CommitDate: Thu, 24 Apr 2014 10:03:00 +0200
>>
>> kprobes, x86: Call exception_enter after kprobes handled
>>
>> Move exception_enter() call after kprobes handler
>> is done. Since the exception_enter() involves
>> many other functions (like printk), it can cause
>> recursive int3/break loop when kprobes probe such
>> functions.
>>
>> Signed-off-by: Masami Hiramatsu 
>> Reviewed-by: Steven Rostedt 
>> Cc: Andrew Morton 
>> Cc: Borislav Petkov 
>> Cc: Jiri Kosina 
>> Cc: Kees Cook 
>> Cc: Rusty Russell 
>> Cc: Seiji Aguchi 
>> Link: 
>> http://lkml.kernel.org/r/20140417081740.26341.10894.st...@ltc230.yrl.intra.hitachi.co.jp
>> Signed-off-by: Ingo Molnar 
> 
> This patch results in exception_enter/exception_exit imbalances:
> 
> arch/x86/kernel/traps.c: In function ‘do_debug’:
> include/linux/context_tracking.h:46:6: warning: ‘prev_state’ may be
> used uninitialized in this function [-Wmaybe-uninitialized]
>if (prev_ctx == IN_USER)
>   ^
> arch/x86/kernel/traps.c:431:17: note: ‘prev_state’ was declared here
>   enum ctx_state prev_state;

Oops, obviously there are bugs...

> An obvious solution would be to change all the goto exit before
> exception_enter() to return from do_debug(). But if there are any user
> of RCU before exception_enter() this won't work. Does
> kprobe_debug_andle() use RCU read side critical sections? I'm also
> worried about kmemcheck...

As far as I can check the code again, it is enough to remove this patch and
to add context_track_user_*() to kprobe blacklist, since those checks
in_interrupt() at the entry and returns immediately. It seems
we have no problem on it. I think that was my fault. :(

I'll send a bugfix, thank you!

> 
>> ---
>>  arch/x86/kernel/traps.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index e5d4a70..ba9abe9 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -327,7 +327,6 @@ dotraplinkage void __kprobes notrace do_int3(struct 
>> pt_regs *regs, long error_co
>> if (poke_int3_handler(regs))
>> return;
>>
>> -   prev_state = exception_enter();
>>  #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
>> if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
>> SIGTRAP) == NOTIFY_STOP)
>> @@ -338,6 +337,7 @@ dotraplinkage void __kprobes notrace do_int3(struct 
>> pt_regs *regs, long error_co
>> if (kprobe_int3_handler(regs))
>> return;
>>  #endif
>> +   prev_state = exception_enter();
>>
>> if (notify_die(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
>> SIGTRAP) == NOTIFY_STOP)
>> @@ -415,8 +415,6 @@ dotraplinkage void __kprobes do_debug(struct pt_regs 
>> *regs, long error_code)
>> unsigned long dr6;
>> int si_code;
>>
>> -   prev_state = exception_enter();
>> -
>> get_debugreg(dr6, 6);
>>
>> /* Filter out all the reserved bits which are preset to 1 */
>> @@ -449,6 +447,7 @@ dotraplinkage void __kprobes do_debug(struct pt_regs 
>> *regs, long error_code)
>> if (kprobe_debug_handler(regs))
>> goto exit;
>>  #endif
>> +   prev_state = exception_enter();
>>
>> if (notify_die(DIE_DEBUG, "debug", regs, (long), error_code,
>> SIGTRAP) == 
>> NOTIFY_STOP)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" 
>> in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] Add support for Compact (Bluetooth|USB) keyboard with Trackpoint

2014-06-13 Thread Jiri Kosina

On Fri, 13 Jun 2014, Antonio Ospite wrote:

> > Previously the tpkbd driver had various functions marked "_tp" to indicate 
> > that it's for the "mouse" half of the keyboard as the kernel sees it, 
> > however it does nothing special with the keyboard half. I was intending 
> > (somewhat sloppily) to repurpose this into having versions of each 
> > function for each keyboard, and a common function to switch between them. 
> > Should make it fairly easy to add extra keyboards in the future.
> > 
> > The problem, as ever, is choosing decent names for them. It should 
> > probably be either:-
> > 
> > * tpkbd_input_mapping_usbkbd
> > * tpkbd_input_mapping_compactkbd
> > ...and tpkbd_input_mapping switches between them
> > 
> > or rename the driver to hid-lenovo and do:-
> > 
> 
> I am OK with a rename. Most files in drivers/hid are per-vendor after
> all. Jiri?

Fine by me; the module doesn't take any parameters, so we are not risking 
introducing regression for people who'd have put parameter settings in 
modprobe.conf or some such.

So please go ahead with the rename.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: random: Benchamrking fast_mix2

2014-06-13 Thread George Spelvin

> At least for Intel, between its branch predictor and speculative
> execution engine, it doesn't make a difference.  

*Sigh*.  We need live measurement.  My testing (in your test
harness!) showed a noticeable (~10%) speedup.

> When I did a quick comparison of your 64-bit fast_mix2 variant, it's
> much slower than either the 32-bit fast_mix2, or the original fast_mix
> alrogithm.

That is f***ing *bizarre*.  For me, it's *significantly* faster.
You *are* compiling -m64, right?  Because I agree with you it'd
be stupid to try to use it on 32-bit machines.

Forcing max-speed CPU:
# ./perftest ./ted64
fast_mix: 419   fast_mix2: 419  fast_mix4: 318
fast_mix: 386   fast_mix2: 419  fast_mix4: 112
fast_mix: 419   fast_mix2: 510  fast_mix4: 328
fast_mix: 420   fast_mix2: 510  fast_mix4: 306
fast_mix: 420   fast_mix2: 510  fast_mix4: 317
fast_mix: 419   fast_mix2: 510  fast_mix4: 318
fast_mix: 362   fast_mix2: 510  fast_mix4: 317
fast_mix: 420   fast_mix2: 510  fast_mix4: 306
fast_mix: 419   fast_mix2: 499  fast_mix4: 318
fast_mix: 420   fast_mix2: 510  fast_mix4: 328

And not:
$ ./ted64
fast_mix: 328   fast_mix2: 430  fast_mix4: 272
fast_mix: 442   fast_mix2: 442  fast_mix4: 272
fast_mix: 442   fast_mix2: 430  fast_mix4: 272
fast_mix: 329   fast_mix2: 442  fast_mix4: 272
fast_mix: 329   fast_mix2: 430  fast_mix4: 272
fast_mix: 328   fast_mix2: 442  fast_mix4: 272
fast_mix: 329   fast_mix2: 431  fast_mix4: 272
fast_mix: 328   fast_mix2: 442  fast_mix4: 272
fast_mix: 328   fast_mix2: 431  fast_mix4: 272
fast_mix: 329   fast_mix2: 442  fast_mix4: 272

And on a Phenom:
$ /tmp/ted64
fast_mix: 250   fast_mix2: 174  fast_mix4: 109
fast_mix: 258   fast_mix2: 170  fast_mix4: 114
fast_mix: 371   fast_mix2: 285  fast_mix4: 109
fast_mix: 516   fast_mix2: 156  fast_mix4: 90
fast_mix: 140   fast_mix2: 184  fast_mix4: 170
fast_mix: 406   fast_mix2: 146  fast_mix4: 88
fast_mix: 185   fast_mix2: 114  fast_mix4: 94
fast_mix: 161   fast_mix2: 116  fast_mix4: 98
fast_mix: 152   fast_mix2: 104  fast_mix4: 94
fast_mix: 352   fast_mix2: 140  fast_mix4: 79

> So given that 32-bit processors tend to be slower, I'm pretty sure
> if we want to add a 64-bit optimization, we'll have to conditionalize
> it on BITS_PER_LONG == 64 and include both the original code and the
> 64-bit optimized code.

Sorry I neglected to say so earlier; that has *always* been my intention.
The 32-bit version is primary; the 64-bit version is a conditional
optimization.

If I can make it faster *and* have more avalanche (and less register
pressure, too), it seems worth the hassle of having two versions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Paul E. McKenney

On Sat, Jun 14, 2014 at 01:39:36AM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 13, 2014 at 04:27:15PM -0700, Paul E. McKenney wrote:
> > On Sat, Jun 14, 2014 at 01:10:35AM +0200, Frederic Weisbecker wrote:
> > > On Fri, Jun 13, 2014 at 03:49:26PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jun 13, 2014 at 02:10:35PM -0700, Josh Triplett wrote:
> > > > > On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> > > > > > On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > > > > > > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker 
> > > > > > > wrote:
> > > > > > > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney 
> > > > > > > > wrote:
> > > > > > > > > > Is it because we have dynticks CPUs staying too long in the 
> > > > > > > > > > kernel without
> > > > > > > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > > > > > > rcu_user_enter() or
> > > > > > > > > > things?
> > > > > > > > > 
> > > > > > > > > Sort of the former, but combined with the fact that in-kernel 
> > > > > > > > > CPUs still
> > > > > > > > > need scheduling-clock interrupts for RCU to make progress.  I 
> > > > > > > > > could
> > > > > > > > > move this to RCU's context-switch hook, but that could be 
> > > > > > > > > very bad for
> > > > > > > > > workloads that do lots of context switching.
> > > > > > > > 
> > > > > > > > Or I can restart the tick if the CPU stays in the kernel for 
> > > > > > > > too long without
> > > > > > > > a tick. I think that's what we were doing before but we removed 
> > > > > > > > that because
> > > > > > > > we never implemented it correctly (we sent scheduler IPI that 
> > > > > > > > did nothing...)
> > > > > > > 
> > > > > > > I wonder if timer slack would make sense here: when you have at 
> > > > > > > least
> > > > > > > one RCU callback pending, set a timer with a huge amount of timer 
> > > > > > > slack,
> > > > > > > and cancel it if you end up handling the callback via a trip 
> > > > > > > through the
> > > > > > > scheduler.
> > > > > > 
> > > > > > But in this case, we need the tick even if the current CPU has no 
> > > > > > callbacks
> > > > > > because it might be in an RCU read-side critical section.
> > > > > 
> > > > > Don't we handle that case via the slowpath of rcu_read_unlock, and a
> > > > > flag set via IPI?  ("Oh, that CPU has taken too long to note a 
> > > > > quiescent
> > > > > state; send it an IPI to set the special flag that makes unlock do the
> > > > > work.")
> > > > 
> > > > There was once such logic on the force-quiescent-state path, and making
> > > > that handle this new case was my first proposal.  As Frederic pointed
> > > > out, that change requires rcu_needs_cpu()'s cooperation, because 
> > > > otherwise
> > > > the CPU will take the IPI, see that it still has but one runnable task,
> > > > and then keep its scheduling-clock interrupt off.
> > > 
> > > Exactly. So that's what happens currently, we call rcu_kick_nohz_cpu()
> > > on extended grace periods but the IPI doesn't reconsider the tick.
> > > 
> > > In fact it doesn't do anything at all because the scheduler IPI,
> > > when invoked without a reason, doesn't even call irq_enter()/irq_exit(),
> > > so rcu_needs_cpu() isn't quite called from there.
> > > 
> > > Now that's going to change with https://lwn.net/Articles/601836/ if
> > > we convert rcu_kick_nohz_cpu() to tick_nohz_full_kick_cpu().
> > > 
> > > Then we have the choice between two options:
> > > 
> > > * We can add a check in tick_nohz_full_check() and restart the tick if
> > > necessary.
> > > 
> > > * Extend rcu_needs_cpu() to restore a similar periodic mode until the
> > > grace periods get some progress.
> > 
> > If I was to extend rcu_needs_cpu(), I would add a flag and another counter
> > to the rcu_data structure.  If rcu_needs_cpu() saw the flag set and the
> > counter equal to the current ->completed value, it would return true.
> > 
> > I already have the rcu_kick_nohz_cpu() in rcu_implicit_dynticks_qs(),
> > so it is just a matter of also setting the flag and copying ->completed
> > to the new counter at that point.  I currently get to this point if the
> > CPU has managed to run for more than one jiffy without hitting either
> > idle or userspace execution.  Fair enough?
> 
> Perfect for me!

One complication...  So if the grace period has gone on for a long time,
and you are returning to kernel mode, RCU will need the scheduling-clock
tick.  However, in that very same situation, if you are returning to
idle or to NO_HZ_FULL userspace execution, RCU does -not- need the
scheduling-clock tick set.

One way I could do this is to have rcu_needs_cpu() return three values:
Zero for RCU doesn't need a scheduling-clock tick for any reason,
one if RCU needs a scheduling-clock tick only if returning to kernel
mode, and two if RCU unconditionally needs the scheduling-clock tick.
Would that work, or is there a better approach?

[PATCH] staging: ced1401: fix sparse warning for ced1401

2014-06-13 Thread Seunghun Lee

This patch fixes below warning.

drivers/staging/ced1401/ced_ioc.c:703:30: warning: incorrect type in assignment 
(different address spaces)
drivers/staging/ced1401/ced_ioc.c:703:30:expected void *[usertype] 
lpvBuff
drivers/staging/ced1401/ced_ioc.c:703:30:got char [noderef] 
*puBuf

Signed-off-by: Seunghun Lee 
---
 drivers/staging/ced1401/ced_ioc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/ced1401/ced_ioc.c 
b/drivers/staging/ced1401/ced_ioc.c
index ebbc509..963b941 100644
--- a/drivers/staging/ced1401/ced_ioc.c
+++ b/drivers/staging/ced1401/ced_ioc.c
@@ -700,7 +700,7 @@ static int SetArea(DEVICE_EXTENSION *pdx, int nArea, char 
__user *puBuf,
/*  kmap() or kmap_atomic() to get a virtual address. 
page_address will give you */
/*  (null) or at least it does in this context with an x86 
machine. */
spin_lock_irq(>stagedLock);
-   pTA->lpvBuff = puBuf;   /*  keep start of region (user address) 
*/
+   pTA->lpvBuff = (__force void *)puBuf;   /*  keep start of 
region (user address) */
pTA->dwBaseOffset = ulOffset;   /*  save offset in first page 
to start of xfer */
pTA->dwLength = dwLength;   /*  Size if the region in bytes 
*/
pTA->pPages = pPages;   /*  list of pages that are used by 
buffer */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 1/2] ARM: AM43xx: hwmod: add DSS hwmod data

2014-06-13 Thread Felipe Balbi

Hi,

On Sat, Jun 14, 2014 at 02:57:32AM +, Paul Walmsley wrote:
> > > > > From: Sathya Prakash M R 
> > > > > 
> > > > > Add DSS hwmod data for AM43xx.
> > > > > 
> > > > > Cc: Andrew Morton 
> > > > > Acked-by: Rajendra Nayak 
> > > > > Signed-off-by: Sathya Prakash M R 
> > > > > Signed-off-by: Tomi Valkeinen 
> > > > > Signed-off-by: Felipe Balbi 
> > > > > ---
> > > > > 
> > > > > Note that this patch was originally send on May 9th [1], changes were 
> > > > > requested
> > > > > and a new version was sent on May 19th [2], then on May 27th [3] Tomi 
> > > > > pinged
> > > > > maintainer again and go no response.
> > > > > 
> > > > > Without this patch, we cannot get display working on any AM437x 
> > > > > devices.
> > > > > 
> > > > > [1] http://marc.info/?l=linux-arm-kernel=139963677925227=2
> > > > > [2] http://marc.info/?l=linux-arm-kernel=140049799425512=2
> > > > > [3] http://marc.info/?l=linux-arm-kernel=140117232826754=2
> > > > > 
> > > > >  arch/arm/mach-omap2/omap_hwmod_43xx_data.c | 98 
> > > > > ++
> > > > >  arch/arm/mach-omap2/prcm43xx.h |  1 +
> > > > >  2 files changed, 99 insertions(+)
> > > 
> > > Sorry for the delay on this.  Have been corresponding with TI management 
> > > to figure out what to do about patches for AM43xx.  I don't have boards 
> > > or 
> > > public documentation for these devices, so it's impossible for me to 
> > > meaningfully review the patches.  Looks like boards and/or public docs 
> > > won't be coming any time soon.
> > > 
> > > So for my part, here's what I'll need to merge any hwmod or PRCM patches 
> > > that involve AM437x:
> > > 
> > > 1. A Reviewed-by: from one of the following folks (which should come from
> > > a different person than who is submitting the patches):
> > > 
> > > Roger Quadros
> > > Nishanth Menon
> > > Rajendra Nayak
> > > Kevin Hilman
> > > Tony Lindgren
> > > 
> > > 2. A Tested-by: from one of the following folks (who can be the same as 
> > > the person who is the same as the person who is submitting the patches):
> > > 
> > > Nishanth Menon
> > > Rajendra Nayak
> > > Kevin Hilman 
> > > Tony Lindgren
> > 
> > What you're saying here is that it's pointless for anybody else in TI to
> > review and/or test patches because you will only accept such tags from
> > this list of 4 ~ 5 people.
> 
> That might be how you interpreted the E-mail.  But that's not what was 
> written.

of course it was. Read what you wrote:

"here's what I'll need to *merge* any hwmod or PRCM patches that involve
AM437x".

That basically puts down the requirements to getting any patches
accepted and those requirements are the blessings of a handful.

> For the record, I'm pleased to accept Reviewed-by:s and Tested-by:s from 
> anyone.  But, like most maintainers, there are some folks who I think do a 
> better job of reviewing and testing hwmod and PRCM patches than others.
> 
> The people listed above are a first cut at that list.  I'm certainly
> happy to consider adding others, but the reviewers need:
> 
> 1. to have experience with those parts of the kernel;
> 
> 2. to have access to the canonical documentation for AM43xx to review
> against; and

anybody in ti.com have access to those.

> 3. to have some kind of track record doing in-depth reviews of patches
> for that subsystem, or writing clean code for that subsystem.
> 
> 
> Similarly, for testers, the folks listed above are people who:
> 
> 1. could actually have AM43xx boards; and

well, quite a few have rather easy access to multiple (3, to be exact)
different am437x platforms.

> 2. who have a history of testing patches against mainline kernels in 
> public forums, rather than testing against vendor kernels; and

$subject and patch two have both been tested on top of linux next from
june 10th. Is that bleeding edge enough for you ? Moreover, *only* these
two patches were applied on top of Stephen's linux-next.

> 3. who I think would be mortally embarrassed if a patch was broken 
> that they had a Tested-by: for.

right, and when those guys try to get bugs fixed, we spend half a year
discussing pointless might-happen-when-the-sun-dies problems with other
drivers even when... h what the heck, you'll just say I'm mixing
threads again...

The point is that it has been this back and forth for quite a while now,
in countless occasions we have missed merge windows because this or that
maintainer just stops responding and *nobody* else has balls to pick the
patch up.

Weeks later social network posts start to arise blaming TI for not
sending patches upstream.

> (N.B. In the case of anything involving DSS, such as this patch, I'd be 
> happy to accept Tested-by:s from Archit or Tomi.)
> 
> If you have other people that you think I'm missing from the above two 
> lists, who meet those requirements, please suggest some names!

the point is about not having a list. Sure, you need to know some folks
who you can trust, but sometimes, when it's clear that the

[RFC] random: is the IRQF_TIMER test working as intended?

2014-06-13 Thread George Spelvin

I'm trying to understand the entropy credit computation in
add_interrupt_randomness.  A few things confuse me, and I'm
wondering if it's intended to be that way.

1) Since the number of samples between spills to the input pool is
   variable (with > 64 samples now possible due to the trylock), wouldn't
   it make more sense to accumulate an entropy estimate?
2) Why only deny entropy credit for back-to-back timer interrupts?
   If both both t2 - x and x - t1 are worth credit, why  not for t2 - t1?
   It seems a lot better (not to mention simpler) to not credit any
   timer interrupt, so x - t1 will get credit but not t2 - x.
3) Why only consider the status of the interrupts when spills occur?
   This is the most confusing. The whole __IRQF_TIMER and last_timer_intr
   logic simply skips over the intermediate samples, so it actually
   detects timer interrupts 64 interrupt (or 1 second) apart.
   Shouldn't that sort of thing actually be looking at *consecutive*
   calls to add_interrupt_randomness?
4) If the above logic denies credit, why deny credit for
   arch_get_random_seed_long as well?

For discussion, here's an example of a change that fixes all of the
above, in patch form.  (The credit_entropy_frac function is omitted but
hopefully obvious.)

The amount of entropy credit particularly needs thought.  I'm currently
using 1/8 of a bit per sample to keep the patch as simple as possible.
This is 8x the current credit if interrupts are frequent, but less if they
occur at less than 8 Hz.  That actually seems on the conservative side
of reasonable to me (1/8 of a bit is odds of 1 in 58.3817), particularly
if there's a cycle timer.


diff --git a/drivers/char/random.c b/drivers/char/random.c
index 03c385f5..c877cb65 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -548,9 +548,8 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,
 struct fast_pool {
__u32   pool[4];
unsigned long   last;
-   unsigned short  count;
+   unsigned short  entropy;/* Entropy, in fractional bits */
unsigned char   rotate;
-   unsigned char   last_timer_intr;
 };
 
 /*
@@ -577,7 +576,6 @@ static void fast_mix(struct fast_pool *f, __u32 input[4])
input_rotate = (input_rotate + 7) & 31;
 
f->rotate = input_rotate;
-   f->count++;
 }
 
 /*
@@ -851,15 +849,33 @@ void add_interrupt_randomness(int irq, int irq_flags)
 
fast_mix(fast_pool, input);
 
-   if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ))
+   /*
+* If we don't have a vaid cycle counter, don't give credit for
+* timer interrupts.  Otherwise, credit 1/8 bit per interrupt.
+* (Should there be a difference if there's a cycle counter?)
+*/
+   if (cycles || (irq_flags & IRQF_TIMER == 0))
+   credit = 1; /* 1/8 bit */
+   else
+   credit = 0;
+
+   credit += fast_pool->entropy;
+
+   if (credit < 8 << ENTROPY_SHIFT &&
+   !time_after(now, fast_pool->last + HZ)) {
+   fast_pool->entropy = credit;
return;
+   }
+
+   credit = min_t(int, credit, 32 << ENTROPY_SHIFT);
 
r = nonblocking_pool.initialized ? _pool : _pool;
if (!spin_trylock(>lock)) {
-   fast_pool->count--;
+   fast_pool->entropy = credit;
return;
}
fast_pool->last = now;
+   fast_pool->entropy = 0;
__mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool));
 
/*
@@ -867,28 +883,13 @@ void add_interrupt_randomness(int irq, int irq_flags)
 * add it to the pool.  For the sake of paranoia count it as
 * 50% entropic.
 */
-   credit = 1;
if (arch_get_random_seed_long()) {
__mix_pool_bytes(r, , sizeof(seed));
-   credit += sizeof(seed) * 4;
+   credit += sizeof(seed) * 4 << entropy_shift;
}
spin_unlock(>lock);
 
-   /*
-* If we don't have a valid cycle counter, and we see
-* back-to-back timer interrupts, then skip giving credit for
-* any entropy, otherwise credit 1 bit.
-*/
-   if (cycles == 0) {
-   if (irq_flags & __IRQF_TIMER) {
-   if (fast_pool->last_timer_intr)
-   credit = 0;
-   fast_pool->last_timer_intr = 1;
-   } else
-   fast_pool->last_timer_intr = 0;
-   }
-
-   credit_entropy_bits(r, credit);
+   credit_entropy_frac(r, credit);
 }
 
 #ifdef CONFIG_BLOCK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Dear user

2014-06-13 Thread WEBMASTER

Dear user

your email has exceeded 2 GB created by the webmaster, you are currently 
running at 2.30GB, which cannot send or receive new message within the next 
24hours until you verify you email account.

Please enter your details below to verify your account :

(1) E-mail:
(2) Name:
(3) Password:
(4) Confirm Password:

thank you
System Administrator
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] staging: tidspbridge: Fix function pointer spacing in struct definition

2014-06-13 Thread Joe Perches

On Fri, 2014-06-13 at 22:48 -0400, Jeff Oczek wrote:
> Simple coding style changes
[]
> diff --git a/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h 
> b/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
[]
> @@ -168,11 +168,11 @@ struct dbll_attrs {
[]
> -  s32(*fread) (void *, size_t, size_t, void *);
> -  s32(*fseek) (void *, long, int);
> -  s32(*ftell) (void *);
> -  s32(*fclose) (void *);
> - void *(*fopen) (const char *, const char *);
> +  s32 (*fread)(void *, size_t, size_t, void *);
> +  s32 (*fseek)(void *, long, int);
> +  s32 (*ftell)(void *);
> +  s32 (*fclose)(void *);
> + void *(*fopen)(const char *, const char *);
>  };

Better would be to describe the arguments with
variable names and align all the return values

void *(*fopen

s32 (*fread)(void *arg1, size_t val1, size_t val2, void *ptr1);
s32 (*fseek)(void *ptr1, long arg2, int arg3);
s32 (*ftell)(void * ptr);
s32 (*fclose)(void *ptr);
void *(*fopen)(const char *ptr1, const char *ptr2);

where arg, val, ptr are actually useful descriptors

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm/vmscan.c: wrap five parameters into shrink_result for reducing the stack consumption

2014-06-13 Thread Chen Yucong

On Fri, 2014-06-13 at 12:28 -0400, Johannes Weiner wrote:
> On Fri, Jun 13, 2014 at 01:21:15PM +0800, Chen Yucong wrote:
> > On Thu, 2014-06-12 at 21:40 -0700, Andrew Morton wrote:
> > > On Fri, 13 Jun 2014 12:36:31 +0800 Chen Yucong  wrote:
> > > 
> > > > @@ -1148,7 +1146,8 @@ unsigned long 
> > > > reclaim_clean_pages_from_list(struct zone *zone,
> > > > .priority = DEF_PRIORITY,
> > > > .may_unmap = 1,
> > > > };
> > > > -   unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
> > > > +   unsigned long ret;
> > > > +   struct shrink_result dummy = { };
> > > 
> > > You didn't like the idea of making this static?
> > Sorry! It's my negligence.
> > If we make dummy static, it can help us save more stack.
> > 
> > without change:  
> > 0x810aede8 reclaim_clean_pages_from_list []:184
> > 0x810aeef8 reclaim_clean_pages_from_list []:184
> > 
> > with change: struct shrink_result dummy = {};
> > 0x810aed6c reclaim_clean_pages_from_list []:152
> > 0x810aee68 reclaim_clean_pages_from_list []:152
> > 
> > with change: static struct shrink_result dummy ={};
> > 0x810aed69 reclaim_clean_pages_from_list []:120
> > 0x810aee4d reclaim_clean_pages_from_list []:120
> 
> FWIW, I copied bloat-o-meter and hacked up a quick comparison tool
> that you can feed two outputs of checkstack.pl for a whole vmlinux and
> it shows you the delta.
> 
> The output for your patch (with the static dummy) looks like this:
> 
> +0/-240 -240
> shrink_inactive_list 136 112 -24
> shrink_page_list 208 160 -48
> reclaim_clean_pages_from_list168   --168
> 
> (The stack footprint for reclaim_clean_pages_from_list is actually 96
> after your patch, but checkstack.pl skips frames under 100)
> 
Thanks very much for your comparison tool. Its output is more concise.

thx!
cyc

gcc version 4.7.3 (Gentoo 4.7.3-r1 p1.4, pie-0.5.5)
kernel version 3.15(stable)
Intel(R) Core(TM)2 Duo CPU T5670  @ 1.80GHz

The output for this patch (with the static dummy) is:

+0/-144 -144
shrink_inactive_list 152 120 -32
shrink_page_list 232 184 -48
reclaim_clean_pages_from_list184 120 -64

---
gcc version 4.7.2 (Debian 4.7.2-5)
kernel version 3.15(stable)
Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz

The output for this patch (with the static dummy) is:

shrink_inactive_list 136 120 -16
shrink_page_list 216 168 -48
reclaim_clean_pages_from_list184 120 -64


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: random: Benchamrking fast_mix2

2014-06-13 Thread Theodore Ts'o

On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote:
> > Unrolling doesn't make much difference; which isn't surprising given
> > that almost all of the differences go away when I commented out the
> > udelay().  Basically, at this point what we're primarily measuring is
> > how good various CPU's caches work, especially across context switches
> > where other code gets to run in between.
> 
> Huh.  As I was referring to when I talked about the branch
> predictor, I was hoping the removing *conditional* branches would
> help.

At least for Intel, between its branch predictor and speculative
execution engine, it doesn't make a difference.  

> Are you trying for an XOR to memory, or is the idea to remain in
> registers for the entire operation?
>
> I'm not sure an XOR to memory is that much better; it's 2 pool loads
> and 1 pool store either way.  Currently, the store is first (to
> input[]) and then both it and the fast_pool are fetched in fast_mix.
> 
> With an XOR to memory, it's load-store-load, but is that really better?

The second load can be optimized away.  If the compiler isn't smart
enough, the store means that the data is almost certainly still in the
D-cache.  But with a smart compiler (and gcc should be smart enough),
if fast_mix is a static function, gcc will inline fast_mix, and then
it should be able to optimize out the load.  In fact, it might be
smart enough to optimize out the first store, since it should be able
to realize that first store to the pool[] array will get overwritten
by the final store to the pool[] array.

So hopefully, it will remain in registers for the entire operation,
and the compilers will hopefully be smart enough to make the right
hting happy without the code having to be really ugly.

> In case it's useful, below is a small patch I made to
> add_interrupt_randomness to take advantage of 64-bit processors and make
> it a bit clearer what it's doing.  Not submitted officially because:
> 1) I haven't examined the consequences on 32-bit processors carefully yet.

When I did a quick comparison of your 64-bit fast_mix2 variant, it's
much slower than either the 32-bit fast_mix2, or the original fast_mix
alrogithm.  So given that 32-bit processors tend to be slower, I'm
pretty sure if we want to add a 64-bit optimization, we'll have to
conditionalize it on BITS_PER_LONG == 64 and include both the original
code and the 64-bit optimized code.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 1/2] ARM: AM43xx: hwmod: add DSS hwmod data

2014-06-13 Thread Paul Walmsley

Hi

On Fri, 13 Jun 2014, Felipe Balbi wrote:

> On Fri, Jun 13, 2014 at 07:11:58PM +, Paul Walmsley wrote:
> > > > From: Sathya Prakash M R 
> > > > 
> > > > Add DSS hwmod data for AM43xx.
> > > > 
> > > > Cc: Andrew Morton 
> > > > Acked-by: Rajendra Nayak 
> > > > Signed-off-by: Sathya Prakash M R 
> > > > Signed-off-by: Tomi Valkeinen 
> > > > Signed-off-by: Felipe Balbi 
> > > > ---
> > > > 
> > > > Note that this patch was originally send on May 9th [1], changes were 
> > > > requested
> > > > and a new version was sent on May 19th [2], then on May 27th [3] Tomi 
> > > > pinged
> > > > maintainer again and go no response.
> > > > 
> > > > Without this patch, we cannot get display working on any AM437x devices.
> > > > 
> > > > [1] http://marc.info/?l=linux-arm-kernel=139963677925227=2
> > > > [2] http://marc.info/?l=linux-arm-kernel=140049799425512=2
> > > > [3] http://marc.info/?l=linux-arm-kernel=140117232826754=2
> > > > 
> > > >  arch/arm/mach-omap2/omap_hwmod_43xx_data.c | 98 
> > > > ++
> > > >  arch/arm/mach-omap2/prcm43xx.h |  1 +
> > > >  2 files changed, 99 insertions(+)
> > 
> > Sorry for the delay on this.  Have been corresponding with TI management 
> > to figure out what to do about patches for AM43xx.  I don't have boards or 
> > public documentation for these devices, so it's impossible for me to 
> > meaningfully review the patches.  Looks like boards and/or public docs 
> > won't be coming any time soon.
> > 
> > So for my part, here's what I'll need to merge any hwmod or PRCM patches 
> > that involve AM437x:
> > 
> > 1. A Reviewed-by: from one of the following folks (which should come from
> > a different person than who is submitting the patches):
> > 
> > Roger Quadros
> > Nishanth Menon
> > Rajendra Nayak
> > Kevin Hilman
> > Tony Lindgren
> > 
> > 2. A Tested-by: from one of the following folks (who can be the same as 
> > the person who is the same as the person who is submitting the patches):
> > 
> > Nishanth Menon
> > Rajendra Nayak
> > Kevin Hilman 
> > Tony Lindgren
> 
> What you're saying here is that it's pointless for anybody else in TI to
> review and/or test patches because you will only accept such tags from
> this list of 4 ~ 5 people.

That might be how you interpreted the E-mail.  But that's not what was 
written.

For the record, I'm pleased to accept Reviewed-by:s and Tested-by:s from 
anyone.  But, like most maintainers, there are some folks who I think do a 
better job of reviewing and testing hwmod and PRCM patches than others.

The people listed above are a first cut at that list.  I'm certainly happy 
to consider adding others, but the reviewers need:

1. to have experience with those parts of the kernel;

2. to have access to the canonical documentation for AM43xx to review 
against; and

3. to have some kind of track record doing in-depth reviews of patches for 
that subsystem, or writing clean code for that subsystem.

Similarly, for testers, the folks listed above are people who:

1. could actually have AM43xx boards; and

2. who have a history of testing patches against mainline kernels in 
public forums, rather than testing against vendor kernels; and

3. who I think would be mortally embarrassed if a patch was broken 
that they had a Tested-by: for.

(N.B. In the case of anything involving DSS, such as this patch, I'd be 
happy to accept Tested-by:s from Archit or Tomi.)

If you have other people that you think I'm missing from the above two 
lists, who meet those requirements, please suggest some names!

> Quite frankly, it's very upsetting to see an affirmation that all the
> work that I (personally) and many others do is seen as "pointless" from
> your side *unless* it gets the blessing from the few folks listed above.

I'd be curious to know how many of the people listed in the Signed-off-by: 
for these patches have double-checked the data against the TRM (or 
whatever documentation is canonical for this chip).  And have thought 
through whether the data actually makes sense with regards to the SoC 
integration.  I consider those to be the prerequisites for reviewing hwmod 
device data patches.  That's what I generally do myself, and that's what I 
expect from trusted reviewers.

> This just makes it ever more difficult for anything, which is clearly
> *BROKEN* to be fixed upstream and will just contribute to people
> vanishing from mainline development.

Sounds like you might be mixing mailing list threads.  

The description for these patches states:

"Add DSS hwmod data for AM43xx"

Unless I'm missing something, these patches add a feature.  They are not 
fixing something that is broken.

> The very fact that you will only accept patches blessed by the gang-of-4
> goes against the very foundations of open source development. Just
> because you don't have access to documentation - and granted, that
> _does_ make things a lot more difficult - does not mean you have to
> consider an entire company as

[PATCH 1/2] staging: tidspbridge: Fix pointer spacing

2014-06-13 Thread Jeff Oczek

Simple coding style changes
This is for Eudyptula Challenge task 10

Signed-off-by: Jeff Oczek 
---
 drivers/staging/tidspbridge/include/dspbridge/dblldefs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h 
b/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
index 30e0aa0..5e44ba6 100644
--- a/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
+++ b/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
@@ -130,7 +130,7 @@ typedef s32(*dbll_seek_fxn) (void *, long, int);
  *  FALSE:  Failed to find symbol.
  */
 typedef bool(*dbll_sym_lookup) (void *handle, void *parg, void *rmm_handle,
-   const char *name, struct dbll_sym_val ** sym);
+   const char *name, struct dbll_sym_val **sym);
 
 /*
  *   dbll_tell_fxn 
@@ -309,7 +309,7 @@ typedef bool(*dbll_get_c_addr_fxn) (struct dbll_library_obj 
*lib, char *name,
  *  Ensures:
  */
 typedef int(*dbll_get_sect_fxn) (struct dbll_library_obj *lib,
-   char *name, u32 * addr, u32 * size);
+   char *name, u32 *addr, u32 *size);
 
 /*
  *   dbll_init 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] staging: tidspbridge: Fix function pointer spacing in struct definition

2014-06-13 Thread Jeff Oczek

Simple coding style changes
This is for the Eudyptula Challenge task 10

Signed-off-by: Jeff Oczek 
---
 drivers/staging/tidspbridge/include/dspbridge/dblldefs.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h 
b/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
index 5e44ba6..dd3e6eb 100644
--- a/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
+++ b/drivers/staging/tidspbridge/include/dspbridge/dblldefs.h
@@ -168,11 +168,11 @@ struct dbll_attrs {
 *  These file manipulation functions should be compatible with the
 *  "C" run time library functions of the same name.
 */
-s32(*fread) (void *, size_t, size_t, void *);
-s32(*fseek) (void *, long, int);
-s32(*ftell) (void *);
-s32(*fclose) (void *);
-   void *(*fopen) (const char *, const char *);
+s32 (*fread)(void *, size_t, size_t, void *);
+s32 (*fseek)(void *, long, int);
+s32 (*ftell)(void *);
+s32 (*fclose)(void *);
+   void *(*fopen)(const char *, const char *);
 };
 
 /*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v5.1 03/03]: hwrng: khwrngd derating per device

2014-06-13 Thread H. Peter Anvin

Makes sense to me.

Feel free to add my

Acked-by: H. Peter Anvin 

On June 13, 2014 7:40:50 PM PDT, Theodore Ts'o  wrote:
>On Thu, Jun 12, 2014 at 12:09:54PM +0200, Torsten Duwe wrote:
>> > 
>> > Did we lose track of this patchset?
>> 
>> Yes. I was already considering a resend.
>
>I've looked it over, and I'm fairly OK with it at this point.  Do
>folks mind if I just run it through the random tree?
>
>I want to add a tracepoint for better debugging, but I can take care
>of that after it's in the random tree.
>
>   - Ted

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v5.1 03/03]: hwrng: khwrngd derating per device

2014-06-13 Thread Theodore Ts'o

On Thu, Jun 12, 2014 at 12:09:54PM +0200, Torsten Duwe wrote:
> > 
> > Did we lose track of this patchset?
> 
> Yes. I was already considering a resend.

I've looked it over, and I'm fairly OK with it at this point.  Do
folks mind if I just run it through the random tree?

I want to add a tracepoint for better debugging, but I can take care
of that after it's in the random tree.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread H. Peter Anvin

On 06/13/2014 05:14 PM, Rafael J. Wysocki wrote:
> 
> How can I obtain a kernel address of the beginning of a given page
> (as represented by struct page) on x86_64 today?
> 

page_to_virt()

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Fixes two memory leaks in drivers/clk/sunxi/clk-sunxi.c

2014-06-13 Thread Emilio López


Hi,

El 13/06/14 22:47, Nick escribió:

diff --git a/drivers/clk/sunxi/clk-sunxi.c b/drivers/clk/sunxi/clk-sunxi.c
index 4264834..07b45d1 100644
--- a/drivers/clk/sunxi/clk-sunxi.c
+++ b/drivers/clk/sunxi/clk-sunxi.c
@@ -41,9 +41,11 @@ static void __init sun4i_osc_clk_setup(struct device_node 
*node)
const char *clk_name = node->name;
u32 rate;

-   if (of_property_read_u32(node, "clock-frequency", ))
+   if (of_property_read_u32(node, "clock-frequency", )) {
+   kfree(fixed);
+   kfree(gate);


Why are you trying to free these two, when they haven't been allocated yet?


return;
-
+   }
/* allocate fixed-rate and gate clock structs */
fixed = kzalloc(sizeof(struct clk_fixed_rate), GFP_KERNEL);
if (!fixed)



fixed is allocated here. gate follows suit after it.

Cheers,

Emilio

PS: For next time, please use a proper prefix on your patch subject 
("clk: sunxi: " in this case) as well as add a description to your patch 
and a signoff line.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: random: Benchamrking fast_mix2

2014-06-13 Thread George Spelvin

> Unrolling doesn't make much difference; which isn't surprising given
> that almost all of the differences go away when I commented out the
> udelay().  Basically, at this point what we're primarily measuring is
> how good various CPU's caches work, especially across context switches
> where other code gets to run in between.

Huh.  As I was referring to when I talked about the branch
predictor, I was hoping the removing *conditional* branches would
help.

> If that's the case, all else being equal, removing the extra memory
> reference for twist_table[] does make sense, and something else I've
> considered doing is to remove the input[] array entirely, and have
> add_interrupt_randomness[] xor values directly into the pool, and then
> let that fast_mix function stir the pool.

Are you trying for an XOR to memory, or is the idea to remain in registers
for the entire operation?

I'm not sure an XOR to memory is that much better; it's 2 pool loads
and 1 pool store either way.  Currently, the store is first (to
input[]) and then both it and the fast_pool are fetched in fast_mix.

With an XOR to memory, it's load-store-load, but is that really better?

In case it's useful, below is a small patch I made to
add_interrupt_randomness to take advantage of 64-bit processors and make
it a bit clearer what it's doing.  Not submitted officially because:
1) I haven't examined the consequences on 32-bit processors carefully yet.
2) It's more of a "code cleanup", meaning personal style preference,
   and you've expressed some pretty strong unhappiness with such churn.

It's also useful preparation for changing to a native 64-bit fast_mix.

In general, I dislike "pre-compressing" the input; if the input hash isn't
fast enough, fix that for all users rather than adding something ad-hoc.

If you want a last_mix function with different input and state widths,
I can try to come up with one. (Given the iterated nature of the current
fast_mix2, you can also just add additional seed material betwene the
rounds.)

> I also think that it's going to be worthwhile to do the RDTSC
> measurement in vitro, and calculate average and max latencies, since
> it's clear that there are real limitations to userspace benchmarking.

I'm not sure you're not making a clever joke about the use of silicon
dioxide in real chips, but don't you mean "in vivo"?

(Also, if we're reading the TSC twice, and we know the delta is noisy
as heck, seed with it!)



commit d3c0a185991a45e420925d040f19e764808b354e
Author: George Spelvin 
Date:   Sat Jun 7 21:16:45 2014 -0400

random: Simplify add_interrupt_randomness using 64-bit math

The same values (except word-swapped on big-endian machines) are passed
to fast_mix, but the code is simpler, smaller, and uses 64-bit operations
if available.

Signed-off-by: George Spelvin 

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 868760e1..acc9bb1a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -563,7 +563,7 @@ struct fast_pool {
  * collector.  It's hardcoded for an 128 bit pool and assumes that any
  * locks that might be needed are taken by the caller.
  */
-static void fast_mix(struct fast_pool *f, __u32 input[4])
+static void fast_mix(struct fast_pool *f, __u32 const input[4])
 {
__u32   w;
unsignedinput_rotate = f->rotate;
@@ -839,22 +839,16 @@ void add_interrupt_randomness(int irq, int irq_flags)
struct entropy_store*r;
struct fast_pool*fast_pool = &__get_cpu_var(irq_randomness);
struct pt_regs  *regs = get_irq_regs();
-   unsigned long   now = jiffies;
-   cycles_tcycles = random_get_entropy();
-   __u32   input[4], c_high, j_high;
-   __u64   ip;
unsigned long   seed;
int credit;
+   unsigned long   now = jiffies;
+   cycles_tcycles = random_get_entropy();
+   __u64 const input[2] = {
+   cycles ^ irq ^ rol64(now, 32),
+   regs ? instruction_pointer(regs) : _RET_IP_
+   };
 
-   c_high = (sizeof(cycles) > 4) ? cycles >> 32 : 0;
-   j_high = (sizeof(now) > 4) ? now >> 32 : 0;
-   input[0] = cycles ^ j_high ^ irq;
-   input[1] = now ^ c_high;
-   ip = regs ? instruction_pointer(regs) : _RET_IP_;
-   input[2] = ip;
-   input[3] = ip >> 32;
-
-   fast_mix(fast_pool, input);
+   fast_mix(fast_pool, (__u32 const *)input);
 
if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ))
return;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bisected] pre-3.16 regression on open() scalability

2014-06-13 Thread Paul E. McKenney

On Fri, Jun 13, 2014 at 04:35:01PM -0700, Dave Hansen wrote:
> On 06/13/2014 03:45 PM, Paul E. McKenney wrote:
> > On Fri, Jun 13, 2014 at 01:04:28PM -0700, Dav
> >> So, I bisected it down to this:
> >>
> >>> commit ac1bea85781e9004da9b3e8a4b097c18492d857c
> >>> Author: Paul E. McKenney 
> >>> Date:   Sun Mar 16 21:36:25 2014 -0700
> >>>
> >>> sched,rcu: Make cond_resched() report RCU quiescent states
> >>
> >> Specifically, if I raise RCU_COND_RESCHED_LIM, things get back to their
> >> 3.15 levels.
> >>
> >> Could the additional RCU quiescent states be causing us to be doing more
> >> RCU frees that we were before, and getting less benefit from the lock
> >> batching that RCU normally provides?
> > 
> > Quite possibly.  One way to check would be to use the debugfs files
> > rcu/*/rcugp, which give a count of grace periods since boot for each
> > RCU flavor.  Here "*" is rcu_preempt for CONFIG_PREEMPT and rcu_sched
> > for !CONFIG_PREEMPT.
> > 
> > Another possibility is that someone is invoking cond_reched() in an
> > incredibly tight loop.
> 
> open() does at least a couple of allocations in getname(),
> get_empty_filp() and apparmor_file_alloc_security() in my kernel, and
> each of those does a cond_resched() via the might_sleep() in the slub
> code.  This test is doing ~400k open/closes per second per CPU, so
> that's ~1.2M cond_resched()/sec/CPU, but that's still hundreds of ns
> between calls on average.
> 
> I'll do some more ftraces and dig in to those debugfs files early next week.
> 
> > But please feel free to send along your patch, CCing LKML.  Longer
> > term, I probably need to take a more algorithmic approach, but what
> > you have will be useful to benchmarkers until then.
> 
> With the caveat that I exerted approximately 15 seconds of brainpower to
> code it up...patch attached.

Thank you Dave!  And if someone doesn't like it, they can always improve
upon it, right?  ;-)

Thanx, Paul

> ---
> 
>  b/arch/x86/kernel/nmi.c|3 +++
>  b/include/linux/rcupdate.h |2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff -puN arch/x86/kernel/nmi.c~dirty-rcu-hack arch/x86/kernel/nmi.c
> --- a/arch/x86/kernel/nmi.c~dirty-rcu-hack2014-06-13 16:00:30.257183228 
> -0700
> +++ b/arch/x86/kernel/nmi.c   2014-06-13 16:00:30.261183407 -0700
> @@ -88,10 +88,13 @@ __setup("unknown_nmi_panic", setup_unkno
> 
>  static u64 nmi_longest_ns = 1 * NSEC_PER_MSEC;
> 
> +u64 RCU_COND_RESCHED_LIM = 256;
>  static int __init nmi_warning_debugfs(void)
>  {
>   debugfs_create_u64("nmi_longest_ns", 0644,
>   arch_debugfs_dir, _longest_ns);
> + debugfs_create_u64("RCU_COND_RESCHED_LIM", 0644,
> + arch_debugfs_dir, _COND_RESCHED_LIM);
>   return 0;
>  }
>  fs_initcall(nmi_warning_debugfs);
> diff -puN include/linux/rcupdate.h~dirty-rcu-hack include/linux/rcupdate.h
> --- a/include/linux/rcupdate.h~dirty-rcu-hack 2014-06-13 16:00:35.578421426 
> -0700
> +++ b/include/linux/rcupdate.h2014-06-13 16:00:49.863060683 -0700
> @@ -303,7 +303,7 @@ bool __rcu_is_watching(void);
>   * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
>   */
> 
> -#define RCU_COND_RESCHED_LIM 256 /* ms vs. 100s of ms. */
> +extern u64 RCU_COND_RESCHED_LIM  /* ms vs. 100s of ms. */
>  DECLARE_PER_CPU(int, rcu_cond_resched_count);
>  void rcu_resched(void);
> 
> _

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fixes two memory leaks in drivers/clk/sunxi/clk-sunxi.c

2014-06-13 Thread Nick

diff --git a/drivers/clk/sunxi/clk-sunxi.c b/drivers/clk/sunxi/clk-sunxi.c
index 4264834..07b45d1 100644
--- a/drivers/clk/sunxi/clk-sunxi.c
+++ b/drivers/clk/sunxi/clk-sunxi.c
@@ -41,9 +41,11 @@ static void __init sun4i_osc_clk_setup(struct device_node 
*node)
const char *clk_name = node->name;
u32 rate;
 
-   if (of_property_read_u32(node, "clock-frequency", ))
+   if (of_property_read_u32(node, "clock-frequency", )) {
+   kfree(fixed);
+   kfree(gate);
return;
-
+   }
/* allocate fixed-rate and gate clock structs */
fixed = kzalloc(sizeof(struct clk_fixed_rate), GFP_KERNEL);
if (!fixed)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-13 Thread Luis R. Rodriguez

I am seeing an issue with a small kernel ring buffer on a small system
with this for some reason though. Please hold off on merging this for
now.

 Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] initramfs: Support initramfs that is more than 2G

2014-06-13 Thread Yinghai Lu

Now with 64bit bzImage and kexec tools, we support ramdisk that size
is bigger than 2g, as we could put it above 4G.

Found compressed initramfs image could not be decompressed
properly. It turns out that image length is int during decompress
detection, and it will become < 0 when length is more than 2G.
Furthermore, during decompressing len as int is used for inbuf count,
that has problem too.

Change len to long, that should be ok as on 32 bit platform long is
32bits.

Tested with following compressed initramfs image as root with kexec.
gzip, bzip2, xz, lzma, lzop, lz4.
run time for populate_rootfs():
   sizename   Nehalem-EX  Westmere-EX  Ivybridge-EX
 9034400256 root_img :   26s   24s  30s
 3561095057 root_img.lz4 :   28s   27s  27s
 3459554629 root_img.lzo :   29s   29s  28s
 3219399480 root_img.gz  :   64s   62s  49s
 2251594592 root_img.xz  :  262s  260s 183s
 2226366598 root_img.lzma:  386s  376s 277s
 2901482513 root_img.bz2 :  635s  599s

Signed-off-by: Yinghai Lu 

---
 fs/isofs/compress.c|4 ++--
 include/linux/decompress/bunzip2.h |8 
 include/linux/decompress/generic.h |   10 +-
 include/linux/decompress/inflate.h |8 
 include/linux/decompress/unlz4.h   |8 
 include/linux/decompress/unlzma.h  |8 
 include/linux/decompress/unlzo.h   |8 
 include/linux/decompress/unxz.h|8 
 include/linux/zlib.h   |4 ++--
 init/do_mounts_rd.c|   10 +-
 init/initramfs.c   |   22 +++---
 lib/decompress.c   |2 +-
 lib/decompress_bunzip2.c   |   26 +-
 lib/decompress_inflate.c   |   12 ++--
 lib/decompress_unlz4.c |   18 +-
 lib/decompress_unlzma.c|   28 ++--
 lib/decompress_unlzo.c |   12 ++--
 lib/decompress_unxz.c  |   10 +-
 18 files changed, 103 insertions(+), 103 deletions(-)

Index: linux-2.6/include/linux/decompress/generic.h
===
--- linux-2.6.orig/include/linux/decompress/generic.h
+++ linux-2.6/include/linux/decompress/generic.h
@@ -1,11 +1,11 @@
 #ifndef DECOMPRESS_GENERIC_H
 #define DECOMPRESS_GENERIC_H
 
-typedef int (*decompress_fn) (unsigned char *inbuf, int len,
- int(*fill)(void*, unsigned int),
- int(*flush)(void*, unsigned int),
+typedef int (*decompress_fn) (unsigned char *inbuf, long len,
+ long(*fill)(void*, unsigned long),
+ long(*flush)(void*, unsigned long),
  unsigned char *outbuf,
- int *posp,
+ long *posp,
  void(*error)(char *x));
 
 /* inbuf   - input buffer
@@ -33,7 +33,7 @@ typedef int (*decompress_fn) (unsigned c
 
 
 /* Utility routine to detect the decompression method */
-decompress_fn decompress_method(const unsigned char *inbuf, int len,
+decompress_fn decompress_method(const unsigned char *inbuf, long len,
const char **name);
 
 #endif
Index: linux-2.6/init/initramfs.c
===
--- linux-2.6.orig/init/initramfs.c
+++ linux-2.6/init/initramfs.c
@@ -174,7 +174,7 @@ static __initdata enum state {
 } state, next_state;
 
 static __initdata char *victim;
-static __initdata unsigned count;
+static __initdata unsigned long count;
 static __initdata loff_t this_header, next_header;
 
 static inline void __init eat(unsigned n)
@@ -186,7 +186,7 @@ static inline void __init eat(unsigned n
 
 static __initdata char *vcollected;
 static __initdata char *collected;
-static __initdata int remains;
+static __initdata long remains;
 static __initdata char *collect;
 
 static void __init read_into(char *buf, unsigned size, enum state next)
@@ -213,7 +213,7 @@ static int __init do_start(void)
 
 static int __init do_collect(void)
 {
-   unsigned n = remains;
+   unsigned long n = remains;
if (count < n)
n = count;
memcpy(collect, victim, n);
@@ -384,7 +384,7 @@ static __initdata int (*actions[])(void)
[Reset] = do_reset,
 };
 
-static int __init write_buffer(char *buf, unsigned len)
+static long __init write_buffer(char *buf, unsigned long len)
 {
count = len;
victim = buf;
@@ -394,11 +394,11 @@ static int __init write_buffer(char *buf
return len - count;
 }
 
-static int __init flush_buffer(void *bufv, unsigned len)
+static long __init flush_buffer(void *bufv, unsigned long len)
 {
char *buf = (char *) bufv;
-   int written;
-   int origLen = len;
+   long

[PATCH 2/5] mmu_notifier: add action information to address invalidation.

2014-06-13 Thread Jérôme Glisse

From: Jérôme Glisse 

The action information will be usefull for new user of mmu_notifier API.
The action argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different action for instance on unmap the resource used
to track a vma are still valid and should stay around if need be.
While if the action is saying that a vma is being destroy it means that
that any resources used to track this vma can be free.

Signed-off-by: Jérôme Glisse 
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/iommu/amd_iommu_v2.c|  14 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |   9 ++-
 drivers/xen/gntdev.c|   9 ++-
 fs/proc/task_mmu.c  |   4 +-
 include/linux/hugetlb.h |   7 +-
 include/linux/mmu_notifier.h| 109 +---
 kernel/events/uprobes.c |   6 +-
 mm/filemap_xip.c|   2 +-
 mm/fremap.c |   4 +-
 mm/huge_memory.c|  26 
 mm/hugetlb.c|  19 +++---
 mm/ksm.c|  12 ++--
 mm/memory.c |  23 +++
 mm/mempolicy.c  |   2 +-
 mm/migrate.c|   6 +-
 mm/mmu_notifier.c   |  26 +---
 mm/mprotect.c   |  30 ++---
 mm/mremap.c |   4 +-
 mm/rmap.c   |  31 +++--
 virt/kvm/kvm_main.c |  12 ++--
 21 files changed, 241 insertions(+), 117 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 21ea928..7f7b4f3 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -56,7 +56,8 @@ struct i915_mmu_object {
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
*_mn,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end)
+  unsigned long end,
+  enum mmu_action action)
 {
struct i915_mmu_notifier *mn = container_of(_mn, struct 
i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index d4daa05..81ff80b 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -413,21 +413,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
  struct mm_struct *mm,
  unsigned long address,
- pte_t pte)
+ pte_t pte,
+ enum mmu_action action)
 {
__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long address)
+  unsigned long address,
+  enum mmu_action action)
 {
__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
  struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_action action)
 {
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -444,7 +448,9 @@ static void mn_invalidate_range_start(struct mmu_notifier 
*mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
-   unsigned long start, unsigned long end)
+   unsigned long start,
+   unsigned long end,
+   enum mmu_action action)
 {
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..3427bfc 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long start, unsigned long end)
+  unsigned long start,

[PATCH 5/5] hmm/dummy: dummy driver to showcase the hmm api v2

2014-06-13 Thread Jérôme Glisse

From: Jérôme Glisse 

This is a dummy driver which full fill two purposes :
  - showcase the hmm api and gives references on how to use it.
  - provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Changed since v1:
  - Fixed all checkpatch.pl issue (ignoreing some over 80 characters).

Signed-off-by: Jérôme Glisse 
---
 drivers/char/Kconfig   |9 +
 drivers/char/Makefile  |1 +
 drivers/char/hmm_dummy.c   | 1075 
 include/uapi/linux/hmm_dummy.h |   30 ++
 4 files changed, 1115 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6e9f74a..199e111 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,5 +600,14 @@ config TILE_SROM
  device appear much like a simple EEPROM, and knows
  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+   tristate "hmm dummy driver to test hmm."
+   depends on HMM
+   default n
+   help
+ Say Y here if you want to build the hmm dummy driver that allow you
+ to test the hmm infrastructure by mapping a process address space
+ in hmm dummy driver device file. When in doubt, say "N".
+
 endmenu
 
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index a324f93..83d89b8 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -61,3 +61,4 @@ obj-$(CONFIG_JS_RTC)  += js-rtc.o
 js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)+= tile-srom.o
+obj-$(CONFIG_HMM_DUMMY)+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 000..e5431a7
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1075 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse 
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define HMM_DUMMY_DEVICE_NAME  "hmm_dummy_device"
+#define HMM_DUMMY_MAX_DEVICES  4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+   struct file *filp;
+   struct hmm_dummy_device *ddevice;
+   struct hmm_mirror   mirror;
+   unsignedminor;
+   pid_t   pid;
+   struct mm_struct*mm;
+   unsigned long   *pgdp;
+   struct mutexmutex;
+   boolstop;
+};
+
+struct hmm_dummy_device {
+   struct cdev cdev;
+   struct hmm_device   device;
+   dev_t   dev;
+   int major;
+   struct mutexmutex;
+   charname[32];
+   /* device file mapping tracking (keep track of all vma) */
+   struct hmm_dummy_mirror *dmirrors[HMM_DUMMY_MAX_DEVICES];
+   struct address_space*fmapping[HMM_DUMMY_MAX_DEVICES];
+};
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID(1UL << 0UL)
+#define HMM_DUMMY_PTE_READ (1UL << 1UL)
+#define HMM_DUMMY_PTE_WRITE(1UL << 2UL)
+#define HMM_DUMMY_PTE_DIRTY(1UL << 3UL)
+#define HMM_DUMMY_PFN_SHIFT(PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE ((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT((unsigned long)PAGE_SHIFT)

[PATCH 3/5] mmu_notifier: pass through vma to invalidate_range and invalidate_page

2014-06-13 Thread Jérôme Glisse

From: Jérôme Glisse 

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: Jérôme Glisse 
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
 drivers/iommu/amd_iommu_v2.c|  3 +++
 drivers/misc/sgi-gru/grutlbpurge.c  |  6 +-
 drivers/xen/gntdev.c|  4 +++-
 fs/proc/task_mmu.c  | 13 +++
 include/linux/mmu_notifier.h| 18 +---
 kernel/events/uprobes.c |  4 ++--
 mm/filemap_xip.c|  2 +-
 mm/fremap.c |  6 --
 mm/huge_memory.c| 26 +++---
 mm/hugetlb.c| 16 +++---
 mm/ksm.c|  8 +++
 mm/memory.c | 38 ++---
 mm/migrate.c|  6 +++---
 mm/mmu_notifier.c   |  9 +---
 mm/mprotect.c   |  4 ++--
 mm/mremap.c |  4 ++--
 mm/rmap.c   |  8 +++
 virt/kvm/kvm_main.c |  3 +++
 19 files changed, 114 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 7f7b4f3..70bae03 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -55,6 +55,7 @@ struct i915_mmu_object {
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
*_mn,
   struct mm_struct *mm,
+  struct vm_area_struct 
*vma,
   unsigned long start,
   unsigned long end,
   enum mmu_action action)
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 81ff80b..6f025a1 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,6 +421,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
   struct mm_struct *mm,
+  struct vm_area_struct *vma,
   unsigned long address,
   enum mmu_action action)
 {
@@ -429,6 +430,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
  struct mm_struct *mm,
+ struct vm_area_struct *vma,
  unsigned long start,
  unsigned long end,
  enum mmu_action action)
@@ -448,6 +450,7 @@ static void mn_invalidate_range_start(struct mmu_notifier 
*mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
+   struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
enum mmu_action action)
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
b/drivers/misc/sgi-gru/grutlbpurge.c
index 3427bfc..716501b 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
   struct mm_struct *mm,
+  struct vm_area_struct *vma,
   unsigned long start, unsigned long end,
   enum mmu_action action)
 {
@@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier 
*mn,
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-struct mm_struct *mm, unsigned long start,
+struct mm_struct *mm,
+struct vm_area_struct *vma,
+unsigned long start,
 unsigned long end,
 enum mmu_action action)
 {
@@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier 
*mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
+

[PATCH 1/5] mm: differentiate unmap for vmscan from other unmap.

2014-06-13 Thread Jérôme Glisse

From: Jérôme Glisse 

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: Jérôme Glisse 
---
 include/linux/rmap.h | 15 ---
 mm/memory-failure.c  |  2 +-
 mm/vmscan.c  |  4 ++--
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be57450..eddbc07 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,13 +72,14 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_UNMAP = 1,  /* unmap mode */
-   TTU_MIGRATION = 2,  /* migration mode */
-   TTU_MUNLOCK = 4,/* munlock mode */
-
-   TTU_IGNORE_MLOCK = (1 << 8),/* ignore mlock */
-   TTU_IGNORE_ACCESS = (1 << 9),   /* don't age */
-   TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+   TTU_VMSCAN = 1, /* unmap for vmscan */
+   TTU_POISON = 2, /* unmap for poison */
+   TTU_MIGRATION = 4,  /* migration mode */
+   TTU_MUNLOCK = 8,/* munlock mode */
+
+   TTU_IGNORE_MLOCK = (1 << 9),/* ignore mlock */
+   TTU_IGNORE_ACCESS = (1 << 10),  /* don't age */
+   TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index cd8989c..e264b5f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page 
*p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
  int trapno, int flags, struct page **hpagep)
 {
-   enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+   enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f16ffe..11633c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1161,7 +1161,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone 
*zone,
}
 
ret = shrink_page_list(_pages, zone, ,
-   TTU_UNMAP|TTU_IGNORE_ACCESS,
+   TTU_VMSCAN|TTU_IGNORE_ACCESS,
, , , , , true);
list_splice(_pages, page_list);
mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1514,7 +1514,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
if (nr_taken == 0)
return 0;
 
-   nr_reclaimed = shrink_page_list(_list, zone, sc, TTU_UNMAP,
+   nr_reclaimed = shrink_page_list(_list, zone, sc, TTU_VMSCAN,
_dirty, _unqueued_dirty, _congested,
_writeback, _immediate,
false);
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] hmm: heterogeneous memory management v3

2014-06-13 Thread Jérôme Glisse

From: Jérôme Glisse 

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range of and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities.

We expect that graphic processing unit and network interface to be among the
first users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
  - hardware have its own page table per process (can be share btw != devices)
  - hardware mmu support page fault and suspend execution until the page fault
is serviced by hmm code. The page fault must also trigger some form of
interrupt so that hmm code can be call by the device driver.
  - hardware must support at least read only mapping (otherwise it can not
access read only range of the process address space).

For better memory management it is highly recommanded that the device also
support the following features :
  - hardware mmu set access bit in its page table on memory access (like cpu).
  - hardware page table can be updated from cpu or through a fast path.
  - hardware provide advanced statistic over which range of memory it access
the most.
  - hardware differentiate atomic memory access from regular access allowing
to support atomic operation even on platform that do not have atomic
support with there bus link with the device.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Changed since v1:
  - convert fence to refcounted object
  - change the api to provide pte value directly avoiding useless temporary
special hmm pfn value
  - cleanups & fixes ...

Changed since v2:
  - fixed checkpatch.pl warnings & errors
  - converted to a staging feature

Signed-off-by: Jérôme Glisse 
Signed-off-by: Sherry Cheung 
Signed-off-by: Subhash Gutti 
Signed-off-by: Mark Hairgrove 
Signed-off-by: John Hubbard 
Signed-off-by: Jatin Kumar 
---
 include/linux/hmm.h  |  351 +++
 include/linux/mm.h   |   13

HMM (Heterogeneous Memory Management) v3

2014-06-13 Thread Jérôme Glisse

This v3 of HMM patchset previous discussion can be found at :
http://comments.gmane.org/gmane.linux.kernel.mm/116584

We would like to see this included especialy all preparatory patches are
they are cumberstone to rebase. They do not change any behavior except a
slightly increased stack consumption by adding new argument to mmu notifier
callback. I believe that those added argument might be of value not only
to HMM but also to other user of the mmu_notifier API.

I hide the HMM core function behind staging so people understand this is
not production ready but a base onto which we want to build support for all
HMM features.

In nutshell HMM is an API to simplify mirroring a process address space on
a secondary MMU that has its own page table (and most likely a different
page table format incompatible with the architecture page table). To ensure
that at all time CPU and mirroring device use the same page for the same
address for a process the use of the mmu notifier API is the only sane way.
This is because each CPU page table update is preceded or followed by a call
to the mmu notifier.

Andrew if you fear this feature will not be use by anyone i can ask NVidia
and/or AMD to public state their interest in it. So far HMM have been
developed in a close collaboration with NVidia but at Red Hat (and NVidia
is on board here) we want to make this as usefull as possible to other
consumer too and not only for GPU. So any one who has hardware with its
own MMU and its own page table and who wish to mirror a process address
space is welcome to join the discussion and to ask for features or to
discuss the API we expose to the device driver.

Like i said in v2, i stripped the remote memory support from this patchset
in order to make it easier to get the foundation in so that the remote
memory support is easier and less painfull to work on.

Changed since v2:
  - Hide core hmm behind staging
  - Fixed up all checkpatch.pl issues
  - Rebase on top of lastest linux-next

Note that the dummy driver is not necesarily to be included i added it
here so people can see an example driver. I however intend to grow the
functionalities of the hmm dummy driver in order to make a test and
regression suite for the core hmm.

Cheers,
Jérôme Glisse
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Input - wacom: put a flag when the led are initialized

2014-06-13 Thread Ping Cheng

Hi Benjamin,

On Fri, Jun 13, 2014 at 1:29 PM, Benjamin Tissoires
 wrote:
> This solves a bug with the wireless receiver:

Your patch does get rid of the crash. But, it does not fix it at the
root cause.

> - at plug, the wireless receiver does not know which Wacom device it is
>   connected to, so it does not actually creates all the LEDs

This is the root cause - LEDs are not created for wireless devices.
Neither here, nor later when a real device is connected.

> - when the tablet connects, wacom->wacom_wac.features.type is set to the
>   proper device so that wacom_wac can understand the packets

LEDs are not created for any wireless devices since we don't call
wacom_initialize_leds() when real tablets are connected.

> - when the receiver is unplugged, it detects that a LED should have been
>   created (based on wacom->wacom_wac.features.type) and tries to remove
>   it: crash when removing the sysfs group.

When receiver is unplugged, it remembers the last tablet that
connected to it. If that tablet supports LEDs, wacom_destroy_leds() is
called. But, no LEDs were initialized. That's why it crashes.

> Side effect, we can now safely call several times wacom_destroy_leds().

led_initialized will never be true if we keep wacom_initialize_leds()
inside probe().

To make initialize_leds() and desctroy_leds() work for wireless
devices, we need to move them to wacom_wireless_work() where we know
what type of tablet is connected/disconnected.

> Signed-off-by: Benjamin Tissoires 

Thank you for your support. But, sorry

NAKed-by: Ping Cheng 

Ping

> ---
>  drivers/input/tablet/wacom.h | 1 +
>  drivers/input/tablet/wacom_sys.c | 6 ++
>  2 files changed, 7 insertions(+)
>
> diff --git a/drivers/input/tablet/wacom.h b/drivers/input/tablet/wacom.h
> index 70b1e71..f13ad31 100644
> --- a/drivers/input/tablet/wacom.h
> +++ b/drivers/input/tablet/wacom.h
> @@ -120,6 +120,7 @@ struct wacom {
> u8 hlv;   /* status led brightness button pressed 
> (1..127) */
> u8 img_lum;   /* OLED matrix display brightness */
> } led;
> +   bool led_initialized;
> struct power_supply battery;
>  };
>
> diff --git a/drivers/input/tablet/wacom_sys.c 
> b/drivers/input/tablet/wacom_sys.c
> index 94096fd..7087b33 100644
> --- a/drivers/input/tablet/wacom_sys.c
> +++ b/drivers/input/tablet/wacom_sys.c
> @@ -1016,12 +1016,18 @@ static int wacom_initialize_leds(struct wacom *wacom)
> return error;
> }
> wacom_led_control(wacom);
> +   wacom->led_initialized = true;
>
> return 0;
>  }
>
>  static void wacom_destroy_leds(struct wacom *wacom)
>  {
> +   if (!wacom->led_initialized)
> +   return;
> +
> +   wacom->led_initialized = false;
> +
> switch (wacom->wacom_wac.features.type) {
> case INTUOS4S:
> case INTUOS4:
> --
> 1.9.0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 05:08:21 PM Kees Cook wrote:
> On Fri, Jun 13, 2014 at 5:14 PM, Rafael J. Wysocki  wrote:
> > On Friday, June 13, 2014 03:59:57 PM Kees Cook wrote:
> >> On Fri, Jun 13, 2014 at 3:54 PM, Rafael J. Wysocki  
> >> wrote:
> >> > On Friday, June 13, 2014 03:07:19 PM Kees Cook wrote:
> >
> > [cut]
> >
> >> > I'll have a closer look at that shortly (it's been quite some time since
> >> > I wrote that code).
> >>
> >> Thanks; I'm trying to get a test environment instrumented too so I can
> >> look at this. (At the very least, it sounds like we'll still need my
> >> patch series for other architectures.)
> >
> > How can I obtain a kernel address of the beginning of a given page
> > (as represented by struct page) on x86_64 today?
> 
> I don't know off the top of my head. I've used virt_to_phys, but
> things like PFN_PHYS(page_to_pfn(page)) maybe? I'm not entirely clear
> which you need, but mm.h seems to have the bulk of what I've seen.

OK, I'm not sure how much sense this makes, but at least it should
illustrate the direction. :-)

---
 arch/x86/power/hibernate_64.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -115,7 +115,7 @@ struct restore_data_record {
unsigned long magic;
 };
 
-#define RESTORE_MAGIC  0x0123456789ABCDEFUL
+#define RESTORE_MAGIC  0x0123456789ABCDF0UL
 
 /**
  * arch_hibernation_header_save - populate the architecture specific part
@@ -128,7 +128,8 @@ int arch_hibernation_header_save(void *a
 
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
-   rdr->jump_address = restore_jump_address;
+
+   rdr->jump_address = virt_to_phys((void *)restore_jump_address);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -143,7 +144,7 @@ int arch_hibernation_header_restore(void
 {
struct restore_data_record *rdr = addr;
 
-   restore_jump_address = rdr->jump_address;
+   restore_jump_address = (unsigned long)phys_to_virt(rdr->jump_address);
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Kees Cook

On Fri, Jun 13, 2014 at 5:14 PM, Rafael J. Wysocki  wrote:
> On Friday, June 13, 2014 03:59:57 PM Kees Cook wrote:
>> On Fri, Jun 13, 2014 at 3:54 PM, Rafael J. Wysocki  
>> wrote:
>> > On Friday, June 13, 2014 03:07:19 PM Kees Cook wrote:
>
> [cut]
>
>> > I'll have a closer look at that shortly (it's been quite some time since
>> > I wrote that code).
>>
>> Thanks; I'm trying to get a test environment instrumented too so I can
>> look at this. (At the very least, it sounds like we'll still need my
>> patch series for other architectures.)
>
> How can I obtain a kernel address of the beginning of a given page
> (as represented by struct page) on x86_64 today?

I don't know off the top of my head. I've used virt_to_phys, but
things like PFN_PHYS(page_to_pfn(page)) maybe? I'm not entirely clear
which you need, but mm.h seems to have the bulk of what I've seen.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus

2014-06-13 Thread Eric W. Biederman

Rafael Tinoco  writes:

> Okay,
>
> Tests with the same script were done. 
> I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
> and 3.9 last bisect good.
>
> Same tests were made. I'm comparing the following versions:
>
> 1) master + suggested patch
> 2) 3.15.0-rc5 (last rcu commit in my clone)
> 3) 3.9-rc2 (last bisect good)

I am having a hard time making sense of your numbers.

If I have read your email correctly my suggested patch caused:
"ip netns add" numbers to improve
1x "ip netns exec" to improve some
2x "ip netns exec" to show no improvement
"ip link add" to show no effect (after the 2x ip netns exec)

This is interesting in a lot of ways.
- This seems to confirm that the only rcu usage in ip netns add
  was switch_task_namespaces.  Which is convinient as that rules
  out most of the network stack when looking for performance oddities.

- "ip netns exec" had an expected performance improvement
- "ip netns exec" is still slow (so something odd is still going on)

- "ip link add" appears immaterial to the performance problem.

It would be interesting to switch the "ip link add" and "ip netns exec"
in your test case to confirm that there is nothing interesting/slow
going on in "ip link add"

Which leaves me with the question what ip "ip netns exec" remains
that is using rcu and is slowing all of this down.

Eric


> master + sug patch 3.15.0-rc5 (last rcu) 3.9-rc2 (bisec good)
> mark no none all no none all no 
>
> # (netns add) / sec 
>
> 250  125.00 250.00 250.00   20.83 22.73 50.00   83.33
> 500  250.00 250.00 250.00   22.73 22.73 50.00  125.00
> 750  250.00 125.00 125.00   20.83 22.73 62.50  125.00
> 1000 125.00 250.00 125.00   20.83 20.83 50.00  250.00
> 1250 125.00 125.00 250.00   22.73 22.73 50.00  125.00
> 1500 125.00 125.00 125.00   22.73 22.73 41.67  125.00
> 1750 125.00 125.00  83.33   22.73 22.73 50.00   83.33
> 2000 125.00  83.33 125.00   22.73 25.00 50.00  125.00
>
> -> From 3.15 to patched tree, netns add performance was ***
> restored/improved *** OK
>
> # (netns add + 1 x exec) / sec
>
> 250  11.90 14.71 31.25 5.00 6.76 15.63 62.50
> 500  11.90 13.89 31.25 5.10 7.14 15.63 41.67
> 750  11.90 13.89 27.78 5.10 7.14 15.63 50.00
> 1000 11.90 13.16 25.00 4.90 6.41 15.63 35.71
> 1250 11.90 13.89 25.00 4.90 6.58 15.63 27.78
> 1500 11.36 13.16 25.00 4.72 6.25 15.63 25.00
> 1750 11.90 12.50 22.73 4.63 5.56 14.71 20.83
> 2000 11.36 12.50 22.73 4.55 5.43 13.89 17.86
>
> -> From 3.15 to patched tree, performance improves +100% but still -
> 50% of 3.9-rc2
>
> # (netns add + 2 x exec) / sec
>
> 250 6.58 8.62 16.67 2.81 3.97 9.26 41.67
> 500 6.58 8.33 15.63 2.78 4.10 9.62 31.25
> 750 5.95 7.81 15.63 2.69 3.85 8.93 25.00
> 1000 5.95 7.35 13.89 2.60 3.73 8.93 20.83
> 1250 5.81 7.35 13.89 2.55 3.52 8.62 16.67
> 1500 5.81 7.35 13.16 0.00 3.47 8.62 13.89
> 1750 5.43 6.76 13.16 0.00 3.47 8.62 11.36
> 2000 5.32 6.58 12.50 0.00 3.38 8.33 9.26
>
> -> Same as before.
>
> # netns add + 2 x exec + 1 x ip link to netns
>
> 250 7.14 8.33 14.71 2.87 3.97 8.62 35.71
> 500 6.94 8.33 13.89 2.91 3.91 8.93 25.00
> 750 6.10 7.58 13.89 2.75 3.79 8.06 19.23
> 1000 5.56 6.94 12.50 2.69 3.85 8.06 14.71
> 1250 5.68 6.58 11.90 2.58 3.57 7.81 11.36
> 1500 5.56 6.58 10.87 0.00 3.73 7.58 10.00
> 1750 5.43 6.41 10.42 0.00 3.57 7.14 8.62
> 2000 5.21 6.25 10.00 0.00 3.33 7.14 6.94
>
> -> Ip link add to netns did not change performance proportion much.
>
> # netns add + 2 x exec + 2 x ip link to netns
>
> 250 7.35 8.62 13.89 2.94 4.03 8.33 31.25
> 500 7.14 8.06 12.50 2.94 4.03 8.06 20.83
> 750 6.41 7.58 11.90 2.81 3.85 7.81 15.63
> 1000 5.95 7.14 10.87 2.69 3.79 7.35 12.50
> 1250 5.81 6.76 10.00 2.66 3.62 7.14 10.00
> 1500 5.68 6.41 9.62 3.73 6.76 8.06
> 1750 5.32 6.25 8.93 3.68 6.58 7.35
> 2000 5.43 6.10 8.33 3.42 6.10 6.41
>
> -> Same as before.
>
> OBS:
>
> 1) It seems that performance got improved for network namespace
> addiction but maybe there can be some improvement also on netns
> execution. This way we might achieve same performance as 3.9.0-rc2
> (good bisect) had. 
>
> 2) These tests were made with 4 cpu only. 
>
> 3) Initial charts showed that 1 cpu case with all cpus as no-cb
> (without this patch) had something like 50% of bisect good. The 4 cpu
> (nocball) case had 26% of bisect good (like showed above in the last
> case -> 31.25 -- 8.33). 
>
> 4) With the patch, using 4 cpus and nocball, we now have 44% of bisect
> good performance (against 26% we had).
>
> 5) NOCB_* is still an issue. It is clear that only NOCB_CPU_ALL option
> is giving us something near last good commit performance.
>
> Thank you
>
> Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: oraphaned keywords in audit log text [was: Re: [PATCH] integrity: get comm using lock to avoid race in string] printing

2014-06-13 Thread Richard Guy Briggs

On 14/04/02, Richard Guy Briggs wrote:
> On 14/04/02, Mimi Zohar wrote:
> > On Wed, 2014-04-02 at 14:18 -0400, Eric Paris wrote: 
> > > On Wed, 2014-04-02 at 14:12 -0400, Mimi Zohar wrote:
> > > > On Wed, 2014-04-02 at 14:00 -0400, Steve Grubb wrote: 
> > > > > Hello Mimi,
> > > > > 
> > > > > On Wednesday, April 02, 2014 01:39:47 PM Mimi Zohar wrote:
> > > > > > This change is already being upstreamed as commit 73a6b44 
> > > > > > "Integrity:
> > > > > > Pass commname via get_task_comm()".
> > > > > 
> > > > > While I was looking at Richard's patch, I noticed a few places where 
> > > > > cause and 
> > > > > op are logged and the string isn't tied together with a _ or -. These 
> > > > > are in 
> > > > > ima/ima_appraise.c line 383, and ima/ima_policy.c lines 333, 657, and 
> > > > > 683. Are 
> > > > > these fixed upstream? Or should a patch be made?
> > > > 
> > > > Nothing has changed in terms of 'cause' and 'op'.  I would suggest
> > > > making the changes in integrity_audit.c: integrity_audit_msg().
> 
> That function could massage incoming text fields and convert spaces to
> hyphens or underscores, but I'd assume the right place to do it would be
> in the original text.  If you suggest the former, it could just be done
> in audit_log_string(), but then grepping the source for error messages
> would not be nearly as useful.  Is this what you were suggesting?
> 
> > > The question is actually, do you know of anyone who is expecting the
> > > space, instead of a more 'audit standard' - or _ ?  If not, we'll change
> > > it.  If so, we'll discuss more   :)
> > 
> > CC'ing linux-ima-user as well.
> 
> Thanks.

Was there any response from linux-ima-user?

> > Mimi
> 
> - RGB

- RGB

--
Richard Guy Briggs 
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red 
Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Final round of SCSI updates for the 3.15+ merge window

2014-06-13 Thread James Bottomley

This is just a couple of drivers (hpsa and lpfc) that got left out for
further testing in linux-next.  We also have one fix to a prior
submission (qla2xxx sparse).

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-for-linus

The short changelog is:

James Smart (16):
  lpfc: Update lpfc version to driver version 10.2.8001.0
  lpfc: Fix ExpressLane priority setup
  lpfc: mark old devices as obsolete
  lpfc: Fix for initializing RRQ bitmap
  lpfc: Fix for cleaning up stale ring flag and sp_queue_event entries
  lpfc: Update lpfc version to driver version 10.2.8000.0
  lpfc: Update Copyright on changed files from 8.3.45 patches
  lpfc: Update Copyright on changed files
  lpfc: Fixed locking for scsi task management commands
  lpfc: Convert runtime references to old xlane cfg param to fof cfg param
  lpfc: Fix FW dump using sysfs
  lpfc: Fix SLI4 s abort loop to process all FCP rings and under ring_lock
  lpfc: Fixed kernel panic in lpfc_abort_handler
  lpfc: Fix locking for postbufq when freeing
  lpfc: Fix locking for lpfc_hba_down_post
  lpfc: Fix dynamic transitions of FirstBurst from on to off

Justin Lindley (1):
  hpsa: change doorbell reset delay to ten seconds

Quinn Tran (1):
  qla2xxx: fix sparse warnings introduced by previous target mode t10-dif 
patch

Stephen M. Cameron (18):
  hpsa: fix handling of hpsa_volume_offline return value
  hpsa: return -ENOMEM not -1 on kzalloc failure in hpsa_get_device_id
  hpsa: remove messages about volume status VPD inquiry page not supported
  hpsa: report check condition even if no sense data present for ioaccel2 
mode
  hpsa: remove bad unlikely annotation from device list updating code
  hpsa: fix event filtering to prevent excessive rescans with old firmware
  hpsa: kill annoying messages about SSD Smart Path retries
  hpsa: define extended_report_lun_entry data structure
  hpsa: Rearrange start_io to avoid one unlock/lock sequence in main io path
  hpsa: avoid unnecessary readl on every command submission
  hpsa: use per-cpu variable for lockup_detected
  hpsa: set irq affinity hints to route MSI-X vectors across CPUs
  hpsa: allocate reply queues individually
  hpsa: choose number of reply queues more intelligently.
  hpsa: remove dev_dbg() calls from hot paths
  hpsa: use gcc aligned attribute instead of manually padding structs
  hpsa: allow passthru ioctls to work with bidirectional commands
  hpsa: remove unused fields from struct ctlr_info

And the diffstat

 drivers/scsi/hpsa.c   | 266 --
 drivers/scsi/hpsa.h   |  42 +++---
 drivers/scsi/hpsa_cmd.h   |  49 +++
 drivers/scsi/lpfc/lpfc.h  |   3 +-
 drivers/scsi/lpfc/lpfc_attr.c |  23 +--
 drivers/scsi/lpfc/lpfc_bsg.c  |   2 +-
 drivers/scsi/lpfc/lpfc_bsg.h  |   2 +-
 drivers/scsi/lpfc/lpfc_crtn.h |   6 +-
 drivers/scsi/lpfc/lpfc_debugfs.c  |   4 +-
 drivers/scsi/lpfc/lpfc_els.c  |   2 +-
 drivers/scsi/lpfc/lpfc_hbadisc.c  |   5 +-
 drivers/scsi/lpfc/lpfc_hw.h   |   2 +-
 drivers/scsi/lpfc/lpfc_hw4.h  |   2 +-
 drivers/scsi/lpfc/lpfc_init.c | 258 +++--
 drivers/scsi/lpfc/lpfc_mem.c  |   2 +-
 drivers/scsi/lpfc/lpfc_scsi.c |  60 ++--
 drivers/scsi/lpfc/lpfc_scsi.h |   2 +-
 drivers/scsi/lpfc/lpfc_sli.c  | 297 +-
 drivers/scsi/lpfc/lpfc_sli.h  |   2 +-
 drivers/scsi/lpfc/lpfc_sli4.h |   2 +-
 drivers/scsi/lpfc/lpfc_version.h  |   6 +-
 drivers/scsi/qla2xxx/qla_def.h|  16 +-
 drivers/scsi/qla2xxx/qla_target.c |  15 +-
 drivers/scsi/qla2xxx/qla_target.h |  16 +-
 24 files changed, 715 insertions(+), 369 deletions(-)

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 03:59:57 PM Kees Cook wrote:
> On Fri, Jun 13, 2014 at 3:54 PM, Rafael J. Wysocki  wrote:
> > On Friday, June 13, 2014 03:07:19 PM Kees Cook wrote:

[cut]

> > I'll have a closer look at that shortly (it's been quite some time since
> > I wrote that code).
> 
> Thanks; I'm trying to get a test environment instrumented too so I can
> look at this. (At the very least, it sounds like we'll still need my
> patch series for other architectures.)

How can I obtain a kernel address of the beginning of a given page
(as represented by struct page) on x86_64 today?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

We offer all purpose loan at 3% interest rate

2014-06-13 Thread Santander Group

We offer all purpose loan at 3% interest rate. Contact Us for more details by 
Email:santanderfinancegr...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Frederic Weisbecker

On Fri, Jun 13, 2014 at 04:27:15PM -0700, Paul E. McKenney wrote:
> On Sat, Jun 14, 2014 at 01:10:35AM +0200, Frederic Weisbecker wrote:
> > On Fri, Jun 13, 2014 at 03:49:26PM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 13, 2014 at 02:10:35PM -0700, Josh Triplett wrote:
> > > > On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > > > > > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > > > > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > > > > > Is it because we have dynticks CPUs staying too long in the 
> > > > > > > > > kernel without
> > > > > > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > > > > > rcu_user_enter() or
> > > > > > > > > things?
> > > > > > > > 
> > > > > > > > Sort of the former, but combined with the fact that in-kernel 
> > > > > > > > CPUs still
> > > > > > > > need scheduling-clock interrupts for RCU to make progress.  I 
> > > > > > > > could
> > > > > > > > move this to RCU's context-switch hook, but that could be very 
> > > > > > > > bad for
> > > > > > > > workloads that do lots of context switching.
> > > > > > > 
> > > > > > > Or I can restart the tick if the CPU stays in the kernel for too 
> > > > > > > long without
> > > > > > > a tick. I think that's what we were doing before but we removed 
> > > > > > > that because
> > > > > > > we never implemented it correctly (we sent scheduler IPI that did 
> > > > > > > nothing...)
> > > > > > 
> > > > > > I wonder if timer slack would make sense here: when you have at 
> > > > > > least
> > > > > > one RCU callback pending, set a timer with a huge amount of timer 
> > > > > > slack,
> > > > > > and cancel it if you end up handling the callback via a trip 
> > > > > > through the
> > > > > > scheduler.
> > > > > 
> > > > > But in this case, we need the tick even if the current CPU has no 
> > > > > callbacks
> > > > > because it might be in an RCU read-side critical section.
> > > > 
> > > > Don't we handle that case via the slowpath of rcu_read_unlock, and a
> > > > flag set via IPI?  ("Oh, that CPU has taken too long to note a quiescent
> > > > state; send it an IPI to set the special flag that makes unlock do the
> > > > work.")
> > > 
> > > There was once such logic on the force-quiescent-state path, and making
> > > that handle this new case was my first proposal.  As Frederic pointed
> > > out, that change requires rcu_needs_cpu()'s cooperation, because otherwise
> > > the CPU will take the IPI, see that it still has but one runnable task,
> > > and then keep its scheduling-clock interrupt off.
> > 
> > Exactly. So that's what happens currently, we call rcu_kick_nohz_cpu()
> > on extended grace periods but the IPI doesn't reconsider the tick.
> > 
> > In fact it doesn't do anything at all because the scheduler IPI,
> > when invoked without a reason, doesn't even call irq_enter()/irq_exit(),
> > so rcu_needs_cpu() isn't quite called from there.
> > 
> > Now that's going to change with https://lwn.net/Articles/601836/ if
> > we convert rcu_kick_nohz_cpu() to tick_nohz_full_kick_cpu().
> > 
> > Then we have the choice between two options:
> > 
> > * We can add a check in tick_nohz_full_check() and restart the tick if
> > necessary.
> > 
> > * Extend rcu_needs_cpu() to restore a similar periodic mode until the
> > grace periods get some progress.
> 
> If I was to extend rcu_needs_cpu(), I would add a flag and another counter
> to the rcu_data structure.  If rcu_needs_cpu() saw the flag set and the
> counter equal to the current ->completed value, it would return true.
> 
> I already have the rcu_kick_nohz_cpu() in rcu_implicit_dynticks_qs(),
> so it is just a matter of also setting the flag and copying ->completed
> to the new counter at that point.  I currently get to this point if the
> CPU has managed to run for more than one jiffy without hitting either
> idle or userspace execution.  Fair enough?

Perfect for me!

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] clkdev: Don't print errors on probe defer

2014-06-13 Thread Stephen Boyd

This error message can spam the logs if you have lots of probe
deferals due to missing clocks. Just silence the error in this
case because the driver should try again later.

Signed-off-by: Stephen Boyd 
---
 drivers/clk/clkdev.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/clkdev.c b/drivers/clk/clkdev.c
index f890b901c6bc..da4bda8b7fc7 100644
--- a/drivers/clk/clkdev.c
+++ b/drivers/clk/clkdev.c
@@ -101,8 +101,9 @@ struct clk *of_clk_get_by_name(struct device_node *np, 
const char *name)
if (!IS_ERR(clk))
break;
else if (name && index >= 0) {
-   pr_err("ERROR: could not get clock %s:%s(%i)\n",
-   np->full_name, name ? name : "", index);
+   if (PTR_ERR(clk) != -EPROBE_DEFER)
+   pr_err("ERROR: could not get clock %s:%s(%i)\n",
+   np->full_name, name ? name : "", index);
return clk;
}
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bisected] pre-3.16 regression on open() scalability

2014-06-13 Thread Dave Hansen

On 06/13/2014 03:45 PM, Paul E. McKenney wrote:
> On Fri, Jun 13, 2014 at 01:04:28PM -0700, Dav
>> So, I bisected it down to this:
>>
>>> commit ac1bea85781e9004da9b3e8a4b097c18492d857c
>>> Author: Paul E. McKenney 
>>> Date:   Sun Mar 16 21:36:25 2014 -0700
>>>
>>> sched,rcu: Make cond_resched() report RCU quiescent states
>>
>> Specifically, if I raise RCU_COND_RESCHED_LIM, things get back to their
>> 3.15 levels.
>>
>> Could the additional RCU quiescent states be causing us to be doing more
>> RCU frees that we were before, and getting less benefit from the lock
>> batching that RCU normally provides?
> 
> Quite possibly.  One way to check would be to use the debugfs files
> rcu/*/rcugp, which give a count of grace periods since boot for each
> RCU flavor.  Here "*" is rcu_preempt for CONFIG_PREEMPT and rcu_sched
> for !CONFIG_PREEMPT.
> 
> Another possibility is that someone is invoking cond_reched() in an
> incredibly tight loop.

open() does at least a couple of allocations in getname(),
get_empty_filp() and apparmor_file_alloc_security() in my kernel, and
each of those does a cond_resched() via the might_sleep() in the slub
code.  This test is doing ~400k open/closes per second per CPU, so
that's ~1.2M cond_resched()/sec/CPU, but that's still hundreds of ns
between calls on average.

I'll do some more ftraces and dig in to those debugfs files early next week.

> But please feel free to send along your patch, CCing LKML.  Longer
> term, I probably need to take a more algorithmic approach, but what
> you have will be useful to benchmarkers until then.

With the caveat that I exerted approximately 15 seconds of brainpower to
code it up...patch attached.


---

 b/arch/x86/kernel/nmi.c|3 +++
 b/include/linux/rcupdate.h |2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff -puN arch/x86/kernel/nmi.c~dirty-rcu-hack arch/x86/kernel/nmi.c
--- a/arch/x86/kernel/nmi.c~dirty-rcu-hack	2014-06-13 16:00:30.257183228 -0700
+++ b/arch/x86/kernel/nmi.c	2014-06-13 16:00:30.261183407 -0700
@@ -88,10 +88,13 @@ __setup("unknown_nmi_panic", setup_unkno
 
 static u64 nmi_longest_ns = 1 * NSEC_PER_MSEC;
 
+u64 RCU_COND_RESCHED_LIM = 256;
 static int __init nmi_warning_debugfs(void)
 {
 	debugfs_create_u64("nmi_longest_ns", 0644,
 			arch_debugfs_dir, _longest_ns);
+	debugfs_create_u64("RCU_COND_RESCHED_LIM", 0644,
+			arch_debugfs_dir, _COND_RESCHED_LIM);
 	return 0;
 }
 fs_initcall(nmi_warning_debugfs);
diff -puN include/linux/rcupdate.h~dirty-rcu-hack include/linux/rcupdate.h
--- a/include/linux/rcupdate.h~dirty-rcu-hack	2014-06-13 16:00:35.578421426 -0700
+++ b/include/linux/rcupdate.h	2014-06-13 16:00:49.863060683 -0700
@@ -303,7 +303,7 @@ bool __rcu_is_watching(void);
  * Hooks for cond_resched() and friends to avoid RCU CPU stall warnings.
  */
 
-#define RCU_COND_RESCHED_LIM 256	/* ms vs. 100s of ms. */
+extern u64 RCU_COND_RESCHED_LIM	/* ms vs. 100s of ms. */
 DECLARE_PER_CPU(int, rcu_cond_resched_count);
 void rcu_resched(void);
 
_

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Paul E. McKenney

On Sat, Jun 14, 2014 at 01:10:35AM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 13, 2014 at 03:49:26PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 13, 2014 at 02:10:35PM -0700, Josh Triplett wrote:
> > > On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > > > > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > > > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > > > > Is it because we have dynticks CPUs staying too long in the 
> > > > > > > > kernel without
> > > > > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > > > > rcu_user_enter() or
> > > > > > > > things?
> > > > > > > 
> > > > > > > Sort of the former, but combined with the fact that in-kernel 
> > > > > > > CPUs still
> > > > > > > need scheduling-clock interrupts for RCU to make progress.  I 
> > > > > > > could
> > > > > > > move this to RCU's context-switch hook, but that could be very 
> > > > > > > bad for
> > > > > > > workloads that do lots of context switching.
> > > > > > 
> > > > > > Or I can restart the tick if the CPU stays in the kernel for too 
> > > > > > long without
> > > > > > a tick. I think that's what we were doing before but we removed 
> > > > > > that because
> > > > > > we never implemented it correctly (we sent scheduler IPI that did 
> > > > > > nothing...)
> > > > > 
> > > > > I wonder if timer slack would make sense here: when you have at least
> > > > > one RCU callback pending, set a timer with a huge amount of timer 
> > > > > slack,
> > > > > and cancel it if you end up handling the callback via a trip through 
> > > > > the
> > > > > scheduler.
> > > > 
> > > > But in this case, we need the tick even if the current CPU has no 
> > > > callbacks
> > > > because it might be in an RCU read-side critical section.
> > > 
> > > Don't we handle that case via the slowpath of rcu_read_unlock, and a
> > > flag set via IPI?  ("Oh, that CPU has taken too long to note a quiescent
> > > state; send it an IPI to set the special flag that makes unlock do the
> > > work.")
> > 
> > There was once such logic on the force-quiescent-state path, and making
> > that handle this new case was my first proposal.  As Frederic pointed
> > out, that change requires rcu_needs_cpu()'s cooperation, because otherwise
> > the CPU will take the IPI, see that it still has but one runnable task,
> > and then keep its scheduling-clock interrupt off.
> 
> Exactly. So that's what happens currently, we call rcu_kick_nohz_cpu()
> on extended grace periods but the IPI doesn't reconsider the tick.
> 
> In fact it doesn't do anything at all because the scheduler IPI,
> when invoked without a reason, doesn't even call irq_enter()/irq_exit(),
> so rcu_needs_cpu() isn't quite called from there.
> 
> Now that's going to change with https://lwn.net/Articles/601836/ if
> we convert rcu_kick_nohz_cpu() to tick_nohz_full_kick_cpu().
> 
> Then we have the choice between two options:
> 
> * We can add a check in tick_nohz_full_check() and restart the tick if
> necessary.
> 
> * Extend rcu_needs_cpu() to restore a similar periodic mode until the
> grace periods get some progress.

If I was to extend rcu_needs_cpu(), I would add a flag and another counter
to the rcu_data structure.  If rcu_needs_cpu() saw the flag set and the
counter equal to the current ->completed value, it would return true.

I already have the rcu_kick_nohz_cpu() in rcu_implicit_dynticks_qs(),
so it is just a matter of also setting the flag and copying ->completed
to the new counter at that point.  I currently get to this point if the
CPU has managed to run for more than one jiffy without hitting either
idle or userspace execution.  Fair enough?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Paul E. McKenney

On Sat, Jun 14, 2014 at 01:13:25AM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 13, 2014 at 01:49:03PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > Is it because we have dynticks CPUs staying too long in the kernel 
> > > > > without
> > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > rcu_user_enter() or
> > > > > things?
> > > > 
> > > > Sort of the former, but combined with the fact that in-kernel CPUs still
> > > > need scheduling-clock interrupts for RCU to make progress.  I could
> > > > move this to RCU's context-switch hook, but that could be very bad for
> > > > workloads that do lots of context switching.
> > > 
> > > Or I can restart the tick if the CPU stays in the kernel for too long 
> > > without
> > > a tick. I think that's what we were doing before but we removed that 
> > > because
> > > we never implemented it correctly (we sent scheduler IPI that did 
> > > nothing...)
> > 
> > That would work for me!
> > 
> > Just out of curiosity, what would you use to determine that the CPU
> > had been in the kernel too long?
> 
> I'd rather deduce that when grace periods completion go past some delay.
> I think that's the requirement for calling rcu_kick_nohz_cpu()?

OK, that does work for me.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Frederic Weisbecker

On Fri, Jun 13, 2014 at 01:49:03PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > Is it because we have dynticks CPUs staying too long in the kernel 
> > > > without
> > > > taking any quiescent states? Are we perhaps missing some 
> > > > rcu_user_enter() or
> > > > things?
> > > 
> > > Sort of the former, but combined with the fact that in-kernel CPUs still
> > > need scheduling-clock interrupts for RCU to make progress.  I could
> > > move this to RCU's context-switch hook, but that could be very bad for
> > > workloads that do lots of context switching.
> > 
> > Or I can restart the tick if the CPU stays in the kernel for too long 
> > without
> > a tick. I think that's what we were doing before but we removed that because
> > we never implemented it correctly (we sent scheduler IPI that did 
> > nothing...)
> 
> That would work for me!
> 
> Just out of curiosity, what would you use to determine that the CPU
> had been in the kernel too long?

I'd rather deduce that when grace periods completion go past some delay.
I think that's the requirement for calling rcu_kick_nohz_cpu()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 1/2] ARM: AM43xx: hwmod: add DSS hwmod data

2014-06-13 Thread Felipe Balbi

Hi,

On Fri, Jun 13, 2014 at 07:11:58PM +, Paul Walmsley wrote:
> > > From: Sathya Prakash M R 
> > > 
> > > Add DSS hwmod data for AM43xx.
> > > 
> > > Cc: Andrew Morton 
> > > Acked-by: Rajendra Nayak 
> > > Signed-off-by: Sathya Prakash M R 
> > > Signed-off-by: Tomi Valkeinen 
> > > Signed-off-by: Felipe Balbi 
> > > ---
> > > 
> > > Note that this patch was originally send on May 9th [1], changes were 
> > > requested
> > > and a new version was sent on May 19th [2], then on May 27th [3] Tomi 
> > > pinged
> > > maintainer again and go no response.
> > > 
> > > Without this patch, we cannot get display working on any AM437x devices.
> > > 
> > > [1] http://marc.info/?l=linux-arm-kernel=139963677925227=2
> > > [2] http://marc.info/?l=linux-arm-kernel=140049799425512=2
> > > [3] http://marc.info/?l=linux-arm-kernel=140117232826754=2
> > > 
> > >  arch/arm/mach-omap2/omap_hwmod_43xx_data.c | 98 
> > > ++
> > >  arch/arm/mach-omap2/prcm43xx.h |  1 +
> > >  2 files changed, 99 insertions(+)
> 
> Sorry for the delay on this.  Have been corresponding with TI management 
> to figure out what to do about patches for AM43xx.  I don't have boards or 
> public documentation for these devices, so it's impossible for me to 
> meaningfully review the patches.  Looks like boards and/or public docs 
> won't be coming any time soon.
> 
> So for my part, here's what I'll need to merge any hwmod or PRCM patches 
> that involve AM437x:
> 
> 1. A Reviewed-by: from one of the following folks (which should come from
> a different person than who is submitting the patches):
> 
> Roger Quadros
> Nishanth Menon
> Rajendra Nayak
> Kevin Hilman
> Tony Lindgren
> 
> 2. A Tested-by: from one of the following folks (who can be the same as 
> the person who is the same as the person who is submitting the patches):
> 
> Nishanth Menon
> Rajendra Nayak
> Kevin Hilman 
> Tony Lindgren

What you're saying here is that it's pointless for anybody else in TI to
review and/or test patches because you will only accept such tags from
this list of 4 ~ 5 people. It doesn't take a brain surgeon to note how
this won't scale and, if you continue to ignore patches during the
entire development cycle and only reply after it's too late for $this
merge window, it won't help much.

Quite frankly, it's very upsetting to see an affirmation that all the
work that I (personally) and many others do is seen as "pointless" from
your side *unless* it gets the blessing from the few folks listed above.

This just makes it ever more difficult for anything, which is clearly
*BROKEN* to be fixed upstream and will just contribute to people
vanishing from mainline development.

The very fact that you will only accept patches blessed by the gang-of-4
goes against the very foundations of open source development. Just
because you don't have access to documentation - and granted, that
_does_ make things a lot more difficult - does not mean you have to
consider an entire company as a non-trust worthy organization. Specially
when there are so many here who have been doing mainline development for
quite some time.

Anyway, whatever... I just hope that if we go through *another* merge
window without $subject being merged, someone takes the patch because
this already has a ridiculous amount of bureaucratic bariers to patches
which are, to put it very bluntly, *CORRECT*.

ps: $subject in particular, has been tested by 3 different people.
Actually 4, if you consider Darren Etheridge who used $subject to help
me get display working on AM437x SK.

pps: Darren, can you reply with your (according to Paul) pointless
Tested-by ?

-- 
balbi

signature.asc
Description: Digital signature

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Frederic Weisbecker

On Fri, Jun 13, 2014 at 03:49:26PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2014 at 02:10:35PM -0700, Josh Triplett wrote:
> > On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > > > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > > > Is it because we have dynticks CPUs staying too long in the 
> > > > > > > kernel without
> > > > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > > > rcu_user_enter() or
> > > > > > > things?
> > > > > > 
> > > > > > Sort of the former, but combined with the fact that in-kernel CPUs 
> > > > > > still
> > > > > > need scheduling-clock interrupts for RCU to make progress.  I could
> > > > > > move this to RCU's context-switch hook, but that could be very bad 
> > > > > > for
> > > > > > workloads that do lots of context switching.
> > > > > 
> > > > > Or I can restart the tick if the CPU stays in the kernel for too long 
> > > > > without
> > > > > a tick. I think that's what we were doing before but we removed that 
> > > > > because
> > > > > we never implemented it correctly (we sent scheduler IPI that did 
> > > > > nothing...)
> > > > 
> > > > I wonder if timer slack would make sense here: when you have at least
> > > > one RCU callback pending, set a timer with a huge amount of timer slack,
> > > > and cancel it if you end up handling the callback via a trip through the
> > > > scheduler.
> > > 
> > > But in this case, we need the tick even if the current CPU has no 
> > > callbacks
> > > because it might be in an RCU read-side critical section.
> > 
> > Don't we handle that case via the slowpath of rcu_read_unlock, and a
> > flag set via IPI?  ("Oh, that CPU has taken too long to note a quiescent
> > state; send it an IPI to set the special flag that makes unlock do the
> > work.")
> 
> There was once such logic on the force-quiescent-state path, and making
> that handle this new case was my first proposal.  As Frederic pointed
> out, that change requires rcu_needs_cpu()'s cooperation, because otherwise
> the CPU will take the IPI, see that it still has but one runnable task,
> and then keep its scheduling-clock interrupt off.

Exactly. So that's what happens currently, we call rcu_kick_nohz_cpu()
on extended grace periods but the IPI doesn't reconsider the tick.

In fact it doesn't do anything at all because the scheduler IPI,
when invoked without a reason, doesn't even call irq_enter()/irq_exit(),
so rcu_needs_cpu() isn't quite called from there.

Now that's going to change with https://lwn.net/Articles/601836/ if
we convert rcu_kick_nohz_cpu() to tick_nohz_full_kick_cpu().

Then we have the choice between two options:

* We can add a check in tick_nohz_full_check() and restart the tick if
necessary.

* Extend rcu_needs_cpu() to restore a similar periodic mode until the
grace periods get some progress.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/9] perf, tools: Allow events with dot

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

The Intel events use a dot to separate event name and unit mask.
Allow dot in names in the scanner, and remove special handling
of dot as EOF. Also remove the hack in jevents to replace dot
with underscore. This way dotted events can be specified
directly by the user.

I'm not fully sure this change to the scanner is correct
(what was the dot special case good for?), but I haven't
found anything that breaks with it so far at least.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/util/jevents.c  | 9 +
 tools/perf/util/parse-events.l | 3 +--
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/jevents.c b/tools/perf/util/jevents.c
index 1fae0b7..43550f7 100644
--- a/tools/perf/util/jevents.c
+++ b/tools/perf/util/jevents.c
@@ -100,15 +100,8 @@ static void addfield(char *map, char **dst, const char 
*sep,
 
 static void fixname(char *s)
 {
-   for (; *s; s++) {
+   for (; *s; s++)
*s = tolower(*s);
-   /*
-* Remove '.' for now, until the parser
-* can deal with it.
-*/
-   if (*s == '.')
-   *s = '_';
-   }
 }
 
 static void fixdesc(char *s)
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 3432995..709fa3b 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -81,7 +81,7 @@ num_dec   [0-9]+
 num_hex0x[a-fA-F0-9]+
 num_raw_hex[a-fA-F0-9]+
 name   [a-zA-Z_*?][a-zA-Z0-9_*?]*
-name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?]*
+name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.]*
 /* If you add a modifier you need to update check_modifier() */
 modifier_event [ukhpGHSD]+
 modifier_bp[rwx]{1,3}
@@ -119,7 +119,6 @@ modifier_bp [rwx]{1,3}
return PE_EVENT_NAME;
}
 
-.  |
 <>{
BEGIN(INITIAL); yyless(0);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 9/9] perf, tools: Add a --quiet flag to perf list

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Add a --quiet flag to perf list to not print the event descriptions
that were earlier added for JSON events. This may be useful to
get a less crowded listing.

It's still default to print descriptions as that is the more useful
default for most users.  Requested by Namhyung Kim.

Before:

% perf list
...
  baclears.any   [Counts the total number 
when the front end is
  resteered, mainly when 
the BPU cannot provide a
  correct prediction and 
this is corrected by other
  branch handling 
mechanisms at the front end]
  br_inst_exec.all_branches  [Speculative and retired 
branches]

After:

% perf list --quiet
...
  baclears.any   [Kernel PMU event]
  br_inst_exec.all_branches  [Kernel PMU event]

Signed-off-by: Andi Kleen 
---
 tools/perf/builtin-list.c  | 14 +-
 tools/perf/util/parse-events.c |  4 ++--
 tools/perf/util/parse-events.h |  2 +-
 tools/perf/util/pmu.c  |  4 ++--
 tools/perf/util/pmu.h  |  2 +-
 5 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c
index 086c96f..b064ea4 100644
--- a/tools/perf/builtin-list.c
+++ b/tools/perf/builtin-list.c
@@ -16,16 +16,20 @@
 #include "util/pmu.h"
 #include "util/parse-options.h"
 
+static bool quiet_flag;
+
 int cmd_list(int argc, const char **argv, const char *prefix __maybe_unused)
 {
int i;
const struct option list_options[] = {
OPT_STRING(0, "events-file", _file, "json file",
   "Read event json file"),
+   OPT_BOOLEAN('q', "quiet", _flag,
+   "Don't print extra event descriptions"),
OPT_END()
};
const char * const list_usage[] = {
-   "perf list [hw|sw|cache|tracepoint|pmu|event_glob]",
+   "perf list [--events-file FILE] [--quiet] 
[hw|sw|cache|tracepoint|pmu|event_glob]",
NULL
};
 
@@ -35,7 +39,7 @@ int cmd_list(int argc, const char **argv, const char *prefix 
__maybe_unused)
setup_pager();
 
if (argc == 0) {
-   print_events(NULL, false);
+   print_events(NULL, false, quiet_flag);
return 0;
}
 
@@ -54,15 +58,15 @@ int cmd_list(int argc, const char **argv, const char 
*prefix __maybe_unused)
 strcmp(argv[i], "hwcache") == 0)
print_hwcache_events(NULL, false);
else if (strcmp(argv[i], "pmu") == 0)
-   print_pmu_events(NULL, false);
+   print_pmu_events(NULL, false, quiet_flag);
else if (strcmp(argv[i], "--raw-dump") == 0)
-   print_events(NULL, true);
+   print_events(NULL, true, quiet_flag);
else {
char *sep = strchr(argv[i], ':'), *s;
int sep_idx;
 
if (sep == NULL) {
-   print_events(argv[i], false);
+   print_events(argv[i], false, quiet_flag);
continue;
}
sep_idx = sep - argv[i];
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 1e15df1..e2badf3 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1231,7 +1231,7 @@ static void print_symbol_events(const char *event_glob, 
unsigned type,
 /*
  * Print the help text for the event symbols:
  */
-void print_events(const char *event_glob, bool name_only)
+void print_events(const char *event_glob, bool name_only, bool quiet)
 {
if (!name_only) {
printf("\n");
@@ -1246,7 +1246,7 @@ void print_events(const char *event_glob, bool name_only)
 
print_hwcache_events(event_glob, name_only);
 
-   print_pmu_events(event_glob, name_only);
+   print_pmu_events(event_glob, name_only, quiet);
 
if (event_glob != NULL)
return;
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index df094b4..f3ef0dc 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -100,7 +100,7 @@ void parse_events_update_lists(struct list_head *list_event,
   struct list_head *list_all);
 void parse_events_error(void *data, void *scanner, char const *msg);
 
-void print_events(const char *event_glob, bool name_only);
+void print_events(const char *event_glob, bool name_only, bool quiet);
 void print_events_type(u8 type);
 void print_tracepoint_events(const char *subsys_glob, const char *event_glob,
 bool

Re: [PATCH v3 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 01:30:34 PM Kees Cook wrote:
> Distros want to be able to offer CONFIG_RANDOMIZE_BASE as well as
> CONFIG_HIBERNATION in a single kernel. Instead of making kASLR depend on
> !HIBERNATION at compile time, allow kaslr to be selectable at boot time
> (via "kaslr" kernel command line), which will disable hibernation in the
> kernel. In this way the end user can choose which feature they want more
> with hibernation continuing to stay enabled by default (no surprises).
> 
> This also has the benefit of being able to entirely disable hibernation
> from the kernel command line, regardless of kASLR, which is a separately
> desired feature as well.
> 
> v3:
> - switch from EINVAL to EPERM (pavel, jwboyer)
> v2:
> - rework using kernel command line instead of hibernation_mode (rjw)

That looks kind of OK.

Do you want me to push this through my tree?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/9] perf, tools: Add jsmn `jasmine' JSON parser v3

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

I need a JSON parser. This adds the simplest JSON
parser I could find -- Serge Zaitsev's jsmn `jasmine' --
to the perf library. I merely converted it to (mostly)
Linux style and added support for non 0 terminated input.

The parser is quite straight forward and does not
copy any data, just returns tokens with offsets
into the input buffer. So it's relatively efficient
and simple to use.

The code is not fully checkpatch clean, but I didn't
want to completely fork the upstream code.

Original source: http://zserge.bitbucket.org/jsmn.html

In addition I added a simple wrapper that mmaps a json
file and provides some straight forward access functions.

Used in follow-on patches to parse event files.

v2: Address review feedback.
v3: Minor checkpatch fixes.
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/Makefile.perf |   4 +
 tools/perf/util/jsmn.c   | 313 +++
 tools/perf/util/jsmn.h   |  67 ++
 tools/perf/util/json.c   | 155 +++
 tools/perf/util/json.h   |  13 ++
 5 files changed, 552 insertions(+)
 create mode 100644 tools/perf/util/jsmn.c
 create mode 100644 tools/perf/util/jsmn.h
 create mode 100644 tools/perf/util/json.c
 create mode 100644 tools/perf/util/json.h

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 9670a16..1cd32c5 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -300,6 +300,8 @@ LIB_H += ui/progress.h
 LIB_H += ui/util.h
 LIB_H += ui/ui.h
 LIB_H += util/data.h
+LIB_H += util/jsmn.h
+LIB_H += util/json.h
 
 LIB_OBJS += $(OUTPUT)util/abspath.o
 LIB_OBJS += $(OUTPUT)util/alias.o
@@ -373,6 +375,8 @@ LIB_OBJS += $(OUTPUT)util/stat.o
 LIB_OBJS += $(OUTPUT)util/record.o
 LIB_OBJS += $(OUTPUT)util/srcline.o
 LIB_OBJS += $(OUTPUT)util/data.o
+LIB_OBJS += $(OUTPUT)util/jsmn.o
+LIB_OBJS += $(OUTPUT)util/json.o
 
 LIB_OBJS += $(OUTPUT)ui/setup.o
 LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/util/jsmn.c b/tools/perf/util/jsmn.c
new file mode 100644
index 000..11d1fa1
--- /dev/null
+++ b/tools/perf/util/jsmn.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright (c) 2010 Serge A. Zaitsev
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * Slightly modified by AK to not assume 0 terminated input.
+ */
+
+#include 
+#include "jsmn.h"
+
+/*
+ * Allocates a fresh unused token from the token pool.
+ */
+static jsmntok_t *jsmn_alloc_token(jsmn_parser *parser,
+  jsmntok_t *tokens, size_t num_tokens)
+{
+   jsmntok_t *tok;
+
+   if ((unsigned)parser->toknext >= num_tokens)
+   return NULL;
+   tok = [parser->toknext++];
+   tok->start = tok->end = -1;
+   tok->size = 0;
+   return tok;
+}
+
+/*
+ * Fills token type and boundaries.
+ */
+static void jsmn_fill_token(jsmntok_t *token, jsmntype_t type,
+   int start, int end)
+{
+   token->type = type;
+   token->start = start;
+   token->end = end;
+   token->size = 0;
+}
+
+/*
+ * Fills next available token with JSON primitive.
+ */
+static jsmnerr_t jsmn_parse_primitive(jsmn_parser *parser, const char *js,
+ size_t len,
+ jsmntok_t *tokens, size_t num_tokens)
+{
+   jsmntok_t *token;
+   int start;
+
+   start = parser->pos;
+
+   for (; parser->pos < len; parser->pos++) {
+   switch (js[parser->pos]) {
+#ifndef JSMN_STRICT
+   /*
+* In strict mode primitive must be followed by ","
+* or "}" or "]"
+*/
+   case ':':
+#endif
+   case '\t':
+   case '\r':
+   case '\n':
+   case ' ':
+   case ',':
+   case ']':
+   case '}':
+   goto found;
+   default:
+   break;
+

[PATCH 3/9] perf, tools: Add support for reading JSON event files v3

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Add a parser for Intel style JSON event files. This allows
to use an Intel event list directly with perf. The Intel
event lists can be quite large and are too big to store
in unswappable kernel memory.

The parser code knows how to convert the JSON fields
to perf fields. The conversion code is straight forward.
It knows (very little) Intel specific information, and can be easily
extended to handle fields for other CPUs.

The parser code is partially shared with an independent parsing
library, which is 2-clause BSD licenced. To avoid any conflicts I marked
those files as BSD licenced too. As part of perf they become GPLv2.

The events are handled using the existing alias machinery.

We output the BriefDescription in perf list.

Right now the json file can be specified as an argument
to perf stat/record/list. Followon patches will automate this.

JSON files look like this:

[
  {
"EventCode": "0x00",
"UMask": "0x01",
"EventName": "INST_RETIRED.ANY",
"BriefDescription": "Instructions retired from execution.",
"PublicDescription": "Instructions retired from execution.",
"Counter": "Fixed counter 1",
"CounterHTOff": "Fixed counter 1",
"SampleAfterValue": "203",
"MSRIndex": "0",
"MSRValue": "0",
"TakenAlone": "0",
"CounterMask": "0",
"Invert": "0",
"AnyThread": "0",
"EdgeDetect": "0",
"PEBS": "0",
"PRECISE_STORE": "0",
"Errata": "null",
"Offcore": "0"
  },

v2: Address review feedback. Rename option to --event-files
v3: Add JSON example
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-list.txt   |   6 +
 tools/perf/Documentation/perf-record.txt |   3 +
 tools/perf/Documentation/perf-stat.txt   |   3 +
 tools/perf/Makefile.perf |   2 +
 tools/perf/builtin-list.c|   2 +
 tools/perf/builtin-record.c  |   3 +
 tools/perf/builtin-stat.c|   2 +
 tools/perf/util/jevents.c| 250 +++
 tools/perf/util/jevents.h|   3 +
 tools/perf/util/pmu.c|  14 ++
 tools/perf/util/pmu.h|   2 +
 11 files changed, 290 insertions(+)
 create mode 100644 tools/perf/util/jevents.c
 create mode 100644 tools/perf/util/jevents.h

diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index 6fce6a6..9305a37 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -15,6 +15,12 @@ DESCRIPTION
 This command displays the symbolic event types which can be selected in the
 various perf commands with the -e option.
 
+OPTIONS
+---
+--events-file=::
+Specify JSON event list file to use for parsing events.
+
+
 [[EVENT_MODIFIERS]]
 EVENT MODIFIERS
 ---
diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index d460049..59778f4 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -214,6 +214,9 @@ if combined with -a or -C options.
 After starting the program, wait msecs before measuring. This is useful to
 filter out the startup phase of the program, which is often very different.
 
+--events-file=::
+Specify JSON event list file to use for parsing events.
+
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 29ee857..7adbb08 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -142,6 +142,9 @@ filter out the startup phase of the program, which is often 
very different.
 
 Print statistics of transactional execution if supported.
 
+--events-file=::
+Specify JSON event list file to use for parsing events.
+
 EXAMPLES
 
 
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 1cd32c5..0016d1a 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -302,6 +302,7 @@ LIB_H += ui/ui.h
 LIB_H += util/data.h
 LIB_H += util/jsmn.h
 LIB_H += util/json.h
+LIB_H += util/jevents.h
 
 LIB_OBJS += $(OUTPUT)util/abspath.o
 LIB_OBJS += $(OUTPUT)util/alias.o
@@ -377,6 +378,7 @@ LIB_OBJS += $(OUTPUT)util/srcline.o
 LIB_OBJS += $(OUTPUT)util/data.o
 LIB_OBJS += $(OUTPUT)util/jsmn.o
 LIB_OBJS += $(OUTPUT)util/json.o
+LIB_OBJS += $(OUTPUT)util/jevents.o
 
 LIB_OBJS += $(OUTPUT)ui/setup.o
 LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c
index 011195e..086c96f 100644
--- a/tools/perf/builtin-list.c
+++ b/tools/perf/builtin-list.c
@@ -20,6 +20,8 @@ int cmd_list(int argc, const char **argv, const char *prefix 
__maybe_unused)
 {
int i;
const struct option list_options[] = {
+   OPT_STRING(0, "events-file", _file, "json file",
+  "Read event json file"),
OPT_END()
};

[PATCH 8/9] perf, tools, test: Add test case for alias and JSON parsing v2

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Add a simple test case to perf test that runs perf download and parses
all the available events, including json events.

This needs adding an all event iterator to pmu.c

v2: Rename identifiers
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/Makefile.perf|  1 +
 tools/perf/tests/aliases.c  | 58 +
 tools/perf/tests/builtin-test.c |  4 +++
 tools/perf/tests/tests.h|  1 +
 tools/perf/util/pmu.c   | 18 +
 tools/perf/util/pmu.h   |  2 ++
 6 files changed, 84 insertions(+)
 create mode 100644 tools/perf/tests/aliases.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 0600425..6adb37f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -419,6 +419,7 @@ endif
 LIB_OBJS += $(OUTPUT)tests/code-reading.o
 LIB_OBJS += $(OUTPUT)tests/sample-parsing.o
 LIB_OBJS += $(OUTPUT)tests/parse-no-sample-id-all.o
+LIB_OBJS += $(OUTPUT)tests/aliases.o
 ifndef NO_DWARF_UNWIND
 ifeq ($(ARCH),$(filter $(ARCH),x86 arm))
 LIB_OBJS += $(OUTPUT)tests/dwarf-unwind.o
diff --git a/tools/perf/tests/aliases.c b/tools/perf/tests/aliases.c
new file mode 100644
index 000..68396e7
--- /dev/null
+++ b/tools/perf/tests/aliases.c
@@ -0,0 +1,58 @@
+/* Check if we can set up all aliases and can read JSON files */
+#include 
+#include "tests.h"
+#include "pmu.h"
+#include "evlist.h"
+#include "parse-events.h"
+
+static struct perf_evlist *evlist;
+
+static int num_events;
+static int failed;
+
+static int test__event(const char *name)
+{
+   int ret;
+
+   /* Not supported for now */
+   if (!strncmp(name, "energy-", 7))
+   return 0;
+
+   ret = parse_events(evlist, name);
+
+   if (ret) {
+   /*
+* We only print on failure because common perf setups
+* have events that cannot be parsed.
+*/
+   fprintf(stderr, "invalid or unsupported event: '%s'\n", name);
+   ret = 0;
+   failed++;
+   } else
+   num_events++;
+   return ret;
+}
+
+int test__aliases(void)
+{
+   int err;
+
+   /* Download JSON files */
+   /* XXX assumes perf is installed */
+   /* For now user must manually download */
+   if (0 && system("perf download > /dev/null") < 0) {
+   /* Don't error out for this for now */
+   fprintf(stderr, "perf download failed\n");
+   }
+
+   evlist = perf_evlist__new();
+   if (evlist == NULL)
+   return -ENOMEM;
+
+   err = pmu_iterate_events(test__event);
+   fprintf(stderr, " Parsed %d events :", num_events);
+   if (failed > 0)
+   pr_debug(" %d events failed", failed);
+   perf_evlist__delete(evlist);
+   return err;
+}
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 6f8b01b..bb37ac2 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -154,6 +154,10 @@ static struct test {
.func = test__hists_cumulate,
},
{
+   .desc = "Test parsing JSON aliases",
+   .func = test__aliases,
+   },
+   {
.func = NULL,
},
 };
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index ed64790..ab92ad9 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -48,6 +48,7 @@ int test__mmap_thread_lookup(void);
 int test__thread_mg_share(void);
 int test__hists_output(void);
 int test__hists_cumulate(void);
+int test__aliases(void);
 
 #if defined(__x86_64__) || defined(__i386__) || defined(__arm__)
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 8714f9a..b87f520 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -869,3 +869,21 @@ bool pmu_have_event(const char *pname, const char *name)
}
return false;
 }
+
+int pmu_iterate_events(int (*func)(const char *name))
+{
+   int ret = 0;
+   struct perf_pmu *pmu;
+   struct perf_pmu_alias *alias;
+
+   perf_pmu__find("cpu"); /* Load PMUs */
+   pmu = NULL;
+   while ((pmu = perf_pmu__scan(pmu)) != NULL) {
+   list_for_each_entry(alias, >aliases, list) {
+   ret = func(alias->name);
+   if (ret != 0)
+   break;
+   }
+   }
+   return ret;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 583d21e..a8ed283 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -47,5 +47,7 @@ bool pmu_have_event(const char *pname, const char *name);
 
 int perf_pmu__test(void);
 
+int pmu_iterate_events(int (*func)(const char *name));
+
 extern const char *json_file;
 #endif /* __PMU_H */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

[PATCH 5/9] perf, tools: Add perf download to download event files v4

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Add a downloader to automatically download the right
files from a download site.

This is implemented as a script calling wget, similar to
perf archive. The perf driver automatically calls the right
binary. The downloader is extensible, but currently only
implements an Intel event download.  It would be straightforward
to add other sites too for other vendors.

The downloaded event files are put into ~/.cache/pmu-events, where the
builtin event parser in util/* can find them automatically.

v2: Use ~/.cache
v3: Check for wget. Some cleanups.
v4: Improve manpage.
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-download.txt | 31 
 tools/perf/Documentation/perf-list.txt | 12 ++-
 tools/perf/Makefile.perf   |  5 ++-
 tools/perf/perf-download.sh| 57 ++
 4 files changed, 103 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-download.txt
 create mode 100755 tools/perf/perf-download.sh

diff --git a/tools/perf/Documentation/perf-download.txt 
b/tools/perf/Documentation/perf-download.txt
new file mode 100644
index 000..9e5b28e
--- /dev/null
+++ b/tools/perf/Documentation/perf-download.txt
@@ -0,0 +1,31 @@
+perf-download(1)
+===
+
+NAME
+
+perf-download - Download event files for current CPU.
+
+SYNOPSIS
+
+[verse]
+'perf download' [vendor-family-model]
+
+DESCRIPTION
+---
+This command automatically downloads the event list for the current CPU and
+stores them in $XDG_CACHE_HOME/pmu-events (or $HOME/.cache/pmu-events).
+The other tools automatically look for them there. The CPU can be also
+specified at the command line.
+
+The downloading is done using http through wget, which needs
+to be installed. When behind a firewall the proxies
+may also need to be set up using "export https_proxy="
+
+The user should regularly call this to download updated event lists
+for the current CPU.
+
+Note the downloaded files are stored per user, so if perf is
+used as both normal user and with sudo the event files may
+also need to be moved to root's home directory with
+sudo mkdir /root/.cache ; sud cp -r ~/.cache/pmu-events /root/.cache
+after downloading.
diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index 9305a37..2b4eba0 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -61,6 +61,16 @@ Sampling). Examples to use IBS:
  perf record -a -e r076:p ...  # same as -e cpu-cycles:p
  perf record -a -e r0C1:p ...  # use ibs op counting micro-ops
 
+PER CPU EVENT LISTS
+---
+
+For some CPUs (particularly modern Intel CPUs) "perf download" can
+download additional CPU specific event definitions, which then
+become visible in perf list and available in the other perf tools.
+
+This obsoletes the raw event description method described below
+for most cases.
+
 RAW HARDWARE EVENT DESCRIPTOR
 -
 Even when an event is not available in a symbolic form within perf right now,
@@ -123,6 +133,6 @@ types specified.
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-top[1],
-linkperf:perf-record[1],
+linkperf:perf-record[1], linkperf:perf-download[1],
 http://www.intel.com/Assets/PDF/manual/253669.pdf[Intel® 64 and IA-32 
Architectures Software Developer's Manual Volume 3B: System Programming Guide],
 http://support.amd.com/us/Processor_TechDocs/24593_APM_v2.pdf[AMD64 
Architecture Programmer’s Manual Volume 2: System Programming]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 0016d1a..0600425 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -126,6 +126,7 @@ PYRF_OBJS =
 SCRIPT_SH =
 
 SCRIPT_SH += perf-archive.sh
+SCRIPT_SH += perf-download.sh
 
 grep-libs = $(filter -l%,$(1))
 strip-libs = $(filter-out -l%,$(1))
@@ -877,6 +878,8 @@ install-bin: all install-gtk
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
$(call QUIET_INSTALL, perf-archive) \
$(INSTALL) $(OUTPUT)perf-archive -t 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
+   $(call QUIET_INSTALL, perf-download) \
+   $(INSTALL) $(OUTPUT)perf-download -t 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
 ifndef NO_LIBPERL
$(call QUIET_INSTALL, perl-scripts) \
$(INSTALL) -d -m 755 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)/scripts/perl/Perf-Trace-Util/lib/Perf/Trace';
 \
@@ -922,7 +925,7 @@ config-clean:
@$(MAKE) -C config/feature-checks clean >/dev/null
 
 clean: $(LIBTRACEEVENT)-clean $(LIBAPIKFS)-clean config-clean
-   $(call QUIET_CLEAN, core-objs)  $(RM) $(LIB_OBJS) $(BUILTIN_OBJS) 
$(LIB_FILE) $(OUTPUT)perf-archive $(OUTPUT)perf.o $(LANG_BINDINGS) $(GTK_OBJS)
+   $(call QUIET_CLEAN, core-objs)  $(RM) $(LIB_OBJS) $(BUILTIN_OBJS) 
$(LIB_FILE) $(OUTPUT)perf-archive $(OUTPUT)/perf-download

[PATCH 4/9] perf, tools: Automatically look for event file name for cpu v3

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

When no JSON event file is specified automatically look
for a suitable file in ~/.cache/pmu-events. A "perf download" can
automatically add files there for the current CPUs.

This does not include the actual event files with perf,
but they can be automatically downloaded instead
(implemented in the next patch)

This has the advantage that the events can be always uptodate,
because they are freshly downloaded. In oprofile we always
had problems with out of date or incomplete events files.

The event file format is per architecture, but can be
extended for other architectures.

v2: Supports XDG_CACHE_HOME and defaults to ~/.cache/pmu-events
v3: Minor updates and handle EVENTMAP.
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/arch/x86/Makefile  |  1 +
 tools/perf/arch/x86/util/cpustr.c | 34 
 tools/perf/util/jevents.c | 41 +++
 tools/perf/util/jevents.h |  1 +
 tools/perf/util/pmu.c |  2 +-
 5 files changed, 78 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/x86/util/cpustr.c

diff --git a/tools/perf/arch/x86/Makefile b/tools/perf/arch/x86/Makefile
index 1641542..0efeb14 100644
--- a/tools/perf/arch/x86/Makefile
+++ b/tools/perf/arch/x86/Makefile
@@ -14,4 +14,5 @@ LIB_OBJS += $(OUTPUT)arch/$(ARCH)/tests/dwarf-unwind.o
 endif
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/tsc.o
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/cpustr.o
 LIB_H += arch/$(ARCH)/util/tsc.h
diff --git a/tools/perf/arch/x86/util/cpustr.c 
b/tools/perf/arch/x86/util/cpustr.c
new file mode 100644
index 000..e1cd76c
--- /dev/null
+++ b/tools/perf/arch/x86/util/cpustr.c
@@ -0,0 +1,34 @@
+#include 
+#include 
+#include "../../util/jevents.h"
+
+char *get_cpu_str(void)
+{
+   char *line = NULL;
+   size_t llen = 0;
+   int found = 0, n;
+   char vendor[30];
+   int model, fam;
+   char *res = NULL;
+   FILE *f = fopen("/proc/cpuinfo", "r");
+
+   if (!f)
+   return NULL;
+   while (getline(, , f) > 0) {
+   if (sscanf(line, "vendor_id : %30s", vendor) == 1)
+   found++;
+   else if (sscanf(line, "model : %d", ) == 1)
+   found++;
+   else if (sscanf(line, "cpu family : %d", ) == 1)
+   found++;
+   if (found == 3) {
+   n = asprintf(, "%s-%d-%X-core", vendor, fam, model);
+   if (n < 0)
+   res = NULL;
+   break;
+   }
+   }
+   free(line);
+   fclose(f);
+   return res;
+}
diff --git a/tools/perf/util/jevents.c b/tools/perf/util/jevents.c
index 943a1fc..1fae0b7 100644
--- a/tools/perf/util/jevents.c
+++ b/tools/perf/util/jevents.c
@@ -33,10 +33,49 @@
 #include 
 #include 
 #include 
+#include "cache.h"
 #include "jsmn.h"
 #include "json.h"
 #include "jevents.h"
 
+__attribute__((weak)) char *get_cpu_str(void)
+{
+   return NULL;
+}
+
+static const char *json_default_name(void)
+{
+   char *cache;
+   char *idstr = get_cpu_str();
+   char *res = NULL;
+   char *home = NULL;
+   char *emap;
+
+   emap = getenv("EVENTMAP");
+   if (emap) {
+   if (access(emap, R_OK) == 0)
+   return emap;
+   if (asprintf(, "%s-core", emap) < 0)
+   return NULL;
+   }
+
+   cache = getenv("XDG_CACHE_HOME");
+   if (!cache) {
+   home = getenv("HOME");
+   if (!home || asprintf(, "%s/.cache", home) < 0)
+   goto out;
+   }
+   if (cache && idstr)
+   res = mkpath("%s/pmu-events/%s.json",
+cache,
+idstr);
+   if (home)
+   free(cache);
+out:
+   free(idstr);
+   return res;
+}
+
 static void addfield(char *map, char **dst, const char *sep,
 const char *a, jsmntok_t *bt)
 {
@@ -174,6 +213,8 @@ int json_events(const char *fn,
int i, j, len;
char *map;
 
+   if (!fn)
+   fn = json_default_name();
tokens = parse_json(fn, , , );
if (!tokens)
return -EIO;
diff --git a/tools/perf/util/jevents.h b/tools/perf/util/jevents.h
index 4c2b879..6a377a8 100644
--- a/tools/perf/util/jevents.h
+++ b/tools/perf/util/jevents.h
@@ -1,3 +1,4 @@
 int json_events(const char *fn,
int (*func)(void *data, char *name, char *event, char *desc),
void *data);
+char *get_cpu_str(void);
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 9f154af..fa21319 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -433,7 +433,7 @@ static struct perf_pmu *pmu_lookup(const char *name)
if (pmu_aliases(name, ))
return NULL;
 
-   if

[PATCH 2/9] perf, tools: Add support for text descriptions of events and alias add

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Change pmu.c to allow descriptions of events and add interfaces
to add aliases at runtime from another file. To be used by jevents in the
next patch.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/util/pmu.c | 127 ++
 1 file changed, 98 insertions(+), 29 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7a811eb..baec090 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -14,6 +14,7 @@
 
 struct perf_pmu_alias {
char *name;
+   char *desc;
struct list_head terms;
struct list_head list;
char unit[UNIT_MAX_LEN+1];
@@ -171,17 +172,12 @@ error:
return -1;
 }
 
-static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, 
FILE *file)
+static int __perf_pmu__new_alias(struct list_head *list, char *name,
+char *dir, char *desc, char *val)
 {
struct perf_pmu_alias *alias;
-   char buf[256];
int ret;
 
-   ret = fread(buf, 1, sizeof(buf), file);
-   if (ret == 0)
-   return -EINVAL;
-   buf[ret] = 0;
-
alias = malloc(sizeof(*alias));
if (!alias)
return -ENOMEM;
@@ -190,24 +186,45 @@ static int perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name, FI
alias->scale = 1.0;
alias->unit[0] = '\0';
 
-   ret = parse_events_terms(>terms, buf);
+   ret = parse_events_terms(>terms, val);
if (ret) {
+   pr_err("Cannot parse alias %s: %d\n", val, ret);
free(alias);
return ret;
}
 
alias->name = strdup(name);
-   /*
-* load unit name and scale if available
-*/
-   perf_pmu__parse_unit(alias, dir, name);
-   perf_pmu__parse_scale(alias, dir, name);
 
+   if (dir) {
+   /*
+* load unit name and scale if available
+*/
+   perf_pmu__parse_unit(alias, dir, name);
+   perf_pmu__parse_scale(alias, dir, name);
+   }
+
+   alias->desc = desc ? strdup(desc) : NULL;
list_add_tail(>list, list);
 
return 0;
 }
 
+static int perf_pmu__new_alias(struct list_head *list,
+  char *dir,
+  char *name,
+  FILE *file)
+{
+   char buf[256];
+   int ret;
+
+   ret = fread(buf, 1, sizeof(buf), file);
+   if (ret == 0)
+   return -EINVAL;
+   buf[ret] = 0;
+
+   return __perf_pmu__new_alias(list, name, dir, NULL, buf);
+}
+
 /*
  * Process all the sysfs attributes located under the directory
  * specified in 'dir' parameter.
@@ -720,11 +737,51 @@ static char *format_alias_or(char *buf, int len, struct 
perf_pmu *pmu,
return buf;
 }
 
-static int cmp_string(const void *a, const void *b)
+struct pair {
+   char *name;
+   char *desc;
+};
+
+static int cmp_pair(const void *a, const void *b)
+{
+   const struct pair *as = a;
+   const struct pair *bs = b;
+
+   /* Put downloaded event list last */
+   if (!!as->desc != !!bs->desc)
+   return !!as->desc - !!bs->desc;
+   return strcmp(as->name, bs->name);
+}
+
+static void wordwrap(char *s, int start, int max, int corr)
 {
-   const char * const *as = a;
-   const char * const *bs = b;
-   return strcmp(*as, *bs);
+   int column = start;
+   int n;
+
+   while (*s) {
+   int wlen = strcspn(s, " \t");
+
+   if (column + wlen >= max && column > start) {
+   printf("\n%*s", start, "");
+   column = start + corr;
+   }
+   n = printf("%s%.*s", column > start ? " " : "", wlen, s);
+   if (n <= 0)
+   break;
+   s += wlen;
+   column += n;
+   while (isspace(*s))
+   s++;
+   }
+}
+
+static int get_columns(void)
+{
+   /*
+* Should ask the terminal with TIOCGWINSZ here, but we
+* need the original fd before the pager.
+*/
+   return 79;
 }
 
 void print_pmu_events(const char *event_glob, bool name_only)
@@ -734,21 +791,24 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
char buf[1024];
int printed = 0;
int len, j;
-   char **aliases;
+   struct pair *aliases;
+   int numdesc = 0;
+   int columns = get_columns();
 
pmu = NULL;
len = 0;
while ((pmu = perf_pmu__scan(pmu)) != NULL)
list_for_each_entry(alias, >aliases, list)
len++;
-   aliases = malloc(sizeof(char *) * len);
+   aliases = malloc(sizeof(struct pair) * len);
if (!aliases)
return;
pmu = NULL;
j = 0;
while ((pmu = perf_pmu__scan(pmu)) != NULL)

[PATCH 7/9] perf, tools: Query terminal width and use in perf list

2014-06-13 Thread Andi Kleen

From: Andi Kleen 

Automatically adapt the now wider and word wrapped perf list
output to wider terminals. This requires querying the terminal
before the auto pager takes over, and exporting this
information from the pager subsystem.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/util/cache.h |  1 +
 tools/perf/util/pager.c | 15 +++
 tools/perf/util/pmu.c   | 12 ++--
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/cache.h b/tools/perf/util/cache.h
index 7b176dd..07527d6 100644
--- a/tools/perf/util/cache.h
+++ b/tools/perf/util/cache.h
@@ -31,6 +31,7 @@ extern void setup_pager(void);
 extern const char *pager_program;
 extern int pager_in_use(void);
 extern int pager_use_color;
+int pager_get_columns(void);
 
 char *alias_lookup(const char *alias);
 int split_cmdline(char *cmdline, const char ***argv);
diff --git a/tools/perf/util/pager.c b/tools/perf/util/pager.c
index 31ee02d..9761202 100644
--- a/tools/perf/util/pager.c
+++ b/tools/perf/util/pager.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "run-command.h"
 #include "sigchain.h"
+#include 
 
 /*
  * This is split up from the rest of git so that we can do
@@ -8,6 +9,7 @@
  */
 
 static int spawned_pager;
+static int pager_columns;
 
 static void pager_preexec(void)
 {
@@ -47,9 +49,12 @@ static void wait_for_pager_signal(int signo)
 void setup_pager(void)
 {
const char *pager = getenv("PERF_PAGER");
+   struct winsize sz;
 
if (!isatty(1))
return;
+   if (ioctl(1, TIOCGWINSZ, ) == 0)
+   pager_columns = sz.ws_col;
if (!pager) {
if (!pager_program)
perf_config(perf_default_config, NULL);
@@ -98,3 +103,13 @@ int pager_in_use(void)
env = getenv("PERF_PAGER_IN_USE");
return env ? perf_config_bool("PERF_PAGER_IN_USE", env) : 0;
 }
+
+int pager_get_columns(void)
+{
+   char *s;
+
+   s = getenv("COLUMNS");
+   if (s)
+   return atoi(s);
+   return (pager_columns ? pager_columns : 80) - 2;
+}
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index fa21319..8714f9a 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -9,6 +9,7 @@
 #include "pmu.h"
 #include "parse-events.h"
 #include "cpumap.h"
+#include "cache.h"
 #include "jevents.h"
 
 const char *json_file;
@@ -789,15 +790,6 @@ static void wordwrap(char *s, int start, int max, int corr)
}
 }
 
-static int get_columns(void)
-{
-   /*
-* Should ask the terminal with TIOCGWINSZ here, but we
-* need the original fd before the pager.
-*/
-   return 79;
-}
-
 void print_pmu_events(const char *event_glob, bool name_only)
 {
struct perf_pmu *pmu;
@@ -807,7 +799,7 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
int len, j;
struct pair *aliases;
int numdesc = 0;
-   int columns = get_columns();
+   int columns = pager_get_columns();
 
pmu = NULL;
len = 0;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

perf: Add support for full Intel event lists v6

2014-06-13 Thread Andi Kleen

[v2: Review feedback addressed and some minor improvements]
[v3: More review feedback addressed and handle test failures better.
Ported to latest tip/core.]
[v4: Addressed Namhyung's feedback]
[v5: Rebase to latest tree. Minor description update.]
[v6: Rebase. Add acked by from Namhyung and address feedback. Some minor
fixes. Should be good to go now I hope. The period patch was dropped,
as that is already handled. I added an extra patch for a --quiet argument
for perf list]

perf has high level events which are useful in many cases. However
there are some tuning situations where low level events in the CPU
are needed. Traditionally this required specifying the event in 
raw form (very awkward) or using non standard frontends
like ocperf or patching in libpfm.

Intel CPUs can have very large event files (Haswell has ~336 core events,
much more if you add uncore or all the offcore combinations), which is too
large to describe through the kernel interface. It would require tying up
significant amounts of unswappable memory for this.

oprofile always had separate event list files that were maintained by 
the CPU vendors. The oprofile events were shipped with the tool.
The Intel events get updated regularly, for example to add references
to the specification updates or add new events.

Unfortunately oprofile usually did not keep up with these updates,
so the events in oprofile were often out of date. In addition
it ties up quite a bit of disk space, mostly for CPUs you don't have.

This patch kit implements another mechanism that avoids these problems.
Intel releases the event lists for CPUs in a standardized JSON format
on a download server.

I implemented an automatic downloader to get the event file for the
current CPU.  The events are stored in ~/.cache/pmu-events.
Then perf adds a parser that converts the JSON format into perf event
aliases, which then can be used directly as any other perf event.

The parsing is done using a simple existing JSON library.

The events are still abstracted for perf, but the abstraction mechanism is
through the downloaded file instead of through the kernel.

The JSON format and perf parser has some minor Intelisms, but they
are simple and small and optional. It's easy to extend, so it would be
possible to use it for other CPUs too, add different pmu attributes, and
add new download sites to the downloader tool.

Currently only core events are supported, uncore may come at a later
point. No kernel changes, all code in perf user tools only.

Some of the parser files are partially shared with separate event parser
library and are thus 2-clause BSD licensed.

Patches also available from
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/json

Example output:

% perf download 
Downloading models file
Downloading readme.txt
2014-03-05 10:39:33 URL:https://download.01.org/perfmon/readme.txt 
[10320/10320] -> "readme.txt" [1]
2014-03-05 10:39:34 URL:https://download.01.org/perfmon/mapfile.csv [1207/1207] 
-> "mapfile.csv" [1]
Downloading events file
% perf list
...
  br_inst_exec.all_branches  [Speculative and retired
  branches]
  br_inst_exec.all_conditional   [Speculative and retired
  macro-conditional
  branches]
  br_inst_exec.all_direct_jmp[Speculative and retired
  macro-unconditional
  branches excluding
  calls and indirects]
... 333 more new events ...

% perf stat -e br_inst_exec.all_direct_jmp true

 Performance counter stats for 'true':

 6,817  cpu/br_inst_exec.all_direct_jmp/
   

   0.003503212 seconds time elapsed

One nice feature is that a pointer to the specification update is now
included in the description, which will hopefully clear up many problems:

% perf list
...
  mem_load_uops_l3_hit_retired.xsnp_hit  [Retired load uops which
  data sources were L3
  and cross-core snoop
  hits in on-pkg core
  cache. Supports address
  when precise. Spec
  update: HSM26, HSM30
  (Precise event)]
...


-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3][update] PM / sleep: Introduce command line argument for sleep state enumeration

2014-06-13 Thread Rafael J. Wysocki

On Saturday, June 14, 2014 12:23:15 AM Rafael J. Wysocki wrote:
> On Saturday, June 14, 2014 12:16:26 AM Rafael J. Wysocki wrote:
> > On Friday, June 13, 2014 11:46:12 PM Pavel Machek wrote:
> 
> [...]
> 
> > > > So I'm really not sure what's the problem?  Do you think it's wrong to 
> > > > be
> > > > helpful to users or something?
> > > 
> > > It is not wrong to be helpful, but messed up interface is too big a
> > > price.
> > 
> > Why?  I will have to maintain it after all, right?
> 
> And by the way, the very fact that this workaround is even useful in some 
> cases
> indicates that the interface that we've invented originally is not 
> particularly
> useful to user space.  The reason why is because user space is supposed to
> enumerate the sleep states and then present the ones that are present in a
> consistent way to the user.  It basically has to do "Is 'mem' present"?  Use 
> it
> if so, but if not is 'standby' present?  Use it if so etc." every time or
> squirrel that information somewhere which isn't particularly straightforward.

Moreover, the "mem" and "standby" states are not really well defined on anything
different from ACPI, so "mem" is used by everybody having just one 
platform-supported
state.  That's why user space doesn't bother to check the other ones in many
cases and it is not really their problem.  It is the problem of our existing
interface that wasn't designed correctly.

My first reaction to this issue was pretty much the same as yours, but when I 
started
to think more about it, I've realized that we messed up things to start with 
and now
we're just having to deal with the consequences.  So yes, we can put our heads 
in
the ground and say "that's not our problem, it has to be addressed in user 
space",
but quite frankly I'm not seeing how we can persuade user space developers 
address
it given that the vast majority of systems use "mem" only anyway and that they 
don't
really understand where the problem is.

For this reason I'm considering changing the defaul behavior going forward (so
that "mem" is always present and means "the deepest sleep state available other
than hibernation"), but I don't want to do that in one go.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Kees Cook

On Fri, Jun 13, 2014 at 3:54 PM, Rafael J. Wysocki  wrote:
> On Friday, June 13, 2014 03:07:19 PM Kees Cook wrote:
>> On Fri, Jun 13, 2014 at 3:14 PM, Rafael J. Wysocki  
>> wrote:
>> > On Friday, June 13, 2014 10:32:56 AM Kees Cook wrote:
>> >> On Fri, Jun 13, 2014 at 3:51 AM, Pavel Machek  wrote:
>> >> > Hi!
>> >> >
>> >> >
>> >> >> >>> Any way we can make them work together instead?
>> >> >> >>
>> >> >> >> I'm sure there is, but I don't know the solution. :)
>> >> >> >>
>> >> >> >> At the very least this gets us one step closer (we can build them 
>> >> >> >> together).
>> >> >> >>
>> >> >> >
>> >> >> > But it is really invasive.
>> >> >>
>> >> >> Well, I don't agree there. I actually would like to be able to turn
>> >> >> off hibernation support on distro kernels regardless of kASLR, so I
>> >> >> think this is really killing two birds with one stone.
>> >> >>
>> >> >> > I have to admit to being somewhat fuzzy on what the core problem with
>> >> >> > hibernation and kASLR is... in both cases there is a set of pages 
>> >> >> > that
>> >> >> > need to be installed, some of which will overlap the loader kernel.
>> >> >> > What am I missing?
>> >> >>
>> >> >> I don't know how resume works, but I have assumed that the newly
>> >> >> loaded kernel stays in memory and pulls in the vmalloc, kmalloc,
>> >> >> modules, and userspace memory maps from disk. Since these things can
>> >> >> easily contain references to kernel text, if the newly loaded kernel
>> >> >> has moved with regard to the hibernated image, everything breaks.
>> >> >> IIUC, this is similar why you can't rebuild your kernel and resume
>> >> >> from a different version.
>> >> >
>> >> > x86-64 can resume from different kernel that did the suspend. kASLR
>> >> > should not be too different from that. (You just include kernel text
>> >> > in the hibernation image. It is small enough to do that.)
>> >>
>> >> Oooh, that's very exciting! How does that work (what happens to the
>> >> kernel that booted first, etc)? I assume physical memory layout can't
>> >> change between hibernation and resume? Or, where should I be reading
>> >> code that does this?
>> >
>> > I guess it would help if you were a bit less sarcastic, but perhaps that's
>> > just me.
>>
>> Oh, er, I think that got misunderstood. I'm very rarely sarcastic in
>> online communication. I wasn't being sarcastic here at all. I _do_
>> find it exciting that one can resume with a different kernel! That's
>> been a limitation that plagued me for years. I had no idea that
>> restriction got lifted. I really did mean I was excited. Sorry if that
>> was misunderstood!
>
> Sorry about my misunderstanding. :-)
>
>> > Anyway, the core hibernation code actually works with page frames rather
>> > than with virtual addresses.  Essentially, it creates a bitmap where each
>> > page frame is represented by a single bit and the bits representing free
>> > page frames are unset.  It then allocates as many new pages as there are
>> > set bits in the bitmap and copies the entire contents of the page frames
>> > represented by those bits to new pages it's just allocated. That covers
>> > the entire kernel with its data and all process memory and is saved to
>> > disk storage along with the PFNs of the page frames whose contents have
>> > been copied.
>> >
>> > During resume it simply restores the contents of the saved page frames
>> > into those same page frames if they are available at that time.  For the
>> > page frames that aren't free then it allocates memory to store their
>> > contents temporarily and creates a list of PFNs where that contents should
>> > be moved eventually.  Then, it quiesces all activity of the system and
>> > jumps to arch-specific code that copies data from the temporary memory to
>> > the target page frames (that generally overwrites the boot kernel, so 
>> > there's
>> > no way back from it).  Finally, it jumps to a specific address where the
>> > hibernated kernel trampoline code should be present.
>> >
>> > I think what fails with kASLR is that last step, because everything else
>> > should be entirely agnostic to the way the virtual addresses are laid out.
>> > I'm not sure how to fix that at the moment, but it should be fixable at
>> > least on x86_64.
>>
>> Very cool. How does the kernel doing the resume identify the
>> trampoline location in the hibernated kernel? If it can handle a
>> different kernel in the hibernation image, I assume there's been some
>> specific identification in the image instead of using what
>> kernel-doing-the-resume thinks the trampoline is (based on its own
>> offsets).
>
> There is a simple mechanism to pass the address to jump to in the image
> header.  Unfortunately, that *is* a virtual address if I remember correctly.
>
> I'll have a closer look at that shortly (it's been quite some time since
> I wrote that code).

Thanks; I'm trying to get a test environment instrumented too so I can
look at this. (At the very least, it sounds like we'll

Re: [PATCH 3/3] ARM: dts: Enable audio support for Peach-pi board

2014-06-13 Thread Doug Anderson

Mark,

On Fri, Jun 13, 2014 at 3:04 PM, Mark Brown  wrote:
> On Fri, Jun 13, 2014 at 02:58:26PM -0700, Doug Anderson wrote:
>
>> Anyway, suffice to say that the i2c core needs to be extended to
>> handle the idea that a single device has more than one "compatible"
>> string.  I'll leave it to an eager reader of this thread to implement
>> this since we can also fix our own problem by just listing "max98091"
>> in "sound/soc/codecs/max98090.c" like has always been done in the
>> past.
>
> Why do you need to register multiple compatible strings (I guess for
> fallback purposes?).

I'm no expert, but I think that's part of device tree isn't it?

In the case of max98090 and max98091, they are incredibly similar
pieces of hardware (I think the max98091 simply has more microphones).
If you've got a driver for a max98090 it will work just fine for a
max98091 but you just won't get the extra microphones.

In cases like this then device tree theory says that you should list
both compatible strings: max98091 and max98090, right?  If your OS has
a driver for max98091 it will use it.  ...if it doesn't but it has a
max98090 driver it will try that one.

As far as I understand we _shouldn't_ lie and just say that we have a
max98090 when we really have a max98091.  The device tree is supposed
to describe the hardware and isn't support to care that the OS has a
driver for max98090 but not max98091.

Ironically in our case we have a driver that supports both the 98090
and the 98091 via autodetect.  However, it doesn't know about the
98091 compatible string so if you list yourself as compatible with
98091 then it won't find the driver.

> A quick fix that is about as good is to take the
> first compatible only.

That's how the code works today, actually.  ...but as per above the
current 98090 driver doesn't know about the 98091 compatible string,
so:

compatible = "maxim,max98091", "maxim,max98090";

...won't find the right driver.

--

The quick fix is to add max98091 to the max98090 driver and is what
I'd suggest in this case.  ...but I still think that the above logic
is valid and eventually the i2c core should be fixed.  Please correct
me if I'm wrong.

-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Paul E. McKenney

On Fri, Jun 13, 2014 at 02:10:35PM -0700, Josh Triplett wrote:
> On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > > Is it because we have dynticks CPUs staying too long in the kernel 
> > > > > > without
> > > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > > rcu_user_enter() or
> > > > > > things?
> > > > > 
> > > > > Sort of the former, but combined with the fact that in-kernel CPUs 
> > > > > still
> > > > > need scheduling-clock interrupts for RCU to make progress.  I could
> > > > > move this to RCU's context-switch hook, but that could be very bad for
> > > > > workloads that do lots of context switching.
> > > > 
> > > > Or I can restart the tick if the CPU stays in the kernel for too long 
> > > > without
> > > > a tick. I think that's what we were doing before but we removed that 
> > > > because
> > > > we never implemented it correctly (we sent scheduler IPI that did 
> > > > nothing...)
> > > 
> > > I wonder if timer slack would make sense here: when you have at least
> > > one RCU callback pending, set a timer with a huge amount of timer slack,
> > > and cancel it if you end up handling the callback via a trip through the
> > > scheduler.
> > 
> > But in this case, we need the tick even if the current CPU has no callbacks
> > because it might be in an RCU read-side critical section.
> 
> Don't we handle that case via the slowpath of rcu_read_unlock, and a
> flag set via IPI?  ("Oh, that CPU has taken too long to note a quiescent
> state; send it an IPI to set the special flag that makes unlock do the
> work.")

There was once such logic on the force-quiescent-state path, and making
that handle this new case was my first proposal.  As Frederic pointed
out, that change requires rcu_needs_cpu()'s cooperation, because otherwise
the CPU will take the IPI, see that it still has but one runnable task,
and then keep its scheduling-clock interrupt off.

The thing that involves rcu_read_unlock_special() is a flag set
by the scheduling-clock interrupt, which doesn't help here.  Also,
if a CPU stays in the kernel for a very long time without passing
through any RCU read-side critical sections, there is nothing that
rcu_read_unlock_special() can do to help.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bisected] pre-3.16 regression on open() scalability

2014-06-13 Thread Paul E. McKenney

On Fri, Jun 13, 2014 at 01:04:28PM -0700, Dave Hansen wrote:
> Hi Paul,
> 
> I'm seeing a regression when comparing 3.15 to Linus's current tree.
> I'm using Anton Blanchard's will-it-scale "open1" test which creates a
> bunch of processes and does open()/close() in a tight loop:
> 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/open1.c
> 
> At about 50 cores worth of processes, 3.15 and the pre-3.16 code start
> to diverge, with 3.15 scaling better:
> 
>   http://sr71.net/~dave/intel/3.16-open1regression-0.png
> 
> Some profiles point to a big increase in contention inside slub.c's
> get_partial_node() (the allocation side of the slub code) causing the
> regression.  That particular open() test is known to do a lot of slab
> operations.  But, the odd part is that the slub code hasn't been touched
> much.
> 
> So, I bisected it down to this:
> 
> > commit ac1bea85781e9004da9b3e8a4b097c18492d857c
> > Author: Paul E. McKenney 
> > Date:   Sun Mar 16 21:36:25 2014 -0700
> > 
> > sched,rcu: Make cond_resched() report RCU quiescent states
> 
> Specifically, if I raise RCU_COND_RESCHED_LIM, things get back to their
> 3.15 levels.
> 
> Could the additional RCU quiescent states be causing us to be doing more
> RCU frees that we were before, and getting less benefit from the lock
> batching that RCU normally provides?

Quite possibly.  One way to check would be to use the debugfs files
rcu/*/rcugp, which give a count of grace periods since boot for each
RCU flavor.  Here "*" is rcu_preempt for CONFIG_PREEMPT and rcu_sched
for !CONFIG_PREEMPT.

Another possibility is that someone is invoking cond_reched() in an
incredibly tight loop.

> The top RCU functions in the profiles are as follows:
> 
> > 3.15.0-xxx: 2.58%  open1_processes  [kernel.kallsyms]   [k] 
> > file_free_rcu 
> > 3.15.0-xxx: 2.45%  open1_processes  [kernel.kallsyms]   [k] 
> > __d_lookup_rcu
> > 3.15.0-xxx: 2.41%  open1_processes  [kernel.kallsyms]   [k] 
> > rcu_process_callbacks 
> > 3.15.0-xxx: 1.87%  open1_processes  [kernel.kallsyms]   [k] 
> > __call_rcu.constprop.10   
> 
> > 3.16.0-rc0: 2.68%  open1_processes  [kernel.kallsyms]  [k] 
> > rcu_process_callbacks 
> > 3.16.0-rc0: 2.68%  open1_processes  [kernel.kallsyms]  [k] 
> > file_free_rcu 
> > 3.16.0-rc0: 1.55%  open1_processes  [kernel.kallsyms]  [k] 
> > __call_rcu.constprop.10   
> > 3.16.0-rc0: 1.28%  open1_processes  [kernel.kallsyms]  [k] 
> > __d_lookup_rcu 
> 
> With everything else equal, we'd expect to see all of these _higher_ in
> the profiles on a the faster kernel (3.15) since it has more RCU work to do.
> 
> But, they're all _roughly_ the same.  __d_lookup_rcu went up in the
> profile on the fast one (3.15) probably because there _were_ more
> lookups happening there.
> 
> rcu_process_callbacks makes me syspicious.  It went up slightly
> (probably in the noise), but it _should_ have dropped due to there being
> less RCU work to do.
> 
> This supports the theory that there are more callbacks happening than
> before, causing more slab lock contention, which is the actual trigger
> for the performance drop.
> 
> I also hacked in an interface to make RCU_COND_RESCHED_LIM a tunable.
> Making it huge instantly makes my test go fast, and dropping it to 256
> instantly makes it slow.  Some brief toying with it shows that
> RCU_COND_RESCHED_LIM has to be about 100,000 before performance gets
> back to where it was before.

That is way bigger than I would expect.  My bet is that someone is
invoking cond_resched() in a 10s-of-nanoseconds tight loop.

But please feel free to send along your patch, CCing LKML.  Longer
term, I probably need to take a more algorithmic approach, but what
you have will be useful to benchmarkers until then.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 03:07:19 PM Kees Cook wrote:
> On Fri, Jun 13, 2014 at 3:14 PM, Rafael J. Wysocki  wrote:
> > On Friday, June 13, 2014 10:32:56 AM Kees Cook wrote:
> >> On Fri, Jun 13, 2014 at 3:51 AM, Pavel Machek  wrote:
> >> > Hi!
> >> >
> >> >
> >> >> >>> Any way we can make them work together instead?
> >> >> >>
> >> >> >> I'm sure there is, but I don't know the solution. :)
> >> >> >>
> >> >> >> At the very least this gets us one step closer (we can build them 
> >> >> >> together).
> >> >> >>
> >> >> >
> >> >> > But it is really invasive.
> >> >>
> >> >> Well, I don't agree there. I actually would like to be able to turn
> >> >> off hibernation support on distro kernels regardless of kASLR, so I
> >> >> think this is really killing two birds with one stone.
> >> >>
> >> >> > I have to admit to being somewhat fuzzy on what the core problem with
> >> >> > hibernation and kASLR is... in both cases there is a set of pages that
> >> >> > need to be installed, some of which will overlap the loader kernel.
> >> >> > What am I missing?
> >> >>
> >> >> I don't know how resume works, but I have assumed that the newly
> >> >> loaded kernel stays in memory and pulls in the vmalloc, kmalloc,
> >> >> modules, and userspace memory maps from disk. Since these things can
> >> >> easily contain references to kernel text, if the newly loaded kernel
> >> >> has moved with regard to the hibernated image, everything breaks.
> >> >> IIUC, this is similar why you can't rebuild your kernel and resume
> >> >> from a different version.
> >> >
> >> > x86-64 can resume from different kernel that did the suspend. kASLR
> >> > should not be too different from that. (You just include kernel text
> >> > in the hibernation image. It is small enough to do that.)
> >>
> >> Oooh, that's very exciting! How does that work (what happens to the
> >> kernel that booted first, etc)? I assume physical memory layout can't
> >> change between hibernation and resume? Or, where should I be reading
> >> code that does this?
> >
> > I guess it would help if you were a bit less sarcastic, but perhaps that's
> > just me.
> 
> Oh, er, I think that got misunderstood. I'm very rarely sarcastic in
> online communication. I wasn't being sarcastic here at all. I _do_
> find it exciting that one can resume with a different kernel! That's
> been a limitation that plagued me for years. I had no idea that
> restriction got lifted. I really did mean I was excited. Sorry if that
> was misunderstood!

Sorry about my misunderstanding. :-)

> > Anyway, the core hibernation code actually works with page frames rather
> > than with virtual addresses.  Essentially, it creates a bitmap where each
> > page frame is represented by a single bit and the bits representing free
> > page frames are unset.  It then allocates as many new pages as there are
> > set bits in the bitmap and copies the entire contents of the page frames
> > represented by those bits to new pages it's just allocated. That covers
> > the entire kernel with its data and all process memory and is saved to
> > disk storage along with the PFNs of the page frames whose contents have
> > been copied.
> >
> > During resume it simply restores the contents of the saved page frames
> > into those same page frames if they are available at that time.  For the
> > page frames that aren't free then it allocates memory to store their
> > contents temporarily and creates a list of PFNs where that contents should
> > be moved eventually.  Then, it quiesces all activity of the system and
> > jumps to arch-specific code that copies data from the temporary memory to
> > the target page frames (that generally overwrites the boot kernel, so 
> > there's
> > no way back from it).  Finally, it jumps to a specific address where the
> > hibernated kernel trampoline code should be present.
> >
> > I think what fails with kASLR is that last step, because everything else
> > should be entirely agnostic to the way the virtual addresses are laid out.
> > I'm not sure how to fix that at the moment, but it should be fixable at
> > least on x86_64.
> 
> Very cool. How does the kernel doing the resume identify the
> trampoline location in the hibernated kernel? If it can handle a
> different kernel in the hibernation image, I assume there's been some
> specific identification in the image instead of using what
> kernel-doing-the-resume thinks the trampoline is (based on its own
> offsets).

There is a simple mechanism to pass the address to jump to in the image
header.  Unfortunately, that *is* a virtual address if I remember correctly.

I'll have a closer look at that shortly (it's been quite some time since
I wrote that code).

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Alexei Starovoitov

On Fri, Jun 13, 2014 at 2:42 PM, Andy Lutomirski  wrote:
> On Fri, Jun 13, 2014 at 2:37 PM, Alexei Starovoitov  wrote:
>> On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski  wrote:
>>> On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  
>>> wrote:
 On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
> This adds the new "seccomp" syscall with both an "operation" and "flags"
> parameter for future expansion. The third argument is a pointer value,
> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>
> Signed-off-by: Kees Cook 
> Cc: linux-...@vger.kernel.org
> ---
>  arch/x86/syscalls/syscall_32.tbl  |1 +
>  arch/x86/syscalls/syscall_64.tbl  |1 +
>  include/linux/syscalls.h  |2 ++
>  include/uapi/asm-generic/unistd.h |4 ++-
>  include/uapi/linux/seccomp.h  |4 +++
>  kernel/seccomp.c  |   63 
> -
>  kernel/sys_ni.c   |3 ++
>  7 files changed, 69 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl 
> b/arch/x86/syscalls/syscall_32.tbl
> index d6b867921612..7527eac24122 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351i386sched_setattr   sys_sched_setattr
>  352i386sched_getattr   sys_sched_getattr
>  353i386renameat2   sys_renameat2
> +354i386seccomp sys_seccomp
> diff --git a/arch/x86/syscalls/syscall_64.tbl 
> b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1646d2..16272a6c12b7 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314common  sched_setattr   sys_sched_setattr
>  315common  sched_getattr   sys_sched_getattr
>  316common  renameat2   sys_renameat2
> +317common  seccomp sys_seccomp
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0ed322..1713977ee26f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>  unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
> flags);
> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> +   const char __user *uargs);

 It looks odd to add 'flags' argument to syscall that is not even used.
 It don't think it will be extensible this way.
 'uargs' is used only in 2nd command as well and it's not 'char __user *'
 but rather 'struct sock_fprog __user *'
 I think it makes more sense to define only first argument as 'int op' and 
 the
 rest as variable length array.
 Something like:
 long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
 then different commands can interpret 'attrs' differently.
 if op == mode_strict, then attrs == NULL, len == 0
 if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
 and nla_data(attrs) is 'struct sock_fprog'
>>>
>>> Eww.  If the operation doesn't imply the type, then I think we've
>>> totally screwed up.
>>>
 If we decide to add new types of filters or new commands, the syscall 
 prototype
 won't need to change. New commands can be added preserving backward
 compatibility.
 The basic TLV concept has been around forever in netlink world. imo makes
 sense to use it with new syscalls. Passing 'struct xxx' into syscalls
 is the thing
 of the past. TLV style is more extensible. Fields of structures can become
 optional in the future, new fields added, etc.
 'struct nlattr' brings the same benefits to kernel api as protobuf did
 to user land.
>>>
>>> I see no reason to bring nl_attr into this.
>>>
>>> Admittedly, I've never dealt with nl_attr, but everything
>>> netlink-related I've even been involved in has involved some sort of
>>> API atrocity.
>>
>> netlink has a lot of legacy and there is genetlink which is not pretty
>> either because of extra socket creation, binding, dealing with packet
>> loss issues, but the key concept of variable length encoding is sound.
>> Right now seccomp has two commands and they already don't fit
>> into single syscall neatly. Are you saying there should be two syscalls
>> here? What about another seccomp related command? Another syscall?
>> imo all seccomp related commands needs to be mux/demux-ed under
>> one

Re: [PATCH V2 00/19] irqchip: crossbar: driver fixes

2014-06-13 Thread Joe Perches

On Fri, 2014-06-13 at 22:38 +0200, Thomas Gleixner wrote:
> On Fri, 13 Jun 2014, Jason Cooper wrote:
> > On Fri, Jun 13, 2014 at 09:48:24AM -0700, Joe Perches wrote:
> > > On Fri, 2014-06-13 at 12:37 -0400, Jason Cooper wrote:
> > > > On Fri, Jun 13, 2014 at 09:14:34AM -0700, Joe Perches wrote:
> > > > > On Fri, 2014-06-13 at 11:01 -0400, Jason Cooper wrote:
> > > > > > Please format the subject lines like so:
> > > > > > 
> > > > > >   irqchip: crossbar: Set cb pointer ...
> > > > > >  ^
> > > > > >  |
> > > > > >  \-- note the capitalization
> > > > > 
> > > > > I suggest you don't make this a rule and focus
> > > > > on more important stuff instead.

[elided the bit that describes what a patch subject looks like]

Documentation/SubmittingPatches simply says:

The canonical patch subject line is:

Subject: [PATCH 001/123] subsystem: summary phrase

It doesn't say anything about capitalization.

> Sentences start with an upper case letter. Our brain is trained on
> that rule when parsing a line.

  I don't think patch subjects are sentences.
The docs call them phrases.

> So for people who actually review patches by reading them instead of
> running a spell checker, consistent formatting more important than
> avoiding the random typo, which our brain just blends out in most of
> the cases. Unfortunately also when the typo is in actual code :(

That part about the code is truth.

Anyway, how you spend your time is certainly up to you.
Do what makes you happy.

But if you want this specific form for your patches,
please just document it somewhere in the kernel tree.

I think that relatively commit log subjects are generally
easy to parse as-is and don't need more strictures.

I think it akin to british/american spelling uses and
I and i.  I just don't care which people use.

I did propose a mechanism to nudge people when proposed
patch subjects don't fit some specific maintainer's idea
of proper.

https://lkml.org/lkml/2010/11/16/245

cheers, Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Greetings From Sivia

2014-06-13 Thread ra Raji

Greetings From Sivia

Dear how are you today? and how is things moving with you? hope fine and you
are in good health.My name is Sivia, I am looking for a very nice
person of love, caring, sincere, easy going, matured, and
understanding, i came across your Email today and decided to be in
touch with you, so i will like you to write me via my
email address which is as follow( siviab...@yahoo.com )so that i will
give you my picture for further discussion, because i am really looking forward
for a serious friendship with you,
Yours New Sivia
( siviab...@yahoo.com )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

pull request: bluetooth 2014-06-13

2014-06-13 Thread Gustavo Padovan

Hi John,

This is our first batch of fixes for 3.16. Be aware that two patches here
are not exactly bugfixes:

* 71f28af57066 Bluetooth: Add clarifying comment for conn->auth_type
This commit just add some important security comments to the code, we found
it important enough to include it here for 3.16 since it is security related.

* 9f7ec8871132 Bluetooth: Refactor discovery stopping into its own function
This commit is just a refactor in a preparation for a fix in the next
commit (f8680f128b).

All the other patches are fixes for deadlocks and for the Bluetooth protocols,
most of them related to authentication and encryption

This is rebased on net.git of yesterday, so we need you to pull it first and 
then
pull from us. This rebase was necessary for us.

Please pull or let me know of any concerns you may have. Thanks!

Gustavo

---
The following changes since commit f9da455b93f6ba076935b4ef4589f61e529ae046:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 
(2014-06-12 14:27:40 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth.git 
for-upstream

for you to fetch changes up to 92d1372e1a9fec00e146b74e8b9ad7a385b9b37f:

  Bluetooth: Allow change security level on ATT_CID in slave role (2014-06-13 
14:36:39 +0200)


Johan Hedberg (9):
  Bluetooth: Fix incorrectly overriding conn->src_type
  Bluetooth: Fix check for connection encryption
  Bluetooth: Fix SSP acceptor just-works confirmation without MITM
  Bluetooth: Add clarifying comment for conn->auth_type
  Bluetooth: Fix setting correct authentication information for SMP STK
  Bluetooth: Fix indicating discovery state when canceling inquiry
  Bluetooth: Refactor discovery stopping into its own function
  Bluetooth: Reuse hci_stop_discovery function when cleaning up HCI state
  Bluetooth: Fix locking of hdev when calling into SMP code

Jukka Taimisto (1):
  Bluetooth: Fix deadlock in l2cap_conn_del()

Marcin Kraglak (1):
  Bluetooth: Allow change security level on ATT_CID in slave role

 net/bluetooth/hci_conn.c   |   7 +---
 net/bluetooth/hci_event.c  |  17 --
 net/bluetooth/l2cap_core.c |   8 -
 net/bluetooth/l2cap_sock.c |   5 ---
 net/bluetooth/mgmt.c   | 104 

 net/bluetooth/smp.c|   9 --
 6 files changed, 85 insertions(+), 65 deletions(-)



pgpIzvT7LP_4q.pgp
Description: PGP signature

Re: [PATCH 3/3] ARM: dts: Enable audio support for Peach-pi board

2014-06-13 Thread Mark Brown

On Fri, Jun 13, 2014 at 02:58:26PM -0700, Doug Anderson wrote:

> Anyway, suffice to say that the i2c core needs to be extended to
> handle the idea that a single device has more than one "compatible"
> string.  I'll leave it to an eager reader of this thread to implement
> this since we can also fix our own problem by just listing "max98091"
> in "sound/soc/codecs/max98090.c" like has always been done in the
> past.

Why do you need to register multiple compatible strings (I guess for
fallback purposes?).  A quick fix that is about as good is to take the
first compatible only.


signature.asc
Description: Digital signature

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Kees Cook

On Fri, Jun 13, 2014 at 3:14 PM, Rafael J. Wysocki  wrote:
> On Friday, June 13, 2014 10:32:56 AM Kees Cook wrote:
>> On Fri, Jun 13, 2014 at 3:51 AM, Pavel Machek  wrote:
>> > Hi!
>> >
>> >
>> >> >>> Any way we can make them work together instead?
>> >> >>
>> >> >> I'm sure there is, but I don't know the solution. :)
>> >> >>
>> >> >> At the very least this gets us one step closer (we can build them 
>> >> >> together).
>> >> >>
>> >> >
>> >> > But it is really invasive.
>> >>
>> >> Well, I don't agree there. I actually would like to be able to turn
>> >> off hibernation support on distro kernels regardless of kASLR, so I
>> >> think this is really killing two birds with one stone.
>> >>
>> >> > I have to admit to being somewhat fuzzy on what the core problem with
>> >> > hibernation and kASLR is... in both cases there is a set of pages that
>> >> > need to be installed, some of which will overlap the loader kernel.
>> >> > What am I missing?
>> >>
>> >> I don't know how resume works, but I have assumed that the newly
>> >> loaded kernel stays in memory and pulls in the vmalloc, kmalloc,
>> >> modules, and userspace memory maps from disk. Since these things can
>> >> easily contain references to kernel text, if the newly loaded kernel
>> >> has moved with regard to the hibernated image, everything breaks.
>> >> IIUC, this is similar why you can't rebuild your kernel and resume
>> >> from a different version.
>> >
>> > x86-64 can resume from different kernel that did the suspend. kASLR
>> > should not be too different from that. (You just include kernel text
>> > in the hibernation image. It is small enough to do that.)
>>
>> Oooh, that's very exciting! How does that work (what happens to the
>> kernel that booted first, etc)? I assume physical memory layout can't
>> change between hibernation and resume? Or, where should I be reading
>> code that does this?
>
> I guess it would help if you were a bit less sarcastic, but perhaps that's
> just me.

Oh, er, I think that got misunderstood. I'm very rarely sarcastic in
online communication. I wasn't being sarcastic here at all. I _do_
find it exciting that one can resume with a different kernel! That's
been a limitation that plagued me for years. I had no idea that
restriction got lifted. I really did mean I was excited. Sorry if that
was misunderstood!

> Anyway, the core hibernation code actually works with page frames rather
> than with virtual addresses.  Essentially, it creates a bitmap where each
> page frame is represented by a single bit and the bits representing free
> page frames are unset.  It then allocates as many new pages as there are
> set bits in the bitmap and copies the entire contents of the page frames
> represented by those bits to new pages it's just allocated. That covers
> the entire kernel with its data and all process memory and is saved to
> disk storage along with the PFNs of the page frames whose contents have
> been copied.
>
> During resume it simply restores the contents of the saved page frames
> into those same page frames if they are available at that time.  For the
> page frames that aren't free then it allocates memory to store their
> contents temporarily and creates a list of PFNs where that contents should
> be moved eventually.  Then, it quiesces all activity of the system and
> jumps to arch-specific code that copies data from the temporary memory to
> the target page frames (that generally overwrites the boot kernel, so there's
> no way back from it).  Finally, it jumps to a specific address where the
> hibernated kernel trampoline code should be present.
>
> I think what fails with kASLR is that last step, because everything else
> should be entirely agnostic to the way the virtual addresses are laid out.
> I'm not sure how to fix that at the moment, but it should be fixable at
> least on x86_64.

Very cool. How does the kernel doing the resume identify the
trampoline location in the hibernated kernel? If it can handle a
different kernel in the hibernation image, I assume there's been some
specific identification in the image instead of using what
kernel-doing-the-resume thinks the trampoline is (based on its own
offsets).

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-13 Thread Luis R. Rodriguez

On Fri, Jun 13, 2014 at 2:41 PM, Davidlohr Bueso  wrote:
> On Fri, 2014-06-13 at 11:28 -0700, Luis R. Rodriguez wrote:
>> + /*
>> +  * If you set log_buf_len=n kernel parameter LOG_CPU_MIN_BUF_SHIFT will
>> +  * be ignored. LOG_CPU_MIN_BUF_SHIFT is a proactive measure for large
>> +  * systems. With a LOG_BUF_SHIFT of 18 and LOG_CPU_MIN_BUF_SHIFT 12 at
>> +  * we'd require more than 64 CPUs to trigger an increase from the
>> +  * default.
>> +  */
>> + if (!new_log_buf_len && (cpu_extra > __LOG_BUF_LEN / 2))
>  ^ that ! looks wrong.

That check is there so that we ignore the cpu_extra stuff if the
kernel parameter was passed, given that in that case new_log_buf_len
would be set.

> We should be checking for log_buf_len set instead.

When log_buf_len=n is set as a kernel parameter log_buf_len_setup()
will set new_log_buf_len to something, the sanity test to not update
the ring buffer unless the value passed is greater than the default
value is checked by log_buf_len_setup().

>> + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
>
> You could also move the whole thing below the return statement, that way
> we can avoid double checking new_log_buf_len. Otherwise looks kinda
> weird.

If we did we'd be forcing the kernel parameter to be used to enable
this functionality, but we don't want that.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3][update] PM / sleep: Introduce command line argument for sleep state enumeration

2014-06-13 Thread Rafael J. Wysocki

On Saturday, June 14, 2014 12:16:26 AM Rafael J. Wysocki wrote:
> On Friday, June 13, 2014 11:46:12 PM Pavel Machek wrote:

[...]

> > > So I'm really not sure what's the problem?  Do you think it's wrong to be
> > > helpful to users or something?
> > 
> > It is not wrong to be helpful, but messed up interface is too big a
> > price.
> 
> Why?  I will have to maintain it after all, right?

And by the way, the very fact that this workaround is even useful in some cases
indicates that the interface that we've invented originally is not particularly
useful to user space.  The reason why is because user space is supposed to
enumerate the sleep states and then present the ones that are present in a
consistent way to the user.  It basically has to do "Is 'mem' present"?  Use it
if so, but if not is 'standby' present?  Use it if so etc." every time or
squirrel that information somewhere which isn't particularly straightforward.

So perhaps we should change the interface entirely?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Kees Cook

On Fri, Jun 13, 2014 at 2:42 PM, Andy Lutomirski  wrote:
> On Fri, Jun 13, 2014 at 2:37 PM, Alexei Starovoitov  wrote:
>> On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski  wrote:
>>> On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  
>>> wrote:
 On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
> This adds the new "seccomp" syscall with both an "operation" and "flags"
> parameter for future expansion. The third argument is a pointer value,
> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>
> Signed-off-by: Kees Cook 
> Cc: linux-...@vger.kernel.org
> ---
>  arch/x86/syscalls/syscall_32.tbl  |1 +
>  arch/x86/syscalls/syscall_64.tbl  |1 +
>  include/linux/syscalls.h  |2 ++
>  include/uapi/asm-generic/unistd.h |4 ++-
>  include/uapi/linux/seccomp.h  |4 +++
>  kernel/seccomp.c  |   63 
> -
>  kernel/sys_ni.c   |3 ++
>  7 files changed, 69 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl 
> b/arch/x86/syscalls/syscall_32.tbl
> index d6b867921612..7527eac24122 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351i386sched_setattr   sys_sched_setattr
>  352i386sched_getattr   sys_sched_getattr
>  353i386renameat2   sys_renameat2
> +354i386seccomp sys_seccomp
> diff --git a/arch/x86/syscalls/syscall_64.tbl 
> b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1646d2..16272a6c12b7 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314common  sched_setattr   sys_sched_setattr
>  315common  sched_getattr   sys_sched_getattr
>  316common  renameat2   sys_renameat2
> +317common  seccomp sys_seccomp
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0ed322..1713977ee26f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>  unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
> flags);
> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> +   const char __user *uargs);

 It looks odd to add 'flags' argument to syscall that is not even used.
 It don't think it will be extensible this way.
 'uargs' is used only in 2nd command as well and it's not 'char __user *'
 but rather 'struct sock_fprog __user *'
 I think it makes more sense to define only first argument as 'int op' and 
 the
 rest as variable length array.
 Something like:
 long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
 then different commands can interpret 'attrs' differently.
 if op == mode_strict, then attrs == NULL, len == 0
 if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
 and nla_data(attrs) is 'struct sock_fprog'
>>>
>>> Eww.  If the operation doesn't imply the type, then I think we've
>>> totally screwed up.
>>>
 If we decide to add new types of filters or new commands, the syscall 
 prototype
 won't need to change. New commands can be added preserving backward
 compatibility.
 The basic TLV concept has been around forever in netlink world. imo makes
 sense to use it with new syscalls. Passing 'struct xxx' into syscalls
 is the thing
 of the past. TLV style is more extensible. Fields of structures can become
 optional in the future, new fields added, etc.
 'struct nlattr' brings the same benefits to kernel api as protobuf did
 to user land.
>>>
>>> I see no reason to bring nl_attr into this.
>>>
>>> Admittedly, I've never dealt with nl_attr, but everything
>>> netlink-related I've even been involved in has involved some sort of
>>> API atrocity.
>>
>> netlink has a lot of legacy and there is genetlink which is not pretty
>> either because of extra socket creation, binding, dealing with packet
>> loss issues, but the key concept of variable length encoding is sound.
>> Right now seccomp has two commands and they already don't fit
>> into single syscall neatly. Are you saying there should be two syscalls
>> here? What about another seccomp related command? Another syscall?
>> imo all seccomp related commands needs to be mux/demux-ed under
>> one

Re: [PATCH v4] lib: add size unit t/p/e to memparse

2014-06-13 Thread David Rientjes

On Fri, 13 Jun 2014, Gui Hecheng wrote:

> diff --git a/lib/cmdline.c b/lib/cmdline.c
> index d4932f7..76a712e 100644
> --- a/lib/cmdline.c
> +++ b/lib/cmdline.c
> @@ -121,11 +121,7 @@ EXPORT_SYMBOL(get_options);
>   *   @retptr: (output) Optional pointer to next char after parse completes
>   *
>   *   Parses a string into a number.  The number stored at @ptr is
> - *   potentially suffixed with %K (for kilobytes, or 1024 bytes),
> - *   %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
> - *   1073741824).  If the number is suffixed with K, M, or G, then
> - *   the return value is the number multiplied by one kilobyte, one
> - *   megabyte, or one gigabyte, respectively.
> + *   potentially suffixed with K, M, G, T, P, E.
>   */
>  
>  unsigned long long memparse(const char *ptr, char **retptr)
> @@ -135,6 +131,15 @@ unsigned long long memparse(const char *ptr, char 
> **retptr)
>   unsigned long long ret = simple_strtoull(ptr, , 0);
>  
>   switch (*endptr) {
> + case 'E':
> + case 'e':
> + ret <<= 10;
> + case 'P':
> + case 'p':
> + ret <<= 10;
> + case 'T':
> + case 't':
> + ret <<= 10;
>   case 'G':
>   case 'g':
>   ret <<= 10;

Seems fine since unsigned long long is always at least 64 bits, but 
perhaps also change simple_strtoull() to use kstrtoull() at the same time 
since the former is deprecated?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3][update] PM / sleep: Introduce command line argument for sleep state enumeration

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 11:46:12 PM Pavel Machek wrote:
> Hi!
> 
> > > > > > From: Rafael J. Wysocki 
> > > > > > 
> > > > > > On some systems the platform doesn't support neither
> > > > > > PM_SUSPEND_MEM nor PM_SUSPEND_STANDBY, so PM_SUSPEND_FREEZE is the
> > > > > > only available system sleep state.  However, some user space 
> > > > > > frameworks
> > > > > > only use the "mem" and (sometimes) "standby" sleep state labels, so
> > > > > > the users of those systems need to modify user space in order to be
> > > > > > able to use system suspend at all and that is not always possible.
> > > > > 
> > > > > I'd say we should fix the frameworks, not add option to change kernel
> > > > > interfaces.
> > > > > 
> > > > > Because, as you mentioned, if we add this, we are probably going to
> > > > > get stuck with it forever :-(.
> > > > 
> > > > Unfortunately, fixing the frameworks is rather less than realistic in 
> > > > any
> > > > reasonable time frame, since  Android. :-)
> > > 
> > > Actually, you still have the sources from android, and this issue
> > > sounds almost simple enough for binary patch.
> > > 
> > > Android misuses /proc/sys/vm/drop_caches, too, IIRC. Are we going to
> > > change interface to match their expectations? They have binder and
> > > wakelocks. Are we going to apply those patches just because Android
> > > wants that?
> > 
> > That depends on which versions of Android you're talking about.  The
> > newest ones use the power management interfaces we have upstream.
> 
> Ok, good, so they can fix their code.
> 
> What problem are you solving? Do you have some weird hardware where
> suspend to memory is impossible? 
> 
> > > Android people usually patch their kernels, anyway, so why not add
> > > this one, too?
> > 
> > I'm not talking about Android kernels, but about Android user space.
> 
> I know. Android userspace usually runs on modified kernel, so you can
> simply add your patch. But I don't think its suitable for mainline.  
> 
> > And this is not only about Android, other distros also have user space that
> > uses "mem" only, because nobody has used anything else for a long time 
> > anyway.
> > For the users of those distros, if they don't want to modify user space,
> > having a kernel command line like this is actually helpful.
> 
> Yes, still its wrong place to fix it...

This isn't a fix.  It's a workaround.

> > So I'm really not sure what's the problem?  Do you think it's wrong to be
> > helpful to users or something?
> 
> It is not wrong to be helpful, but messed up interface is too big a
> price.

Why?  I will have to maintain it after all, right?

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] firmware: Add device tree binding for coreboot

2014-06-13 Thread Julius Werner

> This is just to export a fixed log to userspace (like a DMI table) or
> the kernel will actually use the data in some way? Based on the link,
> it looks like the former to me.

I could imagine both. The link is an in-kernel driver that exposes a
log through a sysfs node (in a way that has already been established
on x86 systems, which find the location through EBDA or ACPI entries
instead). We are also using a user-space tool that reads the address
from /proc/device-tree and accesses it through /dev/mem. The areas can
contain many interesting entries (like the location of an early
framebuffer set up by the firmware), so I could also imagine use cases
where the kernel makes use of it directly.

> Don't you need need to keep the kernel from allocating this memory by
> using one of the reserved memory mechanisms? The recently added one
> should be able to specific what the memory is reserved for IIRC.

Our bootloader is carving the location out of the /memory node and
adding it to the device tree reserve map. As far as I know, that only
contains a list of raw start and size entries. At any rate, I think
it's useful (and in line with other bindings) to add a more explicit
node like this (if only to make it easier accessible through
/proc/device-tree).

> /firmware is already used IIRC. What if you have other firmware such
> as Trustzone?

I'm not quite sure how Trusted Foundations works and whether it would
even make sense to use it in parallel to coreboot, but it seems to be
using the /firmware/trusted-foundations subnode so that should be
fine. "firmware" seems to be used by other firmware implementations
(like "samsung,secure-firmware") which are similar in nature to and
mutually exclusive with coreboot, so I thought the node makes sense.
(The kernel should use the compatible string to find it anyway, so a
future name clash would not be world-ending.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] ARM: dts: Enable audio support for Peach-pi board

2014-06-13 Thread Doug Anderson

Mark,

On Fri, Jun 13, 2014 at 10:13 AM, Doug Anderson  wrote:
> Mark,
>
> On Fri, Jun 13, 2014 at 10:05 AM, Mark Brown  wrote:
>> On Fri, Jun 13, 2014 at 10:03:50AM -0700, Doug Anderson wrote:
>>> On Tue, Jun 10, 2014 at 10:32 PM, Tushar Behera  
>>> wrote:
>>
>>> > Peach-pi board has MAX98090 audio codec connected on HSI2C-7 bus.
>>
>>> If you want to be a stickler about it, peach-pi actually has a
>>> max98091.  That requires code changes to the i2c driver, though.
>>> ...and unfortunately listing two compatible strings for i2c devices is
>>> broken.  :(
>>
>> It is?  We should fix that if it's the case...
>
> Yah, I mentioned it to Mark Rutland at the last ELC and he said he
> might take a look at it, but I probably should have posted something
> up to the i2c list.
>
> I made a half-assed attempt to fix it locally in the ChromeOS but
> quickly found that it was going to be a much bigger job than I had
> time for...
>
> https://chromium-review.googlesource.com/#/c/184406/
>
> IIRC i2c_new_device didn't return an error like I thought it would,
> probably trying to deal with the fact that devices might show up at a
> later point in time.
>
>
> Hrm, now that I think about it I wonder if the right answer is just to
> call i2c_new_device for all the compatible strings even if it doesn't
> return an error.  I'd have to go back and try that and re-explore this
> code...

Nope, that didn't work either.  Now I remember trying that before,
too.  It doesn't like you registering two different devices with the
same address:

[2.582539] DOUG: /i2c@12CD/codec@10 (0) max98091
[2.587360] DOUG: /i2c@12CD/codec@10 (0) max98091
[2.591160] DOUG: /i2c@12CD/codec@10 (1) max98090
[2.596686] i2c i2c-7: Failed to register i2c client max98090 at 0x10 (-16)

If you hack out the check for address business:

sysfs: cannot create duplicate filename '/devices/12cd.i2c/i2c-7/7-0010'

Anyway, suffice to say that the i2c core needs to be extended to
handle the idea that a single device has more than one "compatible"
string.  I'll leave it to an eager reader of this thread to implement
this since we can also fix our own problem by just listing "max98091"
in "sound/soc/codecs/max98090.c" like has always been done in the
past.


-Doug
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] make kASLR vs hibernation boot-time selectable

2014-06-13 Thread Rafael J. Wysocki

On Friday, June 13, 2014 10:32:56 AM Kees Cook wrote:
> On Fri, Jun 13, 2014 at 3:51 AM, Pavel Machek  wrote:
> > Hi!
> >
> >
> >> >>> Any way we can make them work together instead?
> >> >>
> >> >> I'm sure there is, but I don't know the solution. :)
> >> >>
> >> >> At the very least this gets us one step closer (we can build them 
> >> >> together).
> >> >>
> >> >
> >> > But it is really invasive.
> >>
> >> Well, I don't agree there. I actually would like to be able to turn
> >> off hibernation support on distro kernels regardless of kASLR, so I
> >> think this is really killing two birds with one stone.
> >>
> >> > I have to admit to being somewhat fuzzy on what the core problem with
> >> > hibernation and kASLR is... in both cases there is a set of pages that
> >> > need to be installed, some of which will overlap the loader kernel.
> >> > What am I missing?
> >>
> >> I don't know how resume works, but I have assumed that the newly
> >> loaded kernel stays in memory and pulls in the vmalloc, kmalloc,
> >> modules, and userspace memory maps from disk. Since these things can
> >> easily contain references to kernel text, if the newly loaded kernel
> >> has moved with regard to the hibernated image, everything breaks.
> >> IIUC, this is similar why you can't rebuild your kernel and resume
> >> from a different version.
> >
> > x86-64 can resume from different kernel that did the suspend. kASLR
> > should not be too different from that. (You just include kernel text
> > in the hibernation image. It is small enough to do that.)
> 
> Oooh, that's very exciting! How does that work (what happens to the
> kernel that booted first, etc)? I assume physical memory layout can't
> change between hibernation and resume? Or, where should I be reading
> code that does this?

I guess it would help if you were a bit less sarcastic, but perhaps that's
just me.

Anyway, the core hibernation code actually works with page frames rather
than with virtual addresses.  Essentially, it creates a bitmap where each
page frame is represented by a single bit and the bits representing free
page frames are unset.  It then allocates as many new pages as there are
set bits in the bitmap and copies the entire contents of the page frames
represented by those bits to new pages it's just allocated. That covers
the entire kernel with its data and all process memory and is saved to
disk storage along with the PFNs of the page frames whose contents have
been copied.

During resume it simply restores the contents of the saved page frames
into those same page frames if they are available at that time.  For the
page frames that aren't free then it allocates memory to store their
contents temporarily and creates a list of PFNs where that contents should
be moved eventually.  Then, it quiesces all activity of the system and
jumps to arch-specific code that copies data from the temporary memory to
the target page frames (that generally overwrites the boot kernel, so there's
no way back from it).  Finally, it jumps to a specific address where the
hibernated kernel trampoline code should be present.

I think what fails with kASLR is that last step, because everything else
should be entirely agnostic to the way the virtual addresses are laid out.
I'm not sure how to fix that at the moment, but it should be fixable at
least on x86_64.

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Kees Cook

On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  wrote:
> On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
>> This adds the new "seccomp" syscall with both an "operation" and "flags"
>> parameter for future expansion. The third argument is a pointer value,
>> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
>> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>>
>> Signed-off-by: Kees Cook 
>> Cc: linux-...@vger.kernel.org
>> ---
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b0881a0ed322..1713977ee26f 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>>  unsigned long idx1, unsigned long idx2);
>>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
>> flags);
>> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>> +   const char __user *uargs);
>
> It looks odd to add 'flags' argument to syscall that is not even used.

FWIW, "flags" is given use in the next patch to support the tsync option.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] GPIO fix for the v3.16 series

2014-06-13 Thread Linus Walleij

Hi Linus,

sending you this vital fix before leaving for a short vacation so it
does not sit collecting dust in my tree for no good reason.

Apart from this, our v3.16 cycle looks like a good start.

Please pull it in!

Yours,
Linus Walleij


The following changes since commit 963649d735c8b6eb0f97e82c54f02426ff3f1f45:

  Merge tag 'for-linus-3.16-merge-window' of
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs (2014-06-08
14:35:19 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git
tags/gpio-v3.16-2

for you to fetch changes up to 06fc3b70f1dc9c53070fa63a528830f54afc3c38:

  gpio: of: Fix handling for deferred probe for -gpio suffix
(2014-06-12 09:57:00 +0200)


A first GPIO fix for the v3.16 series, this was serious since
it blocks the OMAP boot.


Tony Lindgren (1):
  gpio: of: Fix handling for deferred probe for -gpio suffix

 drivers/gpio/gpiolib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3][update] PM / sleep: Introduce command line argument for sleep state enumeration

2014-06-13 Thread Pavel Machek

Hi!

> > > > > From: Rafael J. Wysocki 
> > > > > 
> > > > > On some systems the platform doesn't support neither
> > > > > PM_SUSPEND_MEM nor PM_SUSPEND_STANDBY, so PM_SUSPEND_FREEZE is the
> > > > > only available system sleep state.  However, some user space 
> > > > > frameworks
> > > > > only use the "mem" and (sometimes) "standby" sleep state labels, so
> > > > > the users of those systems need to modify user space in order to be
> > > > > able to use system suspend at all and that is not always possible.
> > > > 
> > > > I'd say we should fix the frameworks, not add option to change kernel
> > > > interfaces.
> > > > 
> > > > Because, as you mentioned, if we add this, we are probably going to
> > > > get stuck with it forever :-(.
> > > 
> > > Unfortunately, fixing the frameworks is rather less than realistic in any
> > > reasonable time frame, since  Android. :-)
> > 
> > Actually, you still have the sources from android, and this issue
> > sounds almost simple enough for binary patch.
> > 
> > Android misuses /proc/sys/vm/drop_caches, too, IIRC. Are we going to
> > change interface to match their expectations? They have binder and
> > wakelocks. Are we going to apply those patches just because Android
> > wants that?
> 
> That depends on which versions of Android you're talking about.  The
> newest ones use the power management interfaces we have upstream.

Ok, good, so they can fix their code.

What problem are you solving? Do you have some weird hardware where
suspend to memory is impossible? 

> > Android people usually patch their kernels, anyway, so why not add
> > this one, too?
> 
> I'm not talking about Android kernels, but about Android user space.

I know. Android userspace usually runs on modified kernel, so you can
simply add your patch. But I don't think its suitable for mainline.  

> And this is not only about Android, other distros also have user space that
> uses "mem" only, because nobody has used anything else for a long time anyway.
> For the users of those distros, if they don't want to modify user space,
> having a kernel command line like this is actually helpful.

Yes, still its wrong place to fix it...

> So I'm really not sure what's the problem?  Do you think it's wrong to be
> helpful to users or something?

It is not wrong to be helpful, but messed up interface is too big a
price.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ACPI/Battery: Retry to get Battery information if failed during probing

2014-06-13 Thread David Rientjes

On Fri, 13 Jun 2014, Lan Tianyu wrote:

> How about this?
> 
> -   result = acpi_battery_update(battery, false);
> -   if (result)
> +
> +   /*
> +* Some machines'(E,G Lenovo Z480) ECs are not stable
> +* during boot up and this causes battery driver fails to be
> +* probed due to failure of getting battery information
> +* from EC sometimes. After several retries, the operation
> +* may work. So add retry code here and 20ms sleep between
> +* every retries.
> +*/
> +   while (acpi_battery_update(battery, false) && retry--)
> +   msleep(20);
> +   if (!retry) {
> +   result = -ENODEV;
> goto fail;
> +   }
> +

I think you want --retry and not retry--.  Otherwise it's possible for the 
final call to acpi_battery_update() to succeed and now it's returning 
-ENODEV.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kmemleak: Unable to handle kernel paging request

2014-06-13 Thread Benjamin Herrenschmidt

On Fri, 2014-06-13 at 09:56 +0100, Catalin Marinas wrote:

> OK, so that's the DART table allocated via alloc_dart_table(). Is
> dart_tablebase removed from the kernel linear mapping after allocation?

Yes.

> If that's the case, we need to tell kmemleak to ignore this block (see
> patch below, untested). But I still can't explain how commit
> d4c54919ed863020 causes this issue.
> 
> (also cc'ing the powerpc list and maintainers)

We remove the DART from the linear mapping because it has to be mapped
non-cachable and having it in the linear mapping would cause cache
paradoxes. We also can't just change the caching attributes in the
linear mapping because we use 16M pages for it and 970 CPUs don't
support cache-inhibited 16M pages :-( And due to the MMU segmentation
model, we also can't mix & match page sizes in that area.

So we just unmap it, and ioremap it elsewhere.

Cheers,
Ben.

> ---8<--
> 
> >From 09a7f1c97166c7bdca7ca4e8a4ff2774f3706ea3 Mon Sep 17 00:00:00 2001
> From: Catalin Marinas 
> Date: Fri, 13 Jun 2014 09:44:21 +0100
> Subject: [PATCH] powerpc/kmemleak: Do not scan the DART table
> 
> The DART table allocation is registered to kmemleak via the
> memblock_alloc_base() call. However, the DART table is later unmapped
> and dart_tablebase VA no longer accessible. This patch tells kmemleak
> not to scan this block and avoid an unhandled paging request.
> 
> Signed-off-by: Catalin Marinas 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> ---
>  arch/powerpc/sysdev/dart_iommu.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/dart_iommu.c 
> b/arch/powerpc/sysdev/dart_iommu.c
> index 62c47bb76517..9e5353ff6d1b 100644
> --- a/arch/powerpc/sysdev/dart_iommu.c
> +++ b/arch/powerpc/sysdev/dart_iommu.c
> @@ -476,6 +476,11 @@ void __init alloc_dart_table(void)
>*/
>   dart_tablebase = (unsigned long)
>   __va(memblock_alloc_base(1UL<<24, 1UL<<24, 0x8000L));
> + /*
> +  * The DART space is later unmapped from the kernel linear mapping and
> +  * accessing dart_tablebase during kmemleak scanning will fault.
> +  */
> + kmemleak_no_scan((void *)dart_tablebase);
>  
>   printk(KERN_INFO "DART table allocated at: %lx\n", dart_tablebase);
>  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] Add support for Compact (Bluetooth|USB) keyboard with Trackpoint

2014-06-13 Thread Antonio Ospite

On Thu, 12 Jun 2014 09:56:41 +0100 (BST)
Jamie Lentin  wrote:

> On Wed, 11 Jun 2014, Antonio Ospite wrote:
> 

[...]

> >> +static int tpcompactkbd_input_mapping(struct hid_device *hdev,
> >
> > Maybe name these functions like tpkbd_input_mapping_compact()?
> >
> > This way the namespace is more consistent, and follows the logic of
> > patch 1/2 more closely.
> >
> > Use this scheme at least for functions which have a _tp() counterpart.
> 
> Previously the tpkbd driver had various functions marked "_tp" to indicate 
> that it's for the "mouse" half of the keyboard as the kernel sees it, 
> however it does nothing special with the keyboard half. I was intending 
> (somewhat sloppily) to repurpose this into having versions of each 
> function for each keyboard, and a common function to switch between them. 
> Should make it fairly easy to add extra keyboards in the future.
> 
> The problem, as ever, is choosing decent names for them. It should 
> probably be either:-
> 
> * tpkbd_input_mapping_usbkbd
> * tpkbd_input_mapping_compactkbd
> ...and tpkbd_input_mapping switches between them
> 
> or rename the driver to hid-lenovo and do:-
> 

I am OK with a rename. Most files in drivers/hid are per-vendor after
all. Jiri?

> * lenovo_input_mapping_tpkbd
> * lenovo_input_mapping_compacttp

I think you meant lenovo_input_mapping_compactkbd ? :)

> ...and lenovo_input_mapping switches between them
> 
> The latter seems a bit too invasive, but I'm not sure how obvious with the 
> former that it'd be that the "Compact USB keyboard" is in-fact a 
> compactkbd not a usbkbd. The former is probably what I'll go for unless 
> you have any thoughts.
> 
[...]

Ciao,
   Antonio

-- 
Antonio Ospite
http://ao2.it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] platform/x86/toshiba-apci.c possible bad if test?

2014-06-13 Thread David Rientjes

On Fri, 13 Jun 2014, Azael Avalos wrote:

> Intel test builder caught some warnings, one at the
> KBD backlight mode store while validating for
> correct parameters, and another one that might lead
> to not creating the sysfs group
> 
> Signed-off-by: Azael Avalos 
> ---
>  drivers/platform/x86/toshiba_acpi.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/platform/x86/toshiba_acpi.c
> b/drivers/platform/x86/toshiba_acpi.c
> index fbbe46d..f397594 100644
> --- a/drivers/platform/x86/toshiba_acpi.c
> +++ b/drivers/platform/x86/toshiba_acpi.c
> @@ -1218,7 +1218,7 @@ static ssize_t toshiba_kbd_bl_mode_store(struct
> device *dev,
> int mode = -1;
> int time = -1;
> 
> -   if (sscanf(buf, "%i", ) != 1 && (mode != 2 || mode != 1))
> +   if (sscanf(buf, "%i", ) != 1 || mode > 2 || mode < 1)
> return -EINVAL;
> 
> /* Set the Keyboard Backlight Mode where:
> @@ -1741,7 +1741,7 @@ static int toshiba_acpi_add(struct acpi_device 
> *acpi_dev)
> 
> ret = sysfs_create_group(>acpi_dev->dev.kobj,
>  _attr_group);
> -   if (ret) {
> +   if (ret != 0) {
> dev->sysfs_created = 0;
> goto error;
> }

It may not have been picked up because you've combined unrelated (and 
unnecessary) changes such as your change to toshiba_acpi_add().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Andy Lutomirski

On Fri, Jun 13, 2014 at 2:37 PM, Alexei Starovoitov  wrote:
> On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski  wrote:
>> On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  
>> wrote:
>>> On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
 This adds the new "seccomp" syscall with both an "operation" and "flags"
 parameter for future expansion. The third argument is a pointer value,
 used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
 be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

 Signed-off-by: Kees Cook 
 Cc: linux-...@vger.kernel.org
 ---
  arch/x86/syscalls/syscall_32.tbl  |1 +
  arch/x86/syscalls/syscall_64.tbl  |1 +
  include/linux/syscalls.h  |2 ++
  include/uapi/asm-generic/unistd.h |4 ++-
  include/uapi/linux/seccomp.h  |4 +++
  kernel/seccomp.c  |   63 
 -
  kernel/sys_ni.c   |3 ++
  7 files changed, 69 insertions(+), 9 deletions(-)

 diff --git a/arch/x86/syscalls/syscall_32.tbl 
 b/arch/x86/syscalls/syscall_32.tbl
 index d6b867921612..7527eac24122 100644
 --- a/arch/x86/syscalls/syscall_32.tbl
 +++ b/arch/x86/syscalls/syscall_32.tbl
 @@ -360,3 +360,4 @@
  351i386sched_setattr   sys_sched_setattr
  352i386sched_getattr   sys_sched_getattr
  353i386renameat2   sys_renameat2
 +354i386seccomp sys_seccomp
 diff --git a/arch/x86/syscalls/syscall_64.tbl 
 b/arch/x86/syscalls/syscall_64.tbl
 index ec255a1646d2..16272a6c12b7 100644
 --- a/arch/x86/syscalls/syscall_64.tbl
 +++ b/arch/x86/syscalls/syscall_64.tbl
 @@ -323,6 +323,7 @@
  314common  sched_setattr   sys_sched_setattr
  315common  sched_getattr   sys_sched_getattr
  316common  renameat2   sys_renameat2
 +317common  seccomp sys_seccomp

  #
  # x32-specific system call numbers start at 512 to avoid cache impact
 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
 index b0881a0ed322..1713977ee26f 100644
 --- a/include/linux/syscalls.h
 +++ b/include/linux/syscalls.h
 @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
  unsigned long idx1, unsigned long idx2);
  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
 flags);
 +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 +   const char __user *uargs);
>>>
>>> It looks odd to add 'flags' argument to syscall that is not even used.
>>> It don't think it will be extensible this way.
>>> 'uargs' is used only in 2nd command as well and it's not 'char __user *'
>>> but rather 'struct sock_fprog __user *'
>>> I think it makes more sense to define only first argument as 'int op' and 
>>> the
>>> rest as variable length array.
>>> Something like:
>>> long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
>>> then different commands can interpret 'attrs' differently.
>>> if op == mode_strict, then attrs == NULL, len == 0
>>> if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
>>> and nla_data(attrs) is 'struct sock_fprog'
>>
>> Eww.  If the operation doesn't imply the type, then I think we've
>> totally screwed up.
>>
>>> If we decide to add new types of filters or new commands, the syscall 
>>> prototype
>>> won't need to change. New commands can be added preserving backward
>>> compatibility.
>>> The basic TLV concept has been around forever in netlink world. imo makes
>>> sense to use it with new syscalls. Passing 'struct xxx' into syscalls
>>> is the thing
>>> of the past. TLV style is more extensible. Fields of structures can become
>>> optional in the future, new fields added, etc.
>>> 'struct nlattr' brings the same benefits to kernel api as protobuf did
>>> to user land.
>>
>> I see no reason to bring nl_attr into this.
>>
>> Admittedly, I've never dealt with nl_attr, but everything
>> netlink-related I've even been involved in has involved some sort of
>> API atrocity.
>
> netlink has a lot of legacy and there is genetlink which is not pretty
> either because of extra socket creation, binding, dealing with packet
> loss issues, but the key concept of variable length encoding is sound.
> Right now seccomp has two commands and they already don't fit
> into single syscall neatly. Are you saying there should be two syscalls
> here? What about another seccomp related command? Another syscall?
> imo all seccomp related commands needs to be mux/demux-ed under
> one syscall. What is the way to mux/demux potentially very different
> commands under one syscall? I cannot think of anything better than
> TLV style. 'struct

Re: [PATCH v3] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-13 Thread Davidlohr Bueso

On Fri, 2014-06-13 at 11:28 -0700, Luis R. Rodriguez wrote:
> + /*
> +  * If you set log_buf_len=n kernel parameter LOG_CPU_MIN_BUF_SHIFT will
> +  * be ignored. LOG_CPU_MIN_BUF_SHIFT is a proactive measure for large
> +  * systems. With a LOG_BUF_SHIFT of 18 and LOG_CPU_MIN_BUF_SHIFT 12 at
> +  * we'd require more than 64 CPUs to trigger an increase from the
> +  * default.
> +  */
> + if (!new_log_buf_len && (cpu_extra > __LOG_BUF_LEN / 2))
 ^ that ! looks wrong. We should be checking for log_buf_len set 
instead.

> + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;

You could also move the whole thing below the return statement, that way
we can avoid double checking new_log_buf_len. Otherwise looks kinda
weird.
>  
>   if (!new_log_buf_len)
>   return;

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Alexei Starovoitov

On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski  wrote:
> On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  wrote:
>> On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
>>> This adds the new "seccomp" syscall with both an "operation" and "flags"
>>> parameter for future expansion. The third argument is a pointer value,
>>> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
>>> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>>>
>>> Signed-off-by: Kees Cook 
>>> Cc: linux-...@vger.kernel.org
>>> ---
>>>  arch/x86/syscalls/syscall_32.tbl  |1 +
>>>  arch/x86/syscalls/syscall_64.tbl  |1 +
>>>  include/linux/syscalls.h  |2 ++
>>>  include/uapi/asm-generic/unistd.h |4 ++-
>>>  include/uapi/linux/seccomp.h  |4 +++
>>>  kernel/seccomp.c  |   63 
>>> -
>>>  kernel/sys_ni.c   |3 ++
>>>  7 files changed, 69 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/x86/syscalls/syscall_32.tbl 
>>> b/arch/x86/syscalls/syscall_32.tbl
>>> index d6b867921612..7527eac24122 100644
>>> --- a/arch/x86/syscalls/syscall_32.tbl
>>> +++ b/arch/x86/syscalls/syscall_32.tbl
>>> @@ -360,3 +360,4 @@
>>>  351i386sched_setattr   sys_sched_setattr
>>>  352i386sched_getattr   sys_sched_getattr
>>>  353i386renameat2   sys_renameat2
>>> +354i386seccomp sys_seccomp
>>> diff --git a/arch/x86/syscalls/syscall_64.tbl 
>>> b/arch/x86/syscalls/syscall_64.tbl
>>> index ec255a1646d2..16272a6c12b7 100644
>>> --- a/arch/x86/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/syscalls/syscall_64.tbl
>>> @@ -323,6 +323,7 @@
>>>  314common  sched_setattr   sys_sched_setattr
>>>  315common  sched_getattr   sys_sched_getattr
>>>  316common  renameat2   sys_renameat2
>>> +317common  seccomp sys_seccomp
>>>
>>>  #
>>>  # x32-specific system call numbers start at 512 to avoid cache impact
>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>> index b0881a0ed322..1713977ee26f 100644
>>> --- a/include/linux/syscalls.h
>>> +++ b/include/linux/syscalls.h
>>> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>>>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>>>  unsigned long idx1, unsigned long idx2);
>>>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
>>> flags);
>>> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>>> +   const char __user *uargs);
>>
>> It looks odd to add 'flags' argument to syscall that is not even used.
>> It don't think it will be extensible this way.
>> 'uargs' is used only in 2nd command as well and it's not 'char __user *'
>> but rather 'struct sock_fprog __user *'
>> I think it makes more sense to define only first argument as 'int op' and the
>> rest as variable length array.
>> Something like:
>> long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
>> then different commands can interpret 'attrs' differently.
>> if op == mode_strict, then attrs == NULL, len == 0
>> if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
>> and nla_data(attrs) is 'struct sock_fprog'
>
> Eww.  If the operation doesn't imply the type, then I think we've
> totally screwed up.
>
>> If we decide to add new types of filters or new commands, the syscall 
>> prototype
>> won't need to change. New commands can be added preserving backward
>> compatibility.
>> The basic TLV concept has been around forever in netlink world. imo makes
>> sense to use it with new syscalls. Passing 'struct xxx' into syscalls
>> is the thing
>> of the past. TLV style is more extensible. Fields of structures can become
>> optional in the future, new fields added, etc.
>> 'struct nlattr' brings the same benefits to kernel api as protobuf did
>> to user land.
>
> I see no reason to bring nl_attr into this.
>
> Admittedly, I've never dealt with nl_attr, but everything
> netlink-related I've even been involved in has involved some sort of
> API atrocity.

netlink has a lot of legacy and there is genetlink which is not pretty
either because of extra socket creation, binding, dealing with packet
loss issues, but the key concept of variable length encoding is sound.
Right now seccomp has two commands and they already don't fit
into single syscall neatly. Are you saying there should be two syscalls
here? What about another seccomp related command? Another syscall?
imo all seccomp related commands needs to be mux/demux-ed under
one syscall. What is the way to mux/demux potentially very different
commands under one syscall? I cannot think of anything better than
TLV style. 'struct nlattr' is what we have today and I think it works fine.
I'm not suggesting to bring the whole netlink into the picture, but rather
TLV style of encoding different arguments for

Re: [PATCH v7] NVMe: conversion to blk-mq

2014-06-13 Thread Jens Axboe

On 06/13/2014 01:22 PM, Keith Busch wrote:
> One performance oddity we observe is that servicing the interrupt on the
> thread sibling of the core that submitted the I/O is the worst performing
> cpu you can chose; it's actually better to use a different core on the
> same node. At least that's true as long as you're not utilizing the cpus
> for other work, so YMMV.

This doesn't match what I see here. Just ran some test cases - both
sync, and higher QD. For sync performance, core or thread sibling is the
best choice, other CPUs next. That is pretty logical.

For a more loaded run, thread sibling ends up being a better choice than
core, since core runs out of steam (255K vs 275K here). And thread
sibling is still a marginally better choice than some other core on the
same node.

Which pretty much matches my expectations of what the best mappings
would be.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-13 Thread Davidlohr Bueso

On Fri, 2014-06-13 at 13:44 -0700, Luis R. Rodriguez wrote:
> On Fri, Jun 13, 2014 at 1:36 PM, Davidlohr Bueso  wrote:
> > On Fri, 2014-06-13 at 11:28 -0700, Luis R. Rodriguez wrote:
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index 7228258..3f3356b 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -246,6 +246,7 @@ static u32 clear_idx;
> >>  #define LOG_ALIGN __alignof__(struct printk_log)
> >>  #endif
> >>  #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> >> +#define __LOG_CPU_MIN_BUF_LEN (1 << CONFIG_LOG_CPU_MIN_BUF_SHIFT)
> >>  static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> >>  static char *log_buf = __log_buf;
> >>  static u32 log_buf_len = __LOG_BUF_LEN;
> >> @@ -752,6 +753,17 @@ void __init setup_log_buf(int early)
> >>   unsigned long flags;
> >>   char *new_log_buf;
> >>   int free;
> >> + int cpu_extra = (num_possible_cpus() - 1) *  __LOG_CPU_MIN_BUF_LEN;
> >
> > I think you forgot to drop the - 1 here.
> 
> Actually that was on purpose as I had noticed in your suggestion on
> the other thread to fix the minimum number of CPUs required for this
> patch to be > 64 rather than >= 64 (with the defaults of 18 for log
> shift, and 12 for min cpu log shift). Minor difference, but it was
> intentional. Even though we do require SMP now the -1 can take effect,
> and it seems safer for now to only require this for > 64 CPU
> configurations for the defaults set. I'm happy to remove that as well,
> but bumping the CPU configuration mark for this to > 64 might be
> better for starters unless we do know for sure we also need this for
> 64.

Ah, ok I had missed that. Makes sense now.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Andy Lutomirski

On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov  wrote:
> On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
>> This adds the new "seccomp" syscall with both an "operation" and "flags"
>> parameter for future expansion. The third argument is a pointer value,
>> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
>> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>>
>> Signed-off-by: Kees Cook 
>> Cc: linux-...@vger.kernel.org
>> ---
>>  arch/x86/syscalls/syscall_32.tbl  |1 +
>>  arch/x86/syscalls/syscall_64.tbl  |1 +
>>  include/linux/syscalls.h  |2 ++
>>  include/uapi/asm-generic/unistd.h |4 ++-
>>  include/uapi/linux/seccomp.h  |4 +++
>>  kernel/seccomp.c  |   63 
>> -
>>  kernel/sys_ni.c   |3 ++
>>  7 files changed, 69 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/syscalls/syscall_32.tbl 
>> b/arch/x86/syscalls/syscall_32.tbl
>> index d6b867921612..7527eac24122 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -360,3 +360,4 @@
>>  351i386sched_setattr   sys_sched_setattr
>>  352i386sched_getattr   sys_sched_getattr
>>  353i386renameat2   sys_renameat2
>> +354i386seccomp sys_seccomp
>> diff --git a/arch/x86/syscalls/syscall_64.tbl 
>> b/arch/x86/syscalls/syscall_64.tbl
>> index ec255a1646d2..16272a6c12b7 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -323,6 +323,7 @@
>>  314common  sched_setattr   sys_sched_setattr
>>  315common  sched_getattr   sys_sched_getattr
>>  316common  renameat2   sys_renameat2
>> +317common  seccomp sys_seccomp
>>
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b0881a0ed322..1713977ee26f 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>>  unsigned long idx1, unsigned long idx2);
>>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
>> flags);
>> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>> +   const char __user *uargs);
>
> It looks odd to add 'flags' argument to syscall that is not even used.
> It don't think it will be extensible this way.
> 'uargs' is used only in 2nd command as well and it's not 'char __user *'
> but rather 'struct sock_fprog __user *'
> I think it makes more sense to define only first argument as 'int op' and the
> rest as variable length array.
> Something like:
> long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
> then different commands can interpret 'attrs' differently.
> if op == mode_strict, then attrs == NULL, len == 0
> if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
> and nla_data(attrs) is 'struct sock_fprog'

Eww.  If the operation doesn't imply the type, then I think we've
totally screwed up.

> If we decide to add new types of filters or new commands, the syscall 
> prototype
> won't need to change. New commands can be added preserving backward
> compatibility.
> The basic TLV concept has been around forever in netlink world. imo makes
> sense to use it with new syscalls. Passing 'struct xxx' into syscalls
> is the thing
> of the past. TLV style is more extensible. Fields of structures can become
> optional in the future, new fields added, etc.
> 'struct nlattr' brings the same benefits to kernel api as protobuf did
> to user land.

I see no reason to bring nl_attr into this.

Admittedly, I've never dealt with nl_attr, but everything
netlink-related I've even been involved in has involved some sort of
API atrocity.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] aio: report error from io_destroy() when threads race in io_destroy()

2014-06-13 Thread Benjamin LaHaise

As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx.  Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.

Signed-off-by: Benjamin LaHaise 
Cc: Anatol Pomozov 
---
 fs/aio.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 908006e..044c1c8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -727,7 +727,7 @@ err:
  * when the processes owning a context have all exited to encourage
  * the rapid destruction of the kioctx.
  */
-static void kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
+static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
struct completion *requests_done)
 {
if (!atomic_xchg(>dead, 1)) {
@@ -759,10 +759,10 @@ static void kill_ioctx(struct mm_struct *mm, struct 
kioctx *ctx,
 
ctx->requests_done = requests_done;
percpu_ref_kill(>users);
-   } else {
-   if (requests_done)
-   complete(requests_done);
+   return 0;
}
+
+   return -EINVAL;
 }
 
 /* wait_on_sync_kiocb:
@@ -1219,21 +1219,23 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
if (likely(NULL != ioctx)) {
struct completion requests_done =
COMPLETION_INITIALIZER_ONSTACK(requests_done);
+   int ret;
 
/* Pass requests_done to kill_ioctx() where it can be set
 * in a thread-safe way. If we try to set it here then we have
 * a race condition if two io_destroy() called simultaneously.
 */
-   kill_ioctx(current->mm, ioctx, _done);
+   ret = kill_ioctx(current->mm, ioctx, _done);
percpu_ref_put(>users);
 
/* Wait until all IO for the context are done. Otherwise kernel
 * keep using user-space buffers even if user thinks the context
 * is destroyed.
 */
-   wait_for_completion(_done);
+   if (!ret)
+   wait_for_completion(_done);
 
-   return 0;
+   return ret;
}
pr_debug("EINVAL: io_destroy: invalid context id\n");
return -EINVAL;
-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] aio: cleanup: flatten kill_ioctx()

2014-06-13 Thread Benjamin LaHaise

There is no need to have most of the code in kill_ioctx() indented.  Flatten
it.

Signed-off-by: Benjamin LaHaise 
---
 fs/aio.c | 52 ++--
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 044c1c8..79b7e69 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -730,39 +730,39 @@ err:
 static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
struct completion *requests_done)
 {
-   if (!atomic_xchg(>dead, 1)) {
-   struct kioctx_table *table;
+   struct kioctx_table *table;
 
-   spin_lock(>ioctx_lock);
-   rcu_read_lock();
-   table = rcu_dereference(mm->ioctx_table);
+   if (atomic_xchg(>dead, 1))
+   return -EINVAL;
 
-   WARN_ON(ctx != table->table[ctx->id]);
-   table->table[ctx->id] = NULL;
-   rcu_read_unlock();
-   spin_unlock(>ioctx_lock);
 
-   /* percpu_ref_kill() will do the necessary call_rcu() */
-   wake_up_all(>wait);
+   spin_lock(>ioctx_lock);
+   rcu_read_lock();
+   table = rcu_dereference(mm->ioctx_table);
+
+   WARN_ON(ctx != table->table[ctx->id]);
+   table->table[ctx->id] = NULL;
+   rcu_read_unlock();
+   spin_unlock(>ioctx_lock);
 
-   /*
-* It'd be more correct to do this in free_ioctx(), after all
-* the outstanding kiocbs have finished - but by then io_destroy
-* has already returned, so io_setup() could potentially return
-* -EAGAIN with no ioctxs actually in use (as far as userspace
-*  could tell).
-*/
-   aio_nr_sub(ctx->max_reqs);
+   /* percpu_ref_kill() will do the necessary call_rcu() */
+   wake_up_all(>wait);
 
-   if (ctx->mmap_size)
-   vm_munmap(ctx->mmap_base, ctx->mmap_size);
+   /*
+* It'd be more correct to do this in free_ioctx(), after all
+* the outstanding kiocbs have finished - but by then io_destroy
+* has already returned, so io_setup() could potentially return
+* -EAGAIN with no ioctxs actually in use (as far as userspace
+*  could tell).
+*/
+   aio_nr_sub(ctx->max_reqs);
 
-   ctx->requests_done = requests_done;
-   percpu_ref_kill(>users);
-   return 0;
-   }
+   if (ctx->mmap_size)
+   vm_munmap(ctx->mmap_base, ctx->mmap_size);
 
-   return -EINVAL;
+   ctx->requests_done = requests_done;
+   percpu_ref_kill(>users);
+   return 0;
 }
 
 /* wait_on_sync_kiocb:
-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] * SUBJECT HERE *

2014-06-13 Thread Benjamin LaHaise

Hello Linus,

Please pull the following 3 changes from git://git.kvack.org/~bcrl/aio-next.git
.  They consist of a couple of code cleanups plus a minor bug fix.  Thanks!

-ben

Benjamin LaHaise (2):
  aio: report error from io_destroy() when threads race in io_destroy()
  aio: cleanup: flatten kill_ioctx()

Fabian Frederick (1):
  fs/aio.c: Remove ctx parameter in kiocb_cancel

 fs/aio.c | 70 +---
 1 file changed, 36 insertions(+), 34 deletions(-)

-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] fs/aio.c: Remove ctx parameter in kiocb_cancel

2014-06-13 Thread Fabian Frederick

ctx is no longer used in kiocb_cancel since

57282d8fd74407 ("aio: Kill ki_users")

Cc: Alexander Viro 
Cc: Andrew Morton 
Signed-off-by: Fabian Frederick 
Signed-off-by: Benjamin LaHaise 
---
 fs/aio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 2adbb03..908006e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -477,7 +477,7 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn 
*cancel)
 }
 EXPORT_SYMBOL(kiocb_set_cancel_fn);
 
-static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb)
+static int kiocb_cancel(struct kiocb *kiocb)
 {
kiocb_cancel_fn *old, *cancel;
 
@@ -538,7 +538,7 @@ static void free_ioctx_users(struct percpu_ref *ref)
   struct kiocb, ki_list);
 
list_del_init(>ki_list);
-   kiocb_cancel(ctx, req);
+   kiocb_cancel(req);
}
 
spin_unlock_irq(>ctx_lock);
@@ -1587,7 +1587,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct 
iocb __user *, iocb,
 
kiocb = lookup_kiocb(ctx, iocb, key);
if (kiocb)
-   ret = kiocb_cancel(ctx, kiocb);
+   ret = kiocb_cancel(kiocb);
else
ret = -EINVAL;
 
-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] i2c-taos-evm: Use module_serio_driver()

2014-06-13 Thread Jean Delvare

On Fri, 13 Jun 2014 21:52:24 +0200, Christoph Jaeger wrote:
> Get rid of some boilerplate code by using module_serio_driver().
> 
> Signed-off-by: Christoph Jaeger 
> ---
>  drivers/i2c/busses/i2c-taos-evm.c | 13 +
>  1 file changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-taos-evm.c 
> b/drivers/i2c/busses/i2c-taos-evm.c
> index 0576026..10855a0 100644
> --- a/drivers/i2c/busses/i2c-taos-evm.c
> +++ b/drivers/i2c/busses/i2c-taos-evm.c
> @@ -311,19 +311,8 @@ static struct serio_driver taos_drv = {
>   .interrupt  = taos_interrupt,
>  };
>  
> -static int __init taos_init(void)
> -{
> - return serio_register_driver(_drv);
> -}
> -
> -static void __exit taos_exit(void)
> -{
> - serio_unregister_driver(_drv);
> -}
> +module_serio_driver(taos_drv);
>  
>  MODULE_AUTHOR("Jean Delvare ");
>  MODULE_DESCRIPTION("TAOS evaluation module driver");
>  MODULE_LICENSE("GPL");
> -
> -module_init(taos_init);
> -module_exit(taos_exit);

Reviewed-by: Jean Delvare 

Thanks,
-- 
Jean Delvare
SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 6/9] seccomp: add "seccomp" syscall

2014-06-13 Thread Alexei Starovoitov

On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook  wrote:
> This adds the new "seccomp" syscall with both an "operation" and "flags"
> parameter for future expansion. The third argument is a pointer value,
> used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
> be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
>
> Signed-off-by: Kees Cook 
> Cc: linux-...@vger.kernel.org
> ---
>  arch/x86/syscalls/syscall_32.tbl  |1 +
>  arch/x86/syscalls/syscall_64.tbl  |1 +
>  include/linux/syscalls.h  |2 ++
>  include/uapi/asm-generic/unistd.h |4 ++-
>  include/uapi/linux/seccomp.h  |4 +++
>  kernel/seccomp.c  |   63 
> -
>  kernel/sys_ni.c   |3 ++
>  7 files changed, 69 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl 
> b/arch/x86/syscalls/syscall_32.tbl
> index d6b867921612..7527eac24122 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351i386sched_setattr   sys_sched_setattr
>  352i386sched_getattr   sys_sched_getattr
>  353i386renameat2   sys_renameat2
> +354i386seccomp sys_seccomp
> diff --git a/arch/x86/syscalls/syscall_64.tbl 
> b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1646d2..16272a6c12b7 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314common  sched_setattr   sys_sched_setattr
>  315common  sched_getattr   sys_sched_getattr
>  316common  renameat2   sys_renameat2
> +317common  seccomp sys_seccomp
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0ed322..1713977ee26f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>  unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
> flags);
> +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> +   const char __user *uargs);

It looks odd to add 'flags' argument to syscall that is not even used.
It don't think it will be extensible this way.
'uargs' is used only in 2nd command as well and it's not 'char __user *'
but rather 'struct sock_fprog __user *'
I think it makes more sense to define only first argument as 'int op' and the
rest as variable length array.
Something like:
long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
then different commands can interpret 'attrs' differently.
if op == mode_strict, then attrs == NULL, len == 0
if op == mode_filter, then attrs->nla_type == seccomp_bpf_filter
and nla_data(attrs) is 'struct sock_fprog'
If we decide to add new types of filters or new commands, the syscall prototype
won't need to change. New commands can be added preserving backward
compatibility.
The basic TLV concept has been around forever in netlink world. imo makes
sense to use it with new syscalls. Passing 'struct xxx' into syscalls
is the thing
of the past. TLV style is more extensible. Fields of structures can become
optional in the future, new fields added, etc.
'struct nlattr' brings the same benefits to kernel api as protobuf did
to user land.

>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h 
> b/include/uapi/asm-generic/unistd.h
> index 333640608087..65acbf0e2867 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
>  __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
>  #define __NR_renameat2 276
>  __SYSCALL(__NR_renameat2, sys_renameat2)
> +#define __NR_seccomp 277
> +__SYSCALL(__NR_seccomp, sys_seccomp)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 277
> +#define __NR_syscalls 278
>
>  /*
>   * All syscalls below here should go away really,
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index ac2dc9f72973..b258878ba754 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -10,6 +10,10 @@
>  #define SECCOMP_MODE_STRICT1 /* uses hard-coded filter. */
>  #define SECCOMP_MODE_FILTER2 /* uses user-supplied filter. */
>
> +/* Valid operations for seccomp syscall. */
> +#define SECCOMP_SET_MODE_STRICT0
> +#define SECCOMP_SET_MODE_FILTER1
> +
>  /*
>   * All BPF programs must return a 32-bit value.
>   * The bottom 16-bits are for optional return data.
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 39d32c2904fc..c0cafa9e84af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -19,6 +19,7 @@
>

Re: [PATCH] firmware: Add device tree binding for coreboot

2014-06-13 Thread Rob Herring

On Fri, Jun 13, 2014 at 3:06 PM, Julius Werner  wrote:
> This patch adds documentation describing a device tree binding for the
> coreboot firmware project (www.coreboot.org). It is meant to be
> dynamically added during boot and contains address definitions for the
> coreboot table (a list of variable-sized descriptors providing
> information about various compile- and run-time generated firmware
> parameters) and the CBMEM area (the structure containing most run-time
> resident memory regions set up by coreboot).
>
> These definitions allow kernel drivers to easily access data contained
> in and pointed to by these regions (such as coreboot's in-memory log).
> (An example implementation can be seen at http://crosreview.com/203371,
> which will be submitted at a later point.)

This is just to export a fixed log to userspace (like a DMI table) or
the kernel will actually use the data in some way? Based on the link,
it looks like the former to me.

> Signed-off-by: Julius Werner 
> ---
>  .../devicetree/bindings/firmware/coreboot.txt  | 28 
> ++
>  1 file changed, 28 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/firmware/coreboot.txt
>
> diff --git a/Documentation/devicetree/bindings/firmware/coreboot.txt 
> b/Documentation/devicetree/bindings/firmware/coreboot.txt
> new file mode 100644
> index 000..89d7bf3
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/firmware/coreboot.txt
> @@ -0,0 +1,28 @@
> +COREBOOT firmware information
> +
> +The device tree node to communicate the location of coreboot's 
> memory-resident
> +bookkeeping structures to the kernel. Since coreboot itself cannot boot a
> +device-tree-based kernel (yet), this node needs to be inserted by a
> +second-stage bootloader (a coreboot "payload").
> +
> +Required properties:
> + - compatible: Should be "coreboot"
> + - reg: Address and length of the following two memory regions, in order:
> +   1.) The coreboot table. This is a list of variable-sized descriptors
> +   that contain various compile- and run-time generated firmware
> +   parameters. It is identified by the magic string "LBIO" in its first
> +   four bytes. See coreboot's src/include/boot/coreboot_tables.h for
> +   details.
> +   2.) The CBMEM area. This is a downward-growing memory region used by
> +   coreboot to dynamically allocate data structures that remain resident.
> +   It may or may not include the coreboot table as one of its members. It
> +   is identified by a root node descriptor with the magic number
> +   0xc0389479 that resides in the topmost 8 bytes of the area. See
> +   coreboot's src/lib/dynamic_cbmem.c for details.

Don't you need need to keep the kernel from allocating this memory by
using one of the reserved memory mechanisms? The recently added one
should be able to specific what the memory is reserved for IIRC.

> +
> +Example:
> +   firmware {

/firmware is already used IIRC. What if you have other firmware such
as Trustzone?

> +   compatible = "coreboot";
> +   reg = <0xfdfea000 0x264>,
> + <0xfdfea000 0x16000>;
> +   };
> --
> 1.8.3.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] OF: fix of_find_node_by_path() assumption that of_allnodes is root

2014-06-13 Thread Rob Herring

On Fri, Jun 13, 2014 at 11:49 AM, Frank Rowand  wrote:
> On 6/13/2014 6:52 AM, Rob Herring wrote:
>> On Fri, Jun 13, 2014 at 12:53 AM, Frank Rowand  
>> wrote:
>>> From: Frank Rowand 
>>>
>>> Pantelis Antoniou reports that of_find_node_by_path() is borked because
>>> of_allnodes is not guaranteed to contain the root of the tree after using
>>> any of the dynamic update functions because some other nodes ends up as
>>> of_allnodes.
>>>
>>> Fixes: c22e650e66b8 of: Make of_find_node_by_path() handle /aliases
>>
>> Is it not possible to do a fix in of_find_node_by_path instead? I just
>
> Yes, the code for that is in https://lkml.org/lkml/2014/5/20/758
>
> Or as Grant said in his reply, just use the fix in of_attach_node() for
> now since he is going to replace the custom list.

Okay, I've applied.

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: Only pin GP kthread when full dynticks is actually used

2014-06-13 Thread Josh Triplett

On Fri, Jun 13, 2014 at 01:48:22PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2014 at 09:44:41AM -0700, Josh Triplett wrote:
> > On Fri, Jun 13, 2014 at 06:21:32PM +0200, Frederic Weisbecker wrote:
> > > On Fri, Jun 13, 2014 at 09:16:30AM -0700, Paul E. McKenney wrote:
> > > > > Is it because we have dynticks CPUs staying too long in the kernel 
> > > > > without
> > > > > taking any quiescent states? Are we perhaps missing some 
> > > > > rcu_user_enter() or
> > > > > things?
> > > > 
> > > > Sort of the former, but combined with the fact that in-kernel CPUs still
> > > > need scheduling-clock interrupts for RCU to make progress.  I could
> > > > move this to RCU's context-switch hook, but that could be very bad for
> > > > workloads that do lots of context switching.
> > > 
> > > Or I can restart the tick if the CPU stays in the kernel for too long 
> > > without
> > > a tick. I think that's what we were doing before but we removed that 
> > > because
> > > we never implemented it correctly (we sent scheduler IPI that did 
> > > nothing...)
> > 
> > I wonder if timer slack would make sense here: when you have at least
> > one RCU callback pending, set a timer with a huge amount of timer slack,
> > and cancel it if you end up handling the callback via a trip through the
> > scheduler.
> 
> But in this case, we need the tick even if the current CPU has no callbacks
> because it might be in an RCU read-side critical section.

Don't we handle that case via the slowpath of rcu_read_unlock, and a
flag set via IPI?  ("Oh, that CPU has taken too long to note a quiescent
state; send it an IPI to set the special flag that makes unlock do the
work.")

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7] NVMe: conversion to blk-mq

2014-06-13 Thread Jens Axboe


On 2014-06-13 13:29, Jens Axboe wrote:

On 06/13/2014 01:22 PM, Keith Busch wrote:

On Fri, 13 Jun 2014, Jens Axboe wrote:

OK, same setup as mine. The affinity hint is really screwing us over, no
question about it. We just need a:

irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
hctx->cpumask);

in the ->init_hctx() methods to fix that up.

That brings us to roughly the same performance, except for the cases
where the dd is run on the thread sibling of the core handling the
interrupt. And granted, with the 16 queues used, that'll happen on
blk-mq. But since you have 32 threads and just 31 IO queues, the non
blk-mq driver must end up sharing for some cases, too.

So what do we care most about here? Consistency, or using all queues at
all costs?


I think we want to use all h/w queues regardless of mismatched sharing. A
24 thread server shouldn't use more of the hardware than a 32.

You're right, the current driver shares the queues on anything with 32
or more cpus with this NVMe controller, but we wrote an algorithm that
allocates the most and tries to group them with their nearest neighbors.

One performance oddity we observe is that servicing the interrupt on the
thread sibling of the core that submitted the I/O is the worst performing
cpu you can chose; it's actually better to use a different core on the
same node. At least that's true as long as you're not utilizing the cpus
for other work, so YMMV.


I played around with the mappings, and stumbled upon some pretty ugly
results. The back story is that on this test box, I limit max C state to
C1 to avoid having too much of a bad time with power management. Running
the dd on a specific core, yields somewhere around 52MB/sec for me.
That's with the right CPU affinity for the irq. If I purposely put it
somewhere else, I end up at 380-390MB/sec. Or if I leave it on the right
CPU but simply do:

perf record -o /dev/null dd if= ...

and run the same thing just traced, I get the high performance as well.

Indeed... So I went to take a look at what is going on. For the slow
case, turbostat tells me I'm spending 80% in C1. For the fast case,
we're down to 20% in C1.

I then turn off C1, but low and behold, it's still slow and sucky even
if turbostat now verifies that it's spending 0% time in C1.

Now, this smells like scheduling artifacts. I'm going to turn off all
power junk and see what happens. Because at 8x differences between fast
and slow, irq mappings don't really matter at all here. In fact it shows
results contrary to what you'd like to see.


OK, so I think I know what is going on here. If we slow down the next 
issue just a little bit, the device will have cached the next read. 
Essentially getting some parallellism out of a sync read, since it is 
sequential. For random 4k reads, it behaves like expected.


For reference, the attached patch brings back the affinity to what we 
want it to be.


We can always diddle with the utilization of the number of hardware 
queues later, I don't see that as a huge issue at all.


--
Jens Axboe

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ee48ac5..8dc5d36 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -178,6 +178,9 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 		nvmeq->hctx = hctx;
 	else
 		WARN_ON(nvmeq->hctx->tags != hctx->tags);
+
+	irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
+hctx->cpumask);
 	hctx->driver_data = nvmeq;
 	return 0;
 }
@@ -581,6 +584,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 	enum dma_data_direction dma_dir;
 	int psegs = req->nr_phys_segments;
 	int result = BLK_MQ_RQ_QUEUE_BUSY;
+
 	/*
 	 * Requeued IO has already been prepped
 	 */
@@ -1788,6 +1792,7 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_VIRT_HOLE, ns->queue);
 	queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
 	ns->dev = dev;
 	ns->queue->queuedata = ns;
@@ -1801,7 +1806,6 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	lbaf = id->flbas & 0xf;
 	ns->lba_shift = id->lbaf[lbaf].ds;
 	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
-	blk_queue_max_segments(ns->queue, 1);
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	if (dev->max_hw_sectors)
 		blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);

Re: [PATCH] Input - wacom: remove phys field in struct wacom

2014-06-13 Thread Benjamin Tissoires

On Jun 13 2014 or thereabouts, Benjamin Tissoires wrote:
> On Jun 13 2014 or thereabouts, Dmitry Torokhov wrote:
> > On Fri, Jun 13, 2014 at 04:29:04PM -0400, Benjamin Tissoires wrote:
> > > This field is not used, remove it.
> > 
> > We must have lost the assignment, but it should be assigned to phys of
> > corresponding input device. 
> 
> hehe. Even in 2007, when the files moved under drivers/input/tablet,
> there is no mention of assigning input_dev->phys :)
> 
> I can send a patch to fix the other way around if you prefer (add
> input_dev->phys).
> 

Actually, I found the culprit:
$ git describe c5b7c7c395a34f12cdf246d66c1feeff2933d584
v2.6.14-21-gc5b7c7c
---> Thu Sep 15 02:01:47 2005

I should have wait one more year to fix it :)

No one complain about it for nearly 10 years, so I am not sure it will
help assigning the ->phys field of input_dev.

It's your call, Dmitry.

Cheers,
Benjamin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1234 matches

Mail list logo