Re: [PATCH v2] msi: free msi_desc entry only after we've released the kobject

2013-09-30 Thread Bjorn Helgaas
[+cc Russell]

On Sat, Sep 28, 2013 at 11:37:27PM +0200, Veaceslav Falico wrote:
> On Thu, Sep 26, 2013 at 11:59:51AM +0200, Veaceslav Falico wrote:
> >Currently, we first do kobject_put(>kobj) and the kfree(entry),
> >however kobject_put() doesn't guarantee us that it was the last reference
> >and that the kobj isn't used currently by someone else, so after we
> >kfree(entry) with the struct kobject - other users will begin using the
> >freed memory, instead of the actual kobject.
> 
> Hi Bjorn,
> 
> I've seen that you've dropped this bugfix (and the 3 cleanup patches) with
> "Changes Requested", however I don't recall any request to change this.

Oh, sorry, I didn't realize anybody else really looked at patchwork,
much less the reason codes.  I just intended to dispose of those for
now because I think we'll have several more revisions while we sort
things out.  I definitely think this is a serious issue that needs to
be fixed.  But you're right: I haven't given you any specific feedback
yet because I'm not confident about what the right fix is.

I think the current kobject delayed release is too aggressive, in the
sense that even after we've released all references, the object can
still be in sysfs, which causes future creates to fail.  E.g., this
fails:

kset = kset_create_and_add("kobj_test", NULL, NULL);
kset_unregister(kset);
kset = kset_create_and_add("kobj_test", NULL, NULL);  // FAILS

when I think it should succeed.  We don't have a way for the caller to
determine when it's safe to do the second kset_create_and_add().

After we release all references, I think it's OK for the kobject
itself to continue to exist, i.e., we can delay calling t->release().
But it should be impossible to find a kobject with refcount == 0 via
sysfs, so we should be able to create a new one with the same name.

In terms of code, I'm suggesting something like the following:

diff --git a/lib/kobject.c b/lib/kobject.c
index 9621751..4e1c608 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -536,6 +536,32 @@ static struct kobject * __must_check 
kobject_get_unless_zero(struct kobject *kob
return kobj;
 }
 
+static void kobject_free(struct kobject *kobj)
+{
+   struct kobj_type *t = get_ktype(kobj);
+   const char *name = kobj->name;
+
+   if (t && t->release) {
+   pr_debug("kobject: '%s' (%p): calling ktype release\n",
+kobject_name(kobj), kobj);
+   t->release(kobj);
+   }
+
+   /* free name if we allocated it */
+   if (name) {
+   pr_debug("kobject: '%s': free name\n", name);
+   kfree(name);
+   }
+}
+
+#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
+static void kobject_delayed_free(struct work_struct *work)
+{
+   kobject_free(container_of(to_delayed_work(work),
+struct kobject, release));
+}
+#endif
+
 /*
  * kobject_cleanup - free kobject resources.
  * @kobj: object to cleanup
@@ -543,7 +569,6 @@ static struct kobject * __must_check 
kobject_get_unless_zero(struct kobject *kob
 static void kobject_cleanup(struct kobject *kobj)
 {
struct kobj_type *t = get_ktype(kobj);
-   const char *name = kobj->name;
 
pr_debug("kobject: '%s' (%p): %s, parent %p\n",
 kobject_name(kobj), kobj, __func__, kobj->parent);
@@ -567,40 +592,21 @@ static void kobject_cleanup(struct kobject *kobj)
kobject_del(kobj);
}
 
-   if (t && t->release) {
-   pr_debug("kobject: '%s' (%p): calling ktype release\n",
-kobject_name(kobj), kobj);
-   t->release(kobj);
-   }
-
-   /* free name if we allocated it */
-   if (name) {
-   pr_debug("kobject: '%s': free name\n", name);
-   kfree(name);
-   }
-}
-
-#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
-static void kobject_delayed_cleanup(struct work_struct *work)
-{
-   kobject_cleanup(container_of(to_delayed_work(work),
-struct kobject, release));
-}
-#endif
-
-static void kobject_release(struct kref *kref)
-{
-   struct kobject *kobj = container_of(kref, struct kobject, kref);
 #ifdef CONFIG_DEBUG_KOBJECT_RELEASE
pr_debug("kobject: '%s' (%p): %s, parent %p (delayed)\n",
 kobject_name(kobj), kobj, __func__, kobj->parent);
-   INIT_DELAYED_WORK(>release, kobject_delayed_cleanup);
+   INIT_DELAYED_WORK(>release, kobject_delayed_free);
schedule_delayed_work(>release, HZ);
 #else
-   kobject_cleanup(kobj);
+   kobject_free(kobj);
 #endif
 }
 
+static void kobject_release(struct kref *kref)
+{
+   kobject_cleanup(container_of(kref, struct kobject, kref));
+}
+
 /**
  * kobject_put - decrement refcount for object.
  * @kobj: object.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the 

Re: [PATCH -mm] mm, memory-hotpulg: Rename movablenode boot option to movable_node

2013-09-30 Thread Ingo Molnar

* Zhang Yanfei  wrote:

> @@ -153,11 +153,18 @@ config MOVABLE_NODE
>   help
> Allow a node to have only movable memory.  Pages used by the kernel,
> such as direct mapping pages cannot be migrated.  So the corresponding
> +   memory device cannot be hotplugged.  This option allows the following
> +   two things:
> +   - When the system is booting, node full of hotpluggable memory can
> +   be arranged to have only movable memory so that the whole node can
> +   be hotplugged. (need movable_node boot option specified).

So this is _exactly_ what I complained about earlier: why is the 
movable_node boot option needed to get that extra functionality? It's 
clearly not just a drop-in substitute to CONFIG_MOVABLE_NODE but extends 
its functionality, right?

Boot options are _very_ poor user interface. If you don't want to enable 
it by default then turn this sub-functionality into 
CONFIG_MOVABLE_NODE_AUTO and keep it default-off - but don't pretend that 
this is only about CONFIG_MOVABLE_NODE alone - it isnt: as described above 
the 'movable_node' is needed for the full functionality to be available!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ll_temac: Reset dma descriptors on ndo_open

2013-09-30 Thread Ricardo Ribalda Delgado
Hello David

lp->tx_bd_ci, lp->tx_bd_next...  are only initialized to zero on
temac_of_probe (inside  alloc_etherdev).  Those vars are used to index
the dma descriptors.

The initialization of  lp->tx_bd_v[i].app0 = 0; is redundant, because
it is already done on dma_zalloc_coherent in temac_dma_bd_init called
on open.

What if I move
   lp->tx_bd_ci = 0;
   lp->tx_bd_next = 0;
   lp->tx_bd_tail = 0;
   lp->rx_bd_ci = 0;

to temac_dma_bd_init? Will this be more correct?

Without this patch a script like these kills the card in 1-10 iterations:


ifdown eth0
ifdown eth1
while true

do
ifconfig eth1 10.100.10.100
udhcpc -i eth0
ping 192.168.2.1 -c 5 || break
ifconfig eth0 down


ifconfig eth0 10.100.10.100
udhcpc -i eth1
ping 192.168.2.1 -c 5 || break
ifconfig eth1 down

done


Regards

On Tue, Oct 1, 2013 at 6:21 AM, David Miller  wrote:
> From: Ricardo Ribalda Delgado 
> Date: Fri, 27 Sep 2013 13:24:28 +0200
>
>> The dma descriptors are only initialized on the probe function.
>>
>> If a packet is on the buffer when temac_stop is called, the dma
>> descriptors can be left on a incorrect status where no other package can
>> be sent.
>>
>> So an interface could be left in an usable state after ifdow/ifup.
>>
>> This patch makes sure that the descriptors are in a proper status when
>> the device is started.
>>
>> Signed-off-by: Ricardo Ribalda Delgado 
>
> This analysis is not correct.
>
> In the current driver, the descriptors are allocated and initialized
> in the open function, not the probe function.
>
> I'm not applying this patch.



-- 
Ricardo Ribalda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Eric Dumazet
On Mon, 2013-09-30 at 20:36 -0700, Eric Dumazet wrote:

> I have a patch mostly working here, and pretty short.

Here is the RFC patch. Unfortunately I cant really give performance
numbers, as a patch like this would need weeks before being tested here.

 fs/file.c   |   66 --
 include/linux/fdtable.h |1 
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4a78f98..b511f14 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -62,15 +62,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
  */
-static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
+static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt,
+bool shared)
 {
-   unsigned int cpy, set;
+   unsigned int cpy, set, i;
 
BUG_ON(nfdt->max_fds < ofdt->max_fds);
 
cpy = ofdt->max_fds * sizeof(struct file *);
set = (nfdt->max_fds - ofdt->max_fds) * sizeof(struct file *);
-   memcpy(nfdt->fd, ofdt->fd, cpy);
+
+   if (shared) {
+   /* see fd_install() why we need xchg() */
+   for (i = 0; i < ofdt->max_fds; i++)
+   nfdt->fd[i] = xchg(>fd[i], NULL);
+   } else {
+   memcpy(nfdt->fd, ofdt->fd, cpy);
+   }
memset((char *)(nfdt->fd) + cpy, 0, set);
 
cpy = ofdt->max_fds / BITS_PER_BYTE;
@@ -167,8 +175,12 @@ static int expand_fdtable(struct files_struct *files, int 
nr)
cur_fdt = files_fdtable(files);
if (nr >= cur_fdt->max_fds) {
/* Continue as planned */
-   copy_fdtable(new_fdt, cur_fdt);
+
+   write_seqlock(>resize);
+   copy_fdtable(new_fdt, cur_fdt, atomic_read(>count) > 1);
rcu_assign_pointer(files->fdt, new_fdt);
+   write_sequnlock(>resize);
+
if (cur_fdt != >fdtab)
call_rcu(_fdt->rcu, free_fdtable_rcu);
} else {
@@ -258,6 +270,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
atomic_set(>count, 1);
 
spin_lock_init(>file_lock);
+   seqlock_init(>resize);
newf->next_fd = 0;
new_fdt = >fdtab;
new_fdt->max_fds = NR_OPEN_DEFAULT;
@@ -453,6 +466,7 @@ struct files_struct init_files = {
.open_fds   = init_files.open_fds_init,
},
.file_lock  = __SPIN_LOCK_UNLOCKED(init_files.file_lock),
+   .resize = __SEQLOCK_UNLOCKED(init_task.resize),
 };
 
 /*
@@ -569,11 +583,24 @@ void __fd_install(struct files_struct *files, unsigned 
int fd,
struct file *file)
 {
struct fdtable *fdt;
-   spin_lock(>file_lock);
+   struct file *old;
+   unsigned int seq;
+
+   rcu_read_lock();
+retry:
+   seq = read_seqbegin(>resize);
fdt = files_fdtable(files);
-   BUG_ON(fdt->fd[fd] != NULL);
-   rcu_assign_pointer(fdt->fd[fd], file);
-   spin_unlock(>file_lock);
+   old = xchg(>fd[fd], file);
+   if (unlikely(old)) {
+   /* dup2() race */
+   fput(old);
+   }
+   if (unlikely(read_seqretry(>resize, seq))) {
+   if (cmpxchg(>fd[fd], file, NULL) == file)
+   goto retry;
+   /* 'file' was consumed by resizer or dup2(), we are done */
+   }
+   rcu_read_unlock();
 }
 
 void fd_install(unsigned int fd, struct file *file)
@@ -783,26 +810,9 @@ static int do_dup2(struct files_struct *files,
struct file *tofree;
struct fdtable *fdt;
 
-   /*
-* We need to detect attempts to do dup2() over allocated but still
-* not finished descriptor.  NB: OpenBSD avoids that at the price of
-* extra work in their equivalent of fget() - they insert struct
-* file immediately after grabbing descriptor, mark it larval if
-* more work (e.g. actual opening) is needed and make sure that
-* fget() treats larval files as absent.  Potentially interesting,
-* but while extra work in fget() is trivial, locking implications
-* and amount of surgery on open()-related paths in VFS are not.
-* FreeBSD fails with -EBADF in the same situation, NetBSD "solution"
-* deadlocks in rather amusing ways, AFAICS.  All of that is out of
-* scope of POSIX or SUS, since neither considers shared descriptor
-* tables and this condition does not arise without those.
-*/
fdt = files_fdtable(files);
-   tofree = fdt->fd[fd];
-   if (!tofree && fd_is_open(fd, fdt))
-   goto Ebusy;
get_file(file);
-   rcu_assign_pointer(fdt->fd[fd], file);
+   tofree = xchg(>fd[fd], file);
__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
__set_close_on_exec(fd, fdt);
@@ -814,10 +824,6 @@ static int 

Re: [PATCH] ethernet: moxa: fix incorrect placement of __initdata tag

2013-09-30 Thread David Miller
From: Bartlomiej Zolnierkiewicz 
Date: Mon, 30 Sep 2013 15:18:27 +0200

> __initdata tag should be placed between the variable name and equal
> sign for the variable to be placed in the intended .init.data section.
> 
> In this particular case __initdata is incorrect as moxart_mac_driver
> can be used after the driver gets initialized.
> 
> Also while at it static-ize moxart_mac_driver.
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz 
> Signed-off-by: Kyungmin Park 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET] sysfs: use seq_file and unify regular and bin file handling

2013-09-30 Thread Bjorn Helgaas
On Sat, Sep 28, 2013 at 4:15 PM, Tejun Heo  wrote:
> Hey, again.
>
> On Sat, Sep 28, 2013 at 05:49:30PM -0400, Tejun Heo wrote:
>>  0001-sysfs-remove-unused-sysfs_buffer-pos.patch
>>  0002-sysfs-remove-sysfs_buffer-needs_read_fill.patch
>>  0003-sysfs-remove-sysfs_buffer-ops.patch
>>  0004-sysfs-add-sysfs_open_file_mutex.patch
>>  0005-sysfs-rename-sysfs_buffer-to-sysfs_open_file.patch
>>  0006-sysfs-add-sysfs_open_file-sd-and-file.patch
>>  0007-sysfs-use-transient-write-buffer.patch
>>  0008-sysfs-use-seq_file-when-reading-regular-files.patch
>>  0009-sysfs-prepare-llseek-path-for-unified-regular-bin-fi.patch
>>  0010-sysfs-prepare-path-write-for-unified-regular-bin-fil.patch
>>  0011-sysfs-prepare-read-path-for-unified-regular-bin-file.patch
>>  0012-sysfs-copy-bin-mmap-support-from-fs-sysfs-bin.c-to-f.patch
>>  0013-sysfs-prepare-open-path-for-unified-regular-bin-file.patch
>>  0014-sysfs-merge-regular-and-bin-file-handling.patch
>
> On the second thought, 0011 seems too dangerous, especially for pci IO
> BAR regions.  Grumble, looks like I'll have to break out the bin read
> path.  Please ignore patches >= 0009.  I'll update them and repost.

I don't pretend to understand sysfs or the issue you tripped over with
PCI I/O BAR regions.  But we had a long discussion about those files
[1] last spring, and I'm pretty convinced that it was a mistake to add
them in their current form, and I would support an attempt to rework
them.  We had some ideas about how to do that, but I think everybody
lost interest before anything happened.

Bjorn

[1] http://lkml.kernel.org/r/20130316213519.2974.38954.stgit@amt.stowe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next PATCH V2] virtio-net: switch to use XPS to choose txq

2013-09-30 Thread David Miller
From: Jason Wang 
Date: Mon, 30 Sep 2013 15:37:17 +0800

> We used to use a percpu structure vq_index to record the cpu to queue
> mapping, this is suboptimal since it duplicates the work of XPS and
> loses all other XPS functionality such as allowing use to configure
> their own transmission steering strategy.
> 
> So this patch switches to use XPS and suggest a default mapping when
> the number of cpus is equal to the number of queues. With XPS support,
> there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
> so they were removed also.
> 
> Cc: Rusty Russell 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 
> ---
> Changes from V1:
> - use cpumask_of() instead of allocate dynamically

This generates build warnings:

drivers/net/virtio_net.c: In function ‘virtnet_set_affinity’:
drivers/net/virtio_net.c:1093:3: warning: passing argument 2 of 
‘netif_set_xps_queue’ discards ‘const’ qualifier from pointer target type 
[enabled by default]
In file included from drivers/net/virtio_net.c:20:0:
include/linux/netdevice.h:2275:5: note: expected ‘struct cpumask *’ but 
argument is of type ‘const struct cpumask *’


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Michael Neuling
>> Well we don't have to, I think Mikey wasn't totally clear about that
>> "making all registers volatile" business :-) This is just something we
>> need to handle in assembly if we are going to reclaim the suspended
>> transaction.

Yeah, sorry.  The slow path with all registers as volatile is only
needed if we get pre-empted during the transaction.

>>
>> So basically, what we need is something along the lines of
>> enable_kernel_tm() which checks if there's a suspended user transaction
>> and if yes, kills/reclaims it.
>>
>> Then we also need to handle in our interrupt handlers that we have an
>> active/suspended transaction from a kernel state, which we don't deal
>> with at this point, and do whatever has to be done to kill it... we
>> might get away with something simple if we can state that we only allow
>> kernel transactions at task level and not from interrupt/softirq
>> contexts, at least initially.
>
> Call me a coward, but this is starting to sound a bit scary.  ;-)

We are just wanting to prototype it for now to see if we could make it
go faster.  If it's worth it, then we'd consider the additional
complexity this would bring.

I don't think it'll be that bad, but I'd certainly want to make sure
it's worth it before trying :-)

Mikey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc/sem.c: synchronize semop and semctl with IPC_RMID

2013-09-30 Thread Manfred Spraul

Hi Davidlohr,

On 09/30/2013 07:54 PM, Davidlohr Bueso wrote:

Hi Manfred,

On Mon, 2013-09-30 at 11:13 +0200, Manfred Spraul wrote:

After acquiring the semlock spinlock, the operations must test that the
array is still valid.

- semctl() and exit_sem() would walk stale linked lists (ugly, but should
   be ok: all lists are empty)

- semtimedop() would sleep forever - and if woken up due to a signal -
   access memory after free.

Yep, that was next on my list - I had a patch for semtimedop() but was
waiting to rebase it on top of your previous changes. Anyway thanks for
sending this.


The patch standardizes the tests for .deleted, so that all tests in one
function leave the function with the same approach.

Right now, it's a mixture of "goto cleanup", some cleanup and then
"goto further_cleanup" and all cleanup+"return -EIDRM" - that makes the
review much harder.

Davidlohr: Could you please review the patch?
I did some stress test, but probably I didn't hit exactly the modified
lines.

This shouldn't affect performance, if that's what you mean.
All goto's must go to the correct target, free everything, unlock 
everything, do not unlock twice, ...



  One more
read in the critical region won't make any difference. The patch looks
good, just one doubt below.



Signed-off-by: Manfred Spraul 
---
  ipc/sem.c | 43 ++-
  1 file changed, 30 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 19c8b98..a2fa795 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1229,6 +1229,12 @@ static int semctl_setval(struct ipc_namespace *ns, int 
semid, int semnum,
  
  	sem_lock(sma, NULL, -1);
  
+	if (sma->sem_perm.deleted) {

+   sem_unlock(sma, -1);
+   rcu_read_unlock();
+   return -EIDRM;
+   }
+
curr = >sem_base[semnum];
  
  	ipc_assert_locked_object(>sem_perm);

@@ -1285,10 +1291,8 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
sem_lock(sma, NULL, -1);
if(nsems > SEMMSL_FAST) {
if (!ipc_rcu_getref(sma)) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;
-   goto out_free;
+   goto out_unlock;
}
sem_unlock(sma, -1);
rcu_read_unlock();
@@ -1301,10 +1305,13 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
rcu_read_lock();
sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
err = -EIDRM;

^^ check if nsems > SEMMSL_FAST

-   goto out_free;
+   goto out_unlock;
+   }
+   } else {
+   if (sma->sem_perm.deleted) {
+   err = -EIDRM;
+   goto out_unlock;
}

I'm a bit lost here. Why should we only check the existence of the sem
if nsems <= SEMMSL_FAST? Shouldn't the same should apply either way?

It is checked in both branches:
- the check for "nsems > SEMMSL_FAST" was always there, due to the 
kmalloc, the lock is dropped.


--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FTRACE_WARN_ON((rec->flags & ~FTRACE_FL_MASK) == 0))

2013-09-30 Thread Dave Jones
On Mon, Sep 30, 2013 at 11:56:06PM -0400, Steven Rostedt wrote:
 
 > > Steve, any further thoughts on this ? I can't do more than a few hours of 
 > > fuzz-testing
 > > without eventually running into this.
 > 
 > I'm currently traveling, but as it seems that you have a pretty good
 > reproducer, can trace what trinity does up to the point it crashes?
 > 
 > I think we can do that with other buffers. If you download the latest
 > trace-cmd, you can run:
 > 
 > trace-cmd record -e syscalls -B trinity trinity 
 > 
 > This will create a ftrace sub buffer called trinity and shouldn't be
 > affected by the other ftrace functions. Then it should trace all the
 > syscalls that trinity does. I guess a strace could be used too.

It seems like trace-cmd needs to be run as root. all hell will break loose if 
trinity gets root privs.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


An Emergency, Please Acknowledge!!!!

2013-09-30 Thread Mrs. Olga Patarkatsishvili


Greetings From Georgia

Greetings in the name of the lord, I am Mrs. Olga Patarkatsishvili,  the widow 
of late Georgian business tycoon Mr. Badri Patarkatsishvili, I have a
business proposal which will be of great benefit to you and myself. I will send 
you further details once i receive your response back. 
Please for security reason, i will strongly recommend that you write me through 
my private email: mrs_olgapatarkatsishvi...@safe-mail.net account 
only.Thanks for your understanding.

Yours truly,

Mrs. Olga Patarkatsishvili.



...:PMMG:...


Re: [PATCH] ll_temac: Reset dma descriptors on ndo_open

2013-09-30 Thread David Miller
From: Ricardo Ribalda Delgado 
Date: Fri, 27 Sep 2013 13:24:28 +0200

> The dma descriptors are only initialized on the probe function.
> 
> If a packet is on the buffer when temac_stop is called, the dma
> descriptors can be left on a incorrect status where no other package can
> be sent.
> 
> So an interface could be left in an usable state after ifdow/ifup.
> 
> This patch makes sure that the descriptors are in a proper status when
> the device is started.
> 
> Signed-off-by: Ricardo Ribalda Delgado 

This analysis is not correct.

In the current driver, the descriptors are allocated and initialized
in the open function, not the probe function.

I'm not applying this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] regulator: tps65090: remove usage of IS_ERR_OR_NULL

2013-09-30 Thread Manish Badarkhe
This patch changes the driver to avoid the usage of IS_ERR_OR_NULL()
macro.

The case present in this patch simply translates to normal check for pointer,
wheather it is NULL or has an error code.

Signed-off-by: Manish Badarkhe 
---
:100644 100644 bd611cdf.. ad3d4d4... M  drivers/regulator/tps65090-regulator.c
 drivers/regulator/tps65090-regulator.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/regulator/tps65090-regulator.c 
b/drivers/regulator/tps65090-regulator.c
index bd611cdf..ad3d4d4 100644
--- a/drivers/regulator/tps65090-regulator.c
+++ b/drivers/regulator/tps65090-regulator.c
@@ -245,7 +245,7 @@ static int tps65090_regulator_probe(struct platform_device 
*pdev)
if (!tps65090_pdata && tps65090_mfd->dev->of_node)
tps65090_pdata = tps65090_parse_dt_reg_data(pdev,
_reg_matches);
-   if (IS_ERR_OR_NULL(tps65090_pdata)) {
+   if (!tps65090_pdata || IS_ERR(tps65090_pdata)) {
dev_err(>dev, "Platform data missing\n");
return tps65090_pdata ? PTR_ERR(tps65090_pdata) : -EINVAL;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] ftrace: add set_graph_notrace filter

2013-09-30 Thread Steven Rostedt
On Mon, 2013-09-30 at 16:04 +0900, Namhyung Kim wrote:
> Ping!

Ah I missed this patch. And I'm currently traveling.

I'll start testing it when I get back on Friday, but feel free to ping
me again next Monday.

-- Steve

> 
> On Tue,  3 Sep 2013 14:05:08 +0900, Namhyung Kim wrote:
> > From: Namhyung Kim 
> >
> > The set_graph_notrace filter is analogous to set_ftrace_notrace and
> > can be used for eliminating uninteresting part of function graph trace
> > output.  It also works with set_graph_function nicely.
> >
> >   # cd /sys/kernel/debug/tracing/
> >   # echo do_page_fault > set_graph_function
> >   # perf ftrace live true
> >2)   |  do_page_fault() {
> >2)   |__do_page_fault() {
> >2)   0.381 us|  down_read_trylock();
> >2)   0.055 us|  __might_sleep();
> >2)   0.696 us|  find_vma();
> >2)   |  handle_mm_fault() {
> >2)   |handle_pte_fault() {
> >2)   |  __do_fault() {
> >2)   |filemap_fault() {
> >2)   |  find_get_page() {
> >2)   0.033 us|__rcu_read_lock();
> >2)   0.035 us|__rcu_read_unlock();
> >2)   1.696 us|  }
> >2)   0.031 us|  __might_sleep();
> >2)   2.831 us|}
> >2)   |_raw_spin_lock() {
> >2)   0.046 us|  add_preempt_count();
> >2)   0.841 us|}
> >2)   0.033 us|page_add_file_rmap();
> >2)   |_raw_spin_unlock() {
> >2)   0.057 us|  sub_preempt_count();
> >2)   0.568 us|}
> >2)   |unlock_page() {
> >2)   0.084 us|  page_waitqueue();
> >2)   0.126 us|  __wake_up_bit();
> >2)   1.117 us|}
> >2)   7.729 us|  }
> >2)   8.397 us|}
> >2)   8.956 us|  }
> >2)   0.085 us|  up_read();
> >2) + 12.745 us   |}
> >2) + 13.401 us   |  }
> >   ...
> >
> >   # echo handle_mm_fault > set_graph_notrace
> >   # perf ftrace live true
> >1)   |  do_page_fault() {
> >1)   |__do_page_fault() {
> >1)   0.205 us|  down_read_trylock();
> >1)   0.041 us|  __might_sleep();
> >1)   0.344 us|  find_vma();
> >1)   0.069 us|  up_read();
> >1)   4.692 us|}
> >1)   5.311 us|  }
> >   ...
> >
> > Signed-off-by: Namhyung Kim 
> > ---
> >  include/linux/ftrace.h   |   1 +
> >  kernel/trace/ftrace.c| 118 
> > ++-
> >  kernel/trace/trace.h |  23 +++
> >  kernel/trace/trace_functions_graph.c |  21 ++-
> >  4 files changed, 159 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 9f15c0064c50..ec85d48619e1 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -721,6 +721,7 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func, int *depth,
> >  extern char __irqentry_text_start[];
> >  extern char __irqentry_text_end[];
> >  
> > +#define FTRACE_NOTRACE_DEPTH 65536
> >  #define FTRACE_RETFUNC_DEPTH 50
> >  #define FTRACE_RETSTACK_ALLOC_SIZE 32
> >  extern int register_ftrace_graph(trace_func_graph_ret_t retfunc,
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index a6d098c6df3f..1b1f3409f788 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -3819,6 +3819,43 @@ static const struct seq_operations 
> > ftrace_graph_seq_ops = {
> > .show = g_show,
> >  };
> >  
> > +int ftrace_graph_notrace_count;
> > +int ftrace_graph_notrace_enabled;
> > +unsigned long ftrace_graph_notrace_funcs[FTRACE_GRAPH_MAX_FUNCS] 
> > __read_mostly;
> > +
> > +static void *
> > +__n_next(struct seq_file *m, loff_t *pos)
> > +{
> > +   if (*pos >= ftrace_graph_notrace_count)
> > +   return NULL;
> > +   return _graph_notrace_funcs[*pos];
> > +}
> > +
> > +static void *
> > +n_next(struct seq_file *m, void *v, loff_t *pos)
> > +{
> > +   (*pos)++;
> > +   return __n_next(m, pos);
> > +}
> > +
> > +static void *n_start(struct seq_file *m, loff_t *pos)
> > +{
> > +   mutex_lock(_lock);
> > +
> > +   /* Nothing, tell g_show to print all functions are enabled */
> > +   if (!ftrace_graph_notrace_enabled && !*pos)
> > +   return (void *)1;
> > +
> > +   return __n_next(m, pos);
> > +}
> > +
> > +static const struct seq_operations ftrace_graph_notrace_seq_ops = {
> > +   .start = n_start,
> > +   .next = n_next,
> > +   .stop = g_stop,
> > +   .show = g_show,
> > +};
> > +
> >  static int
> >  ftrace_graph_open(struct inode *inode, struct file *file)
> >  {
> > @@ -3843,6 +3880,30 @@ ftrace_graph_open(struct inode 

Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

2013-09-30 Thread Paul E. McKenney
On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:

[ . . . ]

> > > +static bool cpuhp_readers_active_check(void)
> > >  {
> > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > + smp_mb(); /* B matches A */
> > > +
> > > + /*
> > > +  * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > +  * we are guaranteed to also see its __cpuhp_refcount increment.
> > > +  */
> > >  
> > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > + return false;
> > >  
> > > + smp_mb(); /* D matches C */
> > 
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(), note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().
> 
> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
> 
> Paul; is there any argument for the MB here as opposed to RMB; and if
> not should we change both these and SRCU?

Given that these memory barriers execute only on the semi-slow path,
why add the complexity of moving from smp_mb() to either smp_rmb()
or smp_wmb()?  Straight smp_mb() is easier to reason about and more
robust against future changes.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FTRACE_WARN_ON((rec->flags & ~FTRACE_FL_MASK) == 0))

2013-09-30 Thread Steven Rostedt
On Mon, 2013-09-30 at 13:12 -0400, Dave Jones wrote:
> On Wed, Aug 28, 2013 at 03:27:45PM -0400, Steven Rostedt wrote:
> 
>  > > [ 6619.050768] WARNING: CPU: 1 PID: 16351 at kernel/trace/ftrace.c:1640 
> __ftrace_hash_rec_update.part.37+0x20a/0x240()
>  > > [ 6619.053767] Modules linked in: lec snd_seq_dummy bridge stp fuse tun 
> bnep hidp rfcomm nfnetlink ipt_ULOG scsi_transport_iscsi can_bcm can_raw nfc 
> caif_socket caif af_802154 phonet af_rxrpc bluetooth rfkill can llc2 pppoe 
> pppox ppp_generic slhc irda crc_ccitt rds af_key rose x25 atm netrom 
> appletalk ipx p8023 psnap p8022 llc ax25 xfs libcrc32c snd_hda_codec_realtek 
> snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm 
> snd_page_alloc e1000e ptp snd_timer pcspkr pps_core snd soundcore usb_debug
>  > > [ 6619.060523] CPU: 1 PID: 16351 Comm: trinity-child2 Not tainted 
> 3.11.0-rc7+ #31 
>  > > [ 6619.062161]  81a21901 8802267b9ce0 816f9e4f 
> 
>  > > [ 6619.063733]  8802267b9d18 81052dcd  
> 0001
>  > > [ 6619.065309]  8802203a3000  880225e962d0 
> 8802267b9d28
>  > > [ 6619.066895] Call Trace:
>  > > [ 6619.068437]  [] dump_stack+0x54/0x74
>  > > [ 6619.070046]  [] warn_slowpath_common+0x7d/0xa0
>  > > [ 6619.071642]  [] warn_slowpath_null+0x1a/0x20
>  > > [ 6619.073224]  [] 
> __ftrace_hash_rec_update.part.37+0x20a/0x240
>  > > [ 6619.074817]  [] ftrace_shutdown+0xb8/0x160
>  > > [ 6619.076399]  [] unregister_ftrace_function+0x30/0x50
>  > > [ 6619.077983]  [] 
> perf_ftrace_event_register+0x87/0x150
>  > > [ 6619.079565]  [] perf_trace_destroy+0x2c/0x50
>  > > [ 6619.081180]  [] tp_perf_event_destroy+0x9/0x10
>  > > [ 6619.082742]  [] free_event+0xa7/0x300
>  > > [ 6619.084264]  [] __perf_event_exit_task+0xe0/0x130
>  > > [ 6619.085792]  [] perf_event_exit_task+0x1f1/0x230
>  > > [ 6619.087329]  [] do_exit+0x30d/0xcd0
>  > > [ 6619.088860]  [] ? ftrace_call+0x5/0x2f
>  > > [ 6619.090460]  [] do_group_exit+0x4c/0xc0
>  > > [ 6619.092036]  [] SyS_exit_group+0x14/0x20
>  > > [ 6619.093614]  [] tracesys+0xdd/0xe2
>  >
>  > The first failure disables function tracing completely, which is one of
>  > the causes to return an error, which triggered the warning I added.
>  > 
>  > I was hoping that it failed for another reason and then that would
>  > cause things to break. But you said the hash corruption happened first
>  > (the original bug), so that's not the case.
>  > 
>  > Back to staring at code.
> 
> Steve, any further thoughts on this ? I can't do more than a few hours of 
> fuzz-testing
> without eventually running into this.

I'm currently traveling, but as it seems that you have a pretty good
reproducer, can trace what trinity does up to the point it crashes?

I think we can do that with other buffers. If you download the latest
trace-cmd, you can run:

trace-cmd record -e syscalls -B trinity trinity 

This will create a ftrace sub buffer called trinity and shouldn't be
affected by the other ftrace functions. Then it should trace all the
syscalls that trinity does. I guess a strace could be used too.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] apparmor: fix suspicious RCU usage warning in policy.c/policy.h

2013-09-30 Thread Paul E. McKenney
On Sun, Sep 29, 2013 at 08:39:22AM -0700, John Johansen wrote:
> The recent 3.12 pull request for apparmor was missing a couple rcu _protected
> access modifiers. Resulting in the follow suspicious RCU usage

Assuming the lock you called out is the right one (I have no idea!), this
looks good to me!

So why don't we need to worry that RCU read-side critical sections might
have modified the ->base.count field that aa_put_profile() references?
Because the RCU callback function is guaranteed to see the effect of any
RCU read-side critical sections that started before the corresponding
call_rcu() invocation.  This of course assumes that you made the
structure inaccessible to readers before that same call_rcu() function.
(You did do this, didn't you?  If not, you have very big problems over
and above the ->base.count field!)

Thanx, Paul

>  [   29.804534] [ INFO: suspicious RCU usage. ]
>  [   29.804539] 3.11.0+ #5 Not tainted
>  [   29.804541] ---
>  [   29.804545] security/apparmor/include/policy.h:363 suspicious 
> rcu_dereference_check() usage!
>  [   29.804548]
>  [   29.804548] other info that might help us debug this:
>  [   29.804548]
>  [   29.804553]
>  [   29.804553] rcu_scheduler_active = 1, debug_locks = 1
>  [   29.804558] 2 locks held by apparmor_parser/1268:
>  [   29.804560]  #0:  (sb_writers#9){.+.+.+}, at: [] 
> file_start_write+0x27/0x29
>  [   29.804576]  #1:  (>lock){+.+.+.}, at: [] 
> aa_replace_profiles+0x166/0x57c
>  [   29.804589]
>  [   29.804589] stack backtrace:
>  [   29.804595] CPU: 0 PID: 1268 Comm: apparmor_parser Not tainted 3.11.0+ #5
>  [   29.804599] Hardware name: ASUSTeK Computer Inc. UL50VT  
> /UL50VT, BIOS 217 03/01/2010
>  [   29.804602]   8800b95a1d90 8144eb9b 
> 8800b94db540
>  [   29.804611]  8800b95a1dc0 81087439 880138cc3a18 
> 880138cc3a18
>  [   29.804619]  8800b9464a90 880138cc3a38 8800b95a1df0 
> 811f5084
>  [   29.804628] Call Trace:
>  [   29.804636]  [] dump_stack+0x4e/0x82
>  [   29.804642]  [] lockdep_rcu_suspicious+0xfc/0x105
>  [   29.804649]  [] __aa_update_replacedby+0x53/0x7f
>  [   29.804655]  [] __replace_profile+0x11f/0x1ed
>  [   29.804661]  [] aa_replace_profiles+0x410/0x57c
>  [   29.804668]  [] profile_replace+0x35/0x4c
>  [   29.804674]  [] vfs_write+0xad/0x113
>  [   29.804680]  [] SyS_write+0x44/0x7a
>  [   29.804687]  [] system_call_fastpath+0x16/0x1b
>  [   29.804691]
>  [   29.804694] ===
>  [   29.804697] [ INFO: suspicious RCU usage. ]
>  [   29.804700] 3.11.0+ #5 Not tainted
>  [   29.804703] ---
>  [   29.804706] security/apparmor/policy.c:566 suspicious 
> rcu_dereference_check() usage!
>  [   29.804709]
>  [   29.804709] other info that might help us debug this:
>  [   29.804709]
>  [   29.804714]
>  [   29.804714] rcu_scheduler_active = 1, debug_locks = 1
>  [   29.804718] 2 locks held by apparmor_parser/1268:
>  [   29.804721]  #0:  (sb_writers#9){.+.+.+}, at: [] 
> file_start_write+0x27/0x29
>  [   29.804733]  #1:  (>lock){+.+.+.}, at: [] 
> aa_replace_profiles+0x166/0x57c
>  [   29.804744]
>  [   29.804744] stack backtrace:
>  [   29.804750] CPU: 0 PID: 1268 Comm: apparmor_parser Not tainted 3.11.0+ #5
>  [   29.804753] Hardware name: ASUSTeK Computer Inc. UL50VT  
> /UL50VT, BIOS 217 03/01/2010
>  [   29.804756]   8800b95a1d80 8144eb9b 
> 8800b94db540
>  [   29.804764]  8800b95a1db0 81087439 8800b95b02b0 
> 
>  [   29.804772]  8800b9efba08 880138cc3a38 8800b95a1dd0 
> 811f4f94
>  [   29.804779] Call Trace:
>  [   29.804786]  [] dump_stack+0x4e/0x82
>  [   29.804791]  [] lockdep_rcu_suspicious+0xfc/0x105
>  [   29.804798]  [] aa_free_replacedby_kref+0x4d/0x62
>  [   29.804804]  [] ? aa_put_namespace+0x17/0x17
>  [   29.804810]  [] kref_put+0x36/0x40
>  [   29.804816]  [] __replace_profile+0x13a/0x1ed
>  [   29.804822]  [] aa_replace_profiles+0x410/0x57c
>  [   29.804829]  [] profile_replace+0x35/0x4c
>  [   29.804835]  [] vfs_write+0xad/0x113
>  [   29.804840]  [] SyS_write+0x44/0x7a
>  [   29.804847]  [] system_call_fastpath+0x16/0x1b
> 
> Reported-by: miles.l...@gmail.com
> CC: paul...@linux.vnet.ibm.com
> Signed-off-by: John Johansen 
> ---
>  security/apparmor/include/policy.h | 4 +++-
>  security/apparmor/policy.c | 3 ++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/security/apparmor/include/policy.h 
> b/security/apparmor/include/policy.h
> index f2d4b63..c28b0f2 100644
> --- a/security/apparmor/include/policy.h
> +++ b/security/apparmor/include/policy.h
> @@ -360,7 +360,9 @@ static inline void aa_put_replacedby(struct aa_replacedby 
> *p)
>  static inline void __aa_update_replacedby(struct aa_profile *orig,
>  

Re: [PATCH v2] ARM: shmobile: ape6evm: fix incorrect placement of __initdata tag

2013-09-30 Thread Simon Horman
On Mon, Sep 30, 2013 at 08:31:05PM +0200, Laurent Pinchart wrote:
> On Monday 30 September 2013 17:34:36 Bartlomiej Zolnierkiewicz wrote:
> > __initdata tag should be placed between the variable name and equal
> > sign for the variable to be placed in the intended .init.data section.
> > 
> > Signed-off-by: Bartlomiej Zolnierkiewicz 
> > Signed-off-by: Kyungmin Park 
> 
> Acked-by: Laurent Pinchart 

Thanks, I will queue this up.

> 
> > ---
> > v2:
> > - use __initdata as it is OK to do it
> > 
> >  arch/arm/mach-shmobile/board-ape6evm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/mach-shmobile/board-ape6evm.c
> > b/arch/arm/mach-shmobile/board-ape6evm.c index 7627385..8954f55 100644
> > --- a/arch/arm/mach-shmobile/board-ape6evm.c
> > +++ b/arch/arm/mach-shmobile/board-ape6evm.c
> > @@ -86,7 +86,7 @@ static struct gpio_keys_button gpio_buttons[] = {
> > GPIO_KEY(KEY_VOLUMEDOWN,329,"S21"),
> >  };
> > 
> > -static struct __initdata gpio_keys_platform_data ape6evm_keys_pdata = {
> > +static struct gpio_keys_platform_data ape6evm_keys_pdata __initdata = {
> > .buttons= gpio_buttons,
> > .nbuttons   = ARRAY_SIZE(gpio_buttons),
> >  };
> -- 
> Regards,
> 
> Laurent Pinchart
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sh" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: applesmc oops in 3.10/3.11

2013-09-30 Thread Guenter Roeck

On 09/30/2013 06:57 PM, Chris Murphy wrote:


On Sep 27, 2013, at 5:33 PM, Guenter Roeck  wrote:


On 09/27/2013 11:03 AM, Chris Murphy wrote:


On Sep 27, 2013, at 11:59 AM, Guenter Roeck  wrote:


On Fri, Sep 27, 2013 at 11:41:42AM -0600, Chris Murphy wrote:


On Sep 27, 2013, at 11:12 AM, Guenter Roeck  wrote:


On Fri, Sep 27, 2013 at 12:21:04PM -0400, Josh Boyer wrote:

On Thu, Sep 26, 2013 at 2:34 AM, Henrik Rydberg  wrote:

This suggests that initialization may be attempted more than once. The key cache
is allocated only once, but the number of keys is read for each attempt.

No idea if that can happen, but if the number of keys can increase after
the first initialization attempt you would have an explanation for the crash.


Good idea, and easy enough to test with the patch below.


Should we apply this patch even though it may not solve the specific problem ?


Yes, why not - it certainly won't hurt. I am running it right now, so
it is at least run-tested.


Again, not sure if the key count can change, but the current code is at the very
least inconsistent, as it keeps reading the key count without updating or
verifying the cache size.


Yes - I agree that the error state is far-fetched, but it is hard to
see any other logical explanation. There is of course always the
possibility that the problem is somewhere else completely.

Proper patch attached.

Thanks,
Henrik

---

 From dedefba9167913c46e1896ce0624e68ffe95d532 Mon Sep 17 00:00:00 2001
From: Henrik Rydberg 
Date: Thu, 26 Sep 2013 08:33:16 +0200
Subject: [PATCH] hwmon: (applesmc) Check key count before proceeding

After reports from Chris and Josh Boyer of a rare crash in applesmc,
Guenter pointed at the initialization problem fixed below. The patch
has not been verified to fix the crash, but should be applied
regardless.

Reported-by: 
Suggested-by: Guenter Roeck 
Signed-off-by: Henrik Rydberg 
---
drivers/hwmon/applesmc.c | 11 ++-
1 file changed, 10 insertions(+), 1 deletion(-)


Thanks for the quick reply.  I'll get this rolled into our kernels soon.


I sent a pull request to Linus, so you should be able to pull it from
the upstream kernel shortly. Would be great to get feedback if the patch
solves the problem (or doesn't).


I'll start running it when it appears in koji. It's very transient, maybe one 
oops per week with lots of (other) testing. I'm not even sure if it happens on 
warm or cold boots or both.


When you do, can you possibly trigger an event based on the warning added
with the patch ? This might help us to identify if the problem fixed
with the patch actually happens.


I don't understand the question. I'm uncertain how to trigger, and also what 
event.



The patch includes a new warning message.

pr_warn("key count changed from %d to %d\n",
s->key_count, count);

It would be great if there would be a means to detect if this message is seen
in a kernel log, because it would show that the potential crash condition
fixed with the patch was actually encountered. This would help us to determine
if we actually fixed the problem or not.

Of course, we'll know if is wasn't fixed if the system still crashes.


Warning message triggered with 3.12.0-0.rc3.git0.1.fc21.x86_64.

[   10.886016] applesmc: key count changed from 261 to 1174405121



Explains the crash, but the new key count is very wrong. 1174405121 = 
0x4601.
Which I guess explains the subsequent memory allocation error in the log.

Henrik, any idea what might be going on ? Is it possible that the previous
command failure leaves some state machine in a bad state ?

Guenter


Attaching new full dmesg to the bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1011719#c11

Chris



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Eric Dumazet
On Tue, 2013-10-01 at 04:27 +0100, Al Viro wrote:
> On Mon, Sep 30, 2013 at 07:02:23PM -0700, Linus Torvalds wrote:
> 
> > Shouldn't a cmpxchg() in just the dup2 code solve that?
> > 
> > If the old value was NULL, you'd have to repeat and go back and see if
> > the open_fds[] bit had been cleared in the meantime (ie it's NULL not
> > because somebody else is busy installing it, but because somebody just
> > uninstalled it).
> 
> Yechh...  Under ->file_lock (in do_dup2()), hopefully?  Or you'll get
> all kinds of fun with close() thrown into the game, as well...
> 
> > But yeah, I do agree that that sounds nasty and a complication I
> > hadn't even thought about. dup2() does violate our normal "let's
> > pre-allocate the fd slot" rule. Ugh.
> 
> Hell knows...  Descriptor handling *is* pretty well isolated these
> days, so it just might be doable without disrupting the living hell
> out of anything else.  fs/file.c is pretty much it - everything else
> goes through it.

I have a patch mostly working here, and pretty short.

I'll do proper tests before posting it tomorrow.

 fs/fcntl.c  |7 ++-
 fs/file.c   |   17 +
 fs/open.c   |   21 +
 include/linux/fdtable.h |1 +
 4 files changed, 33 insertions(+), 13 deletions(-)

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Al Viro
On Mon, Sep 30, 2013 at 07:02:23PM -0700, Linus Torvalds wrote:

> Shouldn't a cmpxchg() in just the dup2 code solve that?
> 
> If the old value was NULL, you'd have to repeat and go back and see if
> the open_fds[] bit had been cleared in the meantime (ie it's NULL not
> because somebody else is busy installing it, but because somebody just
> uninstalled it).

Yechh...  Under ->file_lock (in do_dup2()), hopefully?  Or you'll get
all kinds of fun with close() thrown into the game, as well...

> But yeah, I do agree that that sounds nasty and a complication I
> hadn't even thought about. dup2() does violate our normal "let's
> pre-allocate the fd slot" rule. Ugh.

Hell knows...  Descriptor handling *is* pretty well isolated these
days, so it just might be doable without disrupting the living hell
out of anything else.  fs/file.c is pretty much it - everything else
goes through it.

I've enough on my plate at the moment with fs/namespace.c and fs/namei.c,
though, and praying hard fs/inode.c doesn't enter the game.  I _know_
that fs/notify will and I'm not enjoying that for a second.  BTW, has
eparis resurfaced with any fixes for *notify/umount races?  I don't seem
to have anything related in the mailbox, but...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH trivial v4] include/uapi/Kbuild: modify comment to provide summary descriptions for Linux UAPI

2013-09-30 Thread Chen Gang
UAPI is whole Linux kernel API, it is important enough to get summary
descriptions by comments.

In "include/uapi/Kbuild", "Makefile..." and "non-arch..." comments are
historical, at present, they need be improved.

Also add more explanations for "include/uapi/" in "include/uapi/Kbuild".


Signed-off-by: Chen Gang 
---
 include/uapi/Kbuild |   13 +++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/uapi/Kbuild b/include/uapi/Kbuild
index 81d2106..f396115 100644
--- a/include/uapi/Kbuild
+++ b/include/uapi/Kbuild
@@ -1,6 +1,15 @@
 # UAPI Header export list
-# Top-level Makefile calls into asm-$(ARCH)
-# List only non-arch directories below
+#
+# UAPI ("Userspace API") is in "include/uapi/" and "arch/*/include/uapi/",
+# which will be installed to "/usr/include".
+#
+# Linux generic (architecture independent) UAPI is at "include/uapi/".
+# Except "linux/", other current level sub-directories are for compatibility.
+# It will not accept new additional current level sub-directories or files.
+#
+# "Architecture Specific Mechanism" UAPI is at "arch/*/include/uapi/".
+# Each architecture must have "arch/*/include/uapi/asm" sub-directory.
+# If architecture permits, it can content additional sub-directories.
 
 
 header-y += asm-generic/
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Paul E. McKenney
On Tue, Oct 01, 2013 at 12:05:03PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> > On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling  wrote:
> > >
> > > The scary part is that we to make all register volatile.  You were not
> > > that keen on doing this as there are a lot of places in exception
> > > entry/exit where we only save/restore a subset of the registers.  We'd
> > > need to catch all these.
> > 
> > Ugh. It's very possible it's not worth using for the kernel then. The
> > example I posted is normally fine *without* any transactional support,
> > since it's a very local per-dentry lock, and since we only take that
> > lock when the last reference drops (so it's not some common directory
> > dentry, it's a end-point file dentry). In fact, on ARM they just made
> > the cmpxchg much faster by making it entirely non-serializing (since
> > it only updates a reference count, there is no locking involved apart
> > from checking that the lock state is unlocked)

A memory-barrier-free cmpxchg() would be easy on Power as well.

> > So there is basically never any contention, and the transaction needs
> > to basically be pretty much the same cost as a "cmpxchg". It's not
> > clear if the intel TSX is good enough for that, and if you have to
> > save a lot of registers in order to use transactions on POWER8, I
> > doubt it's worthwhile.
> 
> Well we don't have to, I think Mikey wasn't totally clear about that
> "making all registers volatile" business :-) This is just something we
> need to handle in assembly if we are going to reclaim the suspended
> transaction.
> 
> So basically, what we need is something along the lines of
> enable_kernel_tm() which checks if there's a suspended user transaction
> and if yes, kills/reclaims it.
> 
> Then we also need to handle in our interrupt handlers that we have an
> active/suspended transaction from a kernel state, which we don't deal
> with at this point, and do whatever has to be done to kill it... we
> might get away with something simple if we can state that we only allow
> kernel transactions at task level and not from interrupt/softirq
> contexts, at least initially.

Call me a coward, but this is starting to sound a bit scary.  ;-)

> > We have very few - if any - locks where contention or even cache
> > bouncing is common or normal. Sure, we have a few particular loads
> > that can trigger it, but even that is becoming rare. So from a
> > performance standpoint, the target always needs to be "comparable to
> > hot spinlock in local cache".
> 
> I am not quite familiar with the performance profile of our
> transactional hardware. I think we should definitely try to hack
> something together for that dput() case and measure it.
> 
> > >> They also have interesting ordering semantics vs. locks, we need to be
> > >> a tad careful (as long as we don't access a lock variable
> > >> transactionally we should be ok. If we do, then spin_unlock needs a
> > >> stronger barrier).
> > >
> > > Yep.
> > 
> > Well, just about any kernel transaction will at least read the state
> > of a lock. Without that, it's generally totally useless. My dput()
> > example sequence very much verified that the lock was not held, for
> > example.
> > 
> > I'm not sure how that affects anything. The actual transaction had
> > better not be visible inside the locked region (ie as far as any lock
> > users go, transactions better all happen fully before or after the
> > lock, if they read the lock and see it being unlocked).
> > 
> > That said, I cannot see how POWER8 could possibly violate that rule.
> > The whole "transactions are atomic" is kind of the whole and only
> > point of a transaction. So I'm not sure what odd lock restrictions
> > POWER8 could have.
> 
> Has to do with the memory model :-(
> 
> I dug the whole story from my mbox and the situation is indeed as dire
> as feared. If the transaction reads the lock, then the corresponding
> spin_lock must have a full sync barrier in it instead of the current
> lighter one.
> 
> Now I believe we are already "on the fence" with our locks today since
> technically speaking, our unlock + lock sequence is *not* exactly a full
> barrier (it is only if it's the same lock I think)
> 
> CC'ing Paul McKenney here who's been chasing that issue. In the end, we
> might end up having to turn our locks into sync anyway

Well, there have been a lot of fixed software bugs since the last
suspicious sighting, but on the other hand, I am just now getting my
RCU testing going again on Power.  I would certainly feel better
about things if unlock-lock was really truly a full barrier, but
this clearly needs a clean sighting.

> Yay ! The isanity^Wjoy of an OO memory model !

;-) ;-) ;-)

Thanx, Paul

> > > FWIW eg.
> > >
> > >  tbegin
> > >  beq abort /* passes first time through */
> > >  
> > >  transactional stuff
> > >  
> 

Re: applesmc oops in 3.10/3.11

2013-09-30 Thread Chris Murphy

On Sep 27, 2013, at 5:33 PM, Guenter Roeck  wrote:

> On 09/27/2013 11:03 AM, Chris Murphy wrote:
>> 
>> On Sep 27, 2013, at 11:59 AM, Guenter Roeck  wrote:
>> 
>>> On Fri, Sep 27, 2013 at 11:41:42AM -0600, Chris Murphy wrote:
 
 On Sep 27, 2013, at 11:12 AM, Guenter Roeck  wrote:
 
> On Fri, Sep 27, 2013 at 12:21:04PM -0400, Josh Boyer wrote:
>> On Thu, Sep 26, 2013 at 2:34 AM, Henrik Rydberg  
>> wrote:
>> This suggests that initialization may be attempted more than once. 
>> The key cache
>> is allocated only once, but the number of keys is read for each 
>> attempt.
>> 
>> No idea if that can happen, but if the number of keys can increase 
>> after
>> the first initialization attempt you would have an explanation for 
>> the crash.
> 
> Good idea, and easy enough to test with the patch below.
> 
 Should we apply this patch even though it may not solve the specific 
 problem ?
>>> 
>>> Yes, why not - it certainly won't hurt. I am running it right now, so
>>> it is at least run-tested.
>>> 
 Again, not sure if the key count can change, but the current code is 
 at the very
 least inconsistent, as it keeps reading the key count without updating 
 or
 verifying the cache size.
>>> 
>>> Yes - I agree that the error state is far-fetched, but it is hard to
>>> see any other logical explanation. There is of course always the
>>> possibility that the problem is somewhere else completely.
>>> 
>>> Proper patch attached.
>>> 
>>> Thanks,
>>> Henrik
>>> 
>>> ---
>>> 
>>> From dedefba9167913c46e1896ce0624e68ffe95d532 Mon Sep 17 00:00:00 2001
>>> From: Henrik Rydberg 
>>> Date: Thu, 26 Sep 2013 08:33:16 +0200
>>> Subject: [PATCH] hwmon: (applesmc) Check key count before proceeding
>>> 
>>> After reports from Chris and Josh Boyer of a rare crash in applesmc,
>>> Guenter pointed at the initialization problem fixed below. The patch
>>> has not been verified to fix the crash, but should be applied
>>> regardless.
>>> 
>>> Reported-by: 
>>> Suggested-by: Guenter Roeck 
>>> Signed-off-by: Henrik Rydberg 
>>> ---
>>> drivers/hwmon/applesmc.c | 11 ++-
>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>> 
>> Thanks for the quick reply.  I'll get this rolled into our kernels soon.
>> 
> I sent a pull request to Linus, so you should be able to pull it from
> the upstream kernel shortly. Would be great to get feedback if the patch
> solves the problem (or doesn't).
 
 I'll start running it when it appears in koji. It's very transient, maybe 
 one oops per week with lots of (other) testing. I'm not even sure if it 
 happens on warm or cold boots or both.
 
>>> When you do, can you possibly trigger an event based on the warning added
>>> with the patch ? This might help us to identify if the problem fixed
>>> with the patch actually happens.
>> 
>> I don't understand the question. I'm uncertain how to trigger, and also what 
>> event.
>> 
> 
> The patch includes a new warning message.
> 
>   pr_warn("key count changed from %d to %d\n",
>s->key_count, count);
> 
> It would be great if there would be a means to detect if this message is seen
> in a kernel log, because it would show that the potential crash condition
> fixed with the patch was actually encountered. This would help us to determine
> if we actually fixed the problem or not.
> 
> Of course, we'll know if is wasn't fixed if the system still crashes.

Warning message triggered with 3.12.0-0.rc3.git0.1.fc21.x86_64. 

[   10.886016] applesmc: key count changed from 261 to 1174405121

Attaching new full dmesg to the bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1011719#c11

Chris--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCHv8 03/18] thermal: introduce device tree parser

2013-09-30 Thread Eduardo Valentin
This patch introduces a device tree bindings for
describing the hardware thermal behavior and limits.
Also a parser to read and interpret the data and feed
it in the thermal framework is presented.

This patch introduces a thermal data parser for device
tree. The parsed data is used to build thermal zones
and thermal binding parameters. The output data
can then be used to deploy thermal policies.

This patch adds also documentation regarding this
API and how to define tree nodes to use
this infrastructure.

Note that, in order to be able to have control
on the sensor registration on the DT thermal zone,
it was required to allow changing the thermal zone
.get_temp callback. For this reason, this patch
also removes the 'const' modifier from the .ops
field of thermal zone devices.

Cc: Zhang Rui 
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Eduardo Valentin 
---
Hello Mark, folks,

So, here is v8. Pretty small changes. Most are better
English phrasing in the binding document. It follows changelog:
- Several rephrasing in the binding document, including
spelling, grammar and better phrasing.
- Removed WiP warning from binding document.
- Several spacing and formatting changes in the binding document
- Used the '-specifier' nomenclature properly in the binding doc
- Removed the optional property 'thermal-sensors-names', because
it does not provide useful runtime information
- Fixed description of 'polling-delay-passive'
- Improved examples by adding comments and better explanation
- s/fw/framework/g
- Added a WARN when sensors specifiers are greater and 1.

All best,

Eduardo

---
 .../devicetree/bindings/thermal/thermal.txt| 587 ++
 drivers/thermal/Kconfig|  13 +
 drivers/thermal/Makefile   |   1 +
 drivers/thermal/of-thermal.c   | 849 +
 drivers/thermal/thermal_core.c |   9 +-
 drivers/thermal/thermal_core.h |   9 +
 include/dt-bindings/thermal/thermal.h  |  27 +
 include/linux/thermal.h|  28 +-
 8 files changed, 1520 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/thermal/thermal.txt
 create mode 100644 drivers/thermal/of-thermal.c
 create mode 100644 include/dt-bindings/thermal/thermal.h

diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt 
b/Documentation/devicetree/bindings/thermal/thermal.txt
new file mode 100644
index 000..ad06a8d
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/thermal.txt
@@ -0,0 +1,587 @@
+* Thermal Framework Device Tree descriptor
+
+This file describes a generic binding to provide a way of
+defining hardware thermal structure using device tree.
+A thermal structure includes thermal zones and their components,
+such as trip points, polling intervals, sensors and cooling devices
+binding descriptors.
+
+The target of device tree thermal descriptors is to describe only
+the hardware thermal aspects. The thermal device tree bindings are
+not about how the system must control or which algorithm or policy
+must be taken in place.
+
+There are five types of nodes involved to describe thermal bindings:
+- sensors: used to describe the device source of temperature sensing;
+- cooling devices: used to describe devices source of power dissipation 
control;
+- trip points: used to describe points in temperature domain defined to
+make the system aware of hardware limits;
+- cooling maps: used to describe links between trip points and cooling devices;
+- thermal zones: used to describe thermal data within the hardware;
+
+It follows a description of each type of these device tree nodes.
+
+* Thermal sensor devices
+
+Thermal sensor devices are nodes providing temperature sensing capabilities on
+thermal zones. Typical devices are I2C ADC converters and bandgaps. These are
+nodes providing temperature data to thermal zones. Thermal sensor devices may
+control one or more internal sensors.
+
+Required property:
+- #thermal-sensor-cells: Used to provide sensor device specific information
+  Type: unsignedwhile referring to it. Typically 0 on thermal sensor
+  Size: one cellnodes with only one sensor, and at least 1 on nodes
+with several internal sensors, in order
+to identify uniquely the sensor instances within
+the IC. See thermal zone binding for more details
+on how consumers refer to sensor devices.
+
+* Cooling device nodes
+
+Cooling devices are nodes providing control on power dissipation. There
+are essentially two ways to provide control on power dissipation. First
+is by means of regulating device performance, which is known as passive
+cooling. A typical passive cooling is a CPU that has dynamic voltage and
+frequency scaling (DVFS), and uses lower frequencies as cooling states.
+Second 

Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Waiman Long

On 09/30/2013 03:47 PM, Tim Chen wrote:

My qrwlock doesn't enable qrwlock by default. You have to use menuconfig
to explicitly enable it. Have you done that when you build the test
kernel? I am thinking of explicitly enabling it for x86 if the anon-vma
lock is converted back to a rwlock.


Yes, I have explicitly enabled it during my testing.

Thanks.
Tim


Thank for the info.

I had tested Ingo's 2nd patch myself with the qrwlock patch on a 8-node
machine with a 3.12.0-rc2 kernel. The results of AIM7's high_systime
(at 1500 users) were:

Anon-vmas lock  JPM%Change
--  ------
rwsem148265   -
rwlock238715  +61%
qrwlock242048  +63%

So the queue rwlock was only slightly faster in this case. Below were
the perf profile with rwlock:

 18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
  9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
  2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  2.23%  ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
  1.21%true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
  1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
  1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner

The perf profile with qrwlock:

 10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
  5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
  2.86%  ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
  1.52%true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
  1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
  1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
  1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
  1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
  1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux

In the qrwlock kernel, less time were spent in the rwlock slowpath
path (about half). However, more time were now spent in the spinlock
and mutex spinning. Another observation is that no noticeable idle
time was reported whereas the system could be half idle with rwsem.
There was also a lot less idle balancing activities.

The qrwlock is fair wrt the writers. So its performance may not be
as good as the fully unfair rwlock. However, queuing reduces a lot
of cache contention traffic, thus improving performance. It is the
interplay of these 2 factors that determine how much performance
benefit we can see. Another factor to consider is that when we have
less contention in anon-vmas, other areas of contentions will show up.

With qrwlock, the spinlock contention was:

 10.57%   reaim  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
 |--58.70%-- release_pages
 |--38.42%-- pagevec_lru_move_fn
 |--0.64%-- get_page_from_freelist
 |--0.64%-- __page_cache_release
  --1.60%-- [...]

  2.86%  ls  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--52.73%-- pagevec_lru_move_fn
|--46.25%-- release_pages
 --1.02%-- [...]

  1.52%true  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
  |--53.76%-- pagevec_lru_move_fn
  |--43.95%-- release_pages
  |--1.02%-- __page_cache_release

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:sched/core] sched/balancing: Fix cfs_rq-> task_h_load calculation

2013-09-30 Thread Paul Turner
On Mon, Sep 30, 2013 at 7:22 PM, Yuanhan Liu
 wrote:
> On Mon, Sep 30, 2013 at 12:14:03PM +0400, Vladimir Davydov wrote:
>> On 09/29/2013 01:47 PM, Yuanhan Liu wrote:
>> >On Fri, Sep 20, 2013 at 06:46:59AM -0700, tip-bot for Vladimir Davydov 
>> >wrote:
>> >>Commit-ID:  7e3115ef5149fc502e3a2e80719dba54a8e7409d
>> >>Gitweb:http://git.kernel.org/tip/7e3115ef5149fc502e3a2e80719dba54a8e7409d
>> >>Author: Vladimir Davydov
>> >>AuthorDate: Sat, 14 Sep 2013 19:39:46 +0400
>> >>Committer:  Ingo Molnar
>> >>CommitDate: Fri, 20 Sep 2013 11:59:39 +0200
>> >>
>> >>sched/balancing: Fix cfs_rq->task_h_load calculation
>> >>
>> >>Patch a003a2 (sched: Consider runnable load average in move_tasks())
>> >>sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
>> >>always 0. This mistype leads to all tasks having weight 0 when load
>> >>balancing in a cpu-cgroup enabled setup. There obviously should be sum
>> >>of weights of all runnable tasks there instead. Fix it.
>> >Hi Vladimir,
>> >
>> >FYI, Here we found a 17% netperf regression by this patch. Here are some
>> >changed stats between this commit 7e3115ef5149fc502e3a2e80719dba54a8e7409d
>> >and it's parent(3029ede39373c368f402a76896600d85a4f7121b)
>>
>> Hello,
>>
>> Could you please report the following info:
>
> Hi Vladimir,
>
> This regression was first found at a 2-core 32 CPU Sandybridge server
> with 64G memory. However, I can't ssh to it now and we are off work
> this week due to holiday. So, sorry, email response may be delayed.
>
> Then I found this regression exists at another atom micro server as
> well. And the following machine and testcase specific info are all from it.
>
> And to not make old data confuse you, here I also update the changed
> stats and corresponding text plot as well in attachment.
>>
>> 1) the test machine cpu topology (i.e. output of 
>> /sys/devices/system/cpu/cpu*/{thread_siblings_list,core_siblings_list})
>
> # grep . 
> /sys/devices/system/cpu/cpu*/topology/{thread_siblings_list,core_siblings_list}
> /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1
> /sys/devices/system/cpu/cpu1/topology/thread_siblings_list:0-1
> /sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2-3
> /sys/devices/system/cpu/cpu3/topology/thread_siblings_list:2-3
> /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
> /sys/devices/system/cpu/cpu1/topology/core_siblings_list:0-3
> /sys/devices/system/cpu/cpu2/topology/core_siblings_list:0-3
> /sys/devices/system/cpu/cpu3/topology/core_siblings_list:0-3
>

>> 2) kernel config you used during the test
>
> Attached.
>
>> 3) the output of /sys/kernel/debug/sched_features (debugfs mounted).
>
> # cat /sys/kernel/debug/sched_features
> GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY
> WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_POWER
> TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN NO_NUMA 
> NO_NUMA_FORCE
>
>> 4) netperf server/client options
>
> Here is our testscript we used:
> #!/bin/bash
> # - test
>
> # start netserver
> netserver
>
> sleep 1
>
> for i in $(seq $nr_threads)
> do
> netperf -t $test -c -C -l $runtime &
> done
>
> Where,
>   $test is TCP_SENDFILE,
>   $nr_threads is 8, two times of nr cpu
>   $runtime is 120s
>
>> 5) did you place netserver into a separate cpu cgroup?
>
> Nope.
>


If this is causing a regression I think it actually calls into
question the original series that included a003a25b227d59d.  This
patch only makes h_load not be a nonsense value.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH trivial v3] include/uapi/Kbuild: modify the comments for it

2013-09-30 Thread Chen Gang
On 09/05/2013 09:09 AM, Chen Gang wrote:
> "include/uapi/" is the whole Linux kernel API, it is important enough
> to get more global explanations by comments.
> 
> In "include/uapi/Kbuild", "Makefile..." and "non-arch..." comments are
> meaningless for current 'Kbuild', so delete them.
> 

Hmm...it is better to improve original comments instead of delete them:
Need give some comments for "arch/*/include/uapi/asm" in
"include/uapi/Kbuild". :-)

> And add more explanations for "include/uapi/" in "include/uapi/Kbuild",
> 

"add more explanations" is still need.


I will send patch v4 for it. :-)

Thanks.

> 
> Signed-off-by: Chen Gang 
> ---
>  include/uapi/Kbuild |7 ---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/Kbuild b/include/uapi/Kbuild
> index 81d2106..287e8d0 100644
> --- a/include/uapi/Kbuild
> +++ b/include/uapi/Kbuild
> @@ -1,7 +1,8 @@
>  # UAPI Header export list
> -# Top-level Makefile calls into asm-$(ARCH)
> -# List only non-arch directories below
> -
> +#
> +# UAPI means "Userspace API" which will be installed to "/usr/include".
> +# Except "linux/", the other current level directories are for compatibility.
> +# It will not accept the new additional current level directories or files.
>  
>  header-y += asm-generic/
>  header-y += linux/
> 


-- 
Chen Gang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Eric Dumazet
On Mon, 2013-09-30 at 18:44 -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 6:05 PM, Eric Dumazet  wrote:
> > Speaking of spinlock contention, the files->file_lock is a good example.
> >
> > Multi threaded programs hit this spinlock three times per fd :
> 
> .. do you actually have programs that see contention?

Yes, Google-Bug-Id 9072743 for Googlers ;)

Basically, we have (many) servers with 10,000,000+ sockets, where this
lock contention is a real problem.

We use a SOCK_FD_FASTALLOC accept4()/socket() flag to relax POSIX
requirements and not having to parse huge bitmap in
find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd)  (O_FD_FASTALLOC
uses a random starting point)

I am writing a patch using cmpxchg() to see if I can solve the dup2()
'problem'


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Benjamin Herrenschmidt
On Mon, 2013-09-30 at 17:56 -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling  wrote:
> >
> > The scary part is that we to make all register volatile.  You were not
> > that keen on doing this as there are a lot of places in exception
> > entry/exit where we only save/restore a subset of the registers.  We'd
> > need to catch all these.
> 
> Ugh. It's very possible it's not worth using for the kernel then. The
> example I posted is normally fine *without* any transactional support,
> since it's a very local per-dentry lock, and since we only take that
> lock when the last reference drops (so it's not some common directory
> dentry, it's a end-point file dentry). In fact, on ARM they just made
> the cmpxchg much faster by making it entirely non-serializing (since
> it only updates a reference count, there is no locking involved apart
> from checking that the lock state is unlocked)
> 
> So there is basically never any contention, and the transaction needs
> to basically be pretty much the same cost as a "cmpxchg". It's not
> clear if the intel TSX is good enough for that, and if you have to
> save a lot of registers in order to use transactions on POWER8, I
> doubt it's worthwhile.

Well we don't have to, I think Mikey wasn't totally clear about that
"making all registers volatile" business :-) This is just something we
need to handle in assembly if we are going to reclaim the suspended
transaction.

So basically, what we need is something along the lines of
enable_kernel_tm() which checks if there's a suspended user transaction
and if yes, kills/reclaims it.

Then we also need to handle in our interrupt handlers that we have an
active/suspended transaction from a kernel state, which we don't deal
with at this point, and do whatever has to be done to kill it... we
might get away with something simple if we can state that we only allow
kernel transactions at task level and not from interrupt/softirq
contexts, at least initially.

> We have very few - if any - locks where contention or even cache
> bouncing is common or normal. Sure, we have a few particular loads
> that can trigger it, but even that is becoming rare. So from a
> performance standpoint, the target always needs to be "comparable to
> hot spinlock in local cache".

I am not quite familiar with the performance profile of our
transactional hardware. I think we should definitely try to hack
something together for that dput() case and measure it.

> >> They also have interesting ordering semantics vs. locks, we need to be
> >> a tad careful (as long as we don't access a lock variable
> >> transactionally we should be ok. If we do, then spin_unlock needs a
> >> stronger barrier).
> >
> > Yep.
> 
> Well, just about any kernel transaction will at least read the state
> of a lock. Without that, it's generally totally useless. My dput()
> example sequence very much verified that the lock was not held, for
> example.
> 
> I'm not sure how that affects anything. The actual transaction had
> better not be visible inside the locked region (ie as far as any lock
> users go, transactions better all happen fully before or after the
> lock, if they read the lock and see it being unlocked).
> 
> That said, I cannot see how POWER8 could possibly violate that rule.
> The whole "transactions are atomic" is kind of the whole and only
> point of a transaction. So I'm not sure what odd lock restrictions
> POWER8 could have.

Has to do with the memory model :-(

I dug the whole story from my mbox and the situation is indeed as dire
as feared. If the transaction reads the lock, then the corresponding
spin_lock must have a full sync barrier in it instead of the current
lighter one.

Now I believe we are already "on the fence" with our locks today since
technically speaking, our unlock + lock sequence is *not* exactly a full
barrier (it is only if it's the same lock I think)

CC'ing Paul McKenney here who's been chasing that issue. In the end, we
might end up having to turn our locks into sync anyway

Yay ! The isanity^Wjoy of an OO memory model !

> > FWIW eg.
> >
> >  tbegin
> >  beq abort /* passes first time through */
> >  
> >  transactional stuff
> >  
> >  tend
> >  b pass
> >
> > abort:
> >
> > pass:
> 
> That's fine, and matches the x86 semantics fairly closely, except
> "xbegin" kind of "contains" that "jump to abort address". But we could
> definitely use the same models. Call it
> "transaction_begin/abort/end()", and it should be architecture-neutral
> naming-wise.

Right.

> Of course, if tbegin then acts basically like some crazy
> assembly-level setjmp (I'm guessing it does exactly, and presumably
> precisely that kind of compiler support - ie a function with
> "__attribute((returns_twice))" in gcc-speak), the overhead of doing it
> may kill it.

Well, all the registers are checkpointed so we *should* be ok but gcc
always makes me nervous in those circumstances ...

Ben.


> 

Re: spinlock contention of files->file_lock

2013-09-30 Thread Linus Torvalds
On Mon, Sep 30, 2013 at 6:53 PM, Al Viro  wrote:
>
> The problem is dup2()

Shouldn't a cmpxchg() in just the dup2 code solve that?

If the old value was NULL, you'd have to repeat and go back and see if
the open_fds[] bit had been cleared in the meantime (ie it's NULL not
because somebody else is busy installing it, but because somebody just
uninstalled it).

But yeah, I do agree that that sounds nasty and a complication I
hadn't even thought about. dup2() does violate our normal "let's
pre-allocate the fd slot" rule. Ugh.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [net-next PATCH V2] virtio-net: switch to use XPS to choose txq

2013-09-30 Thread Rusty Russell
"Michael S. Tsirkin"  writes:
> On Mon, Sep 30, 2013 at 03:37:17PM +0800, Jason Wang wrote:
>> We used to use a percpu structure vq_index to record the cpu to queue
>> mapping, this is suboptimal since it duplicates the work of XPS and
>> loses all other XPS functionality such as allowing use to configure
>> their own transmission steering strategy.
>> 
>> So this patch switches to use XPS and suggest a default mapping when
>> the number of cpus is equal to the number of queues. With XPS support,
>> there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
>> so they were removed also.
>> 
>> Cc: Rusty Russell 
>> Cc: Michael S. Tsirkin 
>> Signed-off-by: Jason Wang 
>
> Acked-by: Michael S. Tsirkin 

Acked-by: Rusty Russell 

Dave, please apply.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Al Viro
On Mon, Sep 30, 2013 at 06:05:03PM -0700, Eric Dumazet wrote:
> Speaking of spinlock contention, the files->file_lock is a good example.
> 
> Multi threaded programs hit this spinlock three times per fd : 
> 
> alloc_fd() / fd_install() / close() 
> 
> I think fd_install() could be done without this spinlock.

The problem is dup2(), not table resize...  I'm not saying it can't be
done, but it won't be trivial and you seem to be looking for potential
trouble in the wrong place.  dup2() semantics is the real bitch.
What we want is
* dup2() over unused descriptor => insert and be done with that
* dup2() over opened one => replace and do equivalent of close()
for what had been there before
The trouble is, what to do with dup2() over reserved, but still not
installed descriptor?  We handle that as -EBUSY and we use ->file_lock
to get atomicity wrt transitions between those states.

And yes, dup2() is a lousy API in multithreaded situation.  It's still
there...  See comments in do_dup2() for some amusement.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: spinlock contention of files->file_lock

2013-09-30 Thread Linus Torvalds
On Mon, Sep 30, 2013 at 6:05 PM, Eric Dumazet  wrote:
> Speaking of spinlock contention, the files->file_lock is a good example.
>
> Multi threaded programs hit this spinlock three times per fd :

.. do you actually have programs that see contention?

> alloc_fd() / fd_install() / close()
>
> I think fd_install() could be done without this spinlock.

Yes, I haven't thought much about it, but from a quick look it should
be trivial to do fd_install(). We already free the fdtable using rcu
(because we look things up without locking), so we could just get the
rcu read-lock, do fd_install() without any locking at all, and then
verify that the fd-table is still the same. Rinse and repeat if not.

> - Add a seqcount, and cmpxchg() to synchronize fd_install() with the
> potential resizer. (Might need additional RCU locking)

I don't think you even need that. alloc_fd() has already reserved the
spot, so I think you really can just assign without any fancy cmpxchg
at all. If you write to the old fdtable, who cares? Probably just
something like

   do {
  fdt = files_fdtable(files);
  rcu_assign_pointer(fdt->fd[fd], file);
  smp_mb();
   } while (fdt != files_fdtable(files));

or similar. Under the RCU lock to guarantee the allocations don't
disappear from under you, but with no other locking.

Maybe I'm missing something, but it really doesn't look that bad.

Now, that only gets rid of fd_install(), but I suspect you could do
something similar for put_unused_fd() (that one does need cmpxchg for
the "next_fd" thing, though). We'd have to replace the non-atomic
bitops on open_fds[] with atomic ones, just to make sure adjacent bit
clearings don't screw up concurrent adjacent bit values, but that
looks fairly straightforward too.

Now, getting rid of the spinlock for alloc_fd() would be more work,
and you'd likely want to keep it for actually resizing (just to keep
that case more straightforward), but you could probably do the
non-resizing cases fairly easily there too without holding the lock.
Scan for a free bit optimistically and without any locking, and then
be somewhat careful with setting that open_fds[] bit - not only using
an atomic test_and_set() for it, but also do the same "let's check
that the fdtable base pointers haven't changed in the meantime and
repeat".

On the whole the fd table looks like if it really is a contention
point, it would be fairly straightforward to fix without any huge
contortions. Sure, lots of care and small details to look out for, but
nothing looks really all that fundamentally difficult.

But is it really a contention point on any real load?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: vt6655: don't leak when returning -EOPNOTSUPP in vt6655_hostap_ioctl

2013-09-30 Thread Greg Kroah-Hartman
On Mon, Sep 30, 2013 at 10:22:11PM +0200, Jesper Juhl wrote:
> Make sure we always free(param); and remove a redundant "goto out;"
> just before we'll hit the label anyway.
> 
> Signed-off-by: Jesper Juhl 
> ---
>  drivers/staging/vt6655/hostap.c |   25 ++---
>  1 files changed, 10 insertions(+), 15 deletions(-)

Applied, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Staging: vt6655: 80211mgr: Cleanup of brace coding style issues

2013-09-30 Thread Greg KH
On Sat, Sep 28, 2013 at 09:03:24PM +0200, Martin Berglund wrote:
> Cleanup of a few brace coding style issues.
> 
> Signed-off-by: Martin Berglund 
> ---
>  drivers/staging/vt6655/80211mgr.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)

Applied, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: correct missing private flag clearing

2013-09-30 Thread Joonsoo Kim
On Mon, Sep 30, 2013 at 02:35:14PM -0700, Andrew Morton wrote:
> On Mon, 30 Sep 2013 16:59:44 +0900 Joonsoo Kim  wrote:
> 
> > We should clear the page's private flag when returing the page to
> > the page allocator or the hugepage pool. This patch fixes it.
> > 
> > Signed-off-by: Joonsoo Kim 
> > ---
> > Hello, Andrew.
> > 
> > I sent the new version of commit ('07443a8') before you did pull request,
> > but it isn't included. It may be losted :)
> > So I send this fix. IMO, this is good for v3.12.
> > 
> > Thanks.
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index b49579c..691f226 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -653,6 +653,7 @@ static void free_huge_page(struct page *page)
> > BUG_ON(page_count(page));
> > BUG_ON(page_mapcount(page));
> > restore_reserve = PagePrivate(page);
> > +   ClearPagePrivate(page);
> >  
> 
> You describe it as a fix, but what does it fix?  IOW, what are the
> user-visible effects of the change?
> 
> update_and_free_page() already clears PG_private, but afaict the bit
> remains unaltered if free_huge_page() takes the enqueue_huge_page()
> route.

Yes, you are right.
I attach another version having more explanation.
Please refer this and merge it.

Thanks.
->8-
>From 9d2ead1800de6e1192824e11dafe68bed7b78cce Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 30 Sep 2013 11:51:54 +0900
Subject: [PATCH] mm, hugetlb: correct missing private flag clearing

We should clear the page's private flag when returing the page to
the hugepage pool. Otherwise, marked hugepage can be allocated to
the user who tries to allocate the non-reserved hugepage. If this user
fail to map this hugepage, he would try to return the page to the
hugepage pool. Since this page has a private flag, resv_huge_pages would
mistakenly increase. This patch fixes this situation.

Signed-off-by: Joonsoo Kim 

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b49579c..691f226 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -653,6 +653,7 @@ static void free_huge_page(struct page *page)
BUG_ON(page_count(page));
BUG_ON(page_mapcount(page));
restore_reserve = PagePrivate(page);
+   ClearPagePrivate(page);
 
spin_lock(_lock);
hugetlb_cgroup_uncharge_page(hstate_index(h),
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


spinlock contention of files->file_lock

2013-09-30 Thread Eric Dumazet
Speaking of spinlock contention, the files->file_lock is a good example.

Multi threaded programs hit this spinlock three times per fd : 

alloc_fd() / fd_install() / close() 

I think fd_install() could be done without this spinlock.

we want to protect fd_install() from a concurrent resize of the fd
table.

We could either :

- Add a spinlock used for resize, and fd_install().
  A bit suboptimal but easy to code and review.

- Add a seqcount, and cmpxchg() to synchronize fd_install() with the
potential resizer. (Might need additional RCU locking)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Linus Torvalds
On Mon, Sep 30, 2013 at 5:36 PM, Michael Neuling  wrote:
>
> The scary part is that we to make all register volatile.  You were not
> that keen on doing this as there are a lot of places in exception
> entry/exit where we only save/restore a subset of the registers.  We'd
> need to catch all these.

Ugh. It's very possible it's not worth using for the kernel then. The
example I posted is normally fine *without* any transactional support,
since it's a very local per-dentry lock, and since we only take that
lock when the last reference drops (so it's not some common directory
dentry, it's a end-point file dentry). In fact, on ARM they just made
the cmpxchg much faster by making it entirely non-serializing (since
it only updates a reference count, there is no locking involved apart
from checking that the lock state is unlocked)

So there is basically never any contention, and the transaction needs
to basically be pretty much the same cost as a "cmpxchg". It's not
clear if the intel TSX is good enough for that, and if you have to
save a lot of registers in order to use transactions on POWER8, I
doubt it's worthwhile.

We have very few - if any - locks where contention or even cache
bouncing is common or normal. Sure, we have a few particular loads
that can trigger it, but even that is becoming rare. So from a
performance standpoint, the target always needs to be "comparable to
hot spinlock in local cache".

>> They also have interesting ordering semantics vs. locks, we need to be
>> a tad careful (as long as we don't access a lock variable
>> transactionally we should be ok. If we do, then spin_unlock needs a
>> stronger barrier).
>
> Yep.

Well, just about any kernel transaction will at least read the state
of a lock. Without that, it's generally totally useless. My dput()
example sequence very much verified that the lock was not held, for
example.

I'm not sure how that affects anything. The actual transaction had
better not be visible inside the locked region (ie as far as any lock
users go, transactions better all happen fully before or after the
lock, if they read the lock and see it being unlocked).

That said, I cannot see how POWER8 could possibly violate that rule.
The whole "transactions are atomic" is kind of the whole and only
point of a transaction. So I'm not sure what odd lock restrictions
POWER8 could have.

> FWIW eg.
>
>  tbegin
>  beq abort /* passes first time through */
>  
>  transactional stuff
>  
>  tend
>  b pass
>
> abort:
>
> pass:

That's fine, and matches the x86 semantics fairly closely, except
"xbegin" kind of "contains" that "jump to abort address". But we could
definitely use the same models. Call it
"transaction_begin/abort/end()", and it should be architecture-neutral
naming-wise.

Of course, if tbegin then acts basically like some crazy
assembly-level setjmp (I'm guessing it does exactly, and presumably
precisely that kind of compiler support - ie a function with
"__attribute((returns_twice))" in gcc-speak), the overhead of doing it
may kill it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Michael Neuling
Ben,

> On Mon, 2013-09-30 at 12:29 -0700, Linus Torvalds wrote:
> > 
> > But I'm cc'ing the POWER people, because I don't know the POWER8
> > interfaces, and I don't want to necessarily call this "xbegin"/"xend"
> > when I actually wrap it in some helper functions.
> 
> The main problem with powerpc TM is that we need to handle interrupts
> happening while in transactional state. We currently only handle that
> for userspace.

Yep.  

> Mikey, how hard would it be to detect the in-kernel TM case and just
> simulate an abort in the exception entry ? (return to tbegin basically)

It's doable.

The scary part is that we to make all register volatile.  You were not
that keen on doing this as there are a lot of places in exception
entry/exit where we only save/restore a subset of the registers.  We'd
need to catch all these.

> Transactions in kernel make me nervous because of the PC jumping
> around on aborts and how easy we can get that stuff wrong :-) 

The same applies for userspace.  We are pretty careful not to screw that
up though.

It's also one of the reason we don't do software rollback of userspace
transactions even in transactional (non-suspended) mode.  We always save
and restore all state and let the HW deal with the PC and state jumping
around.

> They also have interesting ordering semantics vs. locks, we need to be
> a tad careful (as long as we don't access a lock variable
> transactionally we should be ok. If we do, then spin_unlock needs a
> stronger barrier).

Yep.

> The basic semantic for us is tbegin. [...] tend instructions. If the
> transaction fails, control returns to tbegin. (can happen at any
> point) which returns a CC code indicating success or failure.

FWIW eg.

 tbegin
 beq abort /* passes first time through */
 
 transactional stuff
 
 tend
 b pass
 
abort:

pass:

> Things get really complicated if you take an interrupt however, the
> transaction gets into some special "suspended" state, it doesn't just
> die so we need to handle things in our interrupt entry (even if it's
> just to make the transaction abort cleanly) and right now we don't
> when coming from kernel space.

Yep, but we could check easily enough.

Mikey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: coretemp.ko not auto-loaded on 3.11.x

2013-09-30 Thread Brian Gitonga Marete
On Tue, Oct 1, 2013 at 3:16 AM, Guenter Roeck  wrote:
> On 09/30/2013 03:29 PM, Brian Gitonga Marete wrote:
>>
>> On Tue, Oct 1, 2013 at 12:25 AM, Brian Gitonga Marete
>>  wrote:
>>>
>>> Hello all,
>>>
>>> Some time ago, an enhancement was made to the kernel that caused
>>> coretemp.ko (as well as other cpu-specific drivers) to be auto-loaded.
>>>
>>> This has broken for me in 3.11.x. It works fine on 3.10.x.
>>>
>>> Please see the attached kernel config and let me know if I left out
>>> something notwithstanding doing a `sillentoldconfig' from my 3.10.x
>>> config, which does not exhibit the problem.
>>>
>>
>> Some further information which upon some research, may be useful:
>>
>> Kernel is 3.11.2 (But misbehavior first seen with 3.11.0)
>>
>> $ cat /sys/devices/system/cpu/modalias
>>
>> x86cpu:vendor::family:0006:model:0025:feature:,,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,007C,0080,0082,0083,0084,0085,0087,0088,0089,008D,008E,008F,0091,0093,0094,0097,00C0,00E0,00E1,00E7,0100,0101,0102,0103,0104
>>
>> $ modinfo coretemp
>> filename:   /lib/modules/3.11.2/kernel/drivers/hwmon/coretemp.ko
>> license:GPL
>> description:Intel Core temperature monitor
>> author: Rudolf Marek 
>> srcversion: 31B63A96FD9FD13CDDBDF32
>> alias:  x86cpu:vendor::family:*:model:*:feature:*00E7*
>> depends:
>> intree: Y
>> vermagic:   3.11.2 SMP mod_unload modversions
>> parm:   tjmax:TjMax value in degrees Celsius (int)
>>
>> I have UDEV 175 from Ubuntu 12.04.
>>
>> The module alias seems to match that in cat
>> /sys/devices/system/cpu/modalias, so udev should be able to probe for
>> and load coretemp.ko, but that is not happening, which is strange to
>> me.
>>
>> I also note that microcode.ko, which is handled by the same mechanism
>> for auto-loading based on CPUID, is no longer auto-loaded in kernel
>> 3.11.x.
>>
>> I have added Thomas Renninger and Greg Kroah-Hartman to the thread CC.
>>
>
> See
>
> http://lists.lm-sensors.org/pipermail/lm-sensors/2013-August/039745.html
>
> and
>
> http://lists.opensuse.org/opensuse-bugs/2013-08/msg01676.html
>
> as well as
>
> http://cgit.freedesktop.org/systemd/systemd/commit?id=bf7f800f2b3e93ccd1229d4717166f3a4d3af72f
>
> In summary, you'll need a new udev rule.
>
> Guenter
>

Ah, that makes perfect sense. Thanks, Guenter.

-- 
Brian Gitonga Marete
CEO/CTO Toshnix Systems
http://toshnix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf session: Fix infinite loop on invalid perf.data file

2013-09-30 Thread Sonny Rao
On Mon, Sep 30, 2013 at 6:49 AM, David Ahern  wrote:
> On 9/30/13 2:19 AM, Namhyung Kim wrote:
>>
>> From: Namhyung Kim 
>>
>> perf-record updates the header in the perf.data file at termination.
>> Without this update perf-report (and other processing built-ins) it
>> caused an infinite loop when perf report (or something like) called.
>>
>> This is because the algorithm in __perf_session__process_events()
>> depends on the data_size which is read from file header.  Use file
>> size directly instead in this case to do the best-effort processing.
>>
>> Cc: David Ahern 
>> Cc: Sonny Rao 
>> Signed-off-by: David Ahern 
>> Signed-off-by: Namhyung Kim 
>
>
> worked ok for me. Sonny can you verify?

Yes, it works for me as well, thanks!


> David
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: coretemp.ko not auto-loaded on 3.11.x

2013-09-30 Thread Guenter Roeck

On 09/30/2013 03:29 PM, Brian Gitonga Marete wrote:

On Tue, Oct 1, 2013 at 12:25 AM, Brian Gitonga Marete
 wrote:

Hello all,

Some time ago, an enhancement was made to the kernel that caused
coretemp.ko (as well as other cpu-specific drivers) to be auto-loaded.

This has broken for me in 3.11.x. It works fine on 3.10.x.

Please see the attached kernel config and let me know if I left out
something notwithstanding doing a `sillentoldconfig' from my 3.10.x
config, which does not exhibit the problem.



Some further information which upon some research, may be useful:

Kernel is 3.11.2 (But misbehavior first seen with 3.11.0)

$ cat /sys/devices/system/cpu/modalias
x86cpu:vendor::family:0006:model:0025:feature:,,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,007C,0080,0082,0083,0084,0085,0087,0088,0089,008D,008E,008F,0091,0093,0094,0097,00C0,00E0,00E1,00E7,0100,0101,0102,0103,0104

$ modinfo coretemp
filename:   /lib/modules/3.11.2/kernel/drivers/hwmon/coretemp.ko
license:GPL
description:Intel Core temperature monitor
author: Rudolf Marek 
srcversion: 31B63A96FD9FD13CDDBDF32
alias:  x86cpu:vendor::family:*:model:*:feature:*00E7*
depends:
intree: Y
vermagic:   3.11.2 SMP mod_unload modversions
parm:   tjmax:TjMax value in degrees Celsius (int)

I have UDEV 175 from Ubuntu 12.04.

The module alias seems to match that in cat
/sys/devices/system/cpu/modalias, so udev should be able to probe for
and load coretemp.ko, but that is not happening, which is strange to
me.

I also note that microcode.ko, which is handled by the same mechanism
for auto-loading based on CPUID, is no longer auto-loaded in kernel
3.11.x.

I have added Thomas Renninger and Greg Kroah-Hartman to the thread CC.



See

http://lists.lm-sensors.org/pipermail/lm-sensors/2013-August/039745.html

and

http://lists.opensuse.org/opensuse-bugs/2013-08/msg01676.html

as well as

http://cgit.freedesktop.org/systemd/systemd/commit?id=bf7f800f2b3e93ccd1229d4717166f3a4d3af72f

In summary, you'll need a new udev rule.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xpad input driver: Xbox 360 Wireless Adapter

2013-09-30 Thread Zachary Lund
I apologize for poor assumptions and lack of general knowledge
concerning what I'm talking about. However, I feel I can still help on
the subject.

As to what device I'm talking about, I'm talking about the more
properly termed "Xbox 360 Wireless Gaming Reciever". More information
and a picture can be found here:
http://www.microsoft.com/games/en-US/Hardware/Controllers/Pages/XboxWirelessGamingReceiverforWindows.aspx

Future references will refer the above device as "wireless reciever"
and the opposing wired controller that requires no reciever as "wired
controller".
When I refer to the "xpad driver", I mean the USB driver sitting here:
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/drivers/input/joystick/xpad.c

To be clear, the wireless receiver connects to a USB port in the PC
and interacts wirelessly with any Xbox 360 controller that can connect
to the Xbox 360 console (but the driver doesn't need to know this).
When a controller is "synced" with the wireless receiver, it sends
events like normal over the corresponding USB interface via interrupt
endpoints.

>> Second, the driver acts strangely when setting the LED. It calls
>> xpad_send_led_command during xpad_led_probe during xpad_probe but
>> there's a chance that a controller might not even be connected if
>> using the wireless adapter during that time!
>
>What? During xpad_probe() a device must be fully functional. What
>adapter are your talking about?

I apologize again for not explaining well enough. When the wireless
receiver is connected, all 8 interfaces are probed immediately but a
wireless controller may not actually be synced with the wireless
receiver. Even if it were, the current method the xpad driver takes
doesn't seem to work on the wireless receiver and leaves the
controllers LED in a default "blinking" state.

>> The only way to seemingly
>> tell if a controller is connected is by receiving the correct
>> connection packets. If I use the correct packet structure (which I
>> ripped almost directly from xboxdrv) and set the led after parsing a
>> connection packet, the LED seemingly works fine!
>
>Sounds reasonable. Do all devices send the connection-packets? If yes,
>feel free to send a patch which moves LED initialization after receipt
>of this package.

My inexperience comes into play here probably so take what I say with
a grain of salt please. Whenever a controller is synced with the
wireless reciever, the wireless receiver sends an interrupt packet
containing 0x08 0x80 to the driver to say that a new controller is
connected (which corresponds to the USB interface it was sent on,
explained below) or 0x08 0x00 to say that it has disconnected. I
believe there's already a patch available (not created by me) after
further diving in the mailing list, although I cannot confirm whether
it works or not as I've not tested it myself:
https://lkml.org/lkml/2012/11/30/558

To further explain, there are differences between the wireless
receiver and wired controller concerning how it works. The wired
controller seems to have only 4 USB interfaces (according to
http://www.free60.org/GamePad) whereas the wireless receiver has 8. I
cannot explain what the interfaces for the wired controller do but for
the wireless receiver, all odd interfaces seem to deal with input
events, LEDs, and anything to do with the actual controller. So
interfaces 1, 3, 5, and 7 can represent a controller. All even
interfaces (0, 2, 4, 6) seem to represent something else, probably a
headset which I can't confirm or test. Either way, the driver doesn't
support whatever these interfaces do so should xpad actually claim
them? The patch mentioned above seems to also remove the out bulk
endpoint sense it doesn't seem to be useful to any of the Xbox 360
controllers including the wired ones. A lot of things I don't know
myself. I apologize if I just confused the matter.

>> Third, I'm incredibly new to really low level development. Whenever
>> loading the module, it finds my wireless adapter but then creates 4
>> devices (which seems to mean only 4 controllers are allowed per
>> wireless adapter), each of which cause a call to xpad_probe. I
>> couldn't figure out how to tell if other wireless controllers were
>> already connected to the wireless adapter so I could light up the
>> correct LED. How would I go about this properly?
>
>Ugh? Sorry, but I don't understand what kind of wireless adapter this
>is? Please give us a bit more information here. If the device is a
>Bluetooth-device, why use an adapter at all?

Please ignore... I did not properly understand how USB devices work
(rather, I'm still learning) when writing this question and its poorly
written which just confused the topic if I haven't done so more in
previous comments.

On Mon, Sep 30, 2013 at 1:49 PM, David Herrmann  wrote:
> Hi
>
> I'm not very familiar with the xbox-gamepad driver, but please see
> below for some comments:
>
> On Mon, Sep 23, 2013 at 2:34 AM, Zachary Lund  wrote:
>> I'm 

Re: [PATCH v2 0/5] ARM: shmobile: CPUFreq support on Lager

2013-09-30 Thread Simon Horman
Hi Guennadi,

On Mon, Sep 30, 2013 at 05:25:42PM +0200, Guennadi Liakhovetski wrote:
> Hi Simon,
> 
> Just a quick check of the status of this patch series:
> 
> On Thu, 26 Sep 2013, Guennadi Liakhovetski wrote:
> 
> > On Lager a da9210 regulator is used to supply DVFS power to the SoC. This
> > patch series adds I2C and Z-clock support on r8a7790 and CPUFreq support
> > for the Lager board. Please note, that patch 4/4 is an RFC as explained
> > inline.
> > 
> > Developed on top of today's
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas.gitdevel
> > 
> > plus i2c-rcar patches from my earlier patch series
> > 
> > [PATCH v2 0/5] i2c: rcar: Device Tree support and clock improvements
> > 
> > and the "sh-pfc: r8a7790: Add I2C pin groups and functions" patch from
> > Ulrich Hecht.
> > 
> > All changes since v1 are noted in respective patches. They all are trivial,
> > still as soon as test hardware is available, it would be good to run-time
> > test this version too.
> > 
> > Cc: Guennadi Liakhovetski 
> > 
> > Guennadi Liakhovetski (5):
> >   pinctrl: sh-pfc: r8a7790: add pin definitions for the I2C3 interface
> 
> Laurent has taken this one.
> 
> >   ARM: shmobile: r8a7790: add I2C clocks and aliases for the DT mode
> >   ARM: shmobile: r8a7790: add I2C DT nodes
> 
> You've applied the above two too, thanks.
> 
> >   ARM: shmobile: r8a7790: add CPUFreq clock support
> 
> Do I see it right, that you're waiting for a review for the above one?

Yes, sorry for not being clearer.
I would like a review of this.

> >   ARM: shmobile: lager: (DEVEL) add CPUFreq support
> 
> Once a review of #4/5 is posted, I can split the above patch into two to 
> separate shmobile and DT parts, as requested by Sergei. Is that about what 
> you're expecting too?

Again, sorry for not being clearer. Yes, I agree with Sergei.
I would like the patch split into a DT patch and a board patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 net-next] fib_trie: avoid a redundant bit judgement in inflate

2013-09-30 Thread baker . kernel
From: "baker.zhang" 

Because 'node' is the i'st child of 'oldnode',
thus, here 'i' equals
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)

we just get 1 more bit,
and need not care the detail value of this bits.

I apologize for the mistake.

I generated the patch on a branch version,
and did not notice the put_child has been changed.

I have redone the test on HEAD version with my patch.

two cases are used.
case 1. inflate a node which has a leaf child node.
case 2: inflate a node which has a an child node with skipped bits

test env:
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
here, we just focus on route table(MAIN),
so I use a "192.168.11.1/32" address to simplify the test case.

call trace:
+ fib_insert_node
+ + trie_rebalance
+ + + resize
+ + + + inflate

Test case 1:  inflate a node which has a leaf child node.
===
step 1. prepare a fib trie
--
  ip r a 192.168.0.0/24 via 192.168.11.1
  ip r a 192.168.1.0/24 via 192.168.11.1

we get a fib trie.
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/23 1 0 0
   |-- 192.168.0.0
/24 universe UNICAST
   |-- 192.168.1.0
/24 universe UNICAST
Local:
.

step 2. Add the third route
--
root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1

A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.

For function 'inflate':
'inflate' is called with following trie.
  +-- 192.168.0.0/22 1 1 0 <=== tn node
+-- 192.168.0.0/23 1 0 0<== node a
|-- 192.168.0.0
  /24 universe UNICAST
|-- 192.168.1.0
  /24 universe UNICAST
  |-- 192.168.2.0  <== leaf(node b)

When process node b, which is a leaf. here:
i is 1,
node key "192.168.2.0"
oldnode is (pos:22, bits:1)

unpatch source:
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,2,0", 22 + 1, 1)

thus got 0, and call put_child(tn, 2*i, node); <== 2*i=2.

patched source:
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
tkey_extract_bits("192.168,2,0", 22, 1 + 1)  <== get 2.

Test case 2:  inflate a node which has a an child node with skipped bits
==
step 1. prepare a fib trie.
  ip link set eth0 up
  ip a add dev eth0 192.168.11.1/32
  ip r a 192.168.128.0/24 via 192.168.11.1
  ip r a 192.168.0.0/24  via 192.168.11.1
  ip r a 192.168.16.0/24   via 192.168.11.1
  ip r a 192.168.32.0/24  via 192.168.11.1
  ip r a 192.168.48.0/24  via 192.168.11.1
  ip r a 192.168.144.0/24   via 192.168.11.1
  ip r a 192.168.160.0/24   via 192.168.11.1
  ip r a 192.168.176.0/24   via 192.168.11.1

check:
root@baker:~# cat /proc/net/fib_trie
Main:
  +-- 192.168.0.0/16 1 0 0
 +-- 192.168.0.0/18 2 0 0
|-- 192.168.0.0
   /24 universe UNICAST
|-- 192.168.16.0
   /24 universe UNICAST
|-- 192.168.32.0
   /24 universe UNICAST
|-- 192.168.48.0
   /24 universe UNICAST
 +-- 192.168.128.0/18 2 0 0
|-- 192.168.128.0
   /24 universe UNICAST
|-- 192.168.144.0
   /24 universe UNICAST
|-- 192.168.160.0
   /24 universe UNICAST
|-- 192.168.176.0
   /24 universe UNICAST
Local:
  ...

step 2. add a route to trigger inflate.
  ip r a 192.168.96.0/24   via 192.168.11.1

This command will call serveral times inflate.
In the first time, the fib_trie is:

+-- 192.168.128.0/(16, 1) <== tn node
 +-- 192.168.0.0/(17, 1)  <== node a
  +-- 192.168.0.0/(18, 2)
   |-- 192.168.0.0
   |-- 192.168.16.0
   |-- 192.168.32.0
   |-- 192.168.48.0
  |-- 192.168.96.0
 +-- 192.168.128.0/(18, 2) <== node b.
  |-- 192.168.128.0
  |-- 192.168.144.0
  |-- 192.168.160.0
  |-- 192.168.176.0

NOTE: node b is a interal node with skipped bits.
here,
i:1,
node->key "192.168.128.0",
oldnode:(pos:16, bits:1)
so
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0

tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16, 1+1) <=== 2

2*i + 0 == 2, so the result is same.

Signed-off-by: baker.zhang 
---
 net/ipv4/fib_trie.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3df6d3e..45c74ba 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -762,12 +762,9 @@ static struct tnode *inflate(struct trie *t, struct tnode 
*tn)
 
if (IS_LEAF(node) || ((struct tnode *) node)->pos >
   tn->pos + tn->bits - 1) {
-   if (tkey_extract_bits(node->key,
- oldtnode->pos + oldtnode->bits,
- 1) == 0)
- 

Re: IRQ affinity notifiers vs RT

2013-09-30 Thread Ben Hutchings
On Mon, 2013-09-23 at 14:58 +0200, Sebastian Andrzej Siewior wrote:
> On 08/30/2013 11:29 PM, Ben Hutchings wrote:
> > Sebastian, I saw you came up with a fix for this but apparently without
> > seeing my earlier message:
> 
> Yes Ben, I haven't seen it. If I was on Cc I very sorry for overlooking
> it.

You weren't, as I didn't realise you were maintaining the RT patch set
then.

> > On Thu, 2013-07-25 at 00:31 +0100, Ben Hutchings wrote:
> 
> >> Workqueue code uses spin_lock_irq() on the workqueue lock, which with
> >> PREEMPT_RT enabled doesn't actually block IRQs.
> >>
> >> In 3.6, the irq_cpu_rmap functions relies on a workqueue flush to finish
> >> any outstanding notifications before freeing the cpu_rmap that they use.
> >> This won't be reliable if the notification is scheduled after releasing
> >> the irq_desc lock.
> >>
> >> However, following commit 896f97ea95c1 ('lib: cpu_rmap: avoid flushing
> >> all workqueues') in 3.8, I think that it is sufficient to do only
> >> kref_get(>affinity_notify->kref) in __irq_set_affinity_locked()
> >> and then call schedule_work() in irq_set_affinity() after releasing the
> >> lock.  Something like this (untested):
> > 
> > Does the following make sense to you?
> 
> This was suggested by the original submitter on rt-users@v.k.o (Joe
> Korty) where I've been made aware of this for the first time. This okay
> except for the part where the workqueue is not scheduled if calling by
> the __ function (i.e. the mips case). If I read the code correctly, the
> CPU goes offline and affinity change should be updated / users notified
> and this is not the case with this patch.
> 
> It is a valid question why only one mips SoC needs the function. If you
> look at commit 0c3263870f ("MIPS: Octeon: Rewrite interrupt handling
> code.") you can see that tglx himself made this adjustment so it might
> be valid :) Therefore I assume that we may get more callers of this
> function and the workqueue should be executed and I made something
> simple that works on RT.

Right, but it looks quite strange to have this thread just for a
(probably) quite rare event.  Maybe some day someone will work out how
to make Octeon's IRQ management less unusual and then you can use the
simpler approach for RT...

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] extcon: arizona: Get pdata from arizona structure not device

2013-09-30 Thread Mark Brown
On Tue, Oct 01, 2013 at 08:14:39AM +0900, Chanwoo Choi wrote:

> Additionally,I have a question.
> As I mentioned, extcon-arizon driver don't get pdata from dt parsing.
> I think extcon-arizona haven't operated without pdata.

It should work fine on Wolfson reference systems and anything cloned
from them (it did when I was working there, that was the intened goal).


signature.asc
Description: Digital signature


Re: [PATCH] extcon: arizona: Get pdata from arizona structure not device

2013-09-30 Thread Mark Brown
On Tue, Oct 01, 2013 at 08:04:09AM +0900, Chanwoo Choi wrote:
> On 09/30/2013 06:52 PM, Charles Keepax wrote:

> > I would also be happy to implement this as a NULL check on the
> > pdata when we use it if that is preferable? But since we have the
> > cached pdata seems we might as well use it.

> I find below pdata list for extcon-arizona driver.
> But, drivers/mfd/arizona-core.c don't parse dt data for below pdata list
> of extcon-arizona. Did you test this patch for extcon-arizona operation?

Charles had some patches out for review but there were issues with the
DT binding that meant it needs a respin - it will need to support extcon
at some point.


signature.asc
Description: Digital signature


Re: [PATCH] extcon: arizona: Get pdata from arizona structure not device

2013-09-30 Thread Chanwoo Choi
On 10/01/2013 08:04 AM, Chanwoo Choi wrote:
> On 09/30/2013 06:52 PM, Charles Keepax wrote:
>> On Mon, Sep 30, 2013 at 08:37:30AM +0900, Chanwoo Choi wrote:
>>> No, extcon-arizona driver don't currently support DT to get platform data.
>>> I cannot find some dt function to parse data from dts file.
>>> You have to implement extcon-arizona driver by using DT binding style
>>> to get platform data. I think this patch is not necessary.
>>
>> Currently the Arizona MFD driver reads the device tree
>> information and populates the pdata structure, this happens in
>> drivers/mfd/arizona-core.c. Then the various drivers just use the
>> pdata as normal.
>>
>> Admittedly, at the moment we don't parse any data for the extcon
>> driver but without this patch we will attempt to use a NULL
>> pointer on device tree systems.
>>
>> I would also be happy to implement this as a NULL check on the
>> pdata when we use it if that is preferable? But since we have the
>> cached pdata seems we might as well use it.
>>
> 
> I find below pdata list for extcon-arizona driver.
> But, drivers/mfd/arizona-core.c don't parse dt data for below pdata list
> of extcon-arizona. Did you test this patch for extcon-arizona operation?
> 
> arizona->pdata.micd_pol_gpio
> arizona->pdata.micd_force_micbias 
> arizona->pdata.hpdet_id_gpio
> arizona->pdata.hpdet_acc_id
> arizona->pdata.hpdet_acc_id_line
> arizona->pdata.micd_detect_debounce
> arizona->pdata.jd_gpio5
> arizona->pdata.micd_timeout
> arizona->pdata.num_micd_configs
> arizona->pdata.micd_configs
> arizona->pdata.micd_bias_start_time
> arizona->pdata.micd_rate
> arizona->pdata.micd_dbtime
> arizona->pdata.num_micd_ranges
> ...
> 
> If you fix NULL pointer error about pdata,
> I think only that extcon-arizona modify it as following:
> 
> diff --git a/drivers/extcon/extcon-arizona.c b/drivers/extcon/extcon-arizona.c
> index e557130..3429906 100644
> --- a/drivers/extcon/extcon-arizona.c
> +++ b/drivers/extcon/extcon-arizona.c
> @@ -1088,6 +1088,10 @@ static int arizona_extcon_probe(struct platform_device 
> *pdev)
> return -EPROBE_DEFER;
>  
> pdata = dev_get_platdata(arizona->dev);
> +   if (!pdata) {
> +   dev_err(>dev, "Failed to get platform data\n");
> +   return -EINVAL;
> +   }
>  
> info = devm_kzalloc(>dev, sizeof(*info), GFP_KERNEL);
> if (!info) {
> 

Additionally,I have a question.
As I mentioned, extcon-arizon driver don't get pdata from dt parsing.
I think extcon-arizona haven't operated without pdata.

Did you only build test for extcon-arizona?

I'd like you to implement parse function for extcon-arizona
to solve this issue(NULL pointer error).

Thanks
Chanwoo Choi




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


к нам приходят за финансами

2013-09-30 Thread polonism_71
денежная помощь населению http://li.ru/go?9url.in/21ea8c
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 05/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-09-30 Thread Marcelo Tosatti
On Thu, Sep 05, 2013 at 06:29:08PM +0800, Xiao Guangrong wrote:
> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
> write-proect the sptes, it is because:
> - we have marked large sptes readonly instead of dropping them that means we
>   just change the spte from writable to readonly so that we only need to care
>   the case of changing spte from present to present (changing the spte from
>   present to nonpresent will flush all the TLBs immediately), in other words,
>   the only case we need to care is mmu_spte_update()
> 
> - in mmu_spte_update(), we have checked
>   SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
>   means it does not depend on PT_WRITABLE_MASK anymore
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 18 ++
>  arch/x86/kvm/x86.c |  9 +++--
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7488229..a983570 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm 
> *kvm, int slot)
>   if (*rmapp)
>   __rmap_write_protect(kvm, rmapp, false);
>  
> - if (need_resched() || spin_needbreak(>mmu_lock)) {
> - kvm_flush_remote_tlbs(kvm);
> + if (need_resched() || spin_needbreak(>mmu_lock))
>   cond_resched_lock(>mmu_lock);
> - }
>   }
>   }
>  
> - kvm_flush_remote_tlbs(kvm);
>   spin_unlock(>mmu_lock);
> +
> + /*
> +  * We can flush all the TLBs out of the mmu lock without TLB
> +  * corruption since we just change the spte from writable to
> +  * readonly so that we only need to care the case of changing
> +  * spte from present to present (changing the spte from present
> +  * to nonpresent will flush all the TLBs immediately), in other
> +  * words, the only case we care is mmu_spte_update() where we
> +  * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
> +  * instead of PT_WRITABLE_MASK, that means it does not depend
> +  * on PT_WRITABLE_MASK anymore.
> +  */
> + kvm_flush_remote_tlbs(kvm);
>  }

What about need_remote_flush? 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] extcon: arizona: Get pdata from arizona structure not device

2013-09-30 Thread Chanwoo Choi
On 09/30/2013 06:52 PM, Charles Keepax wrote:
> On Mon, Sep 30, 2013 at 08:37:30AM +0900, Chanwoo Choi wrote:
>> No, extcon-arizona driver don't currently support DT to get platform data.
>> I cannot find some dt function to parse data from dts file.
>> You have to implement extcon-arizona driver by using DT binding style
>> to get platform data. I think this patch is not necessary.
> 
> Currently the Arizona MFD driver reads the device tree
> information and populates the pdata structure, this happens in
> drivers/mfd/arizona-core.c. Then the various drivers just use the
> pdata as normal.
> 
> Admittedly, at the moment we don't parse any data for the extcon
> driver but without this patch we will attempt to use a NULL
> pointer on device tree systems.
> 
> I would also be happy to implement this as a NULL check on the
> pdata when we use it if that is preferable? But since we have the
> cached pdata seems we might as well use it.
> 

I find below pdata list for extcon-arizona driver.
But, drivers/mfd/arizona-core.c don't parse dt data for below pdata list
of extcon-arizona. Did you test this patch for extcon-arizona operation?

arizona->pdata.micd_pol_gpio
arizona->pdata.micd_force_micbias   
arizona->pdata.hpdet_id_gpio
arizona->pdata.hpdet_acc_id
arizona->pdata.hpdet_acc_id_line
arizona->pdata.micd_detect_debounce
arizona->pdata.jd_gpio5
arizona->pdata.micd_timeout
arizona->pdata.num_micd_configs
arizona->pdata.micd_configs
arizona->pdata.micd_bias_start_time
arizona->pdata.micd_rate
arizona->pdata.micd_dbtime
arizona->pdata.num_micd_ranges
...

If you fix NULL pointer error about pdata,
I think only that extcon-arizona modify it as following:

diff --git a/drivers/extcon/extcon-arizona.c b/drivers/extcon/extcon-arizona.c
index e557130..3429906 100644
--- a/drivers/extcon/extcon-arizona.c
+++ b/drivers/extcon/extcon-arizona.c
@@ -1088,6 +1088,10 @@ static int arizona_extcon_probe(struct platform_device 
*pdev)
return -EPROBE_DEFER;
 
pdata = dev_get_platdata(arizona->dev);
+   if (!pdata) {
+   dev_err(>dev, "Failed to get platform data\n");
+   return -EINVAL;
+   }
 
info = devm_kzalloc(>dev, sizeof(*info), GFP_KERNEL);
if (!info) {

Thanks,
Chanwoo Choi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Avoiding the dentry d_lock on final dput(), part deux: transactional memory

2013-09-30 Thread Benjamin Herrenschmidt
On Mon, 2013-09-30 at 12:29 -0700, Linus Torvalds wrote:
> 
> But I'm cc'ing the POWER people, because I don't know the POWER8
> interfaces, and I don't want to necessarily call this "xbegin"/"xend"
> when I actually wrap it in some helper functions.

The main problem with powerpc TM is that we need to handle interrupts
happening while in transactional state. We currently only handle that
for userspace.

Mikey, how hard would it be to detect the in-kernel TM case and just
simulate an abort in the exception entry ? (return to tbegin basically)

Transactions in kernel make me nervous because of the PC jumping around
on aborts and how easy we can get that stuff wrong :-) They also have
interesting ordering semantics vs. locks, we need to be a tad careful
(as long as we don't access a lock variable transactionally we should be
ok. If we do, then spin_unlock needs a stronger barrier).

The basic semantic for us is tbegin. [...] tend instructions. If the
transaction fails, control returns to tbegin. (can happen at any point)
which returns a CC code indicating success or failure.

Things get really complicated if you take an interrupt however, the
transaction gets into some special "suspended" state, it doesn't just
die so we need to handle things in our interrupt entry (even if it's
just to make the transaction abort cleanly) and right now we don't when
coming from kernel space.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: ks8695: fix incorrect placement of __initdata tag

2013-09-30 Thread Greg Ungerer

On 30/09/13 23:02, Bartlomiej Zolnierkiewicz wrote:

__initdata tag should be placed between the variable name and equal
sign for the variable to be placed in the intended .init.data section.

Signed-off-by: Bartlomiej Zolnierkiewicz 
Signed-off-by: Kyungmin Park 


Acked-by: Greg Ungerer 



---
  arch/arm/mach-ks8695/board-og.c | 2 +-
  arch/arm/mach-ks8695/cpu.c  | 3 +--
  2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/arm/mach-ks8695/board-og.c b/arch/arm/mach-ks8695/board-og.c
index 002bc61..9204f5a 100644
--- a/arch/arm/mach-ks8695/board-og.c
+++ b/arch/arm/mach-ks8695/board-og.c
@@ -79,7 +79,7 @@ static void __init og_pci_bus_reset(void)
  #define   S8250_VIRT  0xf400
  #define   S8250_SIZE  0x0010

-static struct __initdata map_desc og_io_desc[] = {
+static struct map_desc og_io_desc[] __initdata = {
{
.virtual= S8250_VIRT,
.pfn= __phys_to_pfn(S8250_PHYS),
diff --git a/arch/arm/mach-ks8695/cpu.c b/arch/arm/mach-ks8695/cpu.c
index ddb2422..9618654 100644
--- a/arch/arm/mach-ks8695/cpu.c
+++ b/arch/arm/mach-ks8695/cpu.c
@@ -33,8 +33,7 @@
  #include 
  #include 

-
-static struct __initdata map_desc ks8695_io_desc[] = {
+static struct map_desc ks8695_io_desc[] __initdata = {
{
.virtual= (unsigned long)KS8695_IO_VA,
.pfn= __phys_to_pfn(KS8695_IO_PA),



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 03/15] KVM: MMU: lazily drop large spte

2013-09-30 Thread Marcelo Tosatti
On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoided
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> This version has fixed the issue reported in 6b73a9606, the reason of that
> issue is that fast_page_fault() directly sets the readonly large spte to
> writable but only dirty the first page into the dirty-bitmap that means
> other pages are missed. Fixed it by only the normal sptes (on the
> PT_PAGE_TABLE_LEVEL level) can be fast fixed
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 36 
>  arch/x86/kvm/x86.c |  8 ++--
>  2 files changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 869f1db..88107ee 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
>  
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
>   *
>   * Note: write protection is difference between drity logging and spte
>   * protection:
> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
>   *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
>   */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>  {
>   u64 spte = *sptep;
>  
> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> *flush, bool pt_protect)
>  
>   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>  
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
>   if (pt_protect)
>   spte &= ~SPTE_MMU_WRITEABLE;
>   spte = spte & ~PT_WRITABLE_MASK;
>  
> - *flush |= mmu_spte_update(sptep, spte);
> - return false;
> + return mmu_spte_update(sptep, spte);
>  }

Is it necessary for kvm_mmu_unprotect_page to search for an entire range large 
page range now, instead of a 4k page?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: coretemp.ko not auto-loaded on 3.11.x

2013-09-30 Thread Brian Gitonga Marete
On Tue, Oct 1, 2013 at 12:25 AM, Brian Gitonga Marete
 wrote:
> Hello all,
>
> Some time ago, an enhancement was made to the kernel that caused
> coretemp.ko (as well as other cpu-specific drivers) to be auto-loaded.
>
> This has broken for me in 3.11.x. It works fine on 3.10.x.
>
> Please see the attached kernel config and let me know if I left out
> something notwithstanding doing a `sillentoldconfig' from my 3.10.x
> config, which does not exhibit the problem.
>

Some further information which upon some research, may be useful:

Kernel is 3.11.2 (But misbehavior first seen with 3.11.0)

$ cat /sys/devices/system/cpu/modalias
x86cpu:vendor::family:0006:model:0025:feature:,,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,000D,000E,000F,0010,0011,0013,0015,0016,0017,0018,0019,001A,001B,001C,001D,001F,002B,0034,003B,003D,0068,006B,006C,006D,006F,0070,0072,0074,0075,0076,0078,007C,0080,0082,0083,0084,0085,0087,0088,0089,008D,008E,008F,0091,0093,0094,0097,00C0,00E0,00E1,00E7,0100,0101,0102,0103,0104

$ modinfo coretemp
filename:   /lib/modules/3.11.2/kernel/drivers/hwmon/coretemp.ko
license:GPL
description:Intel Core temperature monitor
author: Rudolf Marek 
srcversion: 31B63A96FD9FD13CDDBDF32
alias:  x86cpu:vendor::family:*:model:*:feature:*00E7*
depends:
intree: Y
vermagic:   3.11.2 SMP mod_unload modversions
parm:   tjmax:TjMax value in degrees Celsius (int)

I have UDEV 175 from Ubuntu 12.04.

The module alias seems to match that in cat
/sys/devices/system/cpu/modalias, so udev should be able to probe for
and load coretemp.ko, but that is not happening, which is strange to
me.

I also note that microcode.ko, which is handled by the same mechanism
for auto-loading based on CPUID, is no longer auto-loaded in kernel
3.11.x.

I have added Thomas Renninger and Greg Kroah-Hartman to the thread CC.

Thanks.

BGM.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] epoll: Do not take global 'epmutex' for simple topologies

2013-09-30 Thread Jason Baron
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested. The purpose of taking
the 'epmutex' on add is to prevent complex topologies such as loops and
deep wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations. However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we should not
have to pay by taking the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
performance:

"
On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
In addition the benchmark when from scaling well on 10 sockets to scaling well
on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck.
"

Tested-by: Nathan Zimmer 
Signed-off-by: Jason Baron 
---
 fs/eventpoll.c | 94 ++
 1 file changed, 68 insertions(+), 26 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index f4251a6..daaec16 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -595,8 +595,7 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
 static int ep_scan_ready_list(struct eventpoll *ep,
  int (*sproc)(struct eventpoll *,
   struct list_head *, void *),
- void *priv,
- int depth)
+ void *priv, int depth, int ep_locked)
 {
int error, pwake = 0;
unsigned long flags;
@@ -607,7 +606,9 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 * We need to lock this because we could be hit by
 * eventpoll_release_file() and epoll_ctl().
 */
-   mutex_lock_nested(>mtx, depth);
+
+   if (!ep_locked)
+   mutex_lock_nested(>mtx, depth);
 
/*
 * Steal the ready list, and re-init the original one to the
@@ -671,7 +672,8 @@ static int ep_scan_ready_list(struct eventpoll *ep,
}
spin_unlock_irqrestore(>lock, flags);
 
-   mutex_unlock(>mtx);
+   if (!ep_locked)
+   mutex_unlock(>mtx);
 
/* We have to call this outside the lock */
if (pwake)
@@ -824,15 +826,34 @@ static int ep_read_events_proc(struct eventpoll *ep, 
struct list_head *head,
return 0;
 }
 
+static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
+poll_table *pt);
+
+struct readyevents_arg {
+   struct eventpoll *ep;
+   int locked;
+};
+
 static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
 {
-   return ep_scan_ready_list(priv, ep_read_events_proc, NULL, call_nests + 
1);
+   struct readyevents_arg *arg = (struct readyevents_arg *)priv;
+
+   return ep_scan_ready_list(arg->ep, ep_read_events_proc, NULL,
+ call_nests + 1, arg->locked);
 }
 
 static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
 {
int pollflags;
struct eventpoll *ep = file->private_data;
+   struct readyevents_arg arg;
+
+   /*
+* During ep_insert() we already hold the ep->mtx for the tfile.
+* Prevent re-aquisition.
+*/
+   arg.locked = ((wait && (wait->_qproc == ep_ptable_queue_proc)) ? 1 : 0);
+   arg.ep = ep;
 
/* Insert inside our poll wait queue */
poll_wait(file, >poll_wait, wait);
@@ -844,7 +865,7 @@ static unsigned int ep_eventpoll_poll(struct file *file, 
poll_table *wait)
 * could re-enter here.
 */
pollflags = ep_call_nested(_readywalk_ncalls, EP_MAX_NESTS,
-  ep_poll_readyevents_proc, ep, ep, current);
+  ep_poll_readyevents_proc, , ep, current);
 
return pollflags != -1 ? pollflags : 0;
 }
@@ -1245,7 +1266,7 @@ static noinline void ep_destroy_wakeup_source(struct 
epitem *epi)
  * Must be called with "mtx" held.
  */
 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
-struct file *tfile, int fd)
+struct file *tfile, int fd, int full_check)
 {
int error, revents, pwake = 0;
unsigned long flags;
@@ -1313,7 +1334,7 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
 
/* now check if we've created too many backpaths */
error = -EINVAL;
-   if (reverse_path_check())
+   if (full_check && reverse_path_check())
goto error_remove_epi;
 
/* We have to drop the new item inside our item list to keep track of 
it */
@@ -1537,7 +1558,7 @@ static int ep_send_events(struct eventpoll *ep,
esed.maxevents = maxevents;
   

[PATCH 3/3] epoll: restore 'struct epitem' size

2013-09-30 Thread Jason Baron
The patch entitled 'epoll: optimize EPOLL_CTL_DEL using rcu', increases the
size of 'struct epitem' over 128 bytes as it adds a 'struct rcu_head' field
to 'struct epitem', in order to make use of RCU. This patch restores the size
of 'struct epitem' back to <= 128 bytes. This size restriction and enforcement
was orignally brought in by commit 39732ca5af4b09f4db561149041ddad7211019a5.

The idea of this patch is to use the 'struct rb_node rbn', which is embedded in
the 'stuct epitem' as the 'rcu_head' callback point. This is ok, since the 'rbn'
is no longer in use when we schedule the item to be freed with RCU and its
access is guarded by ep->mtx. Further, the RCU reader does not access the rbn
field.

I've also added a build-time check to ensure that 'struct rb_node' is >= 'struct
rcu_head'.

Note, I've kept this separate from 'epoll: optimize EPOLL_CTL_DEL using rcu'
in order to make clear the hack-ish nature of this thing.

Tested-by: Nathan Zimmer 
Signed-off-by: Jason Baron 
---
 fs/eventpoll.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index daaec16..2f06737 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -168,11 +168,7 @@ struct epitem {
/* The structure that describe the interested events and the source fd 
*/
struct epoll_event event;
 
-   /*
-* Free the epitem using rcu to enable a CTL_DEL to happen in parallel
-* with reverse path checks.
-*/
-   struct rcu_head rcu;
+   /* The fllink is in use. Since rcu can't do 'list_del_init()' */
int on_list;
 };
 
@@ -682,9 +678,10 @@ static int ep_scan_ready_list(struct eventpoll *ep,
return error;
 }
 
-static void epi_rcu_free(struct rcu_head *head)
+static void epi_rcu_free(struct rcu_head *rcu)
 {
-   struct epitem *epi = container_of(head, struct epitem, rcu);
+   struct epitem *epi = (struct epitem *)((char *)rcu -
+   offsetof(struct epitem, rbn));
kmem_cache_free(epi_cache, epi);
 }
 
@@ -723,9 +720,15 @@ static int ep_remove(struct eventpoll *ep, struct epitem 
*epi)
spin_unlock_irqrestore(>lock, flags);
 
wakeup_source_unregister(ep_wakeup_source(epi));
-
-   /* At this point it is safe to free the eventpoll item */
-   call_rcu(>rcu, epi_rcu_free);
+   /*
+* At this point it is safe to free the eventpoll item.
+* Use epi->rbn, instead of struct rcu_head, since we
+* are trying to minimize the size of 'struct epitem'.
+* The 'rbn' field is no longer in use. Protected
+* by ep->mtx. the rcu read side, reverse_path_check_proc(),
+* does not make use of the rbn field.
+*/
+   call_rcu((struct rcu_head *)>rbn, epi_rcu_free);
 
atomic_long_dec(>user->epoll_watches);
 
@@ -2126,6 +2129,15 @@ static int __init eventpoll_init(void)
/* Initialize the structure used to perform file's f_op->poll() calls */
ep_nested_calls_init(_readywalk_ncalls);
 
+   /*
+* We can have many thousands of epitems, so prevent this from
+* using an extra cache line on 64-bit (and smaller) CPUs
+*/
+   BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
+
+   /* make sure the overloading continues to work */
+   BUILD_BUG_ON(sizeof(struct rb_node) < sizeof(struct rcu_head));
+
/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] epoll: optimize EPOLL_CTL_DEL using rcu

2013-09-30 Thread Jason Baron
Optimize EPOLL_CTL_DEL so that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one. In this way, we can
traverse the epoll network on the add path in parallel with deletes. Since
deletes can't create loops or worse wakeup paths, this is safe.

Note that I've removed the build time check limiting 'struct epitem' to
128 bytes or less. This check is restored by subsequent patch. The reason
for separating it out was to call attention to its somewhat hacky nature.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Tested-by: Nathan Zimmer 
Signed-off-by: Jason Baron 
---
 fs/eventpoll.c | 58 --
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 473e09d..f4251a6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * LOCKING:
@@ -166,6 +167,13 @@ struct epitem {
 
/* The structure that describe the interested events and the source fd 
*/
struct epoll_event event;
+
+   /*
+* Free the epitem using rcu to enable a CTL_DEL to happen in parallel
+* with reverse path checks.
+*/
+   struct rcu_head rcu;
+   int on_list;
 };
 
 /*
@@ -672,6 +680,12 @@ static int ep_scan_ready_list(struct eventpoll *ep,
return error;
 }
 
+static void epi_rcu_free(struct rcu_head *head)
+{
+   struct epitem *epi = container_of(head, struct epitem, rcu);
+   kmem_cache_free(epi_cache, epi);
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
@@ -693,8 +707,10 @@ static int ep_remove(struct eventpoll *ep, struct epitem 
*epi)
 
/* Remove the current item from the list of epoll hooks */
spin_lock(>f_lock);
-   if (ep_is_linked(>fllink))
-   list_del_init(>fllink);
+   if (epi->on_list) {
+   list_del_rcu(>fllink);
+   epi->on_list = 0;
+   }
spin_unlock(>f_lock);
 
rb_erase(>rbn, >rbr);
@@ -707,7 +723,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem 
*epi)
wakeup_source_unregister(ep_wakeup_source(epi));
 
/* At this point it is safe to free the eventpoll item */
-   kmem_cache_free(epi_cache, epi);
+   call_rcu(>rcu, epi_rcu_free);
 
atomic_long_dec(>user->epoll_watches);
 
@@ -873,7 +889,6 @@ static const struct file_operations eventpoll_fops = {
  */
 void eventpoll_release_file(struct file *file)
 {
-   struct list_head *lsthead = >f_ep_links;
struct eventpoll *ep;
struct epitem *epi;
 
@@ -891,17 +906,12 @@ void eventpoll_release_file(struct file *file)
 * Besides, ep_remove() acquires the lock, so we can't hold it here.
 */
mutex_lock();
-
-   while (!list_empty(lsthead)) {
-   epi = list_first_entry(lsthead, struct epitem, fllink);
-
+   list_for_each_entry_rcu(epi, >f_ep_links, fllink) {
ep = epi->ep;
-   list_del_init(>fllink);
mutex_lock_nested(>mtx, 0);
ep_remove(ep, epi);
mutex_unlock(>mtx);
}
-
mutex_unlock();
 }
 
@@ -1139,7 +1149,9 @@ static int reverse_path_check_proc(void *priv, void 
*cookie, int call_nests)
struct file *child_file;
struct epitem *epi;
 
-   list_for_each_entry(epi, >f_ep_links, fllink) {
+   /* CTL_DEL can remove links here, but that can't increase our count */
+   rcu_read_lock();
+   list_for_each_entry_rcu(epi, >f_ep_links, fllink) {
child_file = epi->ep->file;
if (is_file_epoll(child_file)) {
if (list_empty(_file->f_ep_links)) {
@@ -1161,6 +1173,7 @@ static int reverse_path_check_proc(void *priv, void 
*cookie, int call_nests)
"file is not an ep!\n");
}
}
+   rcu_read_unlock();
return error;
 }
 
@@ -1255,6 +1268,7 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;
+   epi->on_list = 0;
if (epi->event.events & EPOLLWAKEUP) {
error = ep_create_wakeup_source(epi);
if (error)
@@ -1287,7 +1301,8 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
 
/* Add the current item to the list of active epoll hook for this file 
*/
spin_lock(>f_lock);
-   list_add_tail(>fllink, >f_ep_links);
+   list_add_tail_rcu(>fllink, >f_ep_links);
+   epi->on_list = 1;
spin_unlock(>f_lock);
 
/*
@@ -1328,8 +1343,8 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
 
 

[PATCH 0/3] epoll: reduce 'epmutex' lock contention

2013-09-30 Thread Jason Baron
Hi,

Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which
is taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path by
using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for simple
topologies. IE when adding a link from an epoll file descriptor to a wakeup
source, where the epoll file descriptor is not nested.

Patch #3, restores the size of 'struct epitem' to <= 128 bytes, which was
broken in Patch #1. Its a bit hacky, so I decided to break it out as a separate
patch for review purposes.

Performance of SPECjbb improves considerably. From Nathan Zimmer's testing.
Thread: http://marc.info/?l=linux-kernel=137908766013329=2

"
On the 16 socket run the performance went from 35k jOPS to 125k jOPS.
In addition the benchmark when from scaling well on 10 sockets to scaling well
on just over 40 sockets.

I should also note there system responsiveness of various commands was quite
improved when under the full load of the benchmark.
...


Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck.
"

Jason Baron (3):
  epoll: optimize EPOLL_CTL_DEL using rcu
  epoll: Do not take global 'epmutex' for simple topologies
  epoll: restore 'struct epitem' size

 fs/eventpoll.c | 150 -
 1 file changed, 105 insertions(+), 45 deletions(-)

-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3: Add Smp support for Allwinner A20. 1/3] Add smp support for Allwinner A20(sunxi 7i).

2013-09-30 Thread Maxime Ripard
Hi Fan,

On Sat, Sep 28, 2013 at 09:48:11PM +0800, cinifr wrote:
> Hi  Maxime,
> I have test it, but I found it does not work. If using
> smp_prepare_cpus, the kernenl cannot find the secondary cpus because
> that smp_prepare_cpus semms not be excuted before kernel is booting
> secondary cpus. So I have to use early_initcall.

At least two other platforms (sti and rockchip) do it like that.

And an early_initcall is not an option.

Thanks!
Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com


signature.asc
Description: Digital signature


Re: [tpmdd-devel] [PATCH 09/13] tpm: Pull everything related to sysfs into tpm-sysfs.c

2013-09-30 Thread Joel Schopp


> There is also the fact that the driver may not be able to tell if a
> locality is available without doing some kind of test command. The Xen
> TPM interface doesn't expose what localities are available, for example,
> and the TIS interface may need to test to see if locality 3 and 4 are
> actually blocked by the chipset - at least 3 might be available on some
> systems (the spec leaves this "implementation dependent").
> 
>>> Perhaps:
>>> default_locality - default to CONFIG_USER_DEFAULT_LOCALITY
>>> sysfs node permissions 0644
>>> kernel_locality - default to #CONFIG_KERNEL_DEFAULT_LOCALITY
>>> 0444 if CONFIG_KERNEL_ONLY_LOCALITY=y
>>> 0644 if CONFIG_KERNEL_ONLY_LOCALITY=n
>>> ioctl TPM_{GET,SET}_LOCALITY on an open /dev/tpmX
>>>
>>> If CONFIG_KERNEL_ONLY_LOCALITY=y, the userspace locality is not
>>> permitted to be equal to kernel_locality (but may take any other valid
>>> value).  Drivers may reject locality values that they consider invalid
>>> (the default should be to only allow 0-4, which is all that is defined
>>> in the spec) or may fail on attempted use of the TPM by passing down an
>>> error from the hardware - I would expect the latter to be the case on
>>> attempts to use locality 4 in the tpm_tis driver.
>>
>> Seems resonable, CONFIG_KERNEL_ONLY_LOCALITY could be
>> CONFIG_TPM_ONE_TIME_LOCALITY (eg you get to set kernel_locality only
>> once)
> 
> Hmm, how much trouble would it be to make this a menu selection? Even
> with the one-time-set option, you still need a default set either in
> the code or by CONFIG_ so that the TPM is not unavailable before the
> sysfs write. The options would be:
> 
>   - CONFIG_TPM_KERNEL_DEFAULT_LOCALITY = [int]
>   - CONFIG_TPM_KERNEL_LOCALITY_FIXED - no changes from userspace
>   - CONFIG_TPM_KERNEL_LOCALITY_ONESHOT - only one change possible
>   - CONFIG_TPM_KERNEL_LOCALITY_ANY - may be changed freely
> 
> The userspace locality is not allowed to use the kernel locality if
> the mode is either FIXED or ONESHOT, but may share locality if ANY
> is used.
> 
> Or, for more flexibility (I actually like this one better):
> 
>   - CONFIG_TPM_KERNEL_DEFAULT_LOCALITY = [int]
>   - CONFIG_TPM_KERNEL_LOCALITY_FIXED = [bool]

This seems best of the options discussed to me.

> 
> And sysfs contains:
>   - kernel_locality [0644, int; 0444 if FIXED=y or when locked(?)]
>   - lock_kernel_locality [write-once; only exists if FIXED=n]
> 
> Where kernel_locality may be changed until a write is made to
> local_kernel_locality, at which time the value of kernel_locality
> becomes read-only and no longer available via /dev/tpmX.
> 
>>> The only one I see immediately is seal/unseal (security/keys/trusted.c).
>>> The invocation of the seal command would need to be changed to seal the
>>> trusted keys to the kernel-only locality in order to take advantage of
>>> the increased protection provided by a kernel-only locality.
>>
>> Right
> 
> Actually, only the invocation needs to be changed - the PCR selection
> is passed in from userspace, which will just need to use PCR_INFO_LONG
> with the proper locality mask.
> 
 Do you know anyone on the userspace SW side who could look at this?
>>
>>> I should be able to find someone.
>>
>> Okay, let me know. I'd like to get a few more clean ups done before
>> mucking with the sysfs, but the way forward for locality looks pretty
>> clear..
>>
>> Thanks,
>> Jason
> 
> So far, nobody I have talked to has offered any strong opinions on
> what locality should be used or how it should be set. I think finding
> a developer of trousers may be the most useful to talk about how the
> ioctl portion of this would need to be set up - if someone is actually
> needed.
> 

I am a TrouSerS developer and am ccing Richard, another TrouSerS
developer, and ccing the trousers-tech list.  It would be good if you
could elaborate on the question and context for those not following the
entire thread, myself included.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] OOM killer: wait for tasks with pending SIGKILL to exit

2013-09-30 Thread David Rientjes
On Fri, 27 Sep 2013, Sergey Dyasly wrote:

> What you are saying contradicts current OOMk code the way I read it. Comment 
> in
> oom_kill_process() says:
> 
> "If the task is already exiting ... set TIF_MEMDIE so it can die quickly"
> 
> I just want to know the right solution.
> 

That's a comment, not code.  The point of the PF_EXITING special handling 
in oom_kill_process() is to avoid telling sysadmins that a process has 
been killed to free memory when it has already called exit() and to avoid 
sacrificing one of its children for the exiting process.

It may or may not need access to memory reserves to actually exit after 
PF_EXITING depending on whether it needs to allocate memory for 
coredumping or anything else.  So instead of waiting for it to recall the 
oom killer, TIF_MEMDIE is set anyway.  The point is that PF_EXITING 
processes can already get TIF_MEMDIE immediately when their memory 
allocation fails so there's no reason not to set it now as an 
optimization.

But we definitely want to avoid printing anything to the kernel log when 
the process has already called exit() and issuing the SIGKILL at that 
point would be pointless.

> You are mistaken, oom_kill_process() is only called from out_of_memory()
> and mem_cgroup_out_of_memory().
> 

out_of_memory() calls oom_kill_process() in two places, plus the call from 
mem_cgroup_out_of_memory(), making three calls in the tree.  Not that this 
matters in the slightest, though.

> > Read the comment about why we don't emit anything to the kernel log in 
> > this case; the process is already exiting, there's no need to kill it or 
> > make anyone believe that it was killed.
> 
> Yes, but there is already the PF_EXITING check in oom_scan_process_thread(),
> and in this case oom_kill_process() won't be even called. That's why it's
> redundant.
> 

You apparently have no idea how long select_bad_process() runs on a large 
system with thousands of processes.  Keep in mind that SGI requested the 
addition of the oom_kill_allocating_task sysctl specifically because of 
how long select_bad_process() runs.  The PF_EXITING check in 
oom_kill_process() is simply an optimization to return early and with 
access to memory reserves so it can exit as quickly as possible and 
without the kernel stating it's killing something that has already called 
exit().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, v2] anon_vmas: Convert the rwsem to an rwlock_t

2013-09-30 Thread Tim Chen
On Mon, 2013-09-30 at 12:47 -0700, Tim Chen wrote:
> On Mon, 2013-09-30 at 15:35 -0400, Waiman Long wrote:
> > On 09/30/2013 03:23 PM, Tim Chen wrote:
> > > On Mon, 2013-09-30 at 20:14 +0200, Peter Zijlstra wrote:
> > >> On Mon, Sep 30, 2013 at 10:10:27AM -0700, Tim Chen wrote:
> > >>> Here's the exim workload data:
> > >>>
> > >>> rwsem improvment:
> > >>> Waimain's patch:+2.0%
> > >>> Alex+Tim's patchset:+4.8%
> > >>> Waiman+Alex+Tim:+5.3%
> > >>>
> > >>> convert rwsem to rwlock_t for root anon_vma lock
> > >>> Ingo's patch+11.7%
> > >>>
> > >> What happens if you stuff Waiman's qrwlock patches on top of that?
> > >> admittedly and oft mentioned in this thread, our current rwlock_t is
> > >> somewhat suboptimal under a number of conditions.
> > > I've tested with Waiman's qrwlock patches on top of Ingo's patches.
> > > It does not affect the throughput for exim and I still get
> > > about +11.7% throughput change (same as with Ingo's patch only).
> > >
> > > Tim
> > >
> > 
> > My qrwlock doesn't enable qrwlock by default. You have to use menuconfig 
> > to explicitly enable it. Have you done that when you build the test 
> > kernel? I am thinking of explicitly enabling it for x86 if the anon-vma 
> > lock is converted back to a rwlock.
> > 
> 
> Yes, I have explicitly enabled it during my testing.
> 
The workload I have is dominated by writer locks, with only very
occasional readers. So it may not benefit from the various tweaks you
have put in qrwlock.

Tim 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: pagevec: cleanup: drop pvec->cold argument in all places

2013-09-30 Thread Andrew Morton
On Sat, 28 Sep 2013 16:33:58 +0800 Bob Liu  wrote:

> Nobody uses the pvec->cold argument of pagevec and it's also unreasonable for
> pages in pagevec released as cold page, so drop the cold argument from 
> pagevec.

Is it unreasonable?  I'd say it's unreasonable to assume that all pages
in all cases are likely to be cache-hot.  Example: what if the pages
are being truncated and were found to be on the inactive LRU,
unreferenced?

A useful exercise would be to go through all those pagevec_init() sites
and convince ourselves that the decision at each place was the correct
one.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Please pull NFS client bugfixes

2013-09-30 Thread Myklebust, Trond

Hi Linus,

The following changes since commit 4a10c2ac2f368583138b774ca41fac4207911983:

  Linux 3.12-rc2 (2013-09-23 15:41:09 -0700)

are available in the git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-3.12-4

for you to fetch changes up to 367156d9a87b21b5232dd93107c5fc61b09ba2ef:

  NFS: Give "flavor" an initial value to fix a compile warning (2013-09-29 
16:03:34 -0400)


NFS client bugfixes for 3.12

- Stable fix for Oopses in the pNFS files layout driver
- Fix a regression when doing a non-exclusive file create on NFSv4.x
- NFSv4.1 security negotiation fixes when looking up the root filesystem
- Fix a memory ordering issue in the pNFS files layout driver


Anna Schumaker (1):
  NFS: Give "flavor" an initial value to fix a compile warning

Trond Myklebust (3):
  NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem 
method
  NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
  NFSv4.1: Ensure memory ordering between nfs4_ds_connect and 
nfs4_fl_prepare_ds

Weston Andros Adamson (1):
  NFSv4.1: try SECINFO_NO_NAME flavs until one works

 fs/nfs/dir.c   |  2 +-
 fs/nfs/nfs4file.c  |  3 ++-
 fs/nfs/nfs4filelayoutdev.c | 20 +---
 fs/nfs/nfs4proc.c  | 58 +-
 include/linux/nfs_xdr.h|  3 ++-
 5 files changed, 63 insertions(+), 23 deletions(-)

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

[PATCH 1/6] perf, core: Add generic transaction flags v5

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

Add a generic qualifier for transaction events, as a new sample
type that returns a flag word. This is particularly useful
for qualifying aborts: to distinguish aborts which happen
due to asynchronous events (like conflicts caused by another
CPU) versus instructions that lead to an abort.

The tuning strategies are very different for those cases,
so it's important to distinguish them easily and early.

Since it's inconvenient and inflexible to filter for this
in the kernel we report all the events out and allow
some post processing in user space.

The flags are based on the Intel TSX events, but should be fairly
generic and mostly applicable to other HTM architectures too. In addition
to various flag words there's also reserved space to report an
program supplied abort code. For TSX this is used to distinguish specific
classes of aborts, like a lock busy abort when doing lock elision.

Flags:

Elision and generic transactions   (ELISION vs TRANSACTION)
(HLE vs RTM on TSX; IBM etc.  would likely only use TRANSACTION)
Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
Retryable transaction  (RETRY)
Conflicts with other threads   (CONFLICT)
Transaction write capacity overflow(CAPACITY WRITE)
Transaction read capacity overflow (CAPACITY READ)

Transactions implicitely aborted can also return an abort code.
This can be used to signal specific events to the profiler. A common
case is abort on lock busy in a RTM eliding library (code 0xff)
To handle this case we include the TSX abort code

Common example aborts in TSX would be:

- Data conflict with another thread on memory read.
  Flags: TRANSACTION|ASYNC|CONFLICT
- executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
- HLE transaction in user space is too large
  Flags: ELISION|SYNC|CAPACITY-WRITE

The only flag that is somewhat TSX specific is ELISION.

This adds the perf core glue needed for reporting the new flag word out.

v2: Add MEM/MISC
v3: Move transaction to the end
v4: Separate capacity-read/write and remove misc
v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
transaction to txn
Signed-off-by: Andi Kleen 
---
 include/linux/perf_event.h  |  5 +
 include/uapi/linux/perf_event.h | 25 -
 kernel/events/core.c|  6 ++
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 866e85c..ee96093 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -562,6 +562,10 @@ struct perf_sample_data {
struct perf_regs_user   regs_user;
u64 stack_user_size;
u64 weight;
+   /*
+* Transaction flags for abort events:
+*/
+   u64 txn;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data,
@@ -577,6 +581,7 @@ static inline void perf_sample_data_init(struct 
perf_sample_data *data,
data->stack_user_size = 0;
data->weight = 0;
data->data_src.val = 0;
+   data->txn = 0;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ca1d90b..fee1264 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -136,8 +136,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_WEIGHT  = 1U << 14,
PERF_SAMPLE_DATA_SRC= 1U << 15,
PERF_SAMPLE_IDENTIFIER  = 1U << 16,
+   PERF_SAMPLE_TRANSACTION = 1U << 17,
 
-   PERF_SAMPLE_MAX = 1U << 17, /* non-ABI */
+   PERF_SAMPLE_MAX = 1U << 18, /* non-ABI */
 };
 
 /*
@@ -181,6 +182,28 @@ enum perf_sample_regs_abi {
 };
 
 /*
+ * Values for the memory transaction event qualifier, mostly for
+ * abort events. Multiple bits can be set.
+ */
+enum {
+   PERF_TXN_ELISION= (1 << 0), /* From elision */
+   PERF_TXN_TRANSACTION= (1 << 1), /* From transaction */
+   PERF_TXN_SYNC   = (1 << 2), /* Instruction is related */
+   PERF_TXN_ASYNC  = (1 << 3), /* Instruction not related */
+   PERF_TXN_RETRY  = (1 << 4), /* Retry possible */
+   PERF_TXN_CONFLICT   = (1 << 5), /* Conflict abort */
+   PERF_TXN_CAPACITY_WRITE = (1 << 6), /* Capacity write abort */
+   PERF_TXN_CAPACITY_READ  = (1 << 7), /* Capacity read abort */
+
+   PERF_TXN_MAX= (1 << 8), /* non-ABI */
+
+   /* bits 32..63 are reserved for the abort code */
+
+   PERF_TXN_ABORT_MASK  = (0xULL << 32),
+   PERF_TXN_ABORT_SHIFT = 32,
+};
+
+/*
  * The format of the data returned by read() on a perf 

[PATCH 5/6] perf, tools: Add support for record transaction flags v5

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

Add support for recording and displaying the transaction flags.
They are essentially a new sort key. Also display them
in a nice way to the user.

v2: Fix manpage
v3: Move transaction to the end
v4: Handle capacity-read/write
v5: Adjust for new names
Reviewed-by: Jiri Olsa 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-record.txt |  4 +-
 tools/perf/Documentation/perf-report.txt |  4 ++
 tools/perf/Documentation/perf-top.txt|  2 +-
 tools/perf/builtin-annotate.c|  2 +-
 tools/perf/builtin-diff.c|  8 ++--
 tools/perf/builtin-record.c  |  2 +
 tools/perf/builtin-report.c  |  4 +-
 tools/perf/builtin-top.c |  5 +--
 tools/perf/perf.h|  1 +
 tools/perf/tests/hists_link.c|  6 ++-
 tools/perf/util/event.h  |  1 +
 tools/perf/util/evsel.c  |  9 
 tools/perf/util/hist.c   |  7 ++-
 tools/perf/util/hist.h   |  4 +-
 tools/perf/util/session.c|  3 ++
 tools/perf/util/sort.c   | 73 
 tools/perf/util/sort.h   |  2 +
 17 files changed, 122 insertions(+), 15 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index 6bec1c9..f732eaa 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -179,12 +179,14 @@ is enabled for all the sampling events. The sampled 
branch type is the same for
 The various filters must be specified as a comma separated list: 
--branch-filter any_ret,u,k
 Note that this feature may not be available on all processors.
 
--W::
 --weight::
 Enable weightened sampling. An additional weight is recorded per sample and 
can be
 displayed with the weight and local_weight sort keys.  This currently works 
for TSX
 abort events and some memory events in precise mode on modern Intel CPUs.
 
+--transaction::
+Record transaction flags for transaction related events.
+
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index ae337e3..be5ad87 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -72,6 +72,10 @@ OPTIONS
- cpu: cpu number the task ran at the time of sample
- srcline: filename and line number executed at the time of sample.  The
DWARF debugging info must be provided.
+   - weight: Event specific weight, e.g. memory latency or transaction
+   abort cost. This is the global weight.
+   - local_weight: Local weight version of the weight above.
+   - transaction: Transaction abort flags.
 
By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
diff --git a/tools/perf/Documentation/perf-top.txt 
b/tools/perf/Documentation/perf-top.txt
index f852eb5..6d70fbf 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -113,7 +113,7 @@ Default is to monitor all CPUS.
 -s::
 --sort::
Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
-   local_weight, abort, in_tx
+   local_weight, abort, in_tx, transaction
 
 -n::
 --show-nr-samples::
diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c
index 5ebd0c3..0393d98 100644
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@@ -63,7 +63,7 @@ static int perf_evsel__add_sample(struct perf_evsel *evsel,
return 0;
}
 
-   he = __hists__add_entry(>hists, al, NULL, 1, 1);
+   he = __hists__add_entry(>hists, al, NULL, 1, 1, 0);
if (he == NULL)
return -ENOMEM;
 
diff --git a/tools/perf/builtin-diff.c b/tools/perf/builtin-diff.c
index f28799e..2a78dc8 100644
--- a/tools/perf/builtin-diff.c
+++ b/tools/perf/builtin-diff.c
@@ -304,9 +304,10 @@ static int formula_fprintf(struct hist_entry *he, struct 
hist_entry *pair,
 
 static int hists__add_entry(struct hists *self,
struct addr_location *al, u64 period,
-   u64 weight)
+   u64 weight, u64 transaction)
 {
-   if (__hists__add_entry(self, al, NULL, period, weight) != NULL)
+   if (__hists__add_entry(self, al, NULL, period, weight, transaction)
+   != NULL)
return 0;
return -ENOMEM;
 }
@@ -328,7 +329,8 @@ static int diff__process_sample_event(struct perf_tool 
*tool __maybe_unused,
if (al.filtered)
return 0;
 
-   if (hists__add_entry(>hists, , sample->period, 
sample->weight)) {
+   if (hists__add_entry(>hists, , sample->period,
+sample->weight, sample->transaction)) {
pr_warning("problem incrementing symbol period, skipping 
event\n");
   

[PATCH 4/6] perf, tools: Add abort_tx,no_tx,in_tx branch filter options to perf record -j v3

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

Make perf record -j aware of the new in_tx,no_tx,abort_tx branch qualifiers.

v2: ABORT -> ABORTTX
v3: Add more _
Reviewed-by: Jiri Olsa 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-record.txt | 3 +++
 tools/perf/builtin-record.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index e297b74..6bec1c9 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -166,6 +166,9 @@ following filters are defined:
 - u:  only when the branch target is at the user level
 - k: only when the branch target is in the kernel
 - hv: only when the target is at the hypervisor level
+   - in_tx: only when the target is in a hardware transaction
+   - no_tx: only when the target is not in a hardware transaction
+   - abort_tx: only when the target is a hardware transaction abort
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call.
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a41ac415..8384b54 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -618,6 +618,9 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+   BRANCH_OPT("abort_tx", PERF_SAMPLE_BRANCH_ABORT_TX),
+   BRANCH_OPT("in_tx", PERF_SAMPLE_BRANCH_IN_TX),
+   BRANCH_OPT("no_tx", PERF_SAMPLE_BRANCH_NO_TX),
BRANCH_END
 };
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


perf, x86: Add last TSX PMU code for Haswell v3

2013-09-30 Thread Andi Kleen
[This has kernel and user parts.
Both sides have been reviewed now, so hopefully it's good
to merge.]
[v2: Address Peter's feedback for the kernel parts]
[v3: Rebased to current tip. Added Ack/Reviewed-by for user tools code]

Note this also needs the separately posted 
64bit comparison fix to correctly report abort codes.

This is currently the last part of the TSX PMU code,
just adding the left over bits:

This adds some changes to the user interfaces.
I'll send patches for the manpage separately.

- Report the transaction abort flags to user space
using a new field, and add the code to display them.
This is used to classify abort types, also fairly
important for tuning as it guides the tuning process,
together with the abort weight that was added earleir.

[3 patches, generic, x86, user tools]

- Add support for reporting the two new TSX LBR flags: in_tx
and abort_tx. The code to handle the LBRs was already
added earlier, this just adds the code to report,
filter and display them.

- Add a workaround for a Haswell issue that it reports
an extra LBR record for every abort. We just filter
those out in the kernel.

Open perf TSX issues left:
- Revisit automatic enabling of precise for tx/el-abort
- Need to fix the sort handling in the user tools
to actually sort on other fields
- The aggregated LBR display in the user tools is not 
very useful for transactions, need a way to report them 
in a histogram like backtraces.
- May want some shortcut options for
record --transaction --weight / report --sort symbol,transaction,weight

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/6] perf, tools: Support sorting by in_tx, abort branch flags v3

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

Extend the perf branch sorting code to support sorting by in_tx
or abort_tx qualifiers. Also print out those qualifiers.

This also fixes up some of the existing sort key documentation.

We do not support no_tx here, because it's simply not showing
the in_tx flag.

v2: Readd flags to man pages
v3: Rename intx
Reviewed-by: Jiri Olsa 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-report.txt |  4 ++-
 tools/perf/Documentation/perf-top.txt|  3 +-
 tools/perf/builtin-report.c  |  2 +-
 tools/perf/builtin-top.c |  3 +-
 tools/perf/perf.h|  4 ++-
 tools/perf/util/hist.h   |  2 ++
 tools/perf/util/sort.c   | 51 
 tools/perf/util/sort.h   |  2 ++
 8 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index 2b8097e..ae337e3 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -71,7 +71,7 @@ OPTIONS
entries are displayed as "[other]".
- cpu: cpu number the task ran at the time of sample
- srcline: filename and line number executed at the time of sample.  The
-   DWARF debuggin info must be provided.
+   DWARF debugging info must be provided.
 
By default, comm, dso and symbol keys are used.
(i.e. --sort comm,dso,symbol)
@@ -85,6 +85,8 @@ OPTIONS
- symbol_from: name of function branched from
- symbol_to: name of function branched to
- mispredict: "N" for predicted branch, "Y" for mispredicted branch
+   - in_tx: branch in TSX transaction
+   - abort: TSX transaction abort.
 
And default sort keys are changed to comm, dso_from, symbol_from, dso_to
and symbol_to, see '--branch-stack'.
diff --git a/tools/perf/Documentation/perf-top.txt 
b/tools/perf/Documentation/perf-top.txt
index 58d6598..f852eb5 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -112,7 +112,8 @@ Default is to monitor all CPUS.
 
 -s::
 --sort::
-   Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight, 
local_weight.
+   Sort by key(s): pid, comm, dso, symbol, parent, srcline, weight,
+   local_weight, abort, in_tx
 
 -n::
 --show-nr-samples::
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 8e50d8d..1e84103 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -786,7 +786,7 @@ int cmd_report(int argc, const char **argv, const char 
*prefix __maybe_unused)
   "sort by key(s): pid, comm, dso, symbol, parent, cpu, 
srcline,"
   " dso_to, dso_from, symbol_to, symbol_from, mispredict,"
   " weight, local_weight, mem, symbol_daddr, dso_daddr, tlb, "
-  "snoop, locked"),
+  "snoop, locked, abort, in_tx"),
OPT_BOOLEAN(0, "showcpuutilization", _conf.show_cpu_utilization,
"Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", _pattern, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 2122141..6534a37 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1103,7 +1103,8 @@ int cmd_top(int argc, const char **argv, const char 
*prefix __maybe_unused)
OPT_INCR('v', "verbose", ,
"be more verbose (show counter open errors, etc)"),
OPT_STRING('s', "sort", _order, "key[,key2...]",
-  "sort by key(s): pid, comm, dso, symbol, parent, weight, 
local_weight"),
+  "sort by key(s): pid, comm, dso, symbol, parent, weight, 
local_weight,"
+  " abort, in_tx"),
OPT_BOOLEAN('n', "show-nr-samples", _conf.show_nr_samples,
"Show a column with the number of samples"),
OPT_CALLBACK_DEFAULT('G', "call-graph", _opts,
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index cf20187..acf3d66 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -182,7 +182,9 @@ struct ip_callchain {
 struct branch_flags {
u64 mispred:1;
u64 predicted:1;
-   u64 reserved:62;
+   u64 in_tx:1;
+   u64 abort:1;
+   u64 reserved:60;
 };
 
 struct branch_entry {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 1329b6b..f743e96 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -45,6 +45,8 @@ enum hist_column {
HISTC_CPU,
HISTC_SRCLINE,
HISTC_MISPREDICT,
+   HISTC_IN_TX,
+   HISTC_ABORT,
HISTC_SYMBOL_FROM,
HISTC_SYMBOL_TO,
HISTC_DSO_FROM,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 5f118a0..1771566 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -858,6 +858,55 @@ struct sort_entry sort_mem_snoop = {

[PATCH 6/6] perf, x86: Suppress duplicated abort LBR records v2

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

Haswell always give an extra LBR record after every TSX abort.
Suppress the extra record.

This only works when the abort is visible in the LBR
If the original abort has already left the 16 LBR entries
the extra entry will will stay.

v2: Adjust white space.
Signed-off-by: Andi Kleen 
---
 arch/x86/kernel/cpu/perf_event.h   |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c |  1 +
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 29 +
 3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index ce84ede..fd00bb2 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -445,6 +445,7 @@ struct x86_pmu {
int lbr_nr;/* hardware stack size */
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
+   boollbr_double_abort;  /* duplicated lbr aborts */
 
/*
 * Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 353b7a3..d6badfc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2518,6 +2518,7 @@ __init int intel_pmu_init(void)
x86_pmu.hw_config = hsw_hw_config;
x86_pmu.get_event_constraints = hsw_get_event_constraints;
x86_pmu.cpu_events = hsw_events_attrs;
+   x86_pmu.lbr_double_abort = true;
pr_cont("Haswell events, ");
break;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d5be06a..90ee6c1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -284,6 +284,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
int lbr_format = x86_pmu.intel_cap.lbr_format;
u64 tos = intel_pmu_lbr_tos();
int i;
+   int out = 0;
 
for (i = 0; i < x86_pmu.lbr_nr; i++) {
unsigned long lbr_idx = (tos - i) & mask;
@@ -306,15 +307,27 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
}
from = (u64)s64)from) << skip) >> skip);
 
-   cpuc->lbr_entries[i].from   = from;
-   cpuc->lbr_entries[i].to = to;
-   cpuc->lbr_entries[i].mispred= mis;
-   cpuc->lbr_entries[i].predicted  = pred;
-   cpuc->lbr_entries[i].in_tx  = in_tx;
-   cpuc->lbr_entries[i].abort  = abort;
-   cpuc->lbr_entries[i].reserved   = 0;
+   /*
+* Some CPUs report duplicated abort records,
+* with the second entry not having an abort bit set.
+* Skip them here. This loop runs backwards,
+* so we need to undo the previous record.
+* If the abort just happened outside the window
+* the extra entry cannot be removed.
+*/
+   if (abort && x86_pmu.lbr_double_abort && out > 0)
+   out--;
+
+   cpuc->lbr_entries[out].from  = from;
+   cpuc->lbr_entries[out].to= to;
+   cpuc->lbr_entries[out].mispred   = mis;
+   cpuc->lbr_entries[out].predicted = pred;
+   cpuc->lbr_entries[out].in_tx = in_tx;
+   cpuc->lbr_entries[out].abort = abort;
+   cpuc->lbr_entries[out].reserved  = 0;
+   out++;
}
-   cpuc->lbr_stack.nr = i;
+   cpuc->lbr_stack.nr = out;
 }
 
 void intel_pmu_lbr_read(void)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/6] perf, x86: Add Haswell specific transaction flag reporting v5

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

In the PEBS handler report the transaction flags using the new
generic transaction flags facility. Most of them come from
the "tsx_tuning" field in PEBSv2, but the abort code is derived
from the RAX register reported in the PEBS record.

v2: Fix interaction with precise-loads
v3: Mask out reserved bits. More comments.
v4: Adjust white space
v5: Refactor PEBS handler code to collapse one if
and move parts into new inline.
Signed-off-by: Andi Kleen 
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c 
b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 54ff6ce..8df 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -206,6 +206,8 @@ union hsw_tsx_tuning {
u64 value;
 };
 
+#define PEBS_HSW_TSX_FLAGS 0xffULL
+
 void init_debug_store_on_cpu(int cpu)
 {
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -806,6 +808,16 @@ static inline u64 intel_hsw_weight(struct pebs_record_hsw 
*pebs)
return 0;
 }
 
+static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
+{
+   u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
+
+   /* For RTM XABORTs also log the abort code from AX */
+   if ((txn & PERF_TXN_TRANSACTION) && (pebs->ax & 1))
+   txn |= ((pebs->ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
+   return txn;
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
   struct pt_regs *iregs, void *__pebs)
 {
@@ -884,10 +896,14 @@ static void __intel_pmu_pebs_event(struct perf_event 
*event,
x86_pmu.intel_cap.pebs_format >= 1)
data.addr = pebs->dla;
 
-   /* Only set the TSX weight when no memory weight was requested. */
-   if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) && !fll &&
-   (x86_pmu.intel_cap.pebs_format >= 2))
-   data.weight = intel_hsw_weight(pebs);
+   if (x86_pmu.intel_cap.pebs_format >= 2) {
+   /* Only set the TSX weight when no memory weight. */
+   if ((event->attr.sample_type & PERF_SAMPLE_WEIGHT) && !fll)
+   data.weight = intel_hsw_weight(pebs);
+
+   if (event->attr.sample_type & PERF_SAMPLE_TRANSACTION)
+   data.txn = intel_hsw_transaction(pebs);
+   }
 
if (has_branch_stack(event))
data.br_stack = >lbr_stack;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kthread: Make kthread_create() killable.

2013-09-30 Thread David Rientjes
On Sat, 28 Sep 2013, Tetsuo Handa wrote:

> Some of enterprise users might prefer "kernel panic followed by kdump and
> automatic reboot" to "a system is not responding for unpredictable period", 
> for
> the panic helps getting information for analyzing what process caused the
> freeze. Well, can they use "Panic (Reboot) On Soft Lockups" option?
> 

Or, when the system doesn't respond for a long period of time you do 
sysrq+T and you find the TIF_MEMDIE bit set on a process that makes no 
progress exiting.  These instances _should_ be very rare since we don't 
have any other reports of it (and the oom killer hasn't differed in this 
regard for over three years).  It used to be much more common for 
mm->mmap_sem dependencies that were fixed.

> Currently the OOM killer kills a process after
> 
>   blocking_notifier_call_chain(_notify_list, 0, );
> 
> in out_of_memory() released all reclaimable memory.

The oom notifiers usually don't do any good on x86.

> This call helps reducing
> the chance to kill a process if the bad process no longer asks for more 
> memory.

The "bad process" could be anything, it's simply the process that is 
allocating memory when all memory is exhausted.

> But if the bad process continues asking for more memory and the chosen task is
> in TASK_UNINTERRUPTIBLE state, this call helps the OOM killer to be disabled
> for unpredictable period. Therefore, releasing all reclaimable memory before
> the OOM killer kills a process might be considered bad.
> 

I don't follow this statement, could you reword it?

If current calls the oom killer and the oom notifiers don't free any 
memory (which is very likely), then choosing an uninterruptible process is 
possible and has always been possible.  If sending SIGKILL and giving that 
process access to memory reserves does not allow it to exit in a short 
amount of time, then it must be waiting on another process that also 
cannot make forward process.  We must identify these cases (which is 
easily doable as described above) and fix them.

> Then, what about an approach described below?
> 
> (1) Introduce a kernel thread which reserves (e.g.) 1 percent of kernel memory
> (this amount should be configurable via sysctl) upon startup.
> 

We don't need kernel threads, this is what per-zone memory reserves are 
intended to provide for GFP_ATOMIC and TIF_MEMDIE allocations (or 
PF_MEMALLOC for reclaimers).

> (2) The kernel thread sleeps using wait_event(memory_reservoir_wait) and
> releases PAGE_SIZE bytes from the reserved memory upon each wakeup.
> 
> (3) The OOM killer calls wake_up() like
> 
>   if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>   if (unlikely(frozen(task)))
>   __thaw_task(task);
> + /* Let the memory reservoir release memory if the chosen 
> process cannot die. */
> + if (time_after(jiffies, p->memdie_stamp) &&
> + task->state == TASK_UNINTERRUPTIBLE)
> + wake_up(_reservoir_wait);
>   if (!force_kill)
>   return OOM_SCAN_ABORT;
>   }
> 
> in oom_scan_process_thread().
> 

This doesn't guarantee that the process that the chosen process is waiting 
for will be able to allocate that page, so it's useless.

> (4) When a task where test_tsk_thread_flag(task, TIF_MEMDIE) is true has
> terminated and memory used by the task is reclaimed, the reclaimed memory
> is again reserved by the kernel thread up to 1 percent of kernel memory.
> 
> In this way, we could shorten the duration of the OOM killer being disabled
> unless the reserved memory was not enough to terminate the chosen process.
> 

Could you please describe and post the details of any case where this 
currently happens so we can address the problem directly instead of trying 
to workaround it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 02/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-09-30 Thread Marcelo Tosatti
On Thu, Sep 05, 2013 at 06:29:05PM +0800, Xiao Guangrong wrote:
> Using sp->role.level instead of @level since @level is not got from the
> page table hierarchy
> 
> There is no issue in current code since the fast page fault currently only
> fixes the fault caused by dirty-log that is always on the last level
> (level = 1)
> 
> This patch makes the code more readable and avoids potential issue in the
> further development
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7714fd8..869f1db 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code)
>  }
>  
>  static bool
> -fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
> +fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> + u64 *sptep, u64 spte)
>  {
> - struct kvm_mmu_page *sp = page_header(__pa(sptep));
>   gfn_t gfn;
>  
>   WARN_ON(!sp->role.direct);
> @@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, 
> gva_t gva, int level,
>   u32 error_code)
>  {
>   struct kvm_shadow_walk_iterator iterator;
> + struct kvm_mmu_page *sp;
>   bool ret = false;
>   u64 spte = 0ull;
>  
> @@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, 
> gva_t gva, int level,
>   goto exit;
>   }
>  
> - if (!is_last_spte(spte, level))
> + sp = page_header(__pa(iterator.sptep));
> + if (!is_last_spte(spte, sp->role.level))
>   goto exit;
>  
>   /*
> @@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, 
> gva_t gva, int level,
>* the gfn is not stable for indirect shadow page.
>* See Documentation/virtual/kvm/locking.txt to get more detail.
>*/
> - ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
> + ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
>  exit:
>   trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
> spte, ret);
> -- 
> 1.8.1.4

Unrelated to this patch:

If vcpu->mode = OUTSIDE_GUEST_MODE, no IPI is sent
by kvm_flush_remote_tlbs.

So how is this supposed to work again?

/*
 * Wait for all vcpus to exit guest mode and/or lockless shadow
 * page table walks.
 */
kvm_flush_remote_tlbs(kvm);

Patch looks fine.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] vfs: avoid sb->s_umount lock while changing bind-mount flags

2013-09-30 Thread Aditya Kali
On Mon, Sep 30, 2013 at 1:03 PM, Al Viro  wrote:
> On Mon, Sep 30, 2013 at 11:13:23AM -0700, Aditya Kali wrote:
>> +Ted Ts'o, Tejun Heo, Jens Axboe
>>
>>
>> On 09/30/2013 10:54 AM, Aditya Kali wrote:
>> >Hi Al and other fs-developers,
>> >
>> >Please let me know what you think about this patch.
>
> Don't top-post, please...  What prevents a race between MS_BIND remounts
> and plain ones?  Used to be serialized on ->s_umount, but...

The rational here is that MS_BIND remounts need to act on the bind
mount-point only (struct vfsmount *) and not on the underlying
superblock (struct super_block *). So, intuitively, it should not be
necessary to take a superblock level lock for the MS_BIND case. As for
the modification of mnt_flags on vfsmount, there is already
vfsmount_lock taken in both MS_BIND and non-MS_BIND case to prevent
the race.

Following is the example that demonstrates the problem that I am
trying to address with this patch:

(1) /dev/sda is mounted at /mnt/sda
(2) /mnt/sda/users/user1 is bind mounted at /home/user1/ ;
/mnt/sda/users/user2 is bind mounted at /home/user2/
(3) user1 is doing buffered writes in /home/user1/logs/
(4) admin tries to make /home/user2/bin/ read-only using a bind-mounts:
$ mount --bind /home/user2/bin/ /home/user2/bin/# this is fast
$ mount --bind -o remount,ro /home/user2/bin/# this blocks
behind any writeback happening on sda (because of sb->s_umount write
lock)

Please advise.

Thanks,
-- 
Aditya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Fix NULL pointer dereference while loading initramfs

2013-09-30 Thread P J P
  Hello Andrew,

I was wondering if you had a chance to review the updated patch that I sent. 
You aren't waiting on me for something, are you?

Thank you.
--
Prasad J Pandit / Red Hat Security Response Team
DB7A 84C5 D3F9 7CD1 B5EB  C939 D048 7860 3655 602B
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Export initial ramdisk compression config

2013-09-30 Thread P J P
  Hello Andrew,

I was wondering if you had a chance to review this patch further? Should I 
send similar patches for other architectures too? As in you aren't waiting on 
me for that, are you?

Thank you.
--
Prasad J Pandit / Red Hat Security Response Team
DB7A 84C5 D3F9 7CD1 B5EB  C939 D048 7860 3655 602B
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: correct missing private flag clearing

2013-09-30 Thread Andrew Morton
On Mon, 30 Sep 2013 16:59:44 +0900 Joonsoo Kim  wrote:

> We should clear the page's private flag when returing the page to
> the page allocator or the hugepage pool. This patch fixes it.
> 
> Signed-off-by: Joonsoo Kim 
> ---
> Hello, Andrew.
> 
> I sent the new version of commit ('07443a8') before you did pull request,
> but it isn't included. It may be losted :)
> So I send this fix. IMO, this is good for v3.12.
> 
> Thanks.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b49579c..691f226 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -653,6 +653,7 @@ static void free_huge_page(struct page *page)
>   BUG_ON(page_count(page));
>   BUG_ON(page_mapcount(page));
>   restore_reserve = PagePrivate(page);
> + ClearPagePrivate(page);
>  

You describe it as a fix, but what does it fix?  IOW, what are the
user-visible effects of the change?

update_and_free_page() already clears PG_private, but afaict the bit
remains unaltered if free_huge_page() takes the enqueue_huge_page()
route.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tpmdd-devel] [PATCH 09/13] tpm: Pull everything related to sysfs into tpm-sysfs.c

2013-09-30 Thread Jason Gunthorpe
On Mon, Sep 30, 2013 at 04:36:27PM -0400, Daniel De Graaf wrote:

> >I think using CONFIG_ options would make this feature unavaiable to
> >distro kernel users...
> 
> This just moves the problem - now you need a custom initrd instead of
> a custom kernel. Other TPM options like IMA's PCR selection also must
> be changed at CONFIG_ time, although that seems to be more justified
> since IMA in TCB mode is not usable on any distro kernel that makes
> the TPM driver a module (i.e. most or all of them).

A 'custom' initrd is something a distro can automate. Eg a distro's
initrd generation script could read /etc/tpm.cfg and generate an
initrd with the module load and correct sysfs writes. This is more
accessible than recompiling the kernel.

My comments would apply to IMA as well, it should work with standard
distros, meaning the initrd must be able to set it up. So, load the
module in the initrd, setup localities, select the PCR, then enable
IMA.

The bootloader should measure the kernel and initrd together.

IMHO, distros are not making it easy to enable TPM features, and
requiring a kernel recompile is not helping :)

> There is also the fact that the driver may not be able to tell if a
> locality is available without doing some kind of test command. The
> Xen

Make sense.

> Or, for more flexibility (I actually like this one better):
> 
>  - CONFIG_TPM_KERNEL_DEFAULT_LOCALITY = [int]
>  - CONFIG_TPM_KERNEL_LOCALITY_FIXED = [bool]
> 
> And sysfs contains:
>  - kernel_locality [0644, int; 0444 if FIXED=y or when locked(?)]
>  - lock_kernel_locality [write-once; only exists if FIXED=n]

Yes, this looks simple and sane.

But if there isn't really a need to have a hardwired kernel, the
defaults can be DEFAULT_LOCALITY=0, LOCALITY_FIXED=n and we can
recommend distros rely on the initrd.

> So far, nobody I have talked to has offered any strong opinions on
> what locality should be used or how it should be set. I think finding
> a developer of trousers may be the most useful to talk about how the
> ioctl portion of this would need to be set up - if someone is actually
> needed.

It would be nice to have a user! As I said, we don't use it here.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] conditionally reschedule while loading selinux policy.

2013-09-30 Thread Dave Jones
On Mon, Sep 30, 2013 at 01:37:53PM -0400, Stephen Smalley wrote:
 
 > > With that patch applied, the problem seems to have moved elsewhere..
 > > 
 > > BUG: soft lockup - CPU#3 stuck for 22s! [load_policy:8119]
 > >  irq event stamp: 1590886
 > >  hardirqs last  enabled at (1590885): [] 
 > > __slab_alloc.constprop.78+0x4c0/0x4d7
 > >  hardirqs last disabled at (1590886): [] 
 > > apic_timer_interrupt+0x6a/0x80
 > >  softirqs last  enabled at (1590336): [] 
 > > __do_softirq+0x169/0x200
 > >  softirqs last disabled at (1590331): [] 
 > > irq_exit+0x11d/0x140
 > >  RIP: 0010:[]  [] 
 > > hashtab_insert+0x62/0x110
 > > 
 > > Call Trace:
 > >  [] policydb_read+0xc25/0x1200
 > ...
 > > We're holding a bunch of locks here, so we can't just cond_resched.  
 > > Thoughts ?
 > 
 > Sorry, what locks are we holding there?  You ought to be able to do a
 > cond_resched() anywhere during policydb_read() AFAIK; it is loading the
 > policy into a new structure that isn't being accessed by anything else
 > yet and the policy_rwlock is only held by security_load_policy after
 > calling policydb_read and only to switch it into place as the active
 > policydb.

Hmm, I thought I had tried this already, and got a lot of spew, but it turns out
for some reason I had previously patched hashtab_search instead.

I'll try running with this for a while..

Dave


diff --git a/security/selinux/ss/hashtab.c b/security/selinux/ss/hashtab.c
index 933e735..2cc4961 100644
--- a/security/selinux/ss/hashtab.c
+++ b/security/selinux/ss/hashtab.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "hashtab.h"
 
 struct hashtab *hashtab_create(u32 (*hash_value)(struct hashtab *h, const void 
*key),
@@ -40,6 +41,8 @@ int hashtab_insert(struct hashtab *h, void *key, void *datum)
u32 hvalue;
struct hashtab_node *prev, *cur, *newnode;
 
+   cond_resched();
+
if (!h || h->nel == HASHTAB_MAX_NODES)
return -EINVAL;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [IA64] aliasing-test: fix mem leaks

2013-09-30 Thread Jesper Juhl
Fix a few trivial mem leaks when returning from scan_tree/scan_rom.

Signed-off-by: Jesper Juhl 
---
 Documentation/ia64/aliasing-test.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/ia64/aliasing-test.c 
b/Documentation/ia64/aliasing-test.c
index 62a190d..33ed7a7 100644
--- a/Documentation/ia64/aliasing-test.c
+++ b/Documentation/ia64/aliasing-test.c
@@ -96,6 +96,7 @@ static int scan_tree(char *path, char *file, off_t offset, 
size_t length, int to
fprintf(stderr, "PASS: %s 0x%lx-0x%lx not 
mappable\n", path2, offset, offset + length);
else {
fprintf(stderr, "FAIL: %s 0x%lx-0x%lx not 
accessible\n", path2, offset, offset + length);
+   free(path2);
return rc;
}
} else {
@@ -185,6 +186,7 @@ static int scan_rom(char *path, char *file)
fprintf(stderr, "PASS: %s read %d bytes\n", 
path2, rc);
else {
fprintf(stderr, "PASS: %s not readable\n", 
path2);
+   free(path2);
return rc;
}
} else {
-- 
1.7.1

-- 
Jesper Juhlhttp://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2] EFI: Runtime services virtual mapping

2013-09-30 Thread Vivek Goyal
On Mon, Sep 30, 2013 at 11:06:46PM +0200, Borislav Petkov wrote:
> On Mon, Sep 30, 2013 at 04:46:42PM -0400, Vivek Goyal wrote:
> > On Mon, Sep 30, 2013 at 10:41:54PM +0200, Borislav Petkov wrote:
> > 
> > [..]
> > > hpa, so, for the struct efi_mapping thing to work, I will have to grep
> > > dmesg of the first kernel to get an output like the one below, filter
> > > the runtime regions so that I can build the struct efi_mapping array of
> > > mem regions which I pass on to the second, kexec kernel with setup_data.
> > 
> > Any data you need to pass to second kernel will have to be exported to
> > user space either using /sys or /proc so that kexec-tools can parse it.
> 
> And I cannot parse dmesg? Because I have all the info there already.

dmesg can 't relied on. It is just a kernel buffer which root can
clear. (dmesg --clear). So required info has to be exported to user
space in a suitable form.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2] EFI: Runtime services virtual mapping

2013-09-30 Thread Borislav Petkov
On Mon, Sep 30, 2013 at 04:46:42PM -0400, Vivek Goyal wrote:
> On Mon, Sep 30, 2013 at 10:41:54PM +0200, Borislav Petkov wrote:
> 
> [..]
> > hpa, so, for the struct efi_mapping thing to work, I will have to grep
> > dmesg of the first kernel to get an output like the one below, filter
> > the runtime regions so that I can build the struct efi_mapping array of
> > mem regions which I pass on to the second, kexec kernel with setup_data.
> 
> Any data you need to pass to second kernel will have to be exported to
> user space either using /sys or /proc so that kexec-tools can parse it.

And I cannot parse dmesg? Because I have all the info there already.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/11] epoll: Remove unnecessary error path

2013-09-30 Thread Andi Kleen
> > @@ -1276,15 +1276,6 @@ static int ep_insert(struct eventpoll *ep, struct 
> > epoll_event *event,
> >  */
> > revents = ep_item_poll(epi, );
> 
> ep_item_poll calls f_op->poll, which calls poll_wait().
> poll_wait() will call ep_ptable_queue_proc.

Thanks. Good point.
-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/11] epoll: Remove unnecessary error path

2013-09-30 Thread Eric Wong
Andi Kleen  wrote:
> From: Andi Kleen 
> 
> A static checker was pointing out that nothing can possible set
> nwait < 0 in this path. The comment and the check appears to be
> outdated. Remove it.

I don't think so...

> Cc: v...@zeniv.linux.org.uk
> Signed-off-by: Andi Kleen 
> ---
>  fs/eventpoll.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 473e09d..f72bf55 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1276,15 +1276,6 @@ static int ep_insert(struct eventpoll *ep, struct 
> epoll_event *event,
>*/
>   revents = ep_item_poll(epi, );

ep_item_poll calls f_op->poll, which calls poll_wait().
poll_wait() will call ep_ptable_queue_proc.

> - /*
> -  * We have to check if something went wrong during the poll wait queue
> -  * install process. Namely an allocation for a wait queue failed due
> -  * high memory pressure.
> -  */
> - error = -ENOMEM;
> - if (epi->nwait < 0)
> - goto error_unregister;
> -
>   /* Add the current item to the list of active epoll hook for this file 
> */
>   spin_lock(>f_lock);
>   list_add_tail(>fllink, >f_ep_links);
> @@ -1334,7 +1325,6 @@ error_remove_epi:
>  
>   rb_erase(>rbn, >rbr);
>  
> -error_unregister:
>   ep_unregister_pollwait(ep, epi);
>  
>   /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 01/11] random: don't feed stack data into pool when interrupt regs NULL

2013-09-30 Thread Luck, Tony
> In this case fast_mix would use two uninitialized ints from the stack
> and mix it into the pool.

Is the concern here is that an attacker might know (or be able to control) what 
is on
the stack - and so get knowledge of what is being mixed into the pool?

> In this case set the input to 0.

And the fix is to guarantee that everyone knows what is being mixed in? (!)

Wouldn't it be better to adjust the "nbytes" parameter to

fast_mix(..., ..., sizeof (input));

to only mix in the part of input[] that we successfully initialized?

Untested patch below.

Signed-off-by: Tony Luck 

---

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 7737b5bd26af..5c4ec0abb702 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -745,16 +745,19 @@ void add_interrupt_randomness(int irq, int irq_flags)
struct pt_regs  *regs = get_irq_regs();
unsigned long   now = jiffies;
__u32   input[4], cycles = get_cycles();
+   int nbytes;
 
input[0] = cycles ^ jiffies;
input[1] = irq;
+   nbytes = 2 * sizeof(input[0]);
if (regs) {
__u64 ip = instruction_pointer(regs);
input[2] = ip;
input[3] = ip >> 32;
+   nbytes += 2 * sizeof(input[0]);
}
 
-   fast_mix(fast_pool, input, sizeof(input));
+   fast_mix(fast_pool, input, nbytes);
 
if ((fast_pool->count & 1023) &&
!time_after(now, fast_pool->last + HZ))


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2] EFI: Runtime services virtual mapping

2013-09-30 Thread Borislav Petkov
On Mon, Sep 30, 2013 at 04:35:05PM -0400, Vivek Goyal wrote:
> On Mon, Sep 30, 2013 at 10:17:30PM +0200, Borislav Petkov wrote:
> > On Thu, Sep 26, 2013 at 11:12:42AM +0800, Dave Young wrote:
> > > If we choose this approach, can we save not only the efi_mapping, but
> > > also the fields which will be converted to virt addr, like fw_vendor,
> > > runtime, tables? During my test on a HP workstation, the config table
> > > item (SMBIOS) also is converted to virt addr though spec only mention
> > > fw_vendor/runtime/tables.
> > 
> > Btw, I was about to ask: how do you pass boot_params to the kexec
> > kernel?
> 
> kexec-tools in user space prepares the boot_params and passes it to
> second kernel.

Ok, thanks.

hpa, so, for the struct efi_mapping thing to work, I will have to grep
dmesg of the first kernel to get an output like the one below, filter
the runtime regions so that I can build the struct efi_mapping array of
mem regions which I pass on to the second, kexec kernel with setup_data.

Unless you have a better idea, of course... :)

[0.00] efi: EFI v2.00 by American Megatrends
[0.00] efi:  ACPI 2.0=0xac812f98  SMBIOS=0xac573018 
[0.00] efi: mem00: type=3, attr=0xf, 
range=[0x-0x8000) (0MB)
[0.00] efi: mem01: type=7, attr=0xf, 
range=[0x8000-0x0004e000) (0MB)
[0.00] efi: mem02: type=4, attr=0xf, 
range=[0x0004e000-0x0006) (0MB)
[0.00] efi: mem03: type=3, attr=0xf, 
range=[0x0006-0x000a) (0MB)
[0.00] efi: mem04: type=2, attr=0xf, 
range=[0x0010-0x0058b000) (4MB)
[0.00] efi: mem05: type=7, attr=0xf, 
range=[0x0058b000-0x0100) (10MB)
[0.00] efi: mem06: type=4, attr=0xf, 
range=[0x0100-0x0102) (0MB)
[0.00] efi: mem07: type=7, attr=0xf, 
range=[0x0102-0x378fe000) (872MB)
[0.00] efi: mem08: type=2, attr=0xf, 
range=[0x378fe000-0x37c77000) (3MB)
[0.00] efi: mem09: type=7, attr=0xf, 
range=[0x37c77000-0x7d6a8000) (1114MB)
[0.00] efi: mem10: type=2, attr=0xf, 
range=[0x7d6a8000-0xa742) (669MB)
[0.00] efi: mem11: type=1, attr=0xf, 
range=[0xa742-0xa743d000) (0MB)
[0.00] efi: mem12: type=4, attr=0xf, 
range=[0xa743d000-0xa7495000) (0MB)
[0.00] efi: mem13: type=7, attr=0xf, 
range=[0xa7495000-0xa74be000) (0MB)
[0.00] efi: mem14: type=4, attr=0xf, 
range=[0xa74be000-0xa74ef000) (0MB)
[0.00] efi: mem15: type=7, attr=0xf, 
range=[0xa74ef000-0xa74f2000) (0MB)
[0.00] efi: mem16: type=4, attr=0xf, 
range=[0xa74f2000-0xa79db000) (4MB)
[0.00] efi: mem17: type=6, attr=0x800f, 
range=[0xa79db000-0xa99dc000) (32MB)
[0.00] efi: mem18: type=4, attr=0xf, 
range=[0xa99dc000-0xa9ca1000) (2MB)
[0.00] efi: mem19: type=7, attr=0xf, 
range=[0xa9ca1000-0xa9ca3000) (0MB)
[0.00] efi: mem20: type=4, attr=0xf, 
range=[0xa9ca3000-0xabf12000) (34MB)
[0.00] efi: mem21: type=7, attr=0xf, 
range=[0xabf12000-0xac0a6000) (1MB)
[0.00] efi: mem22: type=3, attr=0xf, 
range=[0xac0a6000-0xac512000) (4MB)
[0.00] efi: mem23: type=7, attr=0xf, 
range=[0xac512000-0xac529000) (0MB)
[0.00] efi: mem24: type=5, attr=0x800f, 
range=[0xac529000-0xac54c000) (0MB)
[0.00] efi: mem25: type=7, attr=0xf, 
range=[0xac54c000-0xac566000) (0MB)
[0.00] efi: mem26: type=6, attr=0x800f, 
range=[0xac566000-0xac569000) (0MB)
[0.00] efi: mem27: type=7, attr=0xf, 
range=[0xac569000-0xac56a000) (0MB)
[0.00] efi: mem28: type=6, attr=0x800f, 
range=[0xac56a000-0xac576000) (0MB)
[0.00] efi: mem29: type=7, attr=0xf, 
range=[0xac576000-0xac596000) (0MB)
[0.00] efi: mem30: type=0, attr=0xf, 
range=[0xac596000-0xac5fb000) (0MB)
[0.00] efi: mem31: type=7, attr=0xf, 
range=[0xac5fb000-0xac6b7000) (0MB)
[0.00] efi: mem32: type=10, attr=0xf, 
range=[0xac6b7000-0xac7e3000) (1MB)
[0.00] efi: mem33: type=7, attr=0xf, 
range=[0xac7e3000-0xac7e4000) (0MB)
[0.00] efi: mem34: type=10, attr=0xf, 
range=[0xac7e4000-0xac7fb000) (0MB)
[0.00] efi: mem35: type=7, attr=0xf, 
range=[0xac7fb000-0xac80a000) (0MB)
[0.00] efi: mem36: type=2, attr=0xf, 
range=[0xac80a000-0xac81) (0MB)
[0.00] efi: mem37: type=9, attr=0xf, 
range=[0xac81-0xac813000) (0MB)
[0.00] efi: mem38: type=4, 

Re: [PATCHv7 03/18] thermal: introduce device tree parser

2013-09-30 Thread Eduardo Valentin
Hi Mark

Thanks for another good review round. I am following up your concerns
inline.

On 30-09-2013 11:36, Mark Rutland wrote:
> Hi Eduardo,
> 
> On Fri, Sep 27, 2013 at 04:13:10AM +0100, Eduardo Valentin wrote:
>> This patch introduces a device tree bindings for
>> describing the hardware thermal behavior and limits.
>> Also a parser to read and interpret the data and feed
>> it in the thermal framework is presented.
>>
>> This patch introduces a thermal data parser for device
>> tree. The parsed data is used to build thermal zones
>> and thermal binding parameters. The output data
>> can then be used to deploy thermal policies.
>>
>> This patch adds also documentation regarding this
>> API and how to define tree nodes to use
>> this infrastructure.
>>
>> Note that, in order to be able to have control
>> on the sensor registration on the DT thermal zone,
>> it was required to allow changing the thermal zone
>> .get_temp callback. For this reason, this patch
>> also removes the 'const' modifier from the .ops
>> field of thermal zone devices.
>>
>> Cc: Zhang Rui 
>> Cc: linux...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Eduardo Valentin 
>> ---
>>
>> Hello folks,
>>
>> For those interested on this work, here is a short changelog from v6:
>> - Reviewed binding documentation and added type and size on all properties
>> - Described 'thermal-zones' node in binding documentation
>> - Using now cooling state terminology instead of cooling level
>> - Renamed milliCelsius to millicelsius
>> - Renamed cooling-attachments to 'cooling-maps'
>> - Renamed sensor to thermal sensor
>> - Renamed #sensor-cells to #thermal-sensor-cells
>> - #thermal-sensor-cells now is allowed to be 0
>> - Renamed 'sensors' property to 'thermal-sensors'
>> - Changed trip type property to be now string and documented possible values
>> - Described better the cooling properties
>> - Now parses 'contribution' instead of 'usage'
>> - Renamed cdev to cooling device
>> - Renamed 'passive-delay' property to 'polling-delay-passive'
>> - Fixed a couple of error messages to match the property name
>> - Fixed a binding issue while using cpufreq as module
>> - Reworked thermal_of_build_thermal_zone function so that it frees
>> requested memory if something goes wrong and checks for sub-nodes
>> with zero children.
>>
>> ---
>>  .../devicetree/bindings/thermal/thermal.txt| 537 +
>>  drivers/thermal/Kconfig|  13 +
>>  drivers/thermal/Makefile   |   1 +
>>  drivers/thermal/of-thermal.c   | 845 
>> +
>>  drivers/thermal/thermal_core.c |   9 +-
>>  drivers/thermal/thermal_core.h |   9 +
>>  include/dt-bindings/thermal/thermal.h  |  27 +
>>  include/linux/thermal.h|  28 +-
>>  8 files changed, 1466 insertions(+), 3 deletions(-)
>>  create mode 100644 Documentation/devicetree/bindings/thermal/thermal.txt
>>  create mode 100644 drivers/thermal/of-thermal.c
>>  create mode 100644 include/dt-bindings/thermal/thermal.h
>>
>> diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt 
>> b/Documentation/devicetree/bindings/thermal/thermal.txt
>> new file mode 100644
>> index 000..3dbc017
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/thermal/thermal.txt
>> @@ -0,0 +1,537 @@
>> +* Thermal Framework Device Tree descriptor
>> +
>> +Generic binding to provide a way of defining hardware thermal
>> +structure using device tree. A thermal structure includes thermal
>> +zones and their components, such as trip points, polling intervals,
>> +sensors and cooling devices binding descriptors.
>> +
>> +The target of device tree thermal descriptors is to describe only
>> +the hardware thermal aspects, not how the system must control or which
>> +algorithm or policy must be taken in place.
>> +
>> +There are five types of nodes involved to describe thermal bindings:
>> +- sensors: used to describe the device source of temperature sensing;
>> +- cooling devices: used to describe devices source of power dissipation 
>> control;
>> +- trip points: used to describe points in temperature domain defined to
>> +make the system aware of hardware limits;
>> +- cooling maps: used to describe links between trip points and
>> +cooling devices;
>> +- thermal zones: used to describe thermal data within the hardware;
>> +
>> +It follows a description of each type of these device tree nodes.
>> +
>> +Note: This binding is a working in progress.
> 
> What does this mean? Is the binding expected to change in incompatible
> ways?

Well, no. The note was more to say we still don't have a final
agreement. But for merging this patch, once we have agreed to get it
merged, I believe this comment can be left out.

> 
> I see we have similar wording at the top of the clock bindings, which
> should probably go -- it's ABI now...

OK.

> 
>> +
>> +* Thermal sensor 

Re: [PATCH -v2] EFI: Runtime services virtual mapping

2013-09-30 Thread Vivek Goyal
On Mon, Sep 30, 2013 at 10:41:54PM +0200, Borislav Petkov wrote:

[..]
> hpa, so, for the struct efi_mapping thing to work, I will have to grep
> dmesg of the first kernel to get an output like the one below, filter
> the runtime regions so that I can build the struct efi_mapping array of
> mem regions which I pass on to the second, kexec kernel with setup_data.

Any data you need to pass to second kernel will have to be exported to
user space either using /sys or /proc so that kexec-tools can parse it.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: lustre: Don't leak 'buffer' in cfs_get_environ()

2013-09-30 Thread Jesper Juhl
If 'down_read_trylock' fails we'll curently leak the memory allocated to 
'buffer'.
Fix the leak by simply kfree'ing 'buffer' before returning '-EDEADLK'.

Signed-off-by: Jesper Juhl 
---
 .../lustre/lustre/libcfs/linux/linux-curproc.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-curproc.c 
b/drivers/staging/lustre/lustre/libcfs/linux/linux-curproc.c
index ea9e949..ecd031f 100644
--- a/drivers/staging/lustre/lustre/libcfs/linux/linux-curproc.c
+++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-curproc.c
@@ -227,8 +227,10 @@ int cfs_get_environ(const char *key, char *value, int 
*val_len)
 * which is already holding mmap_sem for writes.  If some other
 * thread gets the write lock in the meantime, this thread will
 * block, but at least it won't deadlock on itself.  LU-1735 */
-   if (down_read_trylock(>mmap_sem) == 0)
+   if (down_read_trylock(>mmap_sem) == 0) {
+   kfree(buffer);
return -EDEADLK;
+   }
up_read(>mmap_sem);
 
addr = mm->env_start;
-- 
1.7.1


-- 
Jesper Juhlhttp://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tpmdd-devel] [PATCH 09/13] tpm: Pull everything related to sysfs into tpm-sysfs.c

2013-09-30 Thread Daniel De Graaf

On 09/30/2013 02:10 PM, Jason Gunthorpe wrote:

On Tue, Sep 24, 2013 at 10:28:41AM -0400, Daniel De Graaf wrote:

On 09/23/2013 06:23 PM, Jason Gunthorpe wrote:

On Mon, Sep 23, 2013 at 06:00:46PM -0400, Daniel De Graaf wrote:


In a PC client TPM, normal OS code (as opposed to firmware or microcode)
is already restricted to locality 0-2. It may make sense to restrict
locality 2 to the kernel, which would allow an in-kernel TPM seal
command to be able to bind data so that userspace cannot unseal it.
However, only allowing localities 0 and 1 to userspace may be too
restrictive if userspace also wishes to use locality for separation,
since locality 1 does not have the ability to reset any PCRs that
locality 0 cannot also reset.
The kernel could reserve only locality 1 for its own use, giving it the
ability to seal data but not interfering with the ability to reset PCRs.
This would be my preference, although it is less intuitive to allow code
of lower privilege (userspace) to control the higher numbered locality
2.


This matches my vague understanding (we don't use localities here)


Perhaps a .config option would be useful to allow the system designer to
choose what, if any, locality to reserve for kernel use?


A runtime sysfs seems reasonable..


Allowing a userspace application to change which locality is kernel- and
userspace-only will eliminate the primary benefit of having a locality
restricted to the kernel.


Right, I was sort of thinking these sysfs files would be write-once or
trapped doored to prevent going backwards (like how secure level
worked). Ideally the write would be in the initramfs, which is part of
the PCR computation, so it should have the same properties as CONFIG_?

I think using CONFIG_ options would make this feature unavaiable to
distro kernel users...


This just moves the problem - now you need a custom initrd instead of
a custom kernel. Other TPM options like IMA's PCR selection also must
be changed at CONFIG_ time, although that seems to be more justified
since IMA in TCB mode is not usable on any distro kernel that makes
the TPM driver a module (i.e. most or all of them).


At least "supported_localities" should be generated by the driver if it
is generated at all. There are a few different proposals for handling
localities over 4 in virtual TPMs; one is that locality numbers between
32-255 would be permitted and 5-31 made non-addressable. While this
would work with a bitmask, I'm not sure that is the best solution.


Hmm, a 256 bit wide mask isn't impossible, not sure what other
options are good. Do you think userspace needs to know what localities
are valid or is an ioctl scan sufficient?


I don't think userspace needs to know this directly. The system designer
will need know in order to create any kind of policy about what locality
is being used (even if that's just setting the default), but they would
also need to be aware of the hardware or virtual environment and could
always enumerate them manually as a last resort.

There is also the fact that the driver may not be able to tell if a
locality is available without doing some kind of test command. The Xen
TPM interface doesn't expose what localities are available, for example,
and the TIS interface may need to test to see if locality 3 and 4 are
actually blocked by the chipset - at least 3 might be available on some
systems (the spec leaves this "implementation dependent").


Perhaps:
default_locality - default to CONFIG_USER_DEFAULT_LOCALITY
sysfs node permissions 0644
kernel_locality - default to #CONFIG_KERNEL_DEFAULT_LOCALITY
0444 if CONFIG_KERNEL_ONLY_LOCALITY=y
0644 if CONFIG_KERNEL_ONLY_LOCALITY=n
ioctl TPM_{GET,SET}_LOCALITY on an open /dev/tpmX

If CONFIG_KERNEL_ONLY_LOCALITY=y, the userspace locality is not
permitted to be equal to kernel_locality (but may take any other valid
value).  Drivers may reject locality values that they consider invalid
(the default should be to only allow 0-4, which is all that is defined
in the spec) or may fail on attempted use of the TPM by passing down an
error from the hardware - I would expect the latter to be the case on
attempts to use locality 4 in the tpm_tis driver.


Seems resonable, CONFIG_KERNEL_ONLY_LOCALITY could be
CONFIG_TPM_ONE_TIME_LOCALITY (eg you get to set kernel_locality only
once)


Hmm, how much trouble would it be to make this a menu selection? Even
with the one-time-set option, you still need a default set either in
the code or by CONFIG_ so that the TPM is not unavailable before the
sysfs write. The options would be:

 - CONFIG_TPM_KERNEL_DEFAULT_LOCALITY = [int]
 - CONFIG_TPM_KERNEL_LOCALITY_FIXED - no changes from userspace
 - CONFIG_TPM_KERNEL_LOCALITY_ONESHOT - only one change possible
 - CONFIG_TPM_KERNEL_LOCALITY_ANY - may be changed freely

The userspace locality is not allowed to use the kernel locality if
the mode is either FIXED or ONESHOT, but may share locality if ANY

Re: perf, x86: Add last TSX PMU code for Haswell v2

2013-09-30 Thread Andi Kleen
> I knew I've already sent this out! heh, even acked ;-)
> http://marc.info/?l=linux-kernel=138021331913851=2
> 
> got confused by your new PING^2

Thanks I didn't see that before. I'll add the Acks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/11] perf: Avoid uninitialized sample type reference in __perf_event__output_id_sample

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

__perf_event__output_id_sample looks at data->type to decide
what to output.

A lot of of the custom output functions, for example perf_log_throttle
start with perf_event_header__init_id, which only initializes
the header when event->attr.sample_id_all is true.

But when this is false the output function is still called,
and will look at an uninitialized header.

I changed all the callers to perf_event_header__init_id
to __perf_event_header__init_id which unconditionally
initializes the header.

FWIW I'm not fully sure this is the correct fix and what the
exact semantics of sample_id_all are supposed to be.
Should it disable all throttling etc. messages?
Please review carefully.

Cc: mi...@kernel.org
Cc: eran...@google.com
Cc: pet...@infradead.org
Signed-off-by: Andi Kleen 
---
 kernel/events/core.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cb4238e..0b15209 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4732,7 +4732,7 @@ perf_event_read_event(struct perf_event *event,
};
int ret;
 
-   perf_event_header__init_id(_event.header, , event);
+   __perf_event_header__init_id(_event.header, , event);
ret = perf_output_begin(, event, read_event.header.size);
if (ret)
return;
@@ -4837,7 +4837,7 @@ static void perf_event_task_output(struct perf_event 
*event,
if (!perf_event_task_match(event))
return;
 
-   perf_event_header__init_id(_event->event_id.header, , 
event);
+   __perf_event_header__init_id(_event->event_id.header, , 
event);
 
ret = perf_output_begin(, event,
task_event->event_id.header.size);
@@ -4931,7 +4931,7 @@ static void perf_event_comm_output(struct perf_event 
*event,
if (!perf_event_comm_match(event))
return;
 
-   perf_event_header__init_id(_event->event_id.header, , 
event);
+   __perf_event_header__init_id(_event->event_id.header, , 
event);
ret = perf_output_begin(, event,
comm_event->event_id.header.size);
 
@@ -5063,7 +5063,7 @@ static void perf_event_mmap_output(struct perf_event 
*event,
mmap_event->event_id.header.size += 
sizeof(mmap_event->ino_generation);
}
 
-   perf_event_header__init_id(_event->event_id.header, , 
event);
+   __perf_event_header__init_id(_event->event_id.header, , 
event);
ret = perf_output_begin(, event,
mmap_event->event_id.header.size);
if (ret)
@@ -5237,7 +5237,7 @@ static void perf_log_throttle(struct perf_event *event, 
int enable)
if (enable)
throttle_event.header.type = PERF_RECORD_UNTHROTTLE;
 
-   perf_event_header__init_id(_event.header, , event);
+   __perf_event_header__init_id(_event.header, , event);
 
ret = perf_output_begin(, event,
throttle_event.header.size);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2] EFI: Runtime services virtual mapping

2013-09-30 Thread Vivek Goyal
On Mon, Sep 30, 2013 at 10:17:30PM +0200, Borislav Petkov wrote:
> On Thu, Sep 26, 2013 at 11:12:42AM +0800, Dave Young wrote:
> > If we choose this approach, can we save not only the efi_mapping, but
> > also the fields which will be converted to virt addr, like fw_vendor,
> > runtime, tables? During my test on a HP workstation, the config table
> > item (SMBIOS) also is converted to virt addr though spec only mention
> > fw_vendor/runtime/tables.
> 
> Btw, I was about to ask: how do you pass boot_params to the kexec
> kernel?

kexec-tools in user space prepares the boot_params and passes it to
second kernel.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/11] igb: Avoid uninitialized advertised variable in eee_set_cur

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

eee_get_cur assumes that the output data is already zeroed. It can
read-modify-write the advertised field:

  if (ipcnfg & E1000_IPCNFG_EEE_100M_AN)
2594edata->advertised |= ADVERTISED_100baseT_Full;

This is ok for the normal ethtool eee_get call, which always
zeroes the input data before.

But eee_set_cur also calls eee_get_cur and it did not zero the input
field. Later on it then compares agsinst the field, which can contain partial
stack garbage.

Zero the input field in eee_set_cur() too.

Cc: jeffrey.t.kirs...@intel.com
Cc: net...@vger.kernel.org
Signed-off-by: Andi Kleen 
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 48cbc83..41e37ff 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2652,6 +2652,8 @@ static int igb_set_eee(struct net_device *netdev,
(hw->phy.media_type != e1000_media_type_copper))
return -EOPNOTSUPP;
 
+   memset(_curr, 0, sizeof(struct ethtool_eee));
+
ret_val = igb_get_eee(netdev, _curr);
if (ret_val)
return ret_val;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Various static checker fixes

2013-09-30 Thread Andi Kleen
I was playing around with LLVM's static checker. Here are some fixes
from reviewing the output on a kernel build.

Some merely shut up the checker -- it is technically right, but
the problem is not a real bug as far as I can tell. Still it's
cleaner/clearer to avoid uninitialized variables and similar.

A few others are real bugs.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/11] seq_file: Handle ->next error in seq_read

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

When ->next fails the error would not be returned to user space.
Add the missing check.

Cc: v...@zeniv.linux.org.uk
Signed-off-by: Andi Kleen 
---
 fs/seq_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 3135c25..e16b4a8 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -262,6 +262,8 @@ Fill:
pos = next;
}
m->op->stop(m, p);
+   if (err)
+   goto Done;
n = min(m->count, size);
err = copy_to_user(buf, m->buf, n);
if (err)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/11] posix-timers: Initialize timer value to 0 for invalid timers.

2013-09-30 Thread Andi Kleen
From: Andi Kleen 

cpu_clock_sample et.al. can fail with an invalid timer type.
For most callers this cannot happen because the timer has been
successfully initialized earlier, so they don't check
the return value, but still use the val output.

Unfortunately static checkers for uninitialied variables
don't understand this and always throw a warning in this case.

Initialize the timer value to 0 for the error case to
shut it up.

This doesn't fix any real bug, just reduces warning noise.

Cc: t...@linutronix.de
Signed-off-by: Andi Kleen 
---
 kernel/posix-cpu-timers.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index c7f31aa..cc3c3f2 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -182,6 +182,7 @@ static int cpu_clock_sample(const clockid_t which_clock, 
struct task_struct *p,
 {
switch (CPUCLOCK_WHICH(which_clock)) {
default:
+   *sample = 0;
return -EINVAL;
case CPUCLOCK_PROF:
*sample = prof_ticks(p);
@@ -243,6 +244,7 @@ static int cpu_clock_sample_group(const clockid_t 
which_clock,
 
switch (CPUCLOCK_WHICH(which_clock)) {
default:
+   *sample = 0;
return -EINVAL;
case CPUCLOCK_PROF:
thread_group_cputime(p, );
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >