Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

2013-01-05 Thread Alex Shi

>>  static unsigned long weighted_cpuload(const int cpu)
>>  {
>> -return cpu_rq(cpu)->load.weight;
>> +return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
> 
> Above line change cause aim9 multitask benchmark drop about 10%
> performance on many x86 machines. Profile just show there are more
> cpuidle enter called.
> The testing command:
> 
> #( echo $hostname ; echo test ; echo 1 ; echo 2000 ; echo 2 ; echo 2000
> ; echo 100 ) | ./multitask -nl
> 
> The oprofile output here:
> with this patch set
> 101978 total  0.0134
>  54406 cpuidle_wrap_enter   499.1376
>   2098 __do_page_fault2.0349
>   1976 rwsem_wake29.0588
>   1824 finish_task_switch12.4932
>   1560 copy_user_generic_string  24.3750
>   1346 clear_page_c  84.1250
>   1249 unmap_single_vma   0.6885
>   1141 copy_page_rep 71.3125
>   1093 anon_vma_interval_tree_insert  8.1567
> 
> 3.8-rc2
>  68982 total  0.0090
>  22166 cpuidle_wrap_enter   203.3578
>   2188 rwsem_wake32.1765
>   2136 __do_page_fault2.0718
>   1920 finish_task_switch13.1507
>   1724 poll_idle 15.2566
>   1433 copy_user_generic_string  22.3906
>   1237 clear_page_c  77.3125
>   1222 unmap_single_vma   0.6736
>   1053 anon_vma_interval_tree_insert  7.8582
> 
> Without load avg in periodic balancing, each cpu will weighted with all
> tasks load.
> 
> with new load tracking, we just update the cfs_rq load avg with each
> task at enqueue/dequeue moment, and with just update current task in
> scheduler_tick. I am wondering if it's the sample is a bit rare.
> 
> What's your opinion of this, Paul?
> 

Ingo & Paul:

I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
after all tasks ready, aim9 give a signal than all tasks burst waking up
and run until all finished.
Since each of tasks are finished very quickly, a imbalanced empty cpu
may goes to sleep till a regular balancing give it some new tasks. That
causes the performance dropping. cause more idle entering.

According to load avg's design, it needs time to accumulate its load
weight. So, it's hard to find a way resolving this problem.

As to other scheduler related benchmarks, like kbuild, specjbb2005,
hachbench, sysbench etc, I didn't find clear improvement or regression
on the load avg balancing.

Any comments for this problem?

-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:

> OK so I observed no change with this patch, either on the loopback
> data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
> experimentation anyway.
> 

Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
only.

> Concerning the loopback MTU, I find it strange that the MTU changes
> the splice() behaviour and not send/recv. I thought that there could
> be a relation between the MTU and the pipe size, but it does not
> appear to be the case either, as I tried various sizes between 16kB
> and 256kB without achieving original performance.


It probably is related to a too small receive window, given the MTU was
multiplied by 4, I guess we need to make some adjustments

You also could try :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..b68cdfb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
break;
}
used = recv_actor(desc, skb, offset, len);
+   /* Clean up data we have read: This will do ACK frames. 
*/
+   if (used > 0)
+   tcp_cleanup_rbuf(sk, used);
if (used < 0) {
if (!copied)
copied = used;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 0/2] handle polling errors

2013-01-05 Thread Jason Wang
This is an update version of last version to fix the handling of polling errors
in vhost/vhost_net.

Currently, vhost and vhost_net ignore polling errors which can lead kernel
crashing when it tries to remove itself from waitqueue after the polling
failure. Fix this by checking the poll->wqh before the removing and report an
error when meet polling errors.

Changes from v2:
- check poll->wqh instead of the wrong assumption about POLLERR and waitqueue
- drop the whole tx polling state check since it was replaced by the wqh
  checking
- drop the buggy tuntap patch

Changes from v1:
- restore the state before the ioctl when vhost_init_used() fails
- log the error when meet polling errors in the data path
- don't put into waitqueue when tun_chr_poll() return POLLERR

Jason Wang (2):
  vhost_net: correct error handling in vhost_net_set_backend()
  vhost: handle polling errors

 drivers/vhost/net.c   |   88 +++-
 drivers/vhost/vhost.c |   31 +
 drivers/vhost/vhost.h |2 +-
 3 files changed, 59 insertions(+), 62 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 2/2] vhost: handle polling errors

2013-01-05 Thread Jason Wang
Polling errors were ignored by vhost/vhost_net, this may lead to crash when
trying to remove vhost from waitqueue when after the polling is failed. Solve
this problem by:

- checking the poll->wqh before trying to remove from waitqueue
- report an error when poll() returns a POLLERR in vhost_start_poll()
- report an error when vhost_start_poll() fails in
  vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify the
  failure to userspace.
- report an error in the data path in vhost_net when meet polling errors.

After those changes, we can safely drop the tx polling state in vhost_net since
it was replaced by the checking of poll->wqh.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |   74 
 drivers/vhost/vhost.c |   31 +++-
 drivers/vhost/vhost.h |2 +-
 3 files changed, 49 insertions(+), 58 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index d10ad6f..125c1e5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -64,20 +64,10 @@ enum {
VHOST_NET_VQ_MAX = 2,
 };
 
-enum vhost_net_poll_state {
-   VHOST_NET_POLL_DISABLED = 0,
-   VHOST_NET_POLL_STARTED = 1,
-   VHOST_NET_POLL_STOPPED = 2,
-};
-
 struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
-   /* Tells us whether we are polling a socket for TX.
-* We only do this when socket buffer fills up.
-* Protected by tx vq lock. */
-   enum vhost_net_poll_state tx_poll_state;
/* Number of TX recently submitted.
 * Protected by tx vq lock. */
unsigned tx_packets;
@@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec *from, 
struct iovec *to,
}
 }
 
-/* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
-{
-   if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
-   return;
-   vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-   net->tx_poll_state = VHOST_NET_POLL_STOPPED;
-}
-
-/* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
-{
-   if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
-   return;
-   vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-   net->tx_poll_state = VHOST_NET_POLL_STARTED;
-}
-
 /* In case of DMA done not in order in lower device driver for some reason.
  * upend_idx is used to track end of used idx, done_idx is used to track head
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
@@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, 
bool success)
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = >dev.vqs[VHOST_NET_VQ_TX];
+   struct vhost_poll *poll = net->poll + VHOST_NET_VQ_TX;
unsigned out, in, s;
int head;
struct msghdr msg = {
@@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
wmem = atomic_read(>sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf) {
mutex_lock(>mutex);
-   tx_poll_start(net, sock);
+   if (vhost_poll_start(poll, sock->file))
+   vq_err(vq, "Fail to start TX polling\n");
mutex_unlock(>mutex);
return;
}
@@ -261,7 +235,7 @@ static void handle_tx(struct vhost_net *net)
vhost_disable_notify(>dev, vq);
 
if (wmem < sock->sk->sk_sndbuf / 2)
-   tx_poll_stop(net);
+   vhost_poll_stop(poll);
hdr_size = vq->vhost_hlen;
zcopy = vq->ubufs;
 
@@ -283,8 +257,10 @@ static void handle_tx(struct vhost_net *net)
 
wmem = atomic_read(>sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-   tx_poll_start(net, sock);
-   set_bit(SOCK_ASYNC_NOSPACE, >flags);
+   if (vhost_poll_start(poll, sock->file))
+   vq_err(vq, "Fail to start TX 
polling\n");
+   else
+   set_bit(SOCK_ASYNC_NOSPACE, 
>flags);
break;
}
/* If more outstanding DMAs, queue the work.
@@ -294,8 +270,10 @@ static void handle_tx(struct vhost_net *net)
(vq->upend_idx - vq->done_idx) :
(vq->upend_idx + UIO_MAXIOV - vq->done_idx);
if (unlikely(num_pends > VHOST_MAX_PEND)) {
-   tx_poll_start(net, sock);
-   set_bit(SOCK_ASYNC_NOSPACE, >flags);
+   if (vhost_poll_start(poll, sock->file))
+   

[PATCH V3 1/2] vhost_net: correct error handling in vhost_net_set_backend()

2013-01-05 Thread Jason Wang
Currently, when vhost_init_used() fails the sock refcnt and ubufs were
leaked. Correct this by calling vhost_init_used() before assign ubufs and
restore the oldsock when it fails.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c |   16 +++-
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ebd08b2..d10ad6f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -827,15 +827,16 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
r = PTR_ERR(ubufs);
goto err_ubufs;
}
-   oldubufs = vq->ubufs;
-   vq->ubufs = ubufs;
+
vhost_net_disable_vq(n, vq);
rcu_assign_pointer(vq->private_data, sock);
-   vhost_net_enable_vq(n, vq);
-
r = vhost_init_used(vq);
if (r)
-   goto err_vq;
+   goto err_used;
+   vhost_net_enable_vq(n, vq);
+
+   oldubufs = vq->ubufs;
+   vq->ubufs = ubufs;
 
n->tx_packets = 0;
n->tx_zcopy_err = 0;
@@ -859,6 +860,11 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
mutex_unlock(>dev.mutex);
return 0;
 
+err_used:
+   rcu_assign_pointer(vq->private_data, oldsock);
+   vhost_net_enable_vq(n, vq);
+   if (ubufs)
+   vhost_ubuf_put_and_wait(ubufs);
 err_ubufs:
fput(sock->file);
 err_vq:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: compaction: fix echo 1 > compact_memory return error issue

2013-01-05 Thread Jason Liu
when run the folloing command under shell, it will return error
sh/$ echo 1 > /proc/sys/vm/compact_memory
sh/$ sh: write error: Bad address

After strace, I found the following log:
...
write(1, "1\n", 2)   = 3
write(1, "", 4294967295) = -1 EFAULT (Bad address)
write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from
sysctl_compaction_handler after compaction_nodes finished.

Suggested-by:David Rientjes 
Cc:Mel Gorman 
Cc:Andrew Morton 
Cc:Rik van Riel 
Cc:Minchan Kim 
Cc:KAMEZAWA Hiroyuki 
Signed-off-by: Jason Liu 
---
 mm/compaction.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6b807e4..f8f5c11 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1210,7 +1210,7 @@ static int compact_node(int nid)
 }
 
 /* Compact all nodes in the system */
-static int compact_nodes(void)
+static void compact_nodes(void)
 {
int nid;
 
@@ -1219,8 +1219,6 @@ static int compact_nodes(void)
 
for_each_online_node(nid)
compact_node(nid);
-
-   return COMPACT_COMPLETE;
 }
 
 /* The written value is actually unused, all memory is compacted */
@@ -1231,7 +1229,7 @@ int sysctl_compaction_handler(struct ctl_table *table, 
int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
if (write)
-   return compact_nodes();
+   compact_nodes();
 
return 0;
 }
-- 
1.7.5.4


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fix test relying in wrong behavior of is_printable

2013-01-05 Thread David Gibson
On Fri, Jan 04, 2013 at 09:16:08PM +0200, Pantelis Antoniou wrote:
> After fixing the is_printable bug the test suite fails.
> Fix it with this patch
> 
> Signed-off-by: Pantelis Antoniou 

Rather than just removing the test, it would be better to still run it
using an explicit -t bi to force the byte output.

> ---
>  tests/run_tests.sh | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/run_tests.sh b/tests/run_tests.sh
> index dd7f217..43279c9 100755
> --- a/tests/run_tests.sh
> +++ b/tests/run_tests.sh
> @@ -498,9 +498,8 @@ fdtget_tests () {
>  
>  # run_fdtget_test  []   
>  run_fdtget_test "MyBoardName" $dtb / model
> -run_fdtget_test "77 121 66 111 \
> -97 114 100 78 97 109 101 0 77 121 66 111 97 114 100 70 97 109 105 \
> -108 121 78 97 109 101 0" $dtb / compatible
> +# run_fdtget_test "77 121 66 111 97 114 100 78 97 109 101 0 77 121 66 
> 111 97 114 100 70 97 109 105 108 121 78 97 109 101 0" $dtb / compatible
> +run_fdtget_test "MyBoardName MyBoardFamilyName" $dtb / compatible
>  run_fdtget_test "MyBoardName MyBoardFamilyName" -t s $dtb / compatible
>  run_fdtget_test 32768 $dtb /cpus/PowerPC,970@1 d-cache-size
>  run_fdtget_test 8000 -tx $dtb /cpus/PowerPC,970@1 d-cache-size

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: Digital signature


Re: [PATCH v4 07/18] perf: add generic memory sampling interface

2013-01-05 Thread Andi Kleen
> Why dont use enums for this?

enums can have unpredictable signed/unsignedness issues. #defines for
hardware constants is usually far safer.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] exec: avoid possible undefined behavior in count()

2013-01-05 Thread Xi Wang
The tricky problem is this check:

if (i++ >= max)

icc (mis)optimizes this check as:

if (++i > max)

The check now becomes a no-op since max is MAX_ARG_STRINGS (0x7FFF).

This is "allowed" by the C standard, assuming i++ never overflows,
because signed integer overflow is undefined behavior.  This optimization
effectively reverts the previous commit 362e6663ef ("exec.c, compat.c:
fix count(), compat_count() bounds checking") that tries to fix the check.

This patch simply moves ++ after the check.

Signed-off-by: Xi Wang 
---
Not sure how many people are using Intel's icc to compiled the kernel.
Some projects like LinuxDNA did.

The kernel uses gcc's -fno-strict-overflow to disable this optimization.
icc probably doesn't recognize the option.

To illustrate the problem, try this simple program:

int count(int i, int max)
{
if (i++ >= max) {
__builtin_trap();
return -1;
}
return i;
}

#include 
#include 

int main(int argc, char **argv)
{
int x = atoi(argv[1]);
int max = atoi(argv[2]);
printf("%d %d %d\n", x, max, count(x, max));
}

$ gcc -O2 t.c
$ ./a.out 2147483647 2147483647
Illegal instruction (core dumped)

$ icc -O2 t.c
$ ./a.out 2147483647 2147483647
2147483647 2147483647 -2147483648

There's no difference whether we add -fno-strict-overflow or not.
---
 fs/exec.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18c45ca..20df02c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -434,8 +434,9 @@ static int count(struct user_arg_ptr argv, int max)
if (IS_ERR(p))
return -EFAULT;
 
-   if (i++ >= max)
+   if (i >= max)
return -E2BIG;
+   ++i;
 
if (fatal_signal_pending(current))
return -ERESTARTNOHAND;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/power/88pm860x_battery.c: use devm_request_threaded_irq

2013-01-05 Thread Anton Vorontsov
On Sat, Dec 08, 2012 at 06:16:35PM +0100, Julia Lawall wrote:
> From: Julia Lawall 
> 
> devm_request_threaded_irq requests and irq that is freed when a driver
> detaches.  This patch uses devm_request_threaded_irq for irqs that are
> requested in the probe function of a platform device and are only freed in
> the remove function.
> 
> Additionally, the original code used devm_kzalloc, but kfree.  This would
> lead to a double free.  The problem was found using the following semantic
> match (http://coccinelle.lip6.fr/):
> 
> // 
> @@
> expression x,e;
> @@
> x = devm_kzalloc(...)
> ... when != x = e
> ?-kfree(x,...);
> // 
> 
> The error handling code in the probe function is also simplified in the
> cases where there is now nothing to do other than return.
> 
> Signed-off-by: Julia Lawall 
> 
> ---
[]
> @@ -994,9 +989,6 @@ static int pm860x_battery_remove(struct platform_device 
> *pdev)
>   struct pm860x_battery_info *info = platform_get_drvdata(pdev);
>  
>   power_supply_unregister(>battery);
> - free_irq(info->irq_batt, info);
> - free_irq(info->irq_cc, info);
> - kfree(info);

It is not safe to access battery ('struct power_supply') object after
_unregister() (and irq handlers will surely do). Instead of removing
free_irq(), the right fix would be to place the two calls before
_unregister().

Thanks,
Anton

>   platform_set_drvdata(pdev, NULL);
>   return 0;
>  }
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 3/3] tuntap: don't add to waitqueue when POLLERR

2013-01-05 Thread Jason Wang
On 01/06/2013 03:37 AM, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:34 +0800, Jason Wang wrote:
>> Currently, tun_chr_poll() returns POLLERR after waitqueue adding during 
>> device
>> unregistration. This would confuse some of its user such as vhost which 
>> assume
>> when POLLERR is returned, it wasn't added to the waitqueue. Fix this by
>> returning POLLERR before adding to waitqueue.
>>
>> Signed-off-by: Jason Wang 
>> ---
>>  drivers/net/tun.c |5 +
>>  1 files changed, 1 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index fbd106e..f9c0049 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -886,7 +886,7 @@ static unsigned int tun_chr_poll(struct file *file, 
>> poll_table *wait)
>>  struct sock *sk;
>>  unsigned int mask = 0;
>>  
>> -if (!tun)
>> +if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>>  return POLLERR;
>>  
>>  sk = tfile->socket.sk;
>> @@ -903,9 +903,6 @@ static unsigned int tun_chr_poll(struct file *file, 
>> poll_table *wait)
>>   sock_writeable(sk)))
>>  mask |= POLLOUT | POLLWRNORM;
>>  
>> -if (tun->dev->reg_state != NETREG_REGISTERED)
>> -mask = POLLERR;
>> -
>>  tun_put(tun);
>>  return mask;
>>  }
> This patch is buggy.
>
> First, the caller assuming POLLERR means poll_wait() was not called is
> wrong.

True, looks like vhost need to check the poll->wqh before trying to
remove from waitqueue instead of this wrong assumption. And then we can
drop the whole tx polling state.
>
> Secondly, you add a ref leak.

Yes, will drop this patch.

Thanks.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] regulator: da9055: Remove unused v_shift field from struct da9055_volt_reg

2013-01-05 Thread Axel Lin
Signed-off-by: Axel Lin 
---
 drivers/regulator/da9055-regulator.c |3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/regulator/da9055-regulator.c 
b/drivers/regulator/da9055-regulator.c
index 1a05ac6..3022109 100644
--- a/drivers/regulator/da9055-regulator.c
+++ b/drivers/regulator/da9055-regulator.c
@@ -58,7 +58,6 @@ struct da9055_volt_reg {
int reg_b;
int sl_shift;
int v_mask;
-   int v_shift;
 };
 
 struct da9055_mode_reg {
@@ -388,7 +387,6 @@ static struct regulator_ops da9055_ldo_ops = {
.reg_b = DA9055_REG_VBCORE_B + DA9055_ID_##_id, \
.sl_shift = 7,\
.v_mask = (1 << (vbits)) - 1,\
-   .v_shift = (vbits),\
},\
 }
 
@@ -417,7 +415,6 @@ static struct regulator_ops da9055_ldo_ops = {
.reg_b = DA9055_REG_VBCORE_B + DA9055_ID_##_id, \
.sl_shift = 7,\
.v_mask = (1 << (vbits)) - 1,\
-   .v_shift = (vbits),\
},\
.mode = {\
.reg = DA9055_REG_BCORE_MODE,\
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] Introducing Device Tree Overlays

2013-01-05 Thread Rob Landley

On 01/05/2013 03:35:58 AM, Richard Cochran wrote:

On Sat, Jan 05, 2013 at 12:16:51AM -0600, Joel A Fernandes wrote:
>
> The problem being addressed is discussed in this thread:
> http://permalink.gmane.org/gmane.linux.kernel/1389017

Thanks for the link.

Since the motivation is already documented in that post, why not add
it into Documentation/devicetree/overlay-notes.txt as well?


Seconded.

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: oops in copy_page_rep()

2013-01-05 Thread Linus Torvalds
Adding more people in case somebody else has any idea. Anybody?

On Sat, Jan 5, 2013 at 7:22 AM, Dave Jones  wrote:
> I have no idea what happened here, but this is the first time I've seen this 
> one.
> This was running a tree pulled yesterday afternoon.
>
> BUG: unable to handle kernel paging request at 880100201000

This is %rsi, which is the source for the page copy:

  copy_user_highpage()->
copy_user_page()->
  copy_page()->
copy_page_rep

I don't know exactly which copy_user_highpage() case this is from, the
call trace implies this *could* be a hugepage, and those functions do
copy pages individually in a loop too.

> IP: [] copy_page_rep+0x5/0x10
> PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 800100201160

Hmm. That PTE looks really odd. If I read the PUD/PMD contents right,
the page tables are for individual pages, but then the PTE doesn't
have the present bit set: other than that it looks like it could be a
valid PTE (NX and global bit set, Accessed and dirty also set, but the
two low bits are clear: present and writable are clear).

 I think it's due to DEBUG_PAGEALLOC, so the (free) page has been
unmapped from the kernel mapping.

But how could a page that is the source of a page fault be free?

> Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte 
> Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[]  [] copy_page_rep+0x5/0x10
> RAX: 000100201000 RBX: 00011d215000 RCX: 0200

The RCX value is 0x200, so this is the first access to that page. As expected.

> RDX: cccd RSI: 880100201000 RDI: 88011d215000

RSI (source) and RDI (destination) both look like valid kernel mapping
pages. But RSI isn't mapped, presumably because debug-pagealloc thinks
it is free.

Anybody with any ideas? The call trace indicates a normal page fault
from user space, so..

   Linus

> Call Trace:
>  [] ? do_huge_pmd_wp_page+0x707/0xc00
>  [] handle_mm_fault+0x14c/0x590
>  [] ? __lock_is_held+0x5e/0x90
>  [] __do_page_fault+0x15c/0x4e0
>  [] ? native_sched_clock+0x26/0x90
>  [] ? trace_hardirqs_off_caller+0x28/0xc0
>  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [] do_page_fault+0xe/0x10
>  [] page_fault+0x22/0x30
> Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 
> 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00  48 a5 c3 0f 1f 
> 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> RIP  [] copy_page_rep+0x5/0x10
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] signals: sys_ssetmask() uses uninitialized newmask

2013-01-05 Thread CAI Qian


- Original Message -
> From: "Oleg Nesterov" 
> To: "CAI Qian" , "Andrew Morton" 
> , "Linus Torvalds"
> 
> Cc: "Linda Wang" , "Matt Zywusko" , 
> "Al Viro" ,
> linux-kernel@vger.kernel.org
> Sent: Sunday, January 6, 2013 2:13:13 AM
> Subject: [PATCH 1/2] signals: sys_ssetmask() uses uninitialized newmask
> 
> 77097ae5 "most of set_current_blocked() callers want SIGKILL/SIGSTOP
> removed from set" removed the initialization of newmask by accident,
> restore.
> 
> Reported-by: CAI Qian 
> Signed-off-by: Oleg Nesterov 
> Cc: sta...@kernel.org # v3.5+
Thanks Oleg. This is now passing the testing.

Tested-by: CAI Qian 
> ---
>  kernel/signal.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 7aaa51d..9692499 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3286,6 +3286,7 @@ SYSCALL_DEFINE1(ssetmask, int, newmask)
>   int old = current->blocked.sig[0];
>   sigset_t newset;
>  
> + siginitset(, newmask);
>   set_current_blocked();
>  
>   return old;
> --
> 1.5.5.1
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] power/ab8500_charger: Use devm_regulator_get API

2013-01-05 Thread Anton Vorontsov
On Fri, Dec 07, 2012 at 05:23:28PM +0530, Sachin Kamat wrote:
> devm_regulator_get() is device managed and makes error handling
> and code cleanup simpler.
> 
> Cc: Arun R Murthy 
> Signed-off-by: Sachin Kamat 
> ---
> Compile tested using linux-next.
> ---

Applied, thanks!

>  drivers/power/ab8500_charger.c |   11 +++
>  1 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/power/ab8500_charger.c b/drivers/power/ab8500_charger.c
> index 3be9c0e..5062023 100644
> --- a/drivers/power/ab8500_charger.c
> +++ b/drivers/power/ab8500_charger.c
> @@ -2509,9 +2509,6 @@ static int ab8500_charger_remove(struct platform_device 
> *pdev)
>   free_irq(irq, di);
>   }
>  
> - /* disable the regulator */
> - regulator_put(di->regu);
> -
>   /* Backup battery voltage and current disable */
>   ret = abx500_mask_and_set_register_interruptible(di->dev,
>   AB8500_RTC, AB8500_RTC_CTRL_REG, RTC_BUP_CH_ENA, 0);
> @@ -2665,7 +2662,7 @@ static int ab8500_charger_probe(struct platform_device 
> *pdev)
>* is a charger connected to avoid erroneous BTEMP_HIGH/LOW
>* interrupts during charging
>*/
> - di->regu = regulator_get(di->dev, "vddadc");
> + di->regu = devm_regulator_get(di->dev, "vddadc");
>   if (IS_ERR(di->regu)) {
>   ret = PTR_ERR(di->regu);
>   dev_err(di->dev, "failed to get vddadc regulator\n");
> @@ -2677,14 +2674,14 @@ static int ab8500_charger_probe(struct 
> platform_device *pdev)
>   ret = ab8500_charger_init_hw_registers(di);
>   if (ret) {
>   dev_err(di->dev, "failed to initialize ABB registers\n");
> - goto free_regulator;
> + goto free_charger_wq;
>   }
>  
>   /* Register AC charger class */
>   ret = power_supply_register(di->dev, >ac_chg.psy);
>   if (ret) {
>   dev_err(di->dev, "failed to register AC charger\n");
> - goto free_regulator;
> + goto free_charger_wq;
>   }
>  
>   /* Register USB charger class */
> @@ -2758,8 +2755,6 @@ free_usb:
>   power_supply_unregister(>usb_chg.psy);
>  free_ac:
>   power_supply_unregister(>ac_chg.psy);
> -free_regulator:
> - regulator_put(di->regu);
>  free_charger_wq:
>   destroy_workqueue(di->charger_wq);
>   return ret;
> -- 
> 1.7.4.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bq27x00_battery: fix bugs introduced with BQ27425 support

2013-01-05 Thread Anton Vorontsov
On Sun, Dec 02, 2012 at 08:34:21PM +1100, NeilBrown wrote:
> commit a66f59ba2e994bf70274ef0513e24e0e7ae20c63
> bq27x00_battery: Add support for BQ27425 chip
> 
> introduced 2 bug.
> 
> 1/ 'chip' was set to BQ27425 unconditionally - breaking support for
>other devices.
> 2/ BQ27425 does not support cycle count, how the code still tries to
>get the cycle count for BQ27425, and now does it twice for other chips.
> 
> Cc: Saranya Gopal 
> Signed-off-by: NeilBrown 

Applied, thanks!

> diff --git a/drivers/power/bq27x00_battery.c b/drivers/power/bq27x00_battery.c
> index e2659f1..51d4017 100644
> --- a/drivers/power/bq27x00_battery.c
> +++ b/drivers/power/bq27x00_battery.c
> @@ -445,7 +445,6 @@ static void bq27x00_update(struct bq27x00_device_info *di)
>   cache.temperature = bq27x00_battery_read_temperature(di);
>   if (!is_bq27425)
>   cache.cycle_count = bq27x00_battery_read_cyct(di);
> - cache.cycle_count = bq27x00_battery_read_cyct(di);
>   cache.power_avg =
>   bq27x00_battery_read_pwr_avg(di, BQ27x00_POWER_AVG);
>  
> @@ -697,7 +696,6 @@ static int bq27x00_powersupply_init(struct 
> bq27x00_device_info *di)
>   int ret;
>  
>   di->bat.type = POWER_SUPPLY_TYPE_BATTERY;
> - di->chip = BQ27425;
>   if (di->chip == BQ27425) {
>   di->bat.properties = bq27425_battery_props;
>   di->bat.num_properties = ARRAY_SIZE(bq27425_battery_props);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] power_supply: add watchdog and safety timer expiries under PROP_HEALTH_*

2013-01-05 Thread Anton Vorontsov
On Fri, Nov 30, 2012 at 01:57:46PM +0530, Ramakrishna Pallala wrote:
> As most of the charger chips come with two kinds of safety features
> related to timing.
> 1. Watchdog Timer (interms of seconds/mins)
> 2. Safety Timer (interms of hours)
> 
> This patch adds these to fault causes in POWER_SUPPLY_PROP_HEALTH_*
> enums so that whenever there is either watchdog timeout or safety timer
> timeout driver could notify the user space accurately about the fault
> and will also be helpful for debug.
> 
> Signed-off-by: Ramakrishna Pallala 

Applied, thanks a lot!

> ---
>  drivers/power/power_supply_sysfs.c |3 ++-
>  include/linux/power_supply.h   |2 ++
>  2 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/power/power_supply_sysfs.c 
> b/drivers/power/power_supply_sysfs.c
> index 40fa3b7..29178f7 100644
> --- a/drivers/power/power_supply_sysfs.c
> +++ b/drivers/power/power_supply_sysfs.c
> @@ -55,7 +55,8 @@ static ssize_t power_supply_show_property(struct device 
> *dev,
>   };
>   static char *health_text[] = {
>   "Unknown", "Good", "Overheat", "Dead", "Over voltage",
> - "Unspecified failure", "Cold",
> + "Unspecified failure", "Cold", "Watchdog timer expire",
> + "Safety timer expire"
>   };
>   static char *technology_text[] = {
>   "Unknown", "NiMH", "Li-ion", "Li-poly", "LiFe", "NiCd",
> diff --git a/include/linux/power_supply.h b/include/linux/power_supply.h
> index 35cdf2c..4e672e1 100644
> --- a/include/linux/power_supply.h
> +++ b/include/linux/power_supply.h
> @@ -54,6 +54,8 @@ enum {
>   POWER_SUPPLY_HEALTH_OVERVOLTAGE,
>   POWER_SUPPLY_HEALTH_UNSPEC_FAILURE,
>   POWER_SUPPLY_HEALTH_COLD,
> + POWER_SUPPLY_HEALTH_WATCHDOG_TIMER_EXPIRE,
> + POWER_SUPPLY_HEALTH_SAFETY_TIMER_EXPIRE,
>  };
>  
>  enum {
> -- 
> 1.7.0.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 8/8] memcg: Document cgroup dirty/writeback memory statistics

2013-01-05 Thread Sha Zhengju
On Fri, Dec 28, 2012 at 9:10 AM, Kamezawa Hiroyuki
 wrote:
> (2012/12/26 2:28), Sha Zhengju wrote:
>> From: Sha Zhengju 
>>
>> Signed-off-by: Sha Zhengju 
>
> I don't think your words are bad but it may be better to sync with meminfo's 
> text.
>
>> ---
>>   Documentation/cgroups/memory.txt |2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/Documentation/cgroups/memory.txt 
>> b/Documentation/cgroups/memory.txt
>> index addb1f1..2828164 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -487,6 +487,8 @@ pgpgin- # of charging events to the memory 
>> cgroup. The charging
>>   pgpgout - # of uncharging events to the memory cgroup. The 
>> uncharging
>>   event happens each time a page is unaccounted from the cgroup.
>>   swap- # of bytes of swap usage
>> +dirty  - # of bytes of file cache that are not in sync with the 
>> disk copy.
>> +writeback  - # of bytes of file/anon cache that are queued for syncing 
>> to disk.
>>   inactive_anon   - # of bytes of anonymous memory and swap cache memory 
>> on
>>   LRU list.
>>   active_anon - # of bytes of anonymous and swap cache memory on active
>>
>
> Documentation/filesystems/proc.txt
>
>Dirty: Memory which is waiting to get written back to the disk
>Writeback: Memory which is actively being written back to the disk
>
> even if others are not ;(
>


The words are actually revised by Fengguang before:
https://lkml.org/lkml/2012/7/7/49
It might be more accurate than previous one and I just follow his advise...


Thanks,
Sha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] power_supply: Add charge control struct in power supply class

2013-01-05 Thread Anton Vorontsov
On Tue, Nov 27, 2012 at 01:17:02PM +0530, Ramakrishna Pallala wrote:
[...]
> +++ b/drivers/power/power_supply_core.c
> @@ -158,6 +158,24 @@ struct power_supply *power_supply_get_by_name(char *name)
>  }
>  EXPORT_SYMBOL_GPL(power_supply_get_by_name);
>  
> +#ifdef CONFIG_PSY_CM_LOW_LEVEL_SUPPORT
> +struct power_supply_charger_control
> + *power_supply_get_chrg_cntl_by_name(const char *name)
> +{
> + struct device *dev = class_find_device(power_supply_class, NULL,
> + (char *)name, power_supply_match_device_by_name);
> +
> + return dev ? ((struct power_supply *)dev_get_drvdata(dev))->chrg_cntl : 
> NULL;
> +}
> +#else
> +struct power_supply_charger_control
> + *power_supply_get_chrg_cntl_by_name(const char *name)
> +{
> + return NULL;
> +}
> +#endif
> +EXPORT_SYMBOL_GPL(power_supply_get_chrg_cntl_by_name);

>  int power_supply_powers(struct power_supply *psy, struct device *dev)
>  {
>   return sysfs_create_link(>dev->kobj, >kobj, "powers");
> diff --git a/include/linux/power_supply.h b/include/linux/power_supply.h
> index 1f0ab90..35cdf2c 100644
> --- a/include/linux/power_supply.h
> +++ b/include/linux/power_supply.h
> @@ -191,6 +191,10 @@ struct power_supply {
>   struct thermal_cooling_device *tcd;
>  #endif
>  
> +#ifdef CONFIG_PSY_CM_LOW_LEVEL_SUPPORT
> + struct power_supply_charger_control *chrg_cntl;
> +#endif
> +
>  #ifdef CONFIG_LEDS_TRIGGERS
>   struct led_trigger *charging_full_trig;
>   char *charging_full_trig_name;
> @@ -224,7 +228,29 @@ struct power_supply_info {
>   int use_for_apm;
>  };
>  
> +struct power_supply_charger_control {
> + const char *name;
> + /* get charging status */
> + int (*is_charging_enabled)(void);
> + int (*is_charger_enabled)(void);
> +
> + /* set charging parameters */
> + int (*set_in_current_limit)(int uA);
> + int (*set_charge_current)(int uA);
> + int (*set_charge_voltage)(int uV);
> +
> + /* control battery charging */
> + int (*enable_charging)(void);
> + int (*disable_charging)(void);
> +
> + /* control VSYS or system supply */
> + int (*turnon_charger)(void);
> + int (*turnoff_charger)(void);
> +};
> +

I'm all for this patch, but why do you need to place it into
power_supply.h and power_supply_core.c? :) I see nothing generic here,
it's pure charger-manager stuff. So, place everything into
charger-manager.{c,h}.

You can still add this:

> +#ifdef CONFIG_PSY_CM_LOW_LEVEL_SUPPORT
> + struct power_supply_charger_control *chrg_cntl;
> +#endif

to power_supply.h, of course. It's OK.

>  extern struct power_supply *power_supply_get_by_name(char *name);
> +extern struct power_supply_charger_control
> + *power_supply_get_chrg_cntl_by_name(const char *name);
>  extern void power_supply_changed(struct power_supply *psy);
>  extern int power_supply_am_i_supplied(struct power_supply *psy);
>  extern int power_supply_set_battery_charged(struct power_supply *psy);
> -- 
> 1.7.0.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] OF: Introduce DT overlay support.

2013-01-05 Thread Rob Landley

On 01/04/2013 01:31:10 PM, Pantelis Antoniou wrote:

Introduce DT overlay support.
Using this functionality it is possible to dynamically overlay a part  
of

the kernel's tree with another tree that's been dynamically loaded.
It is also possible to remove node and properties.

Signed-off-by: Pantelis Antoniou 


Just commenting on the documentation a bit...


---
 Documentation/devicetree/overlay-notes.txt | 179 +++
 drivers/of/Kconfig |  10 +
 drivers/of/Makefile|   1 +
 drivers/of/overlay.c   | 831  
+

 include/linux/of.h | 107 
 5 files changed, 1128 insertions(+)
 create mode 100644 Documentation/devicetree/overlay-notes.txt
 create mode 100644 drivers/of/overlay.c

diff --git a/Documentation/devicetree/overlay-notes.txt  
b/Documentation/devicetree/overlay-notes.txt

new file mode 100644
index 000..5289cbb
--- /dev/null
+++ b/Documentation/devicetree/overlay-notes.txt
@@ -0,0 +1,179 @@
+Device Tree Overlay Notes
+-
+
+This document describes the implementation of the in-kernel
+device tree overlay functionality residing in drivers/of/overlay.c  
and is a
+companion document to  
Documentation/devicetree/dt-object-internal.txt[1] &

+Documentation/devicetree/dynamic-resolution-notes.txt[2]
+
+How overlays work
+-
+
+A Device Tree's overlay purpose is to modify the kernel's live tree,  
and
+have the modification affecting the state of the the kernel in a way  
that

+is reflecting the changes.


My wild guess here is this has something to do with hotplug support,  
but I don't know if modules are expected to do this or if userspace  
does it and modules respond... Could you give a couple sentences about  
the purpose and potential users of this mechanism in the summary?


+Since the kernel mainly deals with devices, any new device node that  
result


results

+in an active device should have it created while if the device node  
is either
+disabled or removed all together, the affected device should be  
deregistered.


I'm not following this bit. It looks like some test is missing between  
"while if"?


+Lets take an example where we have a foo board with the following  
base tree

+which is taken from [1].
+
+ foo.dts  
-

+   /* FOO platform */
+   / {
+   compatible = "corp,foo";
+
+   /* shared resources */
+   res: res {
+   };
+
+   /* On chip peripherals */
+   ocp: ocp {
+   /* peripherals that are always instantiated */
+   peripheral1 { ... };
+   }
+   };
+ foo.dts  
-

+
+The overlay bar.dts, when loaded (and resolved as described in [2])  
should

+
+ bar.dts  
-

+/plugin/;  /* allow undefined label references and record them */
+/ {
+		/* various properties for loader use; i.e. part id etc.  
*/

+   fragment@0 {
+   target = <>;
+   __overlay__ {
+   /* bar peripheral */
+   bar {
+   compatible = "corp,bar";
+... /* various properties and child  
nodes */

+   }
+   };
+   };
+};
+ bar.dts  
-

+
+result in foo+bar.dts
+
+ foo+bar.dts  
-

+   /* FOO platform + bar peripheral */
+   / {
+   compatible = "corp,foo";
+
+   /* shared resources */
+   res: res {
+   };
+
+   /* On chip peripherals */
+   ocp: ocp {
+   /* peripherals that are always instantiated */
+   peripheral1 { ... };
+
+   /* bar peripheral */
+   bar {
+   compatible = "corp,bar";
+... /* various properties and child  
nodes */

+   }
+   }
+   };
+ foo+bar.dts  
-

+
+As a result of the the overlay, a new device node (bar) has been  
created
+so a bar platform device will be registered and if a matching device  
driver

+is loaded the device will be created as expected.


Is this done by a module, or does doing this then trigger a hotplug  
event that requests a module? (Or is this a syntax allowing a  
bootloader to collate multiple device tree segments and then Linux  
links them when parsing the device tree...?)



+Overlay in-kernel API
+-
+
+The steps typically required to get an overlay to work are as  
follows:

+
+1. Use 

Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, 
> unsigned int poff,
>   return false;
>   }
>  
> - /* ignore any bits we already processed */
> - if (*off) {
> - __segment_seek(, , , *off);
> - *off = 0;
> - }
> + __segment_seek(, , , *off);
> + *off = 0;
>  
>   do {
>   unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
> struct pipe_inode_info *pipe,
> struct splice_pipe_desc *spd, struct sock *sk)
>  {
>   int seg;
> + struct page *page = virt_to_page(skb->data);
> + unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>   /* map the linear part :
>* If skb->head_frag is set, this 'linear' part is backed by a
>* fragment, and if the head is not shared with any clones then
>* we can avoid a copy since we own the head portion of this page.
>*/
> - if (__splice_segment(virt_to_page(skb->data),
> -  (unsigned long) skb->data & (PAGE_SIZE - 1),
> + if (__splice_segment(page, poff,
>skb_headlen(skb),
>offset, len, skb, spd,
>skb_head_is_locked(skb),
> 

OK so I observed no change with this patch, either on the loopback
data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
experimentation anyway.

Concerning the loopback MTU, I find it strange that the MTU changes
the splice() behaviour and not send/recv. I thought that there could
be a relation between the MTU and the pipe size, but it does not
appear to be the case either, as I tried various sizes between 16kB
and 256kB without achieving original performance.

I've started to bisect the 10GE issue again (since both issues are
unrelated), but I'll finish tomorrow, it's time to get some sleep
now.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] power: bq2415x_charger: Some cleanup

2013-01-05 Thread Anton Vorontsov
On Tue, Nov 27, 2012 at 11:28:43AM +0530, Sachin Kamat wrote:
> This series is build tested againt the linux-next tree (20121126)
> 
> Sachin Kamat (3):
>   power: bq2415x_charger: Remove unneeded version.h inclusion
>   power: bq2415x_charger: Use module_i2c_driver
>   power: bq2415x_charger: Use devm_kzalloc()

Applied, thanks a lot!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote:

> It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
> reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
> The trick is that once the MTU has been set to this large a value,
> even when I set it back to 16kB the problem persists.
> 

Well, this MTU change can uncover a prior bug, or make it happen faster,
for sure.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] power: 88pm860x_battery: Add a few more devm_* APIs

2013-01-05 Thread Anton Vorontsov
On Fri, Nov 23, 2012 at 06:11:52PM +0530, Tushar Behera wrote:
> Add devm_* APIs for threaded IRQ.
> 
> Also since devres managed objects are removed when the device gets
> detached, remove explicit freeing of them.
> 
> Signed-off-by: Tushar Behera 
> ---
[...]
> @@ -994,9 +988,6 @@ static int __devexit pm860x_battery_remove(struct 
> platform_device *pdev)
>   struct pm860x_battery_info *info = platform_get_drvdata(pdev);
>  
>   power_supply_unregister(>battery);
> - free_irq(info->irq_batt, info);
> - free_irq(info->irq_cc, info);
> - kfree(info);

It is not safe to access battery ('struct power_supply') object after
_unregister() (and irq handlers will surely do). Instead of removing
free_irq(), you should place the two calls before _unregister().

Thanks,

>   platform_set_drvdata(pdev, NULL);
>   return 0;
>  }
> -- 
> 1.7.4.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> > On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > > 
> > > > > > Ah interesting because these were some of the mm patches that I had
> > > > > > tried to revert.
> > > > > 
> > > > > Hmm, or we should fix __skb_splice_bits()
> > > > > 
> > > > > I'll send a patch.
> > > > > 
> > > > 
> > > > Could you try the following ?
> > > 
> > > Or more exactly...
> > 
> > The first one did not change a iota unfortunately. I'm about to
> > spot the commit causing the loopback regression. It's a few patches
> > before the first one you pointed. It's almost finished and I test
> > your patch below immediately after.
> 
> I bet you are going to find commit
> 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
> (net: use bigger pages in __netdev_alloc_frag )
> 
> Am I wrong ?

Yes this time you guessed wrong :-) Well maybe it's participating
to the issue.

It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
The trick is that once the MTU has been set to this large a value,
even when I set it back to 16kB the problem persists.

Now I'm retrying your other patch to see if it brings the 10GE back
to full speed.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] EXTCON: Get and set cable properties

2013-01-05 Thread Anton Vorontsov
On Mon, Dec 03, 2012 at 02:09:02AM +, Tc, Jenny wrote:
> > > Could you please review this. This is a follow up patch for "PATCH]
> > > extcon : callback function to read cable property"
> > 
> > While I see nothing wrong with the patch itself, I beg you to send some 
> > users
> > for the new calls. Don't be obsessed with the extcon internals too much,
> > think more about how things will interact (i.e. I really really want to see 
> > how
> > you use these calls from the power supply drivers).
> 
> The usage of extcon cable property is captured in patch 
> https://lkml.org/lkml/2012/10/18/219
> This patch uses a extcon_dev  callback function get_cable_properties() to get 
> the
> cable properties. As discussed in the previous mail thread, it may not be 
> good to have a extcon call
> back function since the extcon provider may not be aware of the cable 
> properties. This patch replaces
> the callback function with an API, so that whoever knows the cable property, 
> can set the property
> using the extcon API extcon_cable_set_data().
> 
> The usage flow would be
> 1)Consumer gets a notification from the extcon
> 2)consumer reads the property using the API extcon_cable_get_data
> 
> This way it doesn't mandatory for the extcon provider to give the cable 
> property.
> Anyone who is aware of the cable property can set the cable property using 
> the API.
> It makes the consumer and provider implementations very simple.
> 
> With this new API, the callback function in patch 
> https://lkml.org/lkml/2012/10/18/219 can be
> replaced by the API extcon_cable_set_data().

Looking at this, the whole idea of hiding power source behind the "extcon"
seems dubious. Why don't you use USB device to get the current?

"extcon" subsystem, as I see it, should only be used to get notified about
external connectors events. And that's all. And chargers probably should
not even care about extcon (well, with the exception of the direct AC/gpio
power source).

For USB, it would make more sense if for you'd get plug/unplug
notifications *and* properties from the USB device (or OTG transceiver)
directly, not from the extcon. And I guess we have this mechanism already,
see drivers/power/pda_power.c.

Thanks,
Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > 
> > > > > Ah interesting because these were some of the mm patches that I had
> > > > > tried to revert.
> > > > 
> > > > Hmm, or we should fix __skb_splice_bits()
> > > > 
> > > > I'll send a patch.
> > > > 
> > > 
> > > Could you try the following ?
> > 
> > Or more exactly...
> 
> The first one did not change a iota unfortunately. I'm about to
> spot the commit causing the loopback regression. It's a few patches
> before the first one you pointed. It's almost finished and I test
> your patch below immediately after.

I bet you are going to find commit
69b08f62e17439ee3d436faf0b9a7ca6fffb78db
(net: use bigger pages in __netdev_alloc_frag )

Am I wrong ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...

The first one did not change a iota unfortunately. I'm about to
spot the commit causing the loopback regression. It's a few patches
before the first one you pointed. It's almost finished and I test
your patch below immediately after.

Thanks,
Willy

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, 
> unsigned int poff,
>   return false;
>   }
>  
> - /* ignore any bits we already processed */
> - if (*off) {
> - __segment_seek(, , , *off);
> - *off = 0;
> - }
> + __segment_seek(, , , *off);
> + *off = 0;
>  
>   do {
>   unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
> struct pipe_inode_info *pipe,
> struct splice_pipe_desc *spd, struct sock *sk)
>  {
>   int seg;
> + struct page *page = virt_to_page(skb->data);
> + unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>   /* map the linear part :
>* If skb->head_frag is set, this 'linear' part is backed by a
>* fragment, and if the head is not shared with any clones then
>* we can avoid a copy since we own the head portion of this page.
>*/
> - if (__splice_segment(virt_to_page(skb->data),
> -  (unsigned long) skb->data & (PAGE_SIZE - 1),
> + if (__splice_segment(page, poff,
>skb_headlen(skb),
>offset, len, skb, spd,
>skb_head_is_locked(skb),
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > 
> > > Ah interesting because these were some of the mm patches that I had
> > > tried to revert.
> > 
> > Hmm, or we should fix __skb_splice_bits()
> > 
> > I'll send a patch.
> > 
> 
> Could you try the following ?

Or more exactly...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..01f222c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned 
int poff,
return false;
}
 
-   /* ignore any bits we already processed */
-   if (*off) {
-   __segment_seek(, , , *off);
-   *off = 0;
-   }
+   __segment_seek(, , , *off);
+   *off = 0;
 
do {
unsigned int flen = min(*len, plen);
@@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb->data);
+   unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb->head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb->data),
-(unsigned long) skb->data & (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Incorrect accounting of irq into the running task

2013-01-05 Thread Sadasivan Shaiju
Hi Shaun,

-Original Message-
From: Shaun Ruffell [mailto:sruff...@digium.com]
Sent: Saturday, January 05, 2013 9:21 AM
To: Sadasivan Shaiju
Cc: linux-kernel@vger.kernel.org; ve...@google.com; a.p.zijls...@chello.nl
Subject: Re: Incorrect accounting of irq into the running task

On Fri, Jan 04, 2013 at 10:22:12AM -0800, Sadasivan Shaiju wrote:
> Hi  Venkatesh,
>
> I have applied the following patches for the incorrect accounting of
> irq into the running task .
>
>
> [PATCH] x86: Add IRQ_TIME_ACCOUNTING
> [e82b8e4ea4f3dffe6e7939f90e78da675fcc450e]
> [PATCH] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
> [b52bfee445d315549d41eacf2fa7c156e7d153d5]
>
> [PATCH] sched: Do not account irq time to current task
> [305e6835e05513406fa12820e40e4a8ecb63743c]
> [PATCH] sched: Export ns irqtimes through /proc/stat
> [abb74cefa9c682fb38ba86c17ca3c86fed6cc464]
>
>
>
> But still the stime and utime of the process in /proc/pid/stat is
> high. I think the above patches does not update The stime and utime
> values in /proc/pid/stat.
>
>
> Or am I missing anything?

Just checking that you do have CONFIG_IRQ_TIME_ACCOUNTING=y in your kernel
config?

  Yes  it  is  turned  on  .

Regards,
Shaiju.

Cheers,
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> 
> > Ah interesting because these were some of the mm patches that I had
> > tried to revert.
> 
> Hmm, or we should fix __skb_splice_bits()
> 
> I'll send a patch.
> 

Could you try the following ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..c5246be 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb->data);
+   unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb->head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb->data),
-(unsigned long) skb->data & (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ab8500: promote ab8500_fg probe before ab8500_btemp probe

2013-01-05 Thread Anton Vorontsov
On Mon, Dec 03, 2012 at 11:42:55PM +0530, Rajanikanth H.V wrote:
> From: "Rajanikanth H.V" 
> 
> ab8500_fg driver prepares instance list of fuelgauge which is
> required by btemp driver for battery identification. So make sure
> that ab8500 fuelgauge list is ready before btemp driver starts.
> 
> for '3.7-rc5': of git://git.infradead.org/battery-2.6.git
> 
> Acked-by: Lee Jones 
> Signed-off-by: Rajanikanth H.V 
> ---

This one, and "1/2 v2" applied, thanks!

>  drivers/power/Makefile |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/power/Makefile b/drivers/power/Makefile
> index 696e3a9..070c73d 100644
> --- a/drivers/power/Makefile
> +++ b/drivers/power/Makefile
> @@ -38,7 +38,7 @@ obj-$(CONFIG_CHARGER_PCF50633)  += pcf50633-charger.o
>  obj-$(CONFIG_BATTERY_JZ4740) += jz4740-battery.o
>  obj-$(CONFIG_BATTERY_INTEL_MID)  += intel_mid_battery.o
>  obj-$(CONFIG_BATTERY_RX51)   += rx51_battery.o
> -obj-$(CONFIG_AB8500_BM)  += ab8500_bmdata.o ab8500_charger.o 
> ab8500_btemp.o ab8500_fg.o abx500_chargalg.o
> +obj-$(CONFIG_AB8500_BM)  += ab8500_bmdata.o ab8500_charger.o 
> ab8500_fg.o ab8500_btemp.o abx500_chargalg.o
>  obj-$(CONFIG_CHARGER_ISP1704)+= isp1704_charger.o
>  obj-$(CONFIG_CHARGER_MAX8903)+= max8903_charger.o
>  obj-$(CONFIG_CHARGER_TWL4030)+= twl4030_charger.o
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:

> Ah interesting because these were some of the mm patches that I had
> tried to revert.

Hmm, or we should fix __skb_splice_bits()

I'll send a patch.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:
> 
> > Yes, I've removed all zero counters in this short view for easier
> > reading (complete version appended at the end of this email). This
> > was after around 140 GB were transferred :
> 
> OK I only wanted to make sure skb were not linearized in xmit.
> 
> Could you try to disable CONFIG_COMPACTION ?

It's already disabled.

> ( This is the other thread mentioning this : "ppoll() stuck on POLLIN
> while TCP peer is sending" )

Ah interesting because these were some of the mm patches that I had
tried to revert.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:

> Yes, I've removed all zero counters in this short view for easier
> reading (complete version appended at the end of this email). This
> was after around 140 GB were transferred :

OK I only wanted to make sure skb were not linearized in xmit.

Could you try to disable CONFIG_COMPACTION ?

( This is the other thread mentioning this : "ppoll() stuck on POLLIN
while TCP peer is sending" )




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:
> 
> > > 2) Another possibility would be that Myri card/driver doesnt like very
> > > well high order pages.
> > 
> > It looks like it has not changed much since 3.6 :-/ I really suspect
> > something is wrong with memory allocation. I have tried reverting many
> > patches affecting the mm/ directory just in case but I did not come to
> > anything useful yet.
> > 
> 
> Hmm, I was referring to TCP stack now using order-3 pages instead of
> order-0 ones
> 
> See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
> (net: use a per task frag allocator)

OK, so you think there are two distinct problems ?

I have tried to revert this one but it did not change the performance, I'm
still saturating at ~6.9 Gbps.

> Could you please post :
> 
> ethtool -S eth0

Yes, I've removed all zero counters in this short view for easier
reading (complete version appended at the end of this email). This
was after around 140 GB were transferred :

# ethtool -S eth1|grep -vw 0
NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 link_changes: 2
 link_up: 1
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 wake_queue: 187727
 stop_queue: 187727
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

Quite honnestly, this is typically the pattern what I'm used to
observe here. I'm now trying to bisect, hopefully we'll get
something exploitable.

Cheers,
Willy

- full ethtool -S 

NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 rx_errors: 0
 tx_errors: 0
 rx_dropped: 0
 tx_dropped: 0
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 MSIX: 0
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 watchdog_resets: 0
 link_changes: 2
 link_up: 1
 dropped_link_overflow: 0
 dropped_link_error_or_filtered: 0
 dropped_pause: 0
 dropped_bad_phy: 0
 dropped_bad_crc32: 0
 dropped_unicast_filtered: 0
 dropped_multicast_filtered: 0
 dropped_runt: 0
 dropped_overrun: 0
 dropped_no_small_buffer: 0
 dropped_no_big_buffer: 0
 --- slice -: 0
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 rx_big_cnt: 0
 wake_queue: 187727
 stop_queue: 187727
 tx_linearized: 0
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 3.8-rc1: compiling problem in perf-event-p6.o

2013-01-05 Thread werner


The problem continues with 3.8-rc

This is grave, no vmlinuz is produced.


wl

  CC  arch/x86/kernel/cpu/perf_event.o
  CC  arch/x86/kernel/cpu/perf_event_amd.o
  CC  arch/x86/kernel/cpu/perf_event_p6.o
arch/x86/kernel/cpu/perf_event_p6.c:22: error: 
p6_hw_cache_event_ids causes a section type conflict
make[3]: [arch/x86/kernel/cpu/perf_event_p6.o] Error 1 
(ignored)

  CC  arch/x86/kernel/cpu/perf_event_knc.o
  CC  arch/x86/kernel/cpu/perf_event_p4.o
  CC  arch/x86/kernel/cpu/perf_event_intel_lbr.o


There ocurs a compiling error in perf-event-p6.o , any 
regression, unfortunately I lost the compiling list but I 
think it was any incompatibility / redefinition with 
something else, pls check and correct that, if not already 
done

W.Landgraf
---
---
Professional hosting for everyone - http://www.host.ru
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC 0/1] Delete legacy power trace API

2013-01-05 Thread Paul Gortmaker
[Re: [PATCH/RFC 0/1] Delete legacy power trace API] On 05/01/2013 (Sat 23:10) 
Rafael J. Wysocki wrote:

> On Friday, January 04, 2013 08:49:03 PM Paul Gortmaker wrote:
> > The actual deletion is mind-numbingly simple; and if you go by the
> > comments in the code, it is well overdue.  However, in discussions
> > with Frederic, he suggested to me that those comments might have
> > been overly optimistic, and that there may still be people out
> > there who are still unknowingly using this dead API.
> > 
> > So, that is the crux of the RFC component -- to check whether the
> > comments saying "delete by v3.1" can be taken at face value, or
> > whether they were overly optimistic, and hence this stuff is still
> > actively used even though it is overdue for deletion.
> 
> Do you want me or the tracing maintainers to handle this?

I have no particular preference as to what path it takes in
getting merged to mainline - just so long as we don't hear anyone
requesting it to _not_ be removed in the next few days.

Paul.
--


> 
> Rafael
> 
> 
> > ---
> > 
> > Paul Gortmaker (1):
> >   tracing: remove deprecated power trace API
> > 
> >  Documentation/trace/events-power.txt | 27 +--
> >  arch/arm/mach-omap2/pm34xx.c |  2 -
> >  arch/x86/kernel/process.c|  6 ---
> >  drivers/cpufreq/cpufreq.c|  1 -
> >  drivers/cpuidle/cpuidle.c|  2 -
> >  include/trace/events/power.h | 92 
> > 
> >  kernel/trace/Kconfig | 15 --
> >  kernel/trace/power-traces.c  |  3 --
> >  8 files changed, 1 insertion(+), 147 deletions(-)
> > 
> > 
> -- 
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:

> > 2) Another possibility would be that Myri card/driver doesnt like very
> > well high order pages.
> 
> It looks like it has not changed much since 3.6 :-/ I really suspect
> something is wrong with memory allocation. I have tried reverting many
> patches affecting the mm/ directory just in case but I did not come to
> anything useful yet.
> 

Hmm, I was referring to TCP stack now using order-3 pages instead of
order-0 ones

See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
(net: use a per task frag allocator)

Could you please post :

ethtool -S eth0



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2 09/26] infiniband: rename random32() to prandom_u32()

2013-01-05 Thread Steve Wise

Reviewed-by: Steve Wise 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v2 09/26] infiniband: rename random32() to prandom_u32()

2013-01-05 Thread Steve Wise

On 1/5/2013 7:37 AM, Akinobu Mita wrote:

2013/1/5 Steve Wise :

I'm asking: why are you bothering with renaming the functions?  This seems
like a needless change, _unless_ there are really non-pseudo-random services
being added.

We already have get_random_byte() which is not pseudo-random number
generator.

Apart from that, the naming scheme was confusing without "prandom" prefix.
Because I introduced new functions in the commit 6582c665 ("prandom:
introduce prandom_bytes() and prandom_bytes_state()").


Ok, thanks for the explanation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi Eric,

On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote:
> Hi Willy, another good finding during the week end ! ;)

Yes, I wanted to experiment with TFO and stopped on this :-)

> 1) This looks like interrupts are spreaded on multiple cpus, and this
> give Out Of Order problems with TCP stack.

No, I forgot to mention this, I have tried to bind IRQs to a single
core, with the server either on the same or another one, but the
problem remained.

Also, the loopback is much more affected and doesn't use IRQs. And
BTW tcpdump on the loopback shouldn't drop that many packets (up to
90% even at low rate). I just noticed something, transferring data
using netcat on the loopback doesn't affect tcpdump. So it's likely
only the spliced data that are affected.

> 2) Another possibility would be that Myri card/driver doesnt like very
> well high order pages.

It looks like it has not changed much since 3.6 :-/ I really suspect
something is wrong with memory allocation. I have tried reverting many
patches affecting the mm/ directory just in case but I did not come to
anything useful yet.

I'm continuing to dig.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/4] input: keyboard: tegra: use devm_* for resource allocation

2013-01-05 Thread Dmitry Torokhov
On Sat, Jan 05, 2013 at 04:50:58PM +0530, Laxman Dewangan wrote:
> HI Dmitry,
> Thanks for quick review.
> 
> I will take care of your comment in next version. Some have my answer.
> 
> 
> On Saturday 05 January 2013 01:36 PM, Dmitry Torokhov wrote:
> >Hi Laxman,
> >
> >On Sat, Jan 05, 2013 at 01:15:08PM +0530, Laxman Dewangan wrote:
> >>Use devm_* for memory, clock, input device allocation. This reduces
> >>code for freeing these resources.
> 
> >>err = tegra_kbd_setup_keymap(kbc);
> >>-   if (err) {
> >>+   if (err < 0) {
> >Why is this change? As far as I can see tegra_kbd_setup_keymap() never
> >returns positive values.
> 
> Ok, mostly errors are in negative and hence this change, I will
> revert it and will keep original.
> 
> >
> >>dev_err(>dev, "failed to setup keymap\n");
> >>-   goto err_put_clk;
> >>+   return err;
> >>}
> >>__set_bit(EV_REP, input_dev->evbit);
> >>@@ -790,15 +784,15 @@ static int tegra_kbc_probe(struct platform_device 
> >>*pdev)
> >>err = request_irq(kbc->irq, tegra_kbc_isr,
> >>  IRQF_NO_SUSPEND | IRQF_TRIGGER_HIGH, pdev->name, kbc);
> >>-   if (err) {
> >>+   if (err < 0) {
> >Neither request_irq(). BTW, why not devm_request_irq?
> 
> I understand from Mark B on different patches that using
> devm_request_irq() can create race condition when removing device.
> Interrupt can occur when device resource release is in process and
> so it can cause isr call which can use the freed pointer.
> devm_request_irq() should be avoided.

devm_request_irq() has a potential of creating a race condition, but it
depents on the driver. In this particular case tegra driver ensures that
interrupts are inhibited when input device is unregistered by providing
tegra_kbc_close() method, so in this particular case it is safe to
use devm_request_irq().

Also, when using managed input devices, the unregistering and final
freeing is a 2-step process, so even in absence of close() method, if
initialization sequence was:

devm_input_allocate_device()
...
devm_request_irq()
...
input_unregister_device()

then order of freeing resources (behind the scenes) will be

devm_input_device_unregister();
/* input device is still present in memory and can
 * handle input_event() calls.
 */
free_irq();
devm_input_device_release();

So using managed request_irq() _together_ with managed input devices is
OK.

Thanks.

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote:
> Hi,
> 
> I'm observing multiple apparently unrelated network performance
> issues in 3.7, to the point that I'm doubting it comes from the
> network stack.
> 
> My setup involves 3 machines connected point-to-point with myri
> 10GE NICs (the middle machine has 2 NICs). The middle machine
> normally runs haproxy, the other two run either an HTTP load
> generator or a dummy web server :
> 
> 
>   [ client ] <> [ haproxy ] <> [ server ]
> 
> Usually transferring HTTP objects from the server to the client
> via haproxy causes no problem at 10 Gbps for moderately large
> objects.
> 
> This time I observed that it was not possible to go beyond 6.8 Gbps,
> with all the chain idling a lot. I tried to change the IRQ rate, CPU
> affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
> knobs, nothing managed to go beyond.
> 
> So I removed haproxy from the equation, and simply started the client
> on the middle machine. Same issue. I thought about concurrency issues,
> so I reduced to a single connection, and nothing changed (usually I
> achieve 10G even with a single connection with large enough TCP windows).
> I tried to start tcpdump and the transfer immediately stalled and did not
> come back after I stopped tcpdump. This was reproducible several times
> but not always.
> 
> So I first thought about an issue in the myri10ge driver and wanted to
> confirm that everything was OK on the middle machine.
> 
> I started the server on it and aimed the client at it via the loopback.
> The transfer rate was even worse : randomly oscillating between 10 and
> 100 MB/s ! Normally on the loop back, I get several GB/s here.
> 
> Running tcpdump on the loopback showed be several very concerning issues :
> 
> 1) lots of packets are lost before reaching tcpdump. The trace shows that
>these segments are ACKed so they're correctly received, but tcpdump
>does not get them. Tcpdump stats at the end report impressive numbers,
>around 90% packet dropped from the capture!
> 
> 2) ACKs seem to be immediately delivered but do not trigger sending, the
>system seems to be running with delayed ACKs, as it waits 40 or 200ms
>before restarting, and this is visible even in the first round trips :
> 
>- connection setup :
> 
>18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 
> 2036886615:2036886615(0) win 8030 
>18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 
> 126397113:126397113(0) ack 2036886616 win 8030  65495,nop,nop,sackOK,nop,wscale 9>
>18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16
> 
>- GET /?s=1g HTTP/1.0
> 
>18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 
> 2036886616:2036886738(122) ack 126397114 win 16
> 
>- HTTP/1.1 200 OK with the beginning of the response :
> 
>18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
> 126397114:126401210(4096) ack 2036886738 win 16
>18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 
> 250
>==> 200ms pause here
>18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 
> 126401210:126463006(61796) ack 2036886738 win 16
>==> 40ms pause here
>18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 
> 256
>18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
> 126463006:126527006(64000) ack 2036886738 win 16
> 
>... and so on
> 
>My server is using splice() with the SPLICE_F_MORE flag to send data.
>I noticed that not using splice and relying on send(MSG_MORE) instead
>I don't get the issue.
> 
> 3) I wondered if this had something to do with the 64k MTU on the loopback
>so I lowered it to 16kB. The performance was even worse (about 5MB/s).
>Starting tcpdump managed to make my transfer stall, just like with the
>myri10ge. In this last test, I noticed that there were some real drops,
>because there were some SACKs :
> 
>18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 956153186:956169530(16344) ack 131668746 win 16
>18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
>18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957035762:957052106(16344) ack 131668746 win 16
>18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
>18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957052106:957099566(47460) ack 131668746 win 16
>18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957402550:957418894(16344) ack 131668746 win 16
>18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 
> 1846
>18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957672806:957689150(16344) ack 131668746 win 16
>18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 
> 2902
>18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 958962966:958979310(16344) ack 131668746 win 16
>18:45:17.516116 IP 

Re: [PATCH] drivers/input/keyboard/lm8323.c: fix incorrect left shift

2013-01-05 Thread Dmitry Torokhov
On Sat, Jan 05, 2013 at 02:09:05PM -0500, Nickolai Zeldovich wrote:
> In drivers/input/keyboard/lm8323.c, INT_PWM1 is already a bitmask,
> not the bit number, so shifting by INT_PWM1 is incorrect.
> 
> Signed-off-by: Nickolai Zeldovich 

Applied, thank you Nickolai.

> ---
>  drivers/input/keyboard/lm8323.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/input/keyboard/lm8323.c b/drivers/input/keyboard/lm8323.c
> index 93c8126..0de23f4 100644
> --- a/drivers/input/keyboard/lm8323.c
> +++ b/drivers/input/keyboard/lm8323.c
> @@ -398,7 +398,7 @@ static irqreturn_t lm8323_irq(int irq, void *_lm)
>   lm8323_configure(lm);
>   }
>   for (i = 0; i < LM8323_NUM_PWMS; i++) {
> - if (ints & (1 << (INT_PWM1 + i))) {
> + if (ints & (INT_PWM1 << i)) {
>   dev_vdbg(>client->dev,
>"pwm%d engine completed\n", i);
>   pwm_done(>pwm[i]);
> -- 
> 1.7.10.4
> 

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc[12] cpufreq build errors...

2013-01-05 Thread Woody Suwalski

Larry Finger wrote:

Woody,

There is a patch pending that fixes this problem. See
http://lkml.indiana.edu/hypermail/linux/kernel/1212.3/01201.html. Note 
that Rafael wrote "If you don't mind, I'll rename 
CONFIG_CPU_FREQ_GOVERNOR to
CONFIG_CPU_FREQ_GOV_COMMON when applying it, though". I think this 
patch is working its way through the system, but I have not seem it in 
mainline yet.


Larry
Thanks, that fix does not work directly for me - cpufreq_governor.o is 
not even built now... But maybe I have missed something - had to do it 
"by hand"...


Rafael has confirmed that he will include a fix...

Thanks, Woody
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/25] charger_manager: don't use [delayed_]work_pending()

2013-01-05 Thread Anton Vorontsov
On Fri, Dec 21, 2012 at 05:56:51PM -0800, Tejun Heo wrote:
> There's no need to test whether a (delayed) work item in pending
> before queueing, flushing or cancelling it.  Most uses are unnecessary
> and quite a few of them are buggy.
> 
> Remove unnecessary pending tests and rewrite _setup_polling() so that
> it uses mod_delayed_work() if the next polling interval is sooner than
> currently scheduled.  queue_delayed_work() is used otherwise.
> 
> Only compile tested.  I noticed that two work items - setup_polling
> and cm_monitor_work - schedule each other.  It's a very unusual
> construct and I'm fairly sure it's racy.  You can't break such
> circular dependency by calling cancel on each.  I strongly recommend
> revising the mechanism.
> 
> Signed-off-by: Tejun Heo 
> Cc: Anton Vorontsov 
> Cc: David Woodhouse 
> Cc: Donggeun Kim 
> Cc: MyungJoo Ham 
> ---
> Please let me know how this patch should be routed.  I can take it
> through the workqueue tree if necessary.

Charger manager is a fast moving target, so it is prone to conflict; it is
better if I take it via battery-2.6.git tree. It is merged, thanks a lot!

Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] charger-manager: Fix bug related to checking fully charged state of battery

2013-01-05 Thread Anton Vorontsov
On Thu, Nov 22, 2012 at 04:44:15PM +0900, Chanwoo Choi wrote:
> This patch fix bug related to checking fully charged state of battery
> when charger-manager call is_full_charged() function. After reading
> property of charger/fuel-gauge through power_supply API, val.intval is
> more than 1. So, is_full_charged() function always return true. If true,
> battery means fully charged state.
> 
> Signed-off-by: Chanwoo Choi 
> Signed-off-by: Myungjoo Ham 
> Signed-off-by: Kyungmin Park 
> ---

Applied, thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC 0/1] Delete legacy power trace API

2013-01-05 Thread Rafael J. Wysocki
On Friday, January 04, 2013 08:49:03 PM Paul Gortmaker wrote:
> The actual deletion is mind-numbingly simple; and if you go by the
> comments in the code, it is well overdue.  However, in discussions
> with Frederic, he suggested to me that those comments might have
> been overly optimistic, and that there may still be people out
> there who are still unknowingly using this dead API.
> 
> So, that is the crux of the RFC component -- to check whether the
> comments saying "delete by v3.1" can be taken at face value, or
> whether they were overly optimistic, and hence this stuff is still
> actively used even though it is overdue for deletion.

Do you want me or the tracing maintainers to handle this?

Rafael


> ---
> 
> Paul Gortmaker (1):
>   tracing: remove deprecated power trace API
> 
>  Documentation/trace/events-power.txt | 27 +--
>  arch/arm/mach-omap2/pm34xx.c |  2 -
>  arch/x86/kernel/process.c|  6 ---
>  drivers/cpufreq/cpufreq.c|  1 -
>  drivers/cpuidle/cpuidle.c|  2 -
>  include/trace/events/power.h | 92 
> 
>  kernel/trace/Kconfig | 15 --
>  kernel/trace/power-traces.c  |  3 --
>  8 files changed, 1 insertion(+), 147 deletions(-)
> 
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7u1 26/31] x86: Don't enable swiotlb if there is not enough ram for it

2013-01-05 Thread Shuah Khan
On Fri, Jan 4, 2013 at 9:10 PM, Yinghai Lu  wrote:
> On Fri, Jan 4, 2013 at 6:02 PM, Shuah Khan  wrote:
>> I applied your patch to 3.6.11 and changed the panic() to pr_info()
>> and also changed enough_mem_for_swiotlb() to always return false to
>> simulate not enough memory condition as this system does have enough
>> memory.
>>
>> So at least on this AMD system, your patch will result in a panic.
>
> ok, thanks for testing.
>
> if enough_mem_for_swiotlb() return false really,  allocating buffer
> for swiotlb with bootmem would panic already, right?
>
> so this patch just delay the panic a while for AMD system with
> unhandled devices by IOMMU.
>
> Thanks
>
> Yinghai

Right. It will eventually panic. I think this is not a valid test. I
am planning to run more tests without forcing no memory condition
which is what I should have done in the first place. I will let you
know what I find, very likely Monday.

Thanks,
-- Shuah
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] ACPI / PM: ACPI power management update

2013-01-05 Thread Rafael J. Wysocki
On Saturday, January 05, 2013 10:31:11 AM Sedat Dilek wrote:
> Hi Rafael,
> 
> against which Linux-kernel version is your patchset?
> Linux v3.8-rc2?
> Mambo number 5 [1] aka patch 5/6 does not apply cleanly.

Oh, I didn't say, sorry about that.

It is on top of the linux-next branch of the linux-pm.git tree with the
additional patchset at https://lkml.org/lkml/2013/1/3/460 applied.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi,

I'm observing multiple apparently unrelated network performance
issues in 3.7, to the point that I'm doubting it comes from the
network stack.

My setup involves 3 machines connected point-to-point with myri
10GE NICs (the middle machine has 2 NICs). The middle machine
normally runs haproxy, the other two run either an HTTP load
generator or a dummy web server :


  [ client ] <> [ haproxy ] <> [ server ]

Usually transferring HTTP objects from the server to the client
via haproxy causes no problem at 10 Gbps for moderately large
objects.

This time I observed that it was not possible to go beyond 6.8 Gbps,
with all the chain idling a lot. I tried to change the IRQ rate, CPU
affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
knobs, nothing managed to go beyond.

So I removed haproxy from the equation, and simply started the client
on the middle machine. Same issue. I thought about concurrency issues,
so I reduced to a single connection, and nothing changed (usually I
achieve 10G even with a single connection with large enough TCP windows).
I tried to start tcpdump and the transfer immediately stalled and did not
come back after I stopped tcpdump. This was reproducible several times
but not always.

So I first thought about an issue in the myri10ge driver and wanted to
confirm that everything was OK on the middle machine.

I started the server on it and aimed the client at it via the loopback.
The transfer rate was even worse : randomly oscillating between 10 and
100 MB/s ! Normally on the loop back, I get several GB/s here.

Running tcpdump on the loopback showed be several very concerning issues :

1) lots of packets are lost before reaching tcpdump. The trace shows that
   these segments are ACKed so they're correctly received, but tcpdump
   does not get them. Tcpdump stats at the end report impressive numbers,
   around 90% packet dropped from the capture!

2) ACKs seem to be immediately delivered but do not trigger sending, the
   system seems to be running with delayed ACKs, as it waits 40 or 200ms
   before restarting, and this is visible even in the first round trips :

   - connection setup :

   18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 
2036886615:2036886615(0) win 8030 
   18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 
126397113:126397113(0) ack 2036886616 win 8030 
   18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16

   - GET /?s=1g HTTP/1.0

   18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 
2036886616:2036886738(122) ack 126397114 win 16

   - HTTP/1.1 200 OK with the beginning of the response :

   18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
126397114:126401210(4096) ack 2036886738 win 16
   18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
   ==> 200ms pause here
   18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 
126401210:126463006(61796) ack 2036886738 win 16
   ==> 40ms pause here
   18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
   18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
126463006:126527006(64000) ack 2036886738 win 16

   ... and so on

   My server is using splice() with the SPLICE_F_MORE flag to send data.
   I noticed that not using splice and relying on send(MSG_MORE) instead
   I don't get the issue.

3) I wondered if this had something to do with the 64k MTU on the loopback
   so I lowered it to 16kB. The performance was even worse (about 5MB/s).
   Starting tcpdump managed to make my transfer stall, just like with the
   myri10ge. In this last test, I noticed that there were some real drops,
   because there were some SACKs :

   18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
956153186:956169530(16344) ack 131668746 win 16
   18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
   18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957035762:957052106(16344) ack 131668746 win 16
   18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
   18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957052106:957099566(47460) ack 131668746 win 16
   18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957402550:957418894(16344) ack 131668746 win 16
   18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
   18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957672806:957689150(16344) ack 131668746 win 16
   18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
   18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
958962966:958979310(16344) ack 131668746 win 16
   18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
   18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 

   18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 


Please note that the Myri card is running with the normal MTU of 1500,
jumbo frames were 

Re: [5/6] ACPI / PM: Move device power management functions to device_pm.c

2013-01-05 Thread Rafael J. Wysocki
On Saturday, January 05, 2013 10:43:31 AM Sedat Dilek wrote:
> Just a small typo in the comments:
> ...
> +#ifdef CONFIG_PM
> ...
> +#else /* !CONFIG_PM */
> ...
> +#endif /* !CONFIG_PM */ <--- /* CONFIG_PM (without "!") */

This actually isn't a typo.  It meas that the block started by #else ends
here.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc[12] cpufreq build errors...

2013-01-05 Thread Rafael J. Wysocki
On Saturday, January 05, 2013 09:39:41 AM Woody Suwalski wrote:
> Rafael, in 3.8 kernel part of the common logic has been moved to a 
> separate cpufreq_governor.c file.
> According to the Makefile, the cpufreq_governor.o is to be linked to 
> other cpufreq  modules.
> 
> However I see that a separate malformed cpufreq_governor.ko is created, 
> and then the real modules can not work without the common logic either.
> 
> The build config is a simple 32-bit config, on top of vanilla source.
> 
> I submit a verbose build log with "ls *.ko" and some "modinfo" outputs 
> attached to the bottom.
> Clearly the build subsystem is misinterpreting the build intentions.
> 
> Could you please check if you see the same issue in your builds?
> I am using a 32-bit October-ish Debian Testing as a build machine (Eeepc 
> netbook).

This issue has already been reported and there's a fix scheduled for
inclusion in the master branch of the linux-pm.git tree.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.6.9 -> 3.7.1 regression] sound: snd_hda_intel codec probing issue?

2013-01-05 Thread Vincent Blut
Le jeudi 03 janvier 2013 à 10:19 +0100, Takashi Iwai a écrit :
> At Fri, 28 Dec 2012 15:25:40 +0100,
> Vincent Blut wrote:
> > 
> > Hi,
> > 
> > Since I updated to Linux 3.7.1, listening to some audio/video bits
> > frequently cause the following:
> > 
> > [ 7896.166946] hda-intel: azx_get_response timeout, switching to polling
> > mode: last cmd=0x020c
> > [ 7897.173444] hda-intel: No response from codec, disabling MSI: last
> > cmd=0x020c
> > [ 7898.179932] hda_intel: azx_get_response timeout, switching to
> > single_cmd mode: last cmd=0x020c
> > [ 7898.179983] hda-codec: out of range cmd 0:0:20:400:f7ff
> > [ 9445.034371] plugin-containe[5873]: segfault at 7f44bb95e639 ip
> > 7f44e454bca0 sp 7f44c91165f8 error 4 in
> > libc-2.13.so[7f44e442c000+18]
> > 
> > It seems to be a codec probing failure (?). This is really fatal because
> > the sound become very choppy and can't recover until I reboot.
> > I'll try to play with 'probe_mask' kernel parameter to see if I can
> > narrow the correct codec slots!
> > 
> > By the way I can't reproduce this on 3.6.9, so is there something that
> > changed in this area in 3.7.1?
> 
> If it's new in 3.7, this could be a regression by runtime D3.
> Try to pass power_save_controller=0 option to snd-hda-intel module
> (or change it via sysfs dynamically).
> 
> 
> thanks,
> 
> Takashi

Hi Takashi,

Well, power_save_controller=0 seems to do the trick but I get plenty of:


[   15.389270] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11
[   25.178725] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11
[   72.296536] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11
[ 2318.147505] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11
[ 6086.029839] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11
[ 7390.772818] pci_pm_runtime_suspend(): azx_runtime_suspend+0x0/0x37
[snd_hda_intel] returns -11


which I think is fixed in 3.8 by commit 6eb827d23577

So what's the next step? Adding a quirk for this sound card? Or is there
a way to fix the root cause? 

Cheers,
Vincent

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/18] AB8500 battery management series upgrade

2013-01-05 Thread Anton Vorontsov
On Thu, Dec 13, 2012 at 03:21:23PM +, Lee Jones wrote:
> Please find the next instalment of the AB8500 Power drivers upgrade.
> A lot of work has taken place on the internal development track, but
> little effort has gone into mainlining it. At last count there were
> around 70+ patches which are in need of forward-porting, then
> upstreaming. This patch-set aims to make a good start. :)

Lee,

The series seem like a part of a larger 57-patches series that I reviewed
in late September (it took me 3 days to review it back then :).

So, today I took a look at the first few patches, and I see that none of
my comments were addressed. I guess there is some internal
miscommunication between the teams, so maybe you didn't know about the
previous effort of upstreaming the fixes.

So, please take a look at these comments:

http://lkml.org/lkml/2012/9/25/587

Thanks!
Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.8-rc[12] cpufreq build errors...

2013-01-05 Thread Larry Finger

Woody,

There is a patch pending that fixes this problem. See
http://lkml.indiana.edu/hypermail/linux/kernel/1212.3/01201.html. Note that 
Rafael wrote "If you don't mind, I'll rename CONFIG_CPU_FREQ_GOVERNOR to
CONFIG_CPU_FREQ_GOV_COMMON when applying it, though". I think this patch is 
working its way through the system, but I have not seem it in mainline yet.


Larry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmotm 2013-01-04-15-43 uploaded (aio)

2013-01-05 Thread Randy Dunlap
On 01/04/13 15:44, a...@linux-foundation.org wrote:
> The mm-of-the-moment snapshot 2013-01-04-15-43 has been uploaded to
> 
>http://www.ozlabs.org/~akpm/mmotm/
> 
> mmotm-readme.txt says
> 
> README for mm-of-the-moment:
> 
> http://www.ozlabs.org/~akpm/mmotm/
> 
> This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
> more than once a week.
> 


A few hundred of these warnings:

include/linux/aio.h:102:43: warning: 'return' with a value, in function 
returning void [enabled by default]

and these errors:

fs/aio.c:697:2: error: dereferencing pointer to incomplete type
fs/aio.c:697:2: error: dereferencing pointer to incomplete type
fs/aio.c:697:2: error: dereferencing pointer to incomplete type
fs/aio.c:707:21: error: dereferencing pointer to incomplete type
fs/aio.c:808:30: error: dereferencing pointer to incomplete type
fs/aio.c:824:40: error: dereferencing pointer to incomplete type
fs/aio.c:826:25: error: storage size of 'batch_stack' isn't known

when CONFIG_BLOCK is not enabled.

-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 3/3] tuntap: don't add to waitqueue when POLLERR

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:34 +0800, Jason Wang wrote:
> Currently, tun_chr_poll() returns POLLERR after waitqueue adding during device
> unregistration. This would confuse some of its user such as vhost which assume
> when POLLERR is returned, it wasn't added to the waitqueue. Fix this by
> returning POLLERR before adding to waitqueue.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/tun.c |5 +
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index fbd106e..f9c0049 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -886,7 +886,7 @@ static unsigned int tun_chr_poll(struct file *file, 
> poll_table *wait)
>   struct sock *sk;
>   unsigned int mask = 0;
>  
> - if (!tun)
> + if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>   return POLLERR;
>  
>   sk = tfile->socket.sk;
> @@ -903,9 +903,6 @@ static unsigned int tun_chr_poll(struct file *file, 
> poll_table *wait)
>sock_writeable(sk)))
>   mask |= POLLOUT | POLLWRNORM;
>  
> - if (tun->dev->reg_state != NETREG_REGISTERED)
> - mask = POLLERR;
> -
>   tun_put(tun);
>   return mask;
>  }

This patch is buggy.

First, the caller assuming POLLERR means poll_wait() was not called is
wrong.

Secondly, you add a ref leak.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/staging/speakup: avoid out-of-range access

2013-01-05 Thread Samuel Thibault
Indeed. The same happens in synth_add, so Greg please use this instead:


Check that array index is in-bounds before accessing the synths[] array.

Signed-off-by: Nickolai Zeldovich 
Signed-off-by: Samuel Thibault 

---
 drivers/staging/speakup/synth.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/speakup/synth.c b/drivers/staging/speakup/synth.c
index df95337..b91d22b 100644
--- a/drivers/staging/speakup/synth.c
+++ b/drivers/staging/speakup/synth.c
@@ -342,7 +342,7 @@ int synth_init(char *synth_name)
 
mutex_lock(_mutex);
/* First, check if we already have it loaded. */
-   for (i = 0; synths[i] != NULL && i < MAXSYNTHS; i++)
+   for (i = 0; i < MAXSYNTHS && synths[i] != NULL; i++)
if (strcmp(synths[i]->name, synth_name) == 0)
synth = synths[i];
 
@@ -423,7 +423,7 @@ int synth_add(struct spk_synth *in_synth
int i;
int status = 0;
mutex_lock(_mutex);
-   for (i = 0; synths[i] != NULL && i < MAXSYNTHS; i++)
+   for (i = 0; i < MAXSYNTHS && synths[i] != NULL; i++)
/* synth_remove() is responsible for rotating the array down */
if (in_synth == synths[i]) {
mutex_unlock(_mutex);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] nfs: avoid dereferencing null pointer in initiate_bulk_draining

2013-01-05 Thread Myklebust, Trond
On Sat, 2013-01-05 at 14:19 -0500, Nickolai Zeldovich wrote:
> Fix an inverted null pointer check in initiate_bulk_draining().
> 
> Signed-off-by: Nickolai Zeldovich 
> ---
>  fs/nfs/callback_proc.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
> index c89b26b..264d1aa 100644
> --- a/fs/nfs/callback_proc.c
> +++ b/fs/nfs/callback_proc.c
> @@ -206,7 +206,7 @@ static u32 initiate_bulk_draining(struct nfs_client *clp,
>  
>   list_for_each_entry(lo, >layouts, plh_layouts) {
>   ino = igrab(lo->plh_inode);
> - if (ino)
> + if (!ino)
>   continue;
>   spin_lock(>i_lock);
>   /* Is this layout in the process of being freed? */

Thanks for spotting. Applied to the 'bugfixes' branch.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND][PATCH v3] mm: Use aligned zone start for pfn_to_bitidx calculation

2013-01-05 Thread Laura Abbott
The current calculation in pfn_to_bitidx assumes that
(pfn - zone->zone_start_pfn) >> pageblock_order will return the
same bit for all pfn in a pageblock. If zone_start_pfn is not
aligned to pageblock_nr_pages, this may not always be correct.

Consider the following with pageblock order = 10, zone start 2MB:

pfn | pfn - zone start | (pfn - zone start) >> page block order

0x26000 | 0x25e00  |  0x97
0x26100 | 0x25f00  |  0x97
0x26200 | 0x26000  |  0x98
0x26300 | 0x26100  |  0x98

This means that calling {get,set}_pageblock_migratetype on a single
page will not set the migratetype for the full block. Fix this by
rounding down zone_start_pfn when doing the bitidx calculation.

Signed-off-by: Laura Abbott 
Acked-by: Mel Gorman 
---
 mm/page_alloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92dd060..b6a2510 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5422,7 +5422,7 @@ static inline int pfn_to_bitidx(struct zone *zone, 
unsigned long pfn)
pfn &= (PAGES_PER_SECTION-1);
return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #else
-   pfn = pfn - zone->zone_start_pfn;
+   pfn = pfn - round_down(zone->zone_start_pfn, pageblock_nr_pages);
return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #endif /* CONFIG_SPARSEMEM */
 }
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 07/18] perf: add generic memory sampling interface

2013-01-05 Thread Jiri Olsa
On Thu, Dec 20, 2012 at 04:41:37PM +0100, Stephane Eranian wrote:
> This patch adds PERF_SAMPLE_COST and PERF_SAMPLE_DSRC.
> The first collects a cost associated with the sampled
> event. In case of memory access, the cost would be
> the latency of the load, otherwise it defaults to
> the sampling period.
> 
> PERF_SAMPLE_DSRC collects the data source, i.e., where
> did the data associated with the sampled instruction
> come from. Information is stored in a perf_mem_dsrc
> structure. It contains opcode, mem level, tlb, snoop,
> lock information, subject to availability in hardware.
> 
> Signed-off-by: Stephane Eranian 

SNIP

> +
> +/* type of opcode (load/store/prefetch,code) */
> +#define PERF_MEM_OP_NA   0x01 /* not available */
> +#define PERF_MEM_OP_LOAD 0x02 /* load instruction */
> +#define PERF_MEM_OP_STORE0x04 /* store instruction */
> +#define PERF_MEM_OP_PFETCH   0x08 /* prefetch */
> +#define PERF_MEM_OP_EXEC 0x10 /* code (execution) */
> +#define PERF_MEM_OP_SHIFT0
> +
> +/* memory hierarchy (memory level, hit or miss) */
> +#define PERF_MEM_LVL_NA  0x01  /* not available */
> +#define PERF_MEM_LVL_HIT 0x02  /* hit level */
> +#define PERF_MEM_LVL_MISS0x04  /* miss level  */
> +#define PERF_MEM_LVL_L1  0x08  /* L1 */
> +#define PERF_MEM_LVL_LFB 0x10  /* Line Fill Buffer */
> +#define PERF_MEM_LVL_L2  0x20  /* L2 hit */
> +#define PERF_MEM_LVL_L3  0x40  /* L3 hit */
> +#define PERF_MEM_LVL_LOC_RAM 0x80  /* Local DRAM */
> +#define PERF_MEM_LVL_REM_RAM10x100 /* Remote DRAM (1 hop) */
> +#define PERF_MEM_LVL_REM_RAM20x200 /* Remote DRAM (2 hops) */
> +#define PERF_MEM_LVL_REM_CCE10x400 /* Remote Cache (1 hop) */
> +#define PERF_MEM_LVL_REM_CCE20x800 /* Remote Cache (2 hops) */
> +#define PERF_MEM_LVL_IO  0x1000 /* I/O memory */
> +#define PERF_MEM_LVL_UNC 0x2000 /* Uncached memory */
> +#define PERF_MEM_LVL_SHIFT   5
> +
> +/* snoop mode */
> +#define PERF_MEM_SNOOP_NA0x01 /* not available */
> +#define PERF_MEM_SNOOP_NONE  0x02 /* no snoop */
> +#define PERF_MEM_SNOOP_HIT   0x04 /* snoop hit */
> +#define PERF_MEM_SNOOP_MISS  0x08 /* snoop miss */
> +#define PERF_MEM_SNOOP_HITM  0x10 /* snoop hit modified */
> +#define PERF_MEM_SNOOP_SHIFT 19
> +
> +/* locked instruction */
> +#define PERF_MEM_LOCK_NA 0x01 /* not available */
> +#define PERF_MEM_LOCK_LOCKED 0x02 /* locked transaction */
> +#define PERF_MEM_LOCK_SHIFT  24
> +
> +/* TLB access */
> +#define PERF_MEM_TLB_NA  0x01 /* not available */
> +#define PERF_MEM_TLB_HIT 0x02 /* hit level */
> +#define PERF_MEM_TLB_MISS0x04 /* miss level */
> +#define PERF_MEM_TLB_L1  0x08 /* L1 */
> +#define PERF_MEM_TLB_L2  0x10 /* L2 */
> +#define PERF_MEM_TLB_WK  0x20 /* Hardware Walker*/
> +#define PERF_MEM_TLB_OS  0x40 /* OS fault handler */
> +#define PERF_MEM_TLB_SHIFT   26
> +
> +#define PERF_MEM_S(a, s) \
> + (((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)

Why dont use enums for this?

jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 08/18] perf/x86: add memory profiling via PEBS Load Latency

2013-01-05 Thread Jiri Olsa
On Thu, Dec 20, 2012 at 04:41:38PM +0100, Stephane Eranian wrote:
> This patch adds support for memory profiling using the
> PEBS Load Latency facility.
> 
> Load accesses are sampled by HW and the instruction
> address, data address, load latency, data source, tlb,
> locked information can be saved in the sampling buffer
> if using the PERF_SAMPLE_COST (for latency),

PERF_SAMPLE_WEIGHT ?

> PERF_SAMPLE_ADDR, PERF_SAMPLE_DSRC types.
> 
> To enable PEBS Load Latency, users have to use the
> model specific event:
> - on NHM/WSM: MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD
> - on SNB/IVB: MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD
> 
> To make things easier, this patch also exports a generic
> alias via sysfs: mem-loads. It export the right event
> encoding based on the host CPU and can be used directly
> by the perf tool.
> 
> Loosely based on Intel's Lin Ming patch posted on LKML
> in July 2011.
> 
> Signed-off-by: Stephane Eranian 

SNIP

> +/*
> + * Map PEBS Load Latency Data Source encodings to generic
> + * memory data source information
> + */
> +#define P(a, b) PERF_MEM_S(a, b)
> +#define OP_LH (P(OP, LOAD) | P(LVL, HIT))
> +#define SNOOP_NONE_MISS (P(SNOOP, NONE) | P(SNOOP, MISS))
> +

I checked Intel SDM 'Table 18-13. Data Source Encoding for Load Latency Record'
and it seems to be different (below) at some points.. did you use another 
source?

> +static const u64 pebs_data_source[] = {
> + P(OP, LOAD) | P(LVL, MISS) | P(LVL, L3) | P(SNOOP, NA),/* 0x00:ukn L3 */
> + OP_LH | P(LVL, L1) | P(SNOOP, NONE),/* 0x01: L1 local */
> + OP_LH | P(LVL, LFB)| P(SNOOP, NONE),/* 0x02: LFB hit */
> + OP_LH | P(LVL, L2) | P(SNOOP, NONE),/* 0x03: L2 hit */
> + OP_LH | P(LVL, L3) | P(SNOOP, NONE),/* 0x04: L3 hit */
> + OP_LH | P(LVL, L3) | P(SNOOP, MISS),/* 0x05: L3 hit, snoop miss */
> + OP_LH | P(LVL, L3) | P(SNOOP, HIT), /* 0x06: L3 hit, snoop hit */

0x6:
L3 HIT. Local or Remote home requests that hit the L3 cache and was serviced by
another processor core with a cross core snoop where modified copies were found.
(HITM).


> + OP_LH | P(LVL, L3) | P(SNOOP, HITM),/* 0x07: L3 hit, snoop hitm */

0x7:
Reserved

> + OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HIT),  /* 0x08: L3 miss snoop hit */
> + OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HITM), /* 0x09: L3 miss snoop hitm*/

0x9:
Reserved

> + OP_LH | P(LVL, LOC_RAM)  | P(SNOOP, HIT),  /* 0x0a: L3 miss, shared */
> + OP_LH | P(LVL, REM_RAM1) | P(SNOOP, HIT),  /* 0x0b: L3 miss, shared */
> + OP_LH | P(LVL, LOC_RAM)  | SNOOP_NONE_MISS,/* 0x0c: L3 miss, excl */
> + OP_LH | P(LVL, REM_RAM1) | SNOOP_NONE_MISS,/* 0x0d: L3 miss, excl */
> + OP_LH | P(LVL, IO) | P(SNOOP, NONE), /* 0x0e: I/O */
> + OP_LH | P(LVL,UNC) | P(SNOOP, NONE), /* 0x0f: uncached */
> +};

thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 07/18] perf: add generic memory sampling interface

2013-01-05 Thread Jiri Olsa
On Thu, Dec 20, 2012 at 04:41:37PM +0100, Stephane Eranian wrote:
> This patch adds PERF_SAMPLE_COST and PERF_SAMPLE_DSRC.

I guess PERF_SAMPLE_COST was replaced by PERF_SAMPLE_WEIGHT added by Andi

jirka

> The first collects a cost associated with the sampled
> event. In case of memory access, the cost would be
> the latency of the load, otherwise it defaults to
> the sampling period.
> 
> PERF_SAMPLE_DSRC collects the data source, i.e., where
> did the data associated with the sampled instruction
> come from. Information is stored in a perf_mem_dsrc
> structure. It contains opcode, mem level, tlb, snoop,
> lock information, subject to availability in hardware.
> 
> Signed-off-by: Stephane Eranian 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs

2013-01-05 Thread Frederic Weisbecker
2013/1/5 Paul E. McKenney :
> On Sat, Jan 05, 2013 at 06:21:01PM +0100, Frederic Weisbecker wrote:
>> Hi Paul,
>>
>> 2013/1/5 Paul E. McKenney :
>> > From: Paul Gortmaker 
>> >
>> > The wait_event() at the head of the rcu_nocb_kthread() can result in
>> > soft-lockup complaints if the CPU in question does not register RCU
>> > callbacks for an extended period.  This commit therefore changes
>> > the wait_event() to a wait_event_interruptible().
>> >
>> > Reported-by: Frederic Weisbecker 
>> > Signed-off-by: Paul Gortmaker 
>> > Signed-off-by: Paul E. McKenney 
>> > ---
>> >  kernel/rcutree_plugin.h |3 ++-
>> >  1 files changed, 2 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>> > index f6e5ec2..43dba2d 100644
>> > --- a/kernel/rcutree_plugin.h
>> > +++ b/kernel/rcutree_plugin.h
>> > @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg)
>> > for (;;) {
>> > /* If not polling, wait for next batch of callbacks. */
>> > if (!rcu_nocb_poll)
>> > -   wait_event(rdp->nocb_wq, rdp->nocb_head);
>> > +   wait_event_interruptible(rdp->nocb_wq, 
>> > rdp->nocb_head);
>> > list = ACCESS_ONCE(rdp->nocb_head);
>> > if (!list) {
>> > schedule_timeout_interruptible(1);
>> > +   flush_signals(current);
>>
>> Why is that needed?
>
> To satisfy my paranoia.  ;-)  And in case someone ever figures out some
> way to send a signal to a kthread.

Ok. I don't want to cause any insomnia to anyone, so I won't insist ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] signals: set_current_blocked() can use __set_current_blocked()

2013-01-05 Thread Oleg Nesterov
Cleanup. And I think we need more cleanups, in particular
__set_current_blocked() and sigprocmask() should die. Nobody
should ever block SIGKILL or SIGSTOP.

- Change set_current_blocked() to use __set_current_blocked()

- Change sys_sigprocmask() to use set_current_blocked(), this
  way it should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov 
---
 kernel/signal.c |8 ++--
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 9692499..372771e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2528,11 +2528,8 @@ static void __set_task_blocked(struct task_struct *tsk, 
const sigset_t *newset)
  */
 void set_current_blocked(sigset_t *newset)
 {
-   struct task_struct *tsk = current;
sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP));
-   spin_lock_irq(>sighand->siglock);
-   __set_task_blocked(tsk, newset);
-   spin_unlock_irq(>sighand->siglock);
+   __set_current_blocked(newset);
 }
 
 void __set_current_blocked(const sigset_t *newset)
@@ -3204,7 +3201,6 @@ SYSCALL_DEFINE3(sigprocmask, int, how, old_sigset_t 
__user *, nset,
if (nset) {
if (copy_from_user(_set, nset, sizeof(*nset)))
return -EFAULT;
-   new_set &= ~(sigmask(SIGKILL) | sigmask(SIGSTOP));
 
new_blocked = current->blocked;
 
@@ -3222,7 +3218,7 @@ SYSCALL_DEFINE3(sigprocmask, int, how, old_sigset_t 
__user *, nset,
return -EINVAL;
}
 
-   __set_current_blocked(_blocked);
+   set_current_blocked(_blocked);
}
 
if (oset) {
-- 
1.5.5.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Was: ssetmask/sgetmask syscalls

2013-01-05 Thread Oleg Nesterov
On 01/05, CAI Qian wrote:
>
> FYI, I noticed that ssetmask/sgetmask syscalls starting to
> fail
>
> ssetmask011  TFAIL  :  sgetmask() failed: TEST_ERRNO=???(0): Success

Thanks!

Should be fixed by 1/2. Probably trivial enough for 3.8

2/2 is a minor "while at it" cleanup.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] signals: sys_ssetmask() uses uninitialized newmask

2013-01-05 Thread Oleg Nesterov
77097ae5 "most of set_current_blocked() callers want SIGKILL/SIGSTOP
removed from set" removed the initialization of newmask by accident,
restore.

Reported-by: CAI Qian 
Signed-off-by: Oleg Nesterov 
Cc: sta...@kernel.org   # v3.5+
---
 kernel/signal.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7aaa51d..9692499 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3286,6 +3286,7 @@ SYSCALL_DEFINE1(ssetmask, int, newmask)
int old = current->blocked.sig[0];
sigset_t newset;
 
+   siginitset(, newmask);
set_current_blocked();
 
return old;
-- 
1.5.5.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs

2013-01-05 Thread Paul E. McKenney
On Sat, Jan 05, 2013 at 06:21:01PM +0100, Frederic Weisbecker wrote:
> Hi Paul,
> 
> 2013/1/5 Paul E. McKenney :
> > From: Paul Gortmaker 
> >
> > The wait_event() at the head of the rcu_nocb_kthread() can result in
> > soft-lockup complaints if the CPU in question does not register RCU
> > callbacks for an extended period.  This commit therefore changes
> > the wait_event() to a wait_event_interruptible().
> >
> > Reported-by: Frederic Weisbecker 
> > Signed-off-by: Paul Gortmaker 
> > Signed-off-by: Paul E. McKenney 
> > ---
> >  kernel/rcutree_plugin.h |3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index f6e5ec2..43dba2d 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg)
> > for (;;) {
> > /* If not polling, wait for next batch of callbacks. */
> > if (!rcu_nocb_poll)
> > -   wait_event(rdp->nocb_wq, rdp->nocb_head);
> > +   wait_event_interruptible(rdp->nocb_wq, 
> > rdp->nocb_head);
> > list = ACCESS_ONCE(rdp->nocb_head);
> > if (!list) {
> > schedule_timeout_interruptible(1);
> > +   flush_signals(current);
> 
> Why is that needed?

To satisfy my paranoia.  ;-)  And in case someone ever figures out some
way to send a signal to a kthread.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 04/14] rcu: Provide compile-time control for no-CBs CPUs

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, the only way to specify no-CBs CPUs is via the rcu_nocbs
kernel command-line parameter.  This is inconvenient in some cases,
particularly for randconfig testing, so this commit adds a new
RCU_NOCB_CPU_DEFAULT kernel configuration parameter.  Setting this
new parameter to zero (the default) retains the old behavior, setting
it to one offloads callback processing from CPU 0 (along with any
other CPUs specified by the rcu_nocbs boot-time parameter), and setting
it to two offloads callback processing from all CPUs.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 init/Kconfig|   13 +
 kernel/rcutree_plugin.h |   13 +
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index fc6a3ca..35dcedb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -676,6 +676,19 @@ config RCU_NOCB_CPU
  Say Y here if you want to help to debug reduced OS jitter.
  Say N here if you are unsure.
 
+config RCU_NOCB_CPU_DEFAULT
+   int "Offload RCU callback processing from compile-selected CPUs"
+   depends on RCU_NOCB_CPU
+   range 0 2
+   default 0
+   help
+ Set this option to zero to only offload RCU callback processing
+ from those CPUs specified by the boot-time rcu_nocbs kernel
+ parameter.  Set it to one to offload processing from CPU 0
+ in addition to any CPUs specified at boot time.  Set it to
+ two to offload processing from all CPUs, regardless of the
+ setting of the boot-time rcu_nocbs kernel parameter.
+
 endmenu # "RCU Subsystem"
 
 config IKCONFIG
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 37750bc..eb9b473 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -86,6 +86,19 @@ static void __init rcu_bootup_announce_oddness(void)
if (nr_cpu_ids != NR_CPUS)
printk(KERN_INFO "\tRCU restricting CPUs from NR_CPUS=%d to 
nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids);
 #ifdef CONFIG_RCU_NOCB_CPU
+#if CONFIG_RCU_NOCB_CPU_DEFAULT != 0
+   if (!have_rcu_nocb_mask) {
+   alloc_bootmem_cpumask_var(_nocb_mask);
+   have_rcu_nocb_mask = true;
+   }
+#if CONFIG_RCU_NOCB_CPU_DEFAULT == 1
+   pr_info("\tExperimental no-CBs CPU 0\n");
+   cpumask_set_cpu(0, rcu_nocb_mask);
+#else /* #if CONFIG_RCU_NOCB_CPU_DEFAULT == 1 */
+   pr_info("\tExperimental no-CBs for all CPUs\n");
+   cpumask_setall(rcu_nocb_mask);
+#endif /* #else #if CONFIG_RCU_NOCB_CPU_DEFAULT == 1 */
+#endif /* #if CONFIG_RCU_NOCB_CPU_DEFAULT != 0 */
if (have_rcu_nocb_mask) {
cpulist_scnprintf(nocb_buf, sizeof(nocb_buf), rcu_nocb_mask);
pr_info("\tExperimental no-CBs CPUs: %s.\n", nocb_buf);
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arm/davinci/musb: fix mispint introduced by commit 032ec49f5351e9cb242b1a1c367d14415043ab95

2013-01-05 Thread Sergei Shtylyov
On 12/22/2012 08:52 PM, Sergei Shtylyov wrote:

>> please respin this patch with a real commit log.

>And then, when referring to commit ID that broke da8xx.c don't forget to 
> also
> specify the commit summmary in parens (or however you like).

   Also, please s/davinci/da8xx/ in the subject. OMAP-L1x/DA8xx is not exactly
DaVinci.

>>> Signed-off-by: Mikhail Kshevetskiy 

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 03/14] rcu: Remove restrictions on no-CBs CPUs

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, CPU 0 is constrained to not be a no-CBs CPU, and furthermore
at least one no-CBs CPU must remain online at any given time.  These
restrictions are problematic in some situations, such as cases where
all CPUs must run a real-time workload that needs to be insulated from
OS jitter and latencies due to RCU callback invocation.  This commit
therefore provides no-CBs CPUs a way to start and to wait for grace
periods independently of the normal RCU callback mechanisms.  This
approach allows any or all of the CPUs to be designated as no-CBs CPUs,
and allows any proper subset of the CPUs (whether no-CBs CPUs or not)
to be offlined.

This commit also provides event tracing, as well as a fix for a locking
bug spotted by Xie ChanglongX .

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h |   55 +
 init/Kconfig   |4 +-
 kernel/rcutree.c   |   15 ++-
 kernel/rcutree.h   |   18 ++--
 kernel/rcutree_plugin.h|  271 +++-
 5 files changed, 244 insertions(+), 119 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 5678114..ef0bf31 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -72,6 +72,58 @@ TRACE_EVENT(rcu_grace_period,
 );
 
 /*
+ * Tracepoint for no-callbacks grace-period events.  The caller should
+ * pull the data from the rcu_node structure, other than rcuname, which
+ * comes from the rcu_state structure, and event, which is one of the
+ * following:
+ *
+ * "Startleaf": Request a nocb grace period based on leaf-node data.
+ * "Startedleaf": Leaf-node start proved sufficient.
+ * "Startedleafroot": Leaf-node start proved sufficient after checking root.
+ * "Startedroot": Requested a nocb grace period based on root-node data.
+ * "StartWait": Start waiting for the requested grace period.
+ * "ResumeWait": Resume waiting after signal.
+ * "EndWait": Complete wait.
+ * "Cleanup": Clean up rcu_node structure after previous GP.
+ * "CleanupMore": Clean up, and another no-CB GP is needed.
+ */
+TRACE_EVENT(rcu_nocb_grace_period,
+
+   TP_PROTO(char *rcuname, unsigned long gpnum, unsigned long completed,
+unsigned long c, u8 level, int grplo, int grphi,
+char *gpevent),
+
+   TP_ARGS(rcuname, gpnum, completed, c, level, grplo, grphi, gpevent),
+
+   TP_STRUCT__entry(
+   __field(char *, rcuname)
+   __field(unsigned long, gpnum)
+   __field(unsigned long, completed)
+   __field(unsigned long, c)
+   __field(u8, level)
+   __field(int, grplo)
+   __field(int, grphi)
+   __field(char *, gpevent)
+   ),
+
+   TP_fast_assign(
+   __entry->rcuname = rcuname;
+   __entry->gpnum = gpnum;
+   __entry->completed = completed;
+   __entry->c = c;
+   __entry->level = level;
+   __entry->grplo = grplo;
+   __entry->grphi = grphi;
+   __entry->gpevent = gpevent;
+   ),
+
+   TP_printk("%s %lu %lu %lu %u %d %d %s",
+ __entry->rcuname, __entry->gpnum, __entry->completed,
+ __entry->c, __entry->level, __entry->grplo, __entry->grphi,
+ __entry->gpevent)
+);
+
+/*
  * Tracepoint for grace-period-initialization events.  These are
  * distinguished by the type of RCU, the new grace-period number, the
  * rcu_node structure level, the starting and ending CPU covered by the
@@ -593,6 +645,9 @@ TRACE_EVENT(rcu_barrier,
 #define trace_rcu_grace_period(rcuname, gpnum, gpevent) do { } while (0)
 #define trace_rcu_grace_period_init(rcuname, gpnum, level, grplo, grphi, \
qsmask) do { } while (0)
+#define trace_rcu_nocb_grace_period(rcuname, gpnum, completed, c, \
+   level, grplo, grphi, event) \
+   do { } while (0)
 #define trace_rcu_preempt_task(rcuname, pid, gpnum) do { } while (0)
 #define trace_rcu_unlock_preempted_task(rcuname, gpnum, pid) do { } while (0)
 #define trace_rcu_quiescent_state_report(rcuname, gpnum, mask, qsmask, level, \
diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..fc6a3ca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -655,7 +655,7 @@ config RCU_BOOST_DELAY
  Accept the default if unsure.
 
 config RCU_NOCB_CPU
-   bool "Offload RCU callback processing from boot-selected CPUs"
+   bool "Offload RCU callback processing from boot-selected CPUs 
(EXPERIMENTAL"
depends on TREE_RCU || TREE_PREEMPT_RCU
default n
help
@@ -673,7 +673,7 @@ config RCU_NOCB_CPU
  callback, and (2) affinity or cgroups can be used to force
  the kthreads to run on whatever set of CPUs is desired.
 
- Say Y here if you want reduced OS jitter on selected CPUs.

[PATCH tip/core/rcu 1/1] Tiny RCU changes for 3.9

2013-01-05 Thread Paul E. McKenney
rcu: Provide RCU CPU stall warnings for tiny RCU

Tiny RCU has historically omitted RCU CPU stall warnings in order to
reduce memory requirements, however, lack of these warnings caused Thomas
Gleixner some debugging pain recently.  Therefore, this commit adds RCU
CPU stall warnings to tiny RCU if RCU_TRACE=y.  This keeps the memory
footprint small, while still enabling CPU stall warnings in kernels
built to enable them.

This is still a bit on the high-risk side, so running this will likely
be a debugging exercise.

Reported-by: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu.h b/kernel/rcu.h
index 20dfba5..7ff057d 100644
--- a/kernel/rcu.h
+++ b/kernel/rcu.h
@@ -111,4 +111,11 @@ static inline bool __rcu_reclaim(char *rn, struct rcu_head 
*head)
 
 extern int rcu_expedited;
 
+#if defined(CONFIG_SMP) || defined(CONFIG_RCU_TRACE)
+
+extern int rcu_cpu_stall_suppress;
+int rcu_jiffies_till_stall_check(void);
+
+#endif /* defined(CONFIG_SMP) || defined(CONFIG_RCU_TRACE) */
+
 #endif /* __LINUX_RCU_H */
diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index a2cf761..06cec61 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -412,3 +412,54 @@ EXPORT_SYMBOL_GPL(do_trace_rcu_torture_read);
 #else
 #define do_trace_rcu_torture_read(rcutorturename, rhp) do { } while (0)
 #endif
+
+#if defined(CONFIG_SMP) || defined(CONFIG_RCU_TRACE)
+
+#ifdef CONFIG_PROVE_RCU
+#define RCU_STALL_DELAY_DELTA (5 * HZ)
+#else
+#define RCU_STALL_DELAY_DELTA 0
+#endif
+
+int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
+int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
+
+module_param(rcu_cpu_stall_suppress, int, 0644);
+module_param(rcu_cpu_stall_timeout, int, 0644);
+
+int rcu_jiffies_till_stall_check(void)
+{
+   int till_stall_check = ACCESS_ONCE(rcu_cpu_stall_timeout);
+
+   /*
+* Limit check must be consistent with the Kconfig limits
+* for CONFIG_RCU_CPU_STALL_TIMEOUT.
+*/
+   if (till_stall_check < 3) {
+   ACCESS_ONCE(rcu_cpu_stall_timeout) = 3;
+   till_stall_check = 3;
+   } else if (till_stall_check > 300) {
+   ACCESS_ONCE(rcu_cpu_stall_timeout) = 300;
+   till_stall_check = 300;
+   }
+   return till_stall_check * HZ + RCU_STALL_DELAY_DELTA;
+}
+
+static int rcu_panic(struct notifier_block *this, unsigned long ev, void *ptr)
+{
+   rcu_cpu_stall_suppress = 1;
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block rcu_panic_block = {
+   .notifier_call = rcu_panic,
+};
+
+static int __init check_cpu_stall_init(void)
+{
+   atomic_notifier_chain_register(_notifier_list, _panic_block);
+   return 0;
+}
+early_initcall(check_cpu_stall_init);
+
+#endif /* defined(CONFIG_SMP) || defined(CONFIG_RCU_TRACE) */
diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
index e7dce58..b899df3 100644
--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -51,10 +51,10 @@ static void __call_rcu(struct rcu_head *head,
   void (*func)(struct rcu_head *rcu),
   struct rcu_ctrlblk *rcp);
 
-#include "rcutiny_plugin.h"
-
 static long long rcu_dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
 
+#include "rcutiny_plugin.h"
+
 /* Common code for rcu_idle_enter() and rcu_irq_exit(), see kernel/rcutree.c. 
*/
 static void rcu_idle_enter_common(long long newval)
 {
@@ -205,6 +205,7 @@ int rcu_is_cpu_rrupt_from_idle(void)
  */
 static int rcu_qsctr_help(struct rcu_ctrlblk *rcp)
 {
+   reset_cpu_stall_ticks(rcp);
if (rcp->rcucblist != NULL &&
rcp->donetail != rcp->curtail) {
rcp->donetail = rcp->curtail;
@@ -251,6 +252,7 @@ void rcu_bh_qs(int cpu)
  */
 void rcu_check_callbacks(int cpu, int user)
 {
+   check_cpu_stalls();
if (user || rcu_is_cpu_rrupt_from_idle())
rcu_sched_qs(cpu);
else if (!in_softirq())
diff --git a/kernel/rcutiny_plugin.h b/kernel/rcutiny_plugin.h
index f85016a..8a23300 100644
--- a/kernel/rcutiny_plugin.h
+++ b/kernel/rcutiny_plugin.h
@@ -33,6 +33,9 @@ struct rcu_ctrlblk {
struct rcu_head **donetail; /* ->next pointer of last "done" CB. */
struct rcu_head **curtail;  /* ->next pointer of last CB. */
RCU_TRACE(long qlen);   /* Number of pending CBs. */
+   RCU_TRACE(unsigned long gp_start); /* Start time for stalls. */
+   RCU_TRACE(unsigned long ticks_this_gp); /* Statistic for stalls. */
+   RCU_TRACE(unsigned long jiffies_stall); /* Jiffies at next stall. */
RCU_TRACE(char *name);  /* Name of RCU type. */
 };
 
@@ -54,6 +57,51 @@ int rcu_scheduler_active __read_mostly;
 EXPORT_SYMBOL_GPL(rcu_scheduler_active);
 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
+#ifdef CONFIG_RCU_TRACE
+
+static void check_cpu_stall(struct rcu_ctrlblk *rcp)
+{
+   unsigned long j;
+   unsigned long js;
+
+   if 

[PATCH tip/core/rcu 01/14] rcu: Tag callback lists with corresponding grace-period number

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, callbacks are advanced each time the corresponding CPU
notices a change in its leaf rcu_node structure's ->completed value
(this value counts grace-period completions).  This approach has worked
quite well, but with the advent of RCU_FAST_NO_HZ, we cannot count on
a given CPU seeing all the grace-period completions.  When a CPU misses
a grace-period completion that occurs while it is in dyntick-idle mode,
this will delay invocation of its callbacks.

In addition, acceleration of callbacks (when RCU realizes that a given
callback need only wait until the end of the next grace period, rather
than having to wait for a partial grace period followed by a full
grace period) must be carried out extremely carefully.  Insufficient
acceleration will result in unnecessarily long grace-period latencies,
while excessive acceleration will result in premature callback invocation.
Changes that involve this tradeoff are therefore among the most
nerve-wracking changes to RCU.

This commit therefore explicitly tags groups of callbacks with the
number of the grace period that they are waiting for.  This means that
callback-advancement and callback-acceleration functions are idempotent,
so that excessive acceleration will merely waste a few CPU cycles.  This
also allows a CPU to take full advantage of any grace periods that have
elapsed while it has been in dyntick-idle mode.  It should also enable
simulataneous simplifications to and optimizations of RCU_FAST_NO_HZ.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |  195 ++
 kernel/rcutree.h |2 +
 2 files changed, 169 insertions(+), 28 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e441b77..ac6a75d 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -305,17 +305,27 @@ cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp)
 }
 
 /*
- * Does the current CPU require a yet-as-unscheduled grace period?
+ * Does the current CPU require a not-yet-started grace period?
+ * The caller must have disabled interrupts to prevent races with
+ * normal callback registry.
  */
 static int
 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
 {
-   struct rcu_head **ntp;
+   int i;
 
-   ntp = rdp->nxttail[RCU_DONE_TAIL +
-  (ACCESS_ONCE(rsp->completed) != rdp->completed)];
-   return rdp->nxttail[RCU_DONE_TAIL] && ntp && *ntp &&
-  !rcu_gp_in_progress(rsp);
+   if (rcu_gp_in_progress(rsp))
+   return 0;  /* No, a grace period is already in progress. */
+   if (!rdp->nxttail[RCU_NEXT_TAIL])
+   return 0;  /* No, this is a no-CBs (or offline) CPU. */
+   if (*rdp->nxttail[RCU_NEXT_READY_TAIL])
+   return 1;  /* Yes, this CPU has newly registered callbacks. */
+   for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++)
+   if (rdp->nxttail[i - 1] != rdp->nxttail[i] &&
+   ULONG_CMP_LT(ACCESS_ONCE(rsp->completed),
+rdp->nxtcompleted[i]))
+   return 1;  /* Yes, CBs for future grace period. */
+   return 0; /* No grace period needed. */
 }
 
 /*
@@ -1071,6 +1081,139 @@ static void init_callback_list(struct rcu_data *rdp)
 }
 
 /*
+ * Determine the value that ->completed will have at the end of the
+ * next subsequent grace period.  This is used to tag callbacks so that
+ * a CPU can invoke callbacks in a timely fashion even if that CPU has
+ * been dyntick-idle for an extended period with callbacks under the
+ * influence of RCU_FAST_NO_HZ.
+ *
+ * The caller must hold rnp->lock with interrupts disabled.
+ */
+static unsigned long rcu_cbs_completed(struct rcu_state *rsp,
+  struct rcu_node *rnp)
+{
+   /*
+* If RCU is idle, we just wait for the next grace period.
+* But we can only be sure that RCU is idle if we are looking
+* at the root rcu_node structure -- otherwise, a new grace
+* period might have started, but just not yet gotten around
+* to initializing the current non-root rcu_node structure.
+*/
+   if (rcu_get_root(rsp) == rnp && rnp->gpnum == rnp->completed)
+   return rnp->completed + 1;
+
+   /*
+* Otherwise, wait for a possible partial grace period and
+* then the subsequent full grace period.
+*/
+   return rnp->completed + 2;
+}
+
+/*
+ * If there is room, assign a ->completed number to any callbacks on
+ * this CPU that have not already been assigned.  Also accelerate any
+ * callbacks that were previously assigned a ->completed number that has
+ * since proven to be too conservative, which can happen if callbacks get
+ * assigned a ->completed number while RCU is idle, but with reference to
+ * a non-root rcu_node structure.  This function is idempotent, so it does
+ * not hurt to call it 

[PATCH tip/core/rcu 11/14] rcu: Push lock release to rcu_start_gp()'s callers

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock.  This commit takes a second step in this direction
by moving the release of this lock to rcu_start_gp()'s callers.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c|   24 ++--
 kernel/rcutree_plugin.h |5 ++---
 2 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 7207435..8ca18ec 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1525,16 +1525,14 @@ static int __noreturn rcu_gp_kthread(void *arg)
 /*
  * Start a new RCU grace period if warranted, re-initializing the hierarchy
  * in preparation for detecting the next grace period.  The caller must hold
- * the root node's ->lock, which is released before return.  Hard irqs must
- * be disabled.
+ * the root node's ->lock and hard irqs must be disabled.
  *
  * Note that it is legal for a dying CPU (which is marked as offline) to
  * invoke this function.  This can happen when the dying CPU reports its
  * quiescent state.
  */
 static void
-rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
-   __releases(rcu_get_root(rsp)->lock)
+rcu_start_gp(struct rcu_state *rsp)
 {
struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
struct rcu_node *rnp = rcu_get_root(rsp);
@@ -1548,15 +1546,13 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 */
rcu_advance_cbs(rsp, rnp, rdp);
 
-   if (!rsp->gp_kthread ||
-   !cpu_needs_another_gp(rsp, rdp)) {
+   if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) {
/*
 * Either we have not yet spawned the grace-period
 * task, this CPU does not need another grace period,
 * or a grace period is already in progress.
 * Either way, don't start a new grace period.
 */
-   raw_spin_unlock_irqrestore(>lock, flags);
return;
}
rsp->gp_flags = RCU_GP_FLAG_INIT;
@@ -1566,15 +1562,14 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 
/* Wake up rcu_gp_kthread() to start the grace period. */
wake_up(>gp_wq);
-   raw_spin_unlock_irqrestore(>lock, flags);
 }
 
 /*
  * Report a full set of quiescent states to the specified rcu_state
  * data structure.  This involves cleaning up after the prior grace
  * period and letting rcu_start_gp() start up the next grace period
- * if one is needed.  Note that the caller must hold rnp->lock, as
- * required by rcu_start_gp(), which will release it.
+ * if one is needed.  Note that the caller must hold rnp->lock, which
+ * is released before return.
  */
 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
__releases(rcu_get_root(rsp)->lock)
@@ -2172,7 +2167,8 @@ __rcu_process_callbacks(struct rcu_state *rsp)
local_irq_save(flags);
if (cpu_needs_another_gp(rsp, rdp)) {
raw_spin_lock(_get_root(rsp)->lock); /* irqs disabled. */
-   rcu_start_gp(rsp, flags);  /* releases above lock */
+   rcu_start_gp(rsp);
+   raw_spin_unlock_irqrestore(_get_root(rsp)->lock, flags);
} else {
local_irq_restore(flags);
}
@@ -2252,11 +2248,11 @@ static void __call_rcu_core(struct rcu_state *rsp, 
struct rcu_data *rdp,
 
/* Start a new grace period if one not already started. */
if (!rcu_gp_in_progress(rsp)) {
-   unsigned long nestflag;
struct rcu_node *rnp_root = rcu_get_root(rsp);
 
-   raw_spin_lock_irqsave(_root->lock, nestflag);
-   rcu_start_gp(rsp, nestflag);  /* rlses rnp_root->lock */
+   raw_spin_lock(_root->lock);
+   rcu_start_gp(rsp);
+   raw_spin_lock(_root->lock);
} else {
/* Give the grace period a kick. */
rdp->blimit = LONG_MAX;
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index d09acdf..736dd2c 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2213,7 +2213,6 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
unsigned long c;
bool d;
unsigned long flags;
-   unsigned long flags1;
struct rcu_node *rnp = rdp->mynode;
struct rcu_node *rnp_root = rcu_get_root(rdp->rsp);
 
@@ -2275,8 +2274,8 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
  c, rnp->level,
  rnp->grplo, rnp->grphi,
  "Startedroot");
-   local_save_flags(flags1);
-  

[PATCH tip/core/rcu 02/14] rcu: Trace callback acceleration

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

This commit adds event tracing for callback acceleration to allow better
tracking of callbacks through the system.

Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h |6 --
 kernel/rcutree.c   |6 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index d4f559b..5678114 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -44,8 +44,10 @@ TRACE_EVENT(rcu_utilization,
  * of a new grace period or the end of an old grace period ("cpustart"
  * and "cpuend", respectively), a CPU passing through a quiescent
  * state ("cpuqs"), a CPU coming online or going offline ("cpuonl"
- * and "cpuofl", respectively), and a CPU being kicked for being too
- * long in dyntick-idle mode ("kick").
+ * and "cpuofl", respectively), a CPU being kicked for being too
+ * long in dyntick-idle mode ("kick"), a CPU accelerating its new
+ * callbacks to RCU_NEXT_READY_TAIL ("AccReadyCB"), and a CPU
+ * accelerating its new callbacks to RCU_WAIT_TAIL ("AccWaitCB").
  */
 TRACE_EVENT(rcu_grace_period,
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ac6a75d..e9dce4f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1168,6 +1168,12 @@ static void rcu_accelerate_cbs(struct rcu_state *rsp, 
struct rcu_node *rnp,
rdp->nxttail[i] = rdp->nxttail[RCU_NEXT_TAIL];
rdp->nxtcompleted[i] = c;
}
+
+   /* Trace depending on how much we were able to accelerate. */
+   if (!*rdp->nxttail[RCU_WAIT_TAIL])
+   trace_rcu_grace_period(rsp->name, rdp->gpnum, "AccWaitCB");
+   else
+   trace_rcu_grace_period(rsp->name, rdp->gpnum, "AccReadyCB");
 }
 
 /*
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 13/14] rcu: Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

CPUs going idle will need to record the need for a future grace
period, but won't actually need to block waiting on it.  This commit
therefore splits rcu_start_future_gp(), which does the recording, from
rcu_nocb_wait_gp(), which now invokes rcu_start_future_gp() to do the
recording, after which rcu_nocb_wait_gp() does the waiting.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c|  123 +--
 kernel/rcutree.h|2 +-
 kernel/rcutree_plugin.h |  104 
 3 files changed, 130 insertions(+), 99 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 8ca18ec..bd42feb 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -230,6 +230,7 @@ static ulong jiffies_till_next_fqs = 
RCU_JIFFIES_TILL_FORCE_QS;
 module_param(jiffies_till_first_fqs, ulong, 0644);
 module_param(jiffies_till_next_fqs, ulong, 0644);
 
+static void rcu_start_gp(struct rcu_state *rsp);
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
@@ -1113,6 +1114,120 @@ static unsigned long rcu_cbs_completed(struct rcu_state 
*rsp,
 }
 
 /*
+ * Trace-event helper function for rcu_start_future_gp() and
+ * rcu_nocb_wait_gp().
+ */
+static void trace_rcu_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
+   unsigned long c, char *s)
+{
+   trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum,
+ rnp->completed, c, rnp->level,
+ rnp->grplo, rnp->grphi, s);
+}
+
+/*
+ * Start some future grace period, as needed to handle newly arrived
+ * callbacks.  The required future grace periods are recorded in each
+ * rcu_node structure's ->need_future_gp field.
+ *
+ * The caller must hold the specified rcu_node structure's ->lock.
+ */
+static unsigned long __maybe_unused
+rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp)
+{
+   unsigned long c;
+   int i;
+   struct rcu_node *rnp_root = rcu_get_root(rdp->rsp);
+
+   /*
+* Pick up grace-period number for new callbacks.  If this
+* grace period is already marked as needed, return to the caller.
+*/
+   c = rcu_cbs_completed(rdp->rsp, rnp);
+   trace_rcu_future_gp(rnp, rdp, c, "Startleaf");
+   if (rnp->need_future_gp[c & 0x1]) {
+   trace_rcu_future_gp(rnp, rdp, c, "Prestartleaf");
+   return c;
+   }
+
+   /*
+* If either this rcu_node structure or the root rcu_node structure
+* believe that a grace period is in progress, then we must wait
+* for the one following, which is in "c".  Because our request
+* will be noticed at the end of the current grace period, we don't
+* need to explicitly start one.
+*/
+   if (rnp->gpnum != rnp->completed ||
+   ACCESS_ONCE(rnp->gpnum) != ACCESS_ONCE(rnp->completed)) {
+   rnp->need_future_gp[c & 0x1]++;
+   trace_rcu_future_gp(rnp, rdp, c, "Startedleaf");
+   return c;
+   }
+
+   /*
+* There might be no grace period in progress.  If we don't already
+* hold it, acquire the root rcu_node structure's lock in order to
+* start one (if needed).
+*/
+   if (rnp != rnp_root)
+   raw_spin_lock(_root->lock);
+
+   /*
+* Get a new grace-period number.  If there really is no grace
+* period in progress, it will be smaller than the one we obtained
+* earlier.  Adjust callbacks as needed.  Note that even no-CBs
+* CPUs have a ->nxtcompleted[] array, so no no-CBs checks needed.
+*/
+   c = rcu_cbs_completed(rdp->rsp, rnp_root);
+   for (i = RCU_DONE_TAIL; i < RCU_NEXT_TAIL; i++)
+   if (ULONG_CMP_LT(c, rdp->nxtcompleted[i]))
+   rdp->nxtcompleted[i] = c;
+
+   /*
+* If the needed for the required grace period is already
+* recorded, trace and leave.
+*/
+   if (rnp_root->need_future_gp[c & 0x1]) {
+   trace_rcu_future_gp(rnp, rdp, c, "Prestartedroot");
+   goto unlock_out;
+   }
+
+   /* Record the need for the future grace period. */
+   rnp_root->need_future_gp[c & 0x1]++;
+
+   /* If a grace period is not already in progress, start one. */
+   if (rnp_root->gpnum != rnp_root->completed) {
+   trace_rcu_future_gp(rnp, rdp, c, "Startedleafroot");
+   } else {
+   trace_rcu_future_gp(rnp, rdp, c, "Startedroot");
+   rcu_start_gp(rdp->rsp);
+   }
+unlock_out:
+   if (rnp != rnp_root)
+   raw_spin_unlock(_root->lock);
+   return c;
+}
+
+/*
+ * Clean up any old requests for the just-ended grace period.  Also return
+ * 

[PATCH tip/core/rcu 06/14] rcu: Export RCU_FAST_NO_HZ parameters to sysfs

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

RCU_FAST_NO_HZ operation is controlled by four compile-time C-preprocessor
macros, but some use cases benefit greatly from runtime adjustment,
particularly when tuning devices.  This commit therefore creates the
corresponding sysfs entries.

Reported-by: Robin Randhawa 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree_plugin.h |   31 ---
 1 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index ab1bdde..0997e9f 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1617,6 +1617,15 @@ static void rcu_idle_count_callbacks_posted(void)
 #define RCU_IDLE_GP_DELAY 4/* Roughly one grace period. */
 #define RCU_IDLE_LAZY_GP_DELAY (6 * HZ)/* Roughly six seconds. */
 
+static int rcu_idle_flushes = RCU_IDLE_FLUSHES;
+module_param(rcu_idle_flushes, int, 0644);
+static int rcu_idle_opt_flushes = RCU_IDLE_OPT_FLUSHES;
+module_param(rcu_idle_opt_flushes, int, 0644);
+static int rcu_idle_gp_delay = RCU_IDLE_GP_DELAY;
+module_param(rcu_idle_gp_delay, int, 0644);
+static int rcu_idle_lazy_gp_delay = RCU_IDLE_LAZY_GP_DELAY;
+module_param(rcu_idle_lazy_gp_delay, int, 0644);
+
 extern int tick_nohz_enabled;
 
 /*
@@ -1696,10 +1705,10 @@ int rcu_needs_cpu(int cpu, unsigned long *delta_jiffies)
}
/* Set up for the possibility that RCU will post a timer. */
if (rcu_cpu_has_nonlazy_callbacks(cpu)) {
-   *delta_jiffies = round_up(RCU_IDLE_GP_DELAY + jiffies,
- RCU_IDLE_GP_DELAY) - jiffies;
+   *delta_jiffies = round_up(rcu_idle_gp_delay + jiffies,
+ rcu_idle_gp_delay) - jiffies;
} else {
-   *delta_jiffies = jiffies + RCU_IDLE_LAZY_GP_DELAY;
+   *delta_jiffies = jiffies + rcu_idle_lazy_gp_delay;
*delta_jiffies = round_jiffies(*delta_jiffies) - jiffies;
}
return 0;
@@ -1805,11 +1814,11 @@ static void rcu_prepare_for_idle(int cpu)
if (rcu_cpu_has_nonlazy_callbacks(cpu)) {
trace_rcu_prep_idle("User dyntick with callbacks");
rdtp->idle_gp_timer_expires =
-   round_up(jiffies + RCU_IDLE_GP_DELAY,
-RCU_IDLE_GP_DELAY);
+   round_up(jiffies + rcu_idle_gp_delay,
+rcu_idle_gp_delay);
} else if (rcu_cpu_has_callbacks(cpu)) {
rdtp->idle_gp_timer_expires =
-   round_jiffies(jiffies + RCU_IDLE_LAZY_GP_DELAY);
+   round_jiffies(jiffies + rcu_idle_lazy_gp_delay);
trace_rcu_prep_idle("User dyntick with lazy callbacks");
} else {
return;
@@ -1861,8 +1870,8 @@ static void rcu_prepare_for_idle(int cpu)
/* Check and update the ->dyntick_drain sequencing. */
if (rdtp->dyntick_drain <= 0) {
/* First time through, initialize the counter. */
-   rdtp->dyntick_drain = RCU_IDLE_FLUSHES;
-   } else if (rdtp->dyntick_drain <= RCU_IDLE_OPT_FLUSHES &&
+   rdtp->dyntick_drain = rcu_idle_flushes;
+   } else if (rdtp->dyntick_drain <= rcu_idle_opt_flushes &&
   !rcu_pending(cpu) &&
   !local_softirq_pending()) {
/* Can we go dyntick-idle despite still having callbacks? */
@@ -1871,11 +1880,11 @@ static void rcu_prepare_for_idle(int cpu)
if (rcu_cpu_has_nonlazy_callbacks(cpu)) {
trace_rcu_prep_idle("Dyntick with callbacks");
rdtp->idle_gp_timer_expires =
-   round_up(jiffies + RCU_IDLE_GP_DELAY,
-RCU_IDLE_GP_DELAY);
+   round_up(jiffies + rcu_idle_gp_delay,
+rcu_idle_gp_delay);
} else {
rdtp->idle_gp_timer_expires =
-   round_jiffies(jiffies + RCU_IDLE_LAZY_GP_DELAY);
+   round_jiffies(jiffies + rcu_idle_lazy_gp_delay);
trace_rcu_prep_idle("Dyntick with lazy callbacks");
}
tp = >idle_gp_timer;
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 07/14] rcu: Accelerate RCU callbacks at grace-period end

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Now that callback acceleration is idempotent, it is safe to accelerate
callbacks during grace-period cleanup on any CPUs that the kthread happens
to be running on.  This commit therefore propagates the completion
of the grace period to the per-CPU data structures, and also adds an
rcu_advance_cbs() just before the cpu_needs_another_gp() check in order
to reduce false-positive grace periods.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |   21 +
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4ec797e..392c977 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1434,6 +1434,9 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
rcu_for_each_node_breadth_first(rsp, rnp) {
raw_spin_lock_irq(>lock);
rnp->completed = rsp->gpnum;
+   rdp = this_cpu_ptr(rsp->rda);
+   if (rnp == rdp->mynode)
+   __rcu_process_gp_end(rsp, rnp, rdp);
nocb += rcu_nocb_gp_cleanup(rsp, rnp);
raw_spin_unlock_irq(>lock);
cond_resched();
@@ -1446,6 +1449,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
trace_rcu_grace_period(rsp->name, rsp->completed, "end");
rsp->fqs_state = RCU_GP_IDLE;
rdp = this_cpu_ptr(rsp->rda);
+   rcu_advance_cbs(rsp, rnp, rdp);  /* Reduce false positives below. */
if (cpu_needs_another_gp(rsp, rdp))
rsp->gp_flags = 1;
raw_spin_unlock_irq(>lock);
@@ -1535,6 +1539,15 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
struct rcu_node *rnp = rcu_get_root(rsp);
 
+   /*
+* If there is no grace period in progress right now, any
+* callbacks we have up to this point will be satisfied by the
+* next grace period.  Also, advancing the callbacks reduces the
+* probability of false positives from cpu_needs_another_gp()
+* resulting in pointless grace periods.  So, advance callbacks!
+*/
+   rcu_advance_cbs(rsp, rnp, rdp);
+
if (!rsp->gp_kthread ||
!cpu_needs_another_gp(rsp, rdp)) {
/*
@@ -1547,14 +1560,6 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
return;
}
 
-   /*
-* Because there is no grace period in progress right now,
-* any callbacks we have up to this point will be satisfied
-* by the next grace period.  So this is a good place to
-* assign a grace period number to recently posted callbacks.
-*/
-   rcu_accelerate_cbs(rsp, rnp, rdp);
-
rsp->gp_flags = RCU_GP_FLAG_INIT;
raw_spin_unlock(>lock); /* Interrupts remain disabled. */
 
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 10/14] rcu: Repurpose no-CBs event tracing to future-GP events

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Dyntick-idle CPUs need to be able to pre-announce their need for grace
periods.  This can be done using something similar to the mechanism used
by no-CB CPUs to announce their need for grace periods.  This commit
moves in this direction by renaming the no-CBs grace-period event tracing
to suit the new future-grace-period needs.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h |   16 +-
 kernel/rcutree_plugin.h|   62 ++-
 2 files changed, 40 insertions(+), 38 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index ef0bf31..0dc0177 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -72,10 +72,10 @@ TRACE_EVENT(rcu_grace_period,
 );
 
 /*
- * Tracepoint for no-callbacks grace-period events.  The caller should
- * pull the data from the rcu_node structure, other than rcuname, which
- * comes from the rcu_state structure, and event, which is one of the
- * following:
+ * Tracepoint for future grace-period events, including those for no-callbacks
+ * CPUs.  The caller should pull the data from the rcu_node structure,
+ * other than rcuname, which comes from the rcu_state structure, and event,
+ * which is one of the following:
  *
  * "Startleaf": Request a nocb grace period based on leaf-node data.
  * "Startedleaf": Leaf-node start proved sufficient.
@@ -87,7 +87,7 @@ TRACE_EVENT(rcu_grace_period,
  * "Cleanup": Clean up rcu_node structure after previous GP.
  * "CleanupMore": Clean up, and another no-CB GP is needed.
  */
-TRACE_EVENT(rcu_nocb_grace_period,
+TRACE_EVENT(rcu_future_grace_period,
 
TP_PROTO(char *rcuname, unsigned long gpnum, unsigned long completed,
 unsigned long c, u8 level, int grplo, int grphi,
@@ -645,9 +645,9 @@ TRACE_EVENT(rcu_barrier,
 #define trace_rcu_grace_period(rcuname, gpnum, gpevent) do { } while (0)
 #define trace_rcu_grace_period_init(rcuname, gpnum, level, grplo, grphi, \
qsmask) do { } while (0)
-#define trace_rcu_nocb_grace_period(rcuname, gpnum, completed, c, \
-   level, grplo, grphi, event) \
-   do { } while (0)
+#define trace_rcu_future_grace_period(rcuname, gpnum, completed, c, \
+ level, grplo, grphi, event) \
+ do { } while (0)
 #define trace_rcu_preempt_task(rcuname, pid, gpnum) do { } while (0)
 #define trace_rcu_unlock_preempted_task(rcuname, gpnum, pid) do { } while (0)
 #define trace_rcu_quiescent_state_report(rcuname, gpnum, mask, qsmask, level, \
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 9371bdd..d09acdf 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2073,9 +2073,9 @@ static int rcu_nocb_gp_cleanup(struct rcu_state *rsp, 
struct rcu_node *rnp)
wake_up_all(>nocb_gp_wq[c & 0x1]);
rnp->n_nocb_gp_requests[c & 0x1] = 0;
needmore = rnp->n_nocb_gp_requests[(c + 1) & 0x1];
-   trace_rcu_nocb_grace_period(rsp->name, rnp->gpnum, rnp->completed,
-   c, rnp->level, rnp->grplo, rnp->grphi,
-   needmore ? "CleanupMore" : "Cleanup");
+   trace_rcu_future_grace_period(rsp->name, rnp->gpnum, rnp->completed,
+ c, rnp->level, rnp->grplo, rnp->grphi,
+ needmore ? "CleanupMore" : "Cleanup");
return needmore;
 }
 
@@ -,9 +,9 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 
/* Count our request for a grace period. */
rnp->n_nocb_gp_requests[c & 0x1]++;
-   trace_rcu_nocb_grace_period(rdp->rsp->name, rnp->gpnum, rnp->completed,
-   c, rnp->level, rnp->grplo, rnp->grphi,
-   "Startleaf");
+   trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum,
+ rnp->completed, c, rnp->level,
+ rnp->grplo, rnp->grphi, "Startleaf");
 
if (rnp->gpnum != rnp->completed) {
 
@@ -2233,10 +2233,10 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 * is in progress, so we are done.  When this grace
 * period ends, our request will be acted upon.
 */
-   trace_rcu_nocb_grace_period(rdp->rsp->name,
-   rnp->gpnum, rnp->completed, c,
-   rnp->level, rnp->grplo, rnp->grphi,
-   "Startedleaf");
+   trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum,
+ rnp->completed, c, rnp->level,
+ rnp->grplo, rnp->grphi,
+

[PATCH tip/core/rcu 14/14] rcu: Make rcu_accelerate_cbs() note need for future grace periods

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Now that rcu_start_future_gp() has been abstracted from
rcu_nocb_wait_gp(), rcu_accelerate_cbs() can invoke rcu_start_future_gp()
so as to register the need for any future grace periods needed by a
CPU about to enter dyntick-idle mode.  This commit makes this change.
Note that some refactoring of rcu_start_gp() is carried out to avoid
recursion and subsequent self-deadlocks.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |   50 --
 1 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index bd42feb..f7399b4 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -230,7 +230,8 @@ static ulong jiffies_till_next_fqs = 
RCU_JIFFIES_TILL_FORCE_QS;
 module_param(jiffies_till_first_fqs, ulong, 0644);
 module_param(jiffies_till_next_fqs, ulong, 0644);
 
-static void rcu_start_gp(struct rcu_state *rsp);
+static void rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
+ struct rcu_data *rdp);
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
@@ -1200,7 +1201,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data 
*rdp)
trace_rcu_future_gp(rnp, rdp, c, "Startedleafroot");
} else {
trace_rcu_future_gp(rnp, rdp, c, "Startedroot");
-   rcu_start_gp(rdp->rsp);
+   rcu_start_gp_advanced(rdp->rsp, rnp_root, rdp);
}
 unlock_out:
if (rnp != rnp_root)
@@ -1286,6 +1287,8 @@ static void rcu_accelerate_cbs(struct rcu_state *rsp, 
struct rcu_node *rnp,
rdp->nxttail[i] = rdp->nxttail[RCU_NEXT_TAIL];
rdp->nxtcompleted[i] = c;
}
+   /* Record any needed additional grace periods. */
+   rcu_start_future_gp(rnp, rdp);
 
/* Trace depending on how much we were able to accelerate. */
if (!*rdp->nxttail[RCU_WAIT_TAIL])
@@ -1647,20 +1650,9 @@ static int __noreturn rcu_gp_kthread(void *arg)
  * quiescent state.
  */
 static void
-rcu_start_gp(struct rcu_state *rsp)
+rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
+ struct rcu_data *rdp)
 {
-   struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
-   struct rcu_node *rnp = rcu_get_root(rsp);
-
-   /*
-* If there is no grace period in progress right now, any
-* callbacks we have up to this point will be satisfied by the
-* next grace period.  Also, advancing the callbacks reduces the
-* probability of false positives from cpu_needs_another_gp()
-* resulting in pointless grace periods.  So, advance callbacks!
-*/
-   rcu_advance_cbs(rsp, rnp, rdp);
-
if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) {
/*
 * Either we have not yet spawned the grace-period
@@ -1672,14 +1664,36 @@ rcu_start_gp(struct rcu_state *rsp)
}
rsp->gp_flags = RCU_GP_FLAG_INIT;
 
-   /* Ensure that CPU is aware of completion of last grace period. */
-   __rcu_process_gp_end(rsp, rdp->mynode, rdp);
-
/* Wake up rcu_gp_kthread() to start the grace period. */
wake_up(>gp_wq);
 }
 
 /*
+ * Similar to rcu_start_gp_advanced(), but also advance the calling CPU's
+ * callbacks.  Note that rcu_start_gp_advanced() cannot do this because it
+ * is invoked indirectly from rcu_advance_cbs(), which would result in
+ * endless recursion -- or would do so if it wasn't for the self-deadlock
+ * that is encountered beforehand.
+ */
+static void
+rcu_start_gp(struct rcu_state *rsp)
+{
+   struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
+   struct rcu_node *rnp = rcu_get_root(rsp);
+
+   /*
+* If there is no grace period in progress right now, any
+* callbacks we have up to this point will be satisfied by the
+* next grace period.  Also, advancing the callbacks reduces the
+* probability of false positives from cpu_needs_another_gp()
+* resulting in pointless grace periods.  So, advance callbacks
+* then start the grace period!
+*/
+   rcu_advance_cbs(rsp, rnp, rdp);
+   rcu_start_gp_advanced(rsp, rnp, rdp);
+}
+
+/*
  * Report a full set of quiescent states to the specified rcu_state
  * data structure.  This involves cleaning up after the prior grace
  * period and letting rcu_start_gp() start up the next grace period
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 05/14] rcu: Distinguish "rcuo" kthreads by RCU flavor

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Currently, the per-no-CBs-CPU kthreads are named "rcuo" followed by
the CPU number, for example, "rcuo".  This is problematic given that
there are either two or three RCU flavors, each of which gets a per-CPU
kthread with exactly the same name.  This commit therefore introduces
a one-letter abbreviation for each RCU flavor, namely 'b' for RCU-bh,
'p' for RCU-preempt, and 's' for RCU-sched.  This abbreviation use used
to distinguish the "rcuo" kthreads, for example, for CPU 0 we would have
"rcuo0b", "rcuo0p", and "rcuo0s".

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
Tested-by: Dietmar Eggemann 
---
 kernel/rcutree.c|7 ---
 kernel/rcutree.h|1 +
 kernel/rcutree_plugin.h |5 +++--
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 8b110fa..4ec797e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -64,7 +64,7 @@
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
 
-#define RCU_STATE_INITIALIZER(sname, cr) { \
+#define RCU_STATE_INITIALIZER(sname, sabbr, cr) { \
.level = { ##_state.node[0] }, \
.call = cr, \
.fqs_state = RCU_GP_IDLE, \
@@ -76,13 +76,14 @@ static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
.onoff_mutex = __MUTEX_INITIALIZER(sname##_state.onoff_mutex), \
.name = #sname, \
+   .abbr = sabbr, \
 }
 
 struct rcu_state rcu_sched_state =
-   RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched);
+   RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
 DEFINE_PER_CPU(struct rcu_data, rcu_sched_data);
 
-struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh, call_rcu_bh);
+struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh, 'b', 
call_rcu_bh);
 DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
 
 static struct rcu_state *rcu_state;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index ef26eab..c865117 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -452,6 +452,7 @@ struct rcu_state {
unsigned long gp_max;   /* Maximum GP duration in */
/*  jiffies. */
char *name; /* Name of structure. */
+   char abbr;  /* Abbreviated name. */
struct list_head flavors;   /* List of RCU flavors. */
 };
 
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index eb9b473..ab1bdde 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -111,7 +111,7 @@ static void __init rcu_bootup_announce_oddness(void)
 #ifdef CONFIG_TREE_PREEMPT_RCU
 
 struct rcu_state rcu_preempt_state =
-   RCU_STATE_INITIALIZER(rcu_preempt, call_rcu);
+   RCU_STATE_INITIALIZER(rcu_preempt, 'p', call_rcu);
 DEFINE_PER_CPU(struct rcu_data, rcu_preempt_data);
 static struct rcu_state *rcu_state = _preempt_state;
 
@@ -2510,7 +2510,8 @@ static void __init rcu_spawn_nocb_kthreads(struct 
rcu_state *rsp)
return;
for_each_cpu(cpu, rcu_nocb_mask) {
rdp = per_cpu_ptr(rsp->rda, cpu);
-   t = kthread_run(rcu_nocb_kthread, rdp, "rcuo%d", cpu);
+   t = kthread_run(rcu_nocb_kthread, rdp,
+   "rcuo%d%c", cpu, rsp->abbr);
BUG_ON(IS_ERR(t));
ACCESS_ONCE(rdp->nocb_kthread) = t;
}
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 09/14] rcu: Rearrange locking in rcu_start_gp()

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

If CPUs are to give prior notice of needed grace periods, it will be
necessary to invoke rcu_start_gp() without dropping the root rcu_node
structure's ->lock.  This commit takes a first step in this direction
by moving the release of this lock to the end of rcu_start_gp().

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4b8d91c..7207435 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1559,16 +1559,14 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
raw_spin_unlock_irqrestore(>lock, flags);
return;
}
-
rsp->gp_flags = RCU_GP_FLAG_INIT;
-   raw_spin_unlock(>lock); /* Interrupts remain disabled. */
 
/* Ensure that CPU is aware of completion of last grace period. */
-   rcu_process_gp_end(rsp, rdp);
-   local_irq_restore(flags);
+   __rcu_process_gp_end(rsp, rdp->mynode, rdp);
 
/* Wake up rcu_gp_kthread() to start the grace period. */
wake_up(>gp_wq);
+   raw_spin_unlock_irqrestore(>lock, flags);
 }
 
 /*
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 12/14] rcu: Rename n_nocb_gp_requests to need_future_gp

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

CPUs going idle need to be able to indicate their need for future grace
periods.  A mechanism for doing this already exists for no-callbacks
CPUs, so the idea is to re-use that mechanism.  This commit therefore
moves the ->n_nocb_gp_requests field of the rcu_node structure out from
under the CONFIG_RCU_NOCB_CPU #ifdef and renames it to ->need_future_gp.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.h|4 ++--
 kernel/rcutree_plugin.h |   18 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 282b1d7..775d96c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -198,9 +198,9 @@ struct rcu_node {
 #ifdef CONFIG_RCU_NOCB_CPU
wait_queue_head_t nocb_gp_wq[2];
/* Place for rcu_nocb_kthread() to wait GP. */
-   int n_nocb_gp_requests[2];
-   /* Counts of upcoming no-CB GP requests. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+   int need_future_gp[2];
+   /* Counts of upcoming no-CB GP requests. */
raw_spinlock_t fqslock cacheline_internodealigned_in_smp;
 } cacheline_internodealigned_in_smp;
 
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 736dd2c..e4037bd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2057,7 +2057,7 @@ static int rcu_nocb_needs_gp(struct rcu_state *rsp)
 {
struct rcu_node *rnp = rcu_get_root(rsp);
 
-   return rnp->n_nocb_gp_requests[(ACCESS_ONCE(rnp->completed) + 1) & 0x1];
+   return rnp->need_future_gp[(ACCESS_ONCE(rnp->completed) + 1) & 0x1];
 }
 
 /*
@@ -2071,8 +2071,8 @@ static int rcu_nocb_gp_cleanup(struct rcu_state *rsp, 
struct rcu_node *rnp)
int needmore;
 
wake_up_all(>nocb_gp_wq[c & 0x1]);
-   rnp->n_nocb_gp_requests[c & 0x1] = 0;
-   needmore = rnp->n_nocb_gp_requests[(c + 1) & 0x1];
+   rnp->need_future_gp[c & 0x1] = 0;
+   needmore = rnp->need_future_gp[(c + 1) & 0x1];
trace_rcu_future_grace_period(rsp->name, rnp->gpnum, rnp->completed,
  c, rnp->level, rnp->grplo, rnp->grphi,
  needmore ? "CleanupMore" : "Cleanup");
@@ -2080,7 +2080,7 @@ static int rcu_nocb_gp_cleanup(struct rcu_state *rsp, 
struct rcu_node *rnp)
 }
 
 /*
- * Set the root rcu_node structure's ->n_nocb_gp_requests field
+ * Set the root rcu_node structure's ->need_future_gp field
  * based on the sum of those of all rcu_node structures.  This does
  * double-count the root rcu_node structure's requests, but this
  * is necessary to handle the possibility of a rcu_nocb_kthread()
@@ -2089,7 +2089,7 @@ static int rcu_nocb_gp_cleanup(struct rcu_state *rsp, 
struct rcu_node *rnp)
  */
 static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq)
 {
-   rnp->n_nocb_gp_requests[(rnp->completed + 1) & 0x1] += nrq;
+   rnp->need_future_gp[(rnp->completed + 1) & 0x1] += nrq;
 }
 
 static void rcu_init_one_nocb(struct rcu_node *rnp)
@@ -2220,7 +2220,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
c = rnp->completed + 2;
 
/* Count our request for a grace period. */
-   rnp->n_nocb_gp_requests[c & 0x1]++;
+   rnp->need_future_gp[c & 0x1]++;
trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum,
  rnp->completed, c, rnp->level,
  rnp->grplo, rnp->grphi, "Startleaf");
@@ -2264,10 +2264,10 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 * Adjust counters accordingly and start the
 * needed grace period.
 */
-   rnp->n_nocb_gp_requests[c & 0x1]--;
+   rnp->need_future_gp[c & 0x1]--;
c = rnp_root->completed + 1;
-   rnp->n_nocb_gp_requests[c & 0x1]++;
-   rnp_root->n_nocb_gp_requests[c & 0x1]++;
+   rnp->need_future_gp[c & 0x1]++;
+   rnp_root->need_future_gp[c & 0x1]++;
trace_rcu_future_grace_period(rdp->rsp->name,
  rnp->gpnum,
  rnp->completed,
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 08/14] rcu: Make RCU_FAST_NO_HZ take advantage of numbered callbacks

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

Because RCU callbacks are now associated with the number of the grace
period that they must wait for, CPUs can now take advance callbacks
corresponding to grace periods that ended while a given CPU was in
dyntick-idle mode.  This eliminates the need to try forcing the RCU
state machine while entering idle, thus reducing the CPU intensiveness
of RCU_FAST_NO_HZ, which should increase its energy efficiency.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h |1 +
 kernel/rcutree.c |   28 +++--
 kernel/rcutree.h |   12 +--
 kernel/rcutree_plugin.h  |  350 +++---
 kernel/rcutree_trace.c   |2 -
 5 files changed, 131 insertions(+), 262 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 275aa3f..4b37b50 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -73,6 +73,7 @@ extern void do_trace_rcu_torture_read(char *rcutorturename,
 #define UINT_CMP_LT(a, b)  (UINT_MAX / 2 < (a) - (b))
 #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b))
 #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b))
+#define ulong2long(a)  (*(long *)(&(a)))
 
 /* Exported common interfaces */
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 392c977..4b8d91c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2678,19 +2678,27 @@ static int rcu_pending(int cpu)
 }
 
 /*
- * Check to see if any future RCU-related work will need to be done
- * by the current CPU, even if none need be done immediately, returning
- * 1 if so.
+ * Return true if the specified CPU has any callback.  If all_lazy is
+ * non-NULL, store an indication of whether all callbacks are lazy.
+ * (If there are no callbacks, all of them are deemed to be lazy.)
  */
-static int rcu_cpu_has_callbacks(int cpu)
+static int rcu_cpu_has_callbacks(int cpu, bool *all_lazy)
 {
+   bool al = true;
+   bool hc = false;
+   struct rcu_data *rdp;
struct rcu_state *rsp;
 
-   /* RCU callbacks either ready or pending? */
-   for_each_rcu_flavor(rsp)
-   if (per_cpu_ptr(rsp->rda, cpu)->nxtlist)
-   return 1;
-   return 0;
+   for_each_rcu_flavor(rsp) {
+   rdp = per_cpu_ptr(rsp->rda, cpu);
+   if (rdp->qlen != rdp->qlen_lazy)
+   al = false;
+   if (rdp->nxtlist)
+   hc = true;
+   }
+   if (all_lazy)
+   *all_lazy = al;
+   return hc;
 }
 
 /*
@@ -2912,7 +2920,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int 
preemptible)
rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
atomic_set(>dynticks->dynticks,
   (atomic_read(>dynticks->dynticks) & ~0x1) + 1);
-   rcu_prepare_for_idle_init(cpu);
raw_spin_unlock(>lock);/* irqs remain disabled. */
 
/* Add CPU to rcu_node bitmasks. */
@@ -2986,7 +2993,6 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block 
*self,
 */
for_each_rcu_flavor(rsp)
rcu_cleanup_dying_cpu(rsp);
-   rcu_cleanup_after_idle(cpu);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index c865117..282b1d7 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -88,18 +88,13 @@ struct rcu_dynticks {
int dynticks_nmi_nesting;   /* Track NMI nesting level. */
atomic_t dynticks;  /* Even value for idle, else odd. */
 #ifdef CONFIG_RCU_FAST_NO_HZ
-   int dyntick_drain;  /* Prepare-for-idle state variable. */
-   unsigned long dyntick_holdoff;
-   /* No retries for the jiffy of failure. */
-   struct timer_list idle_gp_timer;
-   /* Wake up CPU sleeping with callbacks. */
-   unsigned long idle_gp_timer_expires;
-   /* When to wake up CPU (for repost). */
-   bool idle_first_pass;   /* First pass of attempt to go idle? */
+   bool all_lazy;  /* Are all CPU's CBs lazy? */
unsigned long nonlazy_posted;
/* # times non-lazy CBs posted to CPU. */
unsigned long nonlazy_posted_snap;
/* idle-period nonlazy_posted snapshot. */
+   unsigned long last_accelerate;
+   /* Last jiffy CBs were accelerated. */
int tick_nohz_enabled_snap; /* Previously seen value from sysfs. */
 #endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
 #ifdef CONFIG_RCU_USER_QS
@@ -530,7 +525,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct 
rcu_state *rsp,
 struct rcu_node *rnp);
 #endif /* #ifdef CONFIG_RCU_BOOST */
 static void __cpuinit rcu_prepare_kthreads(int 

[PATCH tip/core/rcu 0/14] RCU idle/no-CB changes for 3.9

2013-01-05 Thread Paul E. McKenney
Hello!

This series contains changes to RCU_FAST_NO_HZ idle entry/exit and also
removes restrictions on no-CBs CPUs.  This series contains some commits
that are still rather experimental, so you should avoid using these patches
unless you would like to help debug them.  ;-)

1.  Tag callback lists with the grace-period number that they are
waiting for.  This change enables a number of optimizations
for RCU_FAST_NO_HZ, and though it add a bit of code, it greatly
simplifies RCU's callback handling.
2.  Trace callback acceleration (which is when RCU notices that a
group of callbacks doesn't actually need to wait as long as it
previously thought).
3.  Remove restrictions on no-CBs CPUs.  This patch is probably the
highest-risk of the group.
4.  Allow some control of no-CBs CPUs at kernel-build time.  The option
of most interest is probably the one that makes -all- CPUs be
no-CBs CPUs.
5.  Distinguish the no-CBs kthreads for the various RCU flavors.
Without this patch, CPU 0 would have up to three kthreads all
named "rcuo0", which is less than optimal.
6.  Export RCU_FAST_NO_HZ parameters to sysfs to allow run-time
adjustment.
7.  Re-introduce callback acceleration during grace-period cleanup.
Now that the callbacks are associated with specific grace periods,
such acceleration is idempotent, and it is now safe to accelerate
more than needed.  (In contrast, in the past, too-frequent callback
acceleration resulted in infrequent RCU failures.)
8.  Use the newly numbered callbacks to greatly reduce the CPU overhead
incurred at idle entry by RCU_FAST_NO_HZ.  The fact that the
callbacks are now numbered means that instead of repeatedly
cranking the RCU state machine to try to get all callbacks
invoked, we can instead rely on the numbering so that the CPU
can take full advantage of any grace periods that elapse while
it is asleep.  CPUs with callbacks still have limited sleep times,
especially if they have at least one non-lazy callback queued.
9-14.   Allow CPUs to make known their need for future grace periods,
which is also used to reduce the need for frenetic RCU
state-machine cranking upon RCU_FAST_NO_HZ entry to idle.
9.  Move the release of the root rcu_node structure's ->lock
to then end of rcu_start_gp().
10. Repurpose no-CB's grace-period event tracing to that of
future grace periods, which share no-CB's grace-period
mechanism.
11. Move the release of the root rcu_node structure's ->lock
to rcu_start_gp()'s callers.
12. Rename the rcu_node ->n_nocb_gp_requests field to
->need_future_gp.
13. Abstract rcu_start_future_gp() from rcu_nocb_wait_gp()
to that RCU_FAST_NO_HZ can use the no-CB CPUs mechanism
for allowing a CPU to record its need for future grace
periods.
14. Make rcu_accelerate_cbs() note the need for future
grace periods, thus avoiding delays in starting grace
periods that currently happen due to the CPUs needing
those grace periods being out of action when the previous
grace period ends.

Thanx, Paul


 b/include/linux/rcupdate.h   |1 
 b/include/trace/events/rcu.h |   77 +++
 b/init/Kconfig   |   17 
 b/kernel/rcutree.c   |  475 ++-
 b/kernel/rcutree.h   |   39 -
 b/kernel/rcutree_plugin.h|  859 ++-
 b/kernel/rcutree_trace.c |2 
 7 files changed, 848 insertions(+), 622 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel BUG at kernel/sched_rt.c:493!

2013-01-05 Thread Shawn Bohrer
We recently managed to crash 10 of our test machines at the same time.
Half of the machines were running a 3.1.9 kernel and half were running
3.4.9.  I realize that these are both fairly old kernels but I've
skimmed the list of fixes in the 3.4.* stable series and didn't see
anything that appeared to be relevant to this issue.

All we managed to get was some screenshots of the stacks from the
consoles. On one of the 3.1.9 machines you can see we hit the
BUG_ON(want) statement in __disable_runtime() at
kernel/sched_rt.c:493, and all of the machines had essentially the
same stack showing:

rt_offline_rt
rq_attach_root
cpu_attach_domain
partition_sched_domains
do_rebuild_sched_domains

Here is one of the screenshots of the 3.1.9 machines:

https://dl.dropbox.com/u/84066079/berbox38.png

And here is one from a 3.4.9 machine:

https://dl.dropbox.com/u/84066079/berbox18.png

Three of the five 3.4.9 machines also managed to print
"[sched_delayed] sched: RT throttling activated" ~7 minutes before the
machines locked up.

I've tried reproducing the issue, but so far I've been unsuccessful
but I believe that is because my RT tasks aren't using enough CPU
cause borrowing from the other runqueues.  Normally our RT tasks use
very little CPU so I'm not entirely sure what conditions caused them
to run into throttling on the day that this happened.

The details that I do know about the workload that caused this are as
follows.

1) These are all dual socket 4 core X5460 systems with no
hyperthreading.  Thus there are 8 cores total in the system.
2) We use the cpuset cgroup to apply CPU affinity to various types of
processes.  Initially everything starts out in a single cpuset and the
top level cpuset has cpuset.sched_load_balance=1 thus there is only a
single scheduling domain.
3) In this case tasks were then placed into four non overlapping
cpusets.  1 containing a single core and single SCHED_FIFO task, 2
containing two cores, and multiple SCHED_FIFO tasks, and 1 containing
3 cores and everything else on the system running as SCHED_OTHER.
4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks
start out as SCHED_OTHER are placed into the cpuset then change their
policy to SCHED_FIFO.
5) Once all tasks are placed into non overlapping cpusets the top
level cpuset.sched_load_balance is set to 0 to split the system into
four scheduling domains.
6) The system ran like this for some unknown amount of time.
7) All the processes are then sent a signal to exit, and at the same
time the top level cpuset.sched_load_balance is set back to 1.  This
is when the systems locked up.

Hopefully that is enough information to give someone more familiar
with the scheduler code an idea of where the bug is here.  I will
point out that in step #5 above there is a small window where the RT
tasks could encounter runtime limits but are still in a single big
scheduling domain.  I don't know if that is what happened or if it is
simply sufficient to hit the runtime limits while the system is split
into four domains.  For the curious we are using the default RT
runtime limits:

# grep . /proc/sys/kernel/sched_rt_*
/proc/sys/kernel/sched_rt_period_us:100
/proc/sys/kernel/sched_rt_runtime_us:95

Let me know if you anyone needs any more information about this issue.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.8-rc] regression: NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

2013-01-05 Thread Francois Romieu
Jörg Otte  :
[...]
> jojo@ahorn:~$ dmesg | grep XID
> [1.808847] r8169 :02:00.0 eth0: RTL8168evl/8111evl at
> 0xc9054000, 5c:9a:d8:69:2b:39, XID 0c900800 IRQ 42

Can you check if things improve with v3.8-rc2 after removing :

1. 9ecb9aabaf634677c77af467f4e3028b09d7bcda 
   r8169: workaround for missing extended GigaMAC registers
2. d64ec841517a25f6d468bde9f67e5b4cffdc67c7
   r8169: enable internal ASPM and clock request settings
3. e0c075577965d1c01b30038d38bf637b027a1df3
   r8169: enable ALDPS for power saving

(you can directly try v3.7 r8169.c with v3.8-rc2 if it worked for you
so far) 

If the regression is still there, please apply the patch below to both
v3.8-rc2 unpatched and a known working version then send me their dmesg
after you 'ip link set dev eth0 up'.

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index ed96f30..3d2d2446 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -90,10 +90,28 @@ static const int multicast_filter_limit = 32;
 #define RTL8169_TX_TIMEOUT (6*HZ)
 #define RTL8169_PHY_TIMEOUT(10*HZ)
 
+static void rw8(void __iomem *ioaddr, u8 b)
+{
+   printk(KERN_DEBUG PFX "w %p %02x\n", ioaddr, b);
+   writeb(b, ioaddr);
+}
+
+static void rw16(void __iomem *ioaddr, u16 w)
+{
+   printk(KERN_DEBUG PFX "w %p %04x\n", ioaddr, w);
+   writew(w, ioaddr);
+}
+
+static void rw32(void __iomem *ioaddr, u32 d)
+{
+   printk(KERN_DEBUG PFX "w %p %08x\n", ioaddr, d);
+   writel(d, ioaddr);
+}
+
 /* write/read MMIO register */
-#define RTL_W8(reg, val8)  writeb ((val8), ioaddr + (reg))
-#define RTL_W16(reg, val16)writew ((val16), ioaddr + (reg))
-#define RTL_W32(reg, val32)writel ((val32), ioaddr + (reg))
+#define RTL_W8(reg, val8)  rw8(ioaddr + (reg), (val8))
+#define RTL_W16(reg, val16)rw16(ioaddr + (reg), (val16))
+#define RTL_W32(reg, val32)rw32(ioaddr + (reg), (val32))
 #define RTL_R8(reg)readb (ioaddr + (reg))
 #define RTL_R16(reg)   readw (ioaddr + (reg))
 #define RTL_R32(reg)   readl (ioaddr + (reg))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Incorrect accounting of irq into the running task

2013-01-05 Thread Shaun Ruffell
On Fri, Jan 04, 2013 at 10:22:12AM -0800, Sadasivan Shaiju wrote:
> Hi  Venkatesh,
> 
> I have applied the following patches for the incorrect accounting
> of irq into the running task .
> 
> 
> [PATCH] x86: Add IRQ_TIME_ACCOUNTING
> [e82b8e4ea4f3dffe6e7939f90e78da675fcc450e]
> [PATCH] sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
> [b52bfee445d315549d41eacf2fa7c156e7d153d5]
> 
> [PATCH] sched: Do not account irq time to current task
> [305e6835e05513406fa12820e40e4a8ecb63743c]
> [PATCH] sched: Export ns irqtimes through /proc/stat
> [abb74cefa9c682fb38ba86c17ca3c86fed6cc464]
> 
> 
> 
> But still the stime and utime of the process in /proc/pid/stat is
> high. I think the above patches does not update The stime and
> utime values in /proc/pid/stat.
> 
> 
> Or am I missing anything?

Just checking that you do have CONFIG_IRQ_TIME_ACCOUNTING=y in your
kernel config?

Cheers,
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/urgent 1/2] rcu: Prevent soft-lockup complaints about no-CBs CPUs

2013-01-05 Thread Frederic Weisbecker
Hi Paul,

2013/1/5 Paul E. McKenney :
> From: Paul Gortmaker 
>
> The wait_event() at the head of the rcu_nocb_kthread() can result in
> soft-lockup complaints if the CPU in question does not register RCU
> callbacks for an extended period.  This commit therefore changes
> the wait_event() to a wait_event_interruptible().
>
> Reported-by: Frederic Weisbecker 
> Signed-off-by: Paul Gortmaker 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/rcutree_plugin.h |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index f6e5ec2..43dba2d 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2366,10 +2366,11 @@ static int rcu_nocb_kthread(void *arg)
> for (;;) {
> /* If not polling, wait for next batch of callbacks. */
> if (!rcu_nocb_poll)
> -   wait_event(rdp->nocb_wq, rdp->nocb_head);
> +   wait_event_interruptible(rdp->nocb_wq, 
> rdp->nocb_head);
> list = ACCESS_ONCE(rdp->nocb_head);
> if (!list) {
> schedule_timeout_interruptible(1);
> +   flush_signals(current);

Why is that needed?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rcu: comment: correct 'optimized' to 'optimize'.

2013-01-05 Thread Paul E. McKenney
On Fri, Jan 04, 2013 at 10:33:53AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 04, 2013 at 12:59:40PM -0500, Cody P Schafer wrote:
> > Small gramar fix in rcutree comment regarding 'rcu_scheduler_active'
> > var.
> 
> Queued, thank you!  I could not resist fixing a small nit in the commit
> log as well.  ;-)

And are you OK with my adding your Signed-off-by?  Either way, on future
patches, please add your Signed-off-by in your original posting.

>   Thanx, Paul
> 
> > ---
> >  kernel/rcutree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index e441b77..bfb8972 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -105,7 +105,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* 
> > Total # rcu_nodes in use. */
> >   * The rcu_scheduler_active variable transitions from zero to one just
> >   * before the first task is spawned.  So when this variable is zero, RCU
> >   * can assume that there is but one task, allowing RCU to (for example)
> > - * optimized synchronize_sched() to a simple barrier().  When this variable
> > + * optimize synchronize_sched() to a simple barrier().  When this variable
> >   * is one, RCU must actually do all the hard work required to detect real
> >   * grace periods.  This variable is also used to suppress boot-time false
> >   * positives from lockdep-RCU error checking.
> > -- 
> > 1.8.0.3
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH tip/core/rcu 0/6] RCU fixes for 3.9

2013-01-05 Thread Paul E. McKenney
On Sat, Jan 05, 2013 at 09:09:20AM -0800, Paul E. McKenney wrote:
> Hello!
> 
> The following fixes are intended for 3.9:
> 
> 1.Fix int/long type confusion in trace_rcu_start_batch().
> 2.Declare rcu_is_cpu_rrupt_from_idle() static, courtesy of
>   Josh Triplett.
> 3.Make rcu_eqs_enter_common() trace the new nesting value instead
>   of zero, courtesy of Li Zhong.
> 4.Silence a gcc array-out-of-bounds false positive in rcu_init_one().
> 5.Code style fix in rcu_torture_barrier_init(), courtesy of Sasha Levin.

and:

6.  Grammar fix to rcu_scheduler_active comment, courtesy of Cody
Schafer.

>   Thanx, Paul
> 
> 
>  b/include/trace/events/rcu.h |6 +++---
>  b/kernel/rcutiny.c   |2 +-
>  b/kernel/rcutorture.c|2 +-
>  b/kernel/rcutree.c   |   10 +++---
>  4 files changed, 12 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 2/6] rcu: Make rcu_is_cpu_rrupt_from_idle helper functions static

2013-01-05 Thread Paul E. McKenney
From: Josh Triplett 

Both rcutiny and rcutree define a helper function named
rcu_is_cpu_rrupt_from_idle(), each used exactly once, later in the
same file.  This commit therefore declares these helper functions static.

Signed-off-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutiny.c |2 +-
 kernel/rcutree.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
index e7dce58..9f72a0f 100644
--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -193,7 +193,7 @@ EXPORT_SYMBOL(rcu_is_cpu_idle);
  * interrupts don't count, we must be running at the first interrupt
  * level.
  */
-int rcu_is_cpu_rrupt_from_idle(void)
+static int rcu_is_cpu_rrupt_from_idle(void)
 {
return rcu_dynticks_nesting <= 1;
 }
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e441b77..cceda76 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -727,7 +727,7 @@ EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
  * interrupt from idle, return true.  The caller must have at least
  * disabled preemption.
  */
-int rcu_is_cpu_rrupt_from_idle(void)
+static int rcu_is_cpu_rrupt_from_idle(void)
 {
return __get_cpu_var(rcu_dynticks).dynticks_nesting <= 1;
 }
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 3/6] rcu: Use new nesting value for rcu_dyntick trace in rcu_eqs_enter_common

2013-01-05 Thread Paul E. McKenney
From: Li Zhong 

This patch uses the real new value of dynticks_nesting instead of 0 in
rcu_eqs_enter_common().

Signed-off-by: Li Zhong 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cceda76..d145796 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -336,7 +336,7 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 static void rcu_eqs_enter_common(struct rcu_dynticks *rdtp, long long oldval,
bool user)
 {
-   trace_rcu_dyntick("Start", oldval, 0);
+   trace_rcu_dyntick("Start", oldval, rdtp->dynticks_nesting);
if (!user && !is_idle_task(current)) {
struct task_struct *idle = idle_task(smp_processor_id());
 
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 1/6] rcu: Fix blimit type for trace_rcu_batch_start()

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

When the type of global variable blimit changed from int to long, the
type of the blimit argument of trace_rcu_batch_start() needed to have
changed.  This commit fixes this issue.

Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index d4f559b..f919498 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -393,7 +393,7 @@ TRACE_EVENT(rcu_kfree_callback,
  */
 TRACE_EVENT(rcu_batch_start,
 
-   TP_PROTO(char *rcuname, long qlen_lazy, long qlen, int blimit),
+   TP_PROTO(char *rcuname, long qlen_lazy, long qlen, long blimit),
 
TP_ARGS(rcuname, qlen_lazy, qlen, blimit),
 
@@ -401,7 +401,7 @@ TRACE_EVENT(rcu_batch_start,
__field(char *, rcuname)
__field(long, qlen_lazy)
__field(long, qlen)
-   __field(int, blimit)
+   __field(long, blimit)
),
 
TP_fast_assign(
@@ -411,7 +411,7 @@ TRACE_EVENT(rcu_batch_start,
__entry->blimit = blimit;
),
 
-   TP_printk("%s CBs=%ld/%ld bl=%d",
+   TP_printk("%s CBs=%ld/%ld bl=%ld",
  __entry->rcuname, __entry->qlen_lazy, __entry->qlen,
  __entry->blimit)
 );
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 4/6] rcu: Silence compiler array out-of-bounds false positive

2013-01-05 Thread Paul E. McKenney
From: "Paul E. McKenney" 

It turns out that gcc 4.8 warns on array indexes being out of bounds
unless it can prove otherwise.  It gives this warning on some RCU
initialization code.  Because this is far from any fastpath, add
an explicit check for array bounds and panic if so.  This gives the
compiler enough information to figure out that the array index is never
out of bounds.

However, if a similar false positive occurs on a fastpath, it will
probably be necessary to tell the compiler to keep its array-index
anxieties to itself.  ;-)

Markus Trippelsdorf 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d145796..e0d9815 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2938,6 +2938,10 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 
BUILD_BUG_ON(MAX_RCU_LVLS > ARRAY_SIZE(buf));  /* Fix buf[] init! */
 
+   /* Silence gcc 4.8 warning about array index out of range. */
+   if (rcu_num_lvls > RCU_NUM_LVLS)
+   panic("rcu_init_one: rcu_num_lvls overflow");
+
/* Initialize the level-tracking arrays. */
 
for (i = 0; i < rcu_num_lvls; i++)
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 5/6] rcutorture: don't compare ptr with 0

2013-01-05 Thread Paul E. McKenney
From: Sasha Levin 

Signed-off-by: Sasha Levin 
Reviewed-by: Josh Triplett 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcutorture.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 31dea01..0249800 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -1749,7 +1749,7 @@ static int rcu_torture_barrier_init(void)
barrier_cbs_wq =
kzalloc(n_barrier_cbs * sizeof(barrier_cbs_wq[0]),
GFP_KERNEL);
-   if (barrier_cbs_tasks == NULL || barrier_cbs_wq == 0)
+   if (barrier_cbs_tasks == NULL || !barrier_cbs_wq)
return -ENOMEM;
for (i = 0; i < n_barrier_cbs; i++) {
init_waitqueue_head(_cbs_wq[i]);
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH tip/core/rcu 6/6] rcu: Correct 'optimized' to 'optimize' in header comment

2013-01-05 Thread Paul E. McKenney
From: Cody P Schafer 

Small grammar fix in rcutree comment regarding 'rcu_scheduler_active'
var.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcutree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e0d9815..d78ba60 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -105,7 +105,7 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # 
rcu_nodes in use. */
  * The rcu_scheduler_active variable transitions from zero to one just
  * before the first task is spawned.  So when this variable is zero, RCU
  * can assume that there is but one task, allowing RCU to (for example)
- * optimized synchronize_sched() to a simple barrier().  When this variable
+ * optimize synchronize_sched() to a simple barrier().  When this variable
  * is one, RCU must actually do all the hard work required to detect real
  * grace periods.  This variable is also used to suppress boot-time false
  * positives from lockdep-RCU error checking.
-- 
1.7.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >