date:20140627

Re: [PATCH] target: fix deadlock on unload

2014-06-27 Thread Nicholas A. Bellinger

On Fri, 2014-06-27 at 08:35 -0400, Mikulas Patocka wrote:
> 
> On Thu, 26 Jun 2014, Nicholas A. Bellinger wrote:
> 
> > Hi Mikulas,
> > 
> > On Mon, 2014-06-23 at 13:42 -0400, Mikulas Patocka wrote:
> > > target: fix deadlock on unload
> > > 
> > > On uniprocessor preemptible kernel, target core deadlocks on unload. The
> > > following events happen:
> > > * iscsit_del_np is called
> > > * it calls send_sig(SIGINT, np->np_thread, 1);
> > > * the scheduler switches to the np_thread
> > > * the np_thread is woken up, it sees that kthread_should_stop() returns
> > >   false, so it doesn't terminate
> > > * the np_thread clears signals with flush_signals(current); and goes back
> > >   to sleep in iscsit_accept_np
> > > * the scheduler switches back to iscsit_del_np
> > > * iscsit_del_np calls kthread_stop(np->np_thread);
> > > * the np_thread is waiting in iscsit_accept_np and it doesn't respond to
> > >   kthread_stop
> > > 
> > > The deadlock could be resolved if the administrator sends SIGINT signal to
> > > the np_thread with killall -INT iscsi_np
> > > 
> > > The reproducible deadlock was introduced in commit
> > > db6077fd0b7dd41dc6ff18329cec979379071f87, but the thread-stopping code was
> > > racy even before.
> > > 
> > > This patch fixes the problem. Using kthread_should_stop to stop the
> > > np_thread is unreliable, so we test np_thread_state instead. If
> > > np_thread_state equals ISCSI_NP_THREAD_SHUTDOWN, the thread exits.
> > > 
> > > Signed-off-by: Mikulas Patocka 
> > > Cc: sta...@vger.kernel.org
> > > 
> > 
> > Apologies for the delayed response..
> > 
> > Applied to target-pending/master and including in the next -rc3 PULL
> > request.
> > 
> > Also FYI, I've added '3.12+' to the stable tag to match how far back
> > commit db6077fd0 has been included in stable.
> > 
> > Thanks,
> > 
> > --nab
> 
> Hi
> 
> I think db6077fd0 should be backported to stable kernels beginning with 
> 3.10 (because they set np->np_thread = NULL in 
> __iscsi_target_login_thread). The current 3.10-stable branch misses this 
> patch.
> 
> 
> This patch for unload deadlock should be backported to all stable kernels 
> (because unload is racy there), but because of different code, we should 
> make a different patch for old stable branches.
> 
> For example in 3.4.95, __iscsi_target_login_thread contains this code:
> spin_lock_bh(&np->np_thread_lock);
> if (np->np_thread_state == ISCSI_NP_THREAD_RESET) {
> np->np_thread_state = ISCSI_NP_THREAD_ACTIVE;
> complete(&np->np_restart_comp);
> } else {
> np->np_thread_state = ISCSI_NP_THREAD_ACTIVE;
> }
> spin_unlock_bh(&np->np_thread_lock);
> If the state is ISCSI_NP_THREAD_SHUTDOWN, the above piece of code will 
> change it to ISCSI_NP_THREAD_ACTIVE and open the same kthread_should_stop 
> race.
> 

Ok, dropping the 'v3.12+' stable tag from this patch, and will include
it in a PATCH-v3.10.y series together with db6077fd0 once Greg-KH
attempts to queue it up.

As for v3.4.x, care to send along a separate patch with your Tested-by +
Signed-off-by..?

Thanks for the extra stable info!

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

2014-06-27 Thread Alexei Starovoitov

On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski  wrote:
> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov  
> wrote:
>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski  wrote:
>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  
>>> wrote:
 BPF syscall is a demux for different BPF releated commands.

 'maps' is a generic storage of different types for sharing data between 
 kernel
 and userspace.

 The maps can be created/deleted from user space via BPF syscall:
 - create a map with given id, type and attributes
   map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int 
 len)
   returns positive map id or negative error

 - delete map with given map id
   err = bpf_map_delete(int map_id)
   returns zero or negative error
>>>
>>> What's the scope of "id"?  How is it secured?
>>
>> the map and program id space is global and it's cap_sys_admin only.
>> There is no pressing need to do it with per-user limits.
>> So the whole thing is root only for now.
>>
>
> Hmm.  This may be unpleasant if you ever want to support non-root or
> namespaced operation.

I think it will be easy to extend it per namespace when we lift
root-only restriction. It will be seamless without user api changes.

> How hard would it be to give these things fds?

you mean programs/maps auto-terminate when creator process
exits? I thought about it and it's appealing at first glance, but
doesn't fit the model of existing tracepoint events which are global.
The programs attached to events need to live without 'daemon'
hanging around. Therefore I picked 'kernel module'- like method.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Problems booting kernel with UEFI bios

2014-06-27 Thread Fouldmother Finnishfucker

Greetings

Two years ago I decided to ditch the shim and shams that come with booting in 
UEFI on the laptop. This was working fine. The last year someone had a 
suggestion to sign the kernel, and enable certificate checking, as describe 
here http://kroah.com/log/blog/2013/09/02/booting-a-self-signed-linux-kernel/, 
and this too was good.

Then along came version 3.15.0. The same config but now I see an EFI_MIXED 
option. I tried ignoring this, and I found that I could not boot. I enabled 
this and I still get the same result. So I waited for two versions to see if 
the problem is fixed and nothing. I have tried using early printks in EFI and 
the display is blank. I have no serial, but USB and I cannot get the traces 
out. So two questions: Has anyone else the same woes? and more importantly how 
does one debug this stuff? I tried qemu with tianocore and that barfs on the 
3.14.8 kernel that is currently working.

Any suggestions would be apprecaited.

FM FF
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Michael Welling

On Sat, Jun 28, 2014 at 05:10:46AM +0200, Guillaume Morin wrote:
> On 27 Jun 22:37, Greg Kroah-Hartman wrote:
> > Put that below the --- line.
> 
> Will do.
> 
> > > > And what checkpatch error did you fix?  And are you sure it needs to be
> > > > fixed?
> > > 
> > > That's what I changed:
> > > 
> > > $ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
> > > ERROR: Macros with complex values should be enclosed in parenthesis
> > 
> > Then why didn't you say that :)
> 
> Well it was not totally clear to me if that was obvious or not.  Anyway,
> I'll mention it in the future.
> 
> > 
> > > I assumed that if it was reported as an error, it needed to be fixed...
> > 
> > Use your judgement, checkpatch is a tool, it isn't always correct.
> 
> Right, I guess it's borderline.  Should I resend the patch or just drop
> it?

These days we have GENMASK.
 
http://lxr.free-electrons.com/source/include/linux/bitops.h#L21
 
Maybe the macro can be used directly instead of the value_mask.

> 
> Guillaume.
> 
> -- 
> Guillaume Morin 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-iio" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 07/14] bpf: expand BPF syscall with program load/unload

2014-06-27 Thread Andy Lutomirski

On Fri, Jun 27, 2014 at 11:12 PM, Alexei Starovoitov  wrote:
> On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski  wrote:
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  
>> wrote:
>>> eBPF programs are safe run-to-completion functions with load/unload
>>> methods from userspace similar to kernel modules.
>>>
>>> User space API:
>>>
>>> - load eBPF program
>>>   prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, 
>>> int len)
>>>
>>>   where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>>>   TEXT - array of eBPF instructions
>>>   LICENSE - GPL compatible
>>> +
>>> +   err = -EINVAL;
>>> +   /* look for mandatory license string */
>>> +   if (!tb[BPF_PROG_LICENSE])
>>> +   goto free_attr;
>>> +
>>> +   /* eBPF programs must be GPL compatible */
>>> +   if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
>>> +   goto free_attr;
>>
>> Seriously?  My mind boggles.
>
> Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's 
> wise
> to leave this door open for abuse. This check makes it clear that if you
> write a program in C, the source code must be available.
> If program is written in assembler than this check is nop anyway.
>

I can see this seriously annoying lots of users.  For example,
Chromium might object.

If you want to add GPL-only functions in the future, that would be one
thing.  But if someone writes a nice eBPF compiler, and someone else
writes a little program that filters on network packets, I see no
reason to claim that the little program is a derivative work of the
kernel and therefore must be GPL.

> btw this patch doesn't include debugfs access to all loaded eBPF programs.
> Similarly to kernel modules I'm planning to have a way to list all loaded
> programs with optional assembler dump of instructions.

Users can also dump running programs with ptrace.  That doesn't mean
that all loaded programs need to be GPL.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

2014-06-27 Thread Andy Lutomirski

On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov  wrote:
> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski  wrote:
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  
>> wrote:
>>> BPF syscall is a demux for different BPF releated commands.
>>>
>>> 'maps' is a generic storage of different types for sharing data between 
>>> kernel
>>> and userspace.
>>>
>>> The maps can be created/deleted from user space via BPF syscall:
>>> - create a map with given id, type and attributes
>>>   map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int 
>>> len)
>>>   returns positive map id or negative error
>>>
>>> - delete map with given map id
>>>   err = bpf_map_delete(int map_id)
>>>   returns zero or negative error
>>
>> What's the scope of "id"?  How is it secured?
>
> the map and program id space is global and it's cap_sys_admin only.
> There is no pressing need to do it with per-user limits.
> So the whole thing is root only for now.
>

Hmm.  This may be unpleasant if you ever want to support non-root or
namespaced operation.

How hard would it be to give these things fds?

> Since I got your attention please review the most interesting
> verifier bits (patch 08/14) ;)

Will do.  Or at least I'll try :)

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 13/14] samples: bpf: example of stateful socket filtering

2014-06-27 Thread Alexei Starovoitov

On Fri, Jun 27, 2014 at 5:21 PM, Andy Lutomirski  wrote:
> On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov  wrote:
>> this socket filter example does:
>>
>> - creates a hashtable in kernel with key 4 bytes and value 8 bytes
>>
>> - populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto
>>
>> - loads eBPF program:
>>   r0 = skb[14 + 9]; // load one byte of ip->proto
>>   *(u32*)(fp - 4) = r0;
>>   value = bpf_map_lookup_elem(map_id, fp - 4);
>>   if (value)
>>(*(u64*)value) += 1;
>
> In the code below, this is XADD.  Is there anything that validates
> that shared things like this can only be poked at by atomic
> operations?

Correct. The asm code uses xadd to increment packet stats.
It's up to the program itself to decide what it's doing.
Some programs may prefer speed vs accuracy when counting
and they will be using regular "ld, add, st", instead of xadd.
Verifier checks that programs can only access a valid memory
region. The program itself needs to do something sensible with it.
Theoretically I can add a check to verifier that shared map elements
are read-only and xadd-only, but that limits usability and unnecessary.
We actually do have a use case when we do a regular add, since
'lock add' is too costly at high event rates.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Austin Schuh

On Fri, Jun 27, 2014 at 8:32 PM, Mike Galbraith
 wrote:
> On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:
>
>> It would be more context switches, but I wonder if we could kick the
>> workqueue logic completely out of the scheduler into a thread.  Have
>> the scheduler increment/decrement an atomic pool counter, and wake up
>> the monitoring thread to spawn new threads when needed?  That would
>> get rid of the recursive pool lock problem, and should reduce
>> scheduler latency if we would need to spawn a new thread.
>
> I was wondering the same thing, and not only for workqueue, but also the
> plug pulling.  It's kind of a wart to have that stuff sitting in the
> hear of the scheduler in the first place, would be nice if it just went
> away.  When a task can't help itself, you _could_ wake a proxy do that
> for you.  Trouble is, I can imagine that being a heck of a lot of
> context switches with some loads.. and who's gonna help the helper when
> he blocks while trying to help?
>
> -Mike

For workqueues, as long as the helper doesn't block on a lock which
requires the work queue to be freed up, it will eventually become
unblocked and make progress.  The helper _should_ only need the pool
lock, which will wake the helper back up when it is available again.
Nothing should go to sleep in an un-recoverable way with the work pool
lock held.

To drop the extra context switch, you could have a minimum of 2 worker
threads around at all times, and have the management thread start the
work and delegate to the next management thread.  That thread would
then wait for the first thread to block, spawn a new thread, and then
start the next piece of work.  Definitely a bit more complicated.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] staging: gdm72xx: use lower case for variable names for consistency

2014-06-27 Thread Ben Chan

Signed-off-by: Ben Chan 
---
 drivers/staging/gdm72xx/gdm_qos.c | 38 +++---
 drivers/staging/gdm72xx/gdm_qos.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/staging/gdm72xx/gdm_qos.c 
b/drivers/staging/gdm72xx/gdm_qos.c
index a2efc5c..b08c8e1 100644
--- a/drivers/staging/gdm72xx/gdm_qos.c
+++ b/drivers/staging/gdm72xx/gdm_qos.c
@@ -190,15 +190,15 @@ static int chk_ipv4_rule(struct gdm_wimax_csr_s *csr, u8 
*stream, u8 *port)
 
 static int get_qos_index(struct nic *nic, u8 *iph, u8 *tcpudph)
 {
-   int IP_ver, i;
+   int ip_ver, i;
struct qos_cb_s *qcb = &nic->qos;
 
if (iph == NULL || tcpudph == NULL)
return -1;
 
-   IP_ver = (iph[0]>>4)&0xf;
+   ip_ver = (iph[0]>>4)&0xf;
 
-   if (IP_ver != 4)
+   if (ip_ver != 4)
return -1;
 
for (i = 0; i < QOS_MAX; i++) {
@@ -303,12 +303,12 @@ out:
return ret;
 }
 
-static int get_csr(struct qos_cb_s *qcb, u32 SFID, int mode)
+static int get_csr(struct qos_cb_s *qcb, u32 sfid, int mode)
 {
int i;
 
for (i = 0; i < qcb->qos_list_cnt; i++) {
-   if (qcb->csr[i].SFID == SFID)
+   if (qcb->csr[i].sfid == sfid)
return i;
}
 
@@ -332,7 +332,7 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
 {
struct nic *nic = nic_ptr;
int i, index, pos;
-   u32 SFID;
+   u32 sfid;
u8 sub_cmd_evt;
struct qos_cb_s *qcb = &nic->qos;
struct qos_entry_s *entry, *n;
@@ -345,11 +345,11 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
if (sub_cmd_evt == QOS_REPORT) {
spin_lock_irqsave(&qcb->qos_lock, flags);
for (i = 0; i < qcb->qos_list_cnt; i++) {
-   SFID = ((buf[(i*5)+6]<<24)&0xff00);
-   SFID += ((buf[(i*5)+7]<<16)&0xff);
-   SFID += ((buf[(i*5)+8]<<8)&0xff00);
-   SFID += (buf[(i*5)+9]);
-   index = get_csr(qcb, SFID, 0);
+   sfid = ((buf[(i*5)+6]<<24)&0xff00);
+   sfid += ((buf[(i*5)+7]<<16)&0xff);
+   sfid += ((buf[(i*5)+8]<<8)&0xff00);
+   sfid += (buf[(i*5)+9]);
+   index = get_csr(qcb, sfid, 0);
if (index == -1) {
spin_unlock_irqrestore(&qcb->qos_lock, flags);
netdev_err(nic->netdev, "QoS ERROR: No SF\n");
@@ -366,12 +366,12 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
 
/* sub_cmd_evt == QOS_ADD || sub_cmd_evt == QOS_CHANG_DEL */
pos = 6;
-   SFID = ((buf[pos++]<<24)&0xff00);
-   SFID += ((buf[pos++]<<16)&0xff);
-   SFID += ((buf[pos++]<<8)&0xff00);
-   SFID += (buf[pos++]);
+   sfid = ((buf[pos++]<<24)&0xff00);
+   sfid += ((buf[pos++]<<16)&0xff);
+   sfid += ((buf[pos++]<<8)&0xff00);
+   sfid += (buf[pos++]);
 
-   index = get_csr(qcb, SFID, 1);
+   index = get_csr(qcb, sfid, 1);
if (index == -1) {
netdev_err(nic->netdev,
   "QoS ERROR: csr Update Error / Wrong index (%d)\n",
@@ -381,10 +381,10 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
 
if (sub_cmd_evt == QOS_ADD) {
netdev_dbg(nic->netdev, "QOS_ADD SFID = 0x%x, index=%d\n",
-  SFID, index);
+  sfid, index);
 
spin_lock_irqsave(&qcb->qos_lock, flags);
-   qcb->csr[index].SFID = SFID;
+   qcb->csr[index].sfid = sfid;
qcb->csr[index].classifier_rule_en = ((buf[pos++]<<8)&0xff00);
qcb->csr[index].classifier_rule_en += buf[pos++];
if (qcb->csr[index].classifier_rule_en == 0)
@@ -422,7 +422,7 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
spin_unlock_irqrestore(&qcb->qos_lock, flags);
} else if (sub_cmd_evt == QOS_CHANGE_DEL) {
netdev_dbg(nic->netdev, "QOS_CHANGE_DEL SFID = 0x%x, 
index=%d\n",
-  SFID, index);
+  sfid, index);
 
INIT_LIST_HEAD(&free_list);
 
diff --git a/drivers/staging/gdm72xx/gdm_qos.h 
b/drivers/staging/gdm72xx/gdm_qos.h
index ab03d33..8f742f3 100644
--- a/drivers/staging/gdm72xx/gdm_qos.h
+++ b/drivers/staging/gdm72xx/gdm_qos.h
@@ -33,7 +33,7 @@
 
 struct gdm_wimax_csr_s {
boolenabled;
-   u32 SFID;
+   u32 sfid;
u8  qos_buf_count;
u16 classifier_rule_en;
u8  ip2s_lo;
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Mor

[PATCH 3/4] staging: gdm72xx: use int instead of u32 whenever makes sense

2014-06-27 Thread Ben Chan

This patch addresses the following issues:
- Use int instead of u32 whenever makes sense
- Turn extract_qos_list() in gdm_qos.c, which previously always returned
  0, into a void function.

Reported-by: Dan Carpenter 
Reported-by: Michalis Pappas 
Signed-off-by: Ben Chan 
---
 drivers/staging/gdm72xx/gdm_qos.c | 15 +++
 drivers/staging/gdm72xx/gdm_qos.h |  6 +++---
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/staging/gdm72xx/gdm_qos.c 
b/drivers/staging/gdm72xx/gdm_qos.c
index 732f009..a2efc5c 100644
--- a/drivers/staging/gdm72xx/gdm_qos.c
+++ b/drivers/staging/gdm72xx/gdm_qos.c
@@ -142,7 +142,7 @@ void gdm_qos_release_list(void *nic_ptr)
free_qos_entry_list(&free_list);
 }
 
-static u32 chk_ipv4_rule(struct gdm_wimax_csr_s *csr, u8 *stream, u8 *port)
+static int chk_ipv4_rule(struct gdm_wimax_csr_s *csr, u8 *stream, u8 *port)
 {
int i;
 
@@ -188,9 +188,9 @@ static u32 chk_ipv4_rule(struct gdm_wimax_csr_s *csr, u8 
*stream, u8 *port)
return 0;
 }
 
-static u32 get_qos_index(struct nic *nic, u8 *iph, u8 *tcpudph)
+static int get_qos_index(struct nic *nic, u8 *iph, u8 *tcpudph)
 {
-   u32 IP_ver, i;
+   int IP_ver, i;
struct qos_cb_s *qcb = &nic->qos;
 
if (iph == NULL || tcpudph == NULL)
@@ -213,7 +213,7 @@ static u32 get_qos_index(struct nic *nic, u8 *iph, u8 
*tcpudph)
return -1;
 }
 
-static u32 extract_qos_list(struct nic *nic, struct list_head *head)
+static void extract_qos_list(struct nic *nic, struct list_head *head)
 {
struct qos_cb_s *qcb = &nic->qos;
struct qos_entry_s *entry;
@@ -238,8 +238,6 @@ static u32 extract_qos_list(struct nic *nic, struct 
list_head *head)
if (!list_empty(&qcb->qos_list[i]))
netdev_warn(nic->netdev, "Index(%d) is piled!!\n", i);
}
-
-   return 0;
 }
 
 static void send_qos_list(struct nic *nic, struct list_head *head)
@@ -305,7 +303,7 @@ out:
return ret;
 }
 
-static u32 get_csr(struct qos_cb_s *qcb, u32 SFID, int mode)
+static int get_csr(struct qos_cb_s *qcb, u32 SFID, int mode)
 {
int i;
 
@@ -333,7 +331,8 @@ static u32 get_csr(struct qos_cb_s *qcb, u32 SFID, int mode)
 void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int size)
 {
struct nic *nic = nic_ptr;
-   u32 i, SFID, index, pos;
+   int i, index, pos;
+   u32 SFID;
u8 sub_cmd_evt;
struct qos_cb_s *qcb = &nic->qos;
struct qos_entry_s *entry, *n;
diff --git a/drivers/staging/gdm72xx/gdm_qos.h 
b/drivers/staging/gdm72xx/gdm_qos.h
index 50aa191..ab03d33 100644
--- a/drivers/staging/gdm72xx/gdm_qos.h
+++ b/drivers/staging/gdm72xx/gdm_qos.h
@@ -59,11 +59,11 @@ struct qos_entry_s {
 
 struct qos_cb_s {
struct list_headqos_list[QOS_MAX];
-   u32 qos_list_cnt;
-   u32 qos_null_idx;
+   int qos_list_cnt;
+   int qos_null_idx;
struct gdm_wimax_csr_s  csr[QOS_MAX];
spinlock_t  qos_lock;
-   u32 qos_limit_size;
+   int qos_limit_size;
 };
 
 void gdm_qos_init(void *nic_ptr);
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] staging: gdm72xx: return -EINVAL instead of BUG_ON for invalid data length

2014-06-27 Thread Ben Chan

This patch changes gdm_usb_send() and gdm_sdio_send() to return -EINVAL instead
of calling BUG_ON if an invalid data length is passed to the functions.

Reported-by: Dan Carpenter 
Reported-by: Michalis Pappas 
Signed-off-by: Ben Chan 
---
 drivers/staging/gdm72xx/gdm_sdio.c | 3 ++-
 drivers/staging/gdm72xx/gdm_usb.c  | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/gdm72xx/gdm_sdio.c 
b/drivers/staging/gdm72xx/gdm_sdio.c
index 0c6a3eb..9d2de6f 100644
--- a/drivers/staging/gdm72xx/gdm_sdio.c
+++ b/drivers/staging/gdm72xx/gdm_sdio.c
@@ -390,7 +390,8 @@ static int gdm_sdio_send(void *priv_dev, void *data, int 
len,
u16 cmd_evt;
unsigned long flags;
 
-   BUG_ON(len > TX_BUF_SIZE - TYPE_A_HEADER_SIZE);
+   if (len > TX_BUF_SIZE - TYPE_A_HEADER_SIZE)
+   return -EINVAL;
 
spin_lock_irqsave(&tx->lock, flags);
 
diff --git a/drivers/staging/gdm72xx/gdm_usb.c 
b/drivers/staging/gdm72xx/gdm_usb.c
index cd8e6e4..971976c 100644
--- a/drivers/staging/gdm72xx/gdm_usb.c
+++ b/drivers/staging/gdm72xx/gdm_usb.c
@@ -312,7 +312,8 @@ static int gdm_usb_send(void *priv_dev, void *data, int len,
return -ENODEV;
}
 
-   BUG_ON(len > TX_BUF_SIZE - padding - 1);
+   if (len > TX_BUF_SIZE - padding - 1)
+   return -EINVAL;
 
spin_lock_irqsave(&tx->lock, flags);
 
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] staging: gdm72xx: use bool instead of custom-defined BOOLEAN

2014-06-27 Thread Ben Chan

Signed-off-by: Ben Chan 
---
 drivers/staging/gdm72xx/gdm_qos.c | 10 +-
 drivers/staging/gdm72xx/gdm_qos.h |  4 +---
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/staging/gdm72xx/gdm_qos.c 
b/drivers/staging/gdm72xx/gdm_qos.c
index df6f000..732f009 100644
--- a/drivers/staging/gdm72xx/gdm_qos.c
+++ b/drivers/staging/gdm72xx/gdm_qos.c
@@ -100,7 +100,7 @@ void gdm_qos_init(void *nic_ptr)
for (i = 0; i < QOS_MAX; i++) {
INIT_LIST_HEAD(&qcb->qos_list[i]);
qcb->csr[i].qos_buf_count = 0;
-   qcb->csr[i].enabled = 0;
+   qcb->csr[i].enabled = false;
}
 
qcb->qos_list_cnt = 0;
@@ -127,7 +127,7 @@ void gdm_qos_release_list(void *nic_ptr)
 
for (i = 0; i < QOS_MAX; i++) {
qcb->csr[i].qos_buf_count = 0;
-   qcb->csr[i].enabled = 0;
+   qcb->csr[i].enabled = false;
}
 
qcb->qos_list_cnt = 0;
@@ -316,8 +316,8 @@ static u32 get_csr(struct qos_cb_s *qcb, u32 SFID, int mode)
 
if (mode) {
for (i = 0; i < QOS_MAX; i++) {
-   if (qcb->csr[i].enabled == 0) {
-   qcb->csr[i].enabled = 1;
+   if (!qcb->csr[i].enabled) {
+   qcb->csr[i].enabled = true;
qcb->qos_list_cnt++;
return i;
}
@@ -428,7 +428,7 @@ void gdm_recv_qos_hci_packet(void *nic_ptr, u8 *buf, int 
size)
INIT_LIST_HEAD(&free_list);
 
spin_lock_irqsave(&qcb->qos_lock, flags);
-   qcb->csr[index].enabled = 0;
+   qcb->csr[index].enabled = false;
qcb->qos_list_cnt--;
qcb->qos_limit_size = 254/qcb->qos_list_cnt;
 
diff --git a/drivers/staging/gdm72xx/gdm_qos.h 
b/drivers/staging/gdm72xx/gdm_qos.h
index 6543cff..50aa191 100644
--- a/drivers/staging/gdm72xx/gdm_qos.h
+++ b/drivers/staging/gdm72xx/gdm_qos.h
@@ -18,8 +18,6 @@
 #include 
 #include 
 
-#define BOOLEANu8
-
 #define QOS_MAX16
 #define IPTYPEOFSERVICE0x8000
 #definePROTOCOL0x4000
@@ -34,7 +32,7 @@
 #defineIEEE802_1QVLANID0x10
 
 struct gdm_wimax_csr_s {
-   BOOLEAN enabled;
+   boolenabled;
u32 SFID;
u8  qos_buf_count;
u16 classifier_rule_en;
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 07/14] bpf: expand BPF syscall with program load/unload

2014-06-27 Thread Alexei Starovoitov

On Fri, Jun 27, 2014 at 5:19 PM, Andy Lutomirski  wrote:
> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  wrote:
>> eBPF programs are safe run-to-completion functions with load/unload
>> methods from userspace similar to kernel modules.
>>
>> User space API:
>>
>> - load eBPF program
>>   prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, 
>> int len)
>>
>>   where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>>   TEXT - array of eBPF instructions
>>   LICENSE - GPL compatible
>> +
>> +   err = -EINVAL;
>> +   /* look for mandatory license string */
>> +   if (!tb[BPF_PROG_LICENSE])
>> +   goto free_attr;
>> +
>> +   /* eBPF programs must be GPL compatible */
>> +   if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
>> +   goto free_attr;
>
> Seriously?  My mind boggles.

Yes. Quite a bit of logic can fit into one eBPF program. I don't think it's wise
to leave this door open for abuse. This check makes it clear that if you
write a program in C, the source code must be available.
If program is written in assembler than this check is nop anyway.

btw this patch doesn't include debugfs access to all loaded eBPF programs.
Similarly to kernel modules I'm planning to have a way to list all loaded
programs with optional assembler dump of instructions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 04/14] bpf: update MAINTAINERS entry

2014-06-27 Thread Alexei Starovoitov

On Fri, Jun 27, 2014 at 5:18 PM, Joe Perches  wrote:
> Add MAINTAINERS entry.
>
> On Fri, 2014-06-27 at 17:05 -0700, Alexei Starovoitov wrote:
>> diff --git a/MAINTAINERS b/MAINTAINERS
> []
>> @@ -1881,6 +1881,15 @@ S: Supported
>>  F:   drivers/net/bonding/
>>  F:   include/uapi/linux/if_bonding.h
>>
>> +BPF
>
> While a lot of people know what BPF is, I think it'd
> be better to have something like
>
> BPF - SOCKET FILTER (Berkeley Packet Filter like)

BPF is indeed succinct, but 'socket filter' suffix would be misleading,
since it's way more than just socket filtering.
May be: "BPF (Safe dynamic programs and tools)"
since 'perf' will become 'stap/dtrace'-like based on this infra.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

2014-06-27 Thread Alexei Starovoitov

On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski  wrote:
> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  wrote:
>> BPF syscall is a demux for different BPF releated commands.
>>
>> 'maps' is a generic storage of different types for sharing data between 
>> kernel
>> and userspace.
>>
>> The maps can be created/deleted from user space via BPF syscall:
>> - create a map with given id, type and attributes
>>   map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>   returns positive map id or negative error
>>
>> - delete map with given map id
>>   err = bpf_map_delete(int map_id)
>>   returns zero or negative error
>
> What's the scope of "id"?  How is it secured?

the map and program id space is global and it's cap_sys_admin only.
There is no pressing need to do it with per-user limits.
So the whole thing is root only for now.

Since I got your attention please review the most interesting
verifier bits (patch 08/14) ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Mike Galbraith

On Fri, 2014-06-27 at 16:24 +0200, Thomas Gleixner wrote:

> Completely untested patch below.

It's no longer completely untested, killer_module is no longer a killer.
I'll let box (lockdep etc is enabled) chew on it a while, no news is
good news as usual.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] staging: comedi: addi_apci_1564: clean up apci1564_interrupt()

2014-06-27 Thread Chase Southwood

The code in apci1564_interrupt() for handling counter interrupts is currently
repeated four times; once for each counter.  This code is identical save for the
registers it is using, so just handle all four counters with a for loop.

Also, the interrupt function was doing a useless set-and-check of
devpriv->timer_select_mode before processing any triggered interrupts, remove
all occurrences of this.

Signed-off-by: Chase Southwood 
Cc: Ian Abbott 
Cc: H Hartley Sweeten 
---
Hartley,
I remember that you mentioned that the counters could be handled using a for
loop here.  Is there a better way to go about that or am I on the right track
with this?

Thanks,
Chase

 drivers/staging/comedi/drivers/addi_apci_1564.c | 108 +---
 1 file changed, 23 insertions(+), 85 deletions(-)

diff --git a/drivers/staging/comedi/drivers/addi_apci_1564.c 
b/drivers/staging/comedi/drivers/addi_apci_1564.c
index 0141ed9..f40910e 100644
--- a/drivers/staging/comedi/drivers/addi_apci_1564.c
+++ b/drivers/staging/comedi/drivers/addi_apci_1564.c
@@ -60,8 +60,9 @@ static irqreturn_t apci1564_interrupt(int irq, void *d)
struct comedi_subdevice *s = dev->read_subdev;
unsigned int ui_DO, ui_DI;
unsigned int ui_Timer;
-   unsigned int ui_C1, ui_C2, ui_C3, ui_C4;
+   unsigned int counters[4];
unsigned int ul_Command2 = 0;
+   int i;
 
/* check interrupt is from this device */
if ((inl(devpriv->amcc_iobase + AMCC_OP_REG_INTCSR) &
@@ -73,16 +74,17 @@ static irqreturn_t apci1564_interrupt(int irq, void *d)
   APCI1564_DI_INT_ENABLE;
ui_DO = inl(devpriv->amcc_iobase + APCI1564_DO_IRQ_REG) & 0x01;
ui_Timer = inl(devpriv->amcc_iobase + APCI1564_TIMER_IRQ_REG) & 0x01;
-   ui_C1 =
+   counters[0] =
inl(dev->iobase + APCI1564_TCW_IRQ_REG(APCI1564_COUNTER1)) & 
0x1;
-   ui_C2 =
+   counters[1] =
inl(dev->iobase + APCI1564_TCW_IRQ_REG(APCI1564_COUNTER2)) & 
0x1;
-   ui_C3 =
+   counters[2] =
inl(dev->iobase + APCI1564_TCW_IRQ_REG(APCI1564_COUNTER3)) & 
0x1;
-   ui_C4 =
+   counters[3] =
inl(dev->iobase + APCI1564_TCW_IRQ_REG(APCI1564_COUNTER4)) & 
0x1;
-   if (ui_DI == 0 && ui_DO == 0 && ui_Timer == 0 && ui_C1 == 0
-   && ui_C2 == 0 && ui_C3 == 0 && ui_C4 == 0) {
+   if (ui_DI == 0 && ui_DO == 0 && ui_Timer == 0 && counters[0] == 0
+   && counters[1] == 0 && counters[2] == 0 && counters[3] == 0) {
+   dev_err(dev->class_dev, "Interrupt from unknown source.\n");
return IRQ_HANDLED;
}
 
@@ -113,95 +115,31 @@ static irqreturn_t apci1564_interrupt(int irq, void *d)
}
 
if (ui_Timer == 1) {
-   devpriv->timer_select_mode = ADDIDATA_TIMER;
-   if (devpriv->timer_select_mode) {
+   /*  Disable Timer Interrupt */
+   ul_Command2 = inl(devpriv->amcc_iobase + 
APCI1564_TIMER_CTRL_REG);
+   outl(0x0, devpriv->amcc_iobase + APCI1564_TIMER_CTRL_REG);
 
-   /*  Disable Timer Interrupt */
-   ul_Command2 = inl(devpriv->amcc_iobase + 
APCI1564_TIMER_CTRL_REG);
-   outl(0x0, devpriv->amcc_iobase + 
APCI1564_TIMER_CTRL_REG);
-
-   /* Send a signal to from kernel to user space */
-   send_sig(SIGIO, devpriv->tsk_current, 0);
-
-   /*  Enable Timer Interrupt */
-
-   outl(ul_Command2, devpriv->amcc_iobase + 
APCI1564_TIMER_CTRL_REG);
-   }
-   }
-
-   if (ui_C1 == 1) {
-   devpriv->timer_select_mode = ADDIDATA_COUNTER;
-   if (devpriv->timer_select_mode) {
-
-   /*  Disable Counter Interrupt */
-   ul_Command2 =
-   inl(dev->iobase + 
APCI1564_TCW_CTRL_REG(APCI1564_COUNTER1));
-   outl(0x0,
-dev->iobase + 
APCI1564_TCW_CTRL_REG(APCI1564_COUNTER1));
-
-   /* Send a signal to from kernel to user space */
-   send_sig(SIGIO, devpriv->tsk_current, 0);
-
-   /*  Enable Counter Interrupt */
-   outl(ul_Command2,
-dev->iobase + 
APCI1564_TCW_CTRL_REG(APCI1564_COUNTER1));
-   }
-   }
-
-   if (ui_C2 == 1) {
-   devpriv->timer_select_mode = ADDIDATA_COUNTER;
-   if (devpriv->timer_select_mode) {
-
-   /*  Disable Counter Interrupt */
-   ul_Command2 =
-   inl(dev->iobase + 
APCI1564_TCW_CTRL_REG(APCI1564_COUNTER2));
-   outl(0x0,
-dev->iobase + 
APCI1564_TCW_CTRL_REG(APCI1564_COUNTER2));
-
-   /* Send a signal to from kernel to user space */
-

[PATCH 2/3] staging: comedi: addi_apci_1564: fix use of apci1564_reset() to disable DI interrupts

2014-06-27 Thread Chase Southwood

apci1564_cos_insn_config() is currently using apci1564_reset() to disable
digital input interrupts when the configuration operation is
COMEDI_DIGITAL_TRIG_DISABLE.  However, this is incorrect as the device reset
function also resets the registers for the digital outputs, timer, watchdog, and
counters as well.  Replace the reset function call with a direct disabling of
just the digital input interrupts.

Signed-off-by: Chase Southwood 
Cc: Ian Abbott 
Cc: H Hartley Sweeten 
---
 drivers/staging/comedi/drivers/addi_apci_1564.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/comedi/drivers/addi_apci_1564.c 
b/drivers/staging/comedi/drivers/addi_apci_1564.c
index 59786e7..0141ed9 100644
--- a/drivers/staging/comedi/drivers/addi_apci_1564.c
+++ b/drivers/staging/comedi/drivers/addi_apci_1564.c
@@ -285,7 +285,10 @@ static int apci1564_cos_insn_config(struct comedi_device 
*dev,
devpriv->ctrl = 0;
devpriv->mode1 = 0;
devpriv->mode2 = 0;
-   apci1564_reset(dev);
+   outl(0x0, devpriv->amcc_iobase + APCI1564_DI_IRQ_REG);
+   inl(devpriv->amcc_iobase + APCI1564_DI_INT_STATUS_REG);
+   outl(0x0, devpriv->amcc_iobase + 
APCI1564_DI_INT_MODE1_REG);
+   outl(0x0, devpriv->amcc_iobase + 
APCI1564_DI_INT_MODE2_REG);
break;
case COMEDI_DIGITAL_TRIG_ENABLE_EDGES:
if (devpriv->ctrl != (APCI1564_DI_INT_ENABLE |
-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] staging: comedi: addi_apci_1564: miscellaneous fixes and cleanups

2014-06-27 Thread Chase Southwood

This patchset moves a misplaced include to the proper file, swaps out an overly
aggressive placement of apci1564_reset(), and cleans up apci1564_interrupt().

Chase Southwood (3):
  staging: comedi: addi_apci_1564: move addi_watchdog.h include to
addi_apci_1564.c
  staging: comedi: addi_apci_1564: fix use of apci1564_reset() to
disable DI interrupts
  staging: comedi: addi_apci_1564: clean up apci1564_interrupt()

 .../comedi/drivers/addi-data/hwdrv_apci1564.c  |   2 -
 drivers/staging/comedi/drivers/addi_apci_1564.c| 114 +
 2 files changed, 28 insertions(+), 88 deletions(-)

-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] staging: dgnc_driver.c: code style fixes

2014-06-27 Thread Guillaume Morin

From: Guillaume Morin 

Simple code style fixes

Signed-off-by: Guillaume Morin 
---
 drivers/staging/dgnc/dgnc_driver.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_driver.c 
b/drivers/staging/dgnc/dgnc_driver.c
index d52a9e8..68460af 100644
--- a/drivers/staging/dgnc/dgnc_driver.c
+++ b/drivers/staging/dgnc/dgnc_driver.c
@@ -88,8 +88,7 @@ module_exit(dgnc_cleanup_module);
 /*
  * File operations permitted on Control/Management major.
  */
-static const struct file_operations dgnc_BoardFops =
-{
+static const struct file_operations dgnc_BoardFops = {
.owner  =   THIS_MODULE,
.unlocked_ioctl =   dgnc_mgmt_ioctl,
.open   =   dgnc_mgmt_open,
@@ -407,7 +406,7 @@ static void dgnc_cleanup_board(struct dgnc_board *brd)
 {
int i = 0;
 
-   if(!brd || brd->magic != DGNC_BOARD_MAGIC)
+   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
return;
 
switch (brd->device) {
@@ -480,7 +479,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
/* get the board structure and prep it */
brd = dgnc_Board[dgnc_NumBoards] =
kzalloc(sizeof(*brd), GFP_KERNEL);
-   if (!brd) 
+   if (!brd)
return -ENOMEM;
 
/* make a temporary message buffer for the boot messages */
@@ -523,7 +522,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
brd->irq = pci_irq;
 
 
-   switch(brd->device) {
+   switch (brd->device) {
 
case PCI_DEVICE_CLASSIC_4_DID:
case PCI_DEVICE_CLASSIC_8_DID:
@@ -887,7 +886,7 @@ int dgnc_ms_sleep(ulong ms)
  */
 char *dgnc_ioctl_name(int cmd)
 {
-   switch(cmd) {
+   switch (cmd) {
 
case TCGETA:return "TCGETA";
case TCGETS:return "TCGETS";
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] staging: comedi: addi_apci_1564: move addi_watchdog.h include to addi_apci_1564.c

2014-06-27 Thread Chase Southwood

Commit aed3f9d (staging: comedi: addi_apci_1564: absorb apci1564_reset()) moved
the only use of addi_watchdog.h from hwdrv_apci1564.c to addi_apci_1564.c, but
left the include statement itself in the former file.  Move this include to the
file which actually uses it.

Signed-off-by: Chase Southwood 
Cc: Ian Abbott 
Cc: H Hartley Sweeten 
---
 drivers/staging/comedi/drivers/addi-data/hwdrv_apci1564.c | 2 --
 drivers/staging/comedi/drivers/addi_apci_1564.c   | 1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/staging/comedi/drivers/addi-data/hwdrv_apci1564.c 
b/drivers/staging/comedi/drivers/addi-data/hwdrv_apci1564.c
index 4007fd2..7326f3a 100644
--- a/drivers/staging/comedi/drivers/addi-data/hwdrv_apci1564.c
+++ b/drivers/staging/comedi/drivers/addi-data/hwdrv_apci1564.c
@@ -21,8 +21,6 @@
  *
  */
 
-#include "../addi_watchdog.h"
-
 #define APCI1564_ADDRESS_RANGE 128
 
 /* Digital Input IRQ Function Selection */
diff --git a/drivers/staging/comedi/drivers/addi_apci_1564.c 
b/drivers/staging/comedi/drivers/addi_apci_1564.c
index f71ee02..59786e7 100644
--- a/drivers/staging/comedi/drivers/addi_apci_1564.c
+++ b/drivers/staging/comedi/drivers/addi_apci_1564.c
@@ -4,6 +4,7 @@
 #include "../comedidev.h"
 #include "comedi_fc.h"
 #include "amcc_s5933.h"
+#include "addi_watchdog.h"
 
 #include "addi-data/addi_common.h"
 
-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] FIXME of file toploogy.h for alpha cpus

2014-06-27 Thread Nicholas Krause

This patch fixs the FIXME message in the function *cpumask_of_node
for using this function multiple times and the issue with recaluting
the cpu node mask when reusing this function.

Signed-off-by: Nicholas Krause 
---
 arch/alpha/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/alpha/include/asm/topology.h 
b/arch/alpha/include/asm/topology.h
index 9251e13..d301f66 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -31,6 +31,9 @@ static const struct cpumask *cpumask_of_node(int node)
if (node == -1)
return cpu_all_mask;
 
+   else if (node == &node_to_cpumask_map[node])
+   return &node_to_cpumask_map[node];
+
cpumask_clear(&node_to_cpumask_map[node]);
 
for_each_online_cpu(cpu) {
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Mike Galbraith

On Fri, 2014-06-27 at 18:18 -0700, Austin Schuh wrote:

> It would be more context switches, but I wonder if we could kick the
> workqueue logic completely out of the scheduler into a thread.  Have
> the scheduler increment/decrement an atomic pool counter, and wake up
> the monitoring thread to spawn new threads when needed?  That would
> get rid of the recursive pool lock problem, and should reduce
> scheduler latency if we would need to spawn a new thread.

I was wondering the same thing, and not only for workqueue, but also the
plug pulling.  It's kind of a wart to have that stuff sitting in the
hear of the scheduler in the first place, would be nice if it just went
away.  When a task can't help itself, you _could_ wake a proxy do that
for you.  Trouble is, I can imagine that being a heck of a lot of
context switches with some loads.. and who's gonna help the helper when
he blocks while trying to help?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin

On 27 Jun 22:37, Greg Kroah-Hartman wrote:
> Put that below the --- line.

Will do.

> > > And what checkpatch error did you fix?  And are you sure it needs to be
> > > fixed?
> > 
> > That's what I changed:
> > 
> > $ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
> > ERROR: Macros with complex values should be enclosed in parenthesis
> 
> Then why didn't you say that :)

Well it was not totally clear to me if that was obvious or not.  Anyway,
I'll mention it in the future.

> 
> > I assumed that if it was reported as an error, it needed to be fixed...
> 
> Use your judgement, checkpatch is a tool, it isn't always correct.

Right, I guess it's borderline.  Should I resend the patch or just drop
it?

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 16/18] perf tools: Add debug prints for ordered events queue

2014-06-27 Thread David Ahern


On 6/18/14, 8:58 AM, Jiri Olsa wrote:

Adding some prints for ordered events queue, to help
debug issues.


went to enable this and it is really odd to have to edit a config file 
to enable debugging. How about hooking it into verbose option? Maybe 
like multiple levels of -v or -v  or -v queue.


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Cleanup of Kernel Bugzilla

2014-06-27 Thread Nick Krause

Do any of you use the kernel Bugzilla? If you do I was wondering if we
can clean it up.
Otherwise I was wondering were  I can get an accurate list of open
bugs in the newest
kernels.
Cheers Nick


On Fri, Jun 27, 2014 at 2:11 PM, Nick Krause  wrote:
> Hey fellow developers
> I seem to be finding lots of bugs on the kernel  Bugzilla that are now
> fixed , it would be great if the maintainers or bug reporters closed them.
>  In addition most of them , seem to from the years 2011 -2013. I have
> searched through assigned ,reopened ,need info and new bug states on
> the kernel Bugzilla . The bugs are up to date on assigned but the other
> open states for bugs  need to be cleaned up a lot. It would be great if
> when you and the other maintainers  have time if the bugs that are
> fixed will be closed from these years that are now resolved.
> Cheers ,
> Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Greg Kroah-Hartman

On Sat, Jun 28, 2014 at 04:30:09AM +0200, Guillaume Morin wrote:
> On 27 Jun 19:09, Greg Kroah-Hartman wrote:
> > > v2: add missing Signed-off-by 
> > 
> > That doesn't go here.
> 
> I guess I am struggling to get git send-email do what I want

Put that below the --- line.

> > And what checkpatch error did you fix?  And are you sure it needs to be
> > fixed?
> 
> That's what I changed:
> 
> $ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
> ERROR: Macros with complex values should be enclosed in parenthesis

Then why didn't you say that :)

> I assumed that if it was reported as an error, it needed to be fixed...

Use your judgement, checkpatch is a tool, it isn't always correct.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin

On 27 Jun 19:09, Greg Kroah-Hartman wrote:
> > v2: add missing Signed-off-by 
> 
> That doesn't go here.

I guess I am struggling to get git send-email do what I want

> And what checkpatch error did you fix?  And are you sure it needs to be
> fixed?

That's what I changed:

$ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
ERROR: Macros with complex values should be enclosed in parenthesis
#24: FILE: drivers/staging/iio/frequency/ad9850.c:24:
+#define value_mask (u16)0xf000

I assumed that if it was reported as an error, it needed to be fixed...

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Greg Kroah-Hartman

On Sat, Jun 28, 2014 at 03:46:56AM +0200, Guillaume Morin wrote:
> v2: add missing Signed-off-by 

That doesn't go here.

And what checkpatch error did you fix?  And are you sure it needs to be
fixed?

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] IOMMU Fixes for Linux v3.16-rc2

2014-06-27 Thread Linus Torvalds

Joerg,
 this email was in my spam-box. No real indication as to why, although
the usual suspect is

   Received-SPF: none (google.com: j...@8bytes.org does not designate
permitted sender hosts) client-ip=85.214.48.195;

presumably together with some other trigger that makes gmail unhappy.

Anyway, pulled,

  Linus

On Tue, Jun 24, 2014 at 1:54 AM, Joerg Roedel  wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
> tags/iommu-fixes-v3.16-rc1
>
> IOMMU Fixes for Linux v3.16-rc1
>
> * Fix VT-d regression with handling multiple RMRR entries per
>   device
> * Fix a small race that was left in the mmu_notifier handling in
>   the AMD IOMMUv2 driver
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.

2014-06-27 Thread Jérôme Glisse

From: Jérôme Glisse 

Several subsystem require a callback when a mm struct is being destroy
so that they can cleanup there respective per mm struct. Instead of
having each subsystem add its callback to mmput use a notifier chain
to call each of the subsystem.

This will allow new subsystem to register callback even if they are
module. There should be no contention on the rw semaphore protecting
the call chain and the impact on the code path should be low and
burried in the noise.

Note that this patch also move the call to cleanup functions after
exit_mmap so that new call back can assume that mmu_notifier_release
have already been call. This does not impact existing cleanup functions
as they do not rely on anything that exit_mmap is freeing. Also moved
khugepaged_exit to exit_mmap so that ordering is preserved for that
function.

Signed-off-by: Jérôme Glisse 
---
 fs/aio.c| 29 ++---
 include/linux/aio.h |  2 --
 include/linux/ksm.h | 11 ---
 include/linux/sched.h   |  5 +
 include/linux/uprobes.h |  1 -
 kernel/events/uprobes.c | 19 ---
 kernel/fork.c   | 22 ++
 mm/ksm.c| 26 +-
 mm/mmap.c   |  3 +++
 9 files changed, 85 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c1d8c48..1d06e92 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
 EXPORT_SYMBOL(wait_on_sync_kiocb);
 
 /*
- * exit_aio: called when the last user of mm goes away.  At this point, there 
is
+ * aio_exit: called when the last user of mm goes away.  At this point, there 
is
  * no way for any new requests to be submited or any of the io_* syscalls to be
  * called on the context.
  *
  * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
  * them.
  */
-void exit_aio(struct mm_struct *mm)
+static int aio_exit(struct notifier_block *nb,
+   unsigned long action, void *data)
 {
+   struct mm_struct *mm = data;
struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
int i;
 
if (!table)
-   return;
+   return 0;
 
for (i = 0; i < table->nr; ++i) {
struct kioctx *ctx = table->table[i];
@@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
continue;
/*
 * We don't need to bother with munmap() here - exit_mmap(mm)
-* is coming and it'll unmap everything. And we simply can't,
-* this is not necessarily our ->mm.
-* Since kill_ioctx() uses non-zero ->mmap_size as indicator
-* that it needs to unmap the area, just set it to 0.
+* have already been call and everything is unmap by now. But
+* to be safe set ->mmap_size to 0 since aio_free_ring() uses
+* non-zero ->mmap_size as indicator that it needs to unmap the
+* area.
 */
ctx->mmap_size = 0;
kill_ioctx(mm, ctx, NULL);
@@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
 
RCU_INIT_POINTER(mm->ioctx_table, NULL);
kfree(table);
+   return 0;
 }
 
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
@@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
}
return ret;
 }
+
+static struct notifier_block aio_mmput_nb = {
+   .notifier_call  = aio_exit,
+   .priority   = 1,
+};
+
+static int __init aio_init(void)
+{
+   return mmput_register_notifier(&aio_mmput_nb);
+}
+subsys_initcall(aio_init);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..6308fac 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
 extern void aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
-extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
@@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn 
*cancel);
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
 struct mm_struct;
-static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user * __user *iocbpp,
bool compat) { return 0; }
diff --git a/include/linux/ks

[PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.

2014-06-27 Thread Jérôme Glisse

From: Jérôme Glisse 

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: Jérôme Glisse 
---
 include/linux/rmap.h | 15 ---
 mm/memory-failure.c  |  2 +-
 mm/vmscan.c  |  4 ++--
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be57450..eddbc07 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,13 +72,14 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-   TTU_UNMAP = 1,  /* unmap mode */
-   TTU_MIGRATION = 2,  /* migration mode */
-   TTU_MUNLOCK = 4,/* munlock mode */
-
-   TTU_IGNORE_MLOCK = (1 << 8),/* ignore mlock */
-   TTU_IGNORE_ACCESS = (1 << 9),   /* don't age */
-   TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+   TTU_VMSCAN = 1, /* unmap for vmscan */
+   TTU_POISON = 2, /* unmap for poison */
+   TTU_MIGRATION = 4,  /* migration mode */
+   TTU_MUNLOCK = 8,/* munlock mode */
+
+   TTU_IGNORE_MLOCK = (1 << 9),/* ignore mlock */
+   TTU_IGNORE_ACCESS = (1 << 10),  /* don't age */
+   TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7a89eb..ba176c4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page 
*p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
  int trapno, int flags, struct page **hpagep)
 {
-   enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+   enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d24fd6..5a7d286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone 
*zone,
}
 
ret = shrink_page_list(&clean_pages, zone, &sc,
-   TTU_UNMAP|TTU_IGNORE_ACCESS,
+   TTU_VMSCAN|TTU_IGNORE_ACCESS,
&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
list_splice(&clean_pages, page_list);
mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
if (nr_taken == 0)
return 0;
 
-   nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+   nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
&nr_writeback, &nr_immediate,
false);
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mm preparatory patches for HMM and IOMMUv2

2014-06-27 Thread Jérôme Glisse

Andrew so here are a set of mm patch that do some ground modification to core
mm code. They apply on top of today's linux-next and they pass checkpatch.pl
with flying color (except patch 4 but i did not wanted to be a nazi about 80
char line).

Patch 1 is the mmput notifier call chain we discussed with AMD.

Patch 2, 3 and 4 are so far only useful to HMM but i am discussing with AMD and
i believe it will be useful to them to (in the context of IOMMUv2).

Patch 2 allows to differentiate page unmap for vmscan reason or for poisoning.

Patch 3 associate mmu_notifier with an event type allowing to take different 
code
path inside mmu_notifier callback depending on what is currently happening to 
the
cpu page table. There is no functional change, it just add a new argument to the
various mmu_notifier calls and callback.

Patch 4 pass along the vma into which the range invalidation is happening. There
is few functional changes in place where mmu_notifier_range_invalidate_start/end
used [0, -1] as range, instead now those place call the notifier once for each
vma. This might prove to add unwanted overhead hence why i did it as a separate
patch.

I did not include the core hmm patch but i intend to send a v4 next week. So i
really would like to see those included for next release.

As usual comments welcome.

Cheers,
Jérôme Glisse

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page

2014-06-27 Thread Jérôme Glisse

From: Jérôme Glisse 

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: Jérôme Glisse 
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
 drivers/iommu/amd_iommu_v2.c|  3 +++
 drivers/misc/sgi-gru/grutlbpurge.c  |  6 -
 drivers/xen/gntdev.c|  4 +++-
 fs/proc/task_mmu.c  | 16 -
 include/linux/mmu_notifier.h| 19 ---
 kernel/events/uprobes.c |  4 ++--
 mm/filemap_xip.c|  3 ++-
 mm/huge_memory.c| 26 ++--
 mm/hugetlb.c| 16 ++---
 mm/ksm.c|  8 +++
 mm/memory.c | 42 +
 mm/migrate.c|  6 ++---
 mm/mmu_notifier.c   |  9 ---
 mm/mprotect.c   |  5 ++--
 mm/mremap.c |  4 ++--
 mm/rmap.c   |  9 +++
 virt/kvm/kvm_main.c |  3 +++
 18 files changed, 116 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index ed6f35e..191ac71 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -55,6 +55,7 @@ struct i915_mmu_object {
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
*_mn,
   struct mm_struct *mm,
+  struct vm_area_struct 
*vma,
   unsigned long start,
   unsigned long end,
   enum mmu_event event)
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 2bb9771..9f9e706 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
   struct mm_struct *mm,
+  struct vm_area_struct *vma,
   unsigned long address,
   enum mmu_event event)
 {
@@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
  struct mm_struct *mm,
+ struct vm_area_struct *vma,
  unsigned long start,
  unsigned long end,
  enum mmu_event event)
@@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier 
*mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
+   struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
enum mmu_event event)
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..d02e4c7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
   struct mm_struct *mm,
+  struct vm_area_struct *vma,
   unsigned long start, unsigned long end,
   enum mmu_event event)
 {
@@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier 
*mn,
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-struct mm_struct *mm, unsigned long start,
+struct mm_struct *mm,
+struct vm_area_struct *vma,
+unsigned long start,
 unsigned long end,
 enum mmu_event event)
 {
@@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier 
*mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
+   struct vm_area_struct *vma,

[PATCH 3/6] mmu_notifier: add event information to address invalidation v2

2014-06-27 Thread Jérôme Glisse

From: Jérôme Glisse 

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their intented usage
also documenting what exceptation the listener can have in
respect to each event.

Signed-off-by: Jérôme Glisse 
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/iommu/amd_iommu_v2.c|  14 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |   9 ++-
 drivers/xen/gntdev.c|   9 ++-
 fs/proc/task_mmu.c  |   6 +-
 include/linux/hugetlb.h |   7 +-
 include/linux/mmu_notifier.h| 117 ++--
 kernel/events/uprobes.c |  10 ++-
 mm/filemap_xip.c|   2 +-
 mm/huge_memory.c|  51 --
 mm/hugetlb.c|  25 ---
 mm/ksm.c|  18 +++--
 mm/memory.c |  27 +---
 mm/migrate.c|   9 ++-
 mm/mmu_notifier.c   |  28 +---
 mm/mprotect.c   |  33 ++---
 mm/mremap.c |   6 +-
 mm/rmap.c   |  24 +--
 virt/kvm/kvm_main.c |  12 ++--
 19 files changed, 291 insertions(+), 119 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 21ea928..ed6f35e 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -56,7 +56,8 @@ struct i915_mmu_object {
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
*_mn,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end)
+  unsigned long end,
+  enum mmu_event event)
 {
struct i915_mmu_notifier *mn = container_of(_mn, struct 
i915_mmu_notifier, mn);
struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 499b436..2bb9771 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
  struct mm_struct *mm,
  unsigned long address,
- pte_t pte)
+ pte_t pte,
+ enum mmu_event event)
 {
__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long address)
+  unsigned long address,
+  enum mmu_event event)
 {
__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
  struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end,
+ enum mmu_event event)
 {
struct pasid_state *pasid_state;
struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier 
*mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
struct mm_struct *mm,
-   unsigned long start, unsigned long end)
+   unsigned long start,
+   unsigned long end,
+   enum mmu_event event)
 {
struct pasid_state *pasid_state;
struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
   struct mm_struct *mm,
-

[PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin

v2: add missing Signed-off-by 

Signed-off-by: Guillaume Morin 

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] drm/gk20a: add BAR instance

2014-06-27 Thread Ken Adams


On 6/27/14 8:56 PM, "Ben Skeggs"  wrote:

>On Sat, Jun 28, 2014 at 4:51 AM, Ken Adams  wrote:
>> quick note re: tegra and gpu bars...
>>
>> to this point we've explicitly avoided providing user-mode mappings due
>>to
>> power management issues, etc.
>> looks to me like this would allow such mappings.  is that the case?  are
>> there any paths which would require such mappings to function properly?

>What power management issues are you worried about in particular?  We
>have these concerns on discrete cards too, when doing things like
>changing vram frequencies.  TTM is able to kick out all userspace
>mappings, and clients will then block in the fault handler until it's
>safe - if they touch the mappings.
>
>Ben.


hi ben,

primarily it's the access problem you mentioned.  managing those mappings,
and kicking them out at best adds to the latency to take down power/detach
busii and the like.

and, generally, there are very few (if any) cases where there isn't a
better way to manipulate the pixels than with the cpu :) i understand
there are plenty of paths i don't know about hereŠ and so i asked.

it's a solvable problem, of course.  but especially in the mobile world it
can pop up unexpectedly.  typically on someone's perf/power/stress tests :)

---
ken





>
>>
>> thanks
>> ---
>> ken
>>
>> p.s.: hello :)
>>
>> On 6/27/14 7:36 AM, "Alex Courbot"  wrote:
>>
>>>GK20A's BAR is functionally identical to NVC0's, but do not support
>>>being ioremapped write-combined. Create a BAR instance for GK20A that
>>>reflect that state.
>>>
>>>Signed-off-by: Alexandre Courbot 
>>>---
>>>Changes since v1:
>>>- Fix compilation warning due to missing cast
>>>
>>>Patch 1 of the series was ok and thus has not been resent.
>>>
>>> drivers/gpu/drm/nouveau/Makefile  |  1 +
>>> drivers/gpu/drm/nouveau/core/engine/device/nve0.c |  2 +-
>>> drivers/gpu/drm/nouveau/core/include/subdev/bar.h |  1 +
>>> drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c   | 54
>>>+++
>>> drivers/gpu/drm/nouveau/core/subdev/bar/nvc0.c|  6 +--
>>> drivers/gpu/drm/nouveau/core/subdev/bar/priv.h|  6 +++
>>> 6 files changed, 66 insertions(+), 4 deletions(-)
>>> create mode 100644 drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>>
>>>diff --git a/drivers/gpu/drm/nouveau/Makefile
>>>b/drivers/gpu/drm/nouveau/Makefile
>>>index 8b307e143632..11d9561d67c1 100644
>>>--- a/drivers/gpu/drm/nouveau/Makefile
>>>+++ b/drivers/gpu/drm/nouveau/Makefile
>>>@@ -26,6 +26,7 @@ nouveau-y += core/core/subdev.o
>>> nouveau-y += core/subdev/bar/base.o
>>> nouveau-y += core/subdev/bar/nv50.o
>>> nouveau-y += core/subdev/bar/nvc0.o
>>>+nouveau-y += core/subdev/bar/gk20a.o
>>> nouveau-y += core/subdev/bios/base.o
>>> nouveau-y += core/subdev/bios/bit.o
>>> nouveau-y += core/subdev/bios/boost.o
>>>diff --git a/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>>b/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>>index 2d1e97d4264f..a2b9ccc48f66 100644
>>>--- a/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>>+++ b/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>>@@ -165,7 +165,7 @@ nve0_identify(struct nouveau_device *device)
>>>   device->oclass[NVDEV_SUBDEV_IBUS   ] =
>>>&gk20a_ibus_oclass;
>>>   device->oclass[NVDEV_SUBDEV_INSTMEM] =
>>>nv50_instmem_oclass;
>>>   device->oclass[NVDEV_SUBDEV_VM ] =
>>>&nvc0_vmmgr_oclass;
>>>-  device->oclass[NVDEV_SUBDEV_BAR] = &nvc0_bar_oclass;
>>>+  device->oclass[NVDEV_SUBDEV_BAR] = &gk20a_bar_oclass;
>>>   device->oclass[NVDEV_ENGINE_DMAOBJ ] =
>>>&nvd0_dmaeng_oclass;
>>>   device->oclass[NVDEV_ENGINE_FIFO   ] =
>>>gk20a_fifo_oclass;
>>>   device->oclass[NVDEV_ENGINE_SW ] =
>>>nvc0_software_oclass;
>>>diff --git a/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>>b/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>>index 9002cbb6432b..be037fac534c 100644
>>>--- a/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>>+++ b/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>>@@ -33,5 +33,6 @@ nouveau_bar(void *obj)
>>>
>>> extern struct nouveau_oclass nv50_bar_oclass;
>>> extern struct nouveau_oclass nvc0_bar_oclass;
>>>+extern struct nouveau_oclass gk20a_bar_oclass;
>>>
>>> #endif
>>>diff --git a/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>>b/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>>new file mode 100644
>>>index ..bf877af9d3bd
>>>--- /dev/null
>>>+++ b/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>>@@ -0,0 +1,54 @@
>>>+/*
>>>+ * Copyright (c) 2014, NVIDIA CORPORATION. All rights reserved.
>>>+ *
>>>+ * Permission is hereby granted, free of charge, to any person
>>>obtaining
>>>a
>>>+ * copy of this software and associated documentation files (the
>>>"Software"),
>>>+ * to deal in the Software without restriction, including without
>>>limitation
>>>+ * the rights to use, copy, modify, merge, publish, distribute,
>>>sublicense,
>>>+ * and/or sell

[PATCH 2/2] MAINTAINERS: exceptions for Documentation maintainer

2014-06-27 Thread Randy Dunlap

From: Randy Dunlap 

Note that I don't maintain Documentation/ABI/,
Documentation/devicetree/, or the language translation files.

Signed-off-by: Randy Dunlap 
---
 MAINTAINERS |3 +++
 1 file changed, 3 insertions(+)

Index: lnx-315-rc5/MAINTAINERS
===
--- lnx-315-rc5.orig/MAINTAINERS
+++ lnx-315-rc5/MAINTAINERS
@@ -2886,6 +2886,9 @@ L:linux-...@vger.kernel.org
 T: quilt http://www.infradead.org/~rdunlap/Doc/patches/
 S: Maintained
 F: Documentation/
+X: Documentation/ABI/
+X: Documentation/devicetree/
+X: Documentation/[a-z][a-z]_[A-Z][A-Z]/
 
 DOUBLETALK DRIVER
 M: "James R. Van Zandt" 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] Documentation: add section about git to email-clients.txt

2014-06-27 Thread Randy Dunlap

From: Dan Carpenter 

These days most people use git to send patches so I have added a section
about that.

Signed-off-by: Dan Carpenter 
Signed-off-by: Randy Dunlap 
---
v2: fix typo in commit message
v3: update git am and log commands.  Mention the man pages.
v4: s/list/appropriate mailing list(s)/

---
 Documentation/email-clients.txt |   11 +++
 1 file changed, 11 insertions(+)

--- lnx-315-rc5.orig/Documentation/email-clients.txt
+++ lnx-315-rc5/Documentation/email-clients.txt
@@ -1,6 +1,17 @@
 Email clients info for Linux
 ==
 
+Git
+--
+These days most developers use `git send-email` instead of regular
+email clients.  The man page for this is quite good.  On the receiving
+end, maintainers use `git am` to apply the patches.
+
+If you are new to git then send your first patch to yourself.  Save it
+as raw text including all the headers.  Run `git am raw_email.txt` and
+then review the changelog with `git log`.  When that works then send
+the patch to the appropriate mailing list(s).
+
 General Preferences
 --
 Patches for the Linux kernel are submitted via email, preferably as
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin


diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] Another Compression bugfixe for 3.16-rc3

2014-06-27 Thread Greg KH

The following changes since commit 206204a1162b995e2185275167b22468c00d6b36:

  lz4: ensure length does not wrap (2014-06-23 14:12:01 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/ 
tags/compress-3.16-rc3

for you to fetch changes up to 4148c1f67abf823099b2d7db6851e4aea407f5ee:

  lz4: fix another possible overrun (2014-06-27 11:21:07 -0700)


Compress bugfix for 3.16-rc3

Here is another lz4 bugfix for 3.16-rc3 that resolves a reported issue
with that compression algorithm.

Signed-off-by: Greg Kroah-Hartman 


Greg Kroah-Hartman (1):
  lz4: fix another possible overrun

 lib/lz4/lz4_decompress.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 4/4] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-27 Thread Luis R. Rodriguez

On Fri, Jun 27, 2014 at 04:59:14PM -0700, Andrew Morton wrote:
> On Thu, 26 Jun 2014 16:32:15 -0700 "Luis R. Rodriguez"  
> wrote:
> 
> > On Thu, Jun 26, 2014 at 4:20 PM, Andrew Morton
> >  wrote:
> > > On Fri, 27 Jun 2014 01:16:30 +0200 "Luis R. Rodriguez"  
> > > wrote:
> > >
> > >> > > Another note --  since this option depends on SMP and !BASE_SMALL 
> > >> > > technically
> > >> > > num_possible_cpus() won't ever return something smaller than or 
> > >> > > equal to 1
> > >> > > but because of the default values chosen the -1 on the compuation 
> > >> > > does affect
> > >> > > whether or not this will trigger on > 64 CPUs or >= 64 CPUs, keeping 
> > >> > > the
> > >> > > -1 means we require > 64 CPUs.
> > >> >
> > >> > hm, that sounds like more complexity.
> > >> >
> > >> > > This all can be changed however we like but the language and 
> > >> > > explained logic
> > >> > > would just need to be changed.
> > >> >
> > >> > Let's start out simple.  What's wrong with doing
> > >> >
> > >> > log buf len = max(__LOG_BUF_LEN, nr_possible_cpus * per-cpu log 
> > >> > buf len)
> > >>
> > >> Sure, you already took in the patch series though so how would you like 
> > >> to
> > >> handle a respin, you just drop the last patch and we respin it?
> > >
> > > A fresh patch would suit.  That's if you think it is a reasonable
> > > approach - you've thought about this stuff more than I have!
> > 
> > The way its implemented now makes more technical sense, in short it
> > assumes the first boot (and CPU) gets the full default kernel ring
> > buffer size, the extra size is for the gibberish that each extra CPU
> > is expected to spew out in the worst case. What you propose makes the
> > explanation simpler and easier to understand but sends the wrong
> > message about exactly how the growth of the kernel ring buffer is
> > expected scale with the addition of more CPUs.
> 
> OK, it's finally starting to sink in.  The model for the kernel-wide
> printk output is "a great pile of CPU-independent stuff plus a certain
> amount of per-cpu stuff".  And the code at present attempts to follow
> that model.  Yes?

Yup, exactly.

> I'm rather internet-challenged at present - please let me take another look at
> the patch on Monday.

OK!

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Filesystem lockup with CONFIG_PREEMPT_RT

2014-06-27 Thread Austin Schuh

On Fri, Jun 27, 2014 at 11:19 AM, Steven Rostedt  wrote:
> On Fri, 27 Jun 2014 20:07:54 +0200
> Mike Galbraith  wrote:
>
>> > Why do we need the wakeup? the owner of the lock should wake it up
>> > shouldn't it?
>>
>> True, but that can take ages.
>
> Can it? If the workqueue is of some higher priority, it should boost
> the process that owns the lock. Otherwise it just waits like anything
> else does.
>
> I much rather keep the paradigm of the mainline kernel than to add a
> bunch of hacks that can cause more unforeseen side effects that may
> cause other issues.
>
> Remember, this would only be for spinlocks converted into a rtmutex,
> not for normal mutex or other sleeps. In mainline, the wake up still
> would not happen so why are we waking it up here?
>
> This seems similar to the BKL crap we had to deal with as well. If we
> were going to sleep because we were blocked on a spinlock converted
> rtmutex we could not release and retake the BKL because we would end up
> blocked on two locks. Instead, we made sure that the spinlock would not
> release or take the BKL. It kept with the paradigm of mainline and
> worked. Sucked, but it worked.
>
> -- Steve

Sounds like you are arguing that we should disable preemption (or
whatever the right mechanism is) while holding the pool lock?

Workqueues spin up more threads when work that they are executing
blocks.  This is done through hooks in the scheduler.  This means that
we have to acquire the pool lock when work blocks on a lock in order
to see if there is more work and whether or not we need to spin up a
new thread.

It would be more context switches, but I wonder if we could kick the
workqueue logic completely out of the scheduler into a thread.  Have
the scheduler increment/decrement an atomic pool counter, and wake up
the monitoring thread to spawn new threads when needed?  That would
get rid of the recursive pool lock problem, and should reduce
scheduler latency if we would need to spawn a new thread.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] cgroup: reorganize cgroup_subtree_control_write()

2014-06-27 Thread Tejun Heo

Make the following two reorganizations to
cgroup_subtree_control_write().  These are to prepare for future
changes and shouldn't cause any functional difference.

* Move availability above css offlining wait.

* Move cgrp->child_subsys_mask update above new css creation.

Signed-off-by: Tejun Heo 
---
 kernel/cgroup.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7868fc3..a46d7e2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2613,6 +2613,14 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
continue;
}
 
+   /* unavailable or not enabled on the parent? */
+   if (!(cgrp_dfl_root.subsys_mask & (1 << ssid)) ||
+   (cgroup_parent(cgrp) &&
+!(cgroup_parent(cgrp)->child_subsys_mask & (1 << 
ssid {
+   ret = -ENOENT;
+   goto out_unlock;
+   }
+
/*
 * Because css offlining is asynchronous, userland
 * might try to re-enable the same controller while
@@ -2635,14 +2643,6 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
 
return restart_syscall();
}
-
-   /* unavailable or not enabled on the parent? */
-   if (!(cgrp_dfl_root.subsys_mask & (1 << ssid)) ||
-   (cgroup_parent(cgrp) &&
-!(cgroup_parent(cgrp)->child_subsys_mask & (1 << 
ssid {
-   ret = -ENOENT;
-   goto out_unlock;
-   }
} else if (disable & (1 << ssid)) {
if (!(cgrp->child_subsys_mask & (1 << ssid))) {
disable &= ~(1 << ssid);
@@ -2673,12 +2673,10 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
goto out_unlock;
}
 
-   /*
-* Create csses for enables and update child_subsys_mask.  This
-* changes cgroup_e_css() results which in turn makes the
-* subsequent cgroup_update_dfl_csses() associate all tasks in the
-* subtree to the updated csses.
-*/
+   cgrp->child_subsys_mask |= enable;
+   cgrp->child_subsys_mask &= ~disable;
+
+   /* create new csses */
for_each_subsys(ss, ssid) {
if (!(enable & (1 << ssid)))
continue;
@@ -2690,9 +2688,11 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
}
}
 
-   cgrp->child_subsys_mask |= enable;
-   cgrp->child_subsys_mask &= ~disable;
-
+   /*
+* At this point, cgroup_e_css() results reflect the new csses
+* making the following cgroup_update_dfl_csses() properly update
+* css associations of all tasks in the subtree.
+*/
ret = cgroup_update_dfl_csses(cgrp);
if (ret)
goto err_undo_css;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] cgroup: implement cgroup_subsys->depends_on

2014-06-27 Thread Tejun Heo

Currently, the blkio subsystem attributes all of writeback IOs to the
root.  One of the issues is that there's no way to tell who originated
a writeback IO from block layer.  Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages.  The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

blkio piggybacking on memory is an implementation detail which
preferably should be handled automatically without requiring explicit
userland action.  To achieve that, this patch implements
cgroup_subsys->depends_on which contains the mask of subsystems which
should be enabled together when the subsystem is enabled.

The previous patches already implemented the support for enabled but
invisible subsystems and cgroup_subsys->depends_on can be easily
implemented by updating cgroup_refresh_child_subsys_mask() so that it
calculates cgroup->child_subsys_mask considering
cgroup_subsys->depends_on of the explicitly enabled subsystems.

Documentation/cgroups/unified-hierarchy.txt is updated to explain that
subsystems may not become immediately available after being unused
from userland and that dependency could be a factor in it.  As
subsystems may already keep residual references, this doesn't
significantly change how subsystem rebinding can be used.

Signed-off-by: Tejun Heo 
---
 Documentation/cgroups/unified-hierarchy.txt | 23 --
 include/linux/cgroup.h  |  9 ++
 kernel/cgroup.c | 49 -
 3 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt 
b/Documentation/cgroups/unified-hierarchy.txt
index 324b182..a7a2205 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -97,9 +97,26 @@ change soon.
 All controllers which are not bound to other hierarchies are
 automatically bound to unified hierarchy and show up at the root of
 it.  Controllers which are enabled only in the root of unified
-hierarchy can be bound to other hierarchies at any time.  This allows
-mixing unified hierarchy with the traditional multiple hierarchies in
-a fully backward compatible way.
+hierarchy can be bound to other hierarchies.  This allows mixing
+unified hierarchy with the traditional multiple hierarchies in a fully
+backward compatible way.
+
+A controller can be moved across hierarchies only after the controller
+is no longer referenced in its current hierarchy.  Because per-cgroup
+controller states are destroyed asynchronously and controllers may
+have lingering references, a controller may not show up immediately on
+the unified hierarchy after the final umount of the previous
+hierarchy.  Similarly, a controller should be fully disabled to be
+moved out of the unified hierarchy and it may take some time for the
+disabled controller to become available for other hierarchies;
+furthermore, due to dependencies among controllers, other controllers
+may need to be disabled too.
+
+While useful for development and manual configurations, dynamically
+moving controllers between the unified and other hierarchies is
+strongly discouraged for production use.  It is recommended to decide
+the hierarchies and controller associations before starting using the
+controllers.
 
 
 2-2. cgroup.subtree_control
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index db99e3b..28853e7 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -693,6 +693,15 @@ struct cgroup_subsys {
 
/* base cftypes, automatically registered with subsys itself */
struct cftype *base_cftypes;
+
+   /*
+* A subsystem may depend on other subsystems.  When such subsystem
+* is enabled on a cgroup, the depended-upon subsystems are enabled
+* together if available.  Subsystems enabled due to dependency are
+* not visible to userland until explicitly enabled.  The following
+* specifies the mask of subsystems that this one depends on.
+*/
+   unsigned int depends_on;
 };
 
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3a6b77d..cd02e99 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1037,9 +1037,56 @@ static void cgroup_put(struct cgroup *cgrp)
css_put(&cgrp->self);
 }
 
+/**
+ * cgroup_refresh_child_subsys_mask - update child_subsys_mask
+ * @cgrp: the target cgroup
+ *
+ * On the default hierarchy, a subsystem may request other subsystems to be
+ * enabled together through its ->depends_on mask.  In such cases, more
+ * subsystems than specified in "cgroup.subtree_control" may be enabled.
+ *
+ * This function determines which subsystems need to be enabled given the
+ * current @cgrp->subtree_control and records it in
+ * @cg

[PATCH 3/6] cgroup: make interface files visible iff enabled on cgroup->subtree_control

2014-06-27 Thread Tejun Heo

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The preceding patch distinguished cgroup->subtree_control and
->child_subsys_mask where the former is the subsystems explicitly
configured by the userland and the latter is all enabled subsystems
currently is equal to the former but will include subsystems
implicitly enabled through dependency.

Subsystems which are enabled due to dependency shouldn't be visible to
userland.  This patch updates cgroup_subtree_control_write() and
create_css() such that interface files are not created for implicitly
enabled subsytems.

* @visible paramter is added to create_css().  Interface files are
  created only when true.

* If an already implicitly enabled subsystem is turned on through
  "cgroup.subtree_control", the existing css should be used.  css
  draining is skipped.

* cgroup_subtree_control_write() computes the new target
  cgroup->child_subsys_mask and create/kill or show/hide csses
  accordingly.

As the two subsystem masks are still kept identical, this patch
doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup.c| 78 +-
 2 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8d52c8e..5287f93 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -208,6 +208,8 @@ struct cgroup {
 * ->subtree_control is the one configured through
 * "cgroup.subtree_control" while ->child_subsys_mask is the
 * effective one which may have more subsystems enabled.
+* Controller knobs are made available iff it's enabled in
+* ->subtree_control.
 */
unsigned int subtree_control;
unsigned int child_subsys_mask;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 14a9d88..331fa296 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -186,7 +186,8 @@ static void cgroup_put(struct cgroup *cgrp);
 static int rebind_subsystems(struct cgroup_root *dst_root,
 unsigned int ss_mask);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
-static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss);
+static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
+ bool visible);
 static void css_release(struct percpu_ref *ref);
 static void kill_css(struct cgroup_subsys_state *css);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
@@ -2577,6 +2578,7 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
loff_t off)
 {
unsigned int enable = 0, disable = 0;
+   unsigned int css_enable, css_disable, old_ctrl, new_ctrl;
struct cgroup *cgrp, *child;
struct cgroup_subsys *ss;
char *tok;
@@ -2630,6 +2632,13 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
}
 
/*
+* @ss is already enabled through dependency and
+* we'll just make it visible.  Skip draining.
+*/
+   if (cgrp->child_subsys_mask & (1 << ssid))
+   continue;
+
+   /*
 * Because css offlining is asynchronous, userland
 * might try to re-enable the same controller while
 * the previous instance is still around.  In such
@@ -2681,17 +2690,39 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
goto out_unlock;
}
 
+   /*
+* Update subsys masks and calculate what needs to be done.  More
+* subsystems than specified may need to be enabled or disabled
+* depending on subsystem dependencies.
+*/
cgrp->subtree_control |= enable;
cgrp->subtree_control &= ~disable;
+
+   old_ctrl = cgrp->child_subsys_mask;
cgroup_refresh_child_subsys_mask(cgrp);
+   new_ctrl = cgrp->child_subsys_mask;
+
+   css_enable = ~old_ctrl & new_ctrl;
+   css_disable = old_ctrl & ~new_ctrl;
+   enable |= css_enable;
+   disable |= css_disable;
 
-   /* create new csses */
+   /*
+* Create new csses or make the existing ones visible.  A css is
+* created invisible if it's being implicitly enabled through
+* dependency.  An invisible css is made visible when the userland
+* explicitly enables it.
+*/
for_each_subsys(ss, ssid) {
if (!(enable & (1 << ssid)))
continue;
 
cgroup_for_each_live_child(child, cgrp) {
-   ret = create_css(child, ss);
+

[PATCH 2/6] cgroup: introduce cgroup->subtree_control

2014-06-27 Thread Tejun Heo

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

Previously, cgroup->child_subsys_mask directly reflected
"cgroup.subtree_control" and the enabled subsystems in the child
cgroups.  This patch adds cgroup->subtree_control which
"cgroup.subtree_control" operates on.  cgroup->child_subsys_mask is
now calculated from cgroup->subtree_control by
cgroup_refresh_child_subsys_mask(), which sets it identical to
cgroup->subtree_control for now.

This will allow using cgroup->child_subsys_mask for all the enabled
subsystems including the implicit ones and ->subtree_control for
tracking the explicitly requested ones.  This patch keeps the two
masks identical and doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  8 +++-
 kernel/cgroup.c| 46 +-
 2 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8a111dd..8d52c8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -203,7 +203,13 @@ struct cgroup {
struct kernfs_node *kn; /* cgroup kernfs entry */
struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" 
*/
 
-   /* the bitmask of subsystems enabled on the child cgroups */
+   /*
+* The bitmask of subsystems enabled on the child cgroups.
+* ->subtree_control is the one configured through
+* "cgroup.subtree_control" while ->child_subsys_mask is the
+* effective one which may have more subsystems enabled.
+*/
+   unsigned int subtree_control;
unsigned int child_subsys_mask;
 
/* Private pointers for each registered subsystem */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a46d7e2..14a9d88 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1036,6 +1036,11 @@ static void cgroup_put(struct cgroup *cgrp)
css_put(&cgrp->self);
 }
 
+static void cgroup_refresh_child_subsys_mask(struct cgroup *cgrp)
+{
+   cgrp->child_subsys_mask = cgrp->subtree_control;
+}
+
 /**
  * cgroup_kn_unlock - unlocking helper for cgroup kernfs methods
  * @kn: the kernfs_node being serviced
@@ -1208,12 +1213,15 @@ static int rebind_subsystems(struct cgroup_root 
*dst_root, unsigned int ss_mask)
up_write(&css_set_rwsem);
 
src_root->subsys_mask &= ~(1 << ssid);
-   src_root->cgrp.child_subsys_mask &= ~(1 << ssid);
+   src_root->cgrp.subtree_control &= ~(1 << ssid);
+   cgroup_refresh_child_subsys_mask(&src_root->cgrp);
 
/* default hierarchy doesn't enable controllers by default */
dst_root->subsys_mask |= 1 << ssid;
-   if (dst_root != &cgrp_dfl_root)
-   dst_root->cgrp.child_subsys_mask |= 1 << ssid;
+   if (dst_root != &cgrp_dfl_root) {
+   dst_root->cgrp.subtree_control |= 1 << ssid;
+   cgroup_refresh_child_subsys_mask(&dst_root->cgrp);
+   }
 
if (ss->bind)
ss->bind(css);
@@ -2454,7 +2462,7 @@ static int cgroup_controllers_show(struct seq_file *seq, 
void *v)
 {
struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-   cgroup_print_ss_mask(seq, cgroup_parent(cgrp)->child_subsys_mask);
+   cgroup_print_ss_mask(seq, cgroup_parent(cgrp)->subtree_control);
return 0;
 }
 
@@ -2463,7 +2471,7 @@ static int cgroup_subtree_control_show(struct seq_file 
*seq, void *v)
 {
struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-   cgroup_print_ss_mask(seq, cgrp->child_subsys_mask);
+   cgroup_print_ss_mask(seq, cgrp->subtree_control);
return 0;
 }
 
@@ -2608,7 +2616,7 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
 
for_each_subsys(ss, ssid) {
if (enable & (1 << ssid)) {
-   if (cgrp->child_subsys_mask & (1 << ssid)) {
+   if (cgrp->subtree_control & (1 << ssid)) {
enable &= ~(1 << ssid);
continue;
}
@@ -2616,7 +2624,7 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
/* unavailable or not enabled on the parent? */
if (!(cgrp_dfl_root.subsys_mask & (1 << ssid)) ||
(cgroup_parent(cgrp) &&
-!(cgroup_parent(cgrp)->child_subsys_mask & (1 << 
ssid {
+!(cgroup_parent(cgrp)->subtree_control & (1 << 
ssid {
ret = -ENOENT;
goto out_unlock;
}
@@ -2644,14 +2652,14 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,

[PATCH 4/6] cgroup: implement cgroup_subsys->css_reset()

2014-06-27 Thread Tejun Heo

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The previous patches added support for explicitly and implicitly
enabled subsystems and showing/hiding their interface files.  An
explicitly enabled subsystem may become implicitly enabled if it's
turned off through "cgroup.subtree_control" but there are subsystems
depending on it.  In such cases, the subsystem, as it's turned off
when seen from userland, shouldn't enforce any resource control.
Also, the subsystem may be explicitly turned on later again and its
interface files should be as close to the intial state as possible.

This patch adds cgroup_subsys->css_reset() which is invoked when a css
is hidden.  The callback should disable resource control and reset the
state to the vanilla state.

Signed-off-by: Tejun Heo 
---
 Documentation/cgroups/cgroups.txt | 14 ++
 include/linux/cgroup.h|  1 +
 kernel/cgroup.c   | 16 
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt 
b/Documentation/cgroups/cgroups.txt
index 821de56..10c949b 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -599,6 +599,20 @@ fork. If this method returns 0 (success) then this should 
remain valid
 while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
+void css_reset(struct cgroup_subsys_state *css)
+(cgroup_mutex held by caller)
+
+An optional operation which should restore @css's configuration to the
+initial state.  This is currently only used on the unified hierarchy
+when a subsystem is disabled on a cgroup through
+"cgroup.subtree_control" but should remain enabled because other
+subsystems depend on it.  cgroup core makes such a css invisible by
+removing the associated interface files and invokes this callback so
+that the hidden subsystem can return to the initial neutral state.
+This prevents unexpected resource control from a hidden css and
+ensures that the configuration is in the initial state when it is made
+visible again later.
+
 void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5287f93..db99e3b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -642,6 +642,7 @@ struct cgroup_subsys {
int (*css_online)(struct cgroup_subsys_state *css);
void (*css_offline)(struct cgroup_subsys_state *css);
void (*css_free)(struct cgroup_subsys_state *css);
+   void (*css_reset)(struct cgroup_subsys_state *css);
 
int (*can_attach)(struct cgroup_subsys_state *css,
  struct cgroup_taskset *tset);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 331fa296..3a6b77d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2740,17 +2740,25 @@ static ssize_t cgroup_subtree_control_write(struct 
kernfs_open_file *of,
/*
 * All tasks are migrated out of disabled csses.  Kill or hide
 * them.  A css is hidden when the userland requests it to be
-* disabled while other subsystems are still depending on it.
+* disabled while other subsystems are still depending on it.  The
+* css must not actively control resources and be in the vanilla
+* state if it's made visible again later.  Controllers which may
+* be depended upon should provide ->css_reset() for this purpose.
 */
for_each_subsys(ss, ssid) {
if (!(disable & (1 << ssid)))
continue;
 
cgroup_for_each_live_child(child, cgrp) {
-   if (css_disable & (1 << ssid))
-   kill_css(cgroup_css(child, ss));
-   else
+   struct cgroup_subsys_state *css = cgroup_css(child, ss);
+
+   if (css_disable & (1 << ssid)) {
+   kill_css(css);
+   } else {
cgroup_clear_dir(child, 1 << ssid);
+   if (ss->css_reset)
+   ss->css_reset(css);
+   }
}
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] blkcg, memcg: make blkcg depend on memcg on the default hierarchy

2014-06-27 Thread Tejun Heo

Currently, the blkio subsystem attributes all of writeback IOs to the
root.  One of the issues is that there's no way to tell who originated
a writeback IO from block layer.  Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages.  The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on.  This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available.  Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.

As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations.  This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.

Signed-off-by: Tejun Heo 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vivek Goyal 
Cc: Jens Axboe 
---
 block/blk-cgroup.c |  7 +++
 mm/memcontrol.c| 24 
 2 files changed, 31 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 069bc20..c9f7547 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -925,6 +925,13 @@ struct cgroup_subsys blkio_cgrp_subsys = {
.css_free = blkcg_css_free,
.can_attach = blkcg_can_attach,
.base_cftypes = blkcg_files,
+
+   /*
+* This ensures that, if available, memcg is automatically enabled
+* together on the default hierarchy so that the owner cgroup can
+* be retrieved from writeback pages.
+*/
+   .depends_on = 1 << memory_cgrp_id,
 };
 EXPORT_SYMBOL_GPL(blkio_cgrp_subsys);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb..db536e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6407,6 +6407,29 @@ static void mem_cgroup_css_free(struct 
cgroup_subsys_state *css)
__mem_cgroup_free(memcg);
 }
 
+/**
+ * mem_cgroup_css_reset - reset the states of a mem_cgroup
+ * @css: the target css
+ *
+ * Reset the states of the mem_cgroup associated with @css.  This is
+ * invoked when the userland requests disabling on the default hierarchy
+ * but the memcg is pinned through dependency.  The memcg should stop
+ * applying policies and should revert to the vanilla state as it may be
+ * made visible again.
+ *
+ * The current implementation only resets the essential configurations.
+ * This needs to be expanded to cover all the visible parts.
+ */
+static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+   mem_cgroup_resize_limit(memcg, ULLONG_MAX);
+   mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX);
+   memcg_update_kmem_limit(memcg, ULLONG_MAX);
+   res_counter_set_soft_limit(&memcg->res, ULLONG_MAX);
+}
+
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 #define PRECHARGE_COUNT_AT_ONCE256
@@ -7019,6 +7042,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_online = mem_cgroup_css_online,
.css_offline = mem_cgroup_css_offline,
.css_free = mem_cgroup_css_free,
+   .css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET cgroup/for-3.17] cgroup, blkcg, memcg: make blkcg depend on memcg on unified hierarchy

2014-06-27 Thread Tejun Heo

Hello, guys.

Currently, the blkio subsystem attributes all of writeback IOs to the
root.  One of the issues is that there's no way to tell who originated
a writeback IO from block layer.  Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages.  The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

This can be achieved on the unified hierarchy without too much
difficulty.  This patchset implements a dependency mechanism in the
cgroup such that a subsystem can depends on other subsystems.  If
available, the depended-upon subsystems are enabled together
implicitly when the subsystem is turned on.  Implicitly enabled
subsystems are invisible and the dependencies are transparent to
userland.

This patchset implements the dependency mechanism in cgroup core and
make blkcg depend on memcg.  This doesn't actually solve the writeback
problem yet but is an important step.

This patchset contains the following six patches.

 0001-cgroup-reorganize-cgroup_subtree_control_write.patch
 0002-cgroup-introduce-cgroup-subtree_control.patch
 0003-cgroup-make-interface-files-visible-iff-enabled-on-c.patch
 0004-cgroup-implement-cgroup_subsys-css_reset.patch
 0005-cgroup-implement-cgroup_subsys-depends_on.patch
 0006-blkcg-memcg-make-blkcg-depend-on-memcg-on-the-defaul.patch

0001-0005 gradually implement the dependency mechanism.

0006 makes blkcg depend on memcg.

This patchset is on top of a497c3ba1d97 ("Linux 3.16-rc2") and
available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
review-cgroup-dependency

diffstat follows.  Thanks.

 Documentation/cgroups/cgroups.txt   |   14 +
 Documentation/cgroups/unified-hierarchy.txt |   23 ++-
 block/blk-cgroup.c  |7
 include/linux/cgroup.h  |   20 ++
 kernel/cgroup.c |  201 ++--
 mm/memcontrol.c |   24 +++
 6 files changed, 243 insertions(+), 46 deletions(-)

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] drm/gk20a: add BAR instance

2014-06-27 Thread Ben Skeggs

On Sat, Jun 28, 2014 at 4:51 AM, Ken Adams  wrote:
> quick note re: tegra and gpu bars...
>
> to this point we've explicitly avoided providing user-mode mappings due to
> power management issues, etc.
> looks to me like this would allow such mappings.  is that the case?  are
> there any paths which would require such mappings to function properly?
What power management issues are you worried about in particular?  We
have these concerns on discrete cards too, when doing things like
changing vram frequencies.  TTM is able to kick out all userspace
mappings, and clients will then block in the fault handler until it's
safe - if they touch the mappings.

Ben.

>
> thanks
> ---
> ken
>
> p.s.: hello :)
>
> On 6/27/14 7:36 AM, "Alex Courbot"  wrote:
>
>>GK20A's BAR is functionally identical to NVC0's, but do not support
>>being ioremapped write-combined. Create a BAR instance for GK20A that
>>reflect that state.
>>
>>Signed-off-by: Alexandre Courbot 
>>---
>>Changes since v1:
>>- Fix compilation warning due to missing cast
>>
>>Patch 1 of the series was ok and thus has not been resent.
>>
>> drivers/gpu/drm/nouveau/Makefile  |  1 +
>> drivers/gpu/drm/nouveau/core/engine/device/nve0.c |  2 +-
>> drivers/gpu/drm/nouveau/core/include/subdev/bar.h |  1 +
>> drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c   | 54
>>+++
>> drivers/gpu/drm/nouveau/core/subdev/bar/nvc0.c|  6 +--
>> drivers/gpu/drm/nouveau/core/subdev/bar/priv.h|  6 +++
>> 6 files changed, 66 insertions(+), 4 deletions(-)
>> create mode 100644 drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>
>>diff --git a/drivers/gpu/drm/nouveau/Makefile
>>b/drivers/gpu/drm/nouveau/Makefile
>>index 8b307e143632..11d9561d67c1 100644
>>--- a/drivers/gpu/drm/nouveau/Makefile
>>+++ b/drivers/gpu/drm/nouveau/Makefile
>>@@ -26,6 +26,7 @@ nouveau-y += core/core/subdev.o
>> nouveau-y += core/subdev/bar/base.o
>> nouveau-y += core/subdev/bar/nv50.o
>> nouveau-y += core/subdev/bar/nvc0.o
>>+nouveau-y += core/subdev/bar/gk20a.o
>> nouveau-y += core/subdev/bios/base.o
>> nouveau-y += core/subdev/bios/bit.o
>> nouveau-y += core/subdev/bios/boost.o
>>diff --git a/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>b/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>index 2d1e97d4264f..a2b9ccc48f66 100644
>>--- a/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>+++ b/drivers/gpu/drm/nouveau/core/engine/device/nve0.c
>>@@ -165,7 +165,7 @@ nve0_identify(struct nouveau_device *device)
>>   device->oclass[NVDEV_SUBDEV_IBUS   ] = &gk20a_ibus_oclass;
>>   device->oclass[NVDEV_SUBDEV_INSTMEM] = nv50_instmem_oclass;
>>   device->oclass[NVDEV_SUBDEV_VM ] = &nvc0_vmmgr_oclass;
>>-  device->oclass[NVDEV_SUBDEV_BAR] = &nvc0_bar_oclass;
>>+  device->oclass[NVDEV_SUBDEV_BAR] = &gk20a_bar_oclass;
>>   device->oclass[NVDEV_ENGINE_DMAOBJ ] = &nvd0_dmaeng_oclass;
>>   device->oclass[NVDEV_ENGINE_FIFO   ] =  gk20a_fifo_oclass;
>>   device->oclass[NVDEV_ENGINE_SW ] =  nvc0_software_oclass;
>>diff --git a/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>b/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>index 9002cbb6432b..be037fac534c 100644
>>--- a/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>+++ b/drivers/gpu/drm/nouveau/core/include/subdev/bar.h
>>@@ -33,5 +33,6 @@ nouveau_bar(void *obj)
>>
>> extern struct nouveau_oclass nv50_bar_oclass;
>> extern struct nouveau_oclass nvc0_bar_oclass;
>>+extern struct nouveau_oclass gk20a_bar_oclass;
>>
>> #endif
>>diff --git a/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>b/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>new file mode 100644
>>index ..bf877af9d3bd
>>--- /dev/null
>>+++ b/drivers/gpu/drm/nouveau/core/subdev/bar/gk20a.c
>>@@ -0,0 +1,54 @@
>>+/*
>>+ * Copyright (c) 2014, NVIDIA CORPORATION. All rights reserved.
>>+ *
>>+ * Permission is hereby granted, free of charge, to any person obtaining
>>a
>>+ * copy of this software and associated documentation files (the
>>"Software"),
>>+ * to deal in the Software without restriction, including without
>>limitation
>>+ * the rights to use, copy, modify, merge, publish, distribute,
>>sublicense,
>>+ * and/or sell copies of the Software, and to permit persons to whom the
>>+ * Software is furnished to do so, subject to the following conditions:
>>+ *
>>+ * The above copyright notice and this permission notice shall be
>>included in
>>+ * all copies or substantial portions of the Software.
>>+ *
>>+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>>EXPRESS OR
>>+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>>MERCHANTABILITY,
>>+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
>>SHALL
>>+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>>OTHER
>>+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>>ARISING
>>+ * FROM, OUT OF OR IN CON

[PATCH] dm-io: Fix a race condition in the wake up code for sync_io

2014-06-27 Thread Minfei Huang

There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread.  If the thread
is spuriously woken immediately after the decrement it may exit,
making the on the stack io struct invalid, yet the dec_count could
still be using it.

There are smaller fixes than the one here (eg, just take the io object
off the stack).  But I feel this code could use a clean up.

- simplify dec_count().

  - It always calls a callback fn now.
  - It always frees the io object back to the pool.

- sync_io()

  - Take the io object off the stack and allocate it from the pool the
same as async_io.
  - Use a completion object rather than an explicit io_schedule()
loop.  The callback triggers the completion.

Signed-off-by: Minfei Huang 
---
 drivers/md/dm-io.c |   22 +-
 1 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 3842ac7..05583da 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -10,6 +10,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,7 +33,7 @@ struct dm_io_client {
 struct io {
unsigned long error_bits;
atomic_t count;
-   struct task_struct *sleeper;
+   struct completion *wait;
struct dm_io_client *client;
io_notify_fn callback;
void *context;
@@ -121,8 +122,8 @@ static void dec_count(struct io *io, unsigned int region, 
int error)
invalidate_kernel_vmap_range(io->vma_invalidate_address,
 io->vma_invalidate_size);
 
-   if (io->sleeper)
-   wake_up_process(io->sleeper);
+   if (io->wait)
+   complete(io->wait);
 
else {
unsigned long r = io->error_bits;
@@ -387,6 +388,7 @@ static int sync_io(struct dm_io_client *client, unsigned 
int num_regions,
 */
volatile char io_[sizeof(struct io) + __alignof__(struct io) - 1];
struct io *io = (struct io *)PTR_ALIGN(&io_, __alignof__(struct io));
+   DECLARE_COMPLETION_ONSTACK(wait);
 
if (num_regions > 1 && (rw & RW_MASK) != WRITE) {
WARN_ON(1);
@@ -395,7 +397,7 @@ static int sync_io(struct dm_io_client *client, unsigned 
int num_regions,
 
io->error_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
-   io->sleeper = current;
+   io->wait = &wait;
io->client = client;
 
io->vma_invalidate_address = dp->vma_invalidate_address;
@@ -403,15 +405,9 @@ static int sync_io(struct dm_io_client *client, unsigned 
int num_regions,
 
dispatch_io(rw, num_regions, where, dp, io, 1);
 
-   while (1) {
-   set_current_state(TASK_UNINTERRUPTIBLE);
-
-   if (!atomic_read(&io->count))
-   break;
-
-   io_schedule();
+   while (atomic_read(&io->count) != 0) {
+   wait_for_completion_io_timeout(&wait, 5);
}
-   set_current_state(TASK_RUNNING);
 
if (error_bits)
*error_bits = io->error_bits;
@@ -434,7 +430,7 @@ static int async_io(struct dm_io_client *client, unsigned 
int num_regions,
io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
-   io->sleeper = NULL;
+   io->wait = NULL;
io->client = client;
io->callback = fn;
io->context = context;
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 13/14] samples: bpf: example of stateful socket filtering

2014-06-27 Thread Andy Lutomirski

On Fri, Jun 27, 2014 at 5:06 PM, Alexei Starovoitov  wrote:
> this socket filter example does:
>
> - creates a hashtable in kernel with key 4 bytes and value 8 bytes
>
> - populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto
>
> - loads eBPF program:
>   r0 = skb[14 + 9]; // load one byte of ip->proto
>   *(u32*)(fp - 4) = r0;
>   value = bpf_map_lookup_elem(map_id, fp - 4);
>   if (value)
>(*(u64*)value) += 1;

In the code below, this is XADD.  Is there anything that validates
that shared things like this can only be poked at by atomic
operations?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] Tools: hv: fix file overwriting of hv_fcopy_daemon

2014-06-27 Thread Yue Zhang

From: Yue Zhang 

hv_fcopy_daemon fails to overwrite a file if the target file already
exits.

Add O_TRUNC flag on opening.

Signed-off-by: Yue Zhang 
---
 tools/hv/hv_fcopy_daemon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index fba1c75..8f96b3e 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -88,7 +88,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
}
}
 
-   target_fd = open(target_fname, O_RDWR | O_CREAT | O_CLOEXEC, 0744);
+   target_fd = open(target_fname,
+O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC, 0744);
if (target_fd == -1) {
syslog(LOG_INFO, "Open Failed: %s", strerror(errno));
goto done;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 07/14] bpf: expand BPF syscall with program load/unload

2014-06-27 Thread Andy Lutomirski

On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  wrote:
> eBPF programs are safe run-to-completion functions with load/unload
> methods from userspace similar to kernel modules.
>
> User space API:
>
> - load eBPF program
>   prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, 
> int len)
>
>   where 'prog' is a sequence of sections (currently TEXT and LICENSE)
>   TEXT - array of eBPF instructions
>   LICENSE - GPL compatible


> +
> +   err = -EINVAL;
> +   /* look for mandatory license string */
> +   if (!tb[BPF_PROG_LICENSE])
> +   goto free_attr;
> +
> +   /* eBPF programs must be GPL compatible */
> +   if (!license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE])))
> +   goto free_attr;

Seriously?  My mind boggles.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpufreq: make table sentinal macros unsigned to match use

2014-06-27 Thread Simon Horman

On Fri, Jun 27, 2014 at 04:09:39PM -0500, Brian W Hart wrote:
> Commit 5eeaf1f18973 (cpufreq: Fix build error on some platforms that
> use cpufreq_for_each_*) moved function cpufreq_next_valid() to a public
> header.  Warnings are now generated when objects including that header
> are built with -Wsign-compare (as an out-of-tree module might be):
> 
> .../include/linux/cpufreq.h: In function ‘cpufreq_next_valid’:
> .../include/linux/cpufreq.h:519:27: warning: comparison between signed
> and unsigned integer expressions [-Wsign-compare]
>   while ((*pos)->frequency != CPUFREQ_TABLE_END)
>^
> .../include/linux/cpufreq.h:520:25: warning: comparison between signed
> and unsigned integer expressions [-Wsign-compare]
>if ((*pos)->frequency != CPUFREQ_ENTRY_INVALID)
>  ^
> 
> Constants CPUFREQ_ENTRY_INVALID and CPUFREQ_TABLE_END are signed, but
> are used with unsigned member 'frequency' of cpufreq_frequency_table.
> Update the macro definitions to be explicitly unsigned to match their
> use.
> 
> This also corrects potentially wrong behavior of clk_rate_table_iter()
> if unsigned long is wider than usigned int.
> 
> Signed-off-by: Brian W Hart 

Reviewed-by: Simon Horman 

> ---
> These macros are fairly broadly used in the kernel so I was bit leery
> of changing them, but after inspection I think it's fine.  I found 102
> uses of the macros, of which:
> 
> 99 are uses with cpufreq_frequency_table.frequency (95) or with local
>variables of the same type as frequency (4).  These should be just
>fine with this change--we're just making explicit a conversion that
>was previously implicit.
> 
>  2 are uses with a local variable of different type (unsigned long) than
>'frequency' (in drivers/sh/clk/core.c).  One of these uses is safe;
>the other (in clk_rate_table_iter()) is broken if unsigned long
>is wider than unsigned int.  As a side-effect, this patch corrects
>the potential misbehavior there.
> 
>  1 is a use in macro cpufreq_for_each_entry() with what _should_ be the
>frequency member of a cpufreq_frequency_table, provided the caller it
>well-behaved.  There are 18 callers of this macro; all are well-behaved.
>So these should also be safe.
> 
>  include/linux/cpufreq.h |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index ec4112d..8f8ae95 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -482,8 +482,8 @@ extern struct cpufreq_governor cpufreq_gov_conservative;
>   */
>  
>  /* Special Values of .frequency field */
> -#define CPUFREQ_ENTRY_INVALID~0
> -#define CPUFREQ_TABLE_END~1
> +#define CPUFREQ_ENTRY_INVALID~0u
> +#define CPUFREQ_TABLE_END~1u
>  /* Special Values of .flags field */
>  #define CPUFREQ_BOOST_FREQ   (1 << 0)
>  
> -- 
> 1.7.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 04/14] bpf: update MAINTAINERS entry

2014-06-27 Thread Joe Perches

Add MAINTAINERS entry.

On Fri, 2014-06-27 at 17:05 -0700, Alexei Starovoitov wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
[]
> @@ -1881,6 +1881,15 @@ S: Supported
>  F:   drivers/net/bonding/
>  F:   include/uapi/linux/if_bonding.h
>  
> +BPF

While a lot of people know what BPF is, I think it'd
be better to have something like

BPF - SOCKET FILTER (Berkeley Packet Filter like)
> +M:   Alexei Starovoitov 
> +L:   net...@vger.kernel.org
> +L:   linux-kernel@vger.kernel.org
> +S:   Supported
> +F:   kernel/bpf/
> +F:   include/uapi/linux/bpf.h
> +F:   include/linux/bpf.h


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

2014-06-27 Thread Andy Lutomirski

On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov  wrote:
> BPF syscall is a demux for different BPF releated commands.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps can be created/deleted from user space via BPF syscall:
> - create a map with given id, type and attributes
>   map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>   returns positive map id or negative error
>
> - delete map with given map id
>   err = bpf_map_delete(int map_id)
>   returns zero or negative error

What's the scope of "id"?  How is it secured?

This question is brought to you by keyctl, which is terminally fucked.
At some point I'll generate some proof of concept exploits for severe
bugs caused by misdesign of a namespace.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC net-next 02/14] net: filter: split filter.h and expose eBPF to user space

2014-06-27 Thread Alexei Starovoitov

eBPF can be used from user space.

uapi/linux/bpf.h: eBPF instruction set definition

linux/filter.h: the rest

This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.

These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/filter.h|  294 +--
 include/uapi/linux/Kbuild |1 +
 include/uapi/linux/bpf.h  |  305 +
 3 files changed, 307 insertions(+), 293 deletions(-)
 create mode 100644 include/uapi/linux/bpf.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a7e3c48d73a7..6766577635ff 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -8,303 +8,11 @@
 #include 
 #include 
 #include 
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64  0x07/* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
-#define BPF_XADD   0xc0/* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV0xb0/* mov reg to reg */
-#define BPF_ARSH   0xc0/* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END0xd0/* flags for endianness conversion: */
-#define BPF_TO_LE  0x00/* convert to little-endian */
-#define BPF_TO_BE  0x08/* convert to big-endian */
-#define BPF_FROM_LEBPF_TO_LE
-#define BPF_FROM_BEBPF_TO_BE
-
-#define BPF_JNE0x50/* jump != */
-#define BPF_JSGT   0x60/* SGT is signed '>', GT in x86 */
-#define BPF_JSGE   0x70/* SGE is signed '>=', GE in x86 */
-#define BPF_CALL   0x80/* function call */
-#define BPF_EXIT   0x90/* function return */
-
-/* Register numbers */
-enum {
-   BPF_REG_0 = 0,
-   BPF_REG_1,
-   BPF_REG_2,
-   BPF_REG_3,
-   BPF_REG_4,
-   BPF_REG_5,
-   BPF_REG_6,
-   BPF_REG_7,
-   BPF_REG_8,
-   BPF_REG_9,
-   BPF_REG_10,
-   __MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG__MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1   BPF_REG_1
-#define BPF_REG_ARG2   BPF_REG_2
-#define BPF_REG_ARG3   BPF_REG_3
-#define BPF_REG_ARG4   BPF_REG_4
-#define BPF_REG_ARG5   BPF_REG_5
-#define BPF_REG_CTXBPF_REG_6
-#define BPF_REG_FP BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A  BPF_REG_0
-#define BPF_REG_X  BPF_REG_7
-#define BPF_REG_TMPBPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK  512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC)\
-   ((struct sock_filter_int) { \
-   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
-   .dst_reg = DST, \
-   .src_reg = SRC, \
-   .off   = 0, \
-   .imm   = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC)\
-   ((struct sock_filter_int) { \
-   .code  = BPF_ALU | BPF_OP(OP) | BPF_X,  \
-   .dst_reg = DST, \
-   .src_reg = SRC, \
-   .off   = 0, \
-   .imm   = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM)\
-   ((struct sock_filter_int) { \
-   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,\
-   .dst_reg = DST, \
-   .src_reg = 0,   \
-   .off   = 0, \
-   .imm   = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM)\
-   ((struct sock_filter_int) { \
-   .code  = BPF_ALU | BPF_OP(OP) | BPF_K,  \
-   .dst_reg = DST, \
-   .src_reg = 0,   \
-   .off   = 0, \
-

[PATCH RFC net-next 04/14] bpf: update MAINTAINERS entry

2014-06-27 Thread Alexei Starovoitov

Signed-off-by: Alexei Starovoitov 
---
 MAINTAINERS |9 +
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 48f4ef44b252..ebd831cd1a25 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1881,6 +1881,15 @@ S:   Supported
 F: drivers/net/bonding/
 F: include/uapi/linux/if_bonding.h
 
+BPF
+M: Alexei Starovoitov 
+L: net...@vger.kernel.org
+L: linux-kernel@vger.kernel.org
+S: Supported
+F: kernel/bpf/
+F: include/uapi/linux/bpf.h
+F: include/linux/bpf.h
+
 BROADCOM B44 10/100 ETHERNET DRIVER
 M: Gary Zambrano 
 L: net...@vger.kernel.org
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

2014-06-27 Thread Alexei Starovoitov

BPF syscall is a demux for different BPF releated commands.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps can be created/deleted from user space via BPF syscall:
- create a map with given id, type and attributes
  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
  returns positive map id or negative error

- delete map with given map id
  err = bpf_map_delete(int map_id)
  returns zero or negative error

Next patch allows userspace programs to populate/read maps that eBPF programs
are concurrently updating.

maps can have different types: hash, bloom filter, radix-tree, etc.

The map is defined by:
  . id
  . type
  . max number of elements
  . key size in bytes
  . value size in bytes

Next patches allow eBPF programs to access maps via API:
  void * bpf_map_lookup_elem(u32 map_id, void *key);
  int bpf_map_update_elem(u32 map_id, void *key, void *value);
  int bpf_map_delete_elem(u32 map_id, void *key);

This patch establishes core infrastructure for BPF maps.
Next patches implement lookup/update and hashtable type.
More map types can be added in the future.

syscall is using type-length-value style of passing arguments to be backwards
compatible with future extensions to map attributes. Different map types may
use different attributes as well.
The concept of type-lenght-value is borrowed from netlink, but netlink itself
is not applicable here, since BPF programs and maps can be used in NET-less
configurations.

Signed-off-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt |   69 ++
 arch/x86/syscalls/syscall_64.tbl|1 +
 include/linux/bpf.h |   44 +++
 include/linux/syscalls.h|2 +
 include/uapi/asm-generic/unistd.h   |4 +-
 include/uapi/linux/bpf.h|   29 +
 kernel/bpf/Makefile |2 +-
 kernel/bpf/syscall.c|  238 +++
 kernel/sys_ni.c |3 +
 9 files changed, 390 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 kernel/bpf/syscall.c

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index ee78eba78a9d..e14e486f69cd 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + 
off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF maps
+-
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given id, type and attributes
+  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
+  returns positive map id or negative error
+
+- delete map with given map id
+  err = bpf_map_delete(int map_id)
+  returns zero or negative error
+
+- lookup key in a given map referenced by map_id
+  err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+  returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+  err = bpf_map_update_elem(int map_id, void *key, void *value)
+  returns zero or negative error
+
+- find and delete element by key in a given map
+  err = bpf_map_delete_elem(int map_id, void *key)
+
+userspace programs uses this API to create/populate/read maps that eBPF 
programs
+are concurrently updating.
+
+maps can have different types: hash, bloom filter, radix-tree, etc.
+
+The map is defined by:
+  . id
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
+The maps are accesible from eBPF program with API:
+  void * bpf_map_lookup_elem(u32 map_id, void *key);
+  int bpf_map_update_elem(u32 map_id, void *key, void *value);
+  int bpf_map_delete_elem(u32 map_id, void *key);
+
+If eBPF verifier is configured to recognize extra calls in the program
+bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
+  ...
+  ptr_to_value = map_lookup_elem(const_int_map_id, key)
+  access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
+  ...
+  prepare key2 and value2 on stack of key_size and value_size
+  err = map_update_elem(const_int_map_id2, key2, value2)
+  ...
+
+eBPF program cannot create or delete maps
+(such calls will be unknown to verifier)
+
+During program loading the refcnt of used maps is incremented, so they don't 
get
+deleted while program is running
+
+bpf_map_update_elem() can fail if maximum number of elements reached.
+if key2 already exists, bpf_map_update_elem() replaces it with value2 
atomically
+
+bpf_map_lookup_elem() can return null or ptr_to_value
+ptr_to_value is read/write from the program point of view.
+
+The verifier will check that the program accesses map elem

[PATCH RFC net-next 05/14] bpf: add lookup/update/delete/iterate methods to BPF maps

2014-06-27 Thread Alexei Starovoitov

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given id, type and attributes
  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
  returns positive map id or negative error

- delete map with given map id
  err = bpf_map_delete(int map_id)
  returns zero or negative error

- lookup key in a given map referenced by map_id
  err = bpf_map_lookup_elem(int map_id, void *key, void *value)
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf_map_update_elem(int map_id, void *key, void *value)
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf_map_delete_elem(int map_id, void *key)

- iterate map elements (based on input key return next_key)
  err = bpf_map_get_next_key(int map_id, void *key, void *next_key)

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |6 ++
 include/uapi/linux/bpf.h |   25 +++
 kernel/bpf/syscall.c |  180 ++
 3 files changed, 211 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6448b9beea89..19cd394bdbcc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -18,6 +18,12 @@ struct bpf_map_ops {
/* funcs callable from userspace (via syscall) */
struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 
1]);
void (*map_free)(struct bpf_map *);
+   int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+
+   /* funcs callable from userspace and from eBPF programs */
+   void *(*map_lookup_elem)(struct bpf_map *map, void *key);
+   int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+   int (*map_delete_elem)(struct bpf_map *map, void *key);
 };
 
 struct bpf_map {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04374e57c290..faed2ce2d25a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -315,6 +315,31 @@ enum bpf_cmd {
 * returns zero or negative error
 */
BPF_MAP_DELETE,
+
+   /* lookup key in a given map referenced by map_id
+* err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+* returns zero and stores found elem into value
+* or negative error
+*/
+   BPF_MAP_LOOKUP_ELEM,
+
+   /* create or update key/value pair in a given map
+* err = bpf_map_update_elem(int map_id, void *key, void *value)
+* returns zero or negative error
+*/
+   BPF_MAP_UPDATE_ELEM,
+
+   /* find and delete elem by key in a given map
+* err = bpf_map_delete_elem(int map_id, void *key)
+* returns zero or negative error
+*/
+   BPF_MAP_DELETE_ELEM,
+
+   /* lookup key in a given map and return next key
+* err = bpf_map_get_elem(int map_id, void *key, void *next_key)
+* returns zero and stores next key or negative error
+*/
+   BPF_MAP_GET_NEXT_KEY,
 };
 
 enum bpf_map_attributes {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b9509923b16f..1a48da23a939 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -219,6 +219,174 @@ static int map_delete(int map_id)
return 0;
 }
 
+static int map_lookup_elem(int map_id, void __user *ukey, void __user *uvalue)
+{
+   struct bpf_map *map;
+   void *key, *value;
+   int err;
+
+   if (map_id < 0)
+   return -EINVAL;
+
+   rcu_read_lock();
+   map = idr_find(&bpf_map_id_idr, map_id);
+   err = -EINVAL;
+   if (!map)
+   goto err_unlock;
+
+   err = -ENOMEM;
+   key = kmalloc(map->key_size, GFP_ATOMIC);
+   if (!key)
+   goto err_unlock;
+
+   err = -EFAULT;
+   if (copy_from_user(key, ukey, map->key_size) != 0)
+   goto free_key;
+
+   err = -ESRCH;
+   value = map->ops->map_lookup_elem(map, key);
+   if (!value)
+   goto free_key;
+
+   err = -EFAULT;
+   if (copy_to_user(uvalue, value, map->value_size) != 0)
+   goto free_key;
+
+   err = 0;
+
+free_key:
+   kfree(key);
+err_unlock:
+   rcu_read_unlock();
+   return err;
+}
+
+static int map_update_elem(int map_id, void __user *ukey, void __user *uvalue)
+{
+   struct bpf_map *map;
+   void *key, *value;
+   int err;
+
+   if (map_id < 0)
+   return -EINVAL;
+
+   rcu_read_lock();
+   map = idr_find(&bpf_map_id_idr, map_id);
+   err = -EINVAL;
+   if (!map)
+   goto err_unlock;
+
+   err = -ENOMEM;
+   key = kmalloc(map->key_size, GFP_ATOMIC);
+   if (!key)
+   goto err_unlock;
+
+   err = -EFAULT;
+   if (copy_from_user(key, ukey, map->key_size) != 0)
+

[PATCH RFC net-next 00/14] BPF syscall, maps, verifier, samples

2014-06-27 Thread Alexei Starovoitov

Hi All,

this patch set demonstrates the potential of eBPF.

First patch "net: filter: split filter.c into two files" splits eBPF interpreter
out of networking into kernel/bpf/. The goal for BPF subsystem is to be usable
in NET-less configuration. Though the whole set is marked is RFC, the 1st patch
is good to go. Similar version of the patch that was posted few weeks ago, but
was deferred. I'm assuming due to lack of forward visibility. I hope that this
patch set shows what eBPF is capable of and where it's heading.

Other patches expose eBPF instruction set to user space and introduce concepts
of maps and programs accessible via syscall.

'maps' is a generic storage of different types for sharing data between kernel
and userspace. Maps are referrenced by global id. Root can create multiple
maps of different types where key/value are opaque bytes of data. It's up to
user space and eBPF program to decide what they store in the maps.

eBPF programs are similar to kernel modules. They live in global space and
have unique prog_id. Each program is a safe run-to-completion set of
instructions. eBPF verifier statically determines that the program terminates
and safe to execute. During verification the program takes a hold of maps
that it intends to use, so selected maps cannot be removed until program is
unloaded. The program can be attached to different events. These events can
be packets, tracepoint events and other types in the future. New event triggers
execution of the program which may store information about the event in the 
maps.
Beyond storing data the programs may call into in-kernel helper functions
which may, for example, dump stack, do trace_printk or other forms of live
kernel debugging. Same program can be attached to multiple events. Different
programs can access the same map:

  tracepoint  tracepoint  tracepointsk_buffsk_buff
   event A event B event C  on eth0on eth1
| |  ||  |
| |  ||  |
--> tracing <--  tracing   socket  socket
 prog_1   prog_2   prog_3  prog_4
 |  |   ||
  |---  -|  |---|   map_3
map_1   map_2

User space (via syscall) and eBPF programs access maps concurrently.

Last two patches are sample code. 1st demonstrates stateful packet inspection.
It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
framework can be used for network analytics.
2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
event and counts number of packet drops at particular $pc location.
User space periodically summarizes what eBPF programs recorded.
In these two samples the eBPF programs are tiny and written in 'assembler'
with macroses. More complex programs can be written C (llvm backend is not
part of this diff to reduce 'huge' perception).
Since eBPF is fully JITed on x64, the cost of running eBPF program is very
small even for high frequency events. Here are the numbers comparing
flow_dissector in C vs eBPF:
  x86_64 skb_flow_dissect() same skb (all cached) -  42 nsec per call
  x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
eBPF+jit skb_flow_dissect() same skb (all cached) -  51 nsec per call
eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call

Detailed explanation on eBPF verifier and safety is in patch 08/14

Thanks
Alexei

minor todo: rename 'struct sock_filter_int' into 'struct bpf_insn'. It's not
part of this diff to reduce size.
  
--

The following changes since commit c1c27fb9b3040a2559d4d3e1183afa8c106bc94a:

  Merge branch 'master' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next (2014-06-27 
12:59:38 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

for you to fetch changes up to 4c8da0f21220087e38894c69339cddc64c1220f9:

  samples: bpf: example of tracing filters with eBPF (2014-06-27 15:22:07 -0700)


Alexei Starovoitov (14):
  net: filter: split filter.c into two files
  net: filter: split filter.h and expose eBPF to user space
  bpf: introduce syscall(BPF, ...) and BPF maps
  bpf: update MAINTAINERS entry
  bpf: add lookup/update/delete/iterate methods to BPF maps
  bpf: add hashtable type of BPF maps
  bpf: expand BPF syscall with program load/unload
  bpf: add eBPF verifier
  bpf: allow eBPF programs to use maps
  net: sock: allow eBPF programs to be attached to sockets
  tracing: allow eBPF programs to be attached to events
  samples: bpf: add mini eBPF library to manipulate maps and programs
  samples: bpf: example of stateful socket filtering
  samples: bpf: example of tracing filters with eBPF

 Documentation/networking/filter.txt|  302 +++
 MAINT

[PATCH RFC net-next 07/14] bpf: expand BPF syscall with program load/unload

2014-06-27 Thread Alexei Starovoitov

eBPF programs are safe run-to-completion functions with load/unload
methods from userspace similar to kernel modules.

User space API:

- load eBPF program
  prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr *prog, int 
len)

  where 'prog' is a sequence of sections (currently TEXT and LICENSE)
  TEXT - array of eBPF instructions
  LICENSE - GPL compatible

- unload eBPF program
  err = bpf_prog_unload(int prog_id)

User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_id, prog_type, ...)
follows in later patches

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |   32 ++
 include/linux/filter.h   |9 +-
 include/uapi/linux/bpf.h |   34 ++
 kernel/bpf/core.c|5 +-
 kernel/bpf/syscall.c |  275 ++
 net/core/filter.c|9 +-
 6 files changed, 358 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 19cd394bdbcc..7bfcad87018e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -47,4 +47,36 @@ struct bpf_map_type_list {
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 struct bpf_map *bpf_map_get(u32 map_id);
 
+/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
+ * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
+ * instructions after verifying
+ */
+struct bpf_func_proto {
+   s32 func_off;
+};
+
+struct bpf_verifier_ops {
+   /* return eBPF function prototype for verification */
+   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+};
+
+struct bpf_prog_type_list {
+   struct list_head list_node;
+   struct bpf_verifier_ops *ops;
+   enum bpf_prog_type type;
+};
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
+struct bpf_prog_info {
+   int prog_id;
+   enum bpf_prog_type prog_type;
+   struct bpf_verifier_ops *ops;
+   u32 *used_maps;
+   u32 used_map_cnt;
+};
+
+void free_bpf_prog_info(struct bpf_prog_info *info);
+struct sk_filter *bpf_prog_get(u32 prog_id);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6766577635ff..9873cc8fd31b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -29,12 +29,17 @@ struct sock_fprog_kern {
 struct sk_buff;
 struct sock;
 struct seccomp_data;
+struct bpf_prog_info;
 
 struct sk_filter {
atomic_trefcnt;
u32 jited:1,/* Is our filter JIT'ed? */
-   len:31; /* Number of filter blocks */
-   struct sock_fprog_kern  *orig_prog; /* Original BPF program */
+   ebpf:1, /* Is it eBPF program ? */
+   len:30; /* Number of filter blocks */
+   union {
+   struct sock_fprog_kern  *orig_prog; /* Original BPF program 
*/
+   struct bpf_prog_info*info;
+   };
struct rcu_head rcu;
unsigned int(*bpf_func)(const struct sk_buff *skb,
const struct sock_filter_int 
*filter);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1399ed1d5dad..ed067e245099 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -340,6 +340,19 @@ enum bpf_cmd {
 * returns zero and stores next key or negative error
 */
BPF_MAP_GET_NEXT_KEY,
+
+   /* verify and load eBPF program
+* prog_id = bpf_prog_load(int prog_id, bpf_prog_type, struct nlattr 
*prog, int len)
+* prog is a sequence of sections
+* returns positive program id or negative error
+*/
+   BPF_PROG_LOAD,
+
+   /* unload eBPF program
+* err = bpf_prog_unload(int prog_id)
+* returns zero or negative error
+*/
+   BPF_PROG_UNLOAD,
 };
 
 enum bpf_map_attributes {
@@ -357,4 +370,25 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
 };
 
+enum bpf_prog_attributes {
+   BPF_PROG_UNSPEC,
+   BPF_PROG_TEXT,  /* array of eBPF instructions */
+   BPF_PROG_LICENSE,   /* license string */
+   __BPF_PROG_ATTR_MAX,
+};
+#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
+#define BPF_PROG_MAX_ATTR_SIZE 65535
+
+enum bpf_prog_type {
+   BPF_PROG_TYPE_UNSPEC,
+};
+
+/* integer value in 'imm' field of BPF_CALL instruction selects which helper
+ * function eBPF program intends to call
+ */
+enum bpf_func_id {
+   BPF_FUNC_unspec,
+   __BPF_FUNC_MAX_ID,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dd9c29ff720e..b9f743929d86 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Registers */
 #define BPF_R0 regs[BPF_REG_0]
@@ -537,9 +538,11 @@ void sk_filter_select_runtime(struct sk_filter *fp)
 }
 EXPORT_SYMBOL_GPL(sk_fi

[PATCH RFC net-next 14/14] samples: bpf: example of tracing filters with eBPF

2014-06-27 Thread Alexei Starovoitov

simple packet drop monitor:
- in-kernel eBPF program attaches to kfree_skb() event and records number
  of packet drops at given location
- userspace iterates over the map every second and prints stats

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/Makefile  |4 +-
 samples/bpf/dropmon.c |  127 +
 2 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/dropmon.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95c990151644..8e3dfa0c25e4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,14 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := sock_example
+hostprogs-y := sock_example dropmon
 
 sock_example-objs := sock_example.o libbpf.o
+dropmon-objs := dropmon.o libbpf.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 
 HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
 HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropmon.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index ..80d80066f518
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,127 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ *   of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+#define MAP_ID 1
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
+
+static void write_to_file(const char *file, const char *str)
+{
+   int fd, err;
+
+   fd = open(file, O_WRONLY);
+   err = write(fd, str, strlen(str));
+   (void) err;
+   close(fd);
+}
+
+static int dropmon(void)
+{
+   /* the following eBPF program is equivalent to C:
+* void filter(struct bpf_context *ctx)
+* {
+*   long loc = ctx->arg2;
+*   long init_val = 1;
+*   void *value;
+*
+*   value = bpf_map_lookup_elem(MAP_ID, &loc);
+*   if (value) {
+*  (*(long *) value) += 1;
+*   } else {
+*  bpf_map_update_elem(MAP_ID, &loc, &init_val);
+*   }
+* }
+*/
+   static struct sock_filter_int prog[] = {
+   BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 
*)(r1 + 8) */
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp 
- 8) = r2 */
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), /* r1 = 1 */
+   BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 
0, 0), /* xadd r0 += r1 */
+   BPF_EXIT_INSN(),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 
1 */
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_map_update_elem),
+   BPF_EXIT_INSN(),
+   };
+
+   int prog_id = 1, i;
+   long long key, next_key, value = 0;
+   char fmt[32];
+
+   if (bpf_create_map(MAP_ID, sizeof(key), sizeof(value), 1024) < 0) {
+   printf("failed to create map '%s'\n", strerror(errno));
+   goto cleanup;
+   }
+
+   prog_id = bpf_prog_load(prog_id, BPF_PROG_TYPE_TRACING_FILTER, prog, 
sizeof(prog), "GPL");
+   if (prog_id < 0) {
+   printf("failed to load prog '%s'\n", strerror(errno));
+   return -1;
+   }
+
+   sprintf(fmt, "bpf_%d", prog_id);
+
+   write_to_file(TRACEPOINT "filter", fmt);
+   write_to_file(TRACEPOINT "enable", "1");
+
+   for (i = 0; i < 10; i++) {
+   key = 0;
+   while (bpf_get_next_key(MAP_ID, &key, &next_key) == 0) {
+   bpf_lookup_elem(MAP_ID, &next_key, &value);
+   printf("location 0x%llx count %lld\n", next_key, value);
+   key = next_key;
+   }
+   if (key)
+   printf("\n");
+   sleep(1);
+   }
+
+cleanup:
+   bpf_prog_unload(prog_id);
+

[PATCH RFC net-next 09/14] bpf: allow eBPF programs to use maps

2014-06-27 Thread Alexei Starovoitov

expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |5 +++
 include/uapi/linux/bpf.h |3 ++
 kernel/bpf/syscall.c |   85 ++
 3 files changed, 93 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 67fd49eac904..bc505093683a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -127,4 +127,9 @@ struct sk_filter *bpf_prog_get(u32 prog_id);
 /* verify correctness of eBPF program */
 int bpf_check(struct sk_filter *fp);
 
+/* in-kernel helper functions called from eBPF programs */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597a35cc101d..03c65eedd3d5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -389,6 +389,9 @@ enum bpf_prog_type {
  */
 enum bpf_func_id {
BPF_FUNC_unspec,
+   BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
+   BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, 
void *value) */
+   BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 48d8f43da151..266136f0d333 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -691,3 +691,88 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
}
 }
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = PTR_TO_MAP_CONDITIONAL,
+ *.arg1_type = CONST_ARG_MAP_ID,
+ *.arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map;
+   int map_id = r1;
+   void *key = (void *) (unsigned long) r2;
+   void *value;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   map = idr_find(&bpf_map_id_idr, map_id);
+   /* eBPF verifier guarantees that map_id is valid for the life of
+* the program
+*/
+   BUG_ON(!map);
+
+   value = map->ops->map_lookup_elem(map, key);
+
+   return (unsigned long) value;
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = RET_INTEGER,
+ *.arg1_type = CONST_ARG_MAP_ID,
+ *.arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ *.arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map;
+   int map_id = r1;
+   void *key = (void *) (unsigned long) r2;
+   void *value = (void *) (unsigned long) r3;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   map = idr_find(&bpf_map_id_idr, map_id);
+   /* eBPF verifier guarantees that map_id is valid */
+   BUG_ON(!map);
+
+   return map->ops->map_update_elem(map, key, value);
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = RET_INTEGER,
+ *.arg1_type = CONST_ARG_MAP_ID,
+ *.arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map;
+   int map_id = r1;
+   void *key = (void *) (unsigned long) r2;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   map = idr_find(&bpf_map_id_idr, map_id);
+   /* eBPF verifier guarantees that map_id is valid */
+   BUG_ON(!map);
+
+   return map->ops->map_delete_elem(map, key);
+}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC net-next 08/14] bpf: add eBPF verifier

2014-06-27 Thread Alexei Starovoitov

Safety of eBPF programs is statically determined by the verifier, which detects:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

It checks that
- R1-R5 registers statisfy function prototype
- program terminates
- BPF_LD_ABS|IND instructions are only used in socket filters

It is configured with:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
  that provides information to the verifer which fields of 'ctx'
  are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
  reports argument types of kernel helper functions that eBPF program
  may call, so that verifier can checks that R1-R5 types match prototype

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt |  233 ++
 include/linux/bpf.h |   48 ++
 include/uapi/linux/bpf.h|1 +
 kernel/bpf/Makefile |2 +-
 kernel/bpf/syscall.c|2 +-
 kernel/bpf/verifier.c   | 1431 +++
 6 files changed, 1715 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/verifier.c

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index e14e486f69cd..05fee8fcedf1 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + 
off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF verifier
+-
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
+since addition of two valid pointers makes invalid pointer.
+
+If register was never written to, it's not readable:
+  bpf_mov R0 = R2
+  bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+  bpf_mov R6 = 1
+  bpf_call foo
+  bpf_mov R0 = R6
+  bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+Classic BPF register X is mapped to eBPF register R7 inside 
sk_convert_filter(),
+so that its state is preserved across calls.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment 
checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
+ctx is generic. verifier is configured to known what context is for particular
+class of bpf programs. For example, context == skb (for socket filters) and
+ctx == seccomp_data for seccomp filters.
+A callback is used to customize verifier to restrict eBPF program access to 
only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+  bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+  bpf_ld R0 = *(u32 *)(R10 - 4)
+  bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type PTR_TO_STACK
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is

[PATCH RFC net-next 12/14] samples: bpf: add mini eBPF library to manipulate maps and programs

2014-06-27 Thread Alexei Starovoitov

the library includes a trivial set of BPF syscall wrappers:

int bpf_delete_map(int map_id);

int bpf_create_map(int map_id, int key_size, int value_size, int max_entries);

int bpf_update_elem(int map_id, void *key, void *value);

int bpf_lookup_elem(int map_id, void *key, void *value);

int bpf_delete_elem(int map_id, void *key);

int bpf_get_next_key(int map_id, void *key, void *next_key);

int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
  struct sock_filter_int *insns, int insn_cnt,
  const char *license);

int bpf_prog_unload(int prog_id);

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/libbpf.c |  114 ++
 samples/bpf/libbpf.h |   18 
 2 files changed, 132 insertions(+)
 create mode 100644 samples/bpf/libbpf.c
 create mode 100644 samples/bpf/libbpf.h

diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
new file mode 100644
index ..763eaf4b9814
--- /dev/null
+++ b/samples/bpf/libbpf.c
@@ -0,0 +1,114 @@
+/* eBPF mini library */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+struct nlattr_u32 {
+   __u16 nla_len;
+   __u16 nla_type;
+   __u32 val;
+};
+
+int bpf_delete_map(int map_id)
+{
+   return syscall(__NR_bpf, BPF_MAP_DELETE, map_id);
+}
+
+int bpf_create_map(int map_id, int key_size, int value_size, int max_entries)
+{
+   struct nlattr_u32 attr[] = {
+   {
+   .nla_len = sizeof(struct nlattr_u32),
+   .nla_type = BPF_MAP_KEY_SIZE,
+   .val = key_size,
+   },
+   {
+   .nla_len = sizeof(struct nlattr_u32),
+   .nla_type = BPF_MAP_VALUE_SIZE,
+   .val = value_size,
+   },
+   {
+   .nla_len = sizeof(struct nlattr_u32),
+   .nla_type = BPF_MAP_MAX_ENTRIES,
+   .val = max_entries,
+   },
+   };
+   int err;
+
+   err = syscall(__NR_bpf, BPF_MAP_CREATE, map_id, BPF_MAP_TYPE_HASH, 
attr, sizeof(attr));
+   if (err > 0 && err != map_id && map_id != 0) {
+   bpf_delete_map(err);
+   errno = EEXIST;
+   err = -1;
+   }
+   return err;
+}
+
+
+int bpf_update_elem(int map_id, void *key, void *value)
+{
+   return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, map_id, key, value);
+}
+
+int bpf_lookup_elem(int map_id, void *key, void *value)
+{
+   return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, map_id, key, value);
+}
+
+int bpf_delete_elem(int map_id, void *key)
+{
+   return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, map_id, key);
+}
+
+int bpf_get_next_key(int map_id, void *key, void *next_key)
+{
+   return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, map_id, key, next_key);
+}
+
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+
+int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
+ struct sock_filter_int *insns, int prog_len,
+ const char *license)
+{
+   int nlattr_size, license_len, err;
+   void *nlattr, *ptr;
+
+   license_len = strlen(license) + 1;
+   nlattr_size = sizeof(struct nlattr) + prog_len + sizeof(struct nlattr) +
+   ROUND_UP(license_len, 4);
+
+   ptr = nlattr = malloc(nlattr_size);
+
+   *(struct nlattr *) ptr = (struct nlattr) {
+   .nla_len = prog_len + sizeof(struct nlattr),
+   .nla_type = BPF_PROG_TEXT,
+   };
+   ptr += sizeof(struct nlattr);
+
+   memcpy(ptr, insns, prog_len);
+   ptr += prog_len;
+
+   *(struct nlattr *) ptr = (struct nlattr) {
+   .nla_len = ROUND_UP(license_len, 4) + sizeof(struct nlattr),
+   .nla_type = BPF_PROG_LICENSE,
+   };
+   ptr += sizeof(struct nlattr);
+
+   memcpy(ptr, license, license_len);
+
+   err = syscall(__NR_bpf, BPF_PROG_LOAD, prog_id, prog_type, nlattr,
+ nlattr_size);
+   free(nlattr);
+   return err;
+}
+
+int bpf_prog_unload(int prog_id)
+{
+   return syscall(__NR_bpf, BPF_PROG_UNLOAD, prog_id);
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
new file mode 100644
index ..408368e6d4d5
--- /dev/null
+++ b/samples/bpf/libbpf.h
@@ -0,0 +1,18 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+struct sock_filter_int;
+
+int bpf_delete_map(int map_id);
+int bpf_create_map(int map_id, int key_size, int value_size, int max_entries);
+int bpf_update_elem(int map_id, void *key, void *value);
+int bpf_lookup_elem(int map_id, void *key, void *value);
+int bpf_delete_elem(int map_id, void *key);
+int bpf_get_next_key(int map_id, void *key, void *next_key);
+int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type,
+ struct sock_filter_int *insns, int insn_cnt,
+ const char *l

[PATCH RFC net-next 11/14] tracing: allow eBPF programs to be attached to events

2014-06-27 Thread Alexei Starovoitov

User interface:
cat bpf_123 > /sys/kernel/debug/tracing/__event__/filter

where 123 is an id of the eBPF program priorly loaded.
__event__ is static tracepoint event.
(kprobe events will be supported in the future patches)

eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- trace_printk
- load_pointer
- dump_stack

Signed-off-by: Alexei Starovoitov 
---
 include/linux/ftrace_event.h   |5 +
 include/trace/bpf_trace.h  |   29 +
 include/trace/ftrace.h |   10 ++
 include/uapi/linux/bpf.h   |5 +
 kernel/trace/Kconfig   |1 +
 kernel/trace/Makefile  |1 +
 kernel/trace/bpf_trace.c   |  217 
 kernel/trace/trace.h   |3 +
 kernel/trace/trace_events.c|7 ++
 kernel/trace/trace_events_filter.c |   72 +++-
 10 files changed, 349 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cff3106ffe2c..de313bd9a434 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -237,6 +237,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED_BIT,
TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
TRACE_EVENT_FL_TRACEPOINT_BIT,
+   TRACE_EVENT_FL_BPF_BIT,
 };
 
 /*
@@ -259,6 +260,7 @@ enum {
TRACE_EVENT_FL_WAS_ENABLED  = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
TRACE_EVENT_FL_USE_CALL_FILTER  = (1 << 
TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
TRACE_EVENT_FL_TRACEPOINT   = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+   TRACE_EVENT_FL_BPF  = (1 << TRACE_EVENT_FL_BPF_BIT),
 };
 
 struct ftrace_event_call {
@@ -536,6 +538,9 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file 
*file,
event_triggers_post_call(file, tt);
 }
 
+struct bpf_context;
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context 
*ctx);
+
 enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index ..2122437f1317
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,29 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments 
passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+   unsigned long arg1;
+   unsigned long arg2;
+   unsigned long arg3;
+   unsigned long arg4;
+   unsigned long arg5;
+   unsigned long arg6;
+};
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...);
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..ad4987ac68bb 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
  */
 
 #include 
+#include 
 
 /*
  * DECLARE_EVENT_CLASS can be used to add a generic function
@@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto)   
\
if (ftrace_trigger_soft_disabled(ftrace_file))  \
return; \
\
+   if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&  \
+   unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+   struct bpf_context __ctx;   \
+   \
+   populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0);  \
+   trace_filter_call_bpf(ftrace_file->filter, &__ctx); \
+   return; \
+   }   \
+   \
__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
\
entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file,  \
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 03c65eedd3d5..d03b8b39e031 100644
--- a/include/uapi/linux/bpf.h
+++

[PATCH RFC net-next 13/14] samples: bpf: example of stateful socket filtering

2014-06-27 Thread Alexei Starovoitov

this socket filter example does:

- creates a hashtable in kernel with key 4 bytes and value 8 bytes

- populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto

- loads eBPF program:
  r0 = skb[14 + 9]; // load one byte of ip->proto
  *(u32*)(fp - 4) = r0;
  value = bpf_map_lookup_elem(map_id, fp - 4);
  if (value)
   (*(u64*)value) += 1;

- attaches this program to eth0 raw socket

- every second user space reads map[6] and map[17] to see how many
  TCP and UDP packets were seen on eth0

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/.gitignore |1 +
 samples/bpf/Makefile   |   13 
 samples/bpf/sock_example.c |  160 
 3 files changed, 174 insertions(+)
 create mode 100644 samples/bpf/.gitignore
 create mode 100644 samples/bpf/Makefile
 create mode 100644 samples/bpf/sock_example.c

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index ..5465c6e92a00
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1 @@
+sock_example
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
new file mode 100644
index ..95c990151644
--- /dev/null
+++ b/samples/bpf/Makefile
@@ -0,0 +1,13 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := sock_example
+
+sock_example-objs := sock_example.o libbpf.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
+HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index ..5cf091571d4f
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,160 @@
+/* eBPF example program:
+ * - creates a hashtable in kernel with key 4 bytes and value 8 bytes
+ *
+ * - populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto
+ *
+ * - loads eBPF program:
+ *   r0 = skb[14 + 9]; // load one byte of ip->proto
+ *   *(u32*)(fp - 4) = r0;
+ *   value = bpf_map_lookup_elem(map_id, fp - 4);
+ *   if (value)
+ *(*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[6] and map[17] to see how many
+ *   TCP and UDP packets were seen on eth0
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+static int open_raw_sock(const char *name)
+{
+   struct sockaddr_ll sll;
+   struct packet_mreq mr;
+   struct ifreq ifr;
+   int sock;
+
+   sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, 
htons(ETH_P_ALL));
+   if (sock < 0) {
+   printf("cannot open socket!\n");
+   return -1;
+   }
+
+   memset(&ifr, 0, sizeof(ifr));
+   strncpy((char *)ifr.ifr_name, name, IFNAMSIZ);
+   if (ioctl(sock, SIOCGIFINDEX, &ifr) < 0) {
+   printf("ioctl: %s\n", strerror(errno));
+   close(sock);
+   return -1;
+   }
+
+   memset(&sll, 0, sizeof(sll));
+   sll.sll_family = AF_PACKET;
+   sll.sll_ifindex = ifr.ifr_ifindex;
+   sll.sll_protocol = htons(ETH_P_ALL);
+   if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+   printf("bind: %s\n", strerror(errno));
+   close(sock);
+   return -1;
+   }
+
+   memset(&mr, 0, sizeof(mr));
+   mr.mr_ifindex = ifr.ifr_ifindex;
+   mr.mr_type = PACKET_MR_PROMISC;
+   if (setsockopt(sock, SOL_PACKET, PACKET_ADD_MEMBERSHIP, &mr, 
sizeof(mr)) < 0) {
+   printf("set_promisc: %s\n", strerror(errno));
+   close(sock);
+   return -1;
+   }
+   return sock;
+}
+
+#define MAP_ID 1
+
+static int test_sock(void)
+{
+   static struct sock_filter_int prog[] = {
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+   BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
+   BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 
4) = r0 */
+   BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, MAP_ID), /* r1 = MAP_ID */
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), /* r1 = 1 */
+   BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 
0, 0), /* xadd r0 += r1 */
+   BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0), /* r0 = 0 */
+   BPF_EXIT_INSN(),
+   };
+
+   int sock = -1, prog_id = 1, i, key;
+   long long value = 0, tcp_cnt, udp_cnt;
+
+   if (bpf_create_map(MAP_ID

[PATCH RFC net-next 01/14] net: filter: split filter.c into two files

2014-06-27 Thread Alexei Starovoitov

BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest

kernel/bpf/core.c: eBPF interpreter

net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters

This patch only moves functions.

Signed-off-by: Alexei Starovoitov 
---
 kernel/Makefile |1 +
 kernel/bpf/Makefile |1 +
 kernel/bpf/core.c   |  545 +++
 net/core/filter.c   |  520 
 4 files changed, 547 insertions(+), 520 deletions(-)
 create mode 100644 kernel/bpf/Makefile
 create mode 100644 kernel/bpf/core.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b6246ce9..e7360b7c2c0e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
+obj-$(CONFIG_NET) += bpf/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
new file mode 100644
index ..6a71145e2769
--- /dev/null
+++ b/kernel/bpf/Makefile
@@ -0,0 +1 @@
+obj-y := core.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
new file mode 100644
index ..dd9c29ff720e
--- /dev/null
+++ b/kernel/bpf/core.c
@@ -0,0 +1,545 @@
+/*
+ * Linux Socket Filter - Kernel level socket filtering
+ *
+ * Based on the design of the Berkeley Packet Filter. The new
+ * internal format has been designed by PLUMgrid:
+ *
+ * Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
+ *
+ * Authors:
+ *
+ * Jay Schulist 
+ * Alexei Starovoitov 
+ * Daniel Borkmann 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Andi Kleen - Fix a few bad bugs and races.
+ * Kris Katterjohn - Added many additional checks in sk_chk_filter()
+ */
+#include 
+#include 
+#include 
+
+/* Registers */
+#define BPF_R0 regs[BPF_REG_0]
+#define BPF_R1 regs[BPF_REG_1]
+#define BPF_R2 regs[BPF_REG_2]
+#define BPF_R3 regs[BPF_REG_3]
+#define BPF_R4 regs[BPF_REG_4]
+#define BPF_R5 regs[BPF_REG_5]
+#define BPF_R6 regs[BPF_REG_6]
+#define BPF_R7 regs[BPF_REG_7]
+#define BPF_R8 regs[BPF_REG_8]
+#define BPF_R9 regs[BPF_REG_9]
+#define BPF_R10regs[BPF_REG_10]
+
+/* Named registers */
+#define DSTregs[insn->dst_reg]
+#define SRCregs[insn->src_reg]
+#define FP regs[BPF_REG_FP]
+#define ARG1   regs[BPF_REG_ARG1]
+#define CTXregs[BPF_REG_CTX]
+#define IMMinsn->imm
+
+/* No hurry in this branch
+ *
+ * Exported for the bpf jit load helper.
+ */
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, 
unsigned int size)
+{
+   u8 *ptr = NULL;
+
+   if (k >= SKF_NET_OFF)
+   ptr = skb_network_header(skb) + k - SKF_NET_OFF;
+   else if (k >= SKF_LL_OFF)
+   ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
+   if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
+   return ptr;
+
+   return NULL;
+}
+
+static inline void *load_pointer(const struct sk_buff *skb, int k,
+unsigned int size, void *buffer)
+{
+   if (k >= 0)
+   return skb_header_pointer(skb, k, size, buffer);
+
+   return bpf_internal_load_pointer_neg_helper(skb, k, size);
+}
+
+/* Base function for offset calculation. Needs to go into .text section,
+ * therefore keeping it non-static as well; will also be used by JITs
+ * anyway later on, so do not let the compiler omit it.
+ */
+noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   return 0;
+}
+
+/**
+ * __sk_run_filter - run a filter on a given context
+ * @ctx: buffer to run the filter on
+ * @insn: filter to apply
+ *
+ * Decode and apply filter instructions to the skb->data. Return length to
+ * keep, 0 for none. @ctx is the data we are operating on, @insn is the
+ * array of filter instructions.
+ */
+static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int 
*insn)
+{
+   u64 stack[MAX_BPF_STACK / sizeof(u64)];
+   u64 regs[MAX_BPF_REG], tmp;
+   static const void *jumptable[256] = {
+   [0 ... 255] = &&default_label,
+   /* Now overwrite non-defaults ... */
+   /* 32 bit ALU operations */
+   [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
+   [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
+   [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
+   [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
+   [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
+   [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
+   [BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,
+   [BPF_ALU | BPF_OR | BPF_K]  = &&A

[PATCH RFC net-next 10/14] net: sock: allow eBPF programs to be attached to sockets

2014-06-27 Thread Alexei Starovoitov

introduce new setsockopt() command:

int prog_id;
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_id, sizeof(prog_id))

prog_id is eBPF program id priorly loaded via:

prog_id = syscall(__NR_bpf, BPF_PROG_LOAD, 0, BPF_PROG_TYPE_SOCKET_FILTER,
  &prog, sizeof(prog));

setsockopt() calls bpf_prog_get() which increment refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to different sockets.

Program exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov 
---
 arch/alpha/include/uapi/asm/socket.h   |2 +
 arch/avr32/include/uapi/asm/socket.h   |2 +
 arch/cris/include/uapi/asm/socket.h|2 +
 arch/frv/include/uapi/asm/socket.h |2 +
 arch/ia64/include/uapi/asm/socket.h|2 +
 arch/m32r/include/uapi/asm/socket.h|2 +
 arch/mips/include/uapi/asm/socket.h|2 +
 arch/mn10300/include/uapi/asm/socket.h |2 +
 arch/parisc/include/uapi/asm/socket.h  |2 +
 arch/powerpc/include/uapi/asm/socket.h |2 +
 arch/s390/include/uapi/asm/socket.h|2 +
 arch/sparc/include/uapi/asm/socket.h   |2 +
 arch/xtensa/include/uapi/asm/socket.h  |2 +
 include/linux/filter.h |1 +
 include/uapi/asm-generic/socket.h  |2 +
 net/core/filter.c  |  117 
 net/core/sock.c|   13 
 17 files changed, 159 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 3de1394bcab8..8c83c376b5ba 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h 
b/arch/avr32/include/uapi/asm/socket.h
index 6e6cd159924b..498ef7220466 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h 
b/arch/cris/include/uapi/asm/socket.h
index ed94e5ed0a23..0d5120724780 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -82,6 +82,8 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index ca2c6e6f31c6..81fba267c285 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -80,5 +80,7 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index a1b49bac7951..9cbb2e82fa7c 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 6c9a24b3aefa..587ac2fb4106 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index a14baa218c76..ab1aed2306db 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index 6aa3ce1854aa..1c4f916d0ef1 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index fe35ceacf0e7..d189bb79ca07 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -79,4 +79,6 @@
 
 #define SO_BPF_EXTENSIONS  0x4029
 
+#define SO_ATTACH_FILTER_EBPF  0x402a
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index a9c3e2e18c05..88488f24ae7f 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
 
 #define SO_BPF_EXTENSIONS  48
 
+#define SO_ATTACH_FILTER_EBPF  49
+
 #endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/a

[PATCH RFC net-next 06/14] bpf: add hashtable type of BPF maps

2014-06-27 Thread Alexei Starovoitov

add new map type: BPF_MAP_TYPE_HASH
and its simple (not auto resizeable) hash table implementation

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |1 +
 kernel/bpf/Makefile  |2 +-
 kernel/bpf/hashtab.c |  371 ++
 3 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index faed2ce2d25a..1399ed1d5dad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -354,6 +354,7 @@ enum bpf_map_attributes {
 
 enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC,
+   BPF_MAP_TYPE_HASH,
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index ..6e481cacbba3
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,371 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+
+struct bpf_htab {
+   struct bpf_map map;
+   struct hlist_head *buckets;
+   struct kmem_cache *elem_cache;
+   char *slab_name;
+   spinlock_t lock;
+   u32 count; /* number of elements in this hashtable */
+   u32 n_buckets; /* number of hash buckets */
+   u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+   struct hlist_node hash_node;
+   struct rcu_head rcu;
+   struct bpf_htab *htab;
+   u32 hash;
+   u32 pad;
+   char key[0];
+};
+
+#define HASH_MAX_BUCKETS 1024
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 
1])
+{
+   struct bpf_htab *htab;
+   int err, i;
+
+   htab = kmalloc(sizeof(*htab), GFP_USER);
+   if (!htab)
+   return ERR_PTR(-ENOMEM);
+
+   /* look for mandatory map attributes */
+   err = -EINVAL;
+   if (!attr[BPF_MAP_KEY_SIZE])
+   goto free_htab;
+   htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
+
+   if (!attr[BPF_MAP_VALUE_SIZE])
+   goto free_htab;
+   htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
+
+   if (!attr[BPF_MAP_MAX_ENTRIES])
+   goto free_htab;
+   htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
+
+   htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
+ htab->map.max_entries : HASH_MAX_BUCKETS;
+
+   /* hash table size must be power of 2 */
+   if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
+   goto free_htab;
+
+   err = -E2BIG;
+   if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+   goto free_htab;
+
+   err = -ENOMEM;
+   htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
+   GFP_USER);
+
+   if (!htab->buckets)
+   goto free_htab;
+
+   for (i = 0; i < htab->n_buckets; i++)
+   INIT_HLIST_HEAD(&htab->buckets[i]);
+
+   spin_lock_init(&htab->lock);
+   htab->count = 0;
+
+   htab->elem_size = sizeof(struct htab_elem) +
+ round_up(htab->map.key_size, 8) +
+ htab->map.value_size;
+
+   htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
+   if (!htab->slab_name)
+   goto free_buckets;
+
+   htab->elem_cache = kmem_cache_create(htab->slab_name,
+htab->elem_size, 0, 0, NULL);
+   if (!htab->elem_cache)
+   goto free_slab_name;
+
+   return &htab->map;
+
+free_slab_name:
+   kfree(htab->slab_name);
+free_buckets:
+   kfree(htab->buckets);
+free_htab:
+   kfree(htab);
+   return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+   return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+   return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+void *key, u32 key_size)
+{
+   struct htab_elem *l;
+
+   hlist_for_each_entry_rcu(l, head,

[PATCH v1 1/2] genirq: Fix error path for resuming irqs

2014-06-27 Thread Derek Basehore

In the case of a late abort to suspend/hibernate, irqs marked with
IRQF_EARLY_RESUME will not be enabled. This is due to syscore_resume not getting
called on these paths.

This can happen with a pm test for platform, a late wakeup irq, and other
instances. This change removes the function from syscore and calls it explicitly
in suspend, hibernate, etc.

This regression was introduced in 9bab0b7f "genirq: Add IRQF_RESUME_EARLY"

Signed-off-by: Derek Basehore 
---
 drivers/base/power/main.c |  5 -
 drivers/xen/manage.c  |  5 -
 include/linux/interrupt.h |  1 +
 include/linux/pm.h|  2 +-
 kernel/irq/pm.c   | 17 +++--
 kernel/kexec.c|  2 +-
 kernel/power/hibernate.c  |  6 +++---
 kernel/power/suspend.c|  2 +-
 8 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index bf41296..a087473 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -712,8 +712,10 @@ static void dpm_resume_early(pm_message_t state)
  * dpm_resume_start - Execute "noirq" and "early" device callbacks.
  * @state: PM transition of the system being carried out.
  */
-void dpm_resume_start(pm_message_t state)
+void dpm_resume_start(pm_message_t state, bool enable_early_irqs)
 {
+   if (enable_early_irqs)
+   early_resume_device_irqs();
dpm_resume_noirq(state);
dpm_resume_early(state);
 }
@@ -1132,6 +1134,7 @@ static int dpm_suspend_noirq(pm_message_t state)
if (error) {
suspend_stats.failed_suspend_noirq++;
dpm_save_failed_step(SUSPEND_SUSPEND_NOIRQ);
+   early_resume_device_irqs();
dpm_resume_noirq(resume_event(state));
} else {
dpm_show_time(starttime, state, "noirq");
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c3667b2..d387cdf 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -68,6 +68,7 @@ static int xen_suspend(void *data)
err = syscore_suspend();
if (err) {
pr_err("%s: system core suspend failed: %d\n", __func__, err);
+   early_resume_device_irqs();
return err;
}
 
@@ -92,6 +93,8 @@ static int xen_suspend(void *data)
xen_timer_resume();
}
 
+   early_resume_device_irqs();
+
syscore_resume();
 
return 0;
@@ -137,7 +140,7 @@ static void do_suspend(void)
 
raw_notifier_call_chain(&xen_resume_notifier, 0, NULL);
 
-   dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
+   dpm_resume_start(si.cancelled ? PMSG_THAW : PMSG_RESTORE, false);
 
if (err) {
pr_err("failed to start xen_suspend: %d\n", err);
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 698ad05..7f390e3 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -193,6 +193,7 @@ extern void irq_wake_thread(unsigned int irq, void *dev_id);
 /* The following three functions are for the core kernel use only. */
 extern void suspend_device_irqs(void);
 extern void resume_device_irqs(void);
+extern void early_resume_device_irqs(void);
 #ifdef CONFIG_PM_SLEEP
 extern int check_wakeup_irqs(void);
 #else
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 72c0fe0..ae5b26a 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -677,7 +677,7 @@ struct dev_pm_domain {
 
 #ifdef CONFIG_PM_SLEEP
 extern void device_pm_lock(void);
-extern void dpm_resume_start(pm_message_t state);
+extern void dpm_resume_start(pm_message_t state, bool enable_early_irqs);
 extern void dpm_resume_end(pm_message_t state);
 extern void dpm_resume(pm_message_t state);
 extern void dpm_complete(pm_message_t state);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index abcd6ca..b07dc9c 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -60,26 +60,15 @@ static void resume_irqs(bool want_early)
 }
 
 /**
- * irq_pm_syscore_ops - enable interrupt lines early
+ * early_resume_device_irqs - enable interrupt lines early
  *
  * Enable all interrupt lines with %IRQF_EARLY_RESUME set.
  */
-static void irq_pm_syscore_resume(void)
+void early_resume_device_irqs(void)
 {
resume_irqs(true);
 }
-
-static struct syscore_ops irq_pm_syscore_ops = {
-   .resume = irq_pm_syscore_resume,
-};
-
-static int __init irq_pm_init_ops(void)
-{
-   register_syscore_ops(&irq_pm_syscore_ops);
-   return 0;
-}
-
-device_initcall(irq_pm_init_ops);
+EXPORT_SYMBOL_GPL(early_resume_device_irqs);
 
 /**
  * resume_device_irqs - enable interrupt lines disabled by 
suspend_device_irqs()
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 369f41a..272853b 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1700,7 +1700,7 @@ int kernel_kexec(void)
local_irq_enable();
  Enable_cpus:
enable_nonboot_cpus();
-   dpm_resume_start(PMSG_RESTORE);
+   dpm_resume_start(PMSG_RESTO

[PATCH v1 2/2] Revert "irq: Enable all irqs unconditionally in irq_resume"

2014-06-27 Thread Derek Basehore

This reverts the fix to IRQF_EARLY_RESUME irqs staying disabled after a suspend
failure. It incorrectly stated that Xen is the only platform that uses this
feature. Some rtc drivers such as rtc-as3722.c use the feature and can have its
irq permanently enabled with the change. The driver does disable/enable the irq
for the rtc alarm, so it needs a different fix which is in "genirq: Fix error
path for resuming irqs"

We should also keep correct enable/disable parity for irqs.

This reverts commit ac01810c9d2814238f08a227062e66a35a0e1ea2.

Signed-off-by: Derek Basehore 
---
 kernel/irq/pm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index b07dc9c..a5eaf1f 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -50,7 +50,7 @@ static void resume_irqs(bool want_early)
bool is_early = desc->action &&
desc->action->flags & IRQF_EARLY_RESUME;
 
-   if (!is_early && want_early)
+   if (is_early != want_early)
continue;
 
raw_spin_lock_irqsave(&desc->lock, flags);
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 4/4] printk: allow increasing the ring buffer depending on the number of CPUs

2014-06-27 Thread Andrew Morton

On Thu, 26 Jun 2014 16:32:15 -0700 "Luis R. Rodriguez"  wrote:

> On Thu, Jun 26, 2014 at 4:20 PM, Andrew Morton
>  wrote:
> > On Fri, 27 Jun 2014 01:16:30 +0200 "Luis R. Rodriguez"  
> > wrote:
> >
> >> > > Another note --  since this option depends on SMP and !BASE_SMALL 
> >> > > technically
> >> > > num_possible_cpus() won't ever return something smaller than or equal 
> >> > > to 1
> >> > > but because of the default values chosen the -1 on the compuation does 
> >> > > affect
> >> > > whether or not this will trigger on > 64 CPUs or >= 64 CPUs, keeping 
> >> > > the
> >> > > -1 means we require > 64 CPUs.
> >> >
> >> > hm, that sounds like more complexity.
> >> >
> >> > > This all can be changed however we like but the language and explained 
> >> > > logic
> >> > > would just need to be changed.
> >> >
> >> > Let's start out simple.  What's wrong with doing
> >> >
> >> > log buf len = max(__LOG_BUF_LEN, nr_possible_cpus * per-cpu log buf 
> >> > len)
> >>
> >> Sure, you already took in the patch series though so how would you like to
> >> handle a respin, you just drop the last patch and we respin it?
> >
> > A fresh patch would suit.  That's if you think it is a reasonable
> > approach - you've thought about this stuff more than I have!
> 
> The way its implemented now makes more technical sense, in short it
> assumes the first boot (and CPU) gets the full default kernel ring
> buffer size, the extra size is for the gibberish that each extra CPU
> is expected to spew out in the worst case. What you propose makes the
> explanation simpler and easier to understand but sends the wrong
> message about exactly how the growth of the kernel ring buffer is
> expected scale with the addition of more CPUs.

OK, it's finally starting to sink in.  The model for the kernel-wide
printk output is "a great pile of CPU-independent stuff plus a certain
amount of per-cpu stuff".  And the code at present attempts to follow
that model.  Yes?

I'm rather internet-challenged at present - please let me take another look at
the patch on Monday.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Find correct 64 bit ramdisk address for microcode early update

2014-06-27 Thread H. Peter Anvin

On 06/10/2014 10:04 PM, Yinghai Lu wrote:
> When using kexec with 64bit kernel, bzImage and ramdisk could be
> loaded above 4G. We need this to get correct ramdisk adress.
> 
> Make get_ramdisk_image() global and use it for early microcode updating.
> Also make it to take boot_params pointer for different usage.
> 
> Signed-off-by: Yinghai Lu 

Please update your patch description to explain what "different usage"
you have in mind here.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] Tools: hv: fix file overwriting of hv_fcopy_daemon

2014-06-27 Thread Yue Zhang (OSTC DEV)



> -Original Message-
> From: Greg KH [mailto:g...@kroah.com]
> > From: Yue Zhang 
> >
> > hv_fcopy_daemon fails to overwrite a file if the target file already
> > exits.
> >
> > Add O_TRUNC flag on opening.
> >
> > MS-TFS: 341345
> 
> It's as if the people on your team don't talk to each other about what they
> should, or should not, include in their patch descriptions...
> 
> Please remove.

Sorry for this. It is added by mistake. I will remove it.

Yue Zhang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/14] input: cyapa: re-architecture driver to support multi-trackpads in one driver

2014-06-27 Thread Patrik Fimml

Hi Dudley,

I tried to apply your patchset today, but was not successful: it seems
like tabs have been replaced by spaces, and there's a Cypress
signature and a winmail.dat file added to every email, making it
impossible to apply your patches directly.

I've tried to rule out errors on my end. I checked with
http://marc.info/?l=linux-input&m=140203994303131&q=raw that the
original email indeed has all tabs replaced with spaces.

Can you fix your email setup so that these things don't happen - there
is some documentation in Documentation/SubmittingPatches and
Documentation/email-clients.txt - and send the patches again?

Alternatively, maybe you could at least send the patches as
attachments (as output by git format-patch), so that your email system
doesn't mess with them. That's probably not the preferred solution for
the general lkml audience, but would work as a short-term solution for
me.

If I'm mistaken here and someone else was able to apply the patches
successfully, please point me in the right direction.

Thanks,
Patrik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFA][PATCH 02/27] PM / Sleep: Remove ftrace_stop/start() from suspend and hibernate

2014-06-27 Thread Rafael J. Wysocki

On Thursday, June 26, 2014 12:52:23 PM Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" 
> 
> ftrace_stop() and ftrace_start() were added to the suspend and hibernate
> process because there was some function within the work flow that caused
> the system to reboot if it was traced. This function has recently been
> found (restore_processor_state()). Now there's no reason to disable
> function tracing while we are going into suspend or hibernate, which means
> that being able to trace this will help tremendously in debugging any
> issues with suspend or hibernate.
> 
> This also means that the ftrace_stop/start() functions can be removed
> and simplify the function tracing code a bit.
> 
> Signed-off-by: Steven Rostedt 

ACK

> ---
>  kernel/power/hibernate.c | 6 --
>  kernel/power/suspend.c   | 2 --
>  2 files changed, 8 deletions(-)
> 
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 49e0a20fd010..ca7b1906c6c8 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -365,7 +365,6 @@ int hibernation_snapshot(int platform_mode)
>   }
>  
>   suspend_console();
> - ftrace_stop();
>   pm_restrict_gfp_mask();
>  
>   error = dpm_suspend(PMSG_FREEZE);
> @@ -391,7 +390,6 @@ int hibernation_snapshot(int platform_mode)
>   if (error || !in_suspend)
>   pm_restore_gfp_mask();
>  
> - ftrace_start();
>   resume_console();
>   dpm_complete(msg);
>  
> @@ -494,7 +492,6 @@ int hibernation_restore(int platform_mode)
>  
>   pm_prepare_console();
>   suspend_console();
> - ftrace_stop();
>   pm_restrict_gfp_mask();
>   error = dpm_suspend_start(PMSG_QUIESCE);
>   if (!error) {
> @@ -502,7 +499,6 @@ int hibernation_restore(int platform_mode)
>   dpm_resume_end(PMSG_RECOVER);
>   }
>   pm_restore_gfp_mask();
> - ftrace_start();
>   resume_console();
>   pm_restore_console();
>   return error;
> @@ -529,7 +525,6 @@ int hibernation_platform_enter(void)
>  
>   entering_platform_hibernation = true;
>   suspend_console();
> - ftrace_stop();
>   error = dpm_suspend_start(PMSG_HIBERNATE);
>   if (error) {
>   if (hibernation_ops->recover)
> @@ -573,7 +568,6 @@ int hibernation_platform_enter(void)
>   Resume_devices:
>   entering_platform_hibernation = false;
>   dpm_resume_end(PMSG_RESTORE);
> - ftrace_start();
>   resume_console();
>  
>   Close:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 4dd8822f732a..f6623da034d8 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -248,7 +248,6 @@ static int suspend_enter(suspend_state_t state, bool 
> *wakeup)
>   goto Platform_wake;
>   }
>  
> - ftrace_stop();
>   error = disable_nonboot_cpus();
>   if (error || suspend_test(TEST_CPUS))
>   goto Enable_cpus;
> @@ -275,7 +274,6 @@ static int suspend_enter(suspend_state_t state, bool 
> *wakeup)
>  
>   Enable_cpus:
>   enable_nonboot_cpus();
> - ftrace_start();
>  
>   Platform_wake:
>   if (need_suspend_ops(state) && suspend_ops->wake)
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFA][PATCH 01/27] x86, power, suspend: Annotate restore_processor_state() with notrace

2014-06-27 Thread Rafael J. Wysocki

On Thursday, June 26, 2014 12:52:22 PM Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" 
> 
> ftrace_stop() is used to stop function tracing during suspend and resume
> which removes a lot of possible debugging opportunities with tracing.
> The reason was that some function in the resume path was causing a triple
> fault if it were to be traced. The issue I found was that doing something
> as simple as calling smp_processor_id() would reboot the box!
> 
> When function tracing was first created I didn't have a good way to figure
> out what function was having issues, or it looked to be multiple ones. To
> fix it, we just created a big hammer approach to the problem which was to
> add a flag in the mcount trampoline that could be checked and not call
> the traced functions.
> 
> Lately I developed better ways to find problem functions and I can bisect
> down to see what function is causing the issue. I removed the flag that
> stopped tracing and proceeded to find the problem function and it ended
> up being restore_processor_state(). This function makes sense as when the
> CPU comes back online from a suspend it calls this function to set up
> registers, amongst them the GS register, which stores things such as
> what CPU the processor is (if you call smp_processor_id() without this
> set up properly, it would fault).
> 
> By making restore_processor_state() notrace, the system can suspend and
> resume without the need of the big hammer tracing to stop.
> 
> Signed-off-by: Steven Rostedt 

ACK

> ---
>  arch/x86/power/cpu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index 424f4c97a44d..6ec7910f59bf 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -165,7 +165,7 @@ static void fix_processor_context(void)
>   *   by __save_processor_state()
>   *   @ctxt - structure to load the registers contents from
>   */
> -static void __restore_processor_state(struct saved_context *ctxt)
> +static void notrace __restore_processor_state(struct saved_context *ctxt)
>  {
>   if (ctxt->misc_enable_saved)
>   wrmsrl(MSR_IA32_MISC_ENABLE, ctxt->misc_enable);
> @@ -239,7 +239,7 @@ static void __restore_processor_state(struct 
> saved_context *ctxt)
>  }
>  
>  /* Needed by apm.c */
> -void restore_processor_state(void)
> +void notrace restore_processor_state(void)
>  {
>   __restore_processor_state(&saved_context);
>  }
> 

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Tools: hv: fix file overwriting of hv_fcopy_daemon

2014-06-27 Thread Greg KH

On Fri, Jun 27, 2014 at 05:16:48PM -0700, Yue Zhang wrote:
> From: Yue Zhang 
> 
> hv_fcopy_daemon fails to overwrite a file if the target file already
> exits.
> 
> Add O_TRUNC flag on opening.
> 
> MS-TFS: 341345

It's as if the people on your team don't talk to each other about what
they should, or should not, include in their patch descriptions...

Please remove.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf tool: Carve out ctype.h et al

2014-06-27 Thread Borislav Petkov

On Thu, Jun 26, 2014 at 02:14:33PM +0200, Jiri Olsa wrote:
> this one compiles ok for me

Ok, cool. So guys, can we apply this one so that I can continue with the
next round?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 01/11] seccomp: create internal mode-setting function

2014-06-27 Thread Kees Cook

In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook 
---
 kernel/seccomp.c |   16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc24739c..afb916c7e890 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -473,7 +473,7 @@ long prctl_get_seccomp(void)
 }
 
 /**
- * prctl_set_seccomp: configures current->seccomp.mode
+ * seccomp_set_mode: internal function for setting seccomp mode
  * @seccomp_mode: requested mode to use
  * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
  *
@@ -486,7 +486,7 @@ long prctl_get_seccomp(void)
  *
  * Returns 0 on success or -EINVAL on failure.
  */
-long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
 {
long ret = -EINVAL;
 
@@ -517,3 +517,15 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char 
__user *filter)
 out:
return ret;
 }
+
+/**
+ * prctl_set_seccomp: configures current->seccomp.mode
+ * @seccomp_mode: requested mode to use
+ * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
+{
+   return seccomp_set_mode(seccomp_mode, filter);
+}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 10/11] seccomp: allow mode setting across threads

2014-06-27 Thread Kees Cook

This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.

Signed-off-by: Kees Cook 
---
 kernel/seccomp.c |   27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e1ff2c193190..7bbcb9ed16df 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -207,12 +207,18 @@ static inline bool seccomp_check_mode(unsigned long 
seccomp_mode)
return true;
 }
 
-static inline void seccomp_assign_mode(unsigned long seccomp_mode)
+static inline void seccomp_assign_mode(struct task_struct *task,
+  unsigned long seccomp_mode)
 {
-   BUG_ON(!spin_is_locked(¤t->sighand->siglock));
+   BUG_ON(!spin_is_locked(&task->sighand->siglock));
 
-   current->seccomp.mode = seccomp_mode;
-   set_tsk_thread_flag(current, TIF_SECCOMP);
+   task->seccomp.mode = seccomp_mode;
+   /*
+* Make sure TIF_SECCOMP cannot be set before the mode (and
+* filter) is set.
+*/
+   smp_mb__before_atomic();
+   set_tsk_thread_flag(task, TIF_SECCOMP);
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
@@ -433,12 +439,17 @@ static int mode1_syscalls_32[] = {
 
 int __secure_computing(int this_syscall)
 {
-   int mode = current->seccomp.mode;
int exit_sig = 0;
int *syscall;
u32 ret;
 
-   switch (mode) {
+   /*
+* Make sure that any changes to mode from another thread have
+* been seen after TIF_SECCOMP was seen.
+*/
+   rmb();
+
+   switch (current->seccomp.mode) {
case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
 #ifdef CONFIG_COMPAT
@@ -543,7 +554,7 @@ static long seccomp_set_mode_strict(void)
 #ifdef TIF_NOTSC
disable_TSC();
 #endif
-   seccomp_assign_mode(seccomp_mode);
+   seccomp_assign_mode(current, seccomp_mode);
ret = 0;
 
 out:
@@ -593,7 +604,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
/* Do not free the successfully attached filter. */
prepared = NULL;
 
-   seccomp_assign_mode(seccomp_mode);
+   seccomp_assign_mode(current, seccomp_mode);
 out:
spin_unlock_irq(¤t->sighand->siglock);
seccomp_filter_free(prepared);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 05/11] ARM: add seccomp syscall

2014-06-27 Thread Kees Cook

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook 
---
 arch/arm/include/uapi/asm/unistd.h |1 +
 arch/arm/kernel/calls.S|1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/arm/include/uapi/asm/unistd.h 
b/arch/arm/include/uapi/asm/unistd.h
index ba94446c72d9..e21b4a069701 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -409,6 +409,7 @@
 #define __NR_sched_setattr (__NR_SYSCALL_BASE+380)
 #define __NR_sched_getattr (__NR_SYSCALL_BASE+381)
 #define __NR_renameat2 (__NR_SYSCALL_BASE+382)
+#define __NR_seccomp   (__NR_SYSCALL_BASE+383)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 8f51bdcdacbb..bea85f97f363 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -392,6 +392,7 @@
 /* 380 */  CALL(sys_sched_setattr)
CALL(sys_sched_getattr)
CALL(sys_renameat2)
+   CALL(sys_seccomp)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 08/11] seccomp: split filter prep from check and apply

2014-06-27 Thread Kees Cook

In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

Signed-off-by: Kees Cook 
---
 kernel/seccomp.c |   93 +-
 1 file changed, 64 insertions(+), 29 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 137e40c7ae3b..502e54d7f86d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* #define SECCOMP_DEBUG 1 */
@@ -27,7 +28,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -213,27 +213,21 @@ static inline void seccomp_assign_mode(unsigned long 
seccomp_mode)
 
 #ifdef CONFIG_SECCOMP_FILTER
 /**
- * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
  *
- * Returns 0 on success or an errno on failure.
+ * Returns filter on success or an ERR_PTR on failure.
  */
-static long seccomp_attach_filter(struct sock_fprog *fprog)
+static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 {
struct seccomp_filter *filter;
unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
-   unsigned long total_insns = fprog->len;
struct sock_filter *fp;
int new_len;
long ret;
 
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
-   return -EINVAL;
-
-   for (filter = current->seccomp.filter; filter; filter = filter->prev)
-   total_insns += filter->prog->len + 4;  /* include a 4 instr 
penalty */
-   if (total_insns > MAX_INSNS_PER_PATH)
-   return -ENOMEM;
+   return ERR_PTR(-EINVAL);
 
/*
 * Installing a seccomp filter requires that the task has
@@ -244,11 +238,11 @@ static long seccomp_attach_filter(struct sock_fprog 
*fprog)
if (!task_no_new_privs(current) &&
security_capable_noaudit(current_cred(), current_user_ns(),
 CAP_SYS_ADMIN) != 0)
-   return -EACCES;
+   return ERR_PTR(-EACCES);
 
fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
if (!fp)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
 
/* Copy the instructions from fprog. */
ret = -EFAULT;
@@ -292,13 +286,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 
sk_filter_select_runtime(filter->prog);
 
-   /*
-* If there is an existing filter, make it the prev and don't drop its
-* task reference.
-*/
-   filter->prev = current->seccomp.filter;
-   current->seccomp.filter = filter;
-   return 0;
+   return filter;
 
 free_filter_prog:
kfree(filter->prog);
@@ -306,19 +294,20 @@ free_filter:
kfree(filter);
 free_prog:
kfree(fp);
-   return ret;
+   return ERR_PTR(ret);
 }
 
 /**
- * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * seccomp_prepare_user_filter - prepares a user-supplied sock_fprog
  * @user_filter: pointer to the user data containing a sock_fprog.
  *
  * Returns 0 on success and non-zero otherwise.
  */
-static long seccomp_attach_user_filter(const char __user *user_filter)
+static struct seccomp_filter *
+seccomp_prepare_user_filter(const char __user *user_filter)
 {
struct sock_fprog fprog;
-   long ret = -EFAULT;
+   struct seccomp_filter *filter = ERR_PTR(-EFAULT);
 
 #ifdef CONFIG_COMPAT
if (is_compat_task()) {
@@ -331,9 +320,39 @@ static long seccomp_attach_user_filter(const char __user 
*user_filter)
 #endif
if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
goto out;
-   ret = seccomp_attach_filter(&fprog);
+   filter = seccomp_prepare_filter(&fprog);
 out:
-   return ret;
+   return filter;
+}
+
+/**
+ * seccomp_attach_filter: validate and attach filter
+ * @flags:  flags to change filter behavior
+ * @filter: seccomp filter to add to the current process
+ *
+ * Returns 0 on success, -ve on error.
+ */
+static long seccomp_attach_filter(unsigned int flags,
+ struct seccomp_filter *filter)
+{
+   unsigned long total_insns;
+   struct seccomp_filter *walker;
+
+   /* Validate resulting filter length. */
+   total_insns = filter->prog->len;
+   for (walker = current->seccomp.filter; walker; walker = walker->prev)
+   total_insns += walker->prog->len + 4;  /* 4 instr penalty */
+   if (total_insns > MAX_INSNS_PER_PATH)
+   return -ENOMEM;
+
+   /*
+* If there is an existing filter, make it the prev and don't drop its
+* task reference.
+*/
+   filter->prev = current->seccomp.filter

[PATCH v9 03/11] seccomp: split mode setting routines

2014-06-27 Thread Kees Cook

Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.

Signed-off-by: Kees Cook 
---
 kernel/seccomp.c |   71 --
 1 file changed, 48 insertions(+), 23 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 03a5959b7930..812cea2e7ffb 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -489,48 +489,66 @@ long prctl_get_seccomp(void)
 }
 
 /**
- * seccomp_set_mode: internal function for setting seccomp mode
- * @seccomp_mode: requested mode to use
- * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
- *
- * This function may be called repeatedly with a @seccomp_mode of
- * SECCOMP_MODE_FILTER to install additional filters.  Every filter
- * successfully installed will be evaluated (in reverse order) for each system
- * call the task makes.
+ * seccomp_set_mode_strict: internal function for setting strict seccomp
  *
  * Once current->seccomp.mode is non-zero, it may not be changed.
  *
  * Returns 0 on success or -EINVAL on failure.
  */
-static long seccomp_set_mode(unsigned long seccomp_mode, char __user *filter)
+static long seccomp_set_mode_strict(void)
 {
+   const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
long ret = -EINVAL;
 
if (!seccomp_check_mode(seccomp_mode))
goto out;
 
-   switch (seccomp_mode) {
-   case SECCOMP_MODE_STRICT:
-   ret = 0;
 #ifdef TIF_NOTSC
-   disable_TSC();
+   disable_TSC();
 #endif
-   break;
+   seccomp_assign_mode(seccomp_mode);
+   ret = 0;
+
+out:
+
+   return ret;
+}
+
 #ifdef CONFIG_SECCOMP_FILTER
-   case SECCOMP_MODE_FILTER:
-   ret = seccomp_attach_user_filter(filter);
-   if (ret)
-   goto out;
-   break;
-#endif
-   default:
+/**
+ * seccomp_set_mode_filter: internal function for setting seccomp filter
+ * @filter: struct sock_fprog containing filter
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the task makes.
+ *
+ * Once current->seccomp.mode is non-zero, it may not be changed.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+static long seccomp_set_mode_filter(char __user *filter)
+{
+   const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
+   long ret = -EINVAL;
+
+   if (!seccomp_check_mode(seccomp_mode))
+   goto out;
+
+   ret = seccomp_attach_user_filter(filter);
+   if (ret)
goto out;
-   }
 
seccomp_assign_mode(seccomp_mode);
 out:
return ret;
 }
+#else
+static inline long seccomp_set_mode_filter(char __user *filter)
+{
+   return -EINVAL;
+}
+#endif
 
 /**
  * prctl_set_seccomp: configures current->seccomp.mode
@@ -541,5 +559,12 @@ out:
  */
 long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 {
-   return seccomp_set_mode(seccomp_mode, filter);
+   switch (seccomp_mode) {
+   case SECCOMP_MODE_STRICT:
+   return seccomp_set_mode_strict();
+   case SECCOMP_MODE_FILTER:
+   return seccomp_set_mode_filter(filter);
+   default:
+   return -EINVAL;
+   }
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 11/11] seccomp: implement SECCOMP_FILTER_FLAG_TSYNC

2014-06-27 Thread Kees Cook

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). To make sure that de_thread() is actually able
to kill other threads during an exec, any sighand holders need to check
if they've been scheduled to be killed, and to give up on their work.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes 
Signed-off-by: Kees Cook 
---
 fs/exec.c|2 +-
 include/linux/seccomp.h  |2 +
 include/uapi/linux/seccomp.h |3 +
 kernel/seccomp.c |  139 +-
 4 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0f5c272410f6..ab1f1200ce5d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1216,7 +1216,7 @@ EXPORT_SYMBOL(install_exec_creds);
 /*
  * determine how safe it is to execute the proposed program
  * - the caller must hold ->cred_guard_mutex to protect against
- *   PTRACE_ATTACH
+ *   PTRACE_ATTACH or seccomp thread-sync
  */
 static void check_unsafe_exec(struct linux_binprm *bprm)
 {
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9ff98b4bfe2e..15de2a711518 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -3,6 +3,8 @@
 
 #include 
 
+#define SECCOMP_FILTER_FLAG_MASK   ~(SECCOMP_FILTER_FLAG_TSYNC)
+
 #ifdef CONFIG_SECCOMP
 
 #include 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index b258878ba754..0f238a43ff1e 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -14,6 +14,9 @@
 #define SECCOMP_SET_MODE_STRICT0
 #define SECCOMP_SET_MODE_FILTER1
 
+/* Valid flags for SECCOMP_SET_MODE_FILTER */
+#define SECCOMP_FILTER_FLAG_TSYNC  1
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 7bbcb9ed16df..0a82e16da7ef 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -26,6 +26,7 @@
 #ifdef CONFIG_SECCOMP_FILTER
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -222,6 +223,108 @@ static inline void seccomp_assign_mode(struct task_struct 
*task,
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+/* Returns 1 if the candidate is an ancestor. */
+static int is_ancestor(struct seccomp_filter *candidate,
+  struct seccomp_filter *child)
+{
+   /* NULL is the root ancestor. */
+   if (candidate == NULL)
+   return 1;
+   for (; child; child = child->prev)
+   if (child == candidate)
+   return 1;
+   return 0;
+}
+
+/**
+ * seccomp_can_sync_threads: checks if all threads can be synchronized
+ *
+ * Expects sighand and cred_guard_mutex locks to be held.
+ *
+ * Returns 0 on success, -ve on error, or the pid of a thread which was
+ * either not in the correct seccomp mode or it did not have an ancestral
+ * seccomp filter.
+ */
+static inline pid_t seccomp_can_sync_threads(void)
+{
+   struct task_struct *thread, *caller;
+
+   BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+   BUG_ON(!spin_is_locked(¤t->sighand->siglock));
+
+   if (current->seccomp.mode != SECCOMP_MODE_FILTER)
+   return -EACCES;
+
+   /* Validate all threads bein

[PATCH v9 09/11] seccomp: introduce writer locking

2014-06-27 Thread Kees Cook

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to
the seccomp fields are now protected by the sighand spinlock (which
is unique to the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook 
---
 include/linux/seccomp.h |6 +++---
 kernel/fork.c   |   45 -
 kernel/seccomp.c|   26 --
 3 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 4054b0994071..9ff98b4bfe2e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -14,11 +14,11 @@ struct seccomp_filter;
  *
  * @mode:  indicates one of the valid values above for controlled
  * system calls available to a process.
- * @filter: The metadata and ruleset for determining what system calls
- *  are allowed for a task.
+ * @filter: must always point to a valid seccomp-filter or NULL as it is
+ *  accessed without locking during system call entry.
  *
  *  @filter must only be accessed from the context of current as there
- *  is no locking.
+ *  is no read locking.
  */
 struct seccomp {
int mode;
diff --git a/kernel/fork.c b/kernel/fork.c
index 6a13c46cd87d..ffc1b43e351f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -315,6 +315,15 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig)
goto free_ti;
 
tsk->stack = ti;
+#ifdef CONFIG_SECCOMP
+   /*
+* We must handle setting up seccomp filters once we're under
+* the sighand lock in case orig has changed between now and
+* then. Until then, filter must be NULL to avoid messing up
+* the usage counts on the error path calling free_task.
+*/
+   tsk->seccomp.filter = NULL;
+#endif
 
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
@@ -1081,6 +1090,35 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
+static void copy_seccomp(struct task_struct *p)
+{
+#ifdef CONFIG_SECCOMP
+   /*
+* Must be called with sighand->lock held, which is common to
+* all threads in the group. Holding cred_guard_mutex is not
+* needed because this new task is not yet running and cannot
+* be racing exec.
+*/
+   BUG_ON(!spin_is_locked(¤t->sighand->siglock));
+
+   /* Ref-count the new filter user, and assign it. */
+   get_seccomp_filter(current);
+   p->seccomp = current->seccomp;
+
+   /*
+* Explicitly enable no_new_privs here in case it got set
+* between the task_struct being duplicated and holding the
+* sighand lock. The seccomp state and nnp must be in sync.
+*/
+   if (task_no_new_privs(current))
+   task_set_no_new_privs(p);
+
+   /* If we have a seccomp mode, enable the thread flag. */
+   if (p->seccomp.mode != SECCOMP_MODE_DISABLED)
+   set_tsk_thread_flag(p, TIF_SECCOMP);
+#endif
+}
+
 SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 {
current->clear_child_tid = tidptr;
@@ -1196,7 +1234,6 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto fork_out;
 
ftrace_graph_init_task(p);
-   get_seccomp_filter(p);
 
rt_mutex_init_task(p);
 
@@ -1437,6 +1474,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
spin_lock(¤t->sighand->siglock);
 
/*
+* Copy seccomp details explicitly here, in case they were changed
+* before holding sighand lock.
+*/
+   copy_seccomp(p);
+
+   /*
 * Process group and session signals need to be delivered to just the
 * parent before the fork or both the parent and the child after the
 * fork. Restart if a signal comes in before we add the new process to
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 502e54d7f86d..e1ff2c193190 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -173,12 +173,12 @@ static int seccomp_check_filter(struct sock_filter 
*filter, unsign

[PATCH v9 02/11] seccomp: extract check/assign mode helpers

2014-06-27 Thread Kees Cook

To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.

Signed-off-by: Kees Cook 
---
 kernel/seccomp.c |   22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index afb916c7e890..03a5959b7930 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -194,7 +194,23 @@ static u32 seccomp_run_filters(int syscall)
}
return ret;
 }
+#endif /* CONFIG_SECCOMP_FILTER */
 
+static inline bool seccomp_check_mode(unsigned long seccomp_mode)
+{
+   if (current->seccomp.mode && current->seccomp.mode != seccomp_mode)
+   return false;
+
+   return true;
+}
+
+static inline void seccomp_assign_mode(unsigned long seccomp_mode)
+{
+   current->seccomp.mode = seccomp_mode;
+   set_tsk_thread_flag(current, TIF_SECCOMP);
+}
+
+#ifdef CONFIG_SECCOMP_FILTER
 /**
  * seccomp_attach_filter: Attaches a seccomp filter to current.
  * @fprog: BPF program to install
@@ -490,8 +506,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, 
char __user *filter)
 {
long ret = -EINVAL;
 
-   if (current->seccomp.mode &&
-   current->seccomp.mode != seccomp_mode)
+   if (!seccomp_check_mode(seccomp_mode))
goto out;
 
switch (seccomp_mode) {
@@ -512,8 +527,7 @@ static long seccomp_set_mode(unsigned long seccomp_mode, 
char __user *filter)
goto out;
}
 
-   current->seccomp.mode = seccomp_mode;
-   set_thread_flag(TIF_SECCOMP);
+   seccomp_assign_mode(seccomp_mode);
 out:
return ret;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 07/11] sched: move no_new_privs into new atomic flags

2014-06-27 Thread Kees Cook

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook 
---
 fs/exec.c  |4 ++--
 include/linux/sched.h  |   18 +++---
 kernel/seccomp.c   |2 +-
 kernel/sys.c   |4 ++--
 security/apparmor/domain.c |4 ++--
 5 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a3d33fe592d6..0f5c272410f6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1234,7 +1234,7 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
 * This isn't strictly necessary, but it makes it harder for LSMs to
 * mess up.
 */
-   if (current->no_new_privs)
+   if (task_no_new_privs(current))
bprm->unsafe |= LSM_UNSAFE_NO_NEW_PRIVS;
 
t = p;
@@ -1272,7 +1272,7 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->egid = current_egid();
 
if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
-   !current->no_new_privs &&
+   !task_no_new_privs(current) &&
kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
/* Set-uid? */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..0fd19055bb64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1307,13 +1307,12 @@ struct task_struct {
 * execve */
unsigned in_iowait:1;
 
-   /* task may not gain privileges */
-   unsigned no_new_privs:1;
-
/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
 
+   unsigned long atomic_flags; /* Flags needing atomic access. */
+
pid_t pid;
pid_t tgid;
 
@@ -1967,6 +1966,19 @@ static inline void memalloc_noio_restore(unsigned int 
flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+/* Per-process atomic flags. */
+#define PFA_NO_NEW_PRIVS 0x0001/* May not gain new privileges. */
+
+static inline bool task_no_new_privs(struct task_struct *p)
+{
+   return test_bit(PFA_NO_NEW_PRIVS, &p->atomic_flags);
+}
+
+static inline void task_set_no_new_privs(struct task_struct *p)
+{
+   set_bit(PFA_NO_NEW_PRIVS, &p->atomic_flags);
+}
+
 /*
  * task->jobctl flags
  */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 2f83496d6016..137e40c7ae3b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -241,7 +241,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 * This avoids scenarios where unprivileged tasks can affect the
 * behavior of privileged children.
 */
-   if (!current->no_new_privs &&
+   if (!task_no_new_privs(current) &&
security_capable_noaudit(current_cred(), current_user_ns(),
 CAP_SYS_ADMIN) != 0)
return -EACCES;
diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751ebf9d9..ce8129192a26 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1990,12 +1990,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
arg2, unsigned long, arg3,
if (arg2 != 1 || arg3 || arg4 || arg5)
return -EINVAL;
 
-   current->no_new_privs = 1;
+   task_set_no_new_privs(current);
break;
case PR_GET_NO_NEW_PRIVS:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
-   return current->no_new_privs ? 1 : 0;
+   return task_no_new_privs(current) ? 1 : 0;
case PR_GET_THP_DISABLE:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 452567d3a08e..d97cba3e3849 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -621,7 +621,7 @@ int aa_change_hat(const char *hats[], int count, u64 token, 
bool permtest)
 * There is no exception for unconfined as change_hat is not
 * available.
 */
-   if (current->no_new_privs)
+   if (task_no_new_privs(current))
return -EPERM;
 
/* released below */
@@ -776,7 +776,7 @@ int aa_change_profile(const char *ns_name, const char 
*hname, bool onexec,
 * no_new_privs is set because this aways results in a reduction
 * of permissions.
 */
-   if (current->no_new_privs && !unconfined(profile)) {
+   if (task_no_new_privs(current) && !unconfined(profile)) {
put_cred(cred);
return -EPERM;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.

[PATCH v9 0/11] seccomp: add thread sync ability

2014-06-27 Thread Kees Cook

This adds the ability for threads to request seccomp filter
synchronization across their thread group (at filter attach time).
For example, for Chrome to make sure graphic driver threads are fully
confined after seccomp filters have been attached.

To support this, locking on seccomp changes via thread-group-shared
sighand lock is introduced, along with refactoring of no_new_privs. Races
with thread creation are handled via delayed duplication of the seccomp
task struct field.

This includes a new syscall (instead of adding a new prctl option),
as suggested by Andy Lutomirski and Michael Kerrisk.

Thanks!

-Kees

v9:
 - rearranged/split patches to make things more reviewable
 - added use of cred_guard_mutex to solve exec race (oleg, luto)
 - added barriers for TIF_SECCOMP vs seccomp.mode race (oleg, luto)
 - fixed missed copying of nnp state after v8 refactor (oleg)
v8:
 - drop use of tasklist_lock, appears redundant against sighand (oleg)
 - reduced use of smp_load_acquire to logical minimum (oleg)
 - change nnp to a task struct held atomic flags field (oleg, luto)
 - drop needless irqflags changes in fork.c for holding sighand lock (oleg)
 - cleaned up use of thread for-each loop (oleg)
 - rearranged patch order to keep syscall changes adjacent
 - added example code to manpage (mtk)
v7:
 - rebase on Linus's tree (merged with network bpf changes)
 - wrote manpage text documenting API (follows this series)
v6:
 - switch from seccomp-specific lock to thread-group lock to gain atomicity
 - implement seccomp syscall across all architectures with seccomp filter
 - clean up sparse warnings around locking
v5:
 - move includes around (drysdale)
 - drop set_nnp return value (luto)
 - use smp_load_acquire/store_release (luto)
 - merge nnp changes to seccomp always, fewer ifdef (luto)
v4:
 - cleaned up locking further, as noticed by David Drysdale
v3:
 - added SECCOMP_EXT_ACT_FILTER for new filter install options
v2:
 - reworked to avoid clone races

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 04/11] seccomp: add "seccomp" syscall

2014-06-27 Thread Kees Cook

This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook 
---
 arch/Kconfig  |1 +
 arch/x86/syscalls/syscall_32.tbl  |1 +
 arch/x86/syscalls/syscall_64.tbl  |1 +
 include/linux/syscalls.h  |2 ++
 include/uapi/asm-generic/unistd.h |4 ++-
 include/uapi/linux/seccomp.h  |4 +++
 kernel/seccomp.c  |   55 +
 kernel/sys_ni.c   |3 ++
 8 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 97ff872c7acc..0eae9df35b88 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -321,6 +321,7 @@ config HAVE_ARCH_SECCOMP_FILTER
  - secure_computing is called from a ptrace_event()-safe context
  - secure_computing return value is checked and a return value of -1
results in the system call being skipped immediately.
+ - seccomp syscall wired up
 
 config SECCOMP_FILTER
def_bool y
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b867921612..7527eac24122 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351i386sched_setattr   sys_sched_setattr
 352i386sched_getattr   sys_sched_getattr
 353i386renameat2   sys_renameat2
+354i386seccomp sys_seccomp
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..16272a6c12b7 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314common  sched_setattr   sys_sched_setattr
 315common  sched_getattr   sys_sched_getattr
 316common  renameat2   sys_renameat2
+317common  seccomp sys_seccomp
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..1713977ee26f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
+   const char __user *uargs);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 333640608087..65acbf0e2867 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
 __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
 #define __NR_renameat2 276
 __SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_seccomp 277
+__SYSCALL(__NR_seccomp, sys_seccomp)
 
 #undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..b258878ba754 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -10,6 +10,10 @@
 #define SECCOMP_MODE_STRICT1 /* uses hard-coded filter. */
 #define SECCOMP_MODE_FILTER2 /* uses user-supplied filter. */
 
+/* Valid operations for seccomp syscall. */
+#define SECCOMP_SET_MODE_STRICT0
+#define SECCOMP_SET_MODE_FILTER1
+
 /*
  * All BPF programs must return a 32-bit value.
  * The bottom 16-bits are for optional return data.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 812cea2e7ffb..2f83496d6016 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* #define SECCOMP_DEBUG 1 */
 
@@ -314,7 +315,7 @@ free_prog:
  *
  * Returns 0 on success and non-zero otherwise.
  */
-static long seccomp_attach_user_filter(char __user *user_filter)
+static long seccomp_attach_user_filter(const char __user *user_filter)
 {
struct sock_fprog fprog;
long ret = -EFAULT;
@@ -517,6 +518,7

[PATCH v2] Tools: hv: fix file overwriting of hv_fcopy_daemon

2014-06-27 Thread Yue Zhang

From: Yue Zhang 

hv_fcopy_daemon fails to overwrite a file if the target file already
exits.

Add O_TRUNC flag on opening.

Signed-off-by: Yue Zhang 
---
 tools/hv/hv_fcopy_daemon.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index fba1c75..2a86297 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -88,7 +88,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
}
}
 
-   target_fd = open(target_fname, O_RDWR | O_CREAT | O_CLOEXEC, 0744);
+   target_fd = open(target_fname,
+   O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC, 0744);
if (target_fd == -1) {
syslog(LOG_INFO, "Open Failed: %s", strerror(errno));
goto done;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v9 06/11] MIPS: add seccomp syscall

2014-06-27 Thread Kees Cook

Wires up the new seccomp syscall.

Signed-off-by: Kees Cook 
---
 arch/mips/include/uapi/asm/unistd.h |   15 +--
 arch/mips/kernel/scall32-o32.S  |1 +
 arch/mips/kernel/scall64-64.S   |1 +
 arch/mips/kernel/scall64-n32.S  |1 +
 arch/mips/kernel/scall64-o32.S  |1 +
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/uapi/asm/unistd.h 
b/arch/mips/include/uapi/asm/unistd.h
index 5805414777e0..9bc13eaf9d67 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -372,16 +372,17 @@
 #define __NR_sched_setattr (__NR_Linux + 349)
 #define __NR_sched_getattr (__NR_Linux + 350)
 #define __NR_renameat2 (__NR_Linux + 351)
+#define __NR_seccomp   (__NR_Linux + 352)
 
 /*
  * Offset of the last Linux o32 flavoured syscall
  */
-#define __NR_Linux_syscalls351
+#define __NR_Linux_syscalls352
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
 
 #define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls351
+#define __NR_O32_Linux_syscalls352
 
 #if _MIPS_SIM == _MIPS_SIM_ABI64
 
@@ -701,16 +702,17 @@
 #define __NR_sched_setattr (__NR_Linux + 309)
 #define __NR_sched_getattr (__NR_Linux + 310)
 #define __NR_renameat2 (__NR_Linux + 311)
+#define __NR_seccomp   (__NR_Linux + 312)
 
 /*
  * Offset of the last Linux 64-bit flavoured syscall
  */
-#define __NR_Linux_syscalls311
+#define __NR_Linux_syscalls312
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
 
 #define __NR_64_Linux  5000
-#define __NR_64_Linux_syscalls 311
+#define __NR_64_Linux_syscalls 312
 
 #if _MIPS_SIM == _MIPS_SIM_NABI32
 
@@ -1034,15 +1036,16 @@
 #define __NR_sched_setattr (__NR_Linux + 313)
 #define __NR_sched_getattr (__NR_Linux + 314)
 #define __NR_renameat2 (__NR_Linux + 315)
+#define __NR_seccomp   (__NR_Linux + 316)
 
 /*
  * Offset of the last N32 flavoured syscall
  */
-#define __NR_Linux_syscalls315
+#define __NR_Linux_syscalls316
 
 #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
 
 #define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls315
+#define __NR_N32_Linux_syscalls316
 
 #endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 3245474f19d5..ab02d14f1b5c 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -578,3 +578,4 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr   /* 4350 */
PTR sys_renameat2
+   PTR sys_seccomp
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index be2fedd4ae33..010dccf128ec 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -431,4 +431,5 @@ EXPORT(sys_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr   /* 5310 */
PTR sys_renameat2
+   PTR sys_seccomp
.size   sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index c1dbcda4b816..c3b3b6525df5 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -424,4 +424,5 @@ EXPORT(sysn32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr
PTR sys_renameat2   /* 6315 */
+   PTR sys_seccomp
.size   sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index f1343ccd7ed7..bb1550b1f501 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -557,4 +557,5 @@ EXPORT(sys32_call_table)
PTR sys_sched_setattr
PTR sys_sched_getattr   /* 4350 */
PTR sys_renameat2
+   PTR sys_seccomp
.size   sys32_call_table,.-sys32_call_table
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] Tools: hv: fix file overwriting of hv_fcopy_daemon

2014-06-27 Thread KY Srinivasan



> -Original Message-
> From: Yue Zhang [mailto:yue...@microsoft.com]
> Sent: Friday, June 27, 2014 5:17 PM
> To: KY Srinivasan; Haiyang Zhang; driverdev-de...@linuxdriverproject.org;
> linux-kernel@vger.kernel.org; o...@aepfle.de; jasow...@redhat.com;
> a...@canonical.com
> Cc: Dexuan Cui; Thomas Shao
> Subject: [PATCH] Tools: hv: fix file overwriting of hv_fcopy_daemon
> 
> From: Yue Zhang 
> 
> hv_fcopy_daemon fails to overwrite a file if the target file already exits.
> 
> Add O_TRUNC flag on opening.
> 
> MS-TFS: 341345
You need to include Greg in the "to list". Also get rid of the MS-TFS tag.

> 
> Signed-off-by: Yue Zhang 
> ---
>  tools/hv/hv_fcopy_daemon.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
> index fba1c75..2a86297 100644
> --- a/tools/hv/hv_fcopy_daemon.c
> +++ b/tools/hv/hv_fcopy_daemon.c
> @@ -88,7 +88,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
>   }
>   }
> 
> - target_fd = open(target_fname, O_RDWR | O_CREAT | O_CLOEXEC,
> 0744);
> + target_fd = open(target_fname,
> + O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC,
> 0744);
Please align properly and there is no need for three lines here.

K. Y
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

perf: Add support for full Intel event lists v7

2014-06-27 Thread Andi Kleen

Should be ready for merge now. Please consider.

[v2: Review feedback addressed and some minor improvements]
[v3: More review feedback addressed and handle test failures better.
Ported to latest tip/core.]
[v4: Addressed Namhyung's feedback]
[v5: Rebase to latest tree. Minor description update.]
[v6: Rebase. Add acked by from Namhyung and address feedback. Some minor
fixes. Should be good to go now I hope. The period patch was dropped,
as that is already handled. I added an extra patch for a --quiet argument
for perf list]
[v7: Just rebase to latest tip/core. Should be ready to merge.]

perf has high level events which are useful in many cases. However
there are some tuning situations where low level events in the CPU
are needed. Traditionally this required specifying the event in 
raw form (very awkward) or using non standard frontends
like ocperf or patching in libpfm.

Intel CPUs can have very large event files (Haswell has ~336 core events,
much more if you add uncore or all the offcore combinations), which is too
large to describe through the kernel interface. It would require tying up
significant amounts of unswappable memory for this.

oprofile always had separate event list files that were maintained by 
the CPU vendors. The oprofile events were shipped with the tool.
The Intel events get updated regularly, for example to add references
to the specification updates or add new events.

Unfortunately oprofile usually did not keep up with these updates,
so the events in oprofile were often out of date. In addition
it ties up quite a bit of disk space, mostly for CPUs you don't have.

This patch kit implements another mechanism that avoids these problems.
Intel releases the event lists for CPUs in a standardized JSON format
on a download server.

I implemented an automatic downloader to get the event file for the
current CPU.  The events are stored in ~/.cache/pmu-events.
Then perf adds a parser that converts the JSON format into perf event
aliases, which then can be used directly as any other perf event.

The parsing is done using a simple existing JSON library.

The events are still abstracted for perf, but the abstraction mechanism is
through the downloaded file instead of through the kernel.

The JSON format and perf parser has some minor Intelisms, but they
are simple and small and optional. It's easy to extend, so it would be
possible to use it for other CPUs too, add different pmu attributes, and
add new download sites to the downloader tool.

Currently only core events are supported, uncore may come at a later
point. No kernel changes, all code in perf user tools only.

Some of the parser files are partially shared with separate event parser
library and are thus 2-clause BSD licensed.

Patches also available from
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/json

Example output:

% perf download 
Downloading models file
Downloading readme.txt
2014-03-05 10:39:33 URL:https://download.01.org/perfmon/readme.txt 
[10320/10320] -> "readme.txt" [1]
2014-03-05 10:39:34 URL:https://download.01.org/perfmon/mapfile.csv [1207/1207] 
-> "mapfile.csv" [1]
Downloading events file
% perf list
...
  br_inst_exec.all_branches  [Speculative and retired
  branches]
  br_inst_exec.all_conditional   [Speculative and retired
  macro-conditional
  branches]
  br_inst_exec.all_direct_jmp[Speculative and retired
  macro-unconditional
  branches excluding
  calls and indirects]
... 333 more new events ...

% perf stat -e br_inst_exec.all_direct_jmp true

 Performance counter stats for 'true':

 6,817  cpu/br_inst_exec.all_direct_jmp/
   

   0.003503212 seconds time elapsed

One nice feature is that a pointer to the specification update is now
included in the description, which will hopefully clear up many problems:

% perf list
...
  mem_load_uops_l3_hit_retired.xsnp_hit  [Retired load uops which
  data sources were L3
  and cross-core snoop
  hits in on-pkg core
  cache. Supports address
  when precise. Spec
  update: HSM26, HSM30
  (Precise event)]
...


-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo inf

[PATCH 2/9] perf, tools: Add support for text descriptions of events and alias add

2014-06-27 Thread Andi Kleen

From: Andi Kleen 

Change pmu.c to allow descriptions of events and add interfaces
to add aliases at runtime from another file. To be used by jevents in the
next patch.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/util/pmu.c | 127 ++
 1 file changed, 98 insertions(+), 29 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7a811eb..baec090 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -14,6 +14,7 @@
 
 struct perf_pmu_alias {
char *name;
+   char *desc;
struct list_head terms;
struct list_head list;
char unit[UNIT_MAX_LEN+1];
@@ -171,17 +172,12 @@ error:
return -1;
 }
 
-static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, 
FILE *file)
+static int __perf_pmu__new_alias(struct list_head *list, char *name,
+char *dir, char *desc, char *val)
 {
struct perf_pmu_alias *alias;
-   char buf[256];
int ret;
 
-   ret = fread(buf, 1, sizeof(buf), file);
-   if (ret == 0)
-   return -EINVAL;
-   buf[ret] = 0;
-
alias = malloc(sizeof(*alias));
if (!alias)
return -ENOMEM;
@@ -190,24 +186,45 @@ static int perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name, FI
alias->scale = 1.0;
alias->unit[0] = '\0';
 
-   ret = parse_events_terms(&alias->terms, buf);
+   ret = parse_events_terms(&alias->terms, val);
if (ret) {
+   pr_err("Cannot parse alias %s: %d\n", val, ret);
free(alias);
return ret;
}
 
alias->name = strdup(name);
-   /*
-* load unit name and scale if available
-*/
-   perf_pmu__parse_unit(alias, dir, name);
-   perf_pmu__parse_scale(alias, dir, name);
 
+   if (dir) {
+   /*
+* load unit name and scale if available
+*/
+   perf_pmu__parse_unit(alias, dir, name);
+   perf_pmu__parse_scale(alias, dir, name);
+   }
+
+   alias->desc = desc ? strdup(desc) : NULL;
list_add_tail(&alias->list, list);
 
return 0;
 }
 
+static int perf_pmu__new_alias(struct list_head *list,
+  char *dir,
+  char *name,
+  FILE *file)
+{
+   char buf[256];
+   int ret;
+
+   ret = fread(buf, 1, sizeof(buf), file);
+   if (ret == 0)
+   return -EINVAL;
+   buf[ret] = 0;
+
+   return __perf_pmu__new_alias(list, name, dir, NULL, buf);
+}
+
 /*
  * Process all the sysfs attributes located under the directory
  * specified in 'dir' parameter.
@@ -720,11 +737,51 @@ static char *format_alias_or(char *buf, int len, struct 
perf_pmu *pmu,
return buf;
 }
 
-static int cmp_string(const void *a, const void *b)
+struct pair {
+   char *name;
+   char *desc;
+};
+
+static int cmp_pair(const void *a, const void *b)
+{
+   const struct pair *as = a;
+   const struct pair *bs = b;
+
+   /* Put downloaded event list last */
+   if (!!as->desc != !!bs->desc)
+   return !!as->desc - !!bs->desc;
+   return strcmp(as->name, bs->name);
+}
+
+static void wordwrap(char *s, int start, int max, int corr)
 {
-   const char * const *as = a;
-   const char * const *bs = b;
-   return strcmp(*as, *bs);
+   int column = start;
+   int n;
+
+   while (*s) {
+   int wlen = strcspn(s, " \t");
+
+   if (column + wlen >= max && column > start) {
+   printf("\n%*s", start, "");
+   column = start + corr;
+   }
+   n = printf("%s%.*s", column > start ? " " : "", wlen, s);
+   if (n <= 0)
+   break;
+   s += wlen;
+   column += n;
+   while (isspace(*s))
+   s++;
+   }
+}
+
+static int get_columns(void)
+{
+   /*
+* Should ask the terminal with TIOCGWINSZ here, but we
+* need the original fd before the pager.
+*/
+   return 79;
 }
 
 void print_pmu_events(const char *event_glob, bool name_only)
@@ -734,21 +791,24 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
char buf[1024];
int printed = 0;
int len, j;
-   char **aliases;
+   struct pair *aliases;
+   int numdesc = 0;
+   int columns = get_columns();
 
pmu = NULL;
len = 0;
while ((pmu = perf_pmu__scan(pmu)) != NULL)
list_for_each_entry(alias, &pmu->aliases, list)
len++;
-   aliases = malloc(sizeof(char *) * len);
+   aliases = malloc(sizeof(struct pair) * len);
if (!aliases)
return;
pmu = NULL;
j = 0;
while ((pmu = perf_pmu__scan(pmu)) != N

[PATCH 5/9] perf, tools: Add perf download to download event files v4

2014-06-27 Thread Andi Kleen

From: Andi Kleen 

Add a downloader to automatically download the right
files from a download site.

This is implemented as a script calling wget, similar to
perf archive. The perf driver automatically calls the right
binary. The downloader is extensible, but currently only
implements an Intel event download.  It would be straightforward
to add other sites too for other vendors.

The downloaded event files are put into ~/.cache/pmu-events, where the
builtin event parser in util/* can find them automatically.

v2: Use ~/.cache
v3: Check for wget. Some cleanups.
v4: Improve manpage.
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-download.txt | 31 
 tools/perf/Documentation/perf-list.txt | 12 ++-
 tools/perf/Makefile.perf   |  5 ++-
 tools/perf/perf-download.sh| 57 ++
 4 files changed, 103 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-download.txt
 create mode 100755 tools/perf/perf-download.sh

diff --git a/tools/perf/Documentation/perf-download.txt 
b/tools/perf/Documentation/perf-download.txt
new file mode 100644
index 000..9e5b28e
--- /dev/null
+++ b/tools/perf/Documentation/perf-download.txt
@@ -0,0 +1,31 @@
+perf-download(1)
+===
+
+NAME
+
+perf-download - Download event files for current CPU.
+
+SYNOPSIS
+
+[verse]
+'perf download' [vendor-family-model]
+
+DESCRIPTION
+---
+This command automatically downloads the event list for the current CPU and
+stores them in $XDG_CACHE_HOME/pmu-events (or $HOME/.cache/pmu-events).
+The other tools automatically look for them there. The CPU can be also
+specified at the command line.
+
+The downloading is done using http through wget, which needs
+to be installed. When behind a firewall the proxies
+may also need to be set up using "export https_proxy="
+
+The user should regularly call this to download updated event lists
+for the current CPU.
+
+Note the downloaded files are stored per user, so if perf is
+used as both normal user and with sudo the event files may
+also need to be moved to root's home directory with
+sudo mkdir /root/.cache ; sud cp -r ~/.cache/pmu-events /root/.cache
+after downloading.
diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index 9305a37..2b4eba0 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -61,6 +61,16 @@ Sampling). Examples to use IBS:
  perf record -a -e r076:p ...  # same as -e cpu-cycles:p
  perf record -a -e r0C1:p ...  # use ibs op counting micro-ops
 
+PER CPU EVENT LISTS
+---
+
+For some CPUs (particularly modern Intel CPUs) "perf download" can
+download additional CPU specific event definitions, which then
+become visible in perf list and available in the other perf tools.
+
+This obsoletes the raw event description method described below
+for most cases.
+
 RAW HARDWARE EVENT DESCRIPTOR
 -
 Even when an event is not available in a symbolic form within perf right now,
@@ -123,6 +133,6 @@ types specified.
 SEE ALSO
 
 linkperf:perf-stat[1], linkperf:perf-top[1],
-linkperf:perf-record[1],
+linkperf:perf-record[1], linkperf:perf-download[1],
 http://www.intel.com/Assets/PDF/manual/253669.pdf[Intel® 64 and IA-32 
Architectures Software Developer's Manual Volume 3B: System Programming Guide],
 http://support.amd.com/us/Processor_TechDocs/24593_APM_v2.pdf[AMD64 
Architecture Programmer’s Manual Volume 2: System Programming]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 0016d1a..0600425 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -126,6 +126,7 @@ PYRF_OBJS =
 SCRIPT_SH =
 
 SCRIPT_SH += perf-archive.sh
+SCRIPT_SH += perf-download.sh
 
 grep-libs = $(filter -l%,$(1))
 strip-libs = $(filter-out -l%,$(1))
@@ -877,6 +878,8 @@ install-bin: all install-gtk
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
$(call QUIET_INSTALL, perf-archive) \
$(INSTALL) $(OUTPUT)perf-archive -t 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
+   $(call QUIET_INSTALL, perf-download) \
+   $(INSTALL) $(OUTPUT)perf-download -t 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
 ifndef NO_LIBPERL
$(call QUIET_INSTALL, perl-scripts) \
$(INSTALL) -d -m 755 
'$(DESTDIR_SQ)$(perfexec_instdir_SQ)/scripts/perl/Perf-Trace-Util/lib/Perf/Trace';
 \
@@ -922,7 +925,7 @@ config-clean:
@$(MAKE) -C config/feature-checks clean >/dev/null
 
 clean: $(LIBTRACEEVENT)-clean $(LIBAPIKFS)-clean config-clean
-   $(call QUIET_CLEAN, core-objs)  $(RM) $(LIB_OBJS) $(BUILTIN_OBJS) 
$(LIB_FILE) $(OUTPUT)perf-archive $(OUTPUT)perf.o $(LANG_BINDINGS) $(GTK_OBJS)
+   $(call QUIET_CLEAN, core-objs)  $(RM) $(LIB_OBJS) $(BUILTIN_OBJS) 
$(LIB_FILE) $(OUTPUT)perf-archive $(OUTPUT)/perf-download $(OU

1 2 3 4 5 6 >

1 - 100 of 568 matches

Mail list logo