Re: [PATCH v2 25/35] nds32: Build infrastructure

2017-11-29 Thread Geert Uytterhoeven
On Thu, Nov 30, 2017 at 6:48 AM, Greentime Hu  wrote:
> 2017-11-30 4:27 GMT+08:00 Arnd Bergmann :
>> On Wed, Nov 29, 2017 at 3:10 PM, Greentime Hu  wrote:
>>> 2017-11-29 19:57 GMT+08:00 Arnd Bergmann :
 On Wed, Nov 29, 2017 at 12:39 PM, Greentime Hu  wrote:
> I think I can use this name "CPU_V3" for all nds32 v3 compatible cpu.
> It will be implemented like this.
>
> config HWZOL
> bool "hardware zero overhead loop support"
> depends on CPU_D10 || CPU_D15
> default n
> help
>   A set of Zero-Overhead Loop mechanism is provided to reduce the
>   instruction fetch and execution overhead of loop-control 
> instructions.
>   It will save 3 registers($LB, $LC, $LE) for context saving if say Y.
>   You don't need to save these registers if you can make sure your 
> user
>   program doesn't use these registers.
>
>   If unsure, say N.
>
> config CPU_CACHE_NONALIASING
> bool "Non-aliasing cache"
> depends on !CPU_N10 && !CPU_D10
> default n
> help
>   If this CPU is using VIPT data cache and its cache way size is 
> larger
>   than page size, say N. If it is using PIPT data cache, say Y.
>
>   If unsure, say N.

I still think it will be easier to revert the logic, and have
CPU_CACHE_ALIASING.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v4 7/8] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-29 Thread Souptick Joarder
Hi David, Dan,


On Thu, Nov 30, 2017 at 12:50 AM, David Daney  wrote:
> On 11/29/2017 08:07 AM, Souptick Joarder wrote:
>>
>> On Wed, Nov 29, 2017 at 4:00 PM, Souptick Joarder 
>> wrote:
>>>
>>> On Wed, Nov 29, 2017 at 6:25 AM, David Daney 
>>> wrote:

 From: Carlos Munoz 

 The Cavium OCTEON cn78xx and cn73xx SoCs have network packet I/O
 hardware that is significantly different from previous generations of
 the family.
>>
>>
 diff --git a/drivers/net/ethernet/cavium/octeon/octeon3-bgx-port.c
 b/drivers/net/ethernet/cavium/octeon/octeon3-bgx-port.c
 new file mode 100644
 index ..4dad35fa4270
 --- /dev/null
 +++ b/drivers/net/ethernet/cavium/octeon/octeon3-bgx-port.c
 @@ -0,0 +1,2033 @@
 +// SPDX-License-Identifier: GPL-2.0
 +/* Copyright (c) 2017 Cavium, Inc.
 + *
 + * This file is subject to the terms and conditions of the GNU General
 Public
 + * License.  See the file "COPYING" in the main directory of this
 archive
 + * for more details.
 + */
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +
>>
>>
 +static void bgx_port_sgmii_set_link_down(struct bgx_port_priv *priv)
 +{
 +   u64 data;
>>
>>
 +   data = oct_csr_read(BGX_GMP_PCS_MISC_CTL(priv->node, priv->bgx,
 priv->index));
 +   data |= BIT(11);
 +   oct_csr_write(data, BGX_GMP_PCS_MISC_CTL(priv->node, priv->bgx,
 priv->index));
 +   data = oct_csr_read(BGX_GMP_PCS_MISC_CTL(priv->node, priv->bgx,
 priv->index));
>>>
>>>
>>> Any particular reason to read immediately after write ?
>>
>>
>
> Yes, to ensure the write is committed to hardware before the next step.
>
>>
>>
 +static int bgx_port_sgmii_set_link_speed(struct bgx_port_priv *priv,
 struct port_status status)
 +{
 +   u64 data;
 +   u64 prtx;
 +   u64 miscx;
 +   int timeout;
 +
>>
>>
 +
 +   switch (status.speed) {
 +   case 10:
>>>
>>>
>>> In my opinion, instead of hard coding the value, is it fine to use ENUM ?
>>
>> Similar comments applicable in other places where hard coded values
>> are used.
>>
>
> There is nothing to be gained by interposing an extra layer of abstraction
> in this case.  The code is more clear with the raw numbers in this
> particular case.

   As mentioned by Andrew,  macros defined in uapi/linux/ethtool.h may
be useful here.
   Otherwise it's fine to me :)
>
>
>>
>>
 +static int bgx_port_gser_27882(struct bgx_port_priv *priv)
 +{
 +   u64 data;
 +   u64 addr;
>>>
>>>
 +   int timeout = 200;
 +
 +   //timeout = 200;
>>
>> Better to initialize the timeout value

>
>
> What are you talking about?  It is properly initialized using valid C code.

  I mean, instead of writing

   int timeout;
   timeout = 200;

  write,

   int timeout = 200;

Anyway both are correct and there is nothing wrong in your code.
Please ignore my comment here.

>
>
>>
>>
 +static int bgx_port_qlm_rx_equalization(struct bgx_port_priv *priv, int
 qlm, int lane)
 +{
 +   lmode = oct_csr_read(GSER_LANE_MODE(priv->node, qlm));
 +   lmode &= 0xf;
 +   addr = GSER_LANE_P_MODE_1(priv->node, qlm, lmode);
 +   data = oct_csr_read(addr);
 +   /* Don't complete rx equalization if in VMA manual mode */
 +   if (data & BIT(14))
 +   return 0;
 +
 +   /* Apply rx equalization for speed > 6250 */
 +   if (bgx_port_get_qlm_speed(priv, qlm) < 6250)
 +   return 0;
 +
 +   /* Wait until rx data is valid (CDRLOCK) */
 +   timeout = 500;
>>>
>>>
>>> 500 us is the min required value or it can be further reduced ?
>>
>>
>
>
> 500 uS works well and is shorter than the 2000 uS from the hardware manual.
>
> If you would like to verify shorter timeout values, we could consider
> merging such a patch.  But really, this doesn't matter as it is a very short
> one-off action when the link is brought up.

   Ok.
>
>>
 +static int bgx_port_init_xaui_link(struct bgx_port_priv *priv)
 +{
>>
>>
 +
 +   if (use_ber) {
 +   timeout = 1;
 +   do {
 +   data =
 +
 oct_csr_read(BGX_SPU_BR_STATUS1(priv->node, priv->bgx, priv->index));
 +   if (data & BIT(0))
 +   break;
 +   timeout--;
 +   udelay(1);
 +   } while (timeout);
>>>
>>>
>>> In my opinion, it's better to implement similar kind of loops 

Re: [PATCH v2 25/35] nds32: Build infrastructure

2017-11-29 Thread Greentime Hu
2017-11-30 4:27 GMT+08:00 Arnd Bergmann :
> On Wed, Nov 29, 2017 at 3:10 PM, Greentime Hu  wrote:
>> 2017-11-29 19:57 GMT+08:00 Arnd Bergmann :
>>> On Wed, Nov 29, 2017 at 12:39 PM, Greentime Hu  wrote:

 How about this?

 choice
 prompt "CPU type"
 default CPU_N13
 config CPU_N15
 bool "AndesCore N15"
 select CPU_CACHE_NONALIASING
 config CPU_N13
 bool "AndesCore N13"
 select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
 config CPU_N10
 bool "AndesCore N10"
 select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
 config CPU_D15
 bool "AndesCore D15"
 select CPU_CACHE_NONALIASING
 select HWZOL
 config CPU_D10
 bool "AndesCore D10"
 select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
 endchoice
>>>
>>> With a 'choice' statement this works, but I would consider that
>>> suboptimal for another reason: now you cannot build a kernel that
>>> works on e.g. both N13 and N15.
>>>
>>> This is what we had on ARM a long time ago and on MIPS not so long
>>> ago, but it's really a burden for testing and distribution once you get
>>> support for more than a handful of machines supported in the mainline
>>> kernel: If each CPU core is mutually incompatible with the other ones,
>>> this means you likely end up having one defconfig per CPU core,
>>> or even one defconfig per SoC or board.
>>>
>>> I would always try to get the largest amount of hardware to work
>>> in the same kernel concurrently.
>>>
>>> One way of of this would be to define the "CPU type" as the minimum
>>> supported CPU, e.g. selecting D15 would result in a kernel that
>>> only works on D15, while selecting N15 would work on both N15 and
>>> D15, and selecting D10 would work on both D10 and D15.
>>>
>>
>> Hi, Arnd:
>>
>> Maybe we should keep the original implementation for this reason.
>> The default value of CPU_CACHE_NONALIASING and ANDES_PAGE_SIZE_8KB is
>> available for all CPU types for now.
>> User can use these configs built kernel to boot on almost all nds32 CPUs.
>>
>> It might be a little bit weird if we config CPU_N10 but run on a N13 CPU.
>> This might confuse our users.
>
> I think it really depends on how much flexibility you want to give to users.
> The way I suggested first was to allow selecting an arbitrary combination
> of CPUs, and then let Kconfig derive the correct set of optimization flags
> and other options that work with those. That is probably the easiest for
> the users, but can be tricky to get right for all combinations.
>
> When you put them in a sorted list like I mentioned for simplicity, you
> could reduce the confusion by naming them differently, e.g.
> CONFIG_CPU_N10_OR_NEWER.
>
> Having only the CPU_CACHE_NONALIASING option is fine if you
> never need to make any other decisions based on the CPU core
> type, but then the help text should describe specifically which cases
> are affected (N10/N13/D13 with 4K page size), and you can decide to
> hide the option and make it always-on when using 8K page size.
>
>Arnd


Hi, Arnd:

I think I can use this name "CPU_V3" for all nds32 v3 compatible cpu.
It will be implemented like this.

config HWZOL
bool "hardware zero overhead loop support"
depends on CPU_D10 || CPU_D15
default n
help
  A set of Zero-Overhead Loop mechanism is provided to reduce the
  instruction fetch and execution overhead of loop-control instructions.
  It will save 3 registers($LB, $LC, $LE) for context saving if say Y.
  You don't need to save these registers if you can make sure your user
  program doesn't use these registers.

  If unsure, say N.

config CPU_CACHE_NONALIASING
bool "Non-aliasing cache"
depends on !CPU_N10 && !CPU_D10
default n
help
  If this CPU is using VIPT data cache and its cache way size is larger
  than page size, say N. If it is using PIPT data cache, say Y.

  If unsure, say N.

choice
prompt "CPU type"
default CPU_V3
config CPU_N15
bool "AndesCore N15"
select CPU_CACHE_NONALIASING
config CPU_N13
bool "AndesCore N13"
select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
config CPU_N10
bool "AndesCore N10"
config CPU_D15
bool "AndesCore D15"
select CPU_CACHE_NONALIASING
config CPU_D10
bool "AndesCore D10"
config CPU_V3
bool "AndesCore v3 compatible"
select ANDES_PAGE_SIZE_4KB
endchoice


Re: [PATCH net-next 0/2] bpf/tracing: allow user space to query prog array on the same tp

2017-11-29 Thread Alexei Starovoitov

On 11/28/17 11:20 PM, Yonghong Song wrote:

Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event. Given a perf event
(kprobe, uprobe, or kernel tracepoint), the perf ioctl interface
is used to query bpf programs attached to the same trace event.
The same ioctl interface is also used to attach bpf program.
Patch #1 had the core implementation and patch #2 added
a test case in tools bpf selftests suite.


We actually had an implementation of the same tracepoint+bpf
introspection via BPF_PROG_QUERY command that we use for cgroup+bpf,
but it looks cleaner to use ioctl() style of api here,
since attach to tracepoint/kuprobe is also done via ioctl.

For the set:
Acked-by: Alexei Starovoitov 

The patch touches 3 lines in events/core.c but most likely
it won't conflict with anything in tip, so we plan to have
this set in bpf-next.git -> net-next.git only.

Thanks



Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Tobin C. Harding
On Wed, Nov 29, 2017 at 08:41:36PM -0800, Joe Perches wrote:
> On Thu, 2017-11-30 at 15:18 +1100, Tobin C. Harding wrote:
> > On Wed, Nov 29, 2017 at 07:58:26PM -0800, Joe Perches wrote:
> > > On Thu, 2017-11-30 at 10:26 +1100, Tobin C. Harding wrote:
> > > > On Wed, Nov 29, 2017 at 03:20:58PM -0800, Andrew Morton wrote:
> > > > > On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  
> > > > > wrote:
> > > > > 
> > > > > > printk specifier %p now hashes all addresses before printing. 
> > > > > > Sometimes
> > > > > > we need to see the actual unmodified address. This can be achieved 
> > > > > > using
> > > > > > %lx but then we face the risk that if in future we want to change 
> > > > > > the
> > > > > > way the Kernel handles printing of pointers we will have to grep 
> > > > > > through
> > > > > > the already existent 50 000 %lx call sites. Let's add specifier %px 
> > > > > > as a
> > > > > > clear, opt-in, way to print a pointer and maintain some level of
> > > > > > isolation from all the other hex integer output within the Kernel.
> > > > > > 
> > > > > > Add printk specifier %px to print the actual unmodified address.
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > +Unmodified Addresses
> > > > > > +
> > > > > > +
> > > > > > +::
> > > > > > +
> > > > > > +   %px 01234567 or 0123456789abcdef
> > > > > > +
> > > > > > +For printing pointers when you _really_ want to print the address. 
> > > > > > Please
> > > > > > +consider whether or not you are leaking sensitive information 
> > > > > > about the
> > > > > > +Kernel layout in memory before printing pointers with %px. %px is
> > > > > > +functionally equivalent to %lx. %px is preferred to %lx because it 
> > > > > > is more
> > > > > > +uniquely grep'able. If, in the future, we need to modify the way 
> > > > > > the Kernel
> > > > > > +handles printing pointers it will be nice to be able to find the 
> > > > > > call
> > > > > > +sites.
> > > > > > +
> > > > > 
> > > > > You might want to add a checkpatch rule which emits a stern
> > > > > do-you-really-want-to-do-this warning when someone uses %px.
> > > > > 
> > > > 
> > > > Oh, nice idea. It has to be a CHECK but right?
> > > 
> > > No, it has to be something that's not --strict
> > > so a WARN would probably be best.
> > > 
> > > > By stern, you mean use stern language?
> > > 
> > > I hope he doesn't mean tweet.
> > 
> > /me says tweet tweet (like a bird)
> > 
> > > Something like:
> > > ---
> > >  scripts/checkpatch.pl | 31 +--
> > >  1 file changed, 25 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > > index 0ce249f157a1..9d789cbe7df5 100755
> > > --- a/scripts/checkpatch.pl
> > > +++ b/scripts/checkpatch.pl
> > > @@ -5758,21 +5758,40 @@ sub process {
> > >   defined $stat &&
> > >   $stat =~ /^\+(?![^\{]*\{\s*).*\b(\w+)\s*\(.*$String\s*,/s &&
> > >   $1 !~ /^_*volatile_*$/) {
> > > + my $complete_extension = "";
> > > + my $extension = "";
> > >   my $bad_extension = "";
> > >   my $lc = $stat =~ tr@\n@@;
> > >   $lc = $lc + $linenr;
> > > + my $stat_real;
> > >   for (my $count = $linenr; $count <= $lc; $count++) {
> > >   my $fmt = get_quoted_string($lines[$count - 1], 
> > > raw_line($count, 0));
> > >   $fmt =~ s/%%//g;
> > > - if ($fmt =~ 
> > > /(\%[\*\d\.]*p(?![\WFfSsBKRraEhMmIiUDdgVCbGNO]).)/) {
> > > - $bad_extension = $1;
> > > - last;
> > > + while ($fmt =~ /(\%[\*\d\.]*p(\w))/g) {
> > > + $complete_extension = $1;
> > > + $extension = $2;
> > > + if ($extension !~ 
> > > /[FfSsBKRraEhMmIiUDdgVCbGNOx]/) {
> > > + $bad_extension = 
> > > $complete_extension;
> > > + last;
> > > + }
> > > + if ($extension eq "x") {
> > > + if (!defined($stat_real)) {
> > > + $stat_real = 
> > > raw_line($linenr, 0);
> > > + for (my $count = 
> > > $linenr + 1; $count <= $lc; $count++) {
> > > + $stat_real = 
> > > $stat_real . "\n" . raw_line($count, 0);
> > > + }
> > > + }
> > > + WARN("VSPRINTF_POINTER_PX",
> > > +  "Using vsprintf pointer 
> > > extension 

Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Joe Perches
On Thu, 2017-11-30 at 15:18 +1100, Tobin C. Harding wrote:
> On Wed, Nov 29, 2017 at 07:58:26PM -0800, Joe Perches wrote:
> > On Thu, 2017-11-30 at 10:26 +1100, Tobin C. Harding wrote:
> > > On Wed, Nov 29, 2017 at 03:20:58PM -0800, Andrew Morton wrote:
> > > > On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  
> > > > wrote:
> > > > 
> > > > > printk specifier %p now hashes all addresses before printing. 
> > > > > Sometimes
> > > > > we need to see the actual unmodified address. This can be achieved 
> > > > > using
> > > > > %lx but then we face the risk that if in future we want to change the
> > > > > way the Kernel handles printing of pointers we will have to grep 
> > > > > through
> > > > > the already existent 50 000 %lx call sites. Let's add specifier %px 
> > > > > as a
> > > > > clear, opt-in, way to print a pointer and maintain some level of
> > > > > isolation from all the other hex integer output within the Kernel.
> > > > > 
> > > > > Add printk specifier %px to print the actual unmodified address.
> > > > > 
> > > > > ...
> > > > > 
> > > > > +Unmodified Addresses
> > > > > +
> > > > > +
> > > > > +::
> > > > > +
> > > > > + %px 01234567 or 0123456789abcdef
> > > > > +
> > > > > +For printing pointers when you _really_ want to print the address. 
> > > > > Please
> > > > > +consider whether or not you are leaking sensitive information about 
> > > > > the
> > > > > +Kernel layout in memory before printing pointers with %px. %px is
> > > > > +functionally equivalent to %lx. %px is preferred to %lx because it 
> > > > > is more
> > > > > +uniquely grep'able. If, in the future, we need to modify the way the 
> > > > > Kernel
> > > > > +handles printing pointers it will be nice to be able to find the call
> > > > > +sites.
> > > > > +
> > > > 
> > > > You might want to add a checkpatch rule which emits a stern
> > > > do-you-really-want-to-do-this warning when someone uses %px.
> > > > 
> > > 
> > > Oh, nice idea. It has to be a CHECK but right?
> > 
> > No, it has to be something that's not --strict
> > so a WARN would probably be best.
> > 
> > > By stern, you mean use stern language?
> > 
> > I hope he doesn't mean tweet.
> 
> /me says tweet tweet (like a bird)
> 
> > Something like:
> > ---
> >  scripts/checkpatch.pl | 31 +--
> >  1 file changed, 25 insertions(+), 6 deletions(-)
> > 
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > index 0ce249f157a1..9d789cbe7df5 100755
> > --- a/scripts/checkpatch.pl
> > +++ b/scripts/checkpatch.pl
> > @@ -5758,21 +5758,40 @@ sub process {
> > defined $stat &&
> > $stat =~ /^\+(?![^\{]*\{\s*).*\b(\w+)\s*\(.*$String\s*,/s &&
> > $1 !~ /^_*volatile_*$/) {
> > +   my $complete_extension = "";
> > +   my $extension = "";
> > my $bad_extension = "";
> > my $lc = $stat =~ tr@\n@@;
> > $lc = $lc + $linenr;
> > +   my $stat_real;
> > for (my $count = $linenr; $count <= $lc; $count++) {
> > my $fmt = get_quoted_string($lines[$count - 1], 
> > raw_line($count, 0));
> > $fmt =~ s/%%//g;
> > -   if ($fmt =~ 
> > /(\%[\*\d\.]*p(?![\WFfSsBKRraEhMmIiUDdgVCbGNO]).)/) {
> > -   $bad_extension = $1;
> > -   last;
> > +   while ($fmt =~ /(\%[\*\d\.]*p(\w))/g) {
> > +   $complete_extension = $1;
> > +   $extension = $2;
> > +   if ($extension !~ 
> > /[FfSsBKRraEhMmIiUDdgVCbGNOx]/) {
> > +   $bad_extension = 
> > $complete_extension;
> > +   last;
> > +   }
> > +   if ($extension eq "x") {
> > +   if (!defined($stat_real)) {
> > +   $stat_real = 
> > raw_line($linenr, 0);
> > +   for (my $count = 
> > $linenr + 1; $count <= $lc; $count++) {
> > +   $stat_real = 
> > $stat_real . "\n" . raw_line($count, 0);
> > +   }
> > +   }
> > +   WARN("VSPRINTF_POINTER_PX",
> > +"Using vsprintf pointer 
> > extension '$complete_extension' exposes kernel address for possible 
> > hacking\n" . "$here\n$stat_real\n");
> > +   }
> > }
> > }
> > if ($bad_extension ne 

Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()

2017-11-29 Thread Wei Xu
On Wed, Nov 29, 2017 at 10:43:33PM +0800, Jason Wang wrote:
> 
> 
> On 2017年11月29日 22:23, w...@redhat.com wrote:
> > From: Wei Xu 
> > 
> > Matthew found a roughly 40% tcp throughput regression with commit
> > c67df11f(vhost_net: try batch dequing from skb array) as discussed
> > in the following thread:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
> > 
> > Eventually we figured out that it was a skb leak in handle_rx()
> > when sending packets to the VM. This usually happens when a guest
> > can not drain out vq as fast as vhost fills in, afterwards it sets
> > off the traffic jam and leaks skb(s) which occurs as no headcount
> > to send on the vq from vhost side.
> > 
> > This can be avoided by making sure we have got enough headcount
> > before actually consuming a skb from the batched rx array while
> > transmitting, which is simply done by moving checking the zero
> > headcount a bit ahead.
> > 
> > Also strengthen the small possibility of leak in case of recvmsg()
> > fails by freeing the skb.
> > 
> > Signed-off-by: Wei Xu 
> > Reported-by: Matthew Rosato 
> > ---
> >   drivers/vhost/net.c | 23 +--
> >   1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > v2:
> > - add Matthew as the reporter, thanks matthew.
> > - moving zero headcount check ahead instead of defer consuming skb
> >due to jason and mst's comment.
> > - add freeing skb in favor of recvmsg() fails.
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8d626d7..e302e08 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > -   if (nvq->rx_array)
> > -   msg.msg_control = vhost_net_buf_consume(>rxq);
> > -   /* On overrun, truncate and discard */
> > -   if (unlikely(headcount > UIO_MAXIOV)) {
> > -   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > -   err = sock->ops->recvmsg(sock, ,
> > -1, MSG_DONTWAIT | MSG_TRUNC);
> > -   pr_debug("Discarded rx packet: len %zd\n", sock_len);
> > -   continue;
> > -   }
> > /* OK, now we need to know about added descriptors. */
> > if (!headcount) {
> > if (unlikely(vhost_enable_notify(>dev, vq))) {
> > @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
> >  * they refilled. */
> > goto out;
> > }
> > +   if (nvq->rx_array)
> > +   msg.msg_control = vhost_net_buf_consume(>rxq);
> > +   /* On overrun, truncate and discard */
> > +   if (unlikely(headcount > UIO_MAXIOV)) {
> > +   iov_iter_init(_iter, READ, vq->iov, 1, 1);
> > +   err = sock->ops->recvmsg(sock, ,
> > +1, MSG_DONTWAIT | MSG_TRUNC);
> > +   if (unlikely(err != 1))
> > +   kfree_skb((struct sk_buff *)msg.msg_control);
> 
> I think we'd better fix this in tun/tap (better in another patch) otherwise
> it lead to an odd API: some case skb were freed in recvmsg() but caller
> still need to deal with the rest case.

Right, it is better to handle it in recvmsg().

Wei


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Jakub Kicinski
On Wed, 29 Nov 2017 20:10:09 -0800, Stephen Hemminger wrote:
> On Wed, 29 Nov 2017 19:51:38 -0800 Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:  
> > > On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > > Author: stephen hemminger 
> > > Date:   Tue Aug 1 19:58:53 2017 -0700
> > > 
> > >      netvsc: transparent VF management
> > > 
> > >      This patch implements transparent fail over from synthetic NIC
> > > to SR-IOV virtual function NIC in Hyper-V environment. It is a
> > > better alternative to using bonding as is done now. Instead, the
> > > receive and transmit fail over is done internally inside the driver.
> > > 
> > >      Using bonding driver has lots of issues because it depends on
> > > the script being run early enough in the boot process and with
> > > sufficient information to make the association. This patch moves
> > > all that functionality into the kernel.
> > > 
> > >      Signed-off-by: Stephen Hemminger 
> > >      Signed-off-by: David S. Miller 
> > > 
> > > If my understanding is correct there's no need to for any extension
> > > of virtio spec. If this is true, maybe you can start to prepare the
> > > patch?
> > 
> > IMHO this is as close to policy in the kernel as one can get.  User
> > land has all the information it needs to instantiate that bond/team
> > automatically.  In fact I'm trying to discuss this with NetworkManager
> > folks and Red Hat right now:
> > 
> > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00038.html
> > 
> > Can we flip the argument and ask why is the kernel supposed to be
> > responsible for this?  It's not like we run DHCP out of the kernel
> > on new interfaces...   
> 
> Although "policy should not be in the kernel" is a a great mantra,
> it is not practical in the real world.
> 
> If you think it can be solved in userspace, then you haven't had to
> deal with four different network initialization
> systems, multiple orchestration systems and customers on ancient
> Enterprise distributions.

I would accept that argument if anyone ever tried to get those
Enterprise distros to handle this use case.  From conversations I 
had it seemed like no one ever did, and SR-IOV+virtio bonding is 
what has been done to solve this since day 1 of SR-IOV networking.

For practical reasons it's easier to push this into the kernel, 
because vendors rarely employ developers of the user space
orchestrations systems.  Is that not the real problem here,
potentially? :)


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Tobin C. Harding
On Wed, Nov 29, 2017 at 07:58:26PM -0800, Joe Perches wrote:
> On Thu, 2017-11-30 at 10:26 +1100, Tobin C. Harding wrote:
> > On Wed, Nov 29, 2017 at 03:20:58PM -0800, Andrew Morton wrote:
> > > On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  
> > > wrote:
> > > 
> > > > printk specifier %p now hashes all addresses before printing. Sometimes
> > > > we need to see the actual unmodified address. This can be achieved using
> > > > %lx but then we face the risk that if in future we want to change the
> > > > way the Kernel handles printing of pointers we will have to grep through
> > > > the already existent 50 000 %lx call sites. Let's add specifier %px as a
> > > > clear, opt-in, way to print a pointer and maintain some level of
> > > > isolation from all the other hex integer output within the Kernel.
> > > > 
> > > > Add printk specifier %px to print the actual unmodified address.
> > > > 
> > > > ...
> > > > 
> > > > +Unmodified Addresses
> > > > +
> > > > +
> > > > +::
> > > > +
> > > > +   %px 01234567 or 0123456789abcdef
> > > > +
> > > > +For printing pointers when you _really_ want to print the address. 
> > > > Please
> > > > +consider whether or not you are leaking sensitive information about the
> > > > +Kernel layout in memory before printing pointers with %px. %px is
> > > > +functionally equivalent to %lx. %px is preferred to %lx because it is 
> > > > more
> > > > +uniquely grep'able. If, in the future, we need to modify the way the 
> > > > Kernel
> > > > +handles printing pointers it will be nice to be able to find the call
> > > > +sites.
> > > > +
> > > 
> > > You might want to add a checkpatch rule which emits a stern
> > > do-you-really-want-to-do-this warning when someone uses %px.
> > > 
> > 
> > Oh, nice idea. It has to be a CHECK but right?
> 
> No, it has to be something that's not --strict
> so a WARN would probably be best.
> 
> > By stern, you mean use stern language?
> 
> I hope he doesn't mean tweet.

/me says tweet tweet (like a bird)

> Something like:
> ---
>  scripts/checkpatch.pl | 31 +--
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 0ce249f157a1..9d789cbe7df5 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -5758,21 +5758,40 @@ sub process {
>   defined $stat &&
>   $stat =~ /^\+(?![^\{]*\{\s*).*\b(\w+)\s*\(.*$String\s*,/s &&
>   $1 !~ /^_*volatile_*$/) {
> + my $complete_extension = "";
> + my $extension = "";
>   my $bad_extension = "";
>   my $lc = $stat =~ tr@\n@@;
>   $lc = $lc + $linenr;
> + my $stat_real;
>   for (my $count = $linenr; $count <= $lc; $count++) {
>   my $fmt = get_quoted_string($lines[$count - 1], 
> raw_line($count, 0));
>   $fmt =~ s/%%//g;
> - if ($fmt =~ 
> /(\%[\*\d\.]*p(?![\WFfSsBKRraEhMmIiUDdgVCbGNO]).)/) {
> - $bad_extension = $1;
> - last;
> + while ($fmt =~ /(\%[\*\d\.]*p(\w))/g) {
> + $complete_extension = $1;
> + $extension = $2;
> + if ($extension !~ 
> /[FfSsBKRraEhMmIiUDdgVCbGNOx]/) {
> + $bad_extension = 
> $complete_extension;
> + last;
> + }
> + if ($extension eq "x") {
> + if (!defined($stat_real)) {
> + $stat_real = 
> raw_line($linenr, 0);
> + for (my $count = 
> $linenr + 1; $count <= $lc; $count++) {
> + $stat_real = 
> $stat_real . "\n" . raw_line($count, 0);
> + }
> + }
> + WARN("VSPRINTF_POINTER_PX",
> +  "Using vsprintf pointer 
> extension '$complete_extension' exposes kernel address for possible 
> hacking\n" . "$here\n$stat_real\n");
> + }
>   }
>   }
>   if ($bad_extension ne "") {
> - my $stat_real = raw_line($linenr, 0);
> - for (my $count = $linenr + 1; $count <= $lc; 
> $count++) {
> - $stat_real = $stat_real . "\n" . 
> 

Re: KASAN: use-after-free Read in sock_release

2017-11-29 Thread Al Viro
On Thu, Nov 30, 2017 at 02:07:19AM +, Al Viro wrote:

> FWIW, looking through the callers of sock_alloc_file()... we might be
> better off if it did sock_release() on failure.  Then the calling
> conventions become "sock_alloc_file() means not calling sock_release()
> directly - either it'll be done by the final fput() on resulting file,
> or by sock_alloc_file() itself".

FWIW^2: vfs.git#work.net is (completely untested) implementation of
that.  KCM fixes + sock_alloc_file() calling conventions change.

That's _not_ a pull request, it obviously needs testing and review on
netdev.  I like the way it looks, though...

Al Viro (3):
  socketpair(): allocate descriptors first
  fix kcm_clone()
  make sock_alloc_file() do sock_release() on failures
Diffstat:
 drivers/staging/lustre/lnet/lnet/lib-socket.c |   8 ++---
 net/9p/trans_fd.c |   1 -
 net/kcm/kcmsock.c |  68 
++---
 net/sctp/socket.c |   1 -
 net/socket.c  | 110 
+++
 5 files changed, 69 insertions(+), 119 deletions(-)

Cumulative diff:

diff --git a/drivers/staging/lustre/lnet/lnet/lib-socket.c 
b/drivers/staging/lustre/lnet/lnet/lib-socket.c
index 539a26444f31..7d49d4865298 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-socket.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-socket.c
@@ -71,16 +71,12 @@ lnet_sock_ioctl(int cmd, unsigned long arg)
}
 
sock_filp = sock_alloc_file(sock, 0, NULL);
-   if (IS_ERR(sock_filp)) {
-   sock_release(sock);
-   rc = PTR_ERR(sock_filp);
-   goto out;
-   }
+   if (IS_ERR(sock_filp))
+   return PTR_ERR(sock_filp);
 
rc = kernel_sock_unlocked_ioctl(sock_filp, cmd, arg);
 
fput(sock_filp);
-out:
return rc;
 }
 
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 985046ae4231..80f5c79053a4 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -839,7 +839,6 @@ static int p9_socket_open(struct p9_client *client, struct 
socket *csocket)
if (IS_ERR(file)) {
pr_err("%s (%d): failed to map fd\n",
   __func__, task_pid_nr(current));
-   sock_release(csocket);
kfree(p);
return PTR_ERR(file);
}
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 0b750a22c4b9..d4e98f20fc2a 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1625,60 +1625,30 @@ static struct proto kcm_proto = {
 };
 
 /* Clone a kcm socket. */
-static int kcm_clone(struct socket *osock, struct kcm_clone *info,
-struct socket **newsockp)
+static struct file *kcm_clone(struct socket *osock)
 {
struct socket *newsock;
struct sock *newsk;
-   struct file *newfile;
-   int err, newfd;
 
-   err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
-   goto out;
+   return ERR_PTR(-ENFILE);
 
newsock->type = osock->type;
newsock->ops = osock->ops;
 
__module_get(newsock->ops->owner);
 
-   newfd = get_unused_fd_flags(0);
-   if (unlikely(newfd < 0)) {
-   err = newfd;
-   goto out_fd_fail;
-   }
-
-   newfile = sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name);
-   if (IS_ERR(newfile)) {
-   err = PTR_ERR(newfile);
-   goto out_sock_alloc_fail;
-   }
-
newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL,
 _proto, true);
if (!newsk) {
-   err = -ENOMEM;
-   goto out_sk_alloc_fail;
+   sock_release(newsock);
+   return ERR_PTR(-ENOMEM);
}
-
sock_init_data(newsock, newsk);
init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux);
 
-   fd_install(newfd, newfile);
-   *newsockp = newsock;
-   info->fd = newfd;
-
-   return 0;
-
-out_sk_alloc_fail:
-   fput(newfile);
-out_sock_alloc_fail:
-   put_unused_fd(newfd);
-out_fd_fail:
-   sock_release(newsock);
-out:
-   return err;
+   return sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name);
 }
 
 static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
@@ -1708,17 +1678,25 @@ static int kcm_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
}
case SIOCKCMCLONE: {
struct kcm_clone info;
-   struct socket *newsock = NULL;
-
-   err = kcm_clone(sock, , );
-   if (!err) {
-   if (copy_to_user((void __user *)arg, ,
-sizeof(info))) {
-   err = -EFAULT;
-   sys_close(info.fd);
-   }
-   }
+  

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Stephen Hemminger
On Wed, 29 Nov 2017 19:51:38 -0800
Jakub Kicinski  wrote:

> On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> > On 2017年11月29日 03:27, Jesse Brandeburg wrote:  
> > > Hi, I'd like to get some feedback on a proposal to enhance
> > > virtio-net to ease configuration of a VM and that would enable
> > > live migration of passthrough network SR-IOV devices.
> > >
> > > Today we have SR-IOV network devices (VFs) that can be passed
> > > into a VM in order to enable high performance networking direct
> > > within the VM. The problem I am trying to address is that this
> > > configuration is generally difficult to live-migrate.  There is
> > > documentation [1] indicating that some OS/Hypervisor vendors will
> > > support live migration of a system with a direct assigned
> > > networking device.  The problem I see with these implementations
> > > is that the network configuration requirements that are passed on
> > > to the owner of the VM are quite complicated.  You have to set up
> > > bonding, you have to configure it to enslave two interfaces,
> > > those interfaces (one is virtio-net, the other is SR-IOV
> > > device/driver like ixgbevf) must support MAC address changes
> > > requested in the VM, and on and on...
> > >
> > > So, on to the proposal:
> > > Modify virtio-net driver to be a single VM network device that
> > > enslaves an SR-IOV network device (inside the VM) with the same
> > > MAC address. This would cause the virtio-net driver to appear and
> > > work like a simplified bonding/team driver.  The live migration
> > > problem would be solved just like today's bonding solution, but
> > > the VM user's networking config would be greatly simplified.
> > >
> > > At it's simplest, it would appear something like this in the VM.
> > >
> > > ==
> > > = vnet0  =
> > >   =
> > > (virtio- =   |
> > >   net)=   |
> > >   =  ==
> > >   =  = ixgbef =
> > > ==  ==
> > >
> > > (forgive the ASCII art)
> > >
> > > The fast path traffic would prefer the ixgbevf or other SR-IOV
> > > device path, and fall back to virtio's transmit/receive when
> > > migrating.
> > >
> > > Compared to today's options this proposal would
> > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > speeds
> > > 2) simplify end user configuration in the VM (most if not all of
> > > the set up to enable migration would be done in the hypervisor)
> > > 3) allow live migration via a simple link down and maybe a PCI
> > > hot-unplug of the SR-IOV device, with failover to the
> > > virtio-net driver core
> > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > between vendors if the VM os has driver support for all the
> > > required SR-IOV devices.
> > >
> > > Runtime operation proposed:
> > > -  virtio-net driver loads, SR-IOV driver loads
> > > - virtio-net finds other NICs that match it's MAC address by
> > >both examining existing interfaces, and sets up a new device
> > > notifier
> > > - virtio-net enslaves the first NIC with the same MAC address
> > > - virtio-net brings up the slave, and makes it the "preferred"
> > > path
> > > - virtio-net follows the behavior of an active backup bond/team
> > > - virtio-net acts as the interface to the VM
> > > - live migration initiates
> > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > - failover to virtio-net as primary path
> > > - migration continues to new host
> > > - new host is started with virio-net as primary
> > > - if no SR-IOV, virtio-net stays primary
> > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > - virtio-net notices new NIC and starts over at enslave step above
> > >
> > > Future ideas (brainstorming):
> > > - Optimize Fast east-west by having special rules to direct
> > > east-west traffic through virtio-net traffic path
> > >
> > > Thanks for reading!
> > > Jesse
> > 
> > Cc netdev.
> > 
> > Interesting, and this method is actually used by netvsc now:
> > 
> > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > Author: stephen hemminger 
> > Date:   Tue Aug 1 19:58:53 2017 -0700
> > 
> >      netvsc: transparent VF management
> > 
> >      This patch implements transparent fail over from synthetic NIC
> > to SR-IOV virtual function NIC in Hyper-V environment. It is a
> > better alternative to using bonding as is done now. Instead, the
> > receive and transmit fail over is done internally inside the driver.
> > 
> >      Using bonding driver has lots of issues because it depends on
> > the script being run early enough in the boot process and with
> > sufficient information to make the association. This patch moves
> > all that functionality into the kernel.
> > 
> >      Signed-off-by: Stephen Hemminger 
> >      Signed-off-by: David S. Miller 
> > 
> > If my understanding is correct 

Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Joe Perches
On Thu, 2017-11-30 at 10:26 +1100, Tobin C. Harding wrote:
> On Wed, Nov 29, 2017 at 03:20:58PM -0800, Andrew Morton wrote:
> > On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  wrote:
> > 
> > > printk specifier %p now hashes all addresses before printing. Sometimes
> > > we need to see the actual unmodified address. This can be achieved using
> > > %lx but then we face the risk that if in future we want to change the
> > > way the Kernel handles printing of pointers we will have to grep through
> > > the already existent 50 000 %lx call sites. Let's add specifier %px as a
> > > clear, opt-in, way to print a pointer and maintain some level of
> > > isolation from all the other hex integer output within the Kernel.
> > > 
> > > Add printk specifier %px to print the actual unmodified address.
> > > 
> > > ...
> > > 
> > > +Unmodified Addresses
> > > +
> > > +
> > > +::
> > > +
> > > + %px 01234567 or 0123456789abcdef
> > > +
> > > +For printing pointers when you _really_ want to print the address. Please
> > > +consider whether or not you are leaking sensitive information about the
> > > +Kernel layout in memory before printing pointers with %px. %px is
> > > +functionally equivalent to %lx. %px is preferred to %lx because it is 
> > > more
> > > +uniquely grep'able. If, in the future, we need to modify the way the 
> > > Kernel
> > > +handles printing pointers it will be nice to be able to find the call
> > > +sites.
> > > +
> > 
> > You might want to add a checkpatch rule which emits a stern
> > do-you-really-want-to-do-this warning when someone uses %px.
> > 
> 
> Oh, nice idea. It has to be a CHECK but right?

No, it has to be something that's not --strict
so a WARN would probably be best.

> By stern, you mean use stern language?

I hope he doesn't mean tweet.

Something like:
---
 scripts/checkpatch.pl | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 0ce249f157a1..9d789cbe7df5 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -5758,21 +5758,40 @@ sub process {
defined $stat &&
$stat =~ /^\+(?![^\{]*\{\s*).*\b(\w+)\s*\(.*$String\s*,/s &&
$1 !~ /^_*volatile_*$/) {
+   my $complete_extension = "";
+   my $extension = "";
my $bad_extension = "";
my $lc = $stat =~ tr@\n@@;
$lc = $lc + $linenr;
+   my $stat_real;
for (my $count = $linenr; $count <= $lc; $count++) {
my $fmt = get_quoted_string($lines[$count - 1], 
raw_line($count, 0));
$fmt =~ s/%%//g;
-   if ($fmt =~ 
/(\%[\*\d\.]*p(?![\WFfSsBKRraEhMmIiUDdgVCbGNO]).)/) {
-   $bad_extension = $1;
-   last;
+   while ($fmt =~ /(\%[\*\d\.]*p(\w))/g) {
+   $complete_extension = $1;
+   $extension = $2;
+   if ($extension !~ 
/[FfSsBKRraEhMmIiUDdgVCbGNOx]/) {
+   $bad_extension = 
$complete_extension;
+   last;
+   }
+   if ($extension eq "x") {
+   if (!defined($stat_real)) {
+   $stat_real = 
raw_line($linenr, 0);
+   for (my $count = 
$linenr + 1; $count <= $lc; $count++) {
+   $stat_real = 
$stat_real . "\n" . raw_line($count, 0);
+   }
+   }
+   WARN("VSPRINTF_POINTER_PX",
+"Using vsprintf pointer 
extension '$complete_extension' exposes kernel address for possible hacking\n" 
. "$here\n$stat_real\n");
+   }
}
}
if ($bad_extension ne "") {
-   my $stat_real = raw_line($linenr, 0);
-   for (my $count = $linenr + 1; $count <= $lc; 
$count++) {
-   $stat_real = $stat_real . "\n" . 
raw_line($count, 0);
+   if (!defined($stat_real)) {
+   $stat_real = raw_line($linenr, 0);
+   for (my $count = $linenr + 1; $count <= 
$lc; $count++) {
+

Re: Commit 05cf0d1bf4 ("net: stmmac: free an skb first when there are no longer any descriptors using it") breaks stmmac?

2017-11-29 Thread Niklas Cassel
On Mon, Nov 27, 2017 at 02:41:00PM +, Jose Abreu wrote:
> Hi Niklas,

Hello Jose,

> 
> I think your commit 05cf0d1bf4 ("net: stmmac: free an skb first
> when there are no longer any descriptors using it") is breaking
> stmmac driver in multi-queue configuration (this stacktrace may
> contain some extra characters as I was using serial port):
> 
> ->8-
> general protection fault:  [#1] SMP
> Modules linked in: stmmac_pci stmmac libphy igb ptp pps_core
> x86_pkg_temp_thermal
> CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW   4.14.0-rc5 #2
> Hardware name: Default string Default string/SKYBAY, BIOS 5.0.1.1
> 10/06/2016
> task: a2fe14d8b100 task.stack: b8c6000b8000
> RIP: 0010:skb_release_data+0x66/0x110
> RSP: 0018:a2fe2dd43d98 EFLAGS: 00010206
> RAX: 0030 RBX: a2fe13fab100 RCX: 05aa
> RDX: a2fe12a5 RSI:  RDI: fffcfffdfffbfffc
> RBP: a2fe2dd43db0 R08: a2fe2dfcd000 R09: 0001
> R10: a06245d0 R11: a2fe14c03700 R12: 
> R13: a2fe11e686c0 R14: a2fe13fab100 R15: a2fe129b8940
> FS:  () GS:a2fe2dd4()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fe26c457000 CR3: 2b609003 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  skb_release_all+0x1f/0x30
>  consume_skb+0x1d/0x40
>  __dev_kfree_skb_any+0x2a/0x30
>  stmmac_tx_clean+0x230/0x4d0 [stmmac]
>  stmmac_poll+0x4b/0x980 [stmmac]
>  net_rx_action+0x1ad/0x290
>  __do_softirq+0xdd/0x1d6
>  irq_exit+0x77/0x80
>  do_IRQ+0x4a/0xc0
>  common_interrupt+0x93/0x93
>  
> RIP: 0010:cpuidle_enter_state+0x16a/0x210
> RSP: 0018:b8c6000bbe90 EFLAGS: 0286 ORIG_RAX:
> ffae
> RAX: a2fe2dd575c0 RBX: a2fe14560200 RCX: 001f
> RDX:  RSI: 0122dbd7ae06 RDI: 
> RBP: b8c6000bbec0 R08: 0020 R09: 0002
> R10: b8c6000bbe60 R11: 00123400 R12: 004f3c11b47a
> R13: 0003 R14: a0e3aa58 R15: 0003
>  cpuidle_enter+0x12/0x20
>  call_cpuidle+0x1e/0x40
>  do_idle+0x16a/0x1c0
>  cpu_startup_entry+0x18/0x20
>  start_secondary+0x10d/0x110
>  secondary_startup_64+0xa5/0xa5
> ->8-
> 
> Using tree with your commit I get this stacktrace upon streaming
> data at random time (stacktrace does not appear everytime),
> without the commit I get no errors.
> 
> I understand the reason for your commit but we already have
> dirty_tx in stmmac_tx_clean(), shouldn't it be enough to prevent
> skb use-after-free? Can you please review your patch to make sure
> nothing else is missing?

Dirty is not enough.

The problem that I thought I solved,
was if an single skb was spread out over
3 tx descs like this:

TX DESC #1: first descriptor bit set
TX DESC #2: 
TX DESC #3: last descriptor bit set

before my patch the
tx_q->tx_skbuff[first_entry] would have the skb pointer.

let's say that curr_tx is pointing to TX DESC4,
since curr_tx is supposed to point to the place where
we are going to start filling next time.

The problem I thought I solved was if stmmac_tx_clean
came and started cleaning, it could clean TX DESC #1,
and thus freeing the skb (which TX DESC #2 and TX DESC #3 uses.)

However, by reading the databook more carefully:
If an Ethernet packet is stored over data buffers in multiple
descriptors, the DMA will fetch all descriptors, and
will only update the own bit in all descriptors,
after the complete packet has been sent.

Because of this, I think that the case where
TX DESC #1 gets its own bit cleared, while the hardware
hasn't yet fetched the TX DESC #2 and TX DESC #3 cannot
happen, so my patch 05cf0d1bf4 ("net: stmmac: free an skb first
when there are no longer any descriptors using it")
has no real purpose, and could be reverted.



However, I don't see why is should cause your
kernel to crash.

It looks like skb_release_data is trying to access a member in a pointer
that is null, i.e. a double free of an skb.

I assume this is because the same skb address exists in several queues.

My guess is that the problem is that tx_q->tx_skbuff[] does not get set to
NULL properly when having multiple queues. (It obviously works for a single
tx queue).

stmmac_tx_clean() will do:
if (skb)
tx_q->tx_skbuff[entry] = NULL;

so skbs should always be set to NULL.

However, one difference seems to be that stmmac_xmit and stmmac_tso_xmit
inside the nfrags for-loop, seems to set the skb pointer to NULL.

for (i = 0; i < nfrags; i++) {
tx_q->tx_skbuff[entry] = NULL;

Considering this is already done in stmmac_tx_clean, this seems redundant.

But one difference is that before my patch,
all frags, except first TX DESC, would get NULL:ed
(first desc would be assigned the skb).

After my patch, 

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Jakub Kicinski
On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > Hi, I'd like to get some feedback on a proposal to enhance virtio-net
> > to ease configuration of a VM and that would enable live migration of
> > passthrough network SR-IOV devices.
> >
> > Today we have SR-IOV network devices (VFs) that can be passed into a VM
> > in order to enable high performance networking direct within the VM.
> > The problem I am trying to address is that this configuration is
> > generally difficult to live-migrate.  There is documentation [1]
> > indicating that some OS/Hypervisor vendors will support live migration
> > of a system with a direct assigned networking device.  The problem I
> > see with these implementations is that the network configuration
> > requirements that are passed on to the owner of the VM are quite
> > complicated.  You have to set up bonding, you have to configure it to
> > enslave two interfaces, those interfaces (one is virtio-net, the other
> > is SR-IOV device/driver like ixgbevf) must support MAC address changes
> > requested in the VM, and on and on...
> >
> > So, on to the proposal:
> > Modify virtio-net driver to be a single VM network device that
> > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > address. This would cause the virtio-net driver to appear and work like
> > a simplified bonding/team driver.  The live migration problem would be
> > solved just like today's bonding solution, but the VM user's networking
> > config would be greatly simplified.
> >
> > At it's simplest, it would appear something like this in the VM.
> >
> > ==
> > = vnet0  =
> >   =
> > (virtio- =   |
> >   net)=   |
> >   =  ==
> >   =  = ixgbef =
> > ==  ==
> >
> > (forgive the ASCII art)
> >
> > The fast path traffic would prefer the ixgbevf or other SR-IOV device
> > path, and fall back to virtio's transmit/receive when migrating.
> >
> > Compared to today's options this proposal would
> > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > speeds
> > 2) simplify end user configuration in the VM (most if not all of the
> > set up to enable migration would be done in the hypervisor)
> > 3) allow live migration via a simple link down and maybe a PCI
> > hot-unplug of the SR-IOV device, with failover to the virtio-net
> > driver core
> > 4) allow vendor agnostic hardware acceleration, and live migration
> > between vendors if the VM os has driver support for all the required
> > SR-IOV devices.
> >
> > Runtime operation proposed:
> > -  virtio-net driver loads, SR-IOV driver loads
> > - virtio-net finds other NICs that match it's MAC address by
> >both examining existing interfaces, and sets up a new device notifier
> > - virtio-net enslaves the first NIC with the same MAC address
> > - virtio-net brings up the slave, and makes it the "preferred" path
> > - virtio-net follows the behavior of an active backup bond/team
> > - virtio-net acts as the interface to the VM
> > - live migration initiates
> > - link goes down on SR-IOV, or SR-IOV device is removed
> > - failover to virtio-net as primary path
> > - migration continues to new host
> > - new host is started with virio-net as primary
> > - if no SR-IOV, virtio-net stays primary
> > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > - virtio-net notices new NIC and starts over at enslave step above
> >
> > Future ideas (brainstorming):
> > - Optimize Fast east-west by having special rules to direct east-west
> >traffic through virtio-net traffic path
> >
> > Thanks for reading!
> > Jesse  
> 
> Cc netdev.
> 
> Interesting, and this method is actually used by netvsc now:
> 
> commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> Author: stephen hemminger 
> Date:   Tue Aug 1 19:58:53 2017 -0700
> 
>      netvsc: transparent VF management
> 
>      This patch implements transparent fail over from synthetic NIC to
>      SR-IOV virtual function NIC in Hyper-V environment. It is a better
>      alternative to using bonding as is done now. Instead, the receive and
>      transmit fail over is done internally inside the driver.
> 
>      Using bonding driver has lots of issues because it depends on the
>      script being run early enough in the boot process and with sufficient
>      information to make the association. This patch moves all that
>      functionality into the kernel.
> 
>      Signed-off-by: Stephen Hemminger 
>      Signed-off-by: David S. Miller 
> 
> If my understanding is correct there's no need to for any extension of 
> virtio spec. If this is true, maybe you can start to prepare the patch?

IMHO this is as close to policy in the kernel as one can get.  User
land has all the information it needs to instantiate that bond/team
automatically.  In fact I'm 

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Jason Wang



On 2017年11月29日 03:27, Jesse Brandeburg wrote:

Hi, I'd like to get some feedback on a proposal to enhance virtio-net
to ease configuration of a VM and that would enable live migration of
passthrough network SR-IOV devices.

Today we have SR-IOV network devices (VFs) that can be passed into a VM
in order to enable high performance networking direct within the VM.
The problem I am trying to address is that this configuration is
generally difficult to live-migrate.  There is documentation [1]
indicating that some OS/Hypervisor vendors will support live migration
of a system with a direct assigned networking device.  The problem I
see with these implementations is that the network configuration
requirements that are passed on to the owner of the VM are quite
complicated.  You have to set up bonding, you have to configure it to
enslave two interfaces, those interfaces (one is virtio-net, the other
is SR-IOV device/driver like ixgbevf) must support MAC address changes
requested in the VM, and on and on...

So, on to the proposal:
Modify virtio-net driver to be a single VM network device that
enslaves an SR-IOV network device (inside the VM) with the same MAC
address. This would cause the virtio-net driver to appear and work like
a simplified bonding/team driver.  The live migration problem would be
solved just like today's bonding solution, but the VM user's networking
config would be greatly simplified.

At it's simplest, it would appear something like this in the VM.

==
= vnet0  =
  =
(virtio- =   |
  net)=   |
  =  ==
  =  = ixgbef =
==  ==

(forgive the ASCII art)

The fast path traffic would prefer the ixgbevf or other SR-IOV device
path, and fall back to virtio's transmit/receive when migrating.

Compared to today's options this proposal would
1) make virtio-net more sticky, allow fast path traffic at SR-IOV
speeds
2) simplify end user configuration in the VM (most if not all of the
set up to enable migration would be done in the hypervisor)
3) allow live migration via a simple link down and maybe a PCI
hot-unplug of the SR-IOV device, with failover to the virtio-net
driver core
4) allow vendor agnostic hardware acceleration, and live migration
between vendors if the VM os has driver support for all the required
SR-IOV devices.

Runtime operation proposed:
-  virtio-net driver loads, SR-IOV driver loads
- virtio-net finds other NICs that match it's MAC address by
   both examining existing interfaces, and sets up a new device notifier
- virtio-net enslaves the first NIC with the same MAC address
- virtio-net brings up the slave, and makes it the "preferred" path
- virtio-net follows the behavior of an active backup bond/team
- virtio-net acts as the interface to the VM
- live migration initiates
- link goes down on SR-IOV, or SR-IOV device is removed
- failover to virtio-net as primary path
- migration continues to new host
- new host is started with virio-net as primary
- if no SR-IOV, virtio-net stays primary
- hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
- virtio-net notices new NIC and starts over at enslave step above

Future ideas (brainstorming):
- Optimize Fast east-west by having special rules to direct east-west
   traffic through virtio-net traffic path

Thanks for reading!
Jesse


Cc netdev.

Interesting, and this method is actually used by netvsc now:

commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
Author: stephen hemminger 
Date:   Tue Aug 1 19:58:53 2017 -0700

    netvsc: transparent VF management

    This patch implements transparent fail over from synthetic NIC to
    SR-IOV virtual function NIC in Hyper-V environment. It is a better
    alternative to using bonding as is done now. Instead, the receive and
    transmit fail over is done internally inside the driver.

    Using bonding driver has lots of issues because it depends on the
    script being run early enough in the boot process and with sufficient
    information to make the association. This patch moves all that
    functionality into the kernel.

    Signed-off-by: Stephen Hemminger 
    Signed-off-by: David S. Miller 

If my understanding is correct there's no need to for any extension of 
virtio spec. If this is true, maybe you can start to prepare the patch?


Thanks



[1]
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts
___
Virtualization mailing list
virtualizat...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization




Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread Casey Schaufler

On 11/29/2017 4:31 PM, James Morris wrote:
> On Wed, 29 Nov 2017, Casey Schaufler wrote:
>
>> I see that there is a proposed fix later in the thread, but I don't see
>> the patch. Could you send it to me, so I can try it on my problem?
> Forwarded off-list.

The patch does fix the problem I was seeing in Smack.

>
> Interestingly, I didn't see the KASAN output email from Stephen here.
>
>



Re: [Patch net v2] act_sample: get rid of tcf_sample_cleanup_rcu()

2017-11-29 Thread Eric Dumazet
On Wed, 2017-11-29 at 16:07 -0800, Cong Wang wrote:
> Similar to commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu"),
> TC actions don't need to respect RCU grace period, because it
> is either just detached from tc filter (standalone case) or
> it is removed together with tc filter (bound case) in which case
> RCU grace period is already respected at filter layer.
> 
> Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
> Reported-by: Eric Dumazet 
> Cc: Jamal Hadi Salim 
> Cc: Jiri Pirko 
> Cc: Yotam Gigi 
> Signed-off-by: Cong Wang 
> ---
>  include/net/tc_act/tc_sample.h |  1 -
>  net/sched/act_sample.c | 14 +++---
>  2 files changed, 3 insertions(+), 12 deletions(-)
> 

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()

2017-11-29 Thread Jason Wang



On 2017年11月29日 23:31, Michael S. Tsirkin wrote:

On Wed, Nov 29, 2017 at 09:23:24AM -0500,w...@redhat.com  wrote:

From: Wei Xu

Matthew found a roughly 40% tcp throughput regression with commit
c67df11f(vhost_net: try batch dequing from skb array) as discussed
in the following thread:
https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

Eventually we figured out that it was a skb leak in handle_rx()
when sending packets to the VM. This usually happens when a guest
can not drain out vq as fast as vhost fills in, afterwards it sets
off the traffic jam and leaks skb(s) which occurs as no headcount
to send on the vq from vhost side.

This can be avoided by making sure we have got enough headcount
before actually consuming a skb from the batched rx array while
transmitting, which is simply done by moving checking the zero
headcount a bit ahead.

Also strengthen the small possibility of leak in case of recvmsg()
fails by freeing the skb.

Signed-off-by: Wei Xu
Reported-by: Matthew Rosato
---
  drivers/vhost/net.c | 23 +--
  1 file changed, 13 insertions(+), 10 deletions(-)

v2:
- add Matthew as the reporter, thanks matthew.
- moving zero headcount check ahead instead of defer consuming skb
   due to jason and mst's comment.
- add freeing skb in favor of recvmsg() fails.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8d626d7..e302e08 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
-   if (nvq->rx_array)
-   msg.msg_control = vhost_net_buf_consume(>rxq);
-   /* On overrun, truncate and discard */
-   if (unlikely(headcount > UIO_MAXIOV)) {
-   iov_iter_init(_iter, READ, vq->iov, 1, 1);
-   err = sock->ops->recvmsg(sock, ,
-1, MSG_DONTWAIT | MSG_TRUNC);
-   pr_debug("Discarded rx packet: len %zd\n", sock_len);
-   continue;
-   }
/* OK, now we need to know about added descriptors. */
if (!headcount) {
if (unlikely(vhost_enable_notify(>dev, vq))) {
@@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
 * they refilled. */
goto out;
}
+   if (nvq->rx_array)
+   msg.msg_control = vhost_net_buf_consume(>rxq);
+   /* On overrun, truncate and discard */
+   if (unlikely(headcount > UIO_MAXIOV)) {
+   iov_iter_init(_iter, READ, vq->iov, 1, 1);
+   err = sock->ops->recvmsg(sock, ,
+1, MSG_DONTWAIT | MSG_TRUNC);
+   if (unlikely(err != 1))

Why 1? How is receiving 1 byte special or even possible?
Also, I wouldn't put an unlikely here. It's all error handling code anyway.


+   kfree_skb((struct sk_buff *)msg.msg_control);

You do not need a cast here.
Also, is it really safe to refer to msg_control here?
I'd rather keep a copy of the skb pointer and use it than assume
caller did not change it. But also see below.


+   pr_debug("Discarded rx packet: len %zd\n", sock_len);
+   continue;
+   }
/* We don't need to be notified again. */
iov_iter_init(_iter, READ, vq->iov, in, vhost_len);
fixup = msg.msg_iter;
@@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net)
pr_debug("Discarded rx packet: "
 " len %d, expected %zd\n", err, sock_len);
vhost_discard_vq_desc(vq, headcount);
+   kfree_skb((struct sk_buff *)msg.msg_control);

You do not need a cast here.

Also, we have

 ret = tun_put_user(tun, tfile, skb, to);
 if (unlikely(ret < 0))
 kfree_skb(skb);
 else
 consume_skb(skb);

 return ret;

So it looks like recvmsg actually always consumes the skb.
So I was wrong when I said you need to kfree it after
recv msg, and your original patch was good.

Jason, what do you think?



tun_recvmsg() has the following check:

static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t 
total_len,

           int flags)
{
    struct tun_file *tfile = container_of(sock, struct tun_file, socket);
    struct tun_struct *tun = __tun_get(tfile);
    int ret;

    if (!tun)
        return -EBADFD;

    if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
        ret = -EINVAL;
        goto out;
    }

And tun_do_read() 

[PATCH net] sit: update frag_off info

2017-11-29 Thread Hangbin Liu
After parsing the sit netlink change info, we forget to update frag_off in
ipip6_tunnel_update(). Fix it by assigning frag_off with new value.

Fixes: f37234160233 ("sit: add support of link creation via rtnl")
Reported-by: Jianlin Shi 
Signed-off-by: Hangbin Liu 
---
 net/ipv6/sit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index d60ddcb..d7dc23c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1098,6 +1098,7 @@ static void ipip6_tunnel_update(struct ip_tunnel *t, 
struct ip_tunnel_parm *p,
ipip6_tunnel_link(sitn, t);
t->parms.iph.ttl = p->iph.ttl;
t->parms.iph.tos = p->iph.tos;
+   t->parms.iph.frag_off = p->iph.frag_off;
if (t->parms.link != p->link || t->fwmark != fwmark) {
t->parms.link = p->link;
t->fwmark = fwmark;
-- 
2.5.5



Re: KASAN: use-after-free Read in sock_release

2017-11-29 Thread Al Viro
On Wed, Nov 29, 2017 at 11:37:04AM -0800, Cong Wang wrote:

> > Allocated by task 31066:
> >  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
> >  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3613
> >  kmalloc include/linux/slab.h:499 [inline]
> >  sock_alloc_inode+0xb4/0x300 net/socket.c:253
> >  alloc_inode+0x65/0x180 fs/inode.c:208
> >  new_inode_pseudo+0x69/0x190 fs/inode.c:890
> >  sock_alloc+0x41/0x270 net/socket.c:565
> >  __sock_create+0x148/0x850 net/socket.c:1225
> >  sock_create net/socket.c:1301 [inline]
> >  SYSC_socket net/socket.c:1331 [inline]
> >  SyS_socket+0xeb/0x200 net/socket.c:1311
> >  entry_SYSCALL_64_fastpath+0x1f/0x96
> >
> > Freed by task 3039:
> >  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
> >  __cache_free mm/slab.c:3491 [inline]
> >  kfree+0xca/0x250 mm/slab.c:3806
> >  __rcu_reclaim kernel/rcu/rcu.h:190 [inline]
> >  rcu_do_batch kernel/rcu/tree.c:2758 [inline]
> >  invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline]
> >  __rcu_process_callbacks kernel/rcu/tree.c:2979 [inline]
> >  rcu_process_callbacks+0xe79/0x17d0 kernel/rcu/tree.c:2996
> >  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285

IDGI.  We are running into the object pointed to by sock->wq
already freed, right?  So how the hell had we managed to _fetch_
the pointer in the first place?  Freeing had been scheduled
by
wq = rcu_dereference_protected(ei->socket.wq, 1);  
kfree_rcu(wq, rcu);
kmem_cache_free(sock_inode_cachep, ei);
so we should have
* sock_destroy_inode() run on another CPU while we are
in the middle of sock_release(), sock->wq fetched by sock_release(),
sock->wq fed to kfree_rcu() by sock_destroy_inode() *AND* freed
before sock_release() got around to dereferencing it.

Not impossible to hit, but... why hadn't we run into
much wider window?  If that sock_destroy_inode() on another
CPU had gotten to the call right after that kfree_rcu(), we
would've seen use-after-free on attempt to fetch ->wq...

And it goes without saying that sock_destroy_inode() should
not have happened in parallel to sock_release(), or, for that matter,
to anything else done to struct socket instance...


Re: KASAN: use-after-free Read in sock_release

2017-11-29 Thread Al Viro
On Wed, Nov 29, 2017 at 12:24:55PM -0800, Linus Torvalds wrote:
> Ugh. The inode freeing really is confusing and fairly involved, but
> the last free *should* happen as part of the final dput() that is done
> at the end of __fput().

Note that struct socket is coallocated with its inode.  _Normally_
from sock_alloc() (and that's the case here, apparently), but in several
cases it's embedded into another object.  TUN and TAP - definitely,
might have been other added.  Those should never be passed to sock_release()
at all.

> So in __fput() calls into the
> 
> if (file->f_op->release)
> file->f_op->release(inode, file);
> 
> then the inode should still be around, because the final ref won't be
> done until later. And RCU simply shouldn't be an issue, because of
> that reference count on the inode.
> 
> So it smells like some reference counting went wrong. The socket inode
> creation is a bit confusing, and then in "sock_release()" we do have
> that
> 
> if (!sock->file) {
> iput(SOCK_INODE(sock));
> return;
> }
> sock->file = NULL;
> 
> which *also* tries to free the inode. I'm not sure what the logic (and
> what the locking) behind that code all is.

If socket has never gone through sock_alloc_file(), sock_release() on it
is called explicitly and frees the sucker.  If it has been through
sock_alloc_file(), we must not call sock_release() directly and freeing
is done by iput() from final fput().

> What *is* the locking for "sock->file" anyway?

Pretty much assign-once - zeroing it in the end of sock_release() is
pure cosmetics (we'd damn better have no other references to that
sucker left anywhere; there's still a reference to embedded inode,
but that's it).

FWIW, looking through the callers of sock_alloc_file()... we might be
better off if it did sock_release() on failure.  Then the calling
conventions become "sock_alloc_file() means not calling sock_release()
directly - either it'll be done by the final fput() on resulting file,
or by sock_alloc_file() itself".

Look:
1) in lustre:
sock_filp = sock_alloc_file(sock, 0, NULL);
if (IS_ERR(sock_filp)) {
sock_release(sock);
rc = PTR_ERR(sock_filp);
goto out;
}
2) in net/9p:
file = sock_alloc_file(csocket, 0, NULL);
if (IS_ERR(file)) {
pr_err("%s (%d): failed to map fd\n",
   __func__, task_pid_nr(current));
sock_release(csocket);
kfree(p);
return PTR_ERR(file);
}
3) in sctp:
*newfile = sock_alloc_file(newsock, 0, NULL);
if (IS_ERR(*newfile)) {
put_unused_fd(retval); 
sock_release(newsock);
retval = PTR_ERR(*newfile);
*newfile = NULL;
return retval;
}
4) in accept4():
newfile = sock_alloc_file(newsock, flags, 
sock->sk->sk_prot_creator->name);
if (IS_ERR(newfile)) {
err = PTR_ERR(newfile);
put_unused_fd(newfd);
sock_release(newsock);
goto out_put;
}

5) called in sock_map_fd(), and the sole caller is
retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
if (retval < 0)
goto out_release;
...
out_release:
sock_release(sock);
return retval;
(with no fallthrough or other goto into out_release)

6) the second caller in socketpair():
newfile2 = sock_alloc_file(sock2, flags, NULL);
if (IS_ERR(newfile2)) {
err = PTR_ERR(newfile2);
goto out_fput_1;
}
...
out_fput_1:
fput(newfile1);
put_unused_fd(fd2);
put_unused_fd(fd1);
sock_release(sock2);
goto out;
(again, no fallthrough or other goto into out_fput_1)

7) the first caller in socketpair():
newfile1 = sock_alloc_file(sock1, flags, NULL);
if (IS_ERR(newfile1)) {
err = PTR_ERR(newfile1);
goto out_put_unused_both;
}
...
out_put_unused_both:
put_unused_fd(fd2);
out_put_unused_1:
put_unused_fd(fd1);
out_release_both:
sock_release(sock2);
out_release_1:
sock_release(sock1);
out:
return err;
No fallthrough or goto either.  Sure, we get a failure exit unshared,
but AFAICS some reordering can simplify things quite a bit there.

8) kcm_clone().  Fucked in head - we allocate socket, then file, *THEN*
sock, then attach sock to socket (already attached to file), then finally
deign to initialize sock (already attached to socket, which is attached
to file).  And, surprise, surprise, failure exits are all wrong.
Moreover, calling conventions are broken by design - after we'd put
the damn file into descriptor table we return the pointer to sock
to the caller.  By that time it might have bloody well been destroyed
by close(2) from another thread; 

Re: [PATCH v5 next 2/5] modules:capabilities: add cap_kernel_module_request() permission check

2017-11-29 Thread Luis R. Rodriguez
On Mon, Nov 27, 2017 at 06:18:35PM +0100, Djalal Harouni wrote:
> +/* Determine whether a module auto-load operation is permitted. */
> +int may_autoload_module(char *kmod_name, int required_cap,
> + const char *kmod_prefix);
> +

While we are reviewing a general LSM for this, it has me wondering if an LSM or
userspace feed info may every want to use other possible context we could add 
for
free to make a determination.

For instance since all request_module() calls are in header files, we could 
for add for free THIS_MODULE as context to may_autoload_module() as well, so
struct module. The LSM could in theory then also help ensure only specific
modules are allowed to request a module load. Perhaps userspace could say
only built-in code could request certain modules.

Just a thought.

  Luis


Re: [PATCH resend] trace/xdp: fix compile warning: 'struct bpf_map' declared inside parameter list

2017-11-29 Thread Daniel Borkmann
On 11/30/2017 02:41 AM, Xie XiuQi wrote:
> We meet this compile warning, which caused by missing bpf.h in xdp.h.
> 
> In file included from ./include/trace/events/xdp.h:10:0,
>  from ./include/linux/bpf_trace.h:6,
>  from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:187:34: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   static inline void trace_##name(proto)\
>   ^
> ./include/linux/tracepoint.h:352:24: note: in expansion of macro ‘PARAMS’
>   __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
> ^~
> ./include/linux/tracepoint.h:477:2: note: in expansion of macro 
> ‘DECLARE_TRACE’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^
> ./include/linux/tracepoint.h:477:22: note: in expansion of macro ‘PARAMS’
>   DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
>   ^~
> ./include/trace/events/xdp.h:89:1: note: in expansion of macro ‘DEFINE_EVENT’
>  DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
>  ^~~~
> ./include/trace/events/xdp.h:90:2: note: in expansion of macro ‘TP_PROTO’
>   TP_PROTO(const struct net_device *dev,
>   ^~~~
> ./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
> parameter list will not be visible outside of this definition or declaration
> const struct bpf_map *map, u32 map_index),
>  ^
> ./include/linux/tracepoint.h:203:38: note: in definition of macro 
> ‘__DECLARE_TRACE’
>   register_trace_##name(void (*probe)(data_proto), void *data) \
>   ^~
> ./include/linux/tracepoint.h:354:4: note: in expansion of macro ‘PARAMS’
> PARAMS(void *__data, proto),   \
> ^~
> 
> Reported-by: Huang Daode 
> Cc: Hanjun Guo 
> Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")
> Signed-off-by: Xie XiuQi 
> Acked-by: Jesper Dangaard Brouer 
> Acked-by: Steven Rostedt (VMware) 

Applied to bpf tree, thanks Xie!


Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-29 Thread Willem de Bruijn
On Wed, Nov 29, 2017 at 8:39 PM, Willem de Bruijn
 wrote:
> On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann  wrote:
>> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
>>  wrote:
 Thanks for the review! Any suggestions for how to do the testing? If you 
 have
 existing test cases, could you give my next version a test run to see if 
 there
 are any regressions and if the timestamps work as expected?

 I see that there are test cases in tools/testing/selftests/net/, but none
 of them seem to use the time stamps so far, and I'm not overly familiar
 with how it works in the details to extend it in a meaningful way.
>>>
>>> I could not find any good tests for this interface, either. The only
>>> user of the interface I found was a little tool I wrote a few years
>>> ago that compares timestamps at multiple points in the transmit
>>> path for latency measurement [1]. But it may be easier to just write
>>> a new test under tools/testing/selftests/net for this purpose. I can
>>> help with that, too, if you want.
>>
>> Thanks, that would be great!
>
> I'll reply to this thread with git send-email with an extension to
> tools/testing/selftests/net/psock_tpacket.c.

It appears that it did not end up in this thread. At least not when
using gmail threading. Patch at http://patchwork.ozlabs.org/patch/842854/


[PATCH net-next RFC] selftests: test timestamps in psock_tpacket

2017-11-29 Thread Willem de Bruijn
From: Willem de Bruijn 

Packet rings can return timestamps. Optionally test this path.

Verify that the returned values are sane.
Also test new timestamp modes skip and ns64.

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/psock_tpacket.c | 125 +++-
 1 file changed, 124 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/psock_tpacket.c 
b/tools/testing/selftests/net/psock_tpacket.c
index 7f6cd9fdacf3..5af11016a5de 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -36,6 +36,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -57,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "psock_lib.h"
 
@@ -75,6 +77,19 @@
 #define NUM_PACKETS100
 #define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
 
+const uint64_t tstamp_bound_ns64 = 1000UL * 1000 * 1000;
+
+enum cfg_tstamp_type {
+   tstype_none,
+   tstype_default,
+   tstype_skip,
+   tstype_ns64
+};
+
+static enum cfg_tstamp_type cfg_tstamp;
+
+static uint64_t tstamp_start_ns64;
+
 struct ring {
struct iovec *rd;
uint8_t *mm_space;
@@ -150,6 +165,43 @@ static void test_payload(void *pay, size_t len)
}
 }
 
+static void test_tstamp(uint32_t sec, uint32_t nsec)
+{
+   uint64_t tstamp_ns64;
+
+   if (!cfg_tstamp)
+   return;
+
+   if (cfg_tstamp == tstype_skip) {
+   if (sec || nsec) {
+   fprintf(stderr, "%s: unexpected tstamp %u:%u\n",
+   __func__, sec, nsec);
+   exit(1);
+   }
+   return;
+   }
+
+   if (cfg_tstamp == tstype_ns64)
+   tstamp_ns64 = (((uint64_t) sec) << 32) | nsec;
+   else
+   tstamp_ns64 = (sec * 1000UL * 1000 * 1000) + nsec;
+
+   if (tstamp_ns64 < tstamp_start_ns64) {
+   fprintf(stderr, "tstamp: %lu lowerbound=%lu under=%lu\n",
+   tstamp_ns64, tstamp_start_ns64,
+   tstamp_start_ns64 - tstamp_ns64);
+   exit(1);
+   }
+   if (tstamp_ns64 > tstamp_start_ns64 + tstamp_bound_ns64) {
+   fprintf(stderr, "tstamp: %lu upperbound=%lu over=%lu\n",
+   tstamp_ns64,
+   tstamp_start_ns64 + tstamp_bound_ns64,
+   tstamp_ns64 - (tstamp_start_ns64 +
+  tstamp_bound_ns64));
+   exit(1);
+   }
+}
+
 static void create_payload(void *pay, size_t *len)
 {
int i;
@@ -256,12 +308,18 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
case TPACKET_V1:
test_payload((uint8_t *) ppd.raw + 
ppd.v1->tp_h.tp_mac,
 ppd.v1->tp_h.tp_snaplen);
+   test_tstamp(ppd.v1->tp_h.tp_sec,
+   cfg_tstamp == tstype_ns64 ?
+   ppd.v1->tp_h.tp_usec :
+   ppd.v1->tp_h.tp_usec * 1000);
total_bytes += ppd.v1->tp_h.tp_snaplen;
break;
 
case TPACKET_V2:
test_payload((uint8_t *) ppd.raw + 
ppd.v2->tp_h.tp_mac,
 ppd.v2->tp_h.tp_snaplen);
+   test_tstamp(ppd.v2->tp_h.tp_sec,
+   ppd.v2->tp_h.tp_nsec);
total_bytes += ppd.v2->tp_h.tp_snaplen;
break;
}
@@ -572,6 +630,7 @@ static void __v3_walk_block(struct block_desc *pbd, const 
int block_num)
bytes_with_padding += ALIGN_8(ppd->tp_snaplen + 
ppd->tp_mac);
 
test_payload((uint8_t *) ppd + ppd->tp_mac, ppd->tp_snaplen);
+   test_tstamp(ppd->tp_sec, ppd->tp_nsec);
 
status_bar_update();
total_packets++;
@@ -766,6 +825,38 @@ static void unmap_ring(int sock, struct ring *ring)
free(ring->rd);
 }
 
+static void setup_tstamp(int sock)
+{
+   struct timeval tv;
+   int one = 1;
+
+   gettimeofday(, NULL);
+   tstamp_start_ns64 = (tv.tv_sec * 1000UL * 1000 * 1000) +
+   (tv.tv_usec * 1000UL);
+
+/* TODO: remove before submit: temporary */
+#ifndef PACKET_SKIPTIMESTAMP
+#define PACKET_SKIPTIMESTAMP   23
+#endif
+#ifndef PACKET_TIMESTAMP_NS64
+#define PACKET_TIMESTAMP_NS64  24
+#endif
+
+   if (cfg_tstamp == tstype_skip) {
+   if (setsockopt(sock, SOL_PACKET, PACKET_SKIPTIMESTAMP,
+  , sizeof(one))) {
+ 

[PATCH v2 1/6] perf: Add new types PERF_TYPE_KPROBE and PERF_TYPE_UPROBE

2017-11-29 Thread Song Liu
Two new perf types, PERF_TYPE_KPROBE and PERF_TYPE_UPROBE, are added
to allow creating [k,u]probe with perf_event_open. These [k,u]probe
are associated with the file decriptor created by perf_event_open,
thus are easy to clean when the file descriptor is destroyed.

kprobe_func and uprobe_path are added to union config1 for pointers
to function name for kprobe or binary path for uprobe.

kprobe_addr and probe_offset are added to union config2 for kernel
address (when kprobe_func is NULL), or [k,u]probe offset.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/perf_event.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 362493a..5220600 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -33,6 +33,8 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_KPROBE= 6,
+   PERF_TYPE_UPROBE= 7,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +301,8 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN 64
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -380,10 +384,14 @@ struct perf_event_attr {
__u32   bp_type;
union {
__u64   bp_addr;
+   __u64   kprobe_func; /* for PERF_TYPE_KPROBE */
+   __u64   uprobe_path; /* for PERF_TYPE_UPROBE */
__u64   config1; /* extension of config */
};
union {
__u64   bp_len;
+   __u64   kprobe_addr; /* for PERF_TYPE_KPROBE, with 
kprobe_func == NULL */
+   __u64   probe_offset; /* for PERF_TYPE_[K,U]PROBE */
__u64   config2; /* extension of config1 */
};
__u64   branch_sample_type; /* enum perf_branch_sample_type */
-- 
2.9.5



[PATCH v2 5/6] bpf: add option for bpf_load.c to use PERF_TYPE_KPROBE

2017-11-29 Thread Song Liu
Function load_and_attach() is updated to be able to create kprobes
with either old text based API, or the new PERF_TYPE_KPROBE API.

A global flag use_perf_type_probe is added to select between the
two APIs.

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/bpf_load.c | 54 +++---
 samples/bpf/bpf_load.h |  8 
 2 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 2325d7a..872510e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -42,6 +41,7 @@ int prog_array_fd = -1;
 
 struct bpf_map_data map_data[MAX_MAPS];
 int map_data_count = 0;
+bool use_perf_type_probe = true;
 
 static int populate_prog_array(const char *event, int prog_fd)
 {
@@ -70,7 +70,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
-   int fd, efd, err, id;
+   int fd, efd, err, id = -1;
struct perf_event_attr attr = {};
 
attr.type = PERF_TYPE_TRACEPOINT;
@@ -128,7 +128,7 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
return populate_prog_array(event, fd);
}
 
-   if (is_kprobe || is_kretprobe) {
+   if (!use_perf_type_probe && (is_kprobe || is_kretprobe)) {
if (is_kprobe)
event += 7;
else
@@ -169,27 +169,41 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
strcat(buf, "/id");
}
 
-   efd = open(buf, O_RDONLY, 0);
-   if (efd < 0) {
-   printf("failed to open event %s\n", event);
-   return -1;
-   }
-
-   err = read(efd, buf, sizeof(buf));
-   if (err < 0 || err >= sizeof(buf)) {
-   printf("read from '%s' failed '%s'\n", event, strerror(errno));
-   return -1;
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe)) {
+   attr.type = PERF_TYPE_KPROBE;
+   attr.kprobe_func = ptr_to_u64(
+   event + strlen(is_kprobe ? "kprobe/" : "kretprobe/"));
+   attr.probe_offset = 0;
+   attr.config  = !!is_kretprobe;
+   } else {
+   efd = open(buf, O_RDONLY, 0);
+   if (efd < 0) {
+   printf("failed to open event %s\n", event);
+   return -1;
+   }
+   err = read(efd, buf, sizeof(buf));
+   if (err < 0 || err >= sizeof(buf)) {
+   printf("read from '%s' failed '%s'\n", event,
+  strerror(errno));
+   return -1;
+   }
+   close(efd);
+   buf[err] = 0;
+   id = atoi(buf);
+   attr.config = id;
}
 
-   close(efd);
-
-   buf[err] = 0;
-   id = atoi(buf);
-   attr.config = id;
-
efd = sys_perf_event_open(, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 
0);
if (efd < 0) {
-   printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+   if (use_perf_type_probe && (is_kprobe || is_kretprobe))
+   printf("k%sprobe %s fd %d err %s\n",
+  is_kprobe ? "" : "ret",
+  event + strlen(is_kprobe ? "kprobe/"
+ : "kretprobe/"),
+  efd, strerror(errno));
+   else
+   printf("event %d fd %d err %s\n", id, efd,
+  strerror(errno));
return -1;
}
event_fd[prog_cnt - 1] = efd;
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 7d57a42..e7a8a21 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,6 +2,7 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
+#include 
 #include "libbpf.h"
 
 #define MAX_MAPS 32
@@ -38,6 +39,8 @@ extern int map_fd[MAX_MAPS];
 extern struct bpf_map_data map_data[MAX_MAPS];
 extern int map_data_count;
 
+extern bool use_perf_type_probe;
+
 /* parses elf file compiled by llvm .c->.o
  * . parses 'maps' section and creates maps via BPF syscall
  * . parses 'license' section and passes it to syscall
@@ -59,6 +62,11 @@ struct ksym {
char *name;
 };
 
+static inline __u64 ptr_to_u64(const void *ptr)
+{
+   return (__u64) (unsigned long) ptr;
+}
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
-- 
2.9.5



[PATCH v2 4/6] perf: implement support of PERF_TYPE_UPROBE

2017-11-29 Thread Song Liu
This patch adds perf_uprobe support with similar pattern as previous
patch (for kprobe).

Two functions, create_local_trace_uprobe() and
destroy_local_trace_uprobe(), are created so a uprobe can be created
and attached to the file descriptor created by perf_event_open().

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  2 +
 kernel/events/core.c| 39 +-
 kernel/trace/trace_event_perf.c | 58 ++
 kernel/trace/trace_probe.h  |  4 ++
 kernel/trace/trace_uprobe.c | 90 -
 5 files changed, 181 insertions(+), 12 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 51f748c9..9272fa6 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -496,6 +496,8 @@ extern int  perf_trace_add(struct perf_event *event, int 
flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
 extern int  perf_kprobe_init(struct perf_event *event);
 extern void perf_kprobe_destroy(struct perf_event *event);
+extern int  perf_uprobe_init(struct perf_event *event);
+extern void perf_uprobe_destroy(struct perf_event *event);
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index daa6e0a..b566a53 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7992,6 +7992,28 @@ static int perf_kprobe_event_init(struct perf_event 
*event)
return 0;
 }
 
+static int perf_uprobe_event_init(struct perf_event *event)
+{
+   int err;
+
+   if (event->attr.type != PERF_TYPE_UPROBE)
+   return -ENOENT;
+
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   err = perf_uprobe_init(event);
+   if (err)
+   return err;
+
+   event->destroy = perf_uprobe_destroy;
+
+   return 0;
+}
+
 static struct pmu perf_tracepoint = {
.task_ctx_nr= perf_sw_context,
 
@@ -8013,10 +8035,21 @@ static struct pmu perf_kprobe = {
.read   = perf_swevent_read,
 };
 
+static struct pmu perf_uprobe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_uprobe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+};
+
 static inline void perf_tp_register(void)
 {
perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
perf_pmu_register(_kprobe, "kprobe", PERF_TYPE_KPROBE);
+   perf_pmu_register(_uprobe, "uprobe", PERF_TYPE_UPROBE);
 }
 
 static void perf_event_free_filter(struct perf_event *event)
@@ -8099,7 +8132,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
struct bpf_prog *prog;
 
if (event->attr.type != PERF_TYPE_TRACEPOINT &&
-   event->attr.type != PERF_TYPE_KPROBE)
+   event->attr.type != PERF_TYPE_KPROBE &&
+   event->attr.type != PERF_TYPE_UPROBE)
return perf_event_set_bpf_handler(event, prog_fd);
 
if (event->tp_event->prog)
@@ -8572,7 +8606,8 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg)
int ret = -EINVAL;
 
if (((event->attr.type != PERF_TYPE_TRACEPOINT &&
- event->attr.type != PERF_TYPE_KPROBE) ||
+ event->attr.type != PERF_TYPE_KPROBE &&
+ event->attr.type != PERF_TYPE_UPROBE) ||
 !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
!has_addr_filter(event))
return -EINVAL;
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 7cf0d99..1b97ea2 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -272,6 +272,52 @@ int perf_kprobe_init(struct perf_event *p_event)
 #endif /* CONFIG_KPROBE_EVENTS */
 }
 
+int perf_uprobe_init(struct perf_event *p_event)
+{
+   int ret;
+   char *path = NULL;
+   struct trace_event_call *tp_event;
+
+#ifdef CONFIG_UPROBE_EVENTS
+   if (!p_event->attr.uprobe_path)
+   return -EINVAL;
+   path = kzalloc(PATH_MAX, GFP_KERNEL);
+   if (!path)
+   return -ENOMEM;
+   ret = strncpy_from_user(
+   path, u64_to_user_ptr(p_event->attr.uprobe_path), PATH_MAX);
+   if (ret < 0)
+   goto out;
+   if (path[0] == '\0') {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   tp_event = create_local_trace_uprobe(
+   path, p_event->attr.probe_offset, 

[PATCH v2 0/6] enable creating [k,u]probe with perf_event_open

2017-11-29 Thread Song Liu
Changes PATCH v1 to PATCH v2:
  Split PERF_TYPE_PROBE into PERF_TYPE_KPROBE and PERF_TYPE_UPROBE.
  Split perf_probe into perf_kprobe and perf_uprobe.
  Remove struct probe_desc, use config1 and config2 instead.

Changes RFC v2 to PATCH v1:
  Check type PERF_TYPE_PROBE in perf_event_set_filter().
  Rebase on to tip perf/core.

Changes RFC v1 to RFC v2:
  Fix build issue reported by kbuild test bot by adding ifdef of
  CONFIG_KPROBE_EVENTS, and CONFIG_UPROBE_EVENTS.

RFC v1 cover letter:

This is to follow up the discussion over "new kprobe api" at Linux
Plumbers 2017:

https://www.linuxplumbersconf.org/2017/ocw/proposals/4808

With current kernel, user space tools can only create/destroy [k,u]probes
with a text-based API (kprobe_events and uprobe_events in tracefs). This
approach relies on user space to clean up the [k,u]probe after using them.
However, this is not easy for user space to clean up properly.

To solve this problem, we introduce a file descriptor based API.
Specifically, we extended perf_event_open to create [k,u]probe, and attach
this [k,u]probe to the file descriptor created by perf_event_open. These
[k,u]probe are associated with this file descriptor, so they are not
available in tracefs.

We reuse large portion of existing trace_kprobe and trace_uprobe code.
Currently, the file descriptor API does not support arguments as the
text-based API does. This should not be a problem, as user of the file
decriptor based API read data through other methods (bpf, etc.).

I also include a patch to to bcc, and a patch to man-page perf_even_open.
Please see the list below. A fork of bcc with this patch is also available
on github:

  https://github.com/liu-song-6/bcc/tree/perf_event_open

Thanks,
Song

man-pages patch:
  perf_event_open.2: add type PERF_TYPE_KPROBE and PERF_TYPE_UPROBE

bcc patch:
  bcc: Try use new API to create [k,u]probe with perf_event_open

kernel patches:

Song Liu (6):
  perf: Add new types PERF_TYPE_KPROBE and PERF_TYPE_UPROBE
  perf: copy new perf_event.h to tools/include/uapi
  perf: implement support of PERF_TYPE_KPROBE
  perf: implement support of PERF_TYPE_UPROBE
  bpf: add option for bpf_load.c to use PERF_TYPE_KPROBE
  bpf: add new test test_many_kprobe

 include/linux/trace_events.h  |   4 +
 include/uapi/linux/perf_event.h   |   8 ++
 kernel/events/core.c  |  76 +-
 kernel/trace/trace_event_perf.c   | 111 +
 kernel/trace/trace_kprobe.c   |  91 +++--
 kernel/trace/trace_probe.h|  11 ++
 kernel/trace/trace_uprobe.c   |  90 +++--
 samples/bpf/Makefile  |   3 +
 samples/bpf/bpf_load.c|  59 ++-
 samples/bpf/bpf_load.h|  12 +++
 samples/bpf/test_many_kprobe_user.c   | 182 ++
 tools/include/uapi/linux/perf_event.h |   8 ++
 12 files changed, 611 insertions(+), 44 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

--
2.9.5


[PATCH v2 2/6] perf: copy new perf_event.h to tools/include/uapi

2017-11-29 Thread Song Liu
perf_event.h is updated in previous patch, this patch applies same
changes to the tools/ version. This is part is put in a separate
patch in case the two files are back ported separately.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/linux/perf_event.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index b9a4953..c361442 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -33,6 +33,8 @@ enum perf_type_id {
PERF_TYPE_HW_CACHE  = 3,
PERF_TYPE_RAW   = 4,
PERF_TYPE_BREAKPOINT= 5,
+   PERF_TYPE_KPROBE= 6,
+   PERF_TYPE_UPROBE= 7,
 
PERF_TYPE_MAX,  /* non-ABI */
 };
@@ -299,6 +301,8 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 
+#define MAX_PROBE_FUNC_NAME_LEN 64
+
 /*
  * Hardware event_id to monitor via a performance monitoring event:
  *
@@ -380,10 +384,14 @@ struct perf_event_attr {
__u32   bp_type;
union {
__u64   bp_addr;
+   __u64   kprobe_func;  /* for PERF_TYPE_KPROBE */
+   __u64   uprobe_path;  /* for PERF_TYPE_UPROBE */
__u64   config1; /* extension of config */
};
union {
__u64   bp_len;
+   __u64   kprobe_addr; /* for PERF_TYPE_KPROBE, with 
kprobe_func == NULL */
+   __u64   probe_offset; /* for PERF_TYPE_[K,U]PROBE */
__u64   config2; /* extension of config1 */
};
__u64   branch_sample_type; /* enum perf_branch_sample_type */
-- 
2.9.5



[PATCH v2] perf_event_open.2: add type PERF_TYPE_KPROBE and PERF_TYPE_UPROBE

2017-11-29 Thread Song Liu
Two new types PERF_TYPE_KPROBE and PERF_TYPE_UPROBE are being added
to perf_event_attr. This patch adds information about this type.

Signed-off-by: Song Liu 
---
 man2/perf_event_open.2 | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index c91da3f..e662332 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -256,11 +256,15 @@ struct perf_event_attr {
 
 union {
 __u64 bp_addr;  /* breakpoint address */
+__u64 kprobe_func;  /* for PERF_TYPE_KPROBE */
+__u64 uprobe_path;  /* for PERF_TYPE_KPROBE */
 __u64 config1;  /* extension of config */
 };
 
 union {
 __u64 bp_len;   /* breakpoint length */
+__u64 kprobe_addr;  /* for PERF_TYPE_KPROBE, with kprobe_func == 
NULL */
+__u64 probe_offset; /* for PERF_TYPE_[K,U]PROBE */
 __u64 config2;  /* extension of config1 */
 };
 __u64 branch_sample_type;   /* enum perf_branch_sample_type */
@@ -317,6 +321,13 @@ This indicates a hardware breakpoint as provided by the 
CPU.
 Breakpoints can be read/write accesses to an address as well as
 execution of an instruction address.
 .TP
+.BR PERF_TYPE_KPROBE " and " PERF_TYPE_UPROBE " (since Linux 4.TBD)"
+This indicates a kprobe or uprobe should be created and
+attached to the file descriptor.
+See fields
+.IR kprobe_func ", " uprobe_path ", " kprobe_addr ", and " probe_offset
+for more details.
+.TP
 .RB "dynamic PMU"
 Since Linux 2.6.38,
 .\" commit 2e80a82a49c4c7eca4e35734380f28298ba5db19
@@ -627,6 +638,37 @@ then leave
 .I config
 set to zero.
 Its parameters are set in other places.
+.PP
+If
+.I type
+is
+.BR PERF_TYPE_KPROBE
+or
+.BR PERF_TYPE_UPROBE ,
+.I config
+of 0 means kprobe/uprobe, while
+.I config
+of 1 means kretprobe/uretprobe.
+.RE
+.TP
+.IR kprobe_func ", " uprobe_path ", " kprobe_addr ", and " probe_offset
+.EE
+These fields describes the kprobe/uprobe for
+.BR PERF_TYPE_KPROBE
+and
+.BR PERF_TYPE_UPROBE .
+For kprobe: use
+.I kprobe_func
+and
+.IR probe_offset ,
+or use
+.I kprobe_addr
+and leave
+.I kprobe_func
+as NULL. For uprobe: use
+.I uprobe_path
+and
+.IR probe_offset .
 .RE
 .TP
 .IR sample_period ", " sample_freq
-- 
2.9.5



[PATCH v2] bcc: Try use new API to create [k,u]probe with perf_event_open

2017-11-29 Thread Song Liu
New kernel API allows creating [k,u]probe with perf_event_open.
This patch tries to use the new API. If the new API doesn't work,
we fall back to old API.

bpf_detach_probe() looks up the event being removed. If the event
is not found, we skip the clean up procedure.

Signed-off-by: Song Liu 
---
 src/cc/libbpf.c | 224 +++-
 1 file changed, 155 insertions(+), 69 deletions(-)

diff --git a/src/cc/libbpf.c b/src/cc/libbpf.c
index ef6daf3..5bbcdfd 100644
--- a/src/cc/libbpf.c
+++ b/src/cc/libbpf.c
@@ -526,38 +526,72 @@ int bpf_attach_socket(int sock, int prog) {
   return setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, , sizeof(prog));
 }
 
+/*
+ * new kernel API allows creating [k,u]probe with perf_event_open, which
+ * makes it easier to clean up the [k,u]probe. This function tries to
+ * create pfd with the new API.
+ */
+static int bpf_try_perf_event_open_with_probe(const char *name, uint64_t offs,
+int pid, int cpu, int group_fd, int is_uprobe, int is_return)
+{
+  struct perf_event_attr attr = {};
+
+  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+  attr.sample_period = 1;
+  attr.wakeup_events = 1;
+  attr.config = is_return ? 1 : 0;
+  attr.probe_offset = offs;  /* for kprobe, if name is NULL, this the addr */
+  attr.size = sizeof(attr)
+  if (is_uprobe) {
+attr.type = PERF_TYPE_UPROBE;
+attr.uprobe_path = ptr_to_u64((void *)name);
+  } else {
+attr.type = PERF_TYPE_KPROBE;
+attr.kprobe_func = ptr_to_u64((void *)name);
+  }
+  return syscall(__NR_perf_event_open, , pid, cpu, group_fd,
+ PERF_FLAG_FD_CLOEXEC);
+}
+
 static int bpf_attach_tracing_event(int progfd, const char *event_path,
-struct perf_reader *reader, int pid, int cpu, int group_fd) {
-  int efd, pfd;
+struct perf_reader *reader, int pid, int cpu, int group_fd, int pfd) {
+  int efd;
   ssize_t bytes;
   char buf[256];
   struct perf_event_attr attr = {};
 
-  snprintf(buf, sizeof(buf), "%s/id", event_path);
-  efd = open(buf, O_RDONLY, 0);
-  if (efd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-return -1;
-  }
+  /*
+   * Only look up id and call perf_event_open when
+   * bpf_try_perf_event_open_with_probe() didn't returns valid pfd.
+   */
+  if (pfd < 0) {
+snprintf(buf, sizeof(buf), "%s/id", event_path);
+efd = open(buf, O_RDONLY, 0);
+if (efd < 0) {
+  fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+  return -1;
+}
 
-  bytes = read(efd, buf, sizeof(buf));
-  if (bytes <= 0 || bytes >= sizeof(buf)) {
-fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+bytes = read(efd, buf, sizeof(buf));
+if (bytes <= 0 || bytes >= sizeof(buf)) {
+  fprintf(stderr, "read(%s): %s\n", buf, strerror(errno));
+  close(efd);
+  return -1;
+}
 close(efd);
-return -1;
-  }
-  close(efd);
-  buf[bytes] = '\0';
-  attr.config = strtol(buf, NULL, 0);
-  attr.type = PERF_TYPE_TRACEPOINT;
-  attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
-  attr.sample_period = 1;
-  attr.wakeup_events = 1;
-  pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
-  if (pfd < 0) {
-fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
-return -1;
+buf[bytes] = '\0';
+attr.config = strtol(buf, NULL, 0);
+attr.type = PERF_TYPE_TRACEPOINT;
+attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+attr.sample_period = 1;
+attr.wakeup_events = 1;
+pfd = syscall(__NR_perf_event_open, , pid, cpu, group_fd, 
PERF_FLAG_FD_CLOEXEC);
+if (pfd < 0) {
+  fprintf(stderr, "perf_event_open(%s/id): %s\n", event_path, 
strerror(errno));
+  return -1;
+}
   }
+
   perf_reader_set_fd(reader, pfd);
 
   if (perf_reader_mmap(reader, attr.type, attr.sample_type) < 0)
@@ -585,31 +619,38 @@ void * bpf_attach_kprobe(int progfd, enum 
bpf_probe_attach_type attach_type, con
   char event_alias[128];
   struct perf_reader *reader = NULL;
   static char *event_type = "kprobe";
+  int pfd;
 
   reader = perf_reader_new(cb, NULL, NULL, cb_cookie, 
probe_perf_reader_page_cnt);
   if (!reader)
 goto error;
 
-  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", 
event_type);
-  kfd = open(buf, O_WRONLY | O_APPEND, 0);
-  if (kfd < 0) {
-fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
-goto error;
-  }
+  /* try use new API to create kprobe */
+  pfd = bpf_try_perf_event_open_with_probe(fn_name, 0, pid, cpu, group_fd, 0,
+   attach_type != BPF_PROBE_ENTRY);
 
-  snprintf(event_alias, sizeof(event_alias), "%s_bcc_%d", ev_name, getpid());
-  snprintf(buf, sizeof(buf), "%c:%ss/%s %s", attach_type==BPF_PROBE_ENTRY ? 
'p' : 'r',
-   event_type, event_alias, fn_name);
-  if (write(kfd, buf, strlen(buf)) < 0) {
-if (errno == EINVAL)
-  fprintf(stderr, "check dmesg 

[PATCH v2 3/6] perf: implement support of PERF_TYPE_KPROBE

2017-11-29 Thread Song Liu
A new pmu, perf_kprobe, is created for PERF_TYPE_KPROBE. Based on
input from perf_event_open(), perf_kprobe creates a kprobe (or
kretprobe) for the perf_event. This kprobe is private to this
perf_event, and thus not added to global lists, and not
available in tracefs.

Two functions, create_local_trace_kprobe() and
destroy_local_trace_kprobe()  are added to created and destroy these
local trace_kprobe.

Signed-off-by: Song Liu 
Reviewed-by: Yonghong Song 
Reviewed-by: Josef Bacik 
---
 include/linux/trace_events.h|  2 +
 kernel/events/core.c| 41 +--
 kernel/trace/trace_event_perf.c | 53 
 kernel/trace/trace_kprobe.c | 91 +
 kernel/trace/trace_probe.h  |  7 
 5 files changed, 183 insertions(+), 11 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2bcb4dc..51f748c9 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -494,6 +494,8 @@ extern int  perf_trace_init(struct perf_event *event);
 extern void perf_trace_destroy(struct perf_event *event);
 extern int  perf_trace_add(struct perf_event *event, int flags);
 extern void perf_trace_del(struct perf_event *event, int flags);
+extern int  perf_kprobe_init(struct perf_event *event);
+extern void perf_kprobe_destroy(struct perf_event *event);
 extern int  ftrace_profile_set_filter(struct perf_event *event, int event_id,
 char *filter_str);
 extern void ftrace_profile_free_filter(struct perf_event *event);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 494eca1..daa6e0a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7970,6 +7970,28 @@ static int perf_tp_event_init(struct perf_event *event)
return 0;
 }
 
+static int perf_kprobe_event_init(struct perf_event *event)
+{
+   int err;
+
+   if (event->attr.type != PERF_TYPE_KPROBE)
+   return -ENOENT;
+
+   /*
+* no branch sampling for probe events
+*/
+   if (has_branch_stack(event))
+   return -EOPNOTSUPP;
+
+   err = perf_kprobe_init(event);
+   if (err)
+   return err;
+
+   event->destroy = perf_kprobe_destroy;
+
+   return 0;
+}
+
 static struct pmu perf_tracepoint = {
.task_ctx_nr= perf_sw_context,
 
@@ -7981,9 +8003,20 @@ static struct pmu perf_tracepoint = {
.read   = perf_swevent_read,
 };
 
+static struct pmu perf_kprobe = {
+   .task_ctx_nr= perf_sw_context,
+   .event_init = perf_kprobe_event_init,
+   .add= perf_trace_add,
+   .del= perf_trace_del,
+   .start  = perf_swevent_start,
+   .stop   = perf_swevent_stop,
+   .read   = perf_swevent_read,
+};
+
 static inline void perf_tp_register(void)
 {
perf_pmu_register(_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
+   perf_pmu_register(_kprobe, "kprobe", PERF_TYPE_KPROBE);
 }
 
 static void perf_event_free_filter(struct perf_event *event)
@@ -8065,7 +8098,8 @@ static int perf_event_set_bpf_prog(struct perf_event 
*event, u32 prog_fd)
bool is_kprobe, is_tracepoint, is_syscall_tp;
struct bpf_prog *prog;
 
-   if (event->attr.type != PERF_TYPE_TRACEPOINT)
+   if (event->attr.type != PERF_TYPE_TRACEPOINT &&
+   event->attr.type != PERF_TYPE_KPROBE)
return perf_event_set_bpf_handler(event, prog_fd);
 
if (event->tp_event->prog)
@@ -8537,8 +8571,9 @@ static int perf_event_set_filter(struct perf_event 
*event, void __user *arg)
char *filter_str;
int ret = -EINVAL;
 
-   if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
-   !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
+   if (((event->attr.type != PERF_TYPE_TRACEPOINT &&
+ event->attr.type != PERF_TYPE_KPROBE) ||
+!IS_ENABLED(CONFIG_EVENT_TRACING)) &&
!has_addr_filter(event))
return -EINVAL;
 
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 13ba2d3..7cf0d99 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include "trace.h"
+#include "trace_probe.h"
 
 static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS];
 
@@ -229,6 +230,48 @@ int perf_trace_init(struct perf_event *p_event)
return ret;
 }
 
+int perf_kprobe_init(struct perf_event *p_event)
+{
+   int ret;
+   char *func = NULL;
+   struct trace_event_call *tp_event;
+
+#ifdef CONFIG_KPROBE_EVENTS
+   if (p_event->attr.kprobe_func) {
+   func = kzalloc(MAX_PROBE_FUNC_NAME_LEN, GFP_KERNEL);
+   if (!func)
+   return -ENOMEM;
+   ret = strncpy_from_user(
+   func, u64_to_user_ptr(p_event->attr.kprobe_func),
+ 

[PATCH v2 6/6] bpf: add new test test_many_kprobe

2017-11-29 Thread Song Liu
The test compares old text based kprobe API with PERF_TYPE_KPROBE.

Here is a sample output of this test:

Creating 1000 kprobes with text-based API takes 6.979683 seconds
Cleaning 1000 kprobes with text-based API takes 84.897687 seconds
Creating 1000 kprobes with PERF_TYPE_KPROBE (function name) takes 5.077558 
seconds
Cleaning 1000 kprobes with PERF_TYPE_KPROBE (function name) takes 81.241354 
seconds
Creating 1000 kprobes with PERF_TYPE_KPROBE (function addr) takes 5.218255 
seconds
Cleaning 1000 kprobes with PERF_TYPE_KPROBE (function addr) takes 80.010731 
seconds

Signed-off-by: Song Liu 
Reviewed-by: Josef Bacik 
---
 samples/bpf/Makefile|   3 +
 samples/bpf/bpf_load.c  |   5 +-
 samples/bpf/bpf_load.h  |   4 +
 samples/bpf/test_many_kprobe_user.c | 182 
 4 files changed, 191 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_many_kprobe_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9b4a66e..ec92f35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -42,6 +42,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_monitor
 hostprogs-y += syscall_tp
+hostprogs-y += test_many_kprobe
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -87,6 +88,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+test_many_kprobe-objs := bpf_load.o $(LIBBPF) test_many_kprobe_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -172,6 +174,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_test_many_kprobe += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 872510e..caba9bc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -635,9 +635,8 @@ void read_trace_pipe(void)
}
 }
 
-#define MAX_SYMS 30
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
+struct ksym syms[MAX_SYMS];
+int sym_cnt;
 
 static int ksym_cmp(const void *p1, const void *p2)
 {
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index e7a8a21..16bc263 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -67,6 +67,10 @@ static inline __u64 ptr_to_u64(const void *ptr)
return (__u64) (unsigned long) ptr;
 }
 
+#define MAX_SYMS 30
+extern struct ksym syms[MAX_SYMS];
+extern int sym_cnt;
+
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
diff --git a/samples/bpf/test_many_kprobe_user.c 
b/samples/bpf/test_many_kprobe_user.c
new file mode 100644
index 000..1f3ee07
--- /dev/null
+++ b/samples/bpf/test_many_kprobe_user.c
@@ -0,0 +1,182 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define MAX_KPROBES 1000
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+int kprobes[MAX_KPROBES] = {0};
+int kprobe_count;
+int perf_event_fds[MAX_KPROBES];
+const char license[] = "GPL";
+
+static __u64 time_get_ns(void)
+{
+   struct timespec ts;
+
+   clock_gettime(CLOCK_MONOTONIC, );
+   return ts.tv_sec * 10ull + ts.tv_nsec;
+}
+
+static int kprobe_api(char *func, void *addr, bool use_new_api)
+{
+   int efd;
+   struct perf_event_attr attr = {};
+   char buf[256];
+   int err, id;
+
+   attr.sample_type = PERF_SAMPLE_RAW;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   if (use_new_api) {
+   attr.type = PERF_TYPE_KPROBE;
+   if (func) {
+   attr.kprobe_func = ptr_to_u64(func);
+   attr.probe_offset = 0;
+   } else {
+   attr.kprobe_func = 0;
+   attr.kprobe_addr = ptr_to_u64(addr);
+   }
+   } else {
+   attr.type = PERF_TYPE_TRACEPOINT;
+   snprintf(buf, sizeof(buf),
+"echo 'p:%s %s' >> 
/sys/kernel/debug/tracing/kprobe_events",
+func, func);
+   err = system(buf);
+   if (err < 0) {
+   

Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-29 Thread Song Liu

> On Nov 23, 2017, at 2:22 AM, Peter Zijlstra  wrote:
> 
> On Wed, Nov 15, 2017 at 09:23:33AM -0800, Song Liu wrote:
>> A new perf type PERF_TYPE_PROBE is added to allow creating [k,u]probe
>> with perf_event_open. These [k,u]probe are associated with the file
>> decriptor created by perf_event_open, thus are easy to clean when
>> the file descriptor is destroyed.
>> 
>> Struct probe_desc and two flags, is_uprobe and is_return, are added
>> to describe the probe being created with perf_event_open.
> 
>> ---
>> include/uapi/linux/perf_event.h | 35 +--
>> 1 file changed, 33 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/uapi/linux/perf_event.h 
>> b/include/uapi/linux/perf_event.h
>> index 362493a..cc42d59 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -33,6 +33,7 @@ enum perf_type_id {
>>  PERF_TYPE_HW_CACHE  = 3,
>>  PERF_TYPE_RAW   = 4,
>>  PERF_TYPE_BREAKPOINT= 5,
>> +PERF_TYPE_PROBE = 6,
> 
> Not required.. these fixed types are mostly legacy at this point.

Dear Peter,

Thanks a lot for your feedback. I have incorporated them in the next version
(sending soon). 

I added two fixed types (PERF_TYPE_KPROBE and PERF_TYPE_UPROBE) in the new 
version. I know that perf doesn't need them any more. But currently bcc still 
relies on these fixed types to use the probes/tracepoints. 

Thanks,
Song



[PATCH net] tcp: remove buggy call to tcp_v6_restore_cb()

2017-11-29 Thread Eric Dumazet
From: Eric Dumazet 

tcp_v6_send_reset() expects to receive an skb with skb->cb[] layout as
used in TCP stack.
MD5 lookup uses tcp_v6_iif() and tcp_v6_sdif() and thus
TCP_SKB_CB(skb)->header.h6

This patch probably fixes RST packets sent on behalf of a timewait md5
ipv6 socket.

Before Florian patch, tcp_v6_restore_cb() was needed before jumping to
no_tcp_socket label.

Fixes: 271c3b9b7bda ("tcp: honour SO_BINDTODEVICE for TW_RST case too")
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
---
 net/ipv6/tcp_ipv6.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 
6bb98c93edfe2ed2f16fe5229605f8108cfc7f9a..be11dc13aa705145a83177e17d23594e9416e11a
 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1590,7 +1590,6 @@ static int tcp_v6_rcv(struct sk_buff *skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_restore_cb(skb);
tcp_v6_send_reset(sk, skb);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;



[PATCH resend] trace/xdp: fix compile warning: 'struct bpf_map' declared inside parameter list

2017-11-29 Thread Xie XiuQi
We meet this compile warning, which caused by missing bpf.h in xdp.h.

In file included from ./include/trace/events/xdp.h:10:0,
 from ./include/linux/bpf_trace.h:6,
 from drivers/net/ethernet/intel/i40e/i40e_txrx.c:29:
./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
parameter list will not be visible outside of this definition or declaration
const struct bpf_map *map, u32 map_index),
 ^
./include/linux/tracepoint.h:187:34: note: in definition of macro 
‘__DECLARE_TRACE’
  static inline void trace_##name(proto)\
  ^
./include/linux/tracepoint.h:352:24: note: in expansion of macro ‘PARAMS’
  __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
^~
./include/linux/tracepoint.h:477:2: note: in expansion of macro ‘DECLARE_TRACE’
  DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
  ^
./include/linux/tracepoint.h:477:22: note: in expansion of macro ‘PARAMS’
  DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
  ^~
./include/trace/events/xdp.h:89:1: note: in expansion of macro ‘DEFINE_EVENT’
 DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
 ^~~~
./include/trace/events/xdp.h:90:2: note: in expansion of macro ‘TP_PROTO’
  TP_PROTO(const struct net_device *dev,
  ^~~~
./include/trace/events/xdp.h:93:17: warning: ‘struct bpf_map’ declared inside 
parameter list will not be visible outside of this definition or declaration
const struct bpf_map *map, u32 map_index),
 ^
./include/linux/tracepoint.h:203:38: note: in definition of macro 
‘__DECLARE_TRACE’
  register_trace_##name(void (*probe)(data_proto), void *data) \
  ^~
./include/linux/tracepoint.h:354:4: note: in expansion of macro ‘PARAMS’
PARAMS(void *__data, proto),   \
^~

Reported-by: Huang Daode 
Cc: Hanjun Guo 
Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")
Signed-off-by: Xie XiuQi 
Acked-by: Jesper Dangaard Brouer 
Acked-by: Steven Rostedt (VMware) 
---
 include/trace/events/xdp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 4cd0f05..8989a92 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define __XDP_ACT_MAP(FN)  \
FN(ABORTED) \
-- 
1.8.3.1



Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-29 Thread Willem de Bruijn
On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann  wrote:
> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
>  wrote:
>>> Thanks for the review! Any suggestions for how to do the testing? If you 
>>> have
>>> existing test cases, could you give my next version a test run to see if 
>>> there
>>> are any regressions and if the timestamps work as expected?
>>>
>>> I see that there are test cases in tools/testing/selftests/net/, but none
>>> of them seem to use the time stamps so far, and I'm not overly familiar
>>> with how it works in the details to extend it in a meaningful way.
>>
>> I could not find any good tests for this interface, either. The only
>> user of the interface I found was a little tool I wrote a few years
>> ago that compares timestamps at multiple points in the transmit
>> path for latency measurement [1]. But it may be easier to just write
>> a new test under tools/testing/selftests/net for this purpose. I can
>> help with that, too, if you want.
>
> Thanks, that would be great!

I'll reply to this thread with git send-email with an extension to
tools/testing/selftests/net/psock_tpacket.c. I can resend that for
submission after your feature is merged (as it depends on it) or
feel free to include it in your patchset. The test currently fails for
the ns64 case. I probably did not convert correctly, but have to leave
the office and want to send what I have.

Two other comments: the test crashed the kernel due to a NULL ptr
in tpacket_get_timestamp. We cannot rely on skb->sk being set to
the packet socket here. And assignment to bitfield requires a cast to
boolean.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f55f330ab547..e9decc7fc5c3 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -439,9 +439,9 @@ static int __packet_get_status(struct packet_sock
*po, void *frame)
}
 }

-static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo)
+static __u32 tpacket_get_timestamp(struct packet_sock *po, struct sk_buff *skb,
+  __u32 *hi, __u32 *lo)
 {
-   struct packet_sock *po = pkt_sk(skb->sk);
struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
ktime_t stamp;
u32 type;
@@ -508,7 +508,7 @@ static __u32 __packet_set_timestamp(struct
packet_sock *po, void *frame,
union tpacket_uhdr h;
__u32 ts_status, hi, lo;

-   if (!(ts_status = tpacket_get_timestamp(skb, , )))
+   if (!(ts_status = tpacket_get_timestamp(po, skb, , )))
return 0;

h.raw = frame;
@@ -2352,7 +2352,7 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,

skb_copy_bits(skb, 0, h.raw + macoff, snaplen);

-   if (!(ts_status = tpacket_get_timestamp(skb, _hi, _lo)))
+   if (!(ts_status = tpacket_get_timestamp(po, skb, _hi,
_lo)))
packet_get_time(po, _hi, _lo);

status |= ts_status;
@@ -3835,7 +3835,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;

-   po->tp_skiptstamp = val;
+   po->tp_skiptstamp = !!val;
return 0;
}
case PACKET_TIMESTAMP_NS64:
@@ -3847,7 +3847,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;

-   po->tp_tstamp_ns64 = val;
+   po->tp_tstamp_ns64 = !!val;
return 0;
}
case PACKET_FANOUT:


Re: [PATCH] trace/xdp: fix compile warning: ‘struct bpf_map’ declared inside parameter list

2017-11-29 Thread Xie XiuQi

On 2017/11/29 19:13, Daniel Borkmann wrote:
> Xie, thanks for the patch! We could route this fix via bpf tree if you want.
> 
> Could you resend your patch with below Fixes and Acked-by tag added to
> netdev@vger.kernel.org in Cc, so that it ends up in patchwork there?
> 

Sure, I'll resend soon.

>> Fixes: 8d3b778ff544 ("xdp: tracepoint xdp_redirect also need a map argument")
>>
>> Acked-by: Jesper Dangaard Brouer 
> Thanks,
> Daniel

-- 
Thanks,
Xie XiuQi



Re: [PATCH v5 next 3/5] modules:capabilities: automatic module loading restriction

2017-11-29 Thread Luis R. Rodriguez
On Mon, Nov 27, 2017 at 06:18:36PM +0100, Djalal Harouni wrote:
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 5cbb239..c36aed8 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -261,7 +261,16 @@ struct notifier_block;
>  
>  #ifdef CONFIG_MODULES
>  
> -extern int modules_disabled; /* for sysctl */
> +enum {
> + MODULES_AUTOLOAD_ALLOWED= 0,
> + MODULES_AUTOLOAD_PRIVILEGED = 1,
> + MODULES_AUTOLOAD_DISABLED   = 2,
> +};
> +

Can you kdocify these and add a respective rst doc file?  Maybe stuff your
extensive docs which you are currently adding to
Documentation/sysctl/kernel.txt to this new file and in kernel.txt just refer
to it. This way this can be also nicely visibly documented on the web with the
new sphinx.

This way you can take advantage of the kdocs you are already adding and refer
to them.

> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2fb4e27..0b6f0c8 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -683,6 +688,15 @@ static struct ctl_table kern_table[] = {
>   .extra1 = ,
>   .extra2 = ,
>   },
> + {
> + .procname   = "modules_autoload_mode",
> + .data   = _autoload_mode,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + .proc_handler   = modules_autoload_dointvec_minmax,

It would seem this is a unint ... so why not reflect that?

> @@ -2499,6 +2513,20 @@ static int proc_dointvec_minmax_sysadmin(struct 
> ctl_table *table, int write,
>  }
>  #endif
>  
> +#ifdef CONFIG_MODULES
> +static int modules_autoload_dointvec_minmax(struct ctl_table *table, int 
> write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + /*
> +  * Only CAP_SYS_MODULE in init user namespace are allowed to change this
> +  */
> + if (write && !capable(CAP_SYS_MODULE))
> + return -EPERM;
> +
> + return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +}
> +#endif

We now have proc_douintvec_minmax().

  Luis


Re: [PATCH net 0/6] tools: bpftool: fix a minor issues with JSON and Makefiles

2017-11-29 Thread Daniel Borkmann
On 11/29/2017 02:44 AM, Jakub Kicinski wrote:
> Quentin says:
> 
> First commit in this series fixes a crash that occurs when incorrect
> arguments are passed to bpftool after the `--json` option. It comes from
> the usage() function trying to use the JSON writer, although the latter
> has not been created yet at that point.
> 
> Other patches add destruction of the writer in case the program exits in
> usage(), fix error messages handling when an unrecognized option is
> encountered, remove a spurious new-line character in an error message.
> 
> Last patches are related to the Makefiles. They fix the installation
> directory prefix and .PHONY targets.

Applied to bpf tree, thanks guys!


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts'o
On Wed, Nov 29, 2017 at 11:28:52AM -0600, Serge E. Hallyn wrote:
> 
> Just to be clear, module loading requires - and must always continue to
> require - CAP_SYS_MODULE against the initial user namespace.  Containers
> in user namespaces do not have that.
> 
> I don't believe anyone has ever claimed that containers which are not in
> a user namespace are in any way secure.

Unless the container performs some action which causes the kernel to
call request_module(), which then loads some kernel module,
potentially containing cr*p unmaintained code which was included when
the distro compiled the world, into the host kernel.

This is an attack vector that doesn't exist if you are using VM's.
And in general, the attack surface of the entire Linux
kernel<->userspace API is far larger than that which is exposed by the
guest<->host interface.

For that reason, containers are *far* more insecure than VM's, since
once the attacker gets root on the guest VM, they then have to attack
the hypervisor interface.  And if you compare the attack surface of
the two, it's pretty clear which is larger, and it's not the
hypervisor interface.

- Ted


Re: [PATCH RFC 2/2] veth: propagate bridge GSO to peer

2017-11-29 Thread Solio Sarabia
On Mon, Nov 27, 2017 at 07:02:01PM -0700, David Ahern wrote:
> On 11/27/17 6:42 PM, Solio Sarabia wrote:
> > Adding ioctl support for 'ip link set' would work. I'm still concerned
> > how to enforce the upper limit to not exceed that of the lower devices.
> > 
Actually, giving the user control to change gso doesn't solve the issue.
In a VM, user could simple ignore setting the gso, still hurting host
perf. We need to enforce the lower gso on the bridge/veth.

Should this issue be fixed at hv_netvsc level? Why is the driver passing
down gso buffer sizes greater than what synthetic interface allows.

> > Consider a system with three NICs, each reporting values in the range
> > [60,000 - 62,780]. Users could set virtual interfaces' gso to 65,536,
> > exceeding the limit, and having the host do sw gso (vms settings must
> > not affect host performance.)
> > 
> > Looping through interfaces?  With the difference that now it'd be
> > trigger upon user's request, not every time a veth is created (like one
> > previous patch discussed.)
> > 
> 
> You are concerned about the routed case right? One option is to have VRF
> devices propagate gso sizes to all devices (veth, vlan, etc) enslaved to
> it. VRF devices are Layer 3 master devices so an L3 parallel to a bridge.
Having the VRF device propagate the gso to its slaves is opposite of
what we do now: get minimum of all ports and assign it to bridge
(net/bridge/br_if.c, br_min_mtu, br_set_gso_limits.)

Would it be right to change the logic flow? If so, this this could work:

(1) bridge gets gso from lower devices upon init/setup
(2) when new device is attached to bridge, bridge sets gso for this new
slave (and its peer if it's veth.)
(3) as the code is now, there's an optimization opportunity: for every
new interface attached to bridge, bridge loops through all ports to
set gso, mtu. It's not necessary as bridge already has the minimum
from previous interfaces attached. Could be O(1) instead of O(n).


Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread James Morris
On Wed, 29 Nov 2017, Casey Schaufler wrote:

> I see that there is a proposed fix later in the thread, but I don't see
> the patch. Could you send it to me, so I can try it on my problem?

Forwarded off-list.

Interestingly, I didn't see the KASAN output email from Stephen here.


-- 
James Morris




[RFC] bpf: offload: report device information for offloaded programs

2017-11-29 Thread Jakub Kicinski
Report to the user ifindex and namespace information of offloaded
programs.  Always set dev_bound to true if program was loaded for
a device which has been since removed.  Specify the namespace
using dev/inode combination.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Reviewed-by: Quentin Monnet 
---
 fs/nsfs.c  |  2 +-
 include/linux/bpf.h|  2 ++
 include/linux/proc_ns.h|  1 +
 include/uapi/linux/bpf.h   |  5 +
 kernel/bpf/offload.c   | 34 ++
 kernel/bpf/syscall.c   |  6 ++
 tools/include/uapi/linux/bpf.h |  5 +
 7 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index ef243e14b6eb..d2b89372544a 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -51,7 +51,7 @@ static void nsfs_evict(struct inode *inode)
ns->ops->put(ns);
 }
 
-static void *__ns_get_path(struct path *path, struct ns_common *ns)
+void *__ns_get_path(struct path *path, struct ns_common *ns)
 {
struct vfsmount *mnt = nsfs_mnt;
struct dentry *dentry;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e55e4255a210..fc7ab26e10bf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -516,6 +516,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
 void bpf_prog_offload_destroy(struct bpf_prog *prog);
+int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
+  struct bpf_prog *prog);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 2ff18c9840a7..1733359cf713 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -76,6 +76,7 @@ static inline int ns_alloc_inum(struct ns_common *ns)
 
 extern struct file *proc_ns_fget(int fd);
 #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private)
+extern void *__ns_get_path(struct path *path, struct ns_common *ns);
 extern void *ns_get_path(struct path *path, struct task_struct *task,
const struct proc_ns_operations *ns_ops);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4c223ab30293..3183674496a2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -910,6 +910,11 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
+   __u32 dev_bound:1;
+   __u32 reserved:31;
+   __u32 ifindex;
+   __u64 ns_dev;
+   __u64 ns_inode;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8455b89d1bbf..da98349c647d 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -16,9 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* protected by RTNL */
@@ -164,6 +166,38 @@ int bpf_prog_offload_compile(struct bpf_prog *prog)
return bpf_prog_offload_translate(prog);
 }
 
+int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
+  struct bpf_prog *prog)
+{
+   struct bpf_dev_offload *offload = prog->aux->offload;
+   struct inode *ns_inode;
+   struct path ns_path;
+   struct net *net;
+   int ret = 0;
+   void *ptr;
+
+   info->dev_bound = 1;
+
+   rtnl_lock();
+   if (!offload->netdev)
+   goto out;
+
+   net = dev_net(offload->netdev);
+   get_net(net); /* __ns_get_path() drops the reference */
+   ptr = __ns_get_path(_path, >ns);
+   ret = PTR_ERR_OR_ZERO(ptr);
+   if (ret)
+   goto out;
+   ns_inode = ns_path.dentry->d_inode;
+
+   info->ns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
+   info->ns_inode = ns_inode->i_ino;
+   info->ifindex = offload->netdev->ifindex;
+out:
+   rtnl_unlock();
+   return ret;
+}
+
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2c4cfeaa8d5e..101ee3a3e80e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1616,6 +1616,12 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
return -EFAULT;
}
 
+   if (bpf_prog_is_dev_bound(prog->aux)) {
+   err = bpf_prog_offload_info_fill(, prog);
+   if (err)
+   return err;
+   }
+
 done:
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4c223ab30293..3183674496a2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -910,6 +910,11 @@ struct bpf_prog_info {
__u32 

Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread Casey Schaufler
On 11/29/2017 2:26 AM, James Morris wrote:
> I'm seeing a kernel stack corruption bug (detected via gcc) when running 
> the SELinux testsuite on a 4.15-rc1 kernel, in the 2nd inet_socket test:
>
> https://github.com/SELinuxProject/selinux-testsuite/blob/master/tests/inet_socket/test
>
>   # Verify that unauthorized client cannot communicate with the server.
>   $result = system
>   "runcon -t test_inet_bad_client_t -- $basedir/client stream 127.0.0.1 65535 
> 2>&1";
>
> This correctlly causes an access control error in the Netlabel code, and 
> the bug seems to be triggered during the ICMP send:
>
> ..
>
> This is mostly reliable, and I'm only seeing it on bare metal (not in a 
> virtualbox vm).
>
> The SELinux skb parse error at the start only sometimes appears, and 
> looking at the code, I suspect some kind of memory corruption being the 
> cause at that point (basic packet header checks).
>
> I bisected the bug down to the following change:
>
> commit bffa72cf7f9df842f0016ba03586039296b4caaf
> Author: Eric Dumazet 
> Date:   Tue Sep 19 05:14:24 2017 -0700
>
> net: sk_buff rbnode reorg
> ...
>
>
> Anyone else able to reproduce this, or have any ideas on what's happening?

I have also bisected a problem to this change. I do not have a trace
because the problem manifests as a hard system hang without a trace
being presented. The issue arises when Smack attempts to relabel a TCP
socket using netlbl_sock_setattr().

I see that there is a proposed fix later in the thread, but I don't see
the patch. Could you send it to me, so I can try it on my problem?

Thank you.

>
>
>
> - James



Re: [RFC 1/3] kallsyms: don't leak address when symbol not found

2017-11-29 Thread Tobin C. Harding
I reordered the To's and CC's, I hope this doesn't break
threading. (clearly I haven't groked email yet :( ) 

On Tue, Nov 28, 2017 at 09:30:17AM +1100, Tobin C. Harding wrote:
> Currently if kallsyms_lookup() fails to find the symbol then the address
> is printed. This potentially leaks sensitive information. Instead of
> printing the address we can return an error, giving the calling code the
> option to print the address or print some sanitized message.
> 
> Return error instead of printing address to argument buffer. Leave
> buffer in a sane state.
> 
> Signed-off-by: Tobin C. Harding 
> ---
>  kernel/kallsyms.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index 531ffa984bc2..4bfa4ee3ce93 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -394,8 +394,10 @@ static int __sprint_symbol(char *buffer, unsigned long 
> address,
>  
>   address += symbol_offset;
>   name = kallsyms_lookup(address, , , , buffer);
> - if (!name)
> - return sprintf(buffer, "0x%lx", address - symbol_offset);
> + if (!name) {
> + buffer[0] = '\0';
> + return -1;
> + }
>  
>   if (name != buffer)
>   strcpy(buffer, name);
> -- 
> 2.7.4
> 

Do you want a Suggested-by: tag for this patch Steve? I mentioned you in
the cover letter but as far as going into the git history I'm not
entirely sure on the protocol for adding suggested-by. The kernel docs
say not to add it without authorization, so ...

thanks,
Tobin.


RE: [PATCH] sched/deadline: fix one-bit signed bitfields to be unsigned

2017-11-29 Thread Keller, Jacob E
> -Original Message-
> From: Jakub Kicinski [mailto:kubak...@wp.pl]
> Sent: Tuesday, November 28, 2017 8:08 PM
> To: Kirsher, Jeffrey T 
> Cc: mi...@redhat.com; pet...@infradead.org; Keller, Jacob E
> ; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; luca abeni 
> Subject: Re: [PATCH] sched/deadline: fix one-bit signed bitfields to be 
> unsigned
> 
> On Tue, 28 Nov 2017 12:36:19 -0800, Jeff Kirsher wrote:
> > From: Jacob Keller 
> >
> > Commit 799ba82de01e ("sched/deadline: Use C bitfields for the state
> > flags", 2017-10-10) introduced the use of C bitfields for these
> > variables. However, sparse complains about them:
> >
> > ./include/linux/sched.h:476:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:477:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:478:62: error: dubious one-bit signed bitfield
> > ./include/linux/sched.h:479:62: error: dubious one-bit signed bitfield
> >
> > This is because a one-bit signed bitfield can only hold the values 0 and
> > -1, which can cause problems if the program expects to be able to
> > represent the value positive 1.
> >
> > In practice, this may not cause a bug since -1 would be considered
> > "true" in logical tests, however we should avoid the practice anyways.
> >
> > Fixes: 799ba82de01e ("sched/deadline: Use C bitfields for the state flags", 
> > 2017-
> 10-10)
> > Signed-off-by: Jacob Keller 
> > Cc: luca abeni 
> > Tested-by: Andrew Bowers 
> > Signed-off-by: Jeff Kirsher 
> 
> This is already in Linus's tree (I've been waiting for it to land as
> well :))
> 

Excellent.

Regards,
Jake


[Patch net v2] act_sample: get rid of tcf_sample_cleanup_rcu()

2017-11-29 Thread Cong Wang
Similar to commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu"),
TC actions don't need to respect RCU grace period, because it
is either just detached from tc filter (standalone case) or
it is removed together with tc filter (bound case) in which case
RCU grace period is already respected at filter layer.

Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Reported-by: Eric Dumazet 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Cc: Yotam Gigi 
Signed-off-by: Cong Wang 
---
 include/net/tc_act/tc_sample.h |  1 -
 net/sched/act_sample.c | 14 +++---
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index 524cee4f4c81..01dbfea32672 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -14,7 +14,6 @@ struct tcf_sample {
struct psample_group __rcu *psample_group;
u32 psample_group_num;
struct list_head tcfm_list;
-   struct rcu_head rcu;
 };
 #define to_sample(a) ((struct tcf_sample *)a)
 
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 8b5abcd2f32f..9438969290a6 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -96,23 +96,16 @@ static int tcf_sample_init(struct net *net, struct nlattr 
*nla,
return ret;
 }
 
-static void tcf_sample_cleanup_rcu(struct rcu_head *rcu)
+static void tcf_sample_cleanup(struct tc_action *a, int bind)
 {
-   struct tcf_sample *s = container_of(rcu, struct tcf_sample, rcu);
+   struct tcf_sample *s = to_sample(a);
struct psample_group *psample_group;
 
-   psample_group = rcu_dereference_protected(s->psample_group, 1);
+   psample_group = rtnl_dereference(s->psample_group);
RCU_INIT_POINTER(s->psample_group, NULL);
psample_group_put(psample_group);
 }
 
-static void tcf_sample_cleanup(struct tc_action *a, int bind)
-{
-   struct tcf_sample *s = to_sample(a);
-
-   call_rcu(>rcu, tcf_sample_cleanup_rcu);
-}
-
 static bool tcf_sample_dev_ok_push(struct net_device *dev)
 {
switch (dev->type) {
@@ -264,7 +257,6 @@ static int __init sample_init_module(void)
 
 static void __exit sample_cleanup_module(void)
 {
-   rcu_barrier();
tcf_unregister_action(_sample_ops, _net_ops);
 }
 
-- 
2.13.0



Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Kees Cook
On Wed, Nov 29, 2017 at 2:45 PM, Linus Torvalds
 wrote:
> On Wed, Nov 29, 2017 at 7:58 AM, David Miller  wrote:
>>
>> We're talking about making sure that loading "ppp.ko" really gets
>> ppp.ko rather than some_other_module.ko renamed to ppp.ko via some
>> other mechanism.
>>
>> Both modules have legitimate signatures so the kernel will happily
>> load both.
>
> Yes. We could make the module name be part of the signing process, but
> one problem with that is that at module loading time we don't actually
> have the filename any more.

FWIW, I added this (well, KBUILD_MODNAME) to the module info just recently:

3e2e857f9c3a ("module: Add module name to modinfo")

> User space opens the file and then just feeds the data to the kernel.
> So if you fooled modprobe into feeding the wrong module, that's it.
>
> And yes, we can obviously embed the module name into the ELF headers
> (that is all part of the signed payload), but the module name doesn't
> actually necessarily match what we originally asked for.
>
> Why? Module aliases and module dependencies - which is why we have
> that user mode side at all. When we do "request_module(XYZ)" we don't
> necessarily know what the dependencies are, so we expect modprobe to
> just load the right modules.
>
> So if modprobe then loads some other module (dccp or whatever), the
> kernel has no real way to know "oh, that wasn't part of the dependency
> chain for the module we aked for".
>
> Now, if modprobe is taught to check that the filename of the module
> that it opens actually matches the metadata in the ELF sections, that
> would solve it, but it's out of the kernels hands..

Right, the aliases are why these kinds of renaming shenanigans don't
mean anything: it's not distinguishable from whatever modprobe.conf
ultimately tells modprobe to do.

If you can't trust your filesystem to hold your kernel modules
correctly, you have much bigger problems. (And yes, capabilities are a
problem here, since there are many paths to full root from individual
capabilities, but that's a known issue that is much larger than
tricking modprobe.)

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [RFC 0/3] kallsyms: don't leak address when printing symbol

2017-11-29 Thread Tobin C. Harding
On Tue, Nov 28, 2017 at 08:58:44AM +0530, Kaiwan N Billimoria wrote:
> On Tue, Nov 28, 2017 at 7:20 AM, Tobin C. Harding  wrote:
> >
> > Noob question: how do we _know_ this. In other words how do we know no
> > userland tools rely on the current behaviour? No stress to answer Kees,
> > this is a pretty general kernel dev question.
> 
> Perhaps I'm reading this wrong, but anyway: besides ftrace, kprobes
> will require a
> symbol-to-address lookup. Specifically, in the function
> kprobe_lookup_name() which
> in turn invokes kallsyms_lookup_name().

We should be right for this call chain because the patch doesn't touch
kallsyms_lookup_name().

> AFAIK, SystemTap (userland) is built on top of the kprobes infrastructure..

This actually indirectly answers the concern. Since no userland tool
should be looking up a kernel address the only code we can break is
kernel code.


thanks,
Tobin


Re: [PATCH 0/2] replace %pK with %p

2017-11-29 Thread Kees Cook
On Wed, Nov 29, 2017 at 3:38 PM, Tobin C. Harding  wrote:
> We are now hashing addresses printed with %pK (when
> kptr_restrict==0). Perhaps we can get rid of %pK (and kptr_restrict)
> entirely. Instead of rushing ahead and doing so let's replace all printk
> format strings that use %pK with %p.

NAK. Real people use kptr_restrict -- removing %pK is a regression for
them. Setting kptr_restrict should zero the values marked with %pK.
There is still a risk of correlating information leaks to at least
select a target. If we add a knob for the %p hashing to switch to
zeroing, then we could drop %pK, IMO.

-Kees

>
> It is a nice time to do this now while we are prepared for breakages
> from applying the pointer hashing patch series.
>
> The patch to remove kptr_restrict entirely should then be a non-event.
>
> Second patch adds printk specifier %pz to display zeroed address. This
> may be useful for fixing things that break during the fallout from
> hashing and replacing %pK. We can always revert this patch if it turns
> out to be worthless, right?
>
> Patch 1 was created using
>
> for file in $(git grep -l '%pK')
> do
> perl -pi -e 's/%pK/%p/g' $file
> done
>
> thanks,
> Tobin.
>
> Tobin C. Harding (2):
>   tree-wide: replace all users of %pK with %p
>   printk: add specifier %pz, for zeroed address
>
>  Documentation/printk-formats.txt   | 11 +++
>  arch/arm/mm/physaddr.c |  2 +-
>  arch/arm64/mm/physaddr.c   |  2 +-
>  arch/mips/kernel/relocate.c| 10 +--
>  arch/mips/kvm/mips.c   |  2 +-
>  arch/powerpc/perf/hv-24x7.c|  8 +--
>  arch/s390/kvm/intercept.c  |  2 +-
>  arch/s390/kvm/kvm-s390.c   | 10 +--
>  arch/s390/kvm/trace-s390.h |  4 +-
>  drivers/android/binder.c   |  2 +-
>  drivers/android/binder_alloc.c | 28 
>  drivers/gpu/drm/exynos/exynos_drm_dsi.c|  4 +-
>  drivers/gpu/drm/exynos/exynos_drm_fimc.c   |  2 +-
>  drivers/gpu/drm/exynos/exynos_drm_gem.c|  2 +-
>  drivers/gpu/drm/exynos/exynos_drm_gsc.c|  2 +-
>  drivers/gpu/drm/exynos/exynos_drm_ipp.c| 22 +++---
>  drivers/gpu/drm/exynos/exynos_drm_rotator.c|  2 +-
>  drivers/gpu/drm/i915/i915_debugfs.c|  2 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
>  drivers/net/wireless/ath/ath10k/ahb.c  |  2 +-
>  drivers/net/wireless/ath/ath10k/bmi.c  |  4 +-
>  drivers/net/wireless/ath/ath10k/ce.c   |  4 +-
>  drivers/net/wireless/ath/ath10k/core.c |  4 +-
>  drivers/net/wireless/ath/ath10k/htc.c  |  6 +-
>  drivers/net/wireless/ath/ath10k/htt_rx.c   |  2 +-
>  drivers/net/wireless/ath/ath10k/mac.c  | 22 +++---
>  drivers/net/wireless/ath/ath10k/pci.c  |  2 +-
>  drivers/net/wireless/ath/ath10k/testmode.c |  4 +-
>  drivers/net/wireless/ath/ath10k/txrx.c |  2 +-
>  drivers/net/wireless/ath/ath10k/usb.c  |  4 +-
>  drivers/net/wireless/ath/ath10k/wmi.c  |  4 +-
>  drivers/spi/spi-loopback-test.c| 12 ++--
>  drivers/staging/ccree/ssi_buffer_mgr.c | 54 +++---
>  drivers/staging/ccree/ssi_cipher.c |  4 +-
>  drivers/staging/ccree/ssi_hash.c   | 30 
>  .../interface/vchiq_arm/vchiq_2835_arm.c   |  6 +-
>  .../vc04_services/interface/vchiq_arm/vchiq_arm.c  | 16 ++---
>  .../vc04_services/interface/vchiq_arm/vchiq_core.c | 84 
> +++---
>  .../interface/vchiq_arm/vchiq_kern_lib.c   |  4 +-
>  drivers/usb/core/devio.c   | 14 ++--
>  drivers/usb/core/hcd.c |  4 +-
>  drivers/usb/core/urb.c |  2 +-
>  drivers/usb/dwc3/dwc3-st.c |  2 +-
>  drivers/usb/dwc3/gadget.c  |  4 +-
>  include/linux/filter.h |  2 +-
>  kernel/cgroup/debug.c  |  8 +--
>  kernel/module.c|  2 +-
>  kernel/time/timer_list.c   |  4 +-
>  lib/vsprintf.c | 26 +--
>  mm/vmalloc.c   |  4 +-
>  net/atm/proc.c |  4 +-
>  net/bluetooth/af_bluetooth.c   |  2 +-
>  net/can/bcm.c  |  6 +-
>  net/can/proc.c |  4 +-
>  net/ipv4/ping.c|  2 +-
>  net/ipv4/raw.c |  2 +-
>  net/ipv4/tcp_ipv4.c|  6 +-
>  net/ipv4/udp.c 

Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread James Morris
On Wed, 29 Nov 2017, Eric Dumazet wrote:

> On Wed, 2017-11-29 at 12:23 -0800, Eric Dumazet wrote:
> > 
> > I suspect this exposes an ancient bug, caused by fact that TCP moves
> > IP[6]CB in skb->cb[]
> > 
> > Basically the 2nd tcp_filter() added in commit
> > 8fac365f63c866a00015fa13932d8ffc584518b8
> > ("tcp: Add a tcp_filter hook before handle ack packet") was not
> > expecting selinux code being called a 2nd time,
> > while skb->cb[] has been mangled [1]
> > 
> > [1]
> > memmove(_SKB_CB(skb)->header.h4, IPCB(skb),
> > sizeof(struct inet_skb_parm));
> 
> Please try this fix for IPv4 (a similar patch will be needed for IPv6)
> 
>  net/ipv4/tcp_ipv4.c |   51 ++
>  1 file changed, 32 insertions(+), 19 deletions(-)

Works for me, no crashes with the testsuite running in a loop.


Tested-by: James Morris 


-- 
James Morris



[PATCH 0/2] replace %pK with %p

2017-11-29 Thread Tobin C. Harding
We are now hashing addresses printed with %pK (when
kptr_restrict==0). Perhaps we can get rid of %pK (and kptr_restrict)
entirely. Instead of rushing ahead and doing so let's replace all printk
format strings that use %pK with %p.

It is a nice time to do this now while we are prepared for breakages
from applying the pointer hashing patch series.

The patch to remove kptr_restrict entirely should then be a non-event.

Second patch adds printk specifier %pz to display zeroed address. This
may be useful for fixing things that break during the fallout from
hashing and replacing %pK. We can always revert this patch if it turns
out to be worthless, right?

Patch 1 was created using

for file in $(git grep -l '%pK')
do
perl -pi -e 's/%pK/%p/g' $file
done

thanks,
Tobin.

Tobin C. Harding (2):
  tree-wide: replace all users of %pK with %p
  printk: add specifier %pz, for zeroed address

 Documentation/printk-formats.txt   | 11 +++
 arch/arm/mm/physaddr.c |  2 +-
 arch/arm64/mm/physaddr.c   |  2 +-
 arch/mips/kernel/relocate.c| 10 +--
 arch/mips/kvm/mips.c   |  2 +-
 arch/powerpc/perf/hv-24x7.c|  8 +--
 arch/s390/kvm/intercept.c  |  2 +-
 arch/s390/kvm/kvm-s390.c   | 10 +--
 arch/s390/kvm/trace-s390.h |  4 +-
 drivers/android/binder.c   |  2 +-
 drivers/android/binder_alloc.c | 28 
 drivers/gpu/drm/exynos/exynos_drm_dsi.c|  4 +-
 drivers/gpu/drm/exynos/exynos_drm_fimc.c   |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gsc.c|  2 +-
 drivers/gpu/drm/exynos/exynos_drm_ipp.c| 22 +++---
 drivers/gpu/drm/exynos/exynos_drm_rotator.c|  2 +-
 drivers/gpu/drm/i915/i915_debugfs.c|  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
 drivers/net/wireless/ath/ath10k/ahb.c  |  2 +-
 drivers/net/wireless/ath/ath10k/bmi.c  |  4 +-
 drivers/net/wireless/ath/ath10k/ce.c   |  4 +-
 drivers/net/wireless/ath/ath10k/core.c |  4 +-
 drivers/net/wireless/ath/ath10k/htc.c  |  6 +-
 drivers/net/wireless/ath/ath10k/htt_rx.c   |  2 +-
 drivers/net/wireless/ath/ath10k/mac.c  | 22 +++---
 drivers/net/wireless/ath/ath10k/pci.c  |  2 +-
 drivers/net/wireless/ath/ath10k/testmode.c |  4 +-
 drivers/net/wireless/ath/ath10k/txrx.c |  2 +-
 drivers/net/wireless/ath/ath10k/usb.c  |  4 +-
 drivers/net/wireless/ath/ath10k/wmi.c  |  4 +-
 drivers/spi/spi-loopback-test.c| 12 ++--
 drivers/staging/ccree/ssi_buffer_mgr.c | 54 +++---
 drivers/staging/ccree/ssi_cipher.c |  4 +-
 drivers/staging/ccree/ssi_hash.c   | 30 
 .../interface/vchiq_arm/vchiq_2835_arm.c   |  6 +-
 .../vc04_services/interface/vchiq_arm/vchiq_arm.c  | 16 ++---
 .../vc04_services/interface/vchiq_arm/vchiq_core.c | 84 +++---
 .../interface/vchiq_arm/vchiq_kern_lib.c   |  4 +-
 drivers/usb/core/devio.c   | 14 ++--
 drivers/usb/core/hcd.c |  4 +-
 drivers/usb/core/urb.c |  2 +-
 drivers/usb/dwc3/dwc3-st.c |  2 +-
 drivers/usb/dwc3/gadget.c  |  4 +-
 include/linux/filter.h |  2 +-
 kernel/cgroup/debug.c  |  8 +--
 kernel/module.c|  2 +-
 kernel/time/timer_list.c   |  4 +-
 lib/vsprintf.c | 26 +--
 mm/vmalloc.c   |  4 +-
 net/atm/proc.c |  4 +-
 net/bluetooth/af_bluetooth.c   |  2 +-
 net/can/bcm.c  |  6 +-
 net/can/proc.c |  4 +-
 net/ipv4/ping.c|  2 +-
 net/ipv4/raw.c |  2 +-
 net/ipv4/tcp_ipv4.c|  6 +-
 net/ipv4/udp.c |  2 +-
 net/ipv6/datagram.c|  2 +-
 net/ipv6/tcp_ipv6.c|  6 +-
 net/key/af_key.c   |  2 +-
 net/netlink/af_netlink.c   |  2 +-
 net/packet/af_packet.c |  2 +-
 net/phonet/socket.c|  2 +-
 net/unix/af_unix.c |  2 +-
 sound/soc/bcm/cygnus-pcm.c |  2 +-
 66 files changed, 269 insertions(+), 240 deletions(-)

-- 
2.7.4



[PATCH 2/2] printk: add specifier %pz, for zeroed address

2017-11-29 Thread Tobin C. Harding
Currently %pK [at times] zeros addresses. It would be nice to remove %pK
entirely. Printing zero addresses is useful if we want to sanitize an
address but there may be userland tools that currently rely on the
address format (i.e the correct width).

Add printk specifier %pz.

Signed-off-by: Tobin C. Harding 
---
 Documentation/printk-formats.txt | 11 +++
 lib/vsprintf.c   | 18 ++
 2 files changed, 29 insertions(+)

diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index aa0a776c817a..f88b06485378 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -122,6 +122,17 @@ uniquely grep'able. If, in the future, we need to modify 
the way the Kernel
 handles printing pointers it will be nice to be able to find the call
 sites.
 
+Zeroed Addresses
+
+
+::
+
+   %pz  or 
+
+For printing zeroed addresses. This is useful if you are want to sanitize
+an address but there may be userland tools that depend on the correct width
+of the address for parsing.
+
 Struct Resources
 
 
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 81e9ce8f52f9..ebf911618858 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1727,6 +1727,19 @@ static char *ptr_to_id(char *buf, char *end, void *ptr, 
struct printf_spec spec)
return number(buf, end, hashval, spec);
 }
 
+static noinline_for_stack
+char *zero_string(char *buf, char *end, struct printf_spec spec)
+{
+   spec.base = 16;
+   spec.flags |= SMALL;
+   if (spec.field_width == -1) {
+   spec.field_width = 2 * sizeof(void *);
+   spec.flags |= ZEROPAD;
+   }
+
+   return number(buf, end, 0, spec);
+}
+
 /*
  * Show a '%p' thing.  A kernel extension is that the '%p' is followed
  * by an extra set of alphanumeric characters that are extended format
@@ -1833,6 +1846,9 @@ static char *ptr_to_id(char *buf, char *end, void *ptr, 
struct printf_spec spec)
  *C full compatible string
  *
  * - 'x' For printing the address. Equivalent to "%lx".
+ * - 'z' For printing zeroed addresses. This is useful if you are want to
+ *   sanitize an address but there may be userland tools that depend on the
+ *   correct width of the address for parsing.
  *
  * ** Please update also Documentation/printk-formats.txt when making changes 
**
  *
@@ -1960,6 +1976,8 @@ char *pointer(const char *fmt, char *buf, char *end, void 
*ptr,
}
case 'x':
return pointer_string(buf, end, ptr, spec);
+   case 'z':
+   return zero_string(buf, end, spec);
}
 
/* default is to _not_ leak addresses, hash before printing */
-- 
2.7.4



[PATCH 1/2] tree-wide: replace all users of %pK with %p

2017-11-29 Thread Tobin C. Harding
%p is now hashed, it is therefore more secure than %p (with
 kptr_restrict==0). We may be able to remove %pK and kptr_restrict
 altogether now. First, let's replace all in tree users of %pK. We can
 give this a while in the wild to see what breaks. Then if things play
 nicely we can remove %pK  (and kptr_restrict).

Search and replace all uses of %pK with %p.

Signed-off-by: Tobin C. Harding 
---
 arch/arm/mm/physaddr.c |  2 +-
 arch/arm64/mm/physaddr.c   |  2 +-
 arch/mips/kernel/relocate.c| 10 +--
 arch/mips/kvm/mips.c   |  2 +-
 arch/powerpc/perf/hv-24x7.c|  8 +--
 arch/s390/kvm/intercept.c  |  2 +-
 arch/s390/kvm/kvm-s390.c   | 10 +--
 arch/s390/kvm/trace-s390.h |  4 +-
 drivers/android/binder.c   |  2 +-
 drivers/android/binder_alloc.c | 28 
 drivers/gpu/drm/exynos/exynos_drm_dsi.c|  4 +-
 drivers/gpu/drm/exynos/exynos_drm_fimc.c   |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gsc.c|  2 +-
 drivers/gpu/drm/exynos/exynos_drm_ipp.c| 22 +++---
 drivers/gpu/drm/exynos/exynos_drm_rotator.c|  2 +-
 drivers/gpu/drm/i915/i915_debugfs.c|  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
 drivers/net/wireless/ath/ath10k/ahb.c  |  2 +-
 drivers/net/wireless/ath/ath10k/bmi.c  |  4 +-
 drivers/net/wireless/ath/ath10k/ce.c   |  4 +-
 drivers/net/wireless/ath/ath10k/core.c |  4 +-
 drivers/net/wireless/ath/ath10k/htc.c  |  6 +-
 drivers/net/wireless/ath/ath10k/htt_rx.c   |  2 +-
 drivers/net/wireless/ath/ath10k/mac.c  | 22 +++---
 drivers/net/wireless/ath/ath10k/pci.c  |  2 +-
 drivers/net/wireless/ath/ath10k/testmode.c |  4 +-
 drivers/net/wireless/ath/ath10k/txrx.c |  2 +-
 drivers/net/wireless/ath/ath10k/usb.c  |  4 +-
 drivers/net/wireless/ath/ath10k/wmi.c  |  4 +-
 drivers/spi/spi-loopback-test.c| 12 ++--
 drivers/staging/ccree/ssi_buffer_mgr.c | 54 +++---
 drivers/staging/ccree/ssi_cipher.c |  4 +-
 drivers/staging/ccree/ssi_hash.c   | 30 
 .../interface/vchiq_arm/vchiq_2835_arm.c   |  6 +-
 .../vc04_services/interface/vchiq_arm/vchiq_arm.c  | 16 ++---
 .../vc04_services/interface/vchiq_arm/vchiq_core.c | 84 +++---
 .../interface/vchiq_arm/vchiq_kern_lib.c   |  4 +-
 drivers/usb/core/devio.c   | 14 ++--
 drivers/usb/core/hcd.c |  4 +-
 drivers/usb/core/urb.c |  2 +-
 drivers/usb/dwc3/dwc3-st.c |  2 +-
 drivers/usb/dwc3/gadget.c  |  4 +-
 include/linux/filter.h |  2 +-
 kernel/cgroup/debug.c  |  8 +--
 kernel/module.c|  2 +-
 kernel/time/timer_list.c   |  4 +-
 lib/vsprintf.c |  8 +--
 mm/vmalloc.c   |  4 +-
 net/atm/proc.c |  4 +-
 net/bluetooth/af_bluetooth.c   |  2 +-
 net/can/bcm.c  |  6 +-
 net/can/proc.c |  4 +-
 net/ipv4/ping.c|  2 +-
 net/ipv4/raw.c |  2 +-
 net/ipv4/tcp_ipv4.c|  6 +-
 net/ipv4/udp.c |  2 +-
 net/ipv6/datagram.c|  2 +-
 net/ipv6/tcp_ipv6.c|  6 +-
 net/key/af_key.c   |  2 +-
 net/netlink/af_netlink.c   |  2 +-
 net/packet/af_packet.c |  2 +-
 net/phonet/socket.c|  2 +-
 net/unix/af_unix.c |  2 +-
 sound/soc/bcm/cygnus-pcm.c |  2 +-
 65 files changed, 240 insertions(+), 240 deletions(-)

diff --git a/arch/arm/mm/physaddr.c b/arch/arm/mm/physaddr.c
index cf75819e4c13..3147984ca70b 100644
--- a/arch/arm/mm/physaddr.c
+++ b/arch/arm/mm/physaddr.c
@@ -38,7 +38,7 @@ static inline bool __virt_addr_valid(unsigned long x)
 phys_addr_t __virt_to_phys(unsigned long x)
 {
WARN(!__virt_addr_valid(x),
-"virt_to_phys used for non-linear address: %pK (%pS)\n",
+"virt_to_phys used for non-linear address: %p (%pS)\n",
 (void *)x, (void *)x);
 
return __virt_to_phys_nodebug(x);
diff --git 

Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Andrew Lunn
> While Andrew's suggestion to use of_mdiobus_register() even for the
> built-in DSA created slave_mii_bus makes sense, I would rather recommend
> you instantiate your own bus (ala mv88e6xxx), such that your DT will
> likely look like:

Hi Florian

I could still look like this, if the built in slave_mii_bus looked for
the mdio node.

Something like:

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 44e3fb7dec8c..6b64c09413bf 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -312,6 +312,7 @@ static void dsa_port_teardown(struct dsa_port *dp)
 
 static int dsa_switch_setup(struct dsa_switch *ds)
 {
+   struct device_node *node;
int err;
 
/* Initialize ds->phys_mii_mask before registering the slave MDIO bus
@@ -347,7 +348,11 @@ static int dsa_switch_setup(struct dsa_switch *ds)
 
dsa_slave_mii_bus_init(ds);
 
-   err = mdiobus_register(ds->slave_mii_bus);
+   if (ds->dev->of_node &&
+   node = of_get_child_by_name(pdev->dev.of_node, "mdio"))
+   err = of_mdiobus_register(ds->slave_mii_bus, node);
+   else
+   err = mdiobus_register(ds->slave_mii_bus);
if (err < 0)
return err;
}

Andrew


Re: [PATCH V11 0/5] hash addresses printed with %p

2017-11-29 Thread Tobin C. Harding
On Wed, Nov 29, 2017 at 03:20:40PM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2017 13:05:00 +1100 "Tobin C. Harding"  wrote:
> 
> > Currently there exist approximately 14 000 places in the Kernel where
> > addresses are being printed using an unadorned %p. This potentially
> > leaks sensitive information regarding the Kernel layout in memory. Many
> > of these calls are stale, instead of fixing every call lets hash the
> > address by default before printing. This will of course break some
> > users, forcing code printing needed addresses to be updated. We can add
> > a printk specifier for this purpose (%px) to give developers a clear
> > upgrade path for breakages caused by applying this patch set.
> > 
> > The added advantage of hashing %p is that security is now opt-out, if
> > you _really_ want the address you have to work a little harder and use
> > %px.
> > 
> > The idea for creating the printk specifier %px to print the actual
> > address was suggested by Kees Cook (see below for email threads by
> > subject).
> 
> Maybe I'm being thick, but...  if we're rendering these addresses
> unusable by hashing them, why not just print something like
> "" in their place?  That loses the uniqueness thing but I
> wonder how valuable that is in practice?

The discussion on this has been fragmented over _at least_ 5 patch sets
with totally different subjects. And I only just added you to the CC
list, my apologies if this is a bit confusing.

Consensus was that if we provide a unique identifier (the hashed
address) then this is useful for debugging (i.e differentiating between
structs when you have a list of them).

The first 32 bits (on 64 bit machines) were zeroed out because

1. they are unnecessary in achieving the aim.
2. it reduces noise.
3. makes explicit some funny business was going on.

And bonus points, hopefully we don't break userland tools that parse
addresses if the format is still the same.

thanks,
Tobin.


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Tobin C. Harding
On Wed, Nov 29, 2017 at 03:20:58PM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  wrote:
> 
> > printk specifier %p now hashes all addresses before printing. Sometimes
> > we need to see the actual unmodified address. This can be achieved using
> > %lx but then we face the risk that if in future we want to change the
> > way the Kernel handles printing of pointers we will have to grep through
> > the already existent 50 000 %lx call sites. Let's add specifier %px as a
> > clear, opt-in, way to print a pointer and maintain some level of
> > isolation from all the other hex integer output within the Kernel.
> > 
> > Add printk specifier %px to print the actual unmodified address.
> > 
> > ...
> >
> > +Unmodified Addresses
> > +
> > +
> > +::
> > +
> > +   %px 01234567 or 0123456789abcdef
> > +
> > +For printing pointers when you _really_ want to print the address. Please
> > +consider whether or not you are leaking sensitive information about the
> > +Kernel layout in memory before printing pointers with %px. %px is
> > +functionally equivalent to %lx. %px is preferred to %lx because it is more
> > +uniquely grep'able. If, in the future, we need to modify the way the Kernel
> > +handles printing pointers it will be nice to be able to find the call
> > +sites.
> > +
> 
> You might want to add a checkpatch rule which emits a stern
> do-you-really-want-to-do-this warning when someone uses %px.
> 

Oh, nice idea. It has to be a CHECK but right? By stern, you mean use
stern language?

thanks,
Tobin.


Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Florian Fainelli
On 11/29/2017 03:19 PM, Linus Walleij wrote:
> On Wed, Nov 29, 2017 at 10:56 PM, Andrew Lunn  wrote:
> 
>> I think the problem might be, you are using the DSA provided MDIO bus.
>> The Marvell switches has a similar setup in terms of interrupts. The
>> PHY interrupts appear within the switch. So i implemented an interrupt
>> controller, just the same as you.
>>
>> The problem is, the DSA provided MDIO bus is not linked to device
>> tree. So you cannot have phy nodes in device tree associated to it.
>>
>> What i did for the Marvell driver is that driver itself implements an
>> MDIO bus (two actually in some chips), and the internal or external
>> PHYs are placed on the switch drivers MDIO bus, rather than the DSA
>> MDIO bus. The switch driver MDIO bus links to an mdio node in device
>> tree. I can then have interrupt properties in the phys on this MDIO
>> bus in device tree.
>>
>> What actually might make sense, is to have the DSA MDIO bus look
>> inside the switches device tree node and see if there is an mdio
>> node. If so allow dsa_switch_setup() to use of_mdiobus_register()
>> instead of mdiobus_register().
> 
> Aha I think I see where my thinking went wrong.
> 
> I have been assuming (thought it was intuitive...) that ports and
> PHYs are mapped 1:1.
> 
> So I assumed the port with reg =  is also the PHY with
> reg = 
> 
> So naturally I added the PHY interrupt to the port node.
> 
> So you are saying that the PHY and the port are two
> very disparate things in DSA terminology?

Yes, because the port is some sort of simplified Ethernet MAC, whereas
the PHY is the PHY, and it usually exists in the same shape and size
irrespective of whether it's integrated into a switch, being external,
or being internal to a proper Ethernet NIC.

> 
> I guess all ports except the CPU port actually have
> a 1:1 mapped PHY though, am I right?

This is the typical case, but is not universally true.

> 
> Or are there in pracice things such that reg is different
> on the port and the PHY connected to it? Then it makes
> much sense to put an MDIO bus inside the switch DT
> node and populate the PHY interrupts from there as you
> say.

Yes, I have such systems here, Port 0 has its PHY at MDIO address 5 for
instance.

> 
> I can take a stab at fixing that if that is what we want.

While Andrew's suggestion to use of_mdiobus_register() even for the
built-in DSA created slave_mii_bus makes sense, I would rather recommend
you instantiate your own bus (ala mv88e6xxx), such that your DT will
likely look like:

switch@0 {
compatible = "acme,switch";
#address-cells = <1>;
#size-cells = <0>;

ports {

port@0 {
reg = <0>;
phy-handle = <>;
};

port@1 {
reg = <1>;
phy-handle = <>;
};

port@8 {
reg = <8>;
ethernet = = <>;
};
};

mdio {
compatible = "acme,switch-mdio";

phy@0 {
reg = <0>;
};

phy@1 {
reg = <1>;
};
};
};

That way it's clear which port maps to which PHY, and that the MDIO
controller is internal within the switch (and so are the PHYs).
-- 
Florian


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Andrew Morton
On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  wrote:

> printk specifier %p now hashes all addresses before printing. Sometimes
> we need to see the actual unmodified address. This can be achieved using
> %lx but then we face the risk that if in future we want to change the
> way the Kernel handles printing of pointers we will have to grep through
> the already existent 50 000 %lx call sites. Let's add specifier %px as a
> clear, opt-in, way to print a pointer and maintain some level of
> isolation from all the other hex integer output within the Kernel.
> 
> Add printk specifier %px to print the actual unmodified address.
> 
> ...
>
> +Unmodified Addresses
> +
> +
> +::
> +
> + %px 01234567 or 0123456789abcdef
> +
> +For printing pointers when you _really_ want to print the address. Please
> +consider whether or not you are leaking sensitive information about the
> +Kernel layout in memory before printing pointers with %px. %px is
> +functionally equivalent to %lx. %px is preferred to %lx because it is more
> +uniquely grep'able. If, in the future, we need to modify the way the Kernel
> +handles printing pointers it will be nice to be able to find the call
> +sites.
> +

You might want to add a checkpatch rule which emits a stern
do-you-really-want-to-do-this warning when someone uses %px.



Re: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-11-29 Thread Andrew Morton
On Wed, 29 Nov 2017 13:05:03 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated.
> 
> Code that _really_ needs the address will soon be able to use the new
> printk specifier %px to print the address.
> 
> For what it's worth, usage of unadorned %p can be broken down as
> follows (thanks to Joe Perches).
> 
> $ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
>1084 arch
>  20 block
>  10 crypto
>  32 Documentation
>8121 drivers
>1221 fs
> 143 include
> 101 kernel
>  69 lib
> 100 mm
>1510 net
>  40 samples
>   7 scripts
>  11 security
> 166 sound
> 152 tools
>   2 virt
> 
> Add function ptr_to_id() to map an address to a 32 bit unique
> identifier. Hash any unadorned usage of specifier %p and any malformed
> specifiers.
> 
> ...
>
> @@ -1644,6 +1646,73 @@ char *device_node_string(char *buf, char *end, struct 
> device_node *dn,
>   return widen_string(buf, buf - buf_start, end, spec);
>  }
>  
> +static bool have_filled_random_ptr_key __read_mostly;
> +static siphash_key_t ptr_key __read_mostly;
> +
> +static void fill_random_ptr_key(struct random_ready_callback *unused)
> +{
> + get_random_bytes(_key, sizeof(ptr_key));
> + /*
> +  * have_filled_random_ptr_key==true is dependent on get_random_bytes().
> +  * ptr_to_id() needs to see have_filled_random_ptr_key==true
> +  * after get_random_bytes() returns.
> +  */
> + smp_mb();
> + WRITE_ONCE(have_filled_random_ptr_key, true);
> +}

I don't think I'm seeing anything which prevents two CPUs from
initializing ptr_key at the same time.  Probably doesn't matter much...



Re: [PATCH V11 0/5] hash addresses printed with %p

2017-11-29 Thread Andrew Morton
On Wed, 29 Nov 2017 13:05:00 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the Kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated. We can add
> a printk specifier for this purpose (%px) to give developers a clear
> upgrade path for breakages caused by applying this patch set.
> 
> The added advantage of hashing %p is that security is now opt-out, if
> you _really_ want the address you have to work a little harder and use
> %px.
> 
> The idea for creating the printk specifier %px to print the actual
> address was suggested by Kees Cook (see below for email threads by
> subject).

Maybe I'm being thick, but...  if we're rendering these addresses
unusable by hashing them, why not just print something like
"" in their place?  That loses the uniqueness thing but I
wonder how valuable that is in practice?




Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Linus Walleij
On Wed, Nov 29, 2017 at 10:56 PM, Andrew Lunn  wrote:

> I think the problem might be, you are using the DSA provided MDIO bus.
> The Marvell switches has a similar setup in terms of interrupts. The
> PHY interrupts appear within the switch. So i implemented an interrupt
> controller, just the same as you.
>
> The problem is, the DSA provided MDIO bus is not linked to device
> tree. So you cannot have phy nodes in device tree associated to it.
>
> What i did for the Marvell driver is that driver itself implements an
> MDIO bus (two actually in some chips), and the internal or external
> PHYs are placed on the switch drivers MDIO bus, rather than the DSA
> MDIO bus. The switch driver MDIO bus links to an mdio node in device
> tree. I can then have interrupt properties in the phys on this MDIO
> bus in device tree.
>
> What actually might make sense, is to have the DSA MDIO bus look
> inside the switches device tree node and see if there is an mdio
> node. If so allow dsa_switch_setup() to use of_mdiobus_register()
> instead of mdiobus_register().

Aha I think I see where my thinking went wrong.

I have been assuming (thought it was intuitive...) that ports and
PHYs are mapped 1:1.

So I assumed the port with reg =  is also the PHY with
reg = 

So naturally I added the PHY interrupt to the port node.

So you are saying that the PHY and the port are two
very disparate things in DSA terminology?

I guess all ports except the CPU port actually have
a 1:1 mapped PHY though, am I right?

Or are there in pracice things such that reg is different
on the port and the PHY connected to it? Then it makes
much sense to put an MDIO bus inside the switch DT
node and populate the PHY interrupts from there as you
say.

I can take a stab at fixing that if that is what we want.

Yours,
Linus Walleij


Re: [PATCH v4 7/8] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-29 Thread David Daney

On 11/29/2017 02:56 PM, Andrew Lunn wrote:

On Tue, Nov 28, 2017 at 04:55:39PM -0800, David Daney wrote:

+static int bgx_probe(struct platform_device *pdev)
+{
+   struct mac_platform_data platform_data;
+   const __be32 *reg;
+   u32 port;
+   u64 addr;
+   struct device_node *child;
+   struct platform_device *new_dev;
+   struct platform_device *pki_dev;
+   int numa_node, interface;
+   int i;
+   int r = 0;
+   char id[64];
+   u64 data;
+
+   reg = of_get_property(pdev->dev.of_node, "reg", NULL);
+   addr = of_translate_address(pdev->dev.of_node, reg);
+   interface = (addr >> 24) & 0xf;
+   numa_node = (addr >> 36) & 0x7;


Hi David

You have these two a few times in the code. Maybe add a helper to do
it? The NUMA one i assume could go somewhere in the SoC code?



Thanks for looking at it, I will try with helpers.


The rest of your comments below raise valid points, I will fix those too.





+static int bgx_mix_init_from_fdt(void)
+{
+   struct device_node  *node;
+   struct device_node  *parent = NULL;
+   int mix = 0;



+   /* Get the lmac index */
+   reg = of_get_property(lmac_fdt_node, "reg", NULL);
+   if (!reg)
+   goto err;
+
+   mix_port_lmacs[mix].lmac = *reg;


I don't think of_get_property() deals with endianness. Is there any
danger of this driver being used on hardware with the other endianness
to what you have tested?


+/**
+ * bgx_pki_init_from_param - Initialize the list of lmacs that connect to the
+ *  pki from information in the "pki_port" parameter.
+ *
+ *  The pki_port parameter format is as follows:
+ *  pki_port=nbl
+ *  where:
+ * n = node
+ * b = bgx
+ * l = lmac
+ *
+ *  Commas must be used to separate multiple lmacs:
+ *  pki_port=000,100,110
+ *
+ *  Asterisks (*) specify all possible characters in
+ *  the subset:
+ *  pki_port=00* (all lmacs of node0 bgx0).
+ *
+ *  Missing lmacs identifiers default to all
+ *  possible characters in the subset:
+ *  pki_port=00 (all lmacs on node0 bgx0)
+ *
+ *  Brackets ('[' and ']') specify the valid
+ *  characters in the subset:
+ *  pki_port=00[01] (lmac0 and lmac1 of node0 bgx0).
+ *
+ * Returns 0 if successful.
+ * Returns <0 for error codes.


I've not used kerneldoc much, but i suspect this is wrongly formated:

https://www.kernel.org/doc/html/v4.9/kernel-documentation.html#function-documentation


+int bgx_port_ethtool_set_settings(struct net_device*netdev,
+ struct ethtool_cmd*cmd)
+{
+   struct bgx_port_priv *p = bgx_port_netdev2priv(netdev);
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;


Not required. The enforces this. See dev_ethtool()


+
+   if (p->phydev)
+   return phy_ethtool_sset(p->phydev, cmd);
+
+   return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(bgx_port_ethtool_set_settings);
+
+int bgx_port_ethtool_nway_reset(struct net_device *netdev)
+{
+   struct bgx_port_priv *p = bgx_port_netdev2priv(netdev);
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;


Also not needed.


+static void bgx_port_adjust_link(struct net_device *netdev)
+{
+   struct bgx_port_priv*priv = bgx_port_netdev2priv(netdev);
+   boollink_changed = false;
+   unsigned intlink;
+   unsigned intspeed;
+   unsigned intduplex;
+
+   mutex_lock(>lock);
+
+   if (!priv->phydev->link && priv->last_status.link)
+   link_changed = true;
+
+   if (priv->phydev->link &&
+   (priv->last_status.link != priv->phydev->link ||
+priv->last_status.duplex != priv->phydev->duplex ||
+priv->last_status.speed != priv->phydev->speed))
+   link_changed = true;
+
+   link = priv->phydev->link;
+   priv->last_status.link = priv->phydev->link;
+
+   speed = priv->phydev->speed;
+   priv->last_status.speed = priv->phydev->speed;
+
+   duplex = priv->phydev->duplex;
+   priv->last_status.duplex = priv->phydev->duplex;
+
+   mutex_unlock(>lock);
+
+   if (link_changed) {
+   struct port_status status;
+
+   phy_print_status(priv->phydev);
+
+   status.link = link ? 1 : 0;
+   status.duplex = duplex;
+   status.speed = speed;
+   if (!link) {
+   netif_carrier_off(netdev);
+

Re: [PATCH v4 7/8] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-29 Thread Andrew Lunn
On Tue, Nov 28, 2017 at 04:55:39PM -0800, David Daney wrote:
> +static int bgx_probe(struct platform_device *pdev)
> +{
> + struct mac_platform_data platform_data;
> + const __be32 *reg;
> + u32 port;
> + u64 addr;
> + struct device_node *child;
> + struct platform_device *new_dev;
> + struct platform_device *pki_dev;
> + int numa_node, interface;
> + int i;
> + int r = 0;
> + char id[64];
> + u64 data;
> +
> + reg = of_get_property(pdev->dev.of_node, "reg", NULL);
> + addr = of_translate_address(pdev->dev.of_node, reg);
> + interface = (addr >> 24) & 0xf;
> + numa_node = (addr >> 36) & 0x7;

Hi David

You have these two a few times in the code. Maybe add a helper to do
it? The NUMA one i assume could go somewhere in the SoC code?

> +static int bgx_mix_init_from_fdt(void)
> +{
> + struct device_node  *node;
> + struct device_node  *parent = NULL;
> + int mix = 0;

> + /* Get the lmac index */
> + reg = of_get_property(lmac_fdt_node, "reg", NULL);
> + if (!reg)
> + goto err;
> +
> + mix_port_lmacs[mix].lmac = *reg;

I don't think of_get_property() deals with endianness. Is there any
danger of this driver being used on hardware with the other endianness
to what you have tested?

> +/**
> + * bgx_pki_init_from_param - Initialize the list of lmacs that connect to the
> + *pki from information in the "pki_port" parameter.
> + *
> + *The pki_port parameter format is as follows:
> + *pki_port=nbl
> + *where:
> + *   n = node
> + *   b = bgx
> + *   l = lmac
> + *
> + *Commas must be used to separate multiple lmacs:
> + *pki_port=000,100,110
> + *
> + *Asterisks (*) specify all possible characters in
> + *the subset:
> + *pki_port=00* (all lmacs of node0 bgx0).
> + *
> + *Missing lmacs identifiers default to all
> + *possible characters in the subset:
> + *pki_port=00 (all lmacs on node0 bgx0)
> + *
> + *Brackets ('[' and ']') specify the valid
> + *characters in the subset:
> + *pki_port=00[01] (lmac0 and lmac1 of node0 bgx0).
> + *
> + * Returns 0 if successful.
> + * Returns <0 for error codes.

I've not used kerneldoc much, but i suspect this is wrongly formated:

https://www.kernel.org/doc/html/v4.9/kernel-documentation.html#function-documentation

> +int bgx_port_ethtool_set_settings(struct net_device  *netdev,
> +   struct ethtool_cmd*cmd)
> +{
> + struct bgx_port_priv *p = bgx_port_netdev2priv(netdev);
> +
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;

Not required. The enforces this. See dev_ethtool()

> +
> + if (p->phydev)
> + return phy_ethtool_sset(p->phydev, cmd);
> +
> + return -EOPNOTSUPP;
> +}
> +EXPORT_SYMBOL(bgx_port_ethtool_set_settings);
> +
> +int bgx_port_ethtool_nway_reset(struct net_device *netdev)
> +{
> + struct bgx_port_priv *p = bgx_port_netdev2priv(netdev);
> +
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;

Also not needed.

> +static void bgx_port_adjust_link(struct net_device *netdev)
> +{
> + struct bgx_port_priv*priv = bgx_port_netdev2priv(netdev);
> + boollink_changed = false;
> + unsigned intlink;
> + unsigned intspeed;
> + unsigned intduplex;
> +
> + mutex_lock(>lock);
> +
> + if (!priv->phydev->link && priv->last_status.link)
> + link_changed = true;
> +
> + if (priv->phydev->link &&
> + (priv->last_status.link != priv->phydev->link ||
> +  priv->last_status.duplex != priv->phydev->duplex ||
> +  priv->last_status.speed != priv->phydev->speed))
> + link_changed = true;
> +
> + link = priv->phydev->link;
> + priv->last_status.link = priv->phydev->link;
> +
> + speed = priv->phydev->speed;
> + priv->last_status.speed = priv->phydev->speed;
> +
> + duplex = priv->phydev->duplex;
> + priv->last_status.duplex = priv->phydev->duplex;
> +
> + mutex_unlock(>lock);
> +
> + if (link_changed) {
> + struct port_status status;
> +
> + phy_print_status(priv->phydev);
> +
> + status.link = link ? 1 : 0;
> + status.duplex = duplex;
> + status.speed = speed;
> + if (!link) {
> + netif_carrier_off(netdev);
> +  /* Let TX drain. FIXME check that it is drained. */
> + mdelay(50);
> + }
> +  

Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread Eric Dumazet
On Wed, 2017-11-29 at 12:23 -0800, Eric Dumazet wrote:
> 
> I suspect this exposes an ancient bug, caused by fact that TCP moves
> IP[6]CB in skb->cb[]
> 
> Basically the 2nd tcp_filter() added in commit
> 8fac365f63c866a00015fa13932d8ffc584518b8
> ("tcp: Add a tcp_filter hook before handle ack packet") was not
> expecting selinux code being called a 2nd time,
> while skb->cb[] has been mangled [1]
> 
> [1]
> memmove(_SKB_CB(skb)->header.h4, IPCB(skb),
> sizeof(struct inet_skb_parm));

Please try this fix for IPv4 (a similar patch will be needed for IPv6)

 net/ipv4/tcp_ipv4.c |   51 ++
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
c6bc0c4d19c624888b0d0b5a4246c7183edf63f5..912928105942b9714dda9132e45961ab1baf0852
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1591,6 +1591,28 @@ int tcp_filter(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_filter);
 
+static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
+  const struct tcphdr *th)
+{
+   /* This is tricky : We move IPCB at its correct location into 
TCP_SKB_CB()
+* barrier() makes sure compiler wont play fool^Waliasing games.
+*/
+   memmove(_SKB_CB(skb)->header.h4, IPCB(skb),
+   sizeof(struct inet_skb_parm));
+   barrier();
+
+   TCP_SKB_CB(skb)->seq = ntohl(th->seq);
+   TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+   skb->len - th->doff * 4);
+   TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
+   TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
+   TCP_SKB_CB(skb)->tcp_tw_isn = 0;
+   TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+   TCP_SKB_CB(skb)->sacked  = 0;
+   TCP_SKB_CB(skb)->has_rxtstamp =
+   skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+}
+
 /*
  * From tcp_input.c
  */
@@ -1631,24 +1653,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
 
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
-   /* This is tricky : We move IPCB at its correct location into 
TCP_SKB_CB()
-* barrier() makes sure compiler wont play fool^Waliasing games.
-*/
-   memmove(_SKB_CB(skb)->header.h4, IPCB(skb),
-   sizeof(struct inet_skb_parm));
-   barrier();
-
-   TCP_SKB_CB(skb)->seq = ntohl(th->seq);
-   TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
-   skb->len - th->doff * 4);
-   TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-   TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
-   TCP_SKB_CB(skb)->tcp_tw_isn = 0;
-   TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
-   TCP_SKB_CB(skb)->sacked  = 0;
-   TCP_SKB_CB(skb)->has_rxtstamp =
-   skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
-
 lookup:
sk = __inet_lookup_skb(_hashinfo, skb, __tcp_hdrlen(th), th->source,
   th->dest, sdif, );
@@ -1679,8 +1683,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
sock_hold(sk);
refcounted = true;
nsk = NULL;
-   if (!tcp_filter(sk, skb))
+   if (!tcp_filter(sk, skb)) {
+   th = (const struct tcphdr *)skb->data;
+   iph = ip_hdr(skb);
+   tcp_v4_fill_cb(skb, iph, th);
nsk = tcp_check_req(sk, skb, req, false);
+   }
if (!nsk) {
reqsk_put(req);
goto discard_and_relse;
@@ -1712,6 +1720,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
goto discard_and_relse;
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
+   tcp_v4_fill_cb(skb, iph, th);
 
skb->dev = NULL;
 
@@ -1742,6 +1751,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard_it;
 
+   tcp_v4_fill_cb(skb, iph, th);
+
if (tcp_checksum_complete(skb)) {
 csum_error:
__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
@@ -1768,6 +1779,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
goto discard_it;
}
 
+   tcp_v4_fill_cb(skb, iph, th);
+
if (tcp_checksum_complete(skb)) {
inet_twsk_put(inet_twsk(sk));
goto csum_error;


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Linus Torvalds
On Wed, Nov 29, 2017 at 2:28 PM, Kees Cook  wrote:
>
> In the future, maybe we could have a knob: unhashed, hashed (default),
> or zeroed.

I haven't actually seen a case for that yet.

Let's see if there are actually any debug issues at all, and how big
they are before worrying about it.

   Linus


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Linus Torvalds
On Wed, Nov 29, 2017 at 7:58 AM, David Miller  wrote:
>
> We're talking about making sure that loading "ppp.ko" really gets
> ppp.ko rather than some_other_module.ko renamed to ppp.ko via some
> other mechanism.
>
> Both modules have legitimate signatures so the kernel will happily
> load both.

Yes. We could make the module name be part of the signing process, but
one problem with that is that at module loading time we don't actually
have the filename any more.

User space opens the file and then just feeds the data to the kernel.
So if you fooled modprobe into feeding the wrong module, that's it.

And yes, we can obviously embed the module name into the ELF headers
(that is all part of the signed payload), but the module name doesn't
actually necessarily match what we originally asked for.

Why? Module aliases and module dependencies - which is why we have
that user mode side at all. When we do "request_module(XYZ)" we don't
necessarily know what the dependencies are, so we expect modprobe to
just load the right modules.

So if modprobe then loads some other module (dccp or whatever), the
kernel has no real way to know "oh, that wasn't part of the dependency
chain for the module we aked for".

Now, if modprobe is taught to check that the filename of the module
that it opens actually matches the metadata in the ELF sections, that
would solve it, but it's out of the kernels hands..

 Linus


RE: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Roberts, William C


> -Original Message-
> From: keesc...@google.com [mailto:keesc...@google.com] On Behalf Of Kees
> Cook
> Sent: Wednesday, November 29, 2017 2:28 PM
> To: David Laight 
> Cc: Linus Torvalds ; Tobin C. Harding
> ; kernel-harden...@lists.openwall.com; Jason A. Donenfeld
> ; Theodore Ts'o ; Paolo Bonzini
> ; Tycho Andersen ; Roberts, William C
> ; Tejun Heo ; Jordan Glover
> ; Greg KH ;
> Petr Mladek ; Joe Perches ; Ian
> Campbell ; Sergey Senozhatsky
> ; Catalin Marinas ;
> Will Deacon ; Steven Rostedt ;
> Chris Fries ; Dave Weinstein ; Daniel
> Micay ; Djalal Harouni ; Radim
> Krcmár ; Linux Kernel Mailing List  ker...@vger.kernel.org>; Network Development ;
> David Miller ; Stephen Rothwell
> ; Andrey Ryabinin ;
> Alexander Potapenko ; Dmitry Vyukov
> ; Andrew Morton 
> Subject: Re: [PATCH V11 4/5] vsprintf: add printk specifier %px
> 
> On Wed, Nov 29, 2017 at 2:07 AM, David Laight 
> wrote:
> > From: Linus Torvalds
> >> Sent: 29 November 2017 02:29
> >>
> >> On Tue, Nov 28, 2017 at 6:05 PM, Tobin C. Harding  wrote:
> >> >
> >> >Let's add specifier %px as a
> >> > clear, opt-in, way to print a pointer and maintain some level of
> >> > isolation from all the other hex integer output within the Kernel.
> >>
> >> Yes, I like this model. It's easy and it's obvious ("'x' for hex"),
> >> and it gives people a good way to say "yes, I really want the actual
> >> address as hex" for if/when the hashed pointer doesn't work for some
> >> reason.
> >
> > Remind me to change every %p to %px on kernels that support it.
> >
> > Although the absolute values of pointers may not be useful, knowing
> > that two pointer differ by a small amount is useful.
> > It is also useful to know whether pointers are to stack, code, static
> > data or heap.
> >
> > This change to %p is going to make debugging a nightmare.
> 
> In the future, maybe we could have a knob: unhashed, hashed (default), or
> zeroed.

Isn't that just kptr_restrict and get us right back to the simpler patches I 
proposed?

> 
> -Kees
> 
> --
> Kees Cook
> Pixel Security


Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Kees Cook
On Wed, Nov 29, 2017 at 2:07 AM, David Laight  wrote:
> From: Linus Torvalds
>> Sent: 29 November 2017 02:29
>>
>> On Tue, Nov 28, 2017 at 6:05 PM, Tobin C. Harding  wrote:
>> >
>> >Let's add specifier %px as a
>> > clear, opt-in, way to print a pointer and maintain some level of
>> > isolation from all the other hex integer output within the Kernel.
>>
>> Yes, I like this model. It's easy and it's obvious ("'x' for hex"),
>> and it gives people a good way to say "yes, I really want the actual
>> address as hex" for if/when the hashed pointer doesn't work for some
>> reason.
>
> Remind me to change every %p to %px on kernels that support it.
>
> Although the absolute values of pointers may not be useful, knowing
> that two pointer differ by a small amount is useful.
> It is also useful to know whether pointers are to stack, code, static
> data or heap.
>
> This change to %p is going to make debugging a nightmare.

In the future, maybe we could have a knob: unhashed, hashed (default),
or zeroed.

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH v4 7/8] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-29 Thread Andrew Lunn
On Wed, Nov 29, 2017 at 10:11:38PM +0300, Dan Carpenter wrote:
> On Wed, Nov 29, 2017 at 09:37:15PM +0530, Souptick Joarder wrote:
> > >> +static int bgx_port_sgmii_set_link_speed(struct bgx_port_priv *priv, 
> > >> struct port_status status)
> > >> +{
> > >> +   u64 data;
> > >> +   u64 prtx;
> > >> +   u64 miscx;
> > >> +   int timeout;
> > >> +
> > 
> > >> +
> > >> +   switch (status.speed) {
> > >> +   case 10:
> > >
> > > In my opinion, instead of hard coding the value, is it fine to use ENUM ?
> >Similar comments applicable in other places where hard coded values are 
> > used.
> > 
> 
> 10 means 10M right?  That's not really a magic number.  It's fine.

There are also :
uapi/linux/ethtool.h:#define SPEED_10   10
uapi/linux/ethtool.h:#define SPEED_100  100
uapi/linux/ethtool.h:#define SPEED_1000 1000
uapi/linux/ethtool.h:#define SPEED_11
uapi/linux/ethtool.h:#define SPEED_10   10

 Andrew


Re: Sending 802.1Q packets using AF_PACKET socket on filtered bridge forwards with wrong MAC addresses

2017-11-29 Thread Brandon Carpenter
I narrowed the search to a memmove() called from
skb_reorder_vlan_header() in net/core/skbuff.c.

> memmove(skb->data - ETH_HLEN, skb->data - skb->mac_len - VLAN_HLEN,
>2 * ETH_ALEN);

Calling skb_reset_mac_len() after skb_reset_mac_header() before
calling br_allowed_ingress() in net/bridge/br_device.c fixes the
problem.

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index af5b8c87f590..e10131e2f68f 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -58,6 +58,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct
net_device *dev)
BR_INPUT_SKB_CB(skb)->brdev = dev;

skb_reset_mac_header(skb);
+   skb_reset_mac_len(skb);
eth = eth_hdr(skb);
skb_pull(skb, ETH_HLEN);


I'll put together an official patch  and submit it. Should I use
another email account? Are my emails being ignored because of that
stupid disclaimer my employer attaches to my messages (outside my
control)?

Brandon

-- 


CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is 
for the sole use of the intended recipient(s) and may contain proprietary, 
confidential or privileged information or otherwise be protected by law. 
Any unauthorized review, use, disclosure or distribution is prohibited. If 
you are not the intended recipient, please notify the sender and destroy 
all copies and the original message.


Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Andrew Lunn
Hi Linus

> Just that the PHYs are on the MDIO bus inside the switch, of
> course.

I think the problem might be, you are using the DSA provided MDIO bus.
The Marvell switches has a similar setup in terms of interrupts. The
PHY interrupts appear within the switch. So i implemented an interrupt
controller, just the same as you.

The problem is, the DSA provided MDIO bus is not linked to device
tree. So you cannot have phy nodes in device tree associated to it.

What i did for the Marvell driver is that driver itself implements an
MDIO bus (two actually in some chips), and the internal or external
PHYs are placed on the switch drivers MDIO bus, rather than the DSA
MDIO bus. The switch driver MDIO bus links to an mdio node in device
tree. I can then have interrupt properties in the phys on this MDIO
bus in device tree.

What actually might make sense, is to have the DSA MDIO bus look
inside the switches device tree node and see if there is an mdio
node. If so allow dsa_switch_setup() to use of_mdiobus_register()
instead of mdiobus_register().

  Andrew


Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Florian Fainelli
On 11/29/2017 01:28 PM, Linus Walleij wrote:
> On Wed, Nov 29, 2017 at 4:56 PM, Andrew Lunn  wrote:
>>> I have the phy-handle in the ethernet controller. This RTL8366RB
>>> thing is just one big PHY as far as I know.
>>
>> We don't model switches as PHYs. They are their own device type.  And
>> the internal or external PHYs are just normal PHYs in the linux
>> model. Meaning their interrupt properties goes in the PHY node in
>> device tree, as documented in the phy.txt binding documentation.
> 
> I do model the PHYs on the switch as PHYs.
> They are using the driver in drivers/phy/realtek.c.

That's good.

> 
> The interrupts are assigned to the PHYs not to the Switch.
> Just that the PHYs are on the MDIO bus inside the switch, of
> course.
> 
> The switch however provides an irqchip to demux the interrupts.
> 
> I think there is some misunderstanding in what I'm trying to do..
> 
> I have tried learning the DSA ideas by reading e.g. your paper:
> https://www.netdevconf.org/2.1/papers/distributed-switch-architecture.pdf
> 
> So I try my best to conform with these ideas.
> 
> I however have a hard time testing things since I don't really have a
> system to compare to. What would be useful is to know how
> commands like "ip" and "ifconfig" are used on a typical
> say home router.

There is a mock-up driver: drivers/net/dsa/dsa_loop.c which does not
pass any packets, but at least allows you to exercise user-space tools
and so on.
-- 
Florian


Re: [PATCH 3/4] RFC: net: dsa: Add bindings for Realtek SMI DSAs

2017-11-29 Thread Linus Walleij
On Wed, Nov 29, 2017 at 4:56 PM, Andrew Lunn  wrote:
>> I have the phy-handle in the ethernet controller. This RTL8366RB
>> thing is just one big PHY as far as I know.
>
> We don't model switches as PHYs. They are their own device type.  And
> the internal or external PHYs are just normal PHYs in the linux
> model. Meaning their interrupt properties goes in the PHY node in
> device tree, as documented in the phy.txt binding documentation.

I do model the PHYs on the switch as PHYs.
They are using the driver in drivers/phy/realtek.c.

The interrupts are assigned to the PHYs not to the Switch.
Just that the PHYs are on the MDIO bus inside the switch, of
course.

The switch however provides an irqchip to demux the interrupts.

I think there is some misunderstanding in what I'm trying to do..

I have tried learning the DSA ideas by reading e.g. your paper:
https://www.netdevconf.org/2.1/papers/distributed-switch-architecture.pdf

So I try my best to conform with these ideas.

I however have a hard time testing things since I don't really have a
system to compare to. What would be useful is to know how
commands like "ip" and "ifconfig" are used on a typical
say home router.

Yours,
Linus Walleij


Re: [EXT] Re: [PATCH net] net: phylink: fix link state on phy-connect

2017-11-29 Thread Russell King - ARM Linux
On Wed, Nov 29, 2017 at 09:06:56PM +, Yan Markman wrote:
> The attached p21 patch doesn't change anything.
> But another one from the mail-text is good
>   void phylink_disconnect_phy(struct phylink *pl)
>   +   pl->phy_state.link = false;
> 
> There still (not for my MRVL-PP2) problem:
>   It is expected that on  ifconfig-down the callback
>   pl->ops->mac_link_down(ndev, pl->link_an_mode);
> would be called, but it isn't

Are you calling phylink_stop() or are you just calling phylink_disconnect() ?

You must call phylink_stop() prior to phylink_disconnect().  This
probably explains why the p21 patch did nothing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


RE: [EXT] Re: [PATCH net] net: phylink: fix link state on phy-connect

2017-11-29 Thread Yan Markman
The attached p21 patch doesn't change anything.
But another one from the mail-text is good
void phylink_disconnect_phy(struct phylink *pl)
+   pl->phy_state.link = false;

There still (not for my MRVL-PP2) problem:
It is expected that on  ifconfig-down the callback
pl->ops->mac_link_down(ndev, pl->link_an_mode);
would be called, but it isn't


-Original Message-
From: Russell King - ARM Linux [mailto:li...@armlinux.org.uk] 
Sent: Wednesday, November 29, 2017 9:59 PM
To: Yan Markman 
Cc: Antoine Tenart ; and...@lunn.ch; 
f.faine...@gmail.com; da...@davemloft.net; gregory.clem...@free-electrons.com; 
thomas.petazz...@free-electrons.com; miquel.ray...@free-electrons.com; Nadav 
Haklai ; m...@semihalf.com; Stefan Chulski 
; netdev@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: [EXT] Re: [PATCH net] net: phylink: fix link state on phy-connect

External Email

--
On Wed, Nov 29, 2017 at 07:33:44PM +, Yan Markman wrote:
> Hi Russel
> 
> On my board I have [Marvell 88E1510] phy working with STATUS-POLLING I 
> see some inconsistencies  -- first ifconfig-up is different from furthers, no 
> "link is down" reports.
> Please refer the behavior example below.
> My patch is a "simple solution"  -- always reset/clear Link-state-parameters 
> before going UP.
> Possibly, more correct (but much more complicated) solution would be in the   
> phy state machine   and   phylink resolve modification.
> I just found that 
> On ifconfig-down, the phy-state-machine and phylink-resolve
> are stopped before executing before passing over full graceful down/reset 
> state.
> The further ifconfig-up starts with old state parameters.
> Special cases not-tested but logic 2 test-cases are:
>remote side changes speed whilst link is Down or Disconnected. But local 
> ifconfig-up starts with old speed.

Hi,

I think this is covered in my "phy" branch - but could probably do with further 
testing, specifically this patch (which I've attached):

"phylink: ensure we take the link down when phylink_stop() is called"

This takes the link down on the MAC side synchronously when phylink_stop() is 
called.  However, I think your case might also benefit from this patch - please 
test the patch referred to without this change, and let me know if you need 
this change to solve your problem:

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 
8f43f8779317..c90ad50204b0 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -798,6 +798,7 @@ void phylink_disconnect_phy(struct phylink *pl)
mutex_lock(>state_mutex);
pl->netdev->phydev = NULL;
pl->phydev = NULL;
+   pl->phy_state.link = false;
mutex_unlock(>state_mutex);
mutex_unlock(>lock);
flush_work(>resolve);

Thanks.

--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up 
According to speedtest.net: 8.21Mbps down 510kbps up


Re: [PATCH 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default

2017-11-29 Thread David Miller
From: Heiner Kallweit 
Date: Wed, 29 Nov 2017 21:47:16 +0100

> Am 15.11.2017 um 22:56 schrieb Florian Fainelli:
>> On 11/15/2017 01:42 PM, Heiner Kallweit wrote:
>>> read_status and config_aneg are the only mandatory callbacks and most
>>> of the time the generic implementation is used by drivers.
>>> So make the core fall back to the generic version if a driver doesn't
>>> implement the respective callback.
>>>
>>> Also currently the core doesn't seem to verify that drivers implement
>>> the mandatory calls. If a driver doesn't do so we'd just get a NPE.
>> 
>> Right, which is not an unusual way to signal that something is mandatory.
>> 
>>> With this patch this potential issue doesn't exit any longer.
>>>
>>> Signed-off-by: Heiner Kallweit 
>> 
>> Reviewed-by: Florian Fainelli 
>> 
>> Note that net-next is closed at the moment, so you will have to resubmit
>> this when the tree opens back again.
>> 
> I see that the two patches have status "deferred" in patchwork.
> So do I have to actually resubmit or are they going to be be picked up
> from patchwork?

You must resubmit when the net-next tree opens back up.


Re: [PATCH 1/2] net: phy: core: use genphy version of callbacks read_status and config_aneg per default

2017-11-29 Thread Heiner Kallweit
Am 15.11.2017 um 22:56 schrieb Florian Fainelli:
> On 11/15/2017 01:42 PM, Heiner Kallweit wrote:
>> read_status and config_aneg are the only mandatory callbacks and most
>> of the time the generic implementation is used by drivers.
>> So make the core fall back to the generic version if a driver doesn't
>> implement the respective callback.
>>
>> Also currently the core doesn't seem to verify that drivers implement
>> the mandatory calls. If a driver doesn't do so we'd just get a NPE.
> 
> Right, which is not an unusual way to signal that something is mandatory.
> 
>> With this patch this potential issue doesn't exit any longer.
>>
>> Signed-off-by: Heiner Kallweit 
> 
> Reviewed-by: Florian Fainelli 
> 
> Note that net-next is closed at the moment, so you will have to resubmit
> this when the tree opens back again.
> 
I see that the two patches have status "deferred" in patchwork.
So do I have to actually resubmit or are they going to be be picked up
from patchwork?

Rgds, Heiner

>> ---
>>  drivers/net/phy/phy.c |  5 -
>>  include/linux/phy.h   | 33 ++---
>>  2 files changed, 22 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 2b1e67bc1..a0e7605dc 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -493,7 +493,10 @@ static int phy_start_aneg_priv(struct phy_device 
>> *phydev, bool sync)
>>  /* Invalidate LP advertising flags */
>>  phydev->lp_advertising = 0;
>>  
>> -err = phydev->drv->config_aneg(phydev);
>> +if (phydev->drv->config_aneg)
>> +err = phydev->drv->config_aneg(phydev);
>> +else
>> +err = genphy_config_aneg(phydev);
>>  if (err < 0)
>>  goto out_unlock;
>>  
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index dc82a07cb..958b5162a 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -497,13 +497,13 @@ struct phy_device {
>>   * flags: A bitfield defining certain other features this PHY
>>   *   supports (like interrupts)
>>   *
>> - * The drivers must implement config_aneg and read_status.  All
>> - * other functions are optional. Note that none of these
>> - * functions should be called from interrupt time.  The goal is
>> - * for the bus read/write functions to be able to block when the
>> - * bus transaction is happening, and be freed up by an interrupt
>> - * (The MPC85xx has this ability, though it is not currently
>> - * supported in the driver).
>> + * All functions are optional. If config_aneg or read_status
>> + * are not implemented, the phy core uses the genphy versions.
>> + * Note that none of these functions should be called from
>> + * interrupt time. The goal is for the bus read/write functions
>> + * to be able to block when the bus transaction is happening,
>> + * and be freed up by an interrupt (The MPC85xx has this ability,
>> + * though it is not currently supported in the driver).
>>   */
>>  struct phy_driver {
>>  struct mdio_driver_common mdiodrv;
>> @@ -841,14 +841,6 @@ int phy_aneg_done(struct phy_device *phydev);
>>  int phy_stop_interrupts(struct phy_device *phydev);
>>  int phy_restart_aneg(struct phy_device *phydev);
>>  
>> -static inline int phy_read_status(struct phy_device *phydev)
>> -{
>> -if (!phydev->drv)
>> -return -EIO;
>> -
>> -return phydev->drv->read_status(phydev);
>> -}
>> -
>>  #define phydev_err(_phydev, format, args...)\
>>  dev_err(&_phydev->mdio.dev, format, ##args)
>>  
>> @@ -890,6 +882,17 @@ int genphy_c45_read_pma(struct phy_device *phydev);
>>  int genphy_c45_pma_setup_forced(struct phy_device *phydev);
>>  int genphy_c45_an_disable_aneg(struct phy_device *phydev);
>>  
>> +static inline int phy_read_status(struct phy_device *phydev)
>> +{
>> +if (!phydev->drv)
>> +return -EIO;
>> +
>> +if (phydev->drv->read_status)
>> +return phydev->drv->read_status(phydev);
>> +else
>> +return genphy_read_status(phydev);
>> +}
>> +
>>  void phy_driver_unregister(struct phy_driver *drv);
>>  void phy_drivers_unregister(struct phy_driver *drv, int n);
>>  int phy_driver_register(struct phy_driver *new_driver, struct module 
>> *owner);
>>
> 
> 



Re: [PATCH RfC 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-11-29 Thread Heiner Kallweit
Am 16.11.2017 um 10:51 schrieb Ard Biesheuvel:
> On 15 November 2017 at 22:19, Heiner Kallweit  wrote:
>> Am 15.11.2017 um 23:04 schrieb Florian Fainelli:
>>> On 11/12/2017 01:08 PM, Heiner Kallweit wrote:
 After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
 sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
 some simplification" all relevant code pieces run in process context
 anyway and I don't think we need the disabling of interrupts any longer.

 Interestingly enough, latter commit already removed the comment
 explaining why interrupts need to be temporarily disabled.

 On my system phy interrupt mode works fine with this patch.
 However I may miss something, especially in the context of shared phy
 interrupts, therefore I'd appreciate if more people could test this.
>>>
>>> I am not currently in a position to test this, but this looks very
>>> similar, if not identical to what Ard submitted a few days earlier:
>>>
>>
>> Thanks for the hint. Indeed it's exactly the same patch, so the one
>> sent by me can be disregarded.
>>
> 
> Well, it does appear your patch is more complete. Another difference
> is that I actually need this change to fix an issue with a
> hierarchical irqchip stacked on top of the GIC.
> 
> 
>>> https://patchwork.kernel.org/patch/10048901/
>>>
>>> Since net-next is closed at the moment, that should allow us to give
>>> this some good amount of testing.
>>>
>>> Thanks
>>>

 Signed-off-by: Heiner Kallweit 
> 
> For the record
> 
> Acked-by: Ard Biesheuvel 
> 
> Dear maintainers,
> 
> Please take whichever of these patches looks more correct to you.
> 
> Thanks,
> Ard.
> 

These two patches have status RFC in patchwork. Based on Ard's review
and comment, any action to be taken from his or my side?

Rgds, Heiner

 ---
  drivers/net/phy/phy.c | 26 ++
  include/linux/phy.h   |  1 -
  2 files changed, 2 insertions(+), 25 deletions(-)

 diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
 index 2b1e67bc1..b3784c9a2 100644
 --- a/drivers/net/phy/phy.c
 +++ b/drivers/net/phy/phy.c
 @@ -629,9 +629,6 @@ static irqreturn_t phy_interrupt(int irq, void 
 *phy_dat)
  if (PHY_HALTED == phydev->state)
  return IRQ_NONE;/* It can't be ours.  */

 -disable_irq_nosync(irq);
 -atomic_inc(>irq_disable);
 -
  phy_change(phydev);

  return IRQ_HANDLED;
 @@ -689,7 +686,6 @@ static int phy_disable_interrupts(struct phy_device 
 *phydev)
   */
  int phy_start_interrupts(struct phy_device *phydev)
  {
 -atomic_set(>irq_disable, 0);
  if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
   IRQF_ONESHOT | IRQF_SHARED,
   phydev_name(phydev), phydev) < 0) {
 @@ -716,13 +712,6 @@ int phy_stop_interrupts(struct phy_device *phydev)

  free_irq(phydev->irq, phydev);

 -/* If work indeed has been cancelled, disable_irq() will have
 - * been left unbalanced from phy_interrupt() and enable_irq()
 - * has to be called so that other devices on the line work.
 - */
 -while (atomic_dec_return(>irq_disable) >= 0)
 -enable_irq(phydev->irq);
 -
  return err;
  }
  EXPORT_SYMBOL(phy_stop_interrupts);
 @@ -736,7 +725,7 @@ void phy_change(struct phy_device *phydev)
  if (phy_interrupt_is_valid(phydev)) {
  if (phydev->drv->did_interrupt &&
  !phydev->drv->did_interrupt(phydev))
 -goto ignore;
 +return;

  if (phy_disable_interrupts(phydev))
  goto phy_err;
 @@ -748,27 +737,16 @@ void phy_change(struct phy_device *phydev)
  mutex_unlock(>lock);

  if (phy_interrupt_is_valid(phydev)) {
 -atomic_dec(>irq_disable);
 -enable_irq(phydev->irq);
 -
  /* Reenable interrupts */
  if (PHY_HALTED != phydev->state &&
  phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
 -goto irq_enable_err;
 +goto phy_err;
  }

  /* reschedule state queue work to run as soon as possible */
  phy_trigger_machine(phydev, true);
  return;

 -ignore:
 -atomic_dec(>irq_disable);
 -enable_irq(phydev->irq);
 -return;
 -
 -irq_enable_err:
 -disable_irq(phydev->irq);
 -atomic_inc(>irq_disable);
  phy_err:
  phy_error(phydev);
  }
 diff --git a/include/linux/phy.h b/include/linux/phy.h
 index dc82a07cb..8a87e441f 100644
 --- 

Re: KASAN: use-after-free Read in sock_release

2017-11-29 Thread Eric Dumazet
On Wed, 2017-11-29 at 11:37 -0800, Cong Wang wrote:
> (Cc'ing fs people...)
> 
> On Wed, Nov 29, 2017 at 12:33 AM, syzbot
>  om>
> wrote:
> > Hello,
> > 
> > syzkaller hit the following crash on
> > 1d3b78bbc6e983fabb3fbf91b76339bf66e4a12c
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-
> > next.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > 
> > Unfortunately, I don't have any reproducer for this bug yet.
> > 
> > 
> > device syz3 left promiscuous mode
> > device syz3 entered promiscuous mode
> > ==
> > BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0
> > net/socket.c:601
> > Read of size 8 at addr 8801c8dd1d10 by task syz-executor4/31085
> > 
> > CPU: 0 PID: 31085 Comm: syz-executor4 Not tainted 4.14.0+ #129
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >  print_address_description+0x73/0x250 mm/kasan/report.c:252
> >  kasan_report_error mm/kasan/report.c:351 [inline]
> >  kasan_report+0x25b/0x340 mm/kasan/report.c:409
> >  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
> >  sock_release+0x1c6/0x1e0 net/socket.c:601
> >  sock_close+0x16/0x20 net/socket.c:1125
> >  __fput+0x333/0x7f0 fs/file_table.c:210
> >  fput+0x15/0x20 fs/file_table.c:244
> >  task_work_run+0x199/0x270 kernel/task_work.c:113
> >  exit_task_work include/linux/task_work.h:22 [inline]
> >  do_exit+0x9bb/0x1ae0 kernel/exit.c:865
> >  do_group_exit+0x149/0x400 kernel/exit.c:968
> >  get_signal+0x73f/0x16c0 kernel/signal.c:2335
> >  do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:809
> >  exit_to_usermode_loop+0x214/0x310 arch/x86/entry/common.c:158
> >  prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
> >  syscall_return_slowpath+0x490/0x550 arch/x86/entry/common.c:264
> >  entry_SYSCALL_64_fastpath+0x94/0x96
> > RIP: 0033:0x452879
> > RSP: 002b:7fb1c2435ce8 EFLAGS: 0246 ORIG_RAX:
> > 00ca
> > RAX: fe00 RBX: 00758100 RCX: 00452879
> > RDX:  RSI:  RDI: 00758100
> > RBP: 00758100 R08: 0304 R09: 007580d8
> > R10:  R11: 0246 R12: 
> > R13: 00a6f7ff R14: 7fb1c24369c0 R15: 000e
> > 
> > Allocated by task 31066:
> >  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
> >  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3613
> >  kmalloc include/linux/slab.h:499 [inline]
> >  sock_alloc_inode+0xb4/0x300 net/socket.c:253
> >  alloc_inode+0x65/0x180 fs/inode.c:208
> >  new_inode_pseudo+0x69/0x190 fs/inode.c:890
> >  sock_alloc+0x41/0x270 net/socket.c:565
> >  __sock_create+0x148/0x850 net/socket.c:1225
> >  sock_create net/socket.c:1301 [inline]
> >  SYSC_socket net/socket.c:1331 [inline]
> >  SyS_socket+0xeb/0x200 net/socket.c:1311
> >  entry_SYSCALL_64_fastpath+0x1f/0x96
> > 
> > Freed by task 3039:
> >  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
> >  __cache_free mm/slab.c:3491 [inline]
> >  kfree+0xca/0x250 mm/slab.c:3806
> >  __rcu_reclaim kernel/rcu/rcu.h:190 [inline]
> >  rcu_do_batch kernel/rcu/tree.c:2758 [inline]
> >  invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline]
> >  __rcu_process_callbacks kernel/rcu/tree.c:2979 [inline]
> >  rcu_process_callbacks+0xe79/0x17d0 kernel/rcu/tree.c:2996
> >  __do_softirq+0x29d/0xbb2 kernel/softirq.c:285
> 
> This looks more like a fs issue than network, my fs knowledge
> is not good enough to justify why the hell the inode could be
> destroyed before we release the fd.
> 
> My _guess_ is that it is because we defer the fput() to a
> task work. If this is the case, then fs layer is not guilty for this.
> 
> On the other hand, if we have to blame net layer, it does look
> suspicious on the RCU usage in sock_release() where we
> claim RCU protection but I don't see we hold any RCU lock
> there.

There is rcu protection for sock->wq, and the 1 in 
rcu_dereference_protected(sock->wq, 1) is because we do not have a
lockdep convenient way to express that we are the last user of sock,
and about to free it.


>  Also, the code that deferences sock->wq is pretty much
> useless now, at least I don't see it catches any bug though.
> 
> 
> diff --git a/net/socket.c b/net/socket.c
> index 42d8e9c9ccd5..b2390b5591a9 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -598,9 +598,6 @@ void sock_release(struct socket *sock)
> module_put(owner);
> }
> 
> -   if (rcu_dereference_protected(sock->wq, 1)->fasync_list)

Re: [PATCH v2 25/35] nds32: Build infrastructure

2017-11-29 Thread Arnd Bergmann
On Wed, Nov 29, 2017 at 3:10 PM, Greentime Hu  wrote:
> 2017-11-29 19:57 GMT+08:00 Arnd Bergmann :
>> On Wed, Nov 29, 2017 at 12:39 PM, Greentime Hu  wrote:
>>>
>>> How about this?
>>>
>>> choice
>>> prompt "CPU type"
>>> default CPU_N13
>>> config CPU_N15
>>> bool "AndesCore N15"
>>> select CPU_CACHE_NONALIASING
>>> config CPU_N13
>>> bool "AndesCore N13"
>>> select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
>>> config CPU_N10
>>> bool "AndesCore N10"
>>> select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
>>> config CPU_D15
>>> bool "AndesCore D15"
>>> select CPU_CACHE_NONALIASING
>>> select HWZOL
>>> config CPU_D10
>>> bool "AndesCore D10"
>>> select CPU_CACHE_NONALIASING if ANDES_PAGE_SIZE_8KB
>>> endchoice
>>
>> With a 'choice' statement this works, but I would consider that
>> suboptimal for another reason: now you cannot build a kernel that
>> works on e.g. both N13 and N15.
>>
>> This is what we had on ARM a long time ago and on MIPS not so long
>> ago, but it's really a burden for testing and distribution once you get
>> support for more than a handful of machines supported in the mainline
>> kernel: If each CPU core is mutually incompatible with the other ones,
>> this means you likely end up having one defconfig per CPU core,
>> or even one defconfig per SoC or board.
>>
>> I would always try to get the largest amount of hardware to work
>> in the same kernel concurrently.
>>
>> One way of of this would be to define the "CPU type" as the minimum
>> supported CPU, e.g. selecting D15 would result in a kernel that
>> only works on D15, while selecting N15 would work on both N15 and
>> D15, and selecting D10 would work on both D10 and D15.
>>
>
> Hi, Arnd:
>
> Maybe we should keep the original implementation for this reason.
> The default value of CPU_CACHE_NONALIASING and ANDES_PAGE_SIZE_8KB is
> available for all CPU types for now.
> User can use these configs built kernel to boot on almost all nds32 CPUs.
>
> It might be a little bit weird if we config CPU_N10 but run on a N13 CPU.
> This might confuse our users.

I think it really depends on how much flexibility you want to give to users.
The way I suggested first was to allow selecting an arbitrary combination
of CPUs, and then let Kconfig derive the correct set of optimization flags
and other options that work with those. That is probably the easiest for
the users, but can be tricky to get right for all combinations.

When you put them in a sorted list like I mentioned for simplicity, you
could reduce the confusion by naming them differently, e.g.
CONFIG_CPU_N10_OR_NEWER.

Having only the CPU_CACHE_NONALIASING option is fine if you
never need to make any other decisions based on the CPU core
type, but then the help text should describe specifically which cases
are affected (N10/N13/D13 with 4K page size), and you can decide to
hide the option and make it always-on when using 8K page size.

   Arnd


Re: KASAN: use-after-free Read in sock_release

2017-11-29 Thread Linus Torvalds
On Wed, Nov 29, 2017 at 11:37 AM, Cong Wang  wrote:
> (Cc'ing fs people...)
>
> On Wed, Nov 29, 2017 at 12:33 AM, syzbot wrote:
>> BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0 net/socket.c:601

Lovely.

Yeah, that is:

   601  if (rcu_dereference_protected(sock->wq, 1)->fasync_list)

and as you say, that "rcu_dereference_protected()" is confusing, but
that should be ok because we have a ref to the inode, and we're really
just testing that the pointer is zero.

The call trace here is:

>>  sock_release+0x1c6/0x1e0 net/socket.c:601
>>  sock_close+0x16/0x20 net/socket.c:1125
>>  __fput+0x333/0x7f0 fs/file_table.c:210
>>  fput+0x15/0x20 fs/file_table.c:244
>>  task_work_run+0x199/0x270 kernel/task_work.c:113

and there is no RCU protection anywhere, but it's really just a sanity
check, and the access _should_ be ok.

The stale access does seem to be because 'sock' (embedded in the
inode) itself that has been free'd:

>> Allocated by task 31066:
>>  kmalloc include/linux/slab.h:499 [inline]
>>  sock_alloc_inode+0xb4/0x300 net/socket.c:253
>>  alloc_inode+0x65/0x180 fs/inode.c:208
>>  new_inode_pseudo+0x69/0x190 fs/inode.c:890
>>  sock_alloc+0x41/0x270 net/socket.c:565
>>  __sock_create+0x148/0x850 net/socket.c:1225
>>  sock_create net/socket.c:1301 [inline]
>>  SYSC_socket net/socket.c:1331 [inline]
>>  SyS_socket+0xeb/0x200 net/socket.c:1311
>
> This looks more like a fs issue than network, my fs knowledge
> is not good enough to justify why the hell the inode could be
> destroyed before we release the fd.

Ugh. The inode freeing really is confusing and fairly involved, but
the last free *should* happen as part of the final dput() that is done
at the end of __fput().

So in __fput() calls into the

if (file->f_op->release)
file->f_op->release(inode, file);

then the inode should still be around, because the final ref won't be
done until later. And RCU simply shouldn't be an issue, because of
that reference count on the inode.

So it smells like some reference counting went wrong. The socket inode
creation is a bit confusing, and then in "sock_release()" we do have
that

if (!sock->file) {
iput(SOCK_INODE(sock));
return;
}
sock->file = NULL;

which *also* tries to free the inode. I'm not sure what the logic (and
what the locking) behind that code all is.

What *is* the locking for "sock->file" anyway?

Al, can you take a look on the vfs side? But I'm inclined to blame the
socket code, because if we really had a "inode free'd early" issue at
a vfs level, I'd have expected us to see infinite chaos.

 Linus


Re: [BUG] kernel stack corruption during/after Netlabel error

2017-11-29 Thread Eric Dumazet
On Wed, Nov 29, 2017 at 11:59 AM, Stephen Smalley  wrote:
> On Wed, 2017-11-29 at 09:34 -0800, Eric Dumazet wrote:
>> On Wed, Nov 29, 2017 at 9:31 AM, Stephen Smalley 
>> wrote:
>> > On Wed, 2017-11-29 at 21:26 +1100, James Morris wrote:
>> > > I'm seeing a kernel stack corruption bug (detected via gcc) when
>> > > running
>> > > the SELinux testsuite on a 4.15-rc1 kernel, in the 2nd
>> > > inet_socket
>> > > test:
>> > >
>> > > https://github.com/SELinuxProject/selinux-testsuite/blob/master/t
>> > > ests
>> > > /inet_socket/test
>> > >
>> > >   # Verify that unauthorized client cannot communicate with the
>> > > server.
>> > >   $result = system
>> > >   "runcon -t test_inet_bad_client_t -- $basedir/client stream
>> > > 127.0.0.1 65535 2>&1";
>> > >
>> > > This correctlly causes an access control error in the Netlabel
>> > > code,
>> > > and
>> > > the bug seems to be triggered during the ICMP send:
>> > >
>> > > [  339.806024] SELinux: failure in selinux_parse_skb(), unable to
>> > > parse packet
>> > > [  339.822505] Kernel panic - not syncing: stack-protector:
>> > > Kernel
>> > > stack is corrupted in: 81745af5
>> > > [  339.822505]
>> > > [  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-
>> > > rc1-
>> > > test #15
>> > > [  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS
>> > > FWKT68A   01/19/2017
>> > > [  339.885060] Call Trace:
>> > > [  339.896875]  
>> > > [  339.908103]  dump_stack+0x63/0x87
>> > > [  339.920645]  panic+0xe8/0x248
>> > > [  339.932668]  ? ip_push_pending_frames+0x33/0x40
>> > > [  339.946328]  ? icmp_send+0x525/0x530
>> > > [  339.958861]  ? kfree_skbmem+0x60/0x70
>> > > [  339.971431]  __stack_chk_fail+0x1b/0x20
>> > > [  339.984049]  icmp_send+0x525/0x530
>> > > [  339.996205]  ? netlbl_skbuff_err+0x36/0x40
>> > > [  340.008997]  ? selinux_netlbl_err+0x11/0x20
>> > > [  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
>> > > [  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
>> > > [  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
>> > > [  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
>> > > [  340.074562]  ? tcp_filter+0x2c/0x40
>> > > [  340.086400]  ? tcp_v4_rcv+0x820/0xa20
>> > > [  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
>> > > [  340.111279]  ? ip_local_deliver+0x6f/0xe0
>> > > [  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
>> > > [  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
>> > > [  340.147442]  ? ip_rcv+0x27c/0x3c0
>> > > [  340.158668]  ? inet_del_offload+0x40/0x40
>> > > [  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
>> > > [  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
>> > > [  340.195282]  ? __netif_receive_skb+0x18/0x60
>> > > [  340.207288]  ? process_backlog+0x95/0x140
>> > > [  340.218948]  ? net_rx_action+0x26c/0x3b0
>> > > [  340.230416]  ? __do_softirq+0xc9/0x26a
>> > > [  340.241625]  ? do_softirq_own_stack+0x2a/0x40
>> > > [  340.253368]  
>> > > [  340.262673]  ? do_softirq+0x50/0x60
>> > > [  340.273450]  ? __local_bh_enable_ip+0x57/0x60
>> > > [  340.285045]  ? ip_finish_output2+0x175/0x350
>> > > [  340.296403]  ? ip_finish_output+0x127/0x1d0
>> > > [  340.307665]  ? nf_hook_slow+0x3c/0xb0
>> > > [  340.318230]  ? ip_output+0x72/0xe0
>> > > [  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
>> > > [  340.340070]  ? ip_local_out+0x35/0x40
>> > > [  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
>> > > [  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
>> > > [  340.372484]  ? __skb_clone+0x2e/0x130
>> > > [  340.382633]  ? tcp_transmit_skb+0x558/0xa10
>> > > [  340.393262]  ? tcp_connect+0x938/0xad0
>> > > [  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
>> > > [  340.414206]  ? tcp_v4_connect+0x457/0x4e0
>> > > [  340.424471]  ? __inet_stream_connect+0xb3/0x300
>> > > [  340.435195]  ? inet_stream_connect+0x3b/0x60
>> > > [  340.445607]  ? SYSC_connect+0xd9/0x110
>> > > [  340.455455]  ? __audit_syscall_entry+0xaf/0x100
>> > > [  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
>> > > [  340.476636]  ? __audit_syscall_exit+0x209/0x290
>> > > [  340.487151]  ? SyS_connect+0xe/0x10
>> > > [  340.496453]  ? do_syscall_64+0x67/0x1b0
>> > > [  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25
>> > > [  340.516693] Kernel Offset: disabled
>> > > [  340.526393] Rebooting in 11 seconds..
>> > >
>> > > This is mostly reliable, and I'm only seeing it on bare metal
>> > > (not in
>> > > a
>> > > virtualbox vm).
>> > >
>> > > The SELinux skb parse error at the start only sometimes appears,
>> > > and
>> > > looking at the code, I suspect some kind of memory corruption
>> > > being
>> > > the
>> > > cause at that point (basic packet header checks).
>> > >
>> > > I bisected the bug down to the following change:
>> > >
>> > > commit bffa72cf7f9df842f0016ba03586039296b4caaf
>> > > Author: Eric Dumazet 
>> > > Date:   Tue Sep 19 05:14:24 2017 -0700
>> > >
>> > > net: sk_buff rbnode reorg
>> > > ...
>> > >
>> > >
>> > > Anyone else able to 

[PATCH v2 01/17] idr: Fix build

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

The IDR calls WARN_ON without including 

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 7c3a365f7e12..dd048cf456b7 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -13,6 +13,7 @@
 #define __IDR_H__
 
 #include 
+#include 
 #include 
 #include 
 
-- 
2.15.0



[PATCH v2 07/17] idr: Delete idr_find_ext function

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Simply changing idr_remove's 'id' argument to 'unsigned long' works
for all callers.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h| 7 +--
 net/sched/act_api.c| 2 +-
 net/sched/cls_flower.c | 2 +-
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 90dbe7a3735c..12514ec0cd28 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -179,16 +179,11 @@ static inline void idr_preload_end(void)
  * This function can be called under rcu_read_lock(), given that the leaf
  * pointers lifetimes are correctly managed.
  */
-static inline void *idr_find_ext(const struct idr *idr, unsigned long id)
+static inline void *idr_find(const struct idr *idr, unsigned long id)
 {
return radix_tree_lookup(>idr_rt, id);
 }
 
-static inline void *idr_find(const struct idr *idr, int id)
-{
-   return idr_find_ext(idr, id);
-}
-
 /**
  * idr_for_each_entry - iterate over an idr's elements of a given type
  * @idr: idr handle
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 7e901e855d68..efb90b8a3bf0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -222,7 +222,7 @@ static struct tc_action *tcf_idr_lookup(u32 index, struct 
tcf_idrinfo *idrinfo)
struct tc_action *p = NULL;
 
spin_lock_bh(>lock);
-   p = idr_find_ext(>action_idr, index);
+   p = idr_find(>action_idr, index);
spin_unlock_bh(>lock);
 
return p;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index ca71823bee03..ec0dc92f6104 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -329,7 +329,7 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
 {
struct cls_fl_head *head = rtnl_dereference(tp->root);
 
-   return idr_find_ext(>handle_idr, handle);
+   return idr_find(>handle_idr, handle);
 }
 
 static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
-- 
2.15.0



[PATCH v2 17/17] idr: Warn if old iterators see large IDs

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Now that the IDR can be used to store large IDs, it is possible somebody
might only partially convert their old code and use the iterators which
can only handle IDs up to INT_MAX.  It's probably unwise to show them a
truncated ID, so settle for spewing warnings to dmesg, and terminating
the iteration.

Signed-off-by: Matthew Wilcox 
---
 lib/idr.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/idr.c b/lib/idr.c
index 772a24513d1e..1aaeb5a8c426 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -145,7 +145,11 @@ int idr_for_each(const struct idr *idr,
void __rcu **slot;
 
radix_tree_for_each_slot(slot, >idr_rt, , 0) {
-   int ret = fn(iter.index, rcu_dereference_raw(*slot), data);
+   int ret;
+
+   if (WARN_ON_ONCE(iter.index > INT_MAX))
+   break;
+   ret = fn(iter.index, rcu_dereference_raw(*slot), data);
if (ret)
return ret;
}
@@ -173,6 +177,9 @@ void *idr_get_next(struct idr *idr, int *nextid)
if (!slot)
return NULL;
 
+   if (WARN_ON_ONCE(iter.index > INT_MAX))
+   return NULL;
+
*nextid = iter.index;
return rcu_dereference_raw(*slot);
 }
-- 
2.15.0



[PATCH v2 16/17] idr: Rename idr_for_each_entry_ext

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Match idr_alloc_ul with idr_get_next_ul and idr_for_each_entry_ul.
Also add kernel-doc.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h | 17 ++---
 lib/idr.c   | 20 +++-
 net/sched/act_api.c |  6 +++---
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 344380fd0887..91d27a9bcdf4 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -88,7 +88,7 @@ int idr_alloc_cyclic(struct idr *, void *entry, int start, 
int end, gfp_t);
 int idr_for_each(const struct idr *,
 int (*fn)(int id, void *p, void *data), void *data);
 void *idr_get_next(struct idr *, int *nextid);
-void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);
+void *idr_get_next_ul(struct idr *, unsigned long *nextid);
 void *idr_replace(struct idr *, void *, unsigned long id);
 void idr_destroy(struct idr *);
 
@@ -178,8 +178,19 @@ static inline void *idr_find(const struct idr *idr, 
unsigned long id)
  */
 #define idr_for_each_entry(idr, entry, id) \
for (id = 0; ((entry) = idr_get_next(idr, &(id))) != NULL; ++id)
-#define idr_for_each_entry_ext(idr, entry, id) \
-   for (id = 0; ((entry) = idr_get_next_ext(idr, &(id))) != NULL; ++id)
+
+/**
+ * idr_for_each_entry_ul() - iterate over an IDR's elements of a given type.
+ * @idr: IDR handle.
+ * @entry: The type * to use as cursor.
+ * @id: Entry ID.
+ *
+ * @entry and @id do not need to be initialized before the loop, and
+ * after normal terminatinon @entry is left with the value NULL.  This
+ * is convenient for a "not found" value.
+ */
+#define idr_for_each_entry_ul(idr, entry, id)  \
+   for (id = 0; ((entry) = idr_get_next_ul(idr, &(id))) != NULL; ++id)
 
 /**
  * idr_for_each_entry_continue - continue iteration over an idr's elements of 
a given type
diff --git a/lib/idr.c b/lib/idr.c
index 103afb97b4bd..772a24513d1e 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -155,9 +155,9 @@ int idr_for_each(const struct idr *idr,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
- * idr_get_next - Find next populated entry
- * @idr: idr handle
- * @nextid: Pointer to lowest possible ID to return
+ * idr_get_next() - Find next populated entry.
+ * @idr: IDR handle.
+ * @nextid: Pointer to lowest possible ID to return.
  *
  * Returns the next populated entry in the tree with an ID greater than
  * or equal to the value pointed to by @nextid.  On exit, @nextid is updated
@@ -178,7 +178,17 @@ void *idr_get_next(struct idr *idr, int *nextid)
 }
 EXPORT_SYMBOL(idr_get_next);
 
-void *idr_get_next_ext(struct idr *idr, unsigned long *nextid)
+/**
+ * idr_get_next_ul() - Find next populated entry.
+ * @idr: IDR handle.
+ * @nextid: Pointer to lowest possible ID to return.
+ *
+ * Returns the next populated entry in the tree with an ID greater than
+ * or equal to the value pointed to by @nextid.  On exit, @nextid is updated
+ * to the ID of the found value.  To use in a loop, the value pointed to by
+ * nextid must be incremented by the user.
+ */
+void *idr_get_next_ul(struct idr *idr, unsigned long *nextid)
 {
struct radix_tree_iter iter;
void __rcu **slot;
@@ -190,7 +200,7 @@ void *idr_get_next_ext(struct idr *idr, unsigned long 
*nextid)
*nextid = iter.index;
return rcu_dereference_raw(*slot);
 }
-EXPORT_SYMBOL(idr_get_next_ext);
+EXPORT_SYMBOL(idr_get_next_ul);
 
 /**
  * idr_replace - replace pointer for given id
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 156302c110af..4133d91b7029 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -124,7 +124,7 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, 
struct sk_buff *skb,
 
s_i = cb->args[0];
 
-   idr_for_each_entry_ext(idr, p, id) {
+   idr_for_each_entry_ul(idr, p, id) {
index++;
if (index < s_i)
continue;
@@ -181,7 +181,7 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, 
struct sk_buff *skb,
if (nla_put_string(skb, TCA_KIND, ops->kind))
goto nla_put_failure;
 
-   idr_for_each_entry_ext(idr, p, id) {
+   idr_for_each_entry_ul(idr, p, id) {
ret = __tcf_idr_release(p, false, true);
if (ret == ACT_P_DELETED) {
module_put(ops->owner);
@@ -351,7 +351,7 @@ void tcf_idrinfo_destroy(const struct tc_action_ops *ops,
int ret;
unsigned long id = 1;
 
-   idr_for_each_entry_ext(idr, p, id) {
+   idr_for_each_entry_ul(idr, p, id) {
ret = __tcf_idr_release(p, false, true);
if (ret == ACT_P_DELETED)
module_put(ops->owner);
-- 
2.15.0



[PATCH v2 13/17] cls_u32: Reinstate cyclic allocation

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Commit e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles)
converted htid allocation to use the IDR.  The ID allocated by this
scheme changes; it used to be cyclic, but now always allocates the
lowest available.  The IDR supports cyclic allocation, so just use
the right function.

Signed-off-by: Matthew Wilcox 
---
 net/sched/cls_u32.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 9d48674a70e0..e65b47483dc0 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -316,19 +316,13 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
return u32_lookup_key(ht, handle);
 }
 
+/* Protected by rtnl lock */
 static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
 {
-   unsigned long idr_index;
-   int err;
-
-   /* This is only used inside rtnl lock it is safe to increment
-* without read _copy_ update semantics
-*/
-   err = idr_alloc_ext(_c->handle_idr, ptr, _index,
-   1, 0x7FF, GFP_KERNEL);
-   if (err)
+   int id = idr_alloc_cyclic(_c->handle_idr, ptr, 1, 0x7FF, GFP_KERNEL);
+   if (id < 0)
return 0;
-   return (u32)(idr_index | 0x800) << 20;
+   return (id | 0x800U) << 20;
 }
 
 static struct hlist_head *tc_u_common_hash;
-- 
2.15.0



[PATCH v2 10/17] cls_basic: Convert to use idr_alloc_u32

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Use the new helper which saves a temporary variable and a few lines of
code.

Signed-off-by: Matthew Wilcox 
---
 net/sched/cls_basic.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 147700afcf31..c4b242fee8e4 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -182,7 +182,6 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
struct nlattr *tb[TCA_BASIC_MAX + 1];
struct basic_filter *fold = (struct basic_filter *) *arg;
struct basic_filter *fnew;
-   unsigned long idr_index;
 
if (tca[TCA_OPTIONS] == NULL)
return -EINVAL;
@@ -205,21 +204,17 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
if (err < 0)
goto errout;
 
-   if (handle) {
-   fnew->handle = handle;
-   if (!fold) {
-   err = idr_alloc_ext(>handle_idr, fnew, _index,
-   handle, handle + 1, GFP_KERNEL);
-   if (err)
-   goto errout;
-   }
-   } else {
-   err = idr_alloc_ext(>handle_idr, fnew, _index,
-   1, 0x7FFF, GFP_KERNEL);
-   if (err)
-   goto errout;
-   fnew->handle = idr_index;
+   if (!handle) {
+   handle = 1;
+   err = idr_alloc_u32(>handle_idr, fnew, ,
+   INT_MAX, GFP_KERNEL);
+   } else if (!fold) {
+   err = idr_alloc_u32(>handle_idr, fnew, ,
+   handle, GFP_KERNEL);
}
+   if (err)
+   goto errout;
+   fnew->handle = handle;
 
err = basic_set_parms(net, tp, fnew, base, tb, tca[TCA_RATE], ovr);
if (err < 0) {
-- 
2.15.0



[PATCH v2 12/17] cls_flower: Convert to idr_alloc_u32

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Use the new helper which saves a temporary variable and a few lines
of code.

Signed-off-by: Matthew Wilcox 
---
 net/sched/cls_flower.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index ec0dc92f6104..adee3cf30bb3 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -858,7 +858,6 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
struct cls_fl_filter *fnew;
struct nlattr **tb;
struct fl_flow_mask mask = {};
-   unsigned long idr_index;
int err;
 
if (!tca[TCA_OPTIONS])
@@ -889,21 +888,17 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
goto errout;
 
if (!handle) {
-   err = idr_alloc_ext(>handle_idr, fnew, _index,
-   1, 0x8000, GFP_KERNEL);
-   if (err)
-   goto errout;
-   fnew->handle = idr_index;
-   }
-
-   /* user specifies a handle and it doesn't exist */
-   if (handle && !fold) {
-   err = idr_alloc_ext(>handle_idr, fnew, _index,
-   handle, handle + 1, GFP_KERNEL);
-   if (err)
-   goto errout;
-   fnew->handle = idr_index;
+   handle = 1;
+   err = idr_alloc_u32(>handle_idr, fnew, ,
+   INT_MAX, GFP_KERNEL);
+   } else if (!fold) {
+   /* user specifies a handle and it doesn't exist */
+   err = idr_alloc_u32(>handle_idr, fnew, ,
+   handle, GFP_KERNEL);
}
+   if (err)
+   goto errout;
+   fnew->handle = handle;
 
if (tb[TCA_FLOWER_FLAGS]) {
fnew->flags = nla_get_u32(tb[TCA_FLOWER_FLAGS]);
@@ -957,7 +952,6 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
*arg = fnew;
 
if (fold) {
-   fnew->handle = handle;
idr_replace(>handle_idr, fnew, fnew->handle);
list_replace_rcu(>list, >list);
tcf_unbind_filter(tp, >res);
-- 
2.15.0



[PATCH v2 11/17] cls_bpf: Convert to use idr_alloc_u32

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Use the new helper.  This has a modest reduction in both lines of code
and compiled code size.

Signed-off-by: Matthew Wilcox 
---
 net/sched/cls_bpf.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 1660fc8294ef..db1dd4de7d6a 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -472,7 +472,6 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
struct cls_bpf_prog *oldprog = *arg;
struct nlattr *tb[TCA_BPF_MAX + 1];
struct cls_bpf_prog *prog;
-   unsigned long idr_index;
int ret;
 
if (tca[TCA_OPTIONS] == NULL)
@@ -499,21 +498,18 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
}
 
if (handle == 0) {
-   ret = idr_alloc_ext(>handle_idr, prog, _index,
-   1, 0x7FFF, GFP_KERNEL);
-   if (ret)
-   goto errout;
-   prog->handle = idr_index;
-   } else {
-   if (!oldprog) {
-   ret = idr_alloc_ext(>handle_idr, prog, _index,
-   handle, handle + 1, GFP_KERNEL);
-   if (ret)
-   goto errout;
-   }
-   prog->handle = handle;
+   handle = 1;
+   ret = idr_alloc_u32(>handle_idr, prog, ,
+   INT_MAX, GFP_KERNEL);
+   } else if (!oldprog) {
+   ret = idr_alloc_u32(>handle_idr, prog, ,
+   handle, GFP_KERNEL);
}
 
+   if (ret)
+   goto errout;
+   prog->handle = handle;
+
ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
if (ret < 0)
goto errout_idr;
-- 
2.15.0



[PATCH v2 06/17] idr: Delete idr_replace_ext function

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Changing idr_replace's 'id' argument to 'unsigned long' works for all
callers.  Callers which passed a negative ID now get -ENOENT instead of
-EINVAL.  No callers relied on this error value.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h|  3 +--
 lib/idr.c  | 15 +++
 net/sched/act_api.c|  2 +-
 net/sched/cls_basic.c  |  2 +-
 net/sched/cls_bpf.c|  2 +-
 net/sched/cls_flower.c |  2 +-
 net/sched/cls_u32.c|  2 +-
 7 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 9a4042489ec6..90dbe7a3735c 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -136,8 +136,7 @@ int idr_for_each(const struct idr *,
 int (*fn)(int id, void *p, void *data), void *data);
 void *idr_get_next(struct idr *, int *nextid);
 void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);
-void *idr_replace(struct idr *, void *, int id);
-void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
+void *idr_replace(struct idr *, void *, unsigned long id);
 void idr_destroy(struct idr *);
 
 static inline void *idr_remove(struct idr *idr, unsigned long id)
diff --git a/lib/idr.c b/lib/idr.c
index 2593ce513a18..577bfd4fe5c2 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -147,18 +147,9 @@ EXPORT_SYMBOL(idr_get_next_ext);
  * the one being replaced!).
  *
  * Returns: the old value on success.  %-ENOENT indicates that @id was not
- * found.  %-EINVAL indicates that @id or @ptr were not valid.
+ * found.  %-EINVAL indicates that @ptr was not valid.
  */
-void *idr_replace(struct idr *idr, void *ptr, int id)
-{
-   if (id < 0)
-   return ERR_PTR(-EINVAL);
-
-   return idr_replace_ext(idr, ptr, id);
-}
-EXPORT_SYMBOL(idr_replace);
-
-void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id)
+void *idr_replace(struct idr *idr, void *ptr, unsigned long id)
 {
struct radix_tree_node *node;
void __rcu **slot = NULL;
@@ -175,7 +166,7 @@ void *idr_replace_ext(struct idr *idr, void *ptr, unsigned 
long id)
 
return entry;
 }
-EXPORT_SYMBOL(idr_replace_ext);
+EXPORT_SYMBOL(idr_replace);
 
 /**
  * DOC: IDA description
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index bab81574a420..7e901e855d68 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -348,7 +348,7 @@ void tcf_idr_insert(struct tc_action_net *tn, struct 
tc_action *a)
struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
spin_lock_bh(>lock);
-   idr_replace_ext(>action_idr, a, a->tcfa_index);
+   idr_replace(>action_idr, a, a->tcfa_index);
spin_unlock_bh(>lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index d2193304bad0..147700afcf31 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -231,7 +231,7 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
*arg = fnew;
 
if (fold) {
-   idr_replace_ext(>handle_idr, fnew, fnew->handle);
+   idr_replace(>handle_idr, fnew, fnew->handle);
list_replace_rcu(>link, >link);
tcf_unbind_filter(tp, >res);
tcf_exts_get_net(>exts);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index b017d99fd7e1..1660fc8294ef 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -526,7 +526,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
if (oldprog) {
-   idr_replace_ext(>handle_idr, prog, handle);
+   idr_replace(>handle_idr, prog, handle);
list_replace_rcu(>link, >link);
tcf_unbind_filter(tp, >res);
tcf_exts_get_net(>exts);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 3e89b0be1706..ca71823bee03 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -958,7 +958,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
 
if (fold) {
fnew->handle = handle;
-   idr_replace_ext(>handle_idr, fnew, fnew->handle);
+   idr_replace(>handle_idr, fnew, fnew->handle);
list_replace_rcu(>list, >list);
tcf_unbind_filter(tp, >res);
tcf_exts_get_net(>exts);
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 6fe4e3549ad3..9d48674a70e0 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -833,7 +833,7 @@ static void u32_replace_knode(struct tcf_proto *tp, struct 
tc_u_common *tp_c,
if (pins->handle == n->handle)
break;
 
-   idr_replace_ext(>handle_idr, n, n->handle);
+   idr_replace(>handle_idr, n, n->handle);
RCU_INIT_POINTER(n->next, pins->next);
rcu_assign_pointer(*ins, n);
 }
-- 
2.15.0



[PATCH v2 08/17] idr: Add idr_alloc_u32 helper

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

All current users of idr_alloc_ext() actually want to allocate a u32 and
it's a little painful for them to use idr_alloc_ext().  This convenience
function makes it simple.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h | 29 +
 1 file changed, 29 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 12514ec0cd28..9b2fd6f408b2 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -139,6 +139,35 @@ void *idr_get_next_ext(struct idr *idr, unsigned long 
*nextid);
 void *idr_replace(struct idr *, void *, unsigned long id);
 void idr_destroy(struct idr *);
 
+/**
+ * idr_alloc_u32() - Allocate an ID.
+ * @idr: IDR handle.
+ * @ptr: Pointer to be associated with the new ID.
+ * @nextid: The new ID.
+ * @max: The maximum ID to allocate (inclusive).
+ * @gfp: Memory allocation flags.
+ *
+ * Allocates an unused ID in the range [*nextid, max] and updates the @nextid
+ * pointer with the newly assigned ID.  Returns -ENOSPC and does not modify
+ * @nextid if there are no unused IDs in the range.
+ *
+ * The caller should provide their own locking to ensure that two concurrent
+ * modifications to the IDR are not possible.  Read-only accesses to the
+ * IDR may be done under the RCU read lock or may exclude simultaneous
+ * writers.
+ *
+ * Return: 0 on success, -ENOMEM for memory allocation errors, -ENOSPC if
+ * there are no free IDs in the range.
+ */
+static inline int __must_check idr_alloc_u32(struct idr *idr, void *ptr,
+   u32 *nextid, unsigned long max, gfp_t gfp)
+{
+   unsigned long tmp = *nextid;
+   int ret = idr_alloc_ext(idr, ptr, , tmp, max + 1, gfp);
+   *nextid = tmp;
+   return ret;
+}
+
 static inline void *idr_remove(struct idr *idr, unsigned long id)
 {
return radix_tree_delete_item(>idr_rt, id, NULL);
-- 
2.15.0



[PATCH v2 15/17] idr: Rename idr_alloc_ext to idr_alloc_ul

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

idr_alloc_ul fits better with other parts of the Linux kernel where we
need to name a function based on the types it operates on.

It uses a 'nextid' pointer argument instead of separate minimum ID and
output assigned ID, (like idr_get_next), reducing the number of arguments
by one.  It also takes a 'max' argument rather than an 'end' argument
(unlike idr_alloc, but the semantics of 'end' don't work for unsigned long
arguments).  And its return value is an errno, so mark it as __must_check.

Includes kernel-doc for idr_alloc_ul, which idr_alloc_ext didn't have,
and I realised we were missing a test-case where idr_alloc_cyclic wraps
around INT_MAX.  Chris Mi  has promised to contribute
test-cases for idr_alloc_ul.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h | 55 ++---
 include/linux/radix-tree.h  | 17 +--
 lib/idr.c   | 99 +
 lib/radix-tree.c|  3 +-
 net/sched/cls_u32.c | 20 
 tools/testing/radix-tree/idr-test.c | 17 +++
 6 files changed, 111 insertions(+), 100 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 9b2fd6f408b2..344380fd0887 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -13,7 +13,6 @@
 #define __IDR_H__
 
 #include 
-#include 
 #include 
 #include 
 
@@ -82,55 +81,9 @@ static inline void idr_set_cursor(struct idr *idr, unsigned 
int val)
 
 void idr_preload(gfp_t gfp_mask);
 
-int idr_alloc_cmn(struct idr *idr, void *ptr, unsigned long *index,
- unsigned long start, unsigned long end, gfp_t gfp,
- bool ext);
-
-/**
- * idr_alloc - allocate an id
- * @idr: idr handle
- * @ptr: pointer to be associated with the new id
- * @start: the minimum id (inclusive)
- * @end: the maximum id (exclusive)
- * @gfp: memory allocation flags
- *
- * Allocates an unused ID in the range [start, end).  Returns -ENOSPC
- * if there are no unused IDs in that range.
- *
- * Note that @end is treated as max when <= 0.  This is to always allow
- * using @start + N as @end as long as N is inside integer range.
- *
- * Simultaneous modifications to the @idr are not allowed and should be
- * prevented by the user, usually with a lock.  idr_alloc() may be called
- * concurrently with read-only accesses to the @idr, such as idr_find() and
- * idr_for_each_entry().
- */
-static inline int idr_alloc(struct idr *idr, void *ptr,
-   int start, int end, gfp_t gfp)
-{
-   unsigned long id;
-   int ret;
-
-   if (WARN_ON_ONCE(start < 0))
-   return -EINVAL;
-
-   ret = idr_alloc_cmn(idr, ptr, , start, end, gfp, false);
-
-   if (ret)
-   return ret;
-
-   return id;
-}
-
-static inline int idr_alloc_ext(struct idr *idr, void *ptr,
-   unsigned long *index,
-   unsigned long start,
-   unsigned long end,
-   gfp_t gfp)
-{
-   return idr_alloc_cmn(idr, ptr, index, start, end, gfp, true);
-}
-
+int idr_alloc(struct idr *, void *, int start, int end, gfp_t);
+int __must_check idr_alloc_ul(struct idr *, void *, unsigned long *nextid,
+   unsigned long max, gfp_t);
 int idr_alloc_cyclic(struct idr *, void *entry, int start, int end, gfp_t);
 int idr_for_each(const struct idr *,
 int (*fn)(int id, void *p, void *data), void *data);
@@ -163,7 +116,7 @@ static inline int __must_check idr_alloc_u32(struct idr 
*idr, void *ptr,
u32 *nextid, unsigned long max, gfp_t gfp)
 {
unsigned long tmp = *nextid;
-   int ret = idr_alloc_ext(idr, ptr, , tmp, max + 1, gfp);
+   int ret = idr_alloc_ul(idr, ptr, , max, gfp);
*nextid = tmp;
return ret;
 }
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 23a9c89c7ad9..fc55ff31eca7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -356,24 +356,9 @@ int radix_tree_split(struct radix_tree_root *, unsigned 
long index,
 int radix_tree_join(struct radix_tree_root *, unsigned long index,
unsigned new_order, void *);
 
-void __rcu **idr_get_free_cmn(struct radix_tree_root *root,
+void __rcu **idr_get_free(struct radix_tree_root *root,
  struct radix_tree_iter *iter, gfp_t gfp,
  unsigned long max);
-static inline void __rcu **idr_get_free(struct radix_tree_root *root,
-   struct radix_tree_iter *iter,
-   gfp_t gfp,
-   int end)
-{
-   return idr_get_free_cmn(root, iter, gfp, end > 0 ? end - 1 : INT_MAX);
-}
-
-static inline void __rcu **idr_get_free_ext(struct 

[PATCH v2 14/17] cls_u32: Convert to idr_alloc_u32

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

No real benefit to this classifier, but since we're allocating a u32
anyway, we should use this function.

Signed-off-by: Matthew Wilcox 
---
 net/sched/cls_u32.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index e65b47483dc0..e433d1adccc8 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -970,8 +970,8 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
return -ENOMEM;
}
} else {
-   err = idr_alloc_ext(_c->handle_idr, ht, NULL,
-   handle, handle + 1, GFP_KERNEL);
+   err = idr_alloc_u32(_c->handle_idr, ht, ,
+   handle, GFP_KERNEL);
if (err) {
kfree(ht);
return err;
@@ -1020,8 +1020,7 @@ static int u32_change(struct net *net, struct sk_buff 
*in_skb,
if (TC_U32_HTID(handle) && TC_U32_HTID(handle^htid))
return -EINVAL;
handle = htid | TC_U32_NODE(handle);
-   err = idr_alloc_ext(>handle_idr, NULL, NULL,
-   handle, handle + 1,
+   err = idr_alloc_u32(>handle_idr, NULL, , handle,
GFP_KERNEL);
if (err)
return err;
-- 
2.15.0



[PATCH v2 05/17] idr: Delete idr_remove_ext function

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

Simply changing idr_remove's 'id' argument to 'unsigned long' suffices
for all callers.

Signed-off-by: Matthew Wilcox 
---
 include/linux/idr.h| 7 +--
 net/sched/act_api.c| 2 +-
 net/sched/cls_basic.c  | 6 +++---
 net/sched/cls_bpf.c| 4 ++--
 net/sched/cls_flower.c | 4 ++--
 net/sched/cls_u32.c| 8 
 6 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index dd048cf456b7..9a4042489ec6 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -140,16 +140,11 @@ void *idr_replace(struct idr *, void *, int id);
 void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
 void idr_destroy(struct idr *);
 
-static inline void *idr_remove_ext(struct idr *idr, unsigned long id)
+static inline void *idr_remove(struct idr *idr, unsigned long id)
 {
return radix_tree_delete_item(>idr_rt, id, NULL);
 }
 
-static inline void *idr_remove(struct idr *idr, int id)
-{
-   return idr_remove_ext(idr, id);
-}
-
 static inline void idr_init(struct idr *idr)
 {
INIT_RADIX_TREE(>idr_rt, IDR_RT_MARKER);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4d33a50a8a6d..bab81574a420 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -78,7 +78,7 @@ static void free_tcf(struct tc_action *p)
 static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct tc_action *p)
 {
spin_lock_bh(>lock);
-   idr_remove_ext(>action_idr, p->tcfa_index);
+   idr_remove(>action_idr, p->tcfa_index);
spin_unlock_bh(>lock);
gen_kill_estimator(>tcfa_rate_est);
free_tcf(p);
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 5f169ded347e..d2193304bad0 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -120,7 +120,7 @@ static void basic_destroy(struct tcf_proto *tp)
list_for_each_entry_safe(f, n, >flist, link) {
list_del_rcu(>link);
tcf_unbind_filter(tp, >res);
-   idr_remove_ext(>handle_idr, f->handle);
+   idr_remove(>handle_idr, f->handle);
if (tcf_exts_get_net(>exts))
call_rcu(>rcu, basic_delete_filter);
else
@@ -137,7 +137,7 @@ static int basic_delete(struct tcf_proto *tp, void *arg, 
bool *last)
 
list_del_rcu(>link);
tcf_unbind_filter(tp, >res);
-   idr_remove_ext(>handle_idr, f->handle);
+   idr_remove(>handle_idr, f->handle);
tcf_exts_get_net(>exts);
call_rcu(>rcu, basic_delete_filter);
*last = list_empty(>flist);
@@ -224,7 +224,7 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
err = basic_set_parms(net, tp, fnew, base, tb, tca[TCA_RATE], ovr);
if (err < 0) {
if (!fold)
-   idr_remove_ext(>handle_idr, fnew->handle);
+   idr_remove(>handle_idr, fnew->handle);
goto errout;
}
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 6fe798c2df1a..b017d99fd7e1 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -299,7 +299,7 @@ static void __cls_bpf_delete(struct tcf_proto *tp, struct 
cls_bpf_prog *prog)
 {
struct cls_bpf_head *head = rtnl_dereference(tp->root);
 
-   idr_remove_ext(>handle_idr, prog->handle);
+   idr_remove(>handle_idr, prog->handle);
cls_bpf_stop_offload(tp, prog);
list_del_rcu(>link);
tcf_unbind_filter(tp, >res);
@@ -542,7 +542,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
cls_bpf_free_parms(prog);
 errout_idr:
if (!oldprog)
-   idr_remove_ext(>handle_idr, prog->handle);
+   idr_remove(>handle_idr, prog->handle);
 errout:
tcf_exts_destroy(>exts);
kfree(prog);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 543a3e875d05..3e89b0be1706 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -283,7 +283,7 @@ static void __fl_delete(struct tcf_proto *tp, struct 
cls_fl_filter *f)
 {
struct cls_fl_head *head = rtnl_dereference(tp->root);
 
-   idr_remove_ext(>handle_idr, f->handle);
+   idr_remove(>handle_idr, f->handle);
list_del_rcu(>list);
if (!tc_skip_hw(f->flags))
fl_hw_destroy_filter(tp, f);
@@ -972,7 +972,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
 
 errout_idr:
if (fnew->handle)
-   idr_remove_ext(>handle_idr, fnew->handle);
+   idr_remove(>handle_idr, fnew->handle);
 errout:
tcf_exts_destroy(>exts);
kfree(fnew);
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index ac152b4f4247..6fe4e3549ad3 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -591,7 +591,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct 
tc_u_hnode *ht)
 

[PATCH v2 04/17] IDR test suite: Check handling negative end correctly

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

One of the charming quirks of the idr_alloc() interface is that you
can pass a negative end and it will be interpreted as "maximum".  Ensure
we don't break that.

Signed-off-by: Matthew Wilcox 
---
 tools/testing/radix-tree/idr-test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/radix-tree/idr-test.c 
b/tools/testing/radix-tree/idr-test.c
index 193450b29bf0..892ef8855b02 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -207,6 +207,7 @@ void idr_checks(void)
assert(idr_alloc(, item, i, i + 10, GFP_KERNEL) == i);
}
assert(idr_alloc(, DUMMY_PTR, i - 2, i, GFP_KERNEL) == -ENOSPC);
+   assert(idr_alloc(, DUMMY_PTR, i - 2, i + 10, GFP_KERNEL) == 
-ENOSPC);
 
idr_for_each(, item_idr_free, );
idr_destroy();
-- 
2.15.0



[PATCH v2 00/17] IDR changes for v4.15-rc1

2017-11-29 Thread Matthew Wilcox
From: Matthew Wilcox 

v2:
 - Rebased on f6454f80e8a965fca203dab28723f68ec78db608 to resolve
   conflicting changes with cls_bpf
 - Fix whitespace
 - Change a WARN_ON to WARN_ON_ONCE (git snafu yesterday)

Original cover letter:

The patches here are of three types:

 - Enhancing the test suite (fixing the build, adding a couple of new
   tests, fixing a bug in the test)
 - Replacing the 'extended' IDR API
 - Fixing some low-probability bugs 

As far as the 'extended' IDR API goes, this was added by Chris Mi to
permit a savvy user to use IDs up to ULONG_MAX in size (the traditional
IDR API only permits IDs to be INT_MAX).  It's harder to use, so we
wouldn't want to convert all users over to it.  But it can be made
easier to use than it currently is, which is what I've done here.  The
principal way that I've made it easier to use is by introducing
idr_alloc_u32(), which is what all but one of the existing users
actually want.

The last patch at the end I thought of just now -- what happens when
somebody adds an IDR entry with an ID > INT_MAX and then tries to
iterate over all entries in the IDR using an old interface that can't
return these large IDs?  It's not safe to return those IDs, so I've
settled for a dmesg warning and terminating the iteration.

Matthew Wilcox (17):
  idr: Fix build
  radix tree test suite: Remove ARRAY_SIZE
  idr test suite: Fix ida_test_random()
  IDR test suite: Check handling negative end correctly
  idr: Delete idr_remove_ext function
  idr: Delete idr_replace_ext function
  idr: Delete idr_find_ext function
  idr: Add idr_alloc_u32 helper
  net sched actions: Convert to use idr_alloc_u32
  cls_basic: Convert to use idr_alloc_u32
  cls_bpf: Convert to use idr_alloc_u32
  cls_flower: Convert to idr_alloc_u32
  cls_u32: Reinstate cyclic allocation
  cls_u32: Convert to idr_alloc_u32
  idr: Rename idr_alloc_ext to idr_alloc_ul
  idr: Rename idr_for_each_entry_ext
  idr: Warn if old iterators see large IDs

 include/linux/idr.h | 109 ++--
 include/linux/radix-tree.h  |  17 +---
 lib/idr.c   | 143 +++-
 lib/radix-tree.c|   3 +-
 net/sched/act_api.c |  72 +++-
 net/sched/cls_basic.c   |  33 
 net/sched/cls_bpf.c |  30 +++
 net/sched/cls_flower.c  |  34 
 net/sched/cls_u32.c |  51 +---
 tools/testing/radix-tree/idr-test.c |  22 -
 tools/testing/radix-tree/linux/kernel.h |   2 -
 11 files changed, 266 insertions(+), 250 deletions(-)

-- 
2.15.0



  1   2   >